How a 20mb data-set brought down a 14gb ElasticSearch cluster.

For a client project we are using ElasticSearch (ES) as one of our read-models. The use-case was pretty straight forward and the amount of data was super small. However, our problems were pretty big. It got to a point where we had to reboot our cluster periodically to prevent it from crashing. So, what was going on? Let's dig in.


tl;dr

Learn about the enabled: false setting and don't always count on "logical thinking" to save you.

The use-case.

To facilitate content publishing we used a "Content-API as a Service"-service. Using such a service, whether developed in-house or as a SaaS, allowed us to quickly roll out new functionality without having to care about how the content was actually managed. All we had to do was consume an API and, after applying some business-logic, use said content to render our templates. There are, however, some limitations that needed to be dealt with. In our case these were:

  1. Slow API responses.
  2. Multiple API calls needed per use-case.
  3. Searching functionality of the API didn't fit our needs.

Especially the "search" functionality was a big concern for the application. The business required not only to have everything searchable throughout the website, there were also specialised search requirements that simply could not be met by the content API. In order to satisfy these needs we introduced ElasticSearch into our stack. Not only could ElasticSearch tailor to all our search needs, it was also magnitudes faster! A match made in heaven, at least, so we thought.

We kept ElasticSearch up to date by periodically (cronjob) fetching all the latest content from the API, index all the documents, and hot-swapping the index using an alias. We did everything by the book. Important parts of this process were:

1. Indexing by the bulk.

Instead of sending documents to ElasticSearch one by one, bulk inserts were used. This has two effects; 1. the amount of calls to the cluster is reduced, and 2. the amount of times ES is storing/indexing/analysing is reduced. All of these are expensive operations, reducing either of these will almost always have a positive effect on your ingestion time.

2. Disabling refreshes during in ingestion.

While bulk inserts reduce the time ES is busy indexing, disabling refreshes will make sure they simply won't happen. When you know what kind of data you're storing and what operations you're running, controlling when/if refreshes happen gives a significant boost to your ingestion process. In our case we knew indexing/analysing only needed to happen after all the content was stored. In order to make this work we set the index.refresh_interval setting to -1.

curl -XPUT 'http://localhost:9200/index_name/' -d '{  
    "settings" : {
        "index" : {
            // ...
            "refresh_interval" : -1
        }
    }
}'

After we'd inserted all the documents, a manual refresh was triggered, forcing all the data to be analysed in one go.

curl -XPOST 'http://localhost:9200/index_name/_refresh'  

3. Prevent properties from being analysed.

A big part of a search-engine's magic is a result of how, and what, data is analysed. Parts of this process are things like tokenisation and stemming. In short, these processes break down the data to create lookup tables (inverted indices). The result of these operations aids the engine in quickly finding relevant documents.

As one might assume, if every part of a document is analysed, indices grow. When indices grow, so does our disk-space and memory usage. Besides that, analysing data costs time, slowing down the ingestion process. So when we limit the amount of data being analysed we speed up the process. When you look at your data, you can often easily see which part is, or isn't important to your use-case. In the type mapping you can mark a property as not_analyzed:

curl -XPUT 'http://localhost:9200/index_name/_mapping/type_name' -d '{  
  "properties": {
    // ...
    "property_name": {
      // ...
      "index": "not_analyzed"
    }
  }
}'

4. Limiting the amount of documents you insert.

We don't use (read; you shouldn't use) ES as our single point of truth. ES is only a read-model. At any given time our system should be able to re-hydrate an empty cluster with all the data we need in our read model. This also means we can omit certain parts of this data if it's never going to be read for a particular purpose.

For example; let's store blogposts. Our MySQL/MongoDB/PostgreSQL database might store all of our blogposts, but when we actually want to show blogposts on our websites we don't care about the drafts and deleted blogposts. So when we fill up our indices we can simply omit those blogposts. The less we put into ES, the less it has to do, the less time is spent on stuff we don't actually need.

So far, so good. Right?

Yup, so far so good. We followed every step in the book. But then shit hit the fan. As the number of documents grew, so did our memory usage. Of course, this is to be expected since we're storing more stuff. However, the amount of memory used became more and more disproportional to the amount of data we had. It grew so much that, over a period of time, memory usage was building up. It would grow out of hand until it reached the magical 75% memory mark, which triggers the JVM's garbage collection process to kick in. When not dealt with appropriately this can result in a heap of trouble, as it did for us. This build-up eventually resulted in our cluster entering a state beyond repair. Memory usage went through the roof, peaking at 99% and staying there. This resulted in timeouts, data-loss, and exceptions in our application.

The search.

We needed to do something. We couldn't just let this process go out of hand. In order for our application to keep running we needed to restart the nodes in our cluster to battle the memory congestion. This is not a sustainable strategy in any way, but it prevented our entire operation from halting. So, we got the brightest minds from our team together to come up with possible causes. We even got a consultant from Elastic.co, the company behind ElasticSearch to have a quick look. The result; nothing. We did everything by the book. The mappings looked fine, the data set was small, the queries were simple and optimised. Yet somehow we overlooked something.

The problem with this approach was; we only tried to tackle the problem with logical thinking. When you follow all the steps, it's sometimes hard to spot the anomaly. It felt like we had hit every dead end on StackOverflow, we had tried every solution opted by our peers. Because the problem persisted, we had to adjust our strategy. Since logical thinking didn't work, we turned to its friend; Lateral thinking. In short, lateral thinking comes down to trying creative approaches. It also means you can re-try paths that previously lead nowhere, but following that path a little further than you normally would.

Finding the problem.

The first step was to narrow down our search by dividing our data-sets. Apart from the textual content we had additional information stored in the same cluster. Through the power of deduction, by moving the content to a dedicated cluster, we were able to see the problem originated from this part of the data. This came as a surprise. The other data was continuously updated and somewhat complex. The entire data-set was around 300-400mb. The textual content only made up for around 20mb of that, yet it seemed to cause all of the problems. As we began to dig deeper into the data, we looked at our mapping. The source of the mapping was small, a YAML file with around 15 types with a handful of defined properties. In order to try to understand better what the effect of our mapping declaration was we looked at the JSON mapping obtainable from ES.

curl -XPUT 'http://localhost:9200/index_name/_mapping?pretty'  

Using ?pretty you get a nicely formatting JSON document. Lo and behold; a >16.000 line JSON blob, staring us straight in the face. My god, this thing was huge. It was equally beautiful as it was terrifying. But what caused this monster-mapping to emerge from the depths of hell? Which developer rain-dance summoned this JSON-tsunami?

As it turns out, simply specifying a property as not_analyzed doesn't actually prevent said property from being processed. The property will still be used to create indices. Additionally, if those properties are of the object type, all properties of any nested data will also be processed, resulting into gigantic mappings. In our case, because the content data had many nested elements, this meant the mapping exploded into the 16k line beast we encountered.

The solution.

Now that we had identified the issue we could work towards a solution. Our solution consisted of the following things:

  1. Create index projections for all types, explicitly mapping all the fields we actually cared about. This also meant the structure of indexed data was now flattened.
  2. Specify all the mapping of top-level properties.
  3. Put the data we needed to re-hydrate the model with into a separate field, and disabled that field. This made sure that, not only that data was not analysed, but it wasn't indexed either. The data was merely persisted.

A property can be disabled by setting the enabled flag to false in a type's mapping configuration. This wil only work for properties that have the object type.

curl -XPUT 'http://localhost:9200/index_name/_mapping/type_name' -d '{  
  "properties": {
    // ...
    "property_name": {
      "type": "object",
      "enabled": false
    }
  }
}'

All the work surrounding this change was done within a single day by a single developer. The size per document went up by 50-100%, yet the size of our index significantly decreased. After deploying the change, the memory pressure decreased in an instant. The memory buildup is now a thing of the past and our DevOps/Ops team is no longer tasked with cluster reboots. As a result of these changes:

  • The ES response-times were greatly improved.
  • Ingestion time was brought down a lot.
  • Developers are more aware of what and why data is indexed.

Final note.

Before starting this adventure I had limited to no prior knowledge of ElasticSearch. Within a couple weeks I learned the ins and outs, set up 2 clusters, and learned about all kinds of available monitoring. I am, however, by no means an ElasticSearch expert. As a team we had consulted with seasoned ElasticSearch users from multiple companies, even people working at the company behind the product. In the end "lateral thinking", along with the company's willingness to discuss our issues with third-parties, lead us to find the solution to our problem.

I hope this helps somebody out there or give you some insights to prevent issues like this. They're easy to overlook, even those who have plenty of experience with a given tool.

PS: if you run into these issues, you may want to consider hiring a data engineer.