Three of the Most Common Mistakes Made when Using Elasticsearch
In addition to being a top choice for enterprise search, Elasticsearch is also growing in popularity as an all-purpose datastore. It’s easy to scale, it’s extremely fast, and it can be accessed via a RESTful API. Because the documentation can be difficult to maneuver, there are a number of traps that new Elasticsearch users often fall prey to. Here are some of the most common mistakes I’ve noticed while working with Elasticsearch at CFPB, with some additional tidbits learned at PyCon 2015 in Montreal:
Not explicitly defining a mapping
Elasticsearch claims to be a “schema-free” data store. Lo and behold, you can fire up an Elasticsearch cluster, create an index, and feed it a JSON document without even thinking about schemas. I find calling it “schema-free” to be a little misleading, however.
What’s actually happening behind the scenes is Elasticsearch is going through each field of the JSON document you indexed, making its best guess at which type each field is, and creating a schema for you (a “mapping” in the Elasticsearch world). For example, if it encounters “2014-05-15”, it will add a “date” entry to the mapping for the field. Likewise, 15.5 would be a float, and 2500 would be an integer.
So if Elasticsearch makes a mapping for you, what’s the problem? Well, Elasticsearch isn’t always right. There are certain situations where the wrong type is picked, and next thing you know you’re getting indexing errors!
How to fix? Always explicitly define mappings, ESPECIALLY in any production-like environment. The best practice here is to index a few documents, letting Elasticsearch guess the types for you. Next go grab your mapping: GET /index-name/doc_type/_mapping
Next, make any changes you see fit. Don’t leave anything up to chance!
Setting store = yes
for one or more fields
By default, the full document you index in Elasticsearch is stored in a field called _source. When you run a query on a certain field, for example, Elasticsearch has no problem going into _source and querying on that particular field.
It’s also possible to store individual fields separately by setting store = yes
in the mapping for a field. Keep in mind, everything is already stored in the _source field. Some people think this will speed things up, but in reality, this often causes multiple reads vs. the default single read of _source, slowing things down. Unless you know exactly what you’re doing, resist using store = yes
Using queries when you should be using filters
This is easily the most common of the three mistakes. New Elasticsearch users often use queries only. Why are there both queries and filters, and what is the difference between the two?
**Queries **should be used when performing a full-text search, when scoring of results is required (think search results ranked by relevancy).
In almost all other cases, **filters **should be used. Filters are much faster than queries, mainly because they don’t score the results. If you just want to return all of the widgets that are blue, or that cost more than $10, use filters!
It’s also possible to do a filtered query. So if you only need to perform a full-text search on a subsection of your data, filter it first, then run the query. It will be much faster that way.
In conclusion: if you need a full text search server, or if you want to play around with a NoSQL datastore, I highly recommend giving Elasticsearch a try. Just make sure to avoid these common pitfalls, and you’ll be way ahead of the pack.