I have been working as a Backend Developer for a CRM industry where it is all about searching :). Yep, you are correct, its a system with a lot of data table columns. So the backend framework chosen to support this highly customizable search was Java with spring integration. Yes, your thoughts are right, we chose Elasticsearch(ES) as our datastore. As always there was up votes and down votes, but it was the right decision in the end. It was four years before and the latest stable version was 1.7. Oops completely forgot, let me talk a bit about ES.
ES is an open source search engine based on Lucene. Certainly, this is not a primary data store, however efficient for systems with searches everywhere. ES is easily scalable. Elasticsearch stores metadata information about index (consider this as a table in SQL) and other data information in files How is it stored in ES. Nodes are nothing but your servers which can together be called as a cluster. Nodes help to keep replicas of your data. ES keeps data in form of JSON and it particularly follows so-called laws of JSON. If you want to understand more about basics of ES terms visit here.
Let us get back to what I am actually supposed to talk. Maintaining ES is really hard and costly. What I meant was, since most of the startups do not have so-called DevOps mechanism of auto-scaling and cool kinds of stuff, as the data increases, we need to boot up machines or nodes to keep ES stable. Or else will see "Heap Space Error" since this is completely built using Java, yeah JAVA developers would love this error :).
Real Developer Hardships
- Welcome to the real world of indexing. In elasticsearch you will always hear about indexing. Indexing is nothing but writing all your data to ES. Indexing is a create/update operation in the system as well. There are strategies for indexing and should be carefully configured in ES. That means read/write heavy systems should make appropriate configurations in the ES. Indexing eats your system RAM like an Indian eats spicy food (I just love it). Full indexing should be carefully carried out during peak hours since its a heavy operation.
- No parent-child relationship in 1.7 version. Yes, you can maintain relations in ES between indexes in the later versions. How did we resolve it? We had to store child JSON data inside parent as well. This was kind of duplicating because we had the separate index for child data but still needed to keep in parent index. Then the question could be why again in parent index? Because since the system searches happen in your parent page by filtering for child data. The alternative does separate queries in each index and then link them together. But then the pagination needs to be customized and your MVP model strategy will not work (cannot release soon because of issues all over).
- Awwww then comes to the field analyzers and index analyzers. Whoever says yeah ES is easy, screw them. No I was just joking. But we had a hard time with configurations of metadata. So in ES data is stored either analyzing them or not. For example, if you want a wild card supporting search field, you need to store them in lowercase strings. If you want the search to work with spaces, then you want some other analyzers and the list goes on. The worst part is you cannot change it in an existing index directly, you can only add it to the new index. So we need to reindex data.
- Say your client came up with something new. They want the search to happen for special characters, now you need to create special analyzers and add it to your metadata. Yes, you can just close the index and add it. But if you want to specify the newly created analyzer in any of the fields, then you cannot do it. So again back to first, create the new index and reindex data.
- Now why can't I store every field with analyzer first itself. You should not do this because the searching and indexing will take more time in fields with analyzers. So always be careful on assigning analyzers to fields.
- Aggregations are the best part of ES. Group by is the best synonym I can find. People love to do aggregations in ES. Yes, but it can even break the system. Since its, a very heavy operation on ES eats up a lot of RAM. Should be carefully built and used. Most of the recommendation systems prefer to use this nowadays.
- Elasticdump is the nicest library for smooth releases and reindexing. I remember using this at-least 3 times a month.
- Prefer query builders rather than filters. Filters are only applied after fetching data from nodes. Yes, you are right, its much time-consuming.
Conclusion
Upgrade to the latest versions as soon as possible. The sooner you upgrade, the better and faster your release process will be. Even though all these issues occurred, I loved working with ES and still loves working on the same.
Top comments (3)
In our team we have ELK stack running on for 2 years now and I can agree, ES loves consuming memory, a loot. :)
But configuring that for our purposes was extremely easy, especially with docker images for ES 5 and ES 6 now..
It is interesting what you mention on filtering, is it really not better to filter results before feeding that to query? I guess it is something guys are advising.
I felt filtering is an overkill, if you can do things with query. But yeah nice to hear informations like above. Thanks.
The query DSL changed a lot since 1.7. The distinction between queries and filters is a bit more implicit. Mostly this is just about where you put what in the query. Filters are great for performance since there is no ranking involved and since ES has elaborate caching for filter results; which when you are doing aggregations of course matters.
Speaking of which, aggregations are a lot better behaved in later ES versions. It's still basic computer science in the sense that some things just are inherently not fast but it will at least prevent you from doing most things that would bring down your cluster. I'd say aggregations are a really good reason to use Elasticsearch. Most databases don't really have this type of features at all; or only have a tiny subset of features. Your only other option is doing funky stuff using some big data type asynchronous processing.
You need to figure out index management. Basically a good strategy is to reindex into a new index when your schema changes and use aliases to switch to a new index. Schemas in ES are not really mutable (other than adding stuff).