loading...
Cover image for Iterating Through Elasticsearch Documents Using Scroll and Ruby

Iterating Through Elasticsearch Documents Using Scroll and Ruby

molly_struve profile image Molly Struve (she/her) Updated on ・7 min read

Code being run against Elasticsearch version 7.4

If you are an avid Rails or ActiveRecord user then you know when you need to iterate over a bunch of records in your database you use find_each or find_in_batches to help you do it. These allow you to process a large number of records without having to pull them all out of the database at once.

But what if you want to do the same thing in Elasticsearch? How can you iterate through your Elasticsearch documents the same way you would your database records? One option is to use paging by setting the from and size params when you make a search request. Below is an example of a search request with paging using dev.to's implementation of the elastisearch-ruby gem.

Search::Client.search(
  index: "index_name",
  body: {
    query: {
      filter: {
        term: { user_id: 1 }
      }
    },
    from: 10,
    size: 5
  }
)

However, pagination in Elasticsearch has its limits!

Alt Text

For starters, it can be inaccurate at times. Per the warning in their docs:

Elasticsearch uses Lucene’s internal doc IDs as tie-breakers. These internal doc IDs can be completely different across replicas of the same data. When paginating, you might occasionally see that documents with the same sort values are not ordered consistently.

This means that you might request page 1 and see doc A, then when you request page 2 you might see doc A again. Obviously, this is not ideal if you require precise and accurate data when you are iterating over your records.

How much you can page in Elasticsearch is also limited. You can only page through as far as the index.max_result_window setting for your index. By default, this setting has a value of 10,000. This means you can only get to the 10,000th record in your search before Elasticsearch will throw an error and tell you to stop. Given these limitations, it's clear we need another solution if we want to iterate over a large amount of data in Elasticsearch.

Elasticsearch Scroll

Alt Text

Not quite that kind of scroll 😉 I'm talking about the Elasticsearch scroll API. The scroll API is Elasticsearch's solution to deep pagination and/or iterating over a large batch of documents.

the scroll API can be used to retrieve large numbers of results (or even all results) from a single search request, in much the same way as you would use a cursor on a traditional database.

Scroll should NOT be used to serve real-time user requests, but rather for processing large amounts of data. One reason it shouldn't be used for real-time requests is that when you begin a scroll search the data returned will reflect the state of the index at the time that the initial scroll request is made. You can think of it as taking a "snapshot" of the index and then serving you the data as it was at the time that "snapshot" was taken. This means any changes you make to documents after you start your scroll will not be reflected in your scroll.

For example, let's pretend you have 3 documents in your index.

doc1
doc2
doc3

You decide to scroll through those documents one at a time and you start that scroll at 10am. Then at 10:00.01am you update doc2 while you are still scrolling. The doc2 returned in your scroll will be the original and will NOT have any of the changes you made to it at 10:00.01am.

Now that we have established scrolling is the way to go for processing a large number of documents in Elasticsearch, the only question left is how do you do it?!

Executing an Elasticsearch Scroll in Ruby

As stated above, in the following example I will be using Search::Client which is merely dev.to's search client wrapper for the the elastisearch-ruby gem implementation. With that in mind, this is what your initial scroll request is going to look like:

body = {
  query: {
    bool: {
      filter: [
        { term: { user_id: 1 } }
      ]
    }
  }
}
response = Search::Client.search(
  index: <index_name>, 
  scroll: "1m", 
  body: body, 
  size: 3000
)

Whew, there is a lot going on there! Let's unpack it to learn what is happening.

  • body: The body parameter is simply the search request we want to execute. In this case, we are filtering our documents down to those with a specific user_id.
  • index: The name of the index you plan on searching.
  • scroll: The scroll parameter tells Elasticsearch how long to keep the search context open for. This means you have 1 minute to process the set of records you pulled out in a single scroll before the entire scroll expires. If the scroll expires you will receive an error when you try to grab the next set of documents because Elasticsearch essentially "let go" of that initial snapshot it grabbed. 1m = 1 minute. Elasticsearch has its own special time units that is uses for declaring time and intervals.
  • size: In this context, size is the number of documents you would like to grab with each request. It is the equivalent of batch_size when you are using ActiveRecord to loop through database records.

That request will give us our first scroll response which will look like this.

{
  "_scroll_id"=>"a really long string that is your scroll ID", 
  "took"=>1150, 
  "timed_out"=>false, 
  "_shards"=>{"total"=>10, "successful"=>10, "skipped"=>0, "failed"=>0}, 
  "hits"=>{
    "total"=>{"value"=>1343514, "relation"=>"eq"}, 
    "max_score"=>0.0, 
    "hits"=>[{}, {}, {}] # your first set of returned documents
  }
}

A lot of that is meta information you don't need to worry about but I do want to focus in on a few important pieces. Notice that the above response contains a scroll_id. This is what you will then send back to Elasticsearch to indicate that you would like the next set of documents in the scroll request. The scroll_id can change between requests or it can stay the same. Regardless, you should always be using the most recent scroll_id to make your next request.

To get your second set of documents your request will look like this:

response = Search::Client.scroll(
  :body => { :scroll_id => response['_scroll_id'] }, 
  :scroll => '1m'
)

Notice, we no longer need our query or index parameters from before, all we simply need is to send our scroll_id as our body to indicate to Elasticsearch what we want. Again, we send the scroll parameter to tell Elasticsearch how long to keep the next search context open. This request will give you a similar response to the first one, a set of documents with a scroll_id.

Putting it All Together

Now that we know how scrolling works here is a code example of how you can put it all together. This code will loop through your index so you can gather up your documents and do something with them.

body = {
  query: {
    bool: {
      filter: [
        { term: { user_id: 1 } }
      ]
    }
  },
   _source: [:id]
}
response = Search::Client.search(
  index: <index_name>, 
  scroll: "1m", 
  body: body, 
  size: 3000
)

loop do
  hits = response.dig('hits', 'hits')
  break if hits.empty?

  hits.each do |hit|
    # Process/do something with the hit or hits
  end

  response = Search::Client.scroll(
    :body => { :scroll_id => response['_scroll_id'] }, 
    :scroll => '1m'
  )
end

Gotcha's

Alt Text

A couple of things to watch out for when you are executing scroll requests against your Elasticsearch cluster.

First, try to avoid holding a large scroll request open for a very long period of time. Elasticsearch refers to this as keeping the search context alive. As I mentioned above, scrolling works by taking a "snapshot" of your data and then serving it to you in pieces. This means that Elasticsearch must "hold" all of that in memory.* Having to hold the scroll "snapshot" in memory while doing a lot of data updates can cause your memory to bloat. Memory bloat can lead to issues if you don't have a large surplus of memory to work with.

Secondly, you want to avoid having a lot of scroll requests open at once. Each scroll request has to keep track of the data it has and all of the updates made to that data while the request is open. To keep track of everything, Elasticsearch uses node heap space. If you plan on running a lot of scroll requests you want to make sure your nodes have ample heap for processing the requests. Elasticsearch will also try to save you from yourself by limiting the number of scroll requests you can open at a single time. Per their docs:

To prevent against issues caused by having too many scrolls open, the user is not allowed to open scrolls past a certain limit. By default, the maximum number of open scrolls is 500. This limit can be updated with the search.max_open_scroll_context cluster setting.

But wait, there's more!

Alt Text

What I just covered are the basics of the scroll API, but there is way more you can even do. For example, newer versions of Elasticsearch offer sliced scrolling which allows you to split the scroll into multiple slices that can be consumed independently. There is also a new search after option which is meant to serve a large amount of data for those real-time user requests. No matter what your needs, I highly recommend checking out the Elasticsearch docs if you are wondering what's possible.

I hope you learned a little something from this post! Let me know if you have any questions 😃

*This is a super-simplified explanation. For more details on how Elasticsearch updates and persists data checkout this visualization on merging. If you want to go a step further, this is an old but solid blog post on how Lucene works under the hood.

Posted on May 20 by:

molly_struve profile

Molly Struve (she/her)

@molly_struve

International Speaker 🗣 Runner 🏃‍♀️ Always Ambitious. Never Satisfied. I ride 🦄's IRL

Discussion

markdown guide
 

Yeah, scrolls in Elastic are super-useful and it is great introduction to them.

We've been using scrolls so extensively in one project that we ended up implementing them in PostgreSQL too ;)

 

Can you please blog about it? Will love to read.

 
 

We've had a lot of fun with scroll queries (by which I mean we've found every issue along the way!). Here's a couple of other things to do with older versions which I think are worth mentioning in case folks are looking at older code or just seeing older examples:

  • You may see the "search_type=scan" parameter. This was deprecated, and in v5.0 removed. You don't need it.
  • In older versions it used to be that the initial search request with the scroll param did not return the first batch of data. You only got that with the subsequent scroll call (when you pass the scroll_id). So older code would have the loop arranged a bit differently.

It's interesting what you wrote about the memory use, and holding the whole scroll "snapshot" in memory. Wonder if there's any tricks for making it less expensive.

 
 

Thanks for writing a great article. It'd be awesome if you could add Elasticsearch version you are running this against. So, later if things change, folks can still relate.