loading...

More on Geohashing: Covering and an updated DynamoDB library

phm200 profile image Peter Miller ・3 min read

geohashing-intro (3 Part Series)

1) The Problem of Nearness: Part 1 - Geohash 2) The Problem of Nearness: Part 2 - A Solution with S2 3) More on Geohashing: Covering and an updated DynamoDB library

This is the third part in a series of posts. See the links above to jump to the prior posts

S2 vs. H3 Covering

In my last post I talked through how we can use S2 cells to fill or cover a shape on the map. In this case a circle, within which we want to search for items of interest. I also mentioned that Uber came up with H3 cells, a technique for covering a shape on the map with hexagons.

The images below show S2 and H3 coverings of the same area in Washington, DC.

S2 Covering (shown in blue)
S2 Covering downtown DC

H3 Covering (shown in red)
H3 Covering downtown DC

The S2 covering is composed of cells of different sizes or levels. Some are larger cells that fit mostly into the circle, others are smaller cells that help cover the edges. This means fewer cells overall are needed to cover the space vs the H3 example, albeit with the downside of a more uneven covering compared to what we see in H3.

With that somewhat uneven covering, we can see how important that distance calculation is in the final step of our S2 geo-search from the prior post. It excludes those points of interest that came from the outer edges of covering cells.

These images also illustrate how S2 cells can nest. Any given S2 cell will fit completely into its parent cell at a higher level. S2 cells nesting allow us to calculate coverings with diverse cell sizes. Nesting also would allow us to roll up location based data into larger aggregates. If data is stored at a small (high level) cell, then we can roll up that data to a neighborhood, city or region level by summing up the data from all the child cells in a bigger cell that covers the target area.

In contrast while H3 provides a more precise covering (at least a high resolution), hexagons do not nest. As we scale up levels, the hexagons overlap incompletely. So we cannot have a meaningful H3 covering with diverse hexagon sizes. Without nesting, aggregation is not as automatic. While we can still roll up data from the smaller hexagons, we cannot map it directly to larger ones.

Dash-Labs dynamodb-geo

While watching a brilliant discussion on Twitch between DynamoDB gurus Rick Houlihan and Alex DeBrie, at around an hour in, Rick dropped a quick reference to an AWS customer who had taken the DynamoDB Geo library I looked at last post and improved upon it. Thanks to twitch user switch184 in the comments for pointing me to the repo at: https://github.com/Dash-Labs/dynamodb-geo

#Geo Library for Amazon DynamoDB

Build Status

This library was forked from the AWS geo library.

Following limitations with the aws geo library were the main reasons that necessitated this fork:

  • Usage required a table’s hash and range key to be replaced by geo data. This approach is not feasible as it cannot be used when performing user-centric queries, where the hash and range key have to be domain model specific attributes.
  • Developed prior to GSI, hence only used LSI
  • No solution for composite queries. For e.g. “Find something within X miles of lat/lng AND category=‘restaurants’;
  • The solution executed the queries and returned the final result. It did not provide the client with any control over the query execution.

What methods are available for geo-querying in this library?

  • Query for a given lat/long
  • Radius query
  • Box/Rectangle query

All of the above queries can be run as composite queries, depending on their…

While this fork of the original dynamodb-geo is in Java, not JavaScript, it is worth your time to take a look. There are many improvements, which are summarized in the repo's readme.

One bit of documentation that stuck out to me was more discussion on hash key length, including the number of queries the library produces for a radius search with a given hash key. Something that tripped me up at first is that the geohash key they are talking about here is the first X digits of an S2 cell id, and the length of that key does not match 1:1 to the level of the cell. As the Dash-Labs repo suggests, a 5 or 6 digit long geohash key is well suited for near proximity searches. I still struggle with understanding the math behind that assertion, but the sample query results are convincing.

Thanks for reading and happy geohashing!

geohashing-intro (3 Part Series)

1) The Problem of Nearness: Part 1 - Geohash 2) The Problem of Nearness: Part 2 - A Solution with S2 3) More on Geohashing: Covering and an updated DynamoDB library

Posted on Feb 5 by:

Discussion

markdown guide
 

Do you know why DynamoDB is not offering GEO queries out of the box like: docs.mongodb.com/manual/geospatial...

instead of using all kind of not official supported libraries to fulfil this need?

 

Two reasons I can think of.

First, as I mentioned in the post, the DynamoDB folks report that some large enterprises have been happy using geohashing solutions like dynamodb-geo. Not mentioned in the post, is that you can also add geospatial queries by using Elasticache in addition to DynamoDB. If you want to learn more about that option, check out this post on AWS Amplify and Elasticache: medium.com/@gerard.sans/finding-th.... So I think that the DynamoDB team feels like there are good enough solutions out there and there is not enough demand for an officially approved solution to justify the effort.

Which leads to point 2. I am not familiar with the underlying data model of MongoDB and how they implement geospatial indexes. In a more traditional RDBMS, doing a nearness query against a geospatial data type involves a scan, albeit on a potentially fast index.

On the DynamoDB side, the key to DynamoDB's consistent performance while scaling out is the use of partition keys to physically separate data, which keeps queries (by that key) performant, but means that scans can be quite slow and expensive. Simply adding some geospatial data types could encourage people to implement geospatial queries in DynamoDB with scans, leading to bad performance on these queries that may be frequently used in your application. AWS wants to lead us to successful patterns.

This technical complexity is likely another reason for not having geospatial data types in DynamoDB.

I definitely don't have any inside info on this, but those are my best guesses.

One last note here. You could use composite keys, for example MD#BETHESDA#20815 to allow for efficient searches within state, city and zip. Hat tip to the amazing Alex DeBrie and his DynamoDB Book for that one.

Hope this helps! Great question to think about.

Do you have any ideas on what exactly you'd like to see?

 

Thanks for your advanced answer.

I know about Elasticsearch. I have written an own blog about that: dev.to/rpostulart/the-guide-to-imp...

I know about the Dynamodb GEO package, but I see not many users are using it and the package hasn't been updated for a long time. I don't want to end up in a situation where I have build an App which will be used by many users with a not supported package.

And Elasticsearch provisioned by Amplify is an expensive solution when you are a startup and not doing much GEO queries yet, so I am looking to alternatives.

In the past I have worked with MongoDB and this worked great.

Nice post about Elasticsearch! Re: your concern, maybe try pinging James Beswick (@jbesw ) or Alex DeBrie (@alexbdebrie) on Twitter to ask for a rec for a supported option for GEO queries on DynamoDB? They both have responded to questions from me before. But I get it, a client might be pretty nervous about using an old/clanky library.