loading...

The Problem of Nearness: Part 2 - A Solution with S2

phm200 profile image Peter Miller Updated on ・5 min read

geohashing-intro (3 Part Series)

1) The Problem of Nearness: Part 1 - Geohash 2) The Problem of Nearness: Part 2 - A Solution with S2 3) More on Geohashing: Covering and an updated DynamoDB library

This is part 2 of a series, for the full context see the first post

We covered a lot of theory and math around calculating distance last time. For today's post, we will focus on some of the details of an implementation I referenced.

Caffeinate-Me by @jbesw

James Beswick is a developer advocate at AWS for Serverless. He wrote three fantastic posts about using DynamoDB to implement location-based searches, accompanied by explanatory videos and the implementing Git repo's. Please take a moment to check out his work, it is excellent. I learned a lot playing around with it and also found a few interesting items that I highlight below.

"Geohash" and Google's S2 Geometry

My prior post talked about geohashing, specifically the canonical implementation of GeoHash by Gustavo Niemeyer using alphanumeric hashes to address a grid of nested squares and rectangles that covers the Earth.

The DynamoDB geo library that James' Starbucks locator uses does not use that geohash algorithm. Instead, it uses Google's S2 Geometry for addressing locations. I promised less math, so the big takeaway to focus on here is that points of interest in S2 are placed in cells, that like our squares and rectangles from geohash are nested and cover the Earth's surface. S2 cells are addressed by 64-bit integers (not alphanumeric strings) and certain distance and covering calculations are much faster than with GeoHash.

For a more detailed look at how S2 works, including a fun animated gif of the Hilbert Curve, check out Christian Perone's post on S2.

An important concept in S2 is "covering" geographic shapes. This means identifying the neighboring cells that when tiled together, fill (or nearly fill) the specified shape. For (math) reasons, generating a range of covering cells can be done very quickly with S2. To nerd out a bit more on S2 vs. geohash, check out this post from Fabrice Aneche. Google's presentation on S2 has a nice visual of what covering (referred to as approximating regions here) looks like in practice.

Approximating Regions Using S2 Cells

When I was trying to reach the end of the internet while writing this post, I came across an additional geohashing or spatial library. As any old school war-gamer knows, hexagonal grids are way cooler than square grids, and sure enough, Uber created H3 a "hexagonal hierarchical geospatial indexing system". I'll leave the details for you to look into if you are interested, but Uber states it is a good fit for use cases like analysis of "locations of cars in a city". If that's your thing.

Now that we know we are working with S2, let's check out some of the details of James' Caffeinate-Me app.

Finding Items within a Circle

When using the Caffeinate-Me app, you click around a map and are shown all the Starbucks that are within a circle centered on your click. The code to get the Starbucks within that circle is shown below (from query.js):

myGeoTableManager
  .queryRadius({
    RadiusInMeter: 1000
    CenterPoint: {
      latitude: 40.7769099,
      longitude: -73.9822532
    }
  })

The queryRadius method on the GeoDataManager.js shows how the dynamodb-geo package breaks down this request:

     * @param queryRadiusInput
     *    Container for the necessary parameters to execute radius query request.
     *
     * @return Result of radius query request.
     * */
    GeoDataManager.prototype.queryRadius = function (queryRadiusInput) {
        var _this = this;
        var latLngRect = S2Util_1.S2Util.getBoundingLatLngRectFromQueryRadiusInput(queryRadiusInput);
        var covering = new Covering_1.Covering(new this.config.S2RegionCoverer().getCoveringCells(latLngRect));
        return this.dispatchQueries(covering, queryRadiusInput)
            .then(function (results) { return _this.filterByRadius(results, queryRadiusInput); });
    };

In pseudo-code:

  1. (Line 8) Get a rectangle that defines the min and max latitude and longitudes of a bounding box that encloses a circle of the specified RadiusInMeter from the center point
  2. (Line 9) Get a collection of S2 cell addresses (hashes) that cover this rectangle of space
  3. (Line 10-100) Query DynamoDB to retrieve the Starbucks within the specified S2 cells and then drop the Starbucks that were part of the covering rectangle, but beyond the radius of the circle

The S2 library handily takes care of the details of #1 and #2. More specifically, to back to questions from my first post, that getCoveringCells method is figuring out the neighboring geo-bins (cells). Like with GeoHash, S2 cells have different levels, from 0 (huge) to 30 (1cm squared).

By default, the S2 library will attempt to return 8 S2 cells (possibly at different levels) to cover the given shape. This creates some work for the dispatchQueries method, which has to generate one or more DynamoDB queries per covering cell:

GeoDataManager.prototype.dispatchQueries = function (covering, geoQueryInput) {
        var _this = this;
        var promises = covering.getGeoHashRanges(this.config.hashKeyLength).map(function (range) {
            var hashKey = S2Manager_1.S2Manager.generateHashKey(range.rangeMin, _this.config.hashKeyLength);
            return _this.dynamoDBManager.queryGeohash(geoQueryInput.QueryInput, hashKey, range);
        });
        return Promise.all(promises).then(function (results) {
            var mergedResults = [];
            console.log(results);
            results.forEach(function (queryOutputs) { return queryOutputs.forEach(function (queryOutput) { return mergedResults.push.apply(mergedResults, queryOutput.Items); }); });
            return mergedResults;
        });
    };

In pseudo-code:

  1. (Line 3) Get a collection of geohashes of the table's hash key length that encompasses the covering S2 cells
  2. (Line 4-5) Setup a DynamoDB query that uses the partition key of the hash key, and the range key of with the covering S2 cell addresses to get all the Starbucks in that part of the covering region

That's a lot to take in. For a concrete example, I added some logging to the library, and from a center point in New York City, I got 8 covering S2 cells, which went 1:1 to 8 DynamoDB queries. For example, one query was of hash key -82501, sort key of S2 cell ids between -8520150788008312831 and -8520141991915290625. Another query was of hash key -85199, sort key of S2 cells ids between -8519982196314865663 to -8519982196180647937.

This is where selecting the length of the hash key becomes important. As James explains in his post, based on the radius of the circle you are searching in and length of that key, the number of queries against DynamoDB and how much you are hammering individual partitions (hash keys) can vary dramatically.

The dynamodb-geo library defaults to a 6 digit hash key. James uses a 5 digit hash key in his example. A rule of thumb for these mostly local type of searches seems to be 5 to 7 digits.

There's a lot more to dig into on S2, the dynamodb-geo library and spatial searches, but for, let's call it a day. Please reach out with any comments or questions. Happy geo searching!

geohashing-intro (3 Part Series)

1) The Problem of Nearness: Part 1 - Geohash 2) The Problem of Nearness: Part 2 - A Solution with S2 3) More on Geohashing: Covering and an updated DynamoDB library

Posted on Jan 24 by:

Discussion

markdown guide