loading...
Cover image for Geospatial Queries on FaunaDB

Geospatial Queries on FaunaDB

potato_potaro profile image Taro Woollett-Chiba ・13 min read

Whether you wish to use Niemeyer’s original geohash, Google’s S2, or Uber’s H3, FaunaDB’s flexible query language lets you drop in a variety of geospatial solutions.

Overview

It likely goes without saying, but I’ll state it anyway; if your product/organization isn’t utilizing geospatial data, you’re missing out. Even strictly internet-based companies will utilize geospatial data to aid in internationalization, customer insight, etc. As developers, when we think of geospatial capabilities in databases, we often share a similar set of critical questions.

Can my primary database handle geospatial data or do I need to add another database to my stack?

How does this database store geospatial data, if at all?

What sort of geospatial queries are possible?

Can this database scale in parallel with my primary database?

If you’re lucky and your primary database has geospatial capabilities, you still have to consider the possibility that those capabilities aren’t suited for your work; there’s a reason why Google and Uber have invented their own geospatial solutions. If you’re not so lucky, then you need to shop for a secondary database that’s “geo-enabled”. Of course, this comes with the typical scaling headaches: clustering, distributing, storing backups, etc. In addition, backend complexity is significantly increased by introducing ETL, another database driver at the API layer, capacity differences from the primary database, and more. All of this surface area for error leads to downstream consequences in your product and UX (e.g. delivery progress data is stale and the customer thinks the order is far away).

More often than not, when selecting geo-enabled databases, we opt for conventional databases with a shallow understanding of their geospatial capabilities. This happens when we prioritize maintainability over capability due to the database pains listed above, as a database that poorly addresses our objectives is better than a database that’s often down or irresponsive. Thus we arrive at what this piece is all about:

Using FaunaDB, we get a zero-maintenance database + the freedom to select from a variety of geospatial solutions 😮 .

Zero-maintenance here means us developers don’t have to worry about sharding, replication, clustering, reserving read/write capacity, etc. We simply create a database and all the scaling and distribution is done behind the scenes. Now, I’m going to walk you through how my company, Cooksto, uses FaunaDB + Google’s S2 to build a location-based marketplace for homemade food.

Example Scenario

Imagine we have 25 meals that are being sold on the marketplace and a customer that only wants to browse food in walking distance (1000 meters or less). Now at this scale a simple distance calculation between the customer’s location and all of the meals isn’t much of a problem, however, we’re going to build with scale in mind for the sake of this walkthrough.

Alt TextOur meals are represented with blue pins and white circles, while our user/customer is represented by the centered red pin and capital “U”. The gray circle illustrates the search radius (1000 meters). Our objective is to return only and all of the meals within the search radius.

Solution Part 1: Geohashes

When retrieving database entries based on a specific property, it’s often a good idea to utilize indexes (e.g. user by email). Indexing area/location however is significantly more complex than indexing an email or username field, as this sort of data often has a multi-dimensional representation (e.g. latitude & longitude, polygons, etc.). Encoding these multi-dimensional representations into singular strings or “hashes” is precisely the task geohashing was invented for.

Throughout this article I’ve used the term “geohash” sparingly for many geospatial indexing solutions (S2, H3, etc.) as it accurately describes the function of such solutions. However the exact meaning of “geohash” belongs to the very first solution of its kind, a library/system invented by Gustao Niemeyer in 2008. Since then though, Google has published a more accurate take (with respect to Earth’s sphere-like structure) on geohashes, called S2. I won’t go into too much detail about S2 here, so if you’re interested in reading more about S2 please refer to the official documentation. Below are a couple of images that summarize how S2 “works”.

Alt TextSource: https://s2geometry.io/devguide/img/s2curve-large.gif
Roughly speaking, S2 projects a cube and its 6 sides onto our lovely planet, Earth 🌎 ❤️ . Each of the six projected sides, numbered 0 - 5, is called a “face” and buckets a deep hierarchy of S2 “Cells”.

Alt TextSource: https://s2geometry.io/devguide/img/s2hierarchy.gif
Every S2 Cell (the pseudo squares depicted above) encapsulates 4 children in a hierarchical manner (except the very last cells aka “leaf cells”, they have no children). This hierarchy of cells has a maximum depth of 30 levels, which at level 30 (the smallest representation of a space) is approximately a square centimeter.

There are many ways to represent an S2 Cell, but when used with FaunaDB, a “Hilbert Quadtree Key” (“quadkey” for short) is the optimal representation; this is due to how it represents the cell hierarchy and nesting with prefixes (more on this later). I first learned of this representation from an npm package called s2-geometry; in short, a quadkey is largely a base-4 number (i.e. most digits can only be 0 - 3) with only the first two characters not applicable to the base-4 property; the first character is reserved for the face (remember, a number inclusively between 0 and 5) and the second character is merely a forward slash (i.e. “ / “ ) to indicate separation between the face and cell numbers. Here’s an example maximum-length quadkey for an S2 Cell:

Alt Text

Due to base-4 representation and the fact that S2 Cells divide into 4 children at each level, each added digit represents 1 of 4 children and an increment in level. This allows us to prefix search on the quadkeys for broader/coarser results. Let’s briefly investigate this prefix property, between levels 5 and 9:

const { S2 } = require("s2-geometry");

const hawaii = {
  lat: 19.5968,
  lon: -155.5828,
};

console.log(
  "3/30222" === S2.latLngToKey(hawaii.lat, hawaii.lon, 5) &&
  "3/302221" === S2.latLngToKey(hawaii.lat, hawaii.lon, 6) &&
  "3/3022213" === S2.latLngToKey(hawaii.lat, hawaii.lon, 7) &&
  "3/30222133" === S2.latLngToKey(hawaii.lat, hawaii.lon, 8) &&
  "3/302221331" === S2.latLngToKey(hawaii.lat, hawaii.lon, 9)
); // true

Alt TextAn image of S2 Cells covering Hawaii from levels 5 - 9. Hawaii can be contained within a quadkey of length 7, as can specific parts of the island if we enlarge the quadkeys to 7 + i, where i is some positive integer.

Due to base-4 representation and the fact that S2 Cells divide into 4 children at each level, each added digit represents 1 of 4 children and an increment in level.

Notice that lower levels are simply prefixes of the levels above them. This property applies to all levels. It’s precisely this prefix/nesting behavior that we’ll leverage when indexing our meals.

Solution Part 2: FaunaDB

Now that we have our indexing strategy (prefix searching our S2 Cell quadkeys/geohashes), let’s spin up our database resources. If you haven’t already, you’ll want to start by signing up at FaunaDB and creating a database (e.g. $ fauna create-database my_db; one of many ways to create a database, and yes, it’s really that easy).

After creating the database, we’ll need to create our collection. Recall that we’re providing a marketplace for homemade food, so we’ll name our collection “meals” (i.e. CreateCollection({ name: “meals”})). FaunaDB stores data in the form of nested document objects, each of which belong to a collection (the hierarchy looks something like Databases > Collections > Documents).

Data

It’s time to populate our collection, from here on I’ll be using the JavaScript driver for FaunaDB. Let’s create 25 meals within 1500 meters of our example customer:

const { query: q, Client } = require("faunadb");
const { randomCirclePoint } = require("random-location");
const { S2 } = require("s2-geometry");

const userCoordinates = { lat: 30.274665, lon: -97.74035 };
const searchRadius = 1000; // meters

// creates an array of 25 random points
const randomPoints = Array.from({ length: 25 }).map((num) =>
  randomCirclePoint(
    {
      latitude: userCoordinates.lat,
      longitude: userCoordinates.lon,
    },
    // * 1.5 to demonstrate excluding undesired results
    searchRadius * 1.5
  )
);

// creates an array of 25 meal objects in memory
const meals = randomPoints.map((point, i) => ({
  // usually something like "Marinara Pasta"
  name: `meal-${i}`,
  // ...
  // of course, in the real world, more attributes would go here
  // e.g. initNumServings, pricePerServing, etc.
  // ...
  pickupLocation: {
    // normally the meal's pickup address
    placeName: `placeName-${i}`,
    coordinates: { lat: point.latitude, lon: point.longitude },
    geohash: S2.latLngToKey(
      point.latitude,
      point.longitude,
      30 // max precision level, approximately a square cm
    ),
  },
}));

const client = new Client({
  secret: "<DATABASE_SECRET>",
});

// executes an FQL (Fauna Query Language) query
client
  .query(
    // creates a document in the "meals" collection
    // for each meal in the in-memory meals array
    q.Do(
      meals.map((data) =>
        // creates a document in the "meals" collection
        q.Create(q.Collection("meals"), {
          data,
        })
      )
    )
  )
  .then((data) => console.log(data))
  .catch((err) => console.log(err));

/*
Should console.log something like this:
[ { ref: Ref(Collection("meals"), "264247276731382290"),
    ts: 1588264691537000,
    data: { name: 'meal-0', pickupLocation: [Object] } },
  ...
  { ref: Ref(Collection("meals"), "264247276730322450"),
    ts: 1588264691537000,
    data: { name: 'meal-24', pickupLocation: [Object] } } ]
*/

Boom 💥 . We have a globally distributed database storing all of our meals. Let’s double-check our work and take a peek at the "meals" collection.

client
  .query(
    // maps over a page/array of 25 meal refs
    // e.g. Ref(Collection("meals"), "264069050122895891")
    q.Map(
      q.Paginate(q.Documents(q.Collection("meals"))),
      // given a ref, gets/reads the document from FaunaDB
      q.Lambda((ref) => q.Get(ref))
    )
  )
  .then((data) => console.log(data))
  .catch((err) => console.log(err));

/*
Should console.log something like this:
{ data: [
 { ref: Ref(Collection("meals"), "264069050122895891"),
   ts: 1588094720905000,
   data: [Object] },
 ...,
 { ref: Ref(Collection("meals"), "264145993260360211"),
   ts: 1588168099888000,
   data: [Object] }
] }
*/

Index

Obviously, querying the meals collection itself will return an undesired list of results. Recall that our customer only wants to browse meals that are within “walking distance” (1000 meters). In FaunaDB, whenever you want specific documents you utilize indexes. An index allows us to search and/or sort documents based on their fields. For example, we could create an index to match a given string with meal names, returning only those matched meals. In our case, we want an index that returns meals within a certain location, or more specifically, within a given S2 Cell.

For basic matching and sorting on fields as they are, you may simply create an index targeting those fields. However, if more complex matching/sorting is desired, then the logic that determines how or when to return specific results is stored within an index binding. In other words, given a source collection, an index binding provides a computed field(s) on each document for you to search and/or sort on. In our case, we want to compute a list of geohash prefixes to search on. For clarification, it’s indeed possible to search/sort on arrays (which is what we’re doing here), in addition to scalar values (e.g. a string). Let’s start with the “compute” logic for our index binding:

const minPrefixLength = 3;

// example max “prefix”: 4/030202112231011311302131301203
// two items to note:
// 1) maxPrefixLength represents the entire geohash
// 2) for simplicity sake, I consider the entire geohash...
// as a “prefix” in this code.
const maxPrefixLength = 32;

// a range of integers from 3 to 32
const allPossiblePrefixLengths = Array.from(
  { length: maxPrefixLength - minPrefixLength + 1 },
  (_, i) => minPrefixLength + i
);

// given a meal object or variable:
// returns an array of geohash prefixes
const computePrefixes = (meal) =>
  q.Let(
    {
      geohash: q.Select(["data", "pickupLocation", "geohash"], meal),
      prefixLengths: q.Take(
        q.Subtract(q.Length(q.Var("geohash")), minPrefixLength - 1),
        allPossiblePrefixLengths
      ),
    },
    q.Map(
      q.Var("prefixLengths"),
      q.Lambda((prefixLength) => q.SubString(q.Var("geohash"), 0, prefixLength))
    )
  );

// testing computePrefixes with a single meal
client
  .query(
    q.Let(
      {
        meal: q.Get(q.Documents(q.Collection("meals"))),
      },
      {
        geohash: q.Select(["data", "pickupLocation", "geohash"], q.Var("meal")),
        prefixes: computePrefixes(q.Var("meal")),
      }
    )
  )
  .then((data) => console.log(data))
  .catch((err) => console.log(err));

/*
Should console.log something like this:
{ geohash: '4/030202112231011311302131301203',
 prefixes:
  [ '4/0',
    '4/03',
    '4/030',
     ...
    '4/0302021122310113113021313012',
    '4/03020211223101131130213130120',
    '4/030202112231011311302131301203' ] }
*/

In essence, the binding takes a meal document and returns an array of prefixes, keep in mind that the last prefix in the array could be the entire geohash itself (which is technically not a prefix by definition). Now we just need to refer to our binding in our index creator, leaving us with:

// ...
// the initial variables from above...
// ...

const createPrefixIndex = q.CreateIndex({
  name: "meals_by_geohash",
  source: {
    collection: q.Collection("meals"),
    fields: {
      prefixes: q.Query(q.Lambda(computePrefixes)),
    },
  },
  // terms are what we search on
  terms: [{ binding: "prefixes" }],
  // values are what get returned
  // i.e. a meal's geohash, ref, latitude, and longitude
  values: [
    {
      field: ["data", "pickupLocation", "geohash"],
    },
    {
      field: ["ref"],
    },
    {
      field: ["data", "pickupLocation", "coordinates", "lat"],
    },
    {
      field: ["data", "pickupLocation", "coordinates", "lon"],
    },
  ],
});

client
  .query(createPrefixIndex)
  .then((data) => console.log(data))
  .catch((err) => console.log(err));

/*
Should console.log something like this:
{ ref: Index("meals_by_geohash"),
  ts: 1588263734990000,
  active: false,
  serialized: true,
  name: 'meals_by_geohash',
  source: ...,
  terms: ...,
  values: ...,
  partitions: 1 }
*/

Nice 🙂 . Let’s double check our work and test the index real quick.

// most of the meals we uploaded earlier...
// share the same geohash up to this prefix...
// so we should see multiple results
const coarseGeohash = "4/030202112231";

// again, multiple meals will likely share this geohash...
// however, less meals will likely be returned
const finerGeohash = coarseGeohash + "1";

const readIndex = {
  coarseGeohash: q.Count(q.Match(q.Index("meals_by_geohash"), coarseGeohash)),
  finerGeohash: q.Count(q.Match(q.Index("meals_by_geohash"), finerGeohash)),
};

client
  .query(readIndex)
  .then((data) => console.log(data))
  .catch((err) => console.log(err));

/*
Should console.log something like this:
{ coarseGeohash: 19, finerGeohash: 6 }
*/

Query

With our index created, it’s now possible to retrieve meals within an S2 Cell. If you recall, our customer wishes to browse meals within a specific search radius; when it comes to satisfying this search, we have a spectrum of granularity to choose from. In simpler terms, our search radius is a circle and we need to find one or more S2 Cells to cover/fill this circle. While using more S2 Cells (likely of varying sizes/levels) will more accurately mimic the shape of our search “circle”, it’ll also incur slightly higher costs. If extreme accuracy isn’t something desired, technically speaking, a single S2 Cell can cover our search radius, thus leading to a single read from our index. If moderate accuracy is something we’re okay with, then 4 ~ 8 S2 Cells should do the trick, resulting in (an insignificant) 4 ~ 8 index reads or less. This trend continues into the upper 10s of S2 Cells, where at this point, while the query will be extremely accurate, performance and cost should be considered.

The process of covering our search radius is a bit too detail intensive for this piece. Fortunately, FQL is a language that lends itself perfectly for query-generation which makes it easy for us to write higher-order FQL functions. With it, I’ve created an npm package called faunadb-geo, for us to use. One of the abstract FQL functions in the package, called GeoSearch, uses a series of Match, Range, and Difference operations to create the optimal FQL query under the hood. Typically, each read to the database would map to a single S2 Cell, but with FQL, I was able to save on performance and costs; optimizing a significant amount of reads to include more than one cell. I’ve included a handful of additional optimizations/adjustments you can select from, which affect the precision of your results, so I suggest exploring the docs/github repo for more info. Below is an example use of GeoSearch.

const { Client, query: q } = require("faunadb");
const { GeoSearch } = require("faunadb-geo")(q);

const userCoordinates = { lat: 30.274665, lon: -97.74035 };
const searchRadius = 1000; // meters

client.query(
  q.Paginate(
    GeoSearch("meals_by_geohash", userCoordinates, searchRadius, {
      maxReadOps: 8,
    })
  )
);

"Read ops" are how FaunaDB calculates costs per read, this is what’s referred to in the maxReadOps parameter. FaunaDB’s pricing is largely on-demand focused, however the opportunity to reserve read/write ops and more, in advance, is available. Fun fact: with the base pricing, a whole geoquery (from when it reads the indexes to when it reads each document) of 100 read ops easily undercuts the price of a “pro plan” Algolia search 👏 . It isn’t until the 300 read op range that the cost becomes similar; even then FaunaDB’s base pricing can be discounted as you scale.

Rather than log results, I’ve provided images of GeoSearch in multiple instances/coverings, demonstrating a spectrum of performances based on the maxReadOps parameter. Keep in mind that the actual number of read ops used will likely be less than what maxReadOps is set to.

Alt TextAn image of S2 Cells and S2 Cell Ranges covering our customer’s search radius. A “Range” is just a series of neighboring/sibling cells we can combine under a single read. The blue polygons represent cells/cell-ranges we wish to read from our index, with each polygon resulting in a single read op. This particular covering had maxReadOps set to a mild 8.

Alt TextGreen cells/cell-ranges represent cost efficient differences, used to reduce the total number of read ops. This covering had maxReadOps set to 16.

Alt TextAnother covering but with maxReadOps set to 32.

As you can see, our representation of the search radius can vary according to how many read ops we’d like to incur, along with how accurate we need our results to be. In the case of selling and advertising homemade food, the need for precision doesn’t necessarily outweigh the desire for cost effectiveness. In addition, using the CalculateDistance function from faunadb-geo, we can FQL Filter any results outside of our search radius (refer to the faunadb-geo docs for an example).

To conclude, with FaunaDB’s flexibility, both in implementation and performance, we’ve arrived at a solution which meets all of our needs. Our precision needs are satisfied with the ability to query across the planet or down to a few square centimeters. Any scale isn’t of concern to our globally-distributed database, where the concept of “instances” isn’t relevant to the developer. Additionally, our desire for cost effectiveness is easily accounted for with both a generous free-tier and very competitive on-demand pricing of FaunaDB (where in many cases our geoqueries are cheaper than alternatives like Algolia!). If you have any questions or comments, please feel free to reach out to me on the FaunaDB community slack. I hope you enjoyed this article and that FaunaDB excites you as much as it does for me, cheers! 🍻

Discussion

pic
Editor guide
Collapse
bristoljon profile image
Jon Wyatt

Fantastic, thanks for sharing. This was precisely what I was looking for when beginning to explore adding a map search to my FaunaDB app.

Your fauna-geo lib looks handy too. Obviously you are using the node driver so I assume the request from the client is handled by a node backend of some sort. I'm using graphql and have so far managed to make most of my requests directly to Fauna from the client leveraging UDF's and ABAC rules and was planning to do the same for the map radius search so I'm just wondering if there's any reason you didn't add the logic directly into the fauna instance as a UDF (or series of)?

Collapse
potato_potaro profile image
Taro Woollett-Chiba Author

Hey Jon! Thank you so much for reading and dropping a comment! As you mentioned, the faunadb-geo lib is performing the necessary S2 calculations and then passing the results to the request for index searches. I did this for the sake of development speed. It might be possible to re-implement critical parts (if not entirely) of S2's logic in FQL, but that would require a significant amount of work (there's a lot that happens under the hood for S2). I'm still considering this route for the future, but for now, the client-side S2 generation will have to suffice. That being side, if you're using NPM/Yarn on the frontend, you can use the faunadb-geo lib from there, and if it concerns you, implement FQL/UDFs to limit client requests (although it'd be a tad bit more work). I'd also be interested in potentially implementing such things in the package, so feel free to open an issue. Cheers.

Collapse
bristoljon profile image
Jon Wyatt

Thanks, finally picking this up now and your faunadb-geo lib is great! I've been very quickly able to filter by location using it though I do have a couple of further questions though:

As I mentioned I was mostly using graphql for it's convenience of fetching the data I want. So whats the best way of returning an object for each matching document with all the fields required rather than just the list of values specified by the index? Especially where I want to pull values out of other collections that are referenced in the document I've geo searched on..

Seems like the object would have to be constructed manually from the results. Basically doing what FaunaDB does automagically when resolving a graphql query but just wondered if you know of another way?

The other question is, how can you pass additional search terms to the index? So if you want to also allow filtering by other fields as well as geohash. Is there a way to pass through those parameters to the index?

Thanks again for the great library and the clear explanations given in this article!

Thread Thread
potato_potaro profile image
Taro Woollett-Chiba Author

Hey Jon, regarding GraphQL, I assume you're strictly using Fauna's implementation of it? If so, as you mentioned, writing a UDF along the lines of this docs.fauna.com/fauna/current/api/g... should do the trick. It might also be possible to simply rely on the @index directive, however, I haven't tested this yet.

As for passing additional search terms, your best bet is to create additional indexes and perform those searches outside of the index used for GeoSearch. The input can be cleanly combined in a UDF and assigned to a GraphQL query using @resolver.

Let me know if I missed anything or didn't satisfactorily answer your questions. Cheers.

Collapse
realindiaherald profile image
IHG

Hi There,

I am trying to generated geohash in c#, but not able to do that, can you help us in doing this.

Another need of help we are in with your faunadb-geo plugin, isn't it available in pure js or C# rather the node one.

Collapse
potato_potaro profile image
Taro Woollett-Chiba Author

Hey, thanks for the comment!

At the moment, my package doesn't have any support for C#, however the general approach (i.e. prefix searching) is still applicable. I'd google/search for a C# implementation of an S2 library and then figure out how to convert given S2 Cell IDs into "quadkeys". I imagine there must be a C# library for S2 on the internet somewhere, however the quadkey conversion will likely rely on your own implementation.

As for a pure JS version of my geo package, it's definitely on my radar, however, I'm a bit occupied with other tasks. I do plan on getting to it at some point in the near future. Thank again for reading and checking out my work, let me know if you have any other questions. Cheers :)

Collapse
yaron profile image
Yaron Levi

Great Article! Thank you for sharing 🙂
I'll definitely use it as we gradually move to Fauna.
I'm wondering when Fauna will add a Geography data type to make things simpler.

Collapse
potato_potaro profile image
Taro Woollett-Chiba Author

Hey Yaorn! I appreciate the comment! I believe built-in geo support has been ticketed, however, the team (last I heard) is focused on other critical features, such as streaming/triggers. I believe it's possible to rewrite the S2 library (or perhaps other geohash libraries) in pure FQL, which would be barely distinguishable from built-in/native support. Such an implementation makes me wonder if there's a need for more conveniently sharing complex UDFs and FQL among the fauna community (maybe a package manager like npm?).