How I Created a 1000x Faster Nearest Coordinates Matcher Algorithm

In software development processes, we sometimes encounter problems that are specific to the work we have undertaken and are quite specific. In 99% of the problems we encounter, stackoverflow or reddit (or even chatgpt these days) comes to our rescue, while in the remaining 1% we find ourselves on our own. A software developer who follows “best practices” usually does not encounter such problems. But sometimes the situation we find ourselves in may not be a “best practice” situation. In fact, “engineering skills” are what we are capable to do in these situations. This is the story of how I came up with a solution to a “non-best practice” situation.

In 2018, I was developing an in-house vehicle tracking application for a shuttle service company. The company had a shuttle fleet of 1,100 vehicles operating throughout the city. The location data of the shuttles were sent to the server and monitored real-time from the frontend app (ReactJS for web and React-Native for mobile) via websocket. The company was using google maps Reverse Geocoding APIs to obtain the street name from the latitude and longitude data of the shuttle. The company and its Google sales partner company signed a fixed annual price agreement for unlimited requests, which is renewed every year until 2018.

The Problem

In 2018, Google changed its pricing policies for their APIs. Changed to the pay-as-you-go model instead of the annual fixed price. This change meant very high costs for my client, who was already a small company.

Our first reaction was to try other paid and free services alternative to Google. Unfortunately, none of them returned as accurate results as Google. In most other services, either the street name, neighborhood name or other information was missing or entered incorrectly.

Errors in addresses, especially in street names, were very important to shuttle owners who looked at the shuttle’s daily reports. The shuttle owners, were getting into engaging with the drivers when they saw the street names they did not know on the route. In short, we needed to solve the problem in the most cost-effective and fastest way.

Solution Approaches

We already had report data of 65 million points in our database. That is, using the address data of the nearest point by comparing each new incoming coordinate with this data could be a solution.

SELECT latitude, longitude, SQRT(
    POW(69.1 * (latitude - [startlat]), 2) +
    POW(69.1 * ([startlng] - longitude) * COS(latitude / 57.3), 2)) AS distance
FROM TableName HAVING distance < 25 ORDER BY distance;

credit: Kaletha’s answer from stackoverflow

Easy peasy right? Ok here is the new problem. Between 06:00 and 08:00 in the morning, then between 17:00 and 19:00 in the afternoon, is the time when shuttles are the busiest. Due to local regulations and the operating logic of the system, shuttles have to send coordinates to the backend every 250 meters or every 30 seconds (whichever happened first). Which means that no calculation should take longer than 30 seconds or even less per coordinate. In our low-cost Windows Server 2016 VPS with 16 gigs of ram, it was impossible to run the sql queries you see on stackoverflow in under 30 seconds for 65M rows of data. The solution we need is an inexpensive one that require much less or even no computation.

The Solution

The solution is to change the paradigm. The first step that came to my mind was to reduce the two number variables -the latitude and the longitude-, to a single number. I must have a single number representing the common value of the two numbers. After some dig around web, I came across quadtree model which is a GIS methodology.

The quadtree model is a data structure that divides regions into four equal parts, and each part repeatedly into four equal parts until it reaches the desired resolution. The model is a data structure used in many fields, from compression algorithms to image processing, mesh generation and even Conway’s Game of Life.

If we were to divide the earth into 4 equal parts, and find out where our coordinate falls, and if we did that enough number of times, this method could be a solution to our problem.

I was very surprised that I had never heard of this method until that day.

Let’s calculate the quadtree equivalent of a random point:

let maxLat = 90;
let maxLong = 180;

let minLat = -90;
let minLong = -180;

const quadtree = (lat, long, zoom) => {
  const zeroStartingLat = lat;
  const zeroStartingLong = long;
  const positionArray = [];

  for (var i = 0; i < zoom; i++) {
    if (i > 0) {
      if (positionArray[i - 1] === 2) {
        minLat = (maxLat - minLat) / 2 + minLat;
        minLong = (maxLong - minLong) / 2 + minLong;
      }
      if (positionArray[i - 1] === 1) {
        minLat = (maxLat - minLat) / 2 + minLat;
        maxLong = (maxLong - minLong) / 2 + minLong;
      }
      if (positionArray[i - 1] === 3) {
        maxLat = (maxLat - minLat) / 2 + minLat;
        maxLong = (maxLong - minLong) / 2 + minLong;
      }
      if (positionArray[i - 1] === 4) {
        maxLat = (maxLat - minLat) / 2 + minLat;
        minLong = (maxLong - minLong) / 2 + minLong;
      }
    }

    if (
      zeroStartingLat >= (maxLat - minLat) / 2 + minLat &&
      zeroStartingLong >= (maxLong - minLong) / 2 + minLong
    ) {
      positionArray.push(2);
    } else if (
      zeroStartingLat >= (maxLat - minLat) / 2 + minLat &&
      zeroStartingLong < (maxLong - minLong) / 2 + minLong
    ) {
      positionArray.push(1);
    } else if (
      zeroStartingLat < (maxLat - minLat) / 2 + minLat &&
      zeroStartingLong < (maxLong - minLong) / 2 + minLong
    ) {
      positionArray.push(3);
    } else if (
      zeroStartingLat < (maxLat - minLat) / 2 + minLat &&
      zeroStartingLong >= (maxLong - minLong) / 2 + minLong
    ) {
      positionArray.push(4);
    }
    console.log(minLat, maxLat, minLong, maxLong);
  }
  return positionArray;
};

We calculate the quadtree sequence with the small code snippet I shared above. And here are few outputs of this snippet

console.log(quadtree(40.98227, -73.783594, 18));

Output should be:

-90 90 -180 180
0 90 -180 0
0 45 -90 0
22.5 45 -90 -45
33.75 45 -90 -67.5
39.375 45 -78.75 -67.5
39.375 42.1875 -78.75 -73.125
40.78125 42.1875 -75.9375 -73.125
40.78125 41.484375 -74.53125 -73.125
40.78125 41.1328125 -73.828125 -73.125
40.95703125 41.1328125 -73.828125 -73.4765625
40.95703125 41.044921875 -73.828125 -73.65234375
40.95703125 41.0009765625 -73.828125 -73.740234375
40.97900390625 41.0009765625 -73.7841796875 -73.740234375
40.97900390625 40.989990234375 -73.7841796875 -73.76220703125
40.97900390625 40.9844970703125 -73.7841796875 -73.773193359375
40.98175048828125 40.9844970703125 -73.7841796875 -73.7786865234375
40.98175048828125 40.983123779296875 -73.7841796875 -73.78143310546875
[
  1, 4, 1, 1, 2, 3, 2,
  4, 4, 1, 3, 3, 2, 3,
  3, 1, 3, 3
]

As you can see output of

console.log(minLat, maxLat, minLong, maxLong);

is getting closer and closer to the coordinates we feed the function. This is expected when we increase the resolution by increasing the zoom parameter. The square in which the coordinate is located is getting smaller and smaller. Inside the square, of course, the closer coordinates remain.

The prerequisite for the algorithm to work is that the already existing 65M coordinates are converted to quadtree format. Once this is done, each new coordinate will now be compared only with quadtree formatted fields.

Let’s take a look at how quadtree querying can be done.

You can see the quadtree output of the five points below:

//a point from Bursa, Turkey
//console.log(quadtree(40.218454, 29.041702, 18));
//[2, 3, 1, 2, 1, 4, 3, 1, 4, 3, 2, 1, 3, 4, 2, 1, 4, 4]

//another point from Bursa, Turkey
//console.log(quadtree(40.208137, 28.941271, 18));
//[ 2, 3, 1, 2, 1, 4, 3, 1, 4, 3, 1, 4, 1, 2, 1, 1, 4, 1]

//a point from İstanbul, Turkey
//console.log(quadtree(40.987669, 29.036766, 18));
//[ 2, 3, 1, 2, 1, 4, 1, 3, 4, 1, 4, 3, 1, 4, 1, 2, 4, 4]

//a point from NY, USA
//console.log(quadtree(40.748108, -73.985841, 18));
//[ 1, 4, 1, 1, 2, 3, 4, 2, 1, 2, 2, 1, 3, 3, 2, 2, 1, 2]

//another point from NY, USA
//console.log(quadtree(40.98227, -73.783594, 18));
//[ 1, 4, 1, 1, 2, 3, 2, 4, 4, 1, 3, 3, 2, 3, 3, 1, 3, 3]

If you look closely at the quadtree outputs, you will notice that the first 5 or 6 digits are calculated the same when coordinates close to each other.

The commonality in the first numbers will increase as the dots get closer. Therefore, by trial and error method, you can decide how many first digits you will consider common. In my case, since the first five digits gave a common result for Bursa, Turkey, I created a column called “res1” in the MySQL database and insert the first five digits of each quadtree calculation into this column. Best approach would be to group the numbers with decreasing number of elements. So, it would be logical to set res1 as 5 digits, res2 as 3 digits, res3 as 3 digits, res4 as 2 digits, res5 as 2 digits, res6, res7, res8 as 1 digit. For faster response times, it would make sense to index the res columns.

Will it work?

We re-coded our existing data with the quadtree method. We encoded every new data in quadtree format. Now it’s time to compare. Instead of comparing with millions of rows, let’s first SELECT a few points that are close.

SELECT * FROM data WHERE res1=231214 AND res2=314 AND res3=321 AND res4=34

Depending on the intensity of the service routes, the query above will probably return 10–20 points in under 1 second! From now on it is up to you if you calculate the nearest coordinate with those classic methods or increase query resolution by adding “AND res5=….”.

A better practice would be to use a NoSQL database and set up parenting relationships between groups of numbers. This way I believe it would be possible to return results even faster than results in MySQL.

Requesting new data

Points calculated by quadtree method can give unreliable results if enough amount of data are not accumulated in the database. In order to avoid suspicious results, the number of points returned in the query result can give a clue. If less than, lets say 10 points are returned as a result of the query, the square of the quadtree may have been set too small. Likewise, if there are not enough data re-encoded in quadtree format, even if the square is kept large, few dots will come. In such cases, sending request to the reverse geocoding service and keeping the newly obtained data in the database by encoding it in quadtree format for future use will increase the accuracy of the system over time.

Real World Results

With the use of the Quadtree system for reverse geocoding, the number of queries sent to Google decreases by 96% in the first year. API request sent only for suspicious coordinates. In this way, the pay-as-you-go invoice remained within the limits of what my client company could afford. Thanks to the new algorithm, similar matching results with existing server hardware were obtained up to 60 times faster.

Final Say

Creating a new solution to an old problem from a different perspective is the most satisfying thing for me in the software development process. Fortunately and unfortunately, such problems do not occur very often. Please share your own ideas and criticisms about the Quadtree method in comments. This was my first medium article. So any feedback would be greatly appreciated.

Legal Notes:

In accordance with the contract we signed with the customer company, some details about the operation of the algorithm have been changed within the scope of confidentiality. This includes the server’s operating system, response times and res column grouping. The JS codes I have included above have been rewritten from scratch for the article. But the main idea was transferred without any distortion I promise :)

Cover Image Credit: Karolina Grabowska