DEV Community

Cover image for Unveiling the Power of Spatial Clustering with PostgreSQL
João Vasconcelos
João Vasconcelos

Posted on

Unveiling the Power of Spatial Clustering with PostgreSQL

Spatial clustering, a technique pivotal in organizing geographical data points based on proximity, fuels the efficient display and analysis of location-based information across numerous platforms. Think about the way Airbnb showcases property listings or how Booking.com handles vast arrays of accommodations - these popular websites rely on spatial clustering to streamline user experiences.

While client-side libraries like Mapbox offer clustering capabilities, they can face performance challenges, especially with extensive datasets. In such cases, performing spatial clustering on the server-side, particularly using a robust database like PostgreSQL, emerges as a more efficient solution.

This article dives into the significance of spatial clustering in enhancing performance, shedding light on how PostgreSQL, with its spatial extension, PostGIS, allow us to manage and optimize the usage of spatial data.

PostGis logo

Requirements

  1. Postgres with the PostGIS extension installed. You can use the docker-compose file below, it uses the official PostGIS image for running Postgres with the PostGIS extension installed.
version: '3'

services:
  postgres:
    image: postgis/postgis:latest
    container_name: my_postgres_container
    ports:
      - "5432:5432"
    environment:
      POSTGRES_DB: mydatabase
      POSTGRES_USER: myuser
      POSTGRES_PASSWORD: mypassword
    volumes:
      - postgres_data:/var/lib/postgresql/data

volumes:
  postgres_data:

Enter fullscreen mode Exit fullscreen mode

Setting up the data

Model definition

Before we can cluster anything, we need data and with that in mind, we are going to create the following table in our database:

CREATE TABLE last_location (
    id SERIAL PRIMARY KEY,
    lat NUMERIC,
    lng NUMERIC,
    geom GEOMETRY(Point, 3395),
    -- PostGIS Point field with SRID 3395
    created_on TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    modified_on TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Enter fullscreen mode Exit fullscreen mode

The lat and the lng fields represent the location in degrees following the WGS84 reference system used in GPS. The geom field will be a geometry a spatial data type used to represent a feature in planar (Euclidean) coordinate systems.

In this case, we will represent a point in the World Mercator spatial reference system, hence the SRID 3395. The World Mercator reference system represents almost the entire world (it excludes the polar regions) in meters.

You might be asking "Why do we need 2 distinct ways of representing the location (WSG84 VS World Mercator)?"

While the WSG84 system is great for representing and visualizing the location on a map (e.g. representing the points on a Web app,...)the World Mercator representation in meters will make the cluster calculations easier since it's a lot simpler to think of a cluster radius in meters than degrees.

Populating the data

Now that we have our table ready to store the locations, we will need to populate it with data. We are not going to be super precise with the location, some of the data points might be located in the sea, but the goal here is to see how the clustering works.

INSERT INTO last_location (lat, lng, geom)
SELECT
    lat,
    lng,
    ST_TRANSFORM(ST_Point(lng, lat, 4326), 3395) AS geom
FROM (
SELECT
    (RANDOM() * (49.384358 - 24.396308) + 24.396308) AS lat, -- Range of latitudes in the continental US
    (RANDOM() * (-66.93457 + 125.001636) - 125.001636) AS lng -- Range of longitudes in the continental US
    FROM generate_series(1, 500)
) AS random_locations;
Enter fullscreen mode Exit fullscreen mode

Using the SQL statement above, we are going to generate 500 random records using the range of latitudes and longitudes of the continental US. The geom field is calculated by transforming the WSG84 representation, represented by the SRID (spatial reference identifier) 4326, into the World Mercator representation, SRID 3395.

In the end, you should have 500 locations spread across the US and adjacent areas. The image below was taken from DBeaver, it has a spatial data viewer which helps a lot when working with this type of data.
Location spread across the US

Clustering the data

After all this work, we finally got to the juicy part, clustering the data.

To cluster the data we are going to use an operation provided by PostGIS, ST_CLUSTERDBSCAN.

This is a window function that returns a cluster number for each input geometry. Besides the geometry, you will need to specify the desired distance (eps) and the density (minpoints) parameters to determine each cluster. Since this is a window function and one geometry might be eligible to belong to multiple clusters, to ensure deterministic results we need to use an ORDER BY clause in the window definition.

It's important to understand that the value you assign to eps will take the unit of the geometry you are trying to cluster, that means that if your geometry is represented in degrees and eps is 1000, you will be trying to form a cluster with a distance of 1000º and not 1000 meters or another length unit (I fell into this trap and spent about a day trying to understand why my clustering wasn't working).

Let's try to use it and analyse the output.

-- Query
-- Cluster with a distance of 100000 meters
-- and a minimum of 2 locations to form a cluster
select
    ST_CLUSTERDBSCAN(geom,
    eps := 100000,
    minpoints := 2) over (
order by
    id) as cluster_id,
    geom
from
    last_location ll
order by
    cluster_id
limit 10;

-- Result
cluster_id|geom                                          |
----------+----------------------------------------------+
         0|POINT (-9620039.471564963 4213160.92111902)   |
         0|POINT (-9685949.732485048 4238072.815488585)  |
         0|POINT (-9626448.620191192 4284895.837818226)  |
         0|POINT (-9739000.227136273 4185753.078454269)  |
         1|POINT (-10511656.94032765 3263151.691116077)  |
         1|POINT (-10657529.563062856 3319832.321128275) |
         1|POINT (-10585249.991795693 3285868.5630962)   |
         1|POINT (-10556119.395681698 3311324.9947481216)|
         2|POINT (-13744886.491918504 3118709.415406296) |
         2|POINT (-13638278.89205634 2927643.4725107397) |
Enter fullscreen mode Exit fullscreen mode

Clusters

Looking at the partial result above, we formed 3 distinct clusters, identified by the cluster_id and we can see what points belong to each cluster.

Finding the center of the cluster

Now that we have formed our clusters, we still need to find the centroid and calculate the number of locations per cluster. This is the information we would need to represent the cluster on a Web interface.

select
    cluster_loc.cluster_id,
    ST_Y (
        ST_TRANSFORM(
            ST_Centroid(ST_COLLECT(cluster_loc.geom)),
    4326
        )
    ) as lat,
    ST_X (
        ST_TRANSFORM(
            ST_Centroid(ST_COLLECT(cluster_loc.geom)),
    4326
        )
    ) as lng,
    ST_Centroid(ST_COLLECT(cluster_loc.geom)) as centroid,
    count(cluster_loc.cluster_id) num_points
from
    (
    select
        ST_CLUSTERDBSCAN(geom,
        eps := 100000,
        minpoints := 2) over (
    order by
        id) as cluster_id,
        geom
    from
        last_location ll
    order by
        cluster_id) cluster_loc
group by
    cluster_loc.cluster_id;
Enter fullscreen mode Exit fullscreen mode

If you look at the results of the previous query, you will probably find a row with cluster_id set to null. This is because not all locations might be eligible to fit in a cluster so if you also want to represent the non-clustered locations you would execute the following query.

select
    cluster_loc.cluster_id,
    ST_Y (
        ST_TRANSFORM(
            cluster_loc.geom,
    4326
        )
    ) as lat,
    ST_X (
        ST_TRANSFORM(
            cluster_loc.geom,
    4326
        )
    ) as lng,
    geom as centroid,
    0 as num_points
from
    (
    select
        ST_CLUSTERDBSCAN(geom,
        eps := 100000,
        minpoints := 2) over (
    order by
        id) as cluster_id,
        geom
    from
        last_location ll
    order by
        cluster_id) cluster_loc
where
    cluster_loc.cluster_id is null;
Enter fullscreen mode Exit fullscreen mode

Final thoughts

Mastering server-side clustering empowers you to enhance the efficiency of location-based applications. You can adapt the logic to your specific needs—consider refining the display by a bounding box, dynamically adjusting cluster radius with zoom levels, and exploring numerous possibilities.

I trust this guide has been a helpful resource on your journey to optimize and elevate your location-based projects. Happy clustering!

Visit my GitHub to check out all of the code used in this post.


Have questions? Any feedback is welcomed and I will be glad to start a discussion in the comments section.

Top comments (0)