DEV Community

Savaş Altürk
Savaş Altürk

Posted on

Spatial Data Analysis with DuckDB

Introduction

In this article, I will explain how to access point of interest (POI) data published as open data by the Overture Maps Foundation with DuckDB and how to perform spatial analysis.

What is DuckDB?

DuckDB is designed to support analytical query workloads, also known as Online analytical processing (OLAP). It includes a columnar-vectorized query execution engine. This is more performant than traditional systems such as PostgreSQL, MySQL, or SQLite, which process each row sequentially. There are many plugins available. You can easily transfer your data in environments such as Amazon S3, Google Cloud Storage, postgresql using plugins. You can perform spatial analysis by installing the Spatial plugin.

Environment Setup

pip install duckdb==0.8.1
Enter fullscreen mode Exit fullscreen mode

Creating a database

import duckdb
db = duckdb.connect("data.db")
Enter fullscreen mode Exit fullscreen mode

Installing plug-ins for data access and spatial analysis
We install the "spatial" plugin to perform spatial analysis.
We are installing the "httpfs" plugin to access POI data in Amazon S3. Then we define the region as "us-west-2".

db.sql("""
INSTALL spatial;
INSTALL httpfs;
LOAD spatial;
LOAD httpfs;
SET s3_region='us-west-2';
""")
Enter fullscreen mode Exit fullscreen mode

We create a table in the database and transfer the data in parquet format. 59 million rows of data were transferred to the database in about 34 minutes.

db.sql("""
  create table places as 
  select * from read_parquet('s3://overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=places/type=*/*')
""")
Enter fullscreen mode Exit fullscreen mode
db.sql("""
select count(*) as count from places
""").show()
Enter fullscreen mode Exit fullscreen mode

count places

db.sql("""
    select * from places limit 5
""").show()
Enter fullscreen mode Exit fullscreen mode

limit 5

You can find the diagram of the POI data here. There are columns in the data that we need to preprocess.

db.sql("""
    select names, categories, confidence,brand,addresses from places limit 5
""").show()
Enter fullscreen mode Exit fullscreen mode

columns

For example, in order to find out which category it is in the categories column, we need to get the information from the data held in the "struct" type. You can review the document to learn about DuckDB data types.
For example, to extract which country you are located in the "Addresses" column:

db.sql("""
    select replace(json_extract(CAST(addresses AS JSON), '$[0].country')::varchar,'"','') as country from places limit 5
""").show()
Enter fullscreen mode Exit fullscreen mode

country names
After creating a column called “country” to extract country short names, we add the extracted data.

db.sql("""ALTER TABLE places ADD COLUMN country VARCHAR;
       update places set country = replace(json_extract(CAST(places.addresses AS JSON), '$[0].country')::varchar,'"','')

""")
Enter fullscreen mode Exit fullscreen mode

We run the following query to add the POI data in Turkey to a separate table and obtain the address, category, name, geometry information.

db.sql("""
       create or replace table turkey_places as (
              select
                     replace(json_extract(places.addresses::json,'$[0].locality'),'"','')::varchar as locality,
                     replace(json_extract(places.addresses::json,'$[0].region'),'"','')::varchar as region,
                     replace(json_extract(places.addresses::json,'$[0].postcode'),'"','')::varchar as postcode,
                     replace(json_extract(places.addresses::json,'$[0].freeform'),'"','')::varchar as freeform,

                     categories.main as categories_main,

                     replace(json_extract(places.names::json,'$.common[0].value'),'"','')::varchar as names,
                     confidence,
                     bbox,
                     st_transform(st_point(st_y(st_geomfromwkb(geometry)),st_x(st_geomfromwkb(geometry))),'EPSG:4326','EPSG:3857') as geom


              from places 
                     where country ='TR' 
       )


""")
Enter fullscreen mode Exit fullscreen mode

Created table:

db.sql("""
select * from turkey_places limit 5
""").df()
Enter fullscreen mode Exit fullscreen mode

turkey places

As an example, I will examine the POI data in Istanbul. I created two tables to obtain the POI points located within 500 m of the designated park points.

db.sql("""
    create or replace table park_ist as (
        select * from turkey_places where locality = 'İstanbul' and categories_main='park'   
    );

    create or replace table poi_ist as (
        select * from turkey_places where locality = 'İstanbul' and categories_main <> 'park'
    )

""")
Enter fullscreen mode Exit fullscreen mode

Number of POIs in Istanbul:

db.sql(
    """
select count(*) from poi_ist

"""
)
'''
count
  181959
'''
Enter fullscreen mode Exit fullscreen mode

Number of POIs designated as Parks in Istanbul:

db.sql(
    """
select count(*) from park_ist

""")

'''
count
  492
'''
Enter fullscreen mode Exit fullscreen mode

To query the POI points within 500 m of the points included in the park category:

df = db.sql("""
              select  poi_ist.region as poi_ist_region,poi_ist.freeform as poi_ist_freeform,poi_ist.categories_main as poi_ist_categori ,
              park_ist.categories_main as park_categori , park_ist.names as park_names, park_ist.freeform as park_ist_freeform,

              st_distance(poi_ist.geom,park_ist.geom) as dist,
              ST_AsText(poi_ist.geom) as geom,
              ST_AsText(park_ist.geom) as geom2

              from poi_ist, park_ist 

              where ST_DWithin(poi_ist.geom, park_ist.geom,500) 





       """).to_df()

gdf = gpd.GeoDataFrame(df,geometry= gpd.GeoSeries.from_wkt(df['geom']),crs="EPSG:3857")
gdf.to_file("export/poi.geojson",driver="GeoJSON")
Enter fullscreen mode Exit fullscreen mode

I converted the resulting table to pandas dataframe format. I then saved the table as geojson format using geopandas. I visualized it by category info with QGIS.

visualize

POI data can be examined in more detail. I wanted to tell you how to work with spatial data with DuckDB. I hope it was useful. Hope to see you in the next article.

Source:
Overture Maps Foundation
POI Data
DuckDB
DuckDB Spatial extension
DuckDB HTTPFS extension

Top comments (0)