Ola is one of the most extensively used ride-hailing app in India, which means there are Petabytes of anonymized data related to the movement of vehicles that we work with. Performing data science operations on these datasets gives numerous insights into road conditions, traffic congestion and other relevant factors related to vehicle movement at scale.
Data Scientists at Ola Campus Pune work on large datasets that usually do not fit into the memory (RAM) of Virtual Machines or Local PCs.
This requires the use of distributed computation tools such as Spark and Hadoop, Flink and Kafka are used. But for occasional experimentation, Pandas, Geopandas and Dask are some of the commonly used tools.
With Dask, we can compute any operation, calculation or function parallelly.
This blog will cover working with Mobility data and its visualization. We will cover -
- What is mobility data, and where to find open source mobility data?
- Dask, what is it, and why is it used?
- Hands-on covering-
- Creating Dask clusters
- Creating Dask clusters
- Pre-processing of mobility data
- Visualizing cleaned mobility data
What is Mobility data?
All mobile vehicles at Ola are connected to GPS. This is how users know which cabs are closest to them in real-time, share their live location with their peers and loved ones, and plan their travel.
GPS data contains latitude and longitude attributes, which are coordinates for a vehicle's location on the Earth at a given time. This data is also called telemetry/telematics data or mobility data.
You can find out how this telemetry data is beneficial in this blog.
What is Dask?
Dask is an open-source library which enables parallel computing on any computer, cloud VM or Local PC. It creates a distributed parallel processing cluster and its proprietary Dask Dataframe from any input data, such as Pandas data frames. It computes any operation, calculation or function in a parallel manner.
It can be both deployed and operated on local machines or cloud clusters.
Why Dask?
Telemetry data can be 100s of gigabytes; it cannot be processed over the single pandas dataframe and will take days to process this data. Dask enables parallel processing and helps us process and work with the data faster and more efficiently, effectively using resources such as VM or Local PC.
Hands-on
All of the telemetry data that we use is anonymized. For demonstration purposes, we will be using an open-source dataset of the New York Taxi mobility.
In this blog post, we will try to visualize the pickup and drop details of taxis.
Imports
These are the necessary imports required for processing the data
import dask.dataframe as dd
LocalCluster, progress, Client from dask distributed are also needed to be imported.
from dask.distributed import LocalCluster, progress, Client
Creating a local cluster
As we will be using Dask, there are some parameters to specify for the parallel processing to run smoothly. One such parameter is the number of CPU threads to be used. In dask distributed library, the parameter n_workers
is used. Generally, each worker consumes one thread, but it can be specified per the task load. For example, your virtual machine has 32GB of RAM; to avoid overloading the RAM, only use 24 GB of it. To avoid overloading a single worker, specify a RAM threshold for each worker, i.e. memory_limit
; in this case, each worker is set to consume a maximum of 1 GB; therefore the memory limit would be 1 GB.
Next, all the information about the local cluster is passed to the cluster configurator using kwargs, a dictionary of parameters and values.
kwargs = {'n_workers':24,'threads_per_worker':1,'memory_limit':'1GB'}
client = LocalCluster(dashboard_address='127.0.0.1:8085',**kwargs)
Client
The dashboard address can be modified as per need. It will show the live status of the parallel processing.
Let’s work on the data.
With the local Dask Cluster created, it's time to view and read the dataset into Dask Dataframe. Dask automatically assigns a number of partitions on the large input data for it to process the input in parts at later stages.
# read one or, some of the downloaded CSV files
df = dd.read_csv(path)
To check how many partitions are formed you can use:
df
Output:
To confirm the CSV is imported, check the first five records of the first partition with the code:
df.partitions[0].head()
Filtering the dataset.
As only the pickup details are being visualized, only the below-listed columns are required:
- medallion
- pickup_datetime
- pickup_longitude
- pickup_latitude
df2 = df[['medallion','pickup_datetime','pickup_longitude','pickup_latitude']]
Data is then clipped based on a start/end date
from datetime import datetime as dt
start = "2013-01-01 15:11:48"
end = "2013-01-07 23:59:59"
df3 = df2[(df2['pickup_datetime'] > start) and (df2['pickup_datetime'] < end)]
The start and end date can be chosen as per need.
Convert Dask to CSV
After all the pre-processing of the data a filtered CSV file is written as output.
n = 0
for part in df3.partitions:
n+=1
part.to_csv(f'trip_data_partition-{n}.csv',single_file=True)
Read all the CSV files and merge it to one
full_df = dd.read_csv( path to the folder + "/trip_data_partition-*.csv")
full_df.to_csv( path to the folder + '/trip_data_1_filtered.csv',single_file = True)
Pre-Visualization
Before the data can be visualized, it should be converted to a desired format. Mapbox requires input in .mbtiles
format.
In order to convert CSV to mbtiles follow the following steps:
- Convert CSV to
.vrt
format using GDAL/ogr2ogr - Convert
.vrt
to.geojson
using GDAL/ogr2ogr - Convert
.geojson
to.mbtiles
using tippecanoe
Visualization
Once desired format is generated, it can be visualized using Mapbox Studio or QGIS, here is the desired outputs:
Conclusion
Data scientists work on phenomenally large datasets, and Dask is a handy tool for exploration within the confines of a single cloud VM or their local PCs. Location data visualization is an essential part of deciding further algorithm development and roadmap for projects. This lays the foundation for data engineering and science to work at scale, with petabytes of data.
We look forward to sharing more of our workflows in the coming days.
P.S. Thanks to the contributors of Dask and the data providers. A Big thanks to Sidharth Subramaniam and Yash Chavan our experts in data science for helping us throughout this article.
If you have some feedback or have found this blog post of your interest, do Connect with Us!
Top comments (0)