(This content is based on a blogpost originally published in Crate.io)
When signing up for CrateDB Cloud, you can import the NYC taxi dataset from the console with only a couple of clicks. We like this dataset because it includes information from tens of millions of taxi trips, with multiple data types—including time-series data.
In a previous post, I showed you how to load the NYC taxi dataset into CrateDB Cloud. Now, I will give you some SQL queries you can try. This post will be especially useful if you are (like me) still a beginner in the world of SQL. 🤓
Let's get into it.
Assuming that you already have the full dataset imported, you now have about 40 million data records in your cluster. Let's start by taking a look at our table:
SELECT * FROM nyc_taxi LIMIT 10;
If you scroll to the right, you will see all the information in your table: fare amounts, passenger numbers, tips and tolls paid...
To play around with your data, try the following queries:
The SQL query:
SELECT date_trunc('day', pickup_datetime) AS day, COUNT(*) AS rides FROM nyc_taxi WHERE pickup_datetime BETWEEN '2019-07-01T00:00:00' AND '2019-07-31T23:59:59' GROUP BY 1 ORDER BY 1 LIMIT 100;
The SQL query:
SELECT date_trunc('day', pickup_datetime) AS DAY, COUNT(*) AS rides, SUM(trip_distance) AS total_distance, SUM(trip_distance) / COUNT(*) AS average_per_ride FROM nyc_taxi WHERE pickup_datetime BETWEEN '2019-07-01T00:00:00' AND '2019-07-31T23:59:59' GROUP BY DAY ORDER BY 1 LIMIT 100;
The SQL query:
SELECT EXTRACT(HOUR FROM pickup_datetime) AS HOUR, COUNT(*), SUM(trip_distance) AS total_distance FROM nyc_taxi WHERE pickup_datetime BETWEEN '2019-07-01T00:00:00' AND '2019-07-31T23:59:59' GROUP BY HOUR ORDER BY COUNT(*) DESC LIMIT 100;
With this one, things start getting a little bit more interesting.
For this query, you would normally do something like this:
SELECT COUNT(DISTINCT(pulocationid)) FROM nyc_taxi;
However, the amount of data is too large. If you try this, CrateDB Cloud will immediately trip a circuit breaker: this query can't be performed reasonably, without having to do a ridiculous full scan of the table.
The solution is to use a function called
SELECT HYPERLOGLOG_DISTINCT(pulocationid) FROM nyc_taxi;
This function uses an algorithm to estimate the number of data records, and it is fast even on very large datasets. You can use HYPERLOGLOG_DISTINCT() just like you would use COUNT(DISTINCT) in your queries.
5. Which are the most high-grossing location pairs in NYC for the entire year? For each pair, how much more it grosses if compared to the next best one?
To end, we get advanced for real: this is an example of how to use window functions.
In this case, the query uses the
SELECT pulocationid, dolocationid, cnt, total, next_highest_grossing, total - next_highest_grossing AS leads_by FROM ( SELECT pulocationid, dolocationid, COUNT(*) AS cnt, SUM(total_amount) AS total, LAG(sum(total_amount), 1) OVER (ORDER BY SUM(total_amount)) AS next_highest_grossing FROM nyc_taxi WHERE pickup_datetime BETWEEN '2019-01-01T00:00:00' AND '2019-12-31T23:59:59' GROUP BY pulocationid, dolocationid ORDER BY total DESC LIMIT 100 ) AS sub;
Hope this was useful! See you next time 👋
See you around 🐐