Learning SQL

#sql #dataengineering

Last week I wrote some code in SQL to query the New York taxi trip database. It was part of the homework for the DataTalksClub data engineering zoomcamp, so I had to wait until the homeworks were scored before I could write about it.

Typically for me, I jumped the gun and did the homework before I watched the SQL refresher video. I ended up asking ChatGPT a lot of questions. I knew I had to do a join on the huge data table and a smaller one that had the boroughs and districts spelled out. ChatGPT didn't give me the best answer. It said to add columns to the huge table and insert all the boroughs and districts into it. I got a warning from postgres, something like "do you really want to modify your table?" I said "yes" and added all the boroughs and districts, then queried the table.

The answer is simpler. You perform a join and run the query in the same command. The table isn't modified, and you get your answer. The question was, which boroughs had more than 50,000 green taxi trips for the month of September, 2019? Here's what it looks like:

SELECT zpu."Borough" as pickup_borough, count(*) as count_per_value
FROM green_taxi_data AS trips
JOIN zones AS zpu
ON trips."PULocationID" = zpu."LocationID"
GROUP BY zpu."Borough"
ORDER BY count_per_value DESC;

This prints the output in two columns, pickup_borough and count_per_value. Since the main data table doesn't have Boroughs, just a location id, I had to do a JOIN on the location id. The "zones" table has all the Boroughs and Districts for each location id. Once I did that, I could GROUP BY zpu."Borough" (pickup_borough didn't work) and ORDER BY count, in reverse order. The three boroughs with the most pickups were Manhattan, Queens, and Brooklyn. (There are five boroughs in New York City. The other two are Staten Island and the Bronx.)

"pickup_borough"    "count_per_value"
"Manhattan"         154685
"Queens"            135746
"Brooklyn"          124771
"Bronx"             32712
"Unknown"           944
"Staten Island"     199
"EWR"               6

The second join was slightly more complicated. I had to find the dropoff location for the trip originating in Astoria that had the highest tip amount. Here's what I did:

SELECT
  lpep_pickup_datetime,
  lpep_dropoff_datetime,
  tip_amount,
  zpu."Zone" AS "pickup_zone",
  zdo."Zone" AS "dropoff_loc"
FROM
  green_taxi_data t
JOIN zones zpu ON t."PULocationID" = zpu."LocationID"
JOIN zones zdo ON t."DOLocationID" = zdo."LocationID"
WHERE
  zpu."Zone" = 'Astoria'
ORDER BY tip_amount desc
LIMIT 1;

Here, after joining the tables, I asked postgres to find all the trips where the pickup zone was Astoria, sorted by tip amount in descending order, and I limited the output to one line. The answer was the JFK Airport.

"lpep_pickup_datetime"  "lpep_dropoff_datetime" "tip_amount"    "pickup_zone"   "dropoff_loc"
"2019-09-08 18:10:40"   "2019-09-08 18:44:42"   62.31           "Astoria"       "JFK Airport"

In SQL, you can give a table an alias on a line after you use the alias. This is because the commands are thoroughly parsed before they're executed. You can even leave out the AS command, which I did in the second example. SQL is a declarative language, meaning you tell it what to do, not how to do it.

DEV Community

Learning SQL

Top comments (0)