I may be a few years late to this, but Google's BigQuery is freakin' awesome. The data warehouse is the home to a bunch of really neat datasets which users can traverse shockingly quickly (and for free, up to the first terabyte). In the past few weeks, I've explored the GitHub dataset to find popular Python packages and the frequency of cussing programmers.
Today, I'm going to recreate a blogpost I wrote last year about cab rides in New York City. When I realized the Taxi & Limousine Commission published their ride data, including starting and ending geographic coordinates, I figured there were some interesting patterns waiting to be discovered.
In particular, I wondered how often people took cab rides for ludicrously short distances (less than a tenth of a mile) and how those rides might differ from more conventional rides. I mostly expected geographic and timing discrepancies: surely only drunken late-night patrons or residents of wealthy neighborhoods could justify taking a cab a few hundred feet, right?
While there wasn't much evidence of that hypothesis, I did find something else that was peculiar. Let's split every yellow cab ride since 2009 into those that were longer than a tenth of a mile ("normal") and those that weren't ("micro-rides"), and also into those that have complete geolocational info (pick-up and drop-off both present) and those that don't.
Within those four cohorts, we'll count how often the ride is paid for in cash, how often the fare is $50 or more, and how often that fare is negotiated.
(For the record, this command queried over a billion rows in under six seconds.)
SELECT CASE WHEN trip_distance > 0.1 THEN "normal" ELSE "micro-ride" END AS type, CASE WHEN pickup_longitude = 0 OR pickup_latitude = 0 OR dropoff_longitude = 0 OR dropoff_latitude = 0 THEN 'yes' ELSE 'no' END AS missing_geo, COUNT(*) AS count, STRING(ROUND(SUM(CASE WHEN payment_type = 'CSH' THEN 1 ELSE 0 END)/count*100, 0)) + '%' AS cash, STRING(ROUND(SUM(CASE WHEN total_amount >= 50 THEN 1 ELSE 0 END)/count*100, 2)) + '%' AS fifty_plus, STRING(ROUND(SUM(CASE WHEN rate_code = '5' THEN 1 ELSE 0 END)/count*100,2)) + '%' AS negotiated FROM [nyc-tlc:yellow.trips] GROUP BY type, missing_geo ORDER BY type DESC, missing_geo
Some of these trends are predictable. There are about 100 normal rides for every micro-ride, in which cash is used about half the time. But some of them are a bit odd. Why are there so many instances of micro-rides costing more than $50 when they by definition should be much cheaper? And why is that trend exaggerated for micro-rides missing their geo-locational data? And why does missing geo data increase negotiation frequency so much more for micro-rides than for normal rides?
There might be reasonable explanations for all this. But at first sight, I convinced myself that something sinister was behind this data: In the dark and complex world of New York City, there are a certain class of trips where either the driver or the passenger does not want their exact location logged, and these are the same rides that are paid for in cash at a negotiated price. Feel free to fill in the details with tales of crooked cabbies, brick-shaped packages, and loaded firearms.
Or maybe the rows with missing coordinates or unlikely distances have just been corrupted,in which case our question becomes: is the data dirty in a conventional sense or in a criminal manner?