Danny was scrolling through his Instagram feed when something really caught his eye — “80s Retro Styling 🎸 and Pizza 🍕 Is The Future!”
Danny was sold on the idea, but he knew that pizza alone was not going to help him get seed funding to expand his new Pizza Empire — so he had one more genius idea to combine with it — he was going to Uberize it — and so Pizza Runner was launched!
Danny started by recruiting “runners” to deliver fresh pizza from Pizza Runner Headquarters (otherwise known as Danny’s house) and also maxed out his credit card to pay freelance developers to build a mobile app to accept orders from customers.
- customer_orders — Customers’ pizza orders with 1 row each for individual pizza with topping exclusions and extras, and order time.
- runner_orders — Orders assigned to runners documenting the pickup time, distance and duration from Pizza Runner HQ to customer, and cancellation remark. runners — Runner IDs and registration date
- pizza_names — Pizza IDs and name
- pizza_recipes — Pizza IDs and topping names
- pizza_toppings — Topping IDs and name
This case study has LOTS of questions — they are broken up by area of focus including:
- A. Pizza Metrics
- B. Runner and Customer Experience
- C. Ingredient Optimisation
- D. Pricing and Ratings
- E. Bonus DML Challenges (DML = Data Manipulation Language)
Before I start with the solutions, I investigate the data and found that there are some cleaning and transformation to do, specifically on the
nullvalues and data types in the
nullvalues and data types in the
- Alter data type in
Firstly, to clean up
extras in the
customer_orders — we create TEMP TABLE
#customer_orders and use CASE WHEN.
SELECT order_id, customer_id, pizza_id, CASE WHEN exclusions IS null OR exclusions LIKE 'null' THEN ' ' ELSE exclusions END AS exclusions, CASE WHEN extras IS NULL or extras LIKE 'null' THEN ' ' ELSE extras END AS extras, order_time INTO #customer_orders -- create TEMP TABLE FROM customer_orders;
Then, we clean the
runner_orders table with CASE WHEN and TRIM and create TEMP TABLE
- pickup_time — Remove nulls and replace with ‘ ‘
- distance — Remove ‘km’ and nulls
- duration — Remove ‘minutes’ and nulls
- cancellation — Remove NULL and null and replace with ‘ ‘
SELECT order_id, runner_id, CASE WHEN pickup_time LIKE 'null' THEN ' ' ELSE pickup_time END AS pickup_time, CASE WHEN distance LIKE 'null' THEN ' ' WHEN distance LIKE '%km' THEN TRIM('km' from distance) ELSE distance END AS distance, CASE WHEN duration LIKE 'null' THEN ' ' WHEN duration LIKE '%mins' THEN TRIM('mins' from duration) WHEN duration LIKE '%minute' THEN TRIM('minute' from duration) WHEN duration LIKE '%minutes' THEN TRIM('minutes' from duration) ELSE duration END AS duration, CASE WHEN cancellation IS NULL or cancellation LIKE 'null' THEN '' ELSE cancellation END AS cancellation INTO #runner_orders FROM runner_orders;
Then, we alter the date according to its correct data type.
- pickup_time to DATETIME type
- distance to FLOAT type
- duration to INT type
ALTER TABLE #runner_orders ALTER COLUMN pickup_time DATETIME, ALTER COLUMN distance FLOAT, ALTER COLUMN duration INT;
Now that the data has been cleaned and transformed, let’s move on solving the questions! 😉
How many pizzas were ordered?
SELECT COUNT(*) AS pizza_order_count FROM #customer_orders;
- Total pizzas ordered are 14.
- How many unique customer orders were made?
SELECT COUNT(DISTINCT order_id) AS unique_order_count FROM #customer_orders;
- There are 10 unique customer orders made.
- How many successful orders were delivered by each runner?
SELECT runner_id, COUNT(order_id) AS successful_orders FROM #runner_orders WHERE distance != 0 GROUP BY runner_id;
- Runner 1 has 4 successful delivered orders.
- Runner 2 has 3 successful delivered orders.
- Runner 3 has 1 successful delivered order.
- How many of each type of pizza was delivered?
SELECT p.pizza_name, COUNT(c.pizza_id) AS delivered_pizza_count FROM #customer_orders AS c JOIN #runner_orders AS r ON c.order_id = r.order_id JOIN pizza_names AS p ON c.pizza_id = p.pizza_id WHERE r.distance != 0 GROUP BY p.pizza_name;
- There are 9 delivered Meatlovers pizzas.
- There are 3 delivered Vegetarian pizzas.
- How many Vegetarian and Meatlovers were ordered by each customer?
SELECT c.customer_id, p.pizza_name, COUNT(p.pizza_name) AS order_count FROM #customer_orders AS c JOIN pizza_names AS p ON c.pizza_id= p.pizza_id GROUP BY c.customer_id, p.pizza_name ORDER BY c.customer_id;
- Customer 101 ordered 2 Meatlovers pizzas and 1 Vegetarian pizza.
- Customer 102 ordered 2 Meatlovers pizzas and 2 Vegetarian pizzas.
- Customer 103 ordered 3 Meatlovers pizzas and 1 Vegetarian pizza.
- Customer 104 ordered 1 Meatlovers pizza.
- Customer 105 ordered 1 Vegetarian pizza.
- What was the maximum number of pizzas delivered in a single order?
WITH pizza_count_cte AS ( SELECT c.order_id, COUNT(c.pizza_id) AS pizza_per_order FROM #customer_orders AS c JOIN #runner_orders AS r ON c.order_id = r.order_id WHERE r.distance != 0 GROUP BY c.order_id )
SELECT MAX(pizza_per_order) AS pizza_count FROM pizza_count_cte;
- Maximum number of pizza delivered in a single order is 3 pizzas.
- For each customer, how many delivered pizzas had at least 1 change and how many had no changes?
SELECT c.customer_id, SUM(CASE WHEN c.exclusions <> ' ' OR c.extras <> ' ' THEN 1 ELSE 0 END) AS at_least_1_change, SUM(CASE WHEN c.exclusions = ' ' AND c.extras = ' ' THEN 1 ELSE 0 END) AS no_change FROM #customer_orders AS c JOIN #runner_orders AS r ON c.order_id = r.order_id WHERE r.distance != 0 GROUP BY c.customer_id ORDER BY c.customer_id;
- Customer 101 and 102 likes his/her pizzas per the original recipe.
- Customer 103, 104 and 105 have their own preference for pizza topping and requested at least 1 change (extra or exclusion topping) on their pizza.
- How many pizzas were delivered that had both exclusions and extras?
SELECT SUM(CASE WHEN exclusions IS NOT NULL AND extras IS NOT NULL THEN 1 ELSE 0 END) AS pizza_count_w_exclusions_extras FROM #customer_orders AS c JOIN #runner_orders AS r ON c.order_id = r.order_id WHERE r.distance >= 1 AND exclusions <> ' ' AND extras <> ' ';
- Only 1 pizza delivered that had both extra and exclusion topping. That’s one fussy customer!
- What was the total volume of pizzas ordered for each hour of the day?
SELECT DATEPART(HOUR, [order_time]) AS hour_of_day, COUNT(order_id) AS pizza_count FROM #customer_orders GROUP BY DATEPART(HOUR, [order_time]);
- Highest volume of pizza ordered is at 13 (1:00 pm), 18 (6:00 pm) and 21 (9:00 pm).
- Lowest volume of pizza ordered is at 11 (11:00 am), 19 (7:00 pm) and 23 (11:00 pm).
- What was the volume of orders for each day of the week?
SELECT FORMAT(DATEADD(DAY, 2, order_time),'dddd') AS day_of_week, -- add 2 to adjust 1st day of the week as Monday COUNT(order_id) AS total_pizzas_ordered FROM #customer_orders GROUP BY FORMAT(DATEADD(DAY, 2, order_time),'dddd');
- There are 5 pizzas ordered on Friday and Monday.
- There are 3 pizzas ordered on Saturday.
- There is 1 pizza ordered on Sunday.
How many runners signed up for each 1 week period? (i.e. week starts 2021-01-01)
SELECT DATEPART(WEEK, registration_date) AS registration_week, COUNT(runner_id) AS runner_signup FROM runners GROUP BY DATEPART(WEEK, registration_date);
- On Week 1 of Jan 2021, 2 new runners signed up.
- On Week 2 and 3 of Jan 2021, 1 new runner signed up.
- What was the average time in minutes it took for each runner to arrive at the Pizza Runner HQ to pickup the order?
WITH time_taken_cte AS ( SELECT c.order_id, c.order_time, r.pickup_time, DATEDIFF(MINUTE, c.order_time, r.pickup_time) AS pickup_minutes FROM #customer_orders AS c JOIN #runner_orders AS r ON c.order_id = r.order_id WHERE r.distance != 0 GROUP BY c.order_id, c.order_time, r.pickup_time )
SELECT AVG(pickup_minutes) AS avg_pickup_minutes FROM time_taken_cte WHERE pickup_minutes > 1;
- The average time taken in minutes by runners to arrive at Pizza Runner HQ to pick up the order is 15 minutes.
- Is there any relationship between the number of pizzas and how long the order takes to prepare?
WITH prep_time_cte AS ( SELECT c.order_id, COUNT(c.order_id) AS pizza_order, c.order_time, r.pickup_time, DATEDIFF(MINUTE, c.order_time, r.pickup_time) AS prep_time_minutes FROM #customer_orders AS c JOIN #runner_orders AS r ON c.order_id = r.order_id WHERE r.distance != 0 GROUP BY c.order_id, c.order_time, r.pickup_time )
SELECT pizza_order, AVG(prep_time_minutes) AS avg_prep_time_minutes FROM prep_time_cte WHERE prep_time_minutes > 1 GROUP BY pizza_order;
- On average, a single pizza order takes 12 minutes to prepare.
- An order with 3 pizzas takes 30 minutes at an average of 10 minutes per pizza.
- It takes 16 minutes to prepare an order with 2 pizzas which is 8 minutes per pizza — making 2 pizzas in a single order the ultimate efficiency rate.
- What was the average distance travelled for each customer?
SELECT c.customer_id, AVG(r.distance) AS avg_distance FROM #customer_orders AS c JOIN #runner_orders AS r ON c.order_id = r.order_id WHERE r.duration != 0 GROUP BY c.customer_id;
(Assuming that distance is calculated from Pizza Runner HQ to customer’s place)
- Customer 104 stays the nearest to Pizza Runner HQ at average distance of 10km, whereas Customer 105 stays the furthest at 25km.
- What was the difference between the longest and shortest delivery times for all orders?
Firstly, let’s see all the durations for the orders.
SELECT order_id, duration FROM #runner_orders WHERE duration not like ' ';
Then, we find the difference by deducting the shortest (MIN) from the longest (MAX) delivery times.
SELECT MAX(duration::NUMERIC) - MIN(duration::NUMERIC) AS delivery_time_difference FROM #runner_orders WHERE duration not like '% %'
- The difference between longest (40 minutes) and shortest (10 minutes) delivery time for all orders is 30 minutes.
- What was the average speed for each runner for each delivery and do you notice any trend for these values?
SELECT r.runner_id, c.customer_id, c.order_id, COUNT(c.order_id) AS pizza_count, r.distance, (r.duration / 60) AS duration_hr , ROUND((r.distance/r.duration * 60), 2) AS avg_speed FROM #runner_orders AS r JOIN #customer_orders AS c ON r.order_id = c.order_id WHERE distance != 0 GROUP BY r.runner_id, c.customer_id, c.order_id, r.distance, r.duration ORDER BY c.order_id;
(Average speed = Distance in km / Duration in hour)
- Runner 1’s average speed runs from 37.5km/h to 60km/h.
- Runner 2’s average speed runs from 35.1km/h to 93.6km/h. Danny should investigate Runner 2 as the average speed has a 300% fluctuation rate!
- Runner 3’s average speed is 40km/h
- What is the successful delivery percentage for each runner?
SELECT runner_id, ROUND(100 * SUM (CASE WHEN distance = 0 THEN 0 ELSE 1 END) / COUNT(*), 0) AS success_perc FROM #runner_orders GROUP BY runner_id;
- Runner 1 has 100% successful delivery.
- Runner 2 has 75% successful delivery.
Runner 3 has 50% successful delivery
(It’s not right to attribute successful delivery to runners as order cancellations are out of the runner’s control.)
I will continue with Part A, B and C soon!
What are the standard ingredients for each pizza?
What was the most commonly added extra?
What was the most common exclusion?
Generate an order item for each record in the customers_orders table in the format of one of the following:
Meat Lovers - Extra Bacon
Meat Lovers - Exclude Cheese, Bacon - Extra Mushroom, Peppers
- Generate an alphabetically ordered comma separated ingredient list for each pizza order from the customer_orders table and add a 2x in front of any relevant ingredients
"Meat Lovers: 2xBacon, Beef, ... , Salami"
- What is the total quantity of each ingredient used in all delivered pizzas sorted by most frequent first?
If a Meat Lovers pizza costs $12 and Vegetarian costs $10 and there were no charges for changes — how much money has Pizza Runner made so far if there are no delivery fees?
What if there was an additional $1 charge for any pizza extras?
- Add cheese is $1 extra
The Pizza Runner team now wants to add an additional ratings system that allows customers to rate their runner, how would you design an additional table for this new dataset — generate a schema for this new table and insert your own data for ratings for each successful customer order between 1 to 5.
Using your newly generated table — can you join all of the information together to form a table which has the following information for successful deliveries?
- Time between order and pickup
- Delivery duration
- Average speed
- Total number of pizzas
- If a Meat Lovers pizza was $12 and Vegetarian $10 fixed prices with no cost for extras and each runner is paid $0.30 per kilometre travelled — how much money does Pizza Runner have left over after these deliveries?
E. Bonus Questions
If Danny wants to expand his range of pizzas — how would this impact the existing data design? Write an INSERT statement to demonstrate what would happen if a new Supreme pizza with all the toppings was added to the Pizza Runner menu?