Consider a table
DailySales with the following schema:
The table contains the date and the name of the product sold as well as the IDs of the lead and partner to whom it was sold. The name consists of only lowercase English letters. For each
make_name, you need to find the number of distinct
lead_id's and distinct
partner_id's. The result table should be returned in any order.
Consider the following sample
For this input, the expected output would be:
Here, each row represents a distinct date and make_name, and the count of unique leads and partners for each combination.
Let's discuss two different approaches to solve this problem, highlight their strengths and weaknesses, and explain their underlying structures.
This approach directly counts the distinct
partner_id for each combination of
SELECT date_id, make_name, COUNT(DISTINCT lead_id) [unique_leads], COUNT(DISTINCT partner_id) [unique_partners] FROM DailySales GROUP BY date_id, make_name
This query is fairly straightforward and easy to understand. It works by grouping the data by
make_name, and then it counts the distinct leads and partners within each group. However,
COUNT(DISTINCT) can be slow on large data sets. This query runtime is 1073ms, beating 22.3% of other submissions on LeetCode.
This approach first ranks
partner_id within each group defined by
make_name, and then it selects the maximum rank for each group as the count of distinct leads or partners.
WITH ranks AS ( SELECT date_id, make_name, DENSE_RANK() OVER (PARTITION BY date_id, make_name ORDER BY lead_id) [lead_rank], DENSE_RANK() OVER (PARTITION BY date_id, make_name ORDER BY partner_id) [partner_rank] FROM DailySales ) SELECT date_id, make_name, (SELECT MAX(lead_rank) FROM ranks r WHERE r.date_id = ds.date_id AND r.make_name = ds.make_name) [unique_leads], (SELECT MAX(partner_rank) FROM ranks r WHERE r.date_id = ds.date_id AND r.make_name = ds.make_name) [unique_partners] FROM DailySales ds GROUP BY date_id, make_name
This approach, while more complex, eliminates the need for
COUNT(DISTINCT), which can improve performance in some situations. However, the inclusion of the subqueries in the SELECT clause may cause performance degradation when handling large datasets. This query runtime is 2467ms, beating 5.17% of other submissions on LeetCode.
Both methods solve the problem, but their performances can vary based on the data set size and distribution. The
COUNT(DISTINCT) method is simpler and performed better on LeetCode's platform, making it the preferred solution in this case. However, the performance may vary in real-world RDBMS depending on several factors such as data distribution and the database engine's optimizations.
You can find the original problem at LeetCode.
For more insightful solutions and tech-related content, feel free to connect with me on my Beacons page.