Daily Leads and Partners | LeetCode | MSSQL

#mssql #database #leetcode #tutorial

The Problem

Consider a table DailySales with the following schema:

Column Name	Type
date_id	date
make_name	varchar
lead_id	int
partner_id	int

The table contains the date and the name of the product sold as well as the IDs of the lead and partner to whom it was sold. The name consists of only lowercase English letters. For each date_id and make_name, you need to find the number of distinct lead_id's and distinct partner_id's. The result table should be returned in any order.

Explanation

Consider the following sample DailySales table:

date_id	make_name	lead_id	partner_id
2020-12-8	toyota	0	1
2020-12-8	toyota	1	0
2020-12-8	toyota	1	2
2020-12-7	toyota	0	2
2020-12-7	toyota	0	1
2020-12-8	honda	1	2
2020-12-8	honda	2	1
2020-12-7	honda	0	1
2020-12-7	honda	1	2
2020-12-7	honda	2	1

For this input, the expected output would be:

date_id	make_name	unique_leads	unique_partners
2020-12-8	toyota	2	3
2020-12-7	toyota	1	2
2020-12-8	honda	2	2
2020-12-7	honda	3	2

Here, each row represents a distinct date and make_name, and the count of unique leads and partners for each combination.

The Solution

Let's discuss two different approaches to solve this problem, highlight their strengths and weaknesses, and explain their underlying structures.

Using COUNT(DISTINCT)

This approach directly counts the distinct lead_id and partner_id for each combination of date_id and make_name.

SELECT
    date_id,
    make_name,
    COUNT(DISTINCT lead_id) [unique_leads],
    COUNT(DISTINCT partner_id) [unique_partners]
FROM DailySales
GROUP BY
    date_id,
    make_name

This query is fairly straightforward and easy to understand. It works by grouping the data by date_id and make_name, and then it counts the distinct leads and partners within each group. However, COUNT(DISTINCT) can be slow on large data sets. This query runtime is 1073ms, beating 22.3% of other submissions on LeetCode.

Using DENSE_RANK()

This approach first ranks lead_id and partner_id within each group defined by date_id and make_name, and then it selects the maximum rank for each group as the count of distinct leads or partners.

WITH ranks AS (
    SELECT
        date_id,
        make_name,
        DENSE_RANK() OVER (PARTITION BY date_id, make_name ORDER BY lead_id) [lead_rank],
        DENSE_RANK() OVER (PARTITION BY date_id, make_name ORDER BY partner_id) [partner_rank]
    FROM DailySales
)
SELECT 
    date_id,
    make_name,
    (SELECT MAX(lead_rank) FROM ranks r WHERE r.date_id = ds.date_id AND r.make_name = ds.make_name) [unique_leads],
    (SELECT MAX(partner_rank) FROM ranks r WHERE r.date_id = ds.date_id AND r.make_name = ds.make_name) [unique_partners]
FROM DailySales ds
GROUP BY
    date_id,
    make_name

This approach, while more complex, eliminates the need for COUNT(DISTINCT), which can improve performance in some situations. However, the inclusion of the subqueries in the SELECT clause may cause performance degradation when handling large datasets. This query runtime is 2467ms, beating 5.17% of other submissions on LeetCode.

Conclusion

Both methods solve the problem, but their performances can vary based on the data set size and distribution. The COUNT(DISTINCT) method is simpler and performed better on LeetCode's platform, making it the preferred solution in this case. However, the performance may vary in real-world RDBMS depending on several factors such as data distribution and the database engine's optimizations.

You can find the original problem at LeetCode.

For more insightful solutions and tech-related content, feel free to connect with me on my Beacons page.

👉 Check out all the links on my beacons.ai page 👈