Lighthouse LabsW1D4 - Combining DataInstructor: Socorro E. Dominguez-Vidana |
![]() |
![]() |
Name: Socorro Dominguez-Vidana Work: University of Wisconsin-Madison Data Scientist Hobbies: Kung Fu, traveling, learning languages |
Overview¶
- [] Multiple joins
- [] Demonstrate multiple
join
s using a pre-prepared PG database - [] What is a subquery?
- []
SELECT
subquery - []
FROM
subquery - []
WHERE
subquery - [] Demostrate subqueries
%load_ext sql
%sql postgresql://testuser:testpass@localhost/insurance
Introduction to SQL JOIN
s¶
A
JOIN
clause is used to combine rows from two or more tables based on a related column.Types of
JOIN
s:INNER JOIN
: Returns records that have matching values in both tables.LEFT JOIN
: Returns all records from the left table and the matched records from the right table.RIGHT JOIN
: Returns all records from the right table and the matched records from the left table.FULL OUTER JOIN
: Returns all records when there is a match in either left or right table.
Case Study¶
This Binder has an insurance database that simulates a fictional insurance company, HT-Insurance, and contains essential information about clients, policies, claims, agents, and payments.
We are going to follow Emma, an actuarial analyst in HT-Insurance and do some insurance-related activities, such as evaluating claims, tracking policy information, and analyzing payments.
Database Tables and Relationships:¶
- Clients: This table stores basic information about each client, such as their name, address, and contact details.
- Agents: This table contains details about the insurance agents who manage policies for clients.
- Policies: This table tracks the insurance policies taken out by clients, including details about policy type, premium, and the agent managing the policy.
- Claims: This table stores data about claims filed by clients on their policies, including the amount of the claim and its current status.
- Payments: This table records payments made by clients toward their policies, including the amount and the type of payment.
Multiple JOIN
s¶
Sometimes we need to combine data from more than two tables.
Emma wants to analyze the total claim amounts managed by each agent, grouped by agent, to see which agents are handling the highest financial exposure.
%%sql
SELECT
a.first_name AS agent_first_name,
a.last_name AS agent_last_name,
COUNT(cl.claim_id) AS total_claims,
SUM(cl.amount) AS total_claim_amount
FROM
agents a
INNER JOIN
policies p ON a.agent_id = p.agent_id
INNER JOIN
claims cl ON p.policy_id = cl.policy_id
GROUP BY
a.first_name, a.last_name
ORDER BY
total_claim_amount DESC;
* postgresql://sedv8808:***@localhost/insurance 5 rows affected.
agent_first_name | agent_last_name | total_claims | total_claim_amount |
---|---|---|---|
Bob | Lee | 4 | 11000.00 |
Eva | Parker | 2 | 9000.00 |
Alice | Johnson | 4 | 7600.00 |
Carol | White | 4 | 7400.00 |
David | Miller | 2 | 6000.00 |
Emma wants to generate a report of clients
whose policies
are expiring within the next year, along with the agents
responsible for those clients. This will help the team follow up for renewals.
%%sql
SELECT
c.first_name AS client_first_name,
c.last_name AS client_last_name,
p.policy_type,
p.end_date,
a.first_name AS agent_first_name,
a.last_name AS agent_last_name
FROM
clients c
INNER JOIN
policies p ON c.client_id = p.client_id
INNER JOIN
agents a ON p.agent_id = a.agent_id
WHERE
p.end_date BETWEEN CURRENT_DATE AND (CURRENT_DATE + INTERVAL '365 days')
ORDER BY
p.end_date ASC;
* postgresql://sedv8808:***@localhost/insurance 6 rows affected.
client_first_name | client_last_name | policy_type | end_date | agent_first_name | agent_last_name |
---|---|---|---|---|---|
Jane | Smith | Home | 2025-06-15 | Bob | Lee |
Jane | Smith | Home | 2025-06-15 | Bob | Lee |
Michael | Davis | Life | 2025-09-12 | Eva | Parker |
Michael | Davis | Life | 2025-09-12 | Eva | Parker |
Linda | Thomas | Life | 2025-10-01 | Eva | Parker |
Linda | Thomas | Life | 2025-10-01 | Eva | Parker |
Emma continues analyzing client claims and their associated agents.
She needs to find out which clients
have unresolved claims
and who their assigned agents
are.
Let's use multiple JOIN
s to combine the relevant tables and retrieve the information:
%%sql
SELECT
c.first_name AS client_first_name,
c.last_name AS client_last_name,
p.policy_type,
cl.claim_id,
cl.claim_date,
cl.amount,
cl.status,
a.first_name AS agent_first_name,
a.last_name AS agent_last_name
FROM
clients c
INNER JOIN
policies p ON c.client_id = p.client_id
INNER JOIN
claims cl ON p.policy_id = cl.policy_id
INNER JOIN
agents a ON p.agent_id = a.agent_id
WHERE
cl.status IN ('Pending', 'Denied');
* postgresql://sedv8808:***@localhost/insurance 10 rows affected.
client_first_name | client_last_name | policy_type | claim_id | claim_date | amount | status | agent_first_name | agent_last_name |
---|---|---|---|---|---|---|---|---|
Jane | Smith | Home | 10 | 2023-07-22 | 5000.00 | Pending | Bob | Lee |
Jane | Smith | Home | 2 | 2023-07-22 | 5000.00 | Pending | Bob | Lee |
Robert | Brown | Health | 11 | 2023-09-10 | 1500.00 | Denied | Carol | White |
Robert | Brown | Health | 3 | 2023-09-10 | 1500.00 | Denied | Carol | White |
Michael | Davis | Life | 13 | 2023-02-11 | 4500.00 | Pending | Eva | Parker |
Michael | Davis | Life | 5 | 2023-02-11 | 4500.00 | Pending | Eva | Parker |
Sarah | Wilson | Home | 14 | 2023-04-05 | 1800.00 | Denied | Alice | Johnson |
Sarah | Wilson | Home | 6 | 2023-04-05 | 1800.00 | Denied | Alice | Johnson |
Patricia | Taylor | Auto | 16 | 2023-08-20 | 2200.00 | Pending | Carol | White |
Patricia | Taylor | Auto | 8 | 2023-08-20 | 2200.00 | Pending | Carol | White |
This query uses INNER JOIN
s to combine data from the clients, policies, claims, and agents tables. We filter the results to focus on claims that are either "Pending" or "Denied" which need attention.
Challenge time¶
- Write a query to list the first name and last name of each client along with the first name and last name of their assigned agent.
%%sql
UsageError: %%sql is a cell magic, but the cell body is empty. Did you mean the line magic %sql (single %)?
- Write a query to find all policies that have a premium greater than $1000. Display the policy type, start date, end date, and the premium.
%%sql
UsageError: %%sql is a cell magic, but the cell body is empty. Did you mean the line magic %sql (single %)?
- Write a query to list all pending claims, showing the claim ID, claim date, claim amount, and the client’s first and last name.
%%sql
UsageError: %%sql is a cell magic, but the cell body is empty. Did you mean the line magic %sql (single %)?
Subqueries¶
Next, Emma wants to identify all clients who have more than one policy.
- A
subquery
ornested query
is a query that is located inside another query. - They can be used to simplify queries (and to avoid
JOIN
s) - They can return anything a normal query can: single values, single/multiple rows, single/multiple columns, entire tables
- May be difficult to understand when reading code
SELECT * FROM table WHERE id IN
(SELECT id FROM table)
Instead of using multiple joins, let's use a subquery to identify the clients.
%%sql
SELECT
first_name,
last_name
FROM
clients
WHERE
client_id IN (
SELECT client_id
FROM policies
GROUP BY client_id
HAVING COUNT(policy_id) > 1
)
LIMIT 3;
* postgresql://sedv8808:***@localhost/insurance 3 rows affected.
first_name | last_name |
---|---|
Daniel | Jackson |
Chris | Anderson |
Robert | Brown |
- The subquery groups all policies by
client_id
and counts how many policies a client has. - The result of this subquery is a set of client IDs (those who have more than one policy) - a derived value.
- These derived
client_id
s are then passed back to the outer query to filter the clients.
%%sql
SELECT
first_name,
last_name
FROM
clients
LIMIT 3
* postgresql://sedv8808:***@localhost/insurance 3 rows affected.
first_name | last_name |
---|---|
John | Doe |
Jane | Smith |
Robert | Brown |
%%sql
SELECT client_id
FROM policies
GROUP BY client_id
HAVING COUNT(policy_id) > 1
LIMIT 3
* postgresql://sedv8808:***@localhost/insurance 3 rows affected.
client_id |
---|
11 |
9 |
3 |
Where are subqueries used?¶
In the SELECT
statement.
%%sql
SELECT
first_name,
last_name,
(SELECT COUNT(*)
FROM policies p
WHERE p.client_id = c.client_id) AS total_policies
FROM
clients c
LIMIT 5;
* postgresql://sedv8808:***@localhost/insurance 5 rows affected.
first_name | last_name | total_policies |
---|---|---|
John | Doe | 2 |
Jane | Smith | 2 |
Robert | Brown | 2 |
Emily | Clark | 2 |
Michael | Davis | 2 |
In the FROM
clause.
%%sql
SELECT
client_name,
total_claims,
claims_summary
FROM
(SELECT c.first_name || ' ' || c.last_name AS client_name,
COUNT(cl.claim_id) AS total_claims
FROM clients c
INNER JOIN policies p ON c.client_id = p.client_id
INNER JOIN claims cl ON p.policy_id = cl.policy_id
GROUP BY c.first_name, c.last_name) AS claims_summary;
* postgresql://sedv8808:***@localhost/insurance 8 rows affected.
client_name | total_claims | claims_summary |
---|---|---|
Michael Davis | 2 | ("Michael Davis",2) |
Sarah Wilson | 2 | ("Sarah Wilson",2) |
Jane Smith | 2 | ("Jane Smith",2) |
James Johnson | 2 | ("James Johnson",2) |
John Doe | 2 | ("John Doe",2) |
Robert Brown | 2 | ("Robert Brown",2) |
Emily Clark | 2 | ("Emily Clark",2) |
Patricia Taylor | 2 | ("Patricia Taylor",2) |
In the WHERE
Clause (Filtering Condition)
%%sql
SELECT
first_name,
last_name
FROM
clients
WHERE
client_id IN (
SELECT client_id
FROM policies
WHERE premium > 1000
);
* postgresql://sedv8808:***@localhost/insurance 2 rows affected.
first_name | last_name |
---|---|
Sarah | Wilson |
Chris | Anderson |
Now, Emma wants to see who the clients with the Highest Claim Amount are.
%%sql
SELECT
first_name,
last_name
FROM
clients
WHERE
client_id IN (
SELECT client_id
FROM policies
WHERE policy_id IN (
SELECT policy_id
FROM claims
WHERE amount = (SELECT MAX(amount) FROM claims)
)
);
* postgresql://sedv8808:***@localhost/insurance 1 rows affected.
first_name | last_name |
---|---|
Jane | Smith |
Emma also needs to analyze the average premium per policy type to help the underwriting team adjust premium rates. She uses a FROM subquery to simplify the aggregation.
%%sql
SELECT
policy_summary.policy_type,
AVG(policy_summary.total_premium) AS avg_premium
FROM
(SELECT p.policy_type, SUM(p.premium) AS total_premium
FROM policies p
GROUP BY p.client_id, p.policy_type) AS policy_summary
GROUP BY
policy_summary.policy_type;
* postgresql://sedv8808:***@localhost/insurance 4 rows affected.
policy_type | avg_premium |
---|---|
Life | 1550.0000000000000000 |
Health | 700.0000000000000000 |
Auto | 1050.0000000000000000 |
Home | 2466.6666666666666667 |
JOIN
vs. Subquery Discussion¶
Let's discuss when a subquery might be preferred over a join, and vice versa.
For example, in the SELECT subquery above, we could have rewritten the query using a JOIN, but the subquery is preferred in certain cases.
Here’s how the same query could be written using a JOIN:
%%sql
SELECT
p.policy_type,
AVG(p.premium) AS avg_premium
FROM
policies p
GROUP BY
p.policy_type;
* postgresql://sedv8808:***@localhost/insurance 4 rows affected.
policy_type | avg_premium |
---|---|
Life | 775.0000000000000000 |
Health | 350.0000000000000000 |
Auto | 525.0000000000000000 |
Home | 1233.3333333333333333 |
We manage to get the same result in a more efficient and straightforward way!
- Subqueries are often preferred for:
- Simplicity: Breaking complex logic into manageable parts.
- Modularity: Subqueries can be used in different parts of the main query.
- Performance: In some cases, subqueries can be faster due to optimized execution plans.
JOIN
s, on the other hand, are preferred when:
- You need to retrieve large datasets with complex relationships.
- You need to return multiple rows rather than a single derived value.
Challenge Time¶
- Write a query to find clients who have never filed a claim. Use a subquery to check which clients do not have any claims.
%%sql
UsageError: %%sql is a cell magic, but the cell body is empty. Did you mean the line magic %sql (single %)?
- Write a query to find the highest premium for each policy type. Use a subquery to determine the highest premium amount for each policy type and then return the policy details.
%%sql
UsageError: %%sql is a cell magic, but the cell body is empty. Did you mean the line magic %sql (single %)?
- Practice using a subquery to calculate the total premium for each client and filter the top 3 based on the highest amount.
%%sql
UsageError: %%sql is a cell magic, but the cell body is empty. Did you mean the line magic %sql (single %)?