Lighthouse Labs

W1D4 - Combining Data

Instructor: Socorro E. Dominguez-Vidana
LHL
Hi Name: Socorro Dominguez-Vidana
Work: University of Wisconsin-Madison
Data Scientist
Hobbies: Kung Fu, traveling, learning languages

Overview¶

  • [] Multiple joins
  • [] Demonstrate multiple joins using a pre-prepared PG database
  • [] What is a subquery?
  • [] SELECT subquery
  • [] FROM subquery
  • [] WHERE subquery
  • [] Demostrate subqueries

Follow with Binder

In [1]:
%load_ext sql
In [2]:
%sql postgresql://testuser:testpass@localhost/insurance

Introduction to SQL JOINs¶

  • A JOIN clause is used to combine rows from two or more tables based on a related column.

  • Types of JOINs:

    • INNER JOIN: Returns records that have matching values in both tables. inner
    • LEFT JOIN: Returns all records from the left table and the matched records from the right table. left
    • RIGHT JOIN: Returns all records from the right table and the matched records from the left table. right
    • FULL OUTER JOIN: Returns all records when there is a match in either left or right table. outer

Case Study¶

This Binder has an insurance database that simulates a fictional insurance company, HT-Insurance, and contains essential information about clients, policies, claims, agents, and payments.

We are going to follow Emma, an actuarial analyst in HT-Insurance and do some insurance-related activities, such as evaluating claims, tracking policy information, and analyzing payments.

Database Tables and Relationships:¶

  • Clients: This table stores basic information about each client, such as their name, address, and contact details.
  • Agents: This table contains details about the insurance agents who manage policies for clients.
  • Policies: This table tracks the insurance policies taken out by clients, including details about policy type, premium, and the agent managing the policy.
  • Claims: This table stores data about claims filed by clients on their policies, including the amount of the claim and its current status.
  • Payments: This table records payments made by clients toward their policies, including the amount and the type of payment.

ERD Diagram

Multiple JOINs¶

Sometimes we need to combine data from more than two tables.

Emma wants to analyze the total claim amounts managed by each agent, grouped by agent, to see which agents are handling the highest financial exposure.

mj

In [3]:
%%sql

SELECT 
    a.first_name AS agent_first_name,
    a.last_name AS agent_last_name,
    COUNT(cl.claim_id) AS total_claims,
    SUM(cl.amount) AS total_claim_amount
FROM 
    agents a
INNER JOIN 
    policies p ON a.agent_id = p.agent_id
INNER JOIN 
    claims cl ON p.policy_id = cl.policy_id
GROUP BY 
    a.first_name, a.last_name
ORDER BY 
    total_claim_amount DESC;
 * postgresql://sedv8808:***@localhost/insurance
5 rows affected.
Out[3]:
agent_first_name agent_last_name total_claims total_claim_amount
Bob Lee 4 11000.00
Eva Parker 2 9000.00
Alice Johnson 4 7600.00
Carol White 4 7400.00
David Miller 2 6000.00

Emma wants to generate a report of clients whose policies are expiring within the next year, along with the agents responsible for those clients. This will help the team follow up for renewals.

In [4]:
%%sql

SELECT 
    c.first_name AS client_first_name,
    c.last_name AS client_last_name,
    p.policy_type,
    p.end_date,
    a.first_name AS agent_first_name,
    a.last_name AS agent_last_name
FROM 
    clients c
INNER JOIN 
    policies p ON c.client_id = p.client_id
INNER JOIN 
    agents a ON p.agent_id = a.agent_id
WHERE 
    p.end_date BETWEEN CURRENT_DATE AND (CURRENT_DATE + INTERVAL '365 days')
ORDER BY 
    p.end_date ASC;
 * postgresql://sedv8808:***@localhost/insurance
6 rows affected.
Out[4]:
client_first_name client_last_name policy_type end_date agent_first_name agent_last_name
Jane Smith Home 2025-06-15 Bob Lee
Jane Smith Home 2025-06-15 Bob Lee
Michael Davis Life 2025-09-12 Eva Parker
Michael Davis Life 2025-09-12 Eva Parker
Linda Thomas Life 2025-10-01 Eva Parker
Linda Thomas Life 2025-10-01 Eva Parker

Emma continues analyzing client claims and their associated agents.

She needs to find out which clients have unresolved claims and who their assigned agents are.

multiple_joins

Let's use multiple JOINs to combine the relevant tables and retrieve the information:

In [5]:
%%sql

SELECT 
    c.first_name AS client_first_name, 
    c.last_name AS client_last_name, 
    p.policy_type,
    cl.claim_id, 
    cl.claim_date, 
    cl.amount, 
    cl.status, 
    a.first_name AS agent_first_name, 
    a.last_name AS agent_last_name
FROM 
    clients c
INNER JOIN 
    policies p ON c.client_id = p.client_id
INNER JOIN 
    claims cl ON p.policy_id = cl.policy_id
INNER JOIN 
    agents a ON p.agent_id = a.agent_id
WHERE 
    cl.status IN ('Pending', 'Denied');
 * postgresql://sedv8808:***@localhost/insurance
10 rows affected.
Out[5]:
client_first_name client_last_name policy_type claim_id claim_date amount status agent_first_name agent_last_name
Jane Smith Home 10 2023-07-22 5000.00 Pending Bob Lee
Jane Smith Home 2 2023-07-22 5000.00 Pending Bob Lee
Robert Brown Health 11 2023-09-10 1500.00 Denied Carol White
Robert Brown Health 3 2023-09-10 1500.00 Denied Carol White
Michael Davis Life 13 2023-02-11 4500.00 Pending Eva Parker
Michael Davis Life 5 2023-02-11 4500.00 Pending Eva Parker
Sarah Wilson Home 14 2023-04-05 1800.00 Denied Alice Johnson
Sarah Wilson Home 6 2023-04-05 1800.00 Denied Alice Johnson
Patricia Taylor Auto 16 2023-08-20 2200.00 Pending Carol White
Patricia Taylor Auto 8 2023-08-20 2200.00 Pending Carol White

This query uses INNER JOINs to combine data from the clients, policies, claims, and agents tables. We filter the results to focus on claims that are either "Pending" or "Denied" which need attention.

Challenge time¶

  1. Write a query to list the first name and last name of each client along with the first name and last name of their assigned agent.
In [6]:
%%sql
UsageError: %%sql is a cell magic, but the cell body is empty. Did you mean the line magic %sql (single %)?
  1. Write a query to find all policies that have a premium greater than $1000. Display the policy type, start date, end date, and the premium.
In [7]:
%%sql
UsageError: %%sql is a cell magic, but the cell body is empty. Did you mean the line magic %sql (single %)?
  1. Write a query to list all pending claims, showing the claim ID, claim date, claim amount, and the client’s first and last name.
In [8]:
%%sql
UsageError: %%sql is a cell magic, but the cell body is empty. Did you mean the line magic %sql (single %)?

Subqueries¶

Next, Emma wants to identify all clients who have more than one policy.

  • A subquery or nested query is a query that is located inside another query.
  • They can be used to simplify queries (and to avoid JOINs)
  • They can return anything a normal query can: single values, single/multiple rows, single/multiple columns, entire tables
  • May be difficult to understand when reading code
SELECT * FROM table WHERE id IN 
    (SELECT id FROM table)

Instead of using multiple joins, let's use a subquery to identify the clients.

In [9]:
%%sql
SELECT 
    first_name, 
    last_name 
FROM 
    clients 
WHERE 
    client_id IN (
        SELECT client_id 
        FROM policies 
        GROUP BY client_id 
        HAVING COUNT(policy_id) > 1
    )
LIMIT 3;
 * postgresql://sedv8808:***@localhost/insurance
3 rows affected.
Out[9]:
first_name last_name
Daniel Jackson
Chris Anderson
Robert Brown
  • The subquery groups all policies by client_id and counts how many policies a client has.
  • The result of this subquery is a set of client IDs (those who have more than one policy) - a derived value.
  • These derived client_ids are then passed back to the outer query to filter the clients.
In [10]:
%%sql
SELECT 
    first_name, 
    last_name 
FROM 
    clients 
LIMIT 3
 * postgresql://sedv8808:***@localhost/insurance
3 rows affected.
Out[10]:
first_name last_name
John Doe
Jane Smith
Robert Brown
In [11]:
%%sql
SELECT client_id 
        FROM policies 
        GROUP BY client_id 
        HAVING COUNT(policy_id) > 1
LIMIT 3
 * postgresql://sedv8808:***@localhost/insurance
3 rows affected.
Out[11]:
client_id
11
9
3

Where are subqueries used?¶

In the SELECT statement.

In [12]:
%%sql
SELECT 
    first_name, 
    last_name,
    (SELECT COUNT(*) 
     FROM policies p 
     WHERE p.client_id = c.client_id) AS total_policies
FROM 
    clients c
LIMIT 5;
 * postgresql://sedv8808:***@localhost/insurance
5 rows affected.
Out[12]:
first_name last_name total_policies
John Doe 2
Jane Smith 2
Robert Brown 2
Emily Clark 2
Michael Davis 2

In the FROM clause.

In [13]:
%%sql
SELECT 
    client_name, 
    total_claims,
    claims_summary
FROM 
    (SELECT c.first_name || ' ' || c.last_name AS client_name, 
            COUNT(cl.claim_id) AS total_claims
     FROM clients c
     INNER JOIN policies p ON c.client_id = p.client_id
     INNER JOIN claims cl ON p.policy_id = cl.policy_id
     GROUP BY c.first_name, c.last_name) AS claims_summary;
 * postgresql://sedv8808:***@localhost/insurance
8 rows affected.
Out[13]:
client_name total_claims claims_summary
Michael Davis 2 ("Michael Davis",2)
Sarah Wilson 2 ("Sarah Wilson",2)
Jane Smith 2 ("Jane Smith",2)
James Johnson 2 ("James Johnson",2)
John Doe 2 ("John Doe",2)
Robert Brown 2 ("Robert Brown",2)
Emily Clark 2 ("Emily Clark",2)
Patricia Taylor 2 ("Patricia Taylor",2)

In the WHERE Clause (Filtering Condition)

In [14]:
%%sql
SELECT 
    first_name, 
    last_name 
FROM 
    clients 
WHERE 
    client_id IN (
        SELECT client_id 
        FROM policies 
        WHERE premium > 1000
    );
 * postgresql://sedv8808:***@localhost/insurance
2 rows affected.
Out[14]:
first_name last_name
Sarah Wilson
Chris Anderson

Now, Emma wants to see who the clients with the Highest Claim Amount are.

In [15]:
%%sql

SELECT 
    first_name, 
    last_name 
FROM 
    clients 
WHERE 
    client_id IN (
        SELECT client_id 
        FROM policies 
        WHERE policy_id IN (
            SELECT policy_id 
            FROM claims 
            WHERE amount = (SELECT MAX(amount) FROM claims)
        )
    );
 * postgresql://sedv8808:***@localhost/insurance
1 rows affected.
Out[15]:
first_name last_name
Jane Smith

Emma also needs to analyze the average premium per policy type to help the underwriting team adjust premium rates. She uses a FROM subquery to simplify the aggregation.

In [16]:
%%sql
SELECT 
    policy_summary.policy_type, 
    AVG(policy_summary.total_premium) AS avg_premium
FROM 
    (SELECT p.policy_type, SUM(p.premium) AS total_premium 
     FROM policies p 
     GROUP BY p.client_id, p.policy_type) AS policy_summary
GROUP BY 
    policy_summary.policy_type;
 * postgresql://sedv8808:***@localhost/insurance
4 rows affected.
Out[16]:
policy_type avg_premium
Life 1550.0000000000000000
Health 700.0000000000000000
Auto 1050.0000000000000000
Home 2466.6666666666666667

JOIN vs. Subquery Discussion¶

Let's discuss when a subquery might be preferred over a join, and vice versa.

For example, in the SELECT subquery above, we could have rewritten the query using a JOIN, but the subquery is preferred in certain cases.

Here’s how the same query could be written using a JOIN:

In [17]:
%%sql
SELECT 
    p.policy_type, 
    AVG(p.premium) AS avg_premium
FROM 
    policies p
GROUP BY 
    p.policy_type;
 * postgresql://sedv8808:***@localhost/insurance
4 rows affected.
Out[17]:
policy_type avg_premium
Life 775.0000000000000000
Health 350.0000000000000000
Auto 525.0000000000000000
Home 1233.3333333333333333

We manage to get the same result in a more efficient and straightforward way!

  • Subqueries are often preferred for:
  1. Simplicity: Breaking complex logic into manageable parts.
  2. Modularity: Subqueries can be used in different parts of the main query.
  3. Performance: In some cases, subqueries can be faster due to optimized execution plans.
  • JOINs, on the other hand, are preferred when:
  1. You need to retrieve large datasets with complex relationships.
  2. You need to return multiple rows rather than a single derived value.

Challenge Time¶

  1. Write a query to find clients who have never filed a claim. Use a subquery to check which clients do not have any claims.
In [18]:
%%sql
UsageError: %%sql is a cell magic, but the cell body is empty. Did you mean the line magic %sql (single %)?
  1. Write a query to find the highest premium for each policy type. Use a subquery to determine the highest premium amount for each policy type and then return the policy details.
In [19]:
%%sql
UsageError: %%sql is a cell magic, but the cell body is empty. Did you mean the line magic %sql (single %)?
  1. Practice using a subquery to calculate the total premium for each client and filter the top 3 based on the highest amount.
In [20]:
%%sql
UsageError: %%sql is a cell magic, but the cell body is empty. Did you mean the line magic %sql (single %)?