Lighthouse Labs

W1D5 - Organizing Data

Instructor: Socorro E. Dominguez-Vidana
LHL
Hi Name: Socorro Dominguez-Vidana
Work: University of Wisconsin-Madison
Data Scientist
Hobbies: Kung Fu, traveling, learning languages

Overview¶

  • [] Importance of Organizing Data
  • [] Unions
  • [] Views
  • [] Temporary Tables
  • [] CTEs
  • [] Best practices

Follow with Binder

In [1]:
%load_ext sql
In [2]:
%sql postgresql://sedv8808@localhost/insurance

Importance of Organizing Data¶

  • To make code more understandable for yourself and for others.
  • To improve readability and maintainability.
  • To help in error handling and debugging.
Workplace comics and humor

Work Chronicles. (n.d.). Workplace comics and humor. Retrieved January 6, 2025

What is Organized Data¶

Organized Data has the following properties:

  1. Structured Format: Data is stored in a predefined format, such as rows and columns in a spreadsheet or database table.

  2. Logical Grouping: Data is categorized into logical groups (e.g., by date, category, department, or region).

  3. Consistency: Data follows consistent conventions for formatting, such as dates in a uniform format (e.g., YYYY-MM-DD), standardized names, or units of measurement.

  4. Labeled and Descriptive: Each data point is clearly labeled or associated with metadata, making its purpose or context understandable (e.g. headers like Name, Date, etc).

  5. Easily Queryable or Analyzable: Organized data is stored in a way that allows for efficient querying, filtering, and analysis using tools or programming languages.

UNIONs¶

  • A UNION operator is used to combine the results of two or more SELECT statements into a single result set. The result contains rows from all SELECT queries, stacked vertically.

Types of UNIONs:

  1. UNION: - Combines the result sets of two or more queries and removes duplicate rows from the final output. - Use when you want unique results from multiple queries. - Example:
SELECT name FROM employees
UNION
SELECT name FROM managers;

Big Tech Interviews. (n.d.). SQL UNION vs UNION ALL.

  1. UNION ALL:
    • Combines the result sets of two or more queries but keeps duplicate rows in the output.
    • Use when performance is critical or when duplicates are expected and meaningful.
    • Example:
SELECT name FROM employees
UNION ALL
SELECT name FROM managers;

Big Tech Interviews. (n.d.). SQL UNION vs UNION ALL.

Case Study¶

This Binder has an insurance database that simulates a fictional insurance company, HT-Insurance, and contains essential information about clients, policies, claims, agents, and payments.

We are going to follow Emma, an actuarial analyst in HT-Insurance and do some insurance-related activities, such as evaluating claims, tracking policy information, and analyzing payments.

Database Tables and Relationships:¶

  • Clients: This table stores basic information about each client, such as their name, address, and contact details.
  • Agents: This table contains details about the insurance agents who manage policies for clients.
  • Policies: This table tracks the insurance policies taken out by clients, including details about policy type, premium, and the agent managing the policy.
  • Claims: This table stores data about claims filed by clients on their policies, including the amount of the claim and its current status.
  • Payments: This table records payments made by clients toward their policies, including the amount and the type of payment.

ERD Diagram

Examples¶

Combining Client and Agent Names¶

You want to create a list of all people associated with the company, including both clients and agents, but you only want unique names.

In [3]:
%%sql

SELECT CONCAT(first_name, ' ', last_name) AS person_name
FROM Clients
UNION
SELECT CONCAT(first_name, ' ', last_name) AS person_name
FROM Agents
LIMIT 5;
 * postgresql://sedv8808@localhost/insurance
5 rows affected.
Out[3]:
person_name
James Johnson
Ava Allen
Ethan Lopez
Mia Rivera
Lucas Robinson
Listing All (Repeated) Clients and Agents¶
In [4]:
%%sql

SELECT CONCAT(first_name, ' ', last_name) AS person_name
FROM Clients
UNION ALL
SELECT CONCAT(first_name, ' ', last_name) AS person_name
FROM Agents
ORDER BY person_name DESC
LIMIT 5;
 * postgresql://sedv8808@localhost/insurance
5 rows affected.
Out[4]:
person_name
William Hughes
Sophia Wright
Sophia Martinez
Sarah Wilson
Robert Brown
  • Combines rows from Clients and Agents tables, aligning client_id with agent_id and client_name with agent_name.
  • UNION ALL retains all rows even if they are duplicated.
Comparing Counts Across Tables¶
In [5]:
%%sql

SELECT 'Clients' 
        AS table_name, 
    COUNT(*) AS count_rows,
    COUNT(DISTINCT(client_id)) AS count_distinct
FROM Clients
UNION ALL
SELECT 'Policies' 
        AS table_name,
    COUNT(*) AS count_rows,
    COUNT(DISTINCT(policy_type)) AS count_distinct
FROM Policies
UNION ALL
SELECT 'Claims' 
        AS table_name,
    COUNT(*) AS count_rows,
    COUNT(DISTINCT(status)) AS count_distinct
FROM Claims;
 * postgresql://sedv8808@localhost/insurance
3 rows affected.
Out[5]:
table_name count_rows count_distinct
Clients 30 30
Policies 12 4
Claims 8 3
Combining Financial Transactions by Policy Type (Payments and Claims)¶
In [6]:
%%sql

SELECT 
    pol.policy_type,
    SUM(pa.amount) AS total_amount,
    'Policy Payments' 
        AS transaction_source
FROM policies pol
JOIN payments pa 
    ON pa.policy_id = pol.policy_id
GROUP BY pol.policy_type
UNION ALL
SELECT 
    pol.policy_type,
    SUM(cl.amount) AS total_amount,
    'Claim Settlements' 
        AS transaction_source
FROM claims cl
JOIN policies pol
    ON cl.policy_id = pol.policy_id
GROUP BY policy_type;
 * postgresql://sedv8808@localhost/insurance
8 rows affected.
Out[6]:
policy_type total_amount transaction_source
Life 700.00 Policy Payments
Health 700.00 Policy Payments
Auto 1500.00 Policy Payments
Home 2200.00 Policy Payments
Life 4500.00 Claim Settlements
Health 2000.00 Claim Settlements
Auto 7200.00 Claim Settlements
Home 6800.00 Claim Settlements
  • Calculates the total payment amounts and total settled claim amounts per policy type.
  • Combines these into a single result using UNION ALL.
  • Keeps payments and claim data distinct by including transaction_source.
High-Value Transactions¶
In [7]:
%%sql

SELECT 
    policy_id,
    amount AS transaction_amount,
    'High Payment' 
        AS transaction_type
FROM payments
WHERE amount > 1000
UNION
SELECT 
    policy_id,
    amount AS transaction_amount,
    'High Claim' 
        AS transaction_type
FROM claims
WHERE amount > 1000 AND status = 'Settled';
 * postgresql://sedv8808@localhost/insurance
1 rows affected.
Out[7]:
policy_id transaction_amount transaction_type
6 1200.00 High Payment
  • Combines high-value payments and high-value settled claims into one result set.

When to use UNION¶

  • Combining data from multiple tables
  • Merging data from multiple queries
  • Aggregating data from different sources
  • Same number of columns with compatible data types (column names can be different)

When to use UNION ALL?¶

  • You do not want to lose duplicates.

Remember¶

  • Data types of columns must be compatible
  • Number and order of columns must be same
  • Returns column names from first query

VIEWs¶

Views are "Virtual" tables.

  • Based on Queries: A view is created by defining a SQL query that pulls data from one or more tables.
  • Make data more consistent
  • Improve performance
  • Are typically read-only, but in some cases, they can be updated if they meet certain conditions.
  • Enhance security: Views can restrict access to specific columns or rows, exposing only the required data.
SELECT * FROM my_view

Creating a View¶

CREATE VIEW view_name AS
SELECT column1, column2, ...
FROM table_name
WHERE condition;
In [8]:
%%sql

CREATE VIEW MonthlyPayments AS
SELECT payment_id, policy_id, amount
FROM payments
WHERE payment_type = 'Monthly';
 * postgresql://sedv8808@localhost/insurance
Done.
Out[8]:
[]

Using a View¶

In [9]:
%%sql

SELECT * FROM MonthlyPayments;
 * postgresql://sedv8808@localhost/insurance
1 rows affected.
Out[9]:
payment_id policy_id amount
3 3 300.00

Modifying a View¶

You can update a view's definition using the CREATE OR REPLACE VIEW statement:

In [10]:
%%sql

CREATE OR REPLACE VIEW MonthlyPayments AS
SELECT payment_id, policy_id, amount, payment_type
FROM payments
WHERE payment_type = 'Monthly';
 * postgresql://sedv8808@localhost/insurance
Done.
Out[10]:
[]

Deleting a View¶

You can delete a View the same way you can drop a table.

In [11]:
%%sql

DROP VIEW IF EXISTS MonthlyPayments;
 * postgresql://sedv8808@localhost/insurance
Done.
Out[11]:
[]

Limitations of Views¶

  1. Performance Overhead: Since views are recalculated on each access, they may impact performance for complex or large queries.
  2. Read-Only: Some views cannot be updated or inserted into if they are based on complex queries or use joins and aggregations.
  3. Dependency Issues: Dropping or altering underlying tables can break a view.

Types of Views¶

  1. Simple Views:
  • Based on a single table and do not use functions, joins, or groupings.
  • Usually updatable.
  1. Complex Views:
  • Based on multiple tables and can include joins, subqueries, and aggregation.
  • Often read-only.
  1. Materialized Views:
  • Unlike regular views, materialized views store the query result physically, improving performance for frequent access.

For a Materialized View, the command is:

CREATE MATERIALIZED VIEW MonthlyPayments AS
SELECT payment_id, policy_id, amount
FROM payments
WHERE payment_type = 'Monthly';

Temporary Tables¶

Tables that exist only in a specific "session".

  • Store data temporarily for intermediate calculations or processing.
  • Particularly useful for breaking down complex queries.
  • Temporary tables are automatically dropped at the end of the session or transaction, depending on their scope.
  • To create a Temporary Table:
CREATE TEMP TABLE t_temp_test AS
SELECT * FROM my_table
In [12]:
%%sql

CREATE TEMPORARY TABLE HighValueClaims AS
SELECT claim_id, policy_id, amount
FROM Claims
WHERE amount > 1000;
 * postgresql://sedv8808@localhost/insurance
7 rows affected.
Out[12]:
[]
  • To use it:
SELECT * FROM t_temp_test
In [13]:
%%sql

SELECT * FROM HighValueClaims
 * postgresql://sedv8808@localhost/insurance
7 rows affected.
Out[13]:
claim_id policy_id amount
1 1 2000.00
2 2 5000.00
3 3 1500.00
4 4 3000.00
5 5 4500.00
6 6 1800.00
8 8 2200.00
  • To delete it:
DROP TABLE IF EXISTS t_temp_test
In [14]:
%%sql

DROP TABLE HighValueClaims;
 * postgresql://sedv8808@localhost/insurance
Done.
Out[14]:
[]

Common Table Expressions (CTEs)¶

Common Table Expressions are a temporary, named result set that is defined within a SQL query.

It simplifies complex queries by breaking them into smaller, readable parts.

  1. Temporary Scope:
  • The CTE exists only during the execution of the SQL statement.
  • It is not stored in the database.
  1. Improved Readability:
  • CTEs make complex queries easier to read and maintain by structuring them into logical sections.
  1. Reusable in a Query:
  • A CTE can be referenced multiple times within the same query, reducing repetition.

Defining CTEs¶

WITH cte_name AS (
    SELECT column1, column2
    FROM table_name
    WHERE condition
)
SELECT *
FROM cte_name;

Example: Find clients who have paid more than $1,000 in total.

In [15]:
%%sql

WITH TotalPayments AS (
    SELECT 
        pol.client_id, 
        SUM(pay.amount) AS total_paid
    FROM payments pay
    JOIN policies pol ON pay.policy_id = pol.policy_id
    GROUP BY pol.client_id
)
SELECT 
    cl.client_id, 
    CONCAT(cl.first_name, ' ', cl.last_name) AS client_name, 
    tp.total_paid
FROM TotalPayments tp
JOIN clients cl ON tp.client_id = cl.client_id
WHERE tp.total_paid > 1000;
 * postgresql://sedv8808@localhost/insurance
1 rows affected.
Out[15]:
client_id client_name total_paid
6 Sarah Wilson 1200.00

Why this works?

  • The CTE calculates the total payments (SUM(pay.amount)) made for each client_id.
  • It links the payments table to the policies table using the policy_id column.
  • The GROUP BY pol.client_id ensures we aggregate payments for each unique client_id.

How the CTE works:

  1. The JOIN combines the payments table and policies table using the common column policy_id. This allows us to find out which client_id is associated with each payment.
  2. SUM: For each client_id, the query sums up the amount column from the payments table.
  3. GROUP BY: Ensures the aggregation (SUM) happens at the client_id level, giving one row per client.

The result of the CTE (TotalPayments) is a table that looks something like this:

In [16]:
%%sql

SELECT pol.client_id, 
        SUM(pay.amount) AS total_paid
FROM payments pay
JOIN policies pol ON pay.policy_id = pol.policy_id
GROUP BY pol.client_id;
 * postgresql://sedv8808@localhost/insurance
8 rows affected.
Out[16]:
client_id total_paid
3 300.00
5 700.00
4 550.00
6 1200.00
2 1000.00
7 400.00
1 500.00
8 450.00

The main query:

SELECT 
    cl.client_id, 
    CONCAT(cl.first_name, ' ', cl.last_name) AS client_name, 
    tp.total_paid
FROM TotalPayments tp
JOIN clients cl ON tp.client_id = cl.client_id
WHERE tp.total_paid > 1000;

TotalPayments is joined with the clients table on the client_id column.

  • This allows us to add client details (first_name and last_name) to the total payment data.

This query works because:

  1. It separates the logic into a reusable CTE (TotalPayments) to calculate total payments for each client.
  2. It joins the CTE with the clients table to include client details.
  3. It filters aggregated results using WHERE after the aggregation is done in the CTE.

Challenge Time¶

  1. Create a view called policy_summary that summarizes the total premium paid for each policy type. Include columns for policy_type, total_policies, and total_premium.
  1. Write a query that uses a CTE to calculate:
  • The total number of claims per policy type.
  • The total amount of claim settlements (from claims) per policy type where the status is 'Settled'.
  • Use the CTE result to generate a final output of policy_type, total_claims, and total_claim_amount.
  1. Create a temporary table that stores:
  • All clients who have active policies (i.e., end_date is in the future).
  • Write a query using this temporary table to get the list of clients with their policies, ordered by the policy end date.
  1. Write a query that:
  • Retrieves all policy payments (from payments) and claim settlements (from claims where status is 'Settled').
  • Use UNION for policy payments and claim settlements where the amounts are the same.
  • Use UNION ALL for all other records.
In [ ]: