Lighthouse LabsW2D4 - Data Quality Assurance ProcessInstructor: Socorro E. Dominguez-Vidana |
![]() |
![]() |
Name: Socorro Dominguez-Vidana Work: University of Wisconsin-Madison Data Scientist Hobbies: Kung Fu, traveling, learning languages |
Overview¶
- [] Where Data Comes From
- [] ELT Process Definition
- [] What is QA in transforming and analyzing data with SQL? (5 mins)
- [] QA guidelines & process (45 mins)
- [] Break (10 mins)
- [] How to spot errors using the Sakila PG database (40 mins)
- [] Tasks
Introduction to QA¶
Quality assurance QA is any systematic process of determining whether a product or service meets specified requirements.
QA's Goal: Prevent product defects before they arise.
Software Quality Assurance SQA:
- QA teams find patterns to improve development cycles.
- Finding and fixing coding errors can carry unintended consequences (bugs).
- Developers have multiple strategies to identify potential errors and fix them without introducing bugs.
- Most likely a SQA or Data Engineer's job.
Introduction to ETL¶
- ETL stands for Extract, Transform, Load.
- ETL is a process used in data management to ensure the accuracy, consistency, and quality of data before it is used for decision-making, reporting, or analytics.
- It is a data integration process: combines data from multiple sources, transform it into a consistent format, and load it into a target system.
- Centralized data enables quicker access to insights allowing Faster Decision-Making.
- Provides scalability: Works well with large volumes of data across various industries.
- It does not compromise the raw data.

Use Case:¶
Jane had always loved movies. After years of working in a corporate job, she decided it was time to pursue her dream: opening her own DVD rental shop.
To set the shop, Jane found a small storefront with large windows that offered a view of the bustling street.
As she prepared for the grand opening, she realized that managing a rental shop involved more than just having great movies on display; she needed a system to track inventory, rentals, and customer information.
Extraction

Jane used an Excel where she meticulously copied by hand the titles of the movies she owned, creating columns for key information such as the film title, director, and release year.
She realized she was missing some information and so, she turned to online resources like IMDb, where she searched for each title to find the corresponding genre, cast, and rental price.
Her Excel sheet was growing, Jane also noticed that the Excel was not going to be good for storing who rent which movie... or how to reach to her customers.
She created a form for New Customers. But this information was going to have to be ingested somewhere...
Transfrom
Jane realized that Excel was giving her issues: columns were not strictly numbers or text. So, many times, she thought she could sum things up but that was not the case...
Some rental rates were negative too..

She decided to create a relational database because she knew that SQL would guarantee that the data was in the correct format and she would be able to handle her inventory and customer transactions effectively.
She outlined the following tables:
Film Table: This would contain all the movies available for rent. Columns:
FilmID
,Title
,RentalRate
, andAvailability
.Customer Table: Customers' information. Columns:
CustomerID
,Name
,Email
Rental Table: This table would track each rental transaction. Columns:
RentalID
,FilmID (FK)
,CustomerID (FK)
,RentalDate
,ReturnDate
, andPaymentStatus
.

Not only where the tables designed, Jane also defined the data types and relationships between the tables.
She also ensured that the data was standardized, no rates smaller than 0.
Load
Once the database was set up, Jane began populating it with data. She exported her Excel sheet to CSV format and imported the data into the Film
table.
She loaded the information of the forms into the Customers
table and her Rentals
table is ready for any new actions!
Jane's database is ready for real-time use and analysis.

Key Stages:¶
- Extract: Collect data from sources like databases, flat files, and APIs.
- Transform: Clean, aggregate, and format data to meet business requirements. This can include, but is not limited to:
- Filtering
- Cleansing
- Removing duplicates
- Validating and authenticating the data
- Load: Store the transformed data into a data warehouse or another target system.
QA Guidelines¶
When Jane was creating her Database, she did check several things:
- Data Accuracy: Ensures that data transformations produce correct results.
- Consistency: Maintains uniformity across datasets and results.
- Validation: Jane made sure she gave predefined constraints such as allowable values or acceptable formats and data types.
- Completeness: Jane tried to populate all fields and not leave blank spaces with no reason.
- Error Detection: Identifies issues early in the data pipeline to prevent downstream problems.
- Compliance: Ensures that data meets regulatory and business standards.
- Performance Standards: She made sure the database she created can be run from her computer and it is not too slow.
- Security and Privacy: She has made sure that certain tables, such as the ones containing names and addresses are not accessible by anyone.
- Documentation and Schema Management: Jane documented schema changes, relationships, and data models to ensure clarity and consistency.
Applying QA Techniques in SQL Data Analysis¶
The Sakila Database¶
Now, we are Jane.
![]() |
The Sakila database is a sample database provided by MySQL to learn and practice SQL through a realistic and relatable scenario. It simulates a DVD rental store, featuring various tables that represent different aspects of the business, including films, customers, rentals, payments, and staff. Take a look at the documentation |
This might be too complicated. For some existing Databases, we use SchemaSpy, which is a Java-based tool that analyzes the metadata of a schema in a database and generates a visual representation of it in a browser-displayable format.
For the Sakila database, we can find the SchemaSpy here
Sad News: There is no SchemaSpy for Northwind. - Or I couldn't find it.
%load_ext sql
%sql postgresql://testuser:testpass@localhost:5432/sakila
%%sql
SELECT * FROM FILM
LIMIT 2;
* postgresql://testuser:***@localhost:5432/sakila 2 rows affected.
film_id | title | description | release_year | language_id | original_language_id | rental_duration | rental_rate | length | replacement_cost | rating | last_update | special_features | fulltext |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | ACADEMY DINOSAUR | A Epic Drama of a Feminist And a Mad Scientist who must Battle a Teacher in The Canadian Rockies | 2006 | 1 | None | 6 | 0.99 | 86 | 20.99 | PG | 2006-02-15 05:03:42 | ['Deleted Scenes', 'Behind the Scenes'] | 'academi':1 'battl':15 'canadian':20 'dinosaur':2 'drama':5 'epic':4 'feminist':8 'mad':11 'must':14 'rocki':21 'scientist':12 'teacher':17 |
2 | ACE GOLDFINGER | A Astounding Epistle of a Database Administrator And a Explorer who must Find a Car in Ancient China | 2006 | 1 | None | 3 | 4.99 | 48 | 12.99 | G | 2006-02-15 05:03:42 | ['Trailers', 'Deleted Scenes'] | 'ace':1 'administr':9 'ancient':19 'astound':4 'car':17 'china':20 'databas':8 'epistl':5 'explor':12 'find':15 'goldfing':2 'must':14 |
%%sql
SELECT * FROM ACTOR
LIMIT 5
* postgresql://testuser:***@localhost:5432/sakila 5 rows affected.
actor_id | first_name | last_name | last_update |
---|---|---|---|
1 | PENELOPE | GUINESS | 2006-02-15 04:34:33 |
2 | NICK | WAHLBERG | 2006-02-15 04:34:33 |
3 | ED | CHASE | 2006-02-15 04:34:33 |
4 | JENNIFER | DAVIS | 2006-02-15 04:34:33 |
5 | JOHNNY | LOLLOBRIGIDA | 2006-02-15 04:34:33 |
- Validate Data Types: To check the data types of the columns in a table, you can query the information schema:
%%sql
-- Validate data types for the 'film' table
SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_name = 'film';
* postgresql://testuser:***@localhost:5432/sakila 14 rows affected.
column_name | data_type | is_nullable |
---|---|---|
film_id | integer | NO |
title | character varying | NO |
description | text | YES |
release_year | integer | YES |
language_id | integer | NO |
original_language_id | integer | YES |
rental_duration | smallint | NO |
rental_rate | numeric | NO |
length | smallint | YES |
replacement_cost | numeric | NO |
rating | USER-DEFINED | YES |
last_update | timestamp without time zone | NO |
special_features | ARRAY | YES |
fulltext | tsvector | NO |
- Data Profiling: Count the total rows and retrieve sample data from a table.
%%sql
SELECT COUNT(*) AS total_rows FROM customer;
* postgresql://testuser:***@localhost:5432/sakila 1 rows affected.
total_rows |
---|
600 |
%%sql
SELECT *
FROM customer
LIMIT 5;
* postgresql://testuser:***@localhost:5432/sakila 5 rows affected.
customer_id | store_id | first_name | last_name | address_id | activebool | create_date | last_update | active | |
---|---|---|---|---|---|---|---|---|---|
1 | 1 | MARY | SMITH | MARY.SMITH@sakilacustomer.org | 5 | True | 2006-02-14 | 2006-02-15 04:57:20 | 1 |
2 | 1 | PATRICIA | JOHNSON | PATRICIA.JOHNSON@sakilacustomer.org | 6 | True | 2006-02-14 | 2006-02-15 04:57:20 | 1 |
3 | 1 | LINDA | WILLIAMS | LINDA.WILLIAMS@sakilacustomer.org | 7 | True | 2006-02-14 | 2006-02-15 04:57:20 | 1 |
4 | 2 | BARBARA | JONES | BARBARA.JONES@sakilacustomer.org | 8 | True | 2006-02-14 | 2006-02-15 04:57:20 | 1 |
5 | 1 | ELIZABETH | BROWN | ELIZABETH.BROWN@sakilacustomer.org | 9 | True | 2006-02-14 | 2006-02-15 04:57:20 | 1 |
- Unit Testing: You can run unit tests on specific queries to assess data consistency and integrity.
%%sql
SELECT COUNT(*) FROM film;
* postgresql://testuser:***@localhost:5432/sakila 1 rows affected.
count |
---|
1003 |
How many unique rentals happened?
%%sql
SELECT rental_rate, COUNT(*) AS film_count
FROM film
GROUP BY rental_rate;
* postgresql://testuser:***@localhost:5432/sakila 4 rows affected.
rental_rate | film_count |
---|---|
2.99 | 323 |
4.99 | 336 |
-0.99 | 3 |
0.99 | 341 |
How many unique customers are in the 'customer' table?
%%sql
SELECT COUNT(DISTINCT customer_id) FROM customer;
* postgresql://testuser:***@localhost:5432/sakila 1 rows affected.
count |
---|
600 |
Are there any nulls in the film_id
?
%%sql
SELECT COUNT(*) FROM inventory WHERE film_id IS NULL;
* postgresql://testuser:***@localhost:5432/sakila 1 rows affected.
count |
---|
0 |
- Regular Audits Are there any negative rental rates? (We saw earlier that no, but this is an alternate way of checking)
%%sql
SELECT *
FROM film
WHERE rental_rate < 0;
* postgresql://testuser:***@localhost:5432/sakila 3 rows affected.
film_id | title | description | release_year | language_id | original_language_id | rental_duration | rental_rate | length | replacement_cost | rating | last_update | special_features | fulltext |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1001 | Negative Movie 1 | Negative Film 1 | 2024 | 1 | None | 3 | -0.99 | 120 | 19.99 | PG | 2024-09-23 16:36:02.580938 | None | '1':3,6 'film':5 'movi':2 'negat':1,4 |
1002 | Negative Movie 2 | Negative Film 2 | 2024 | 1 | None | 5 | -0.99 | 90 | 24.99 | R | 2024-09-23 16:36:02.580938 | None | '2':3,6 'film':5 'movi':2 'negat':1,4 |
1003 | Negative Movie 3 | Negative Film 3 | 2024 | 1 | None | 7 | -0.99 | 150 | 29.99 | G | 2024-09-23 16:36:02.580938 | None | '3':3,6 'film':5 'movi':2 'negat':1,4 |
Manipulating Tables when Finding Errors¶
- Fix Missing or Null Values in Critical Columns
Let's check for missing values in customer_id
, first_name
, last_name
or email
in the customer table.
%%sql
SELECT customer_id, first_name, last_name, email
FROM customer
WHERE first_name IS NULL OR last_name IS NULL OR email IS NULL;
* postgresql://testuser:***@localhost:5432/sakila 0 rows affected.
customer_id | first_name | last_name |
---|
That's great! What would happen if I actually tried to enter a new entry in customers where I had a NULL name?
%%sql
-- Add customer with a NULL first name
INSERT INTO customer (store_id, first_name, last_name, email, address_id, active, create_date)
VALUES (1, NULL, 'Smith', 'smith@example.com', 2, 1, NOW());
* postgresql://testuser:***@localhost:5432/sakila (psycopg2.errors.NotNullViolation) null value in column "first_name" of relation "customer" violates not-null constraint DETAIL: Failing row contains (606, 1, null, Smith, smith@example.com, 2, t, 2024-09-24, 2024-09-24 17:58:35.640635, 1). [SQL: -- Add customer with a NULL first name INSERT INTO customer (store_id, first_name, last_name, email, address_id, active, create_date) VALUES (1, NULL, 'Smith', 'smith@example.com', 2, 1, NOW());] (Background on this error at: https://sqlalche.me/e/20/gkpj)
Let's create a new table customers2
, where NULL values will be allowed:
%%sql
DROP TABLE IF EXISTS customers2;
CREATE TABLE customers2 AS
SELECT * FROM customer;
ALTER TABLE customers2 ALTER COLUMN first_name DROP NOT NULL;
ALTER TABLE customers2 ALTER COLUMN last_name DROP NOT NULL;
ALTER TABLE customers2 ALTER COLUMN email DROP NOT NULL;
* postgresql://testuser:***@localhost:5432/sakila Done. 600 rows affected. Done. Done. Done.
[]
%%sql
-- Add customer2 with a NULL first name
INSERT INTO customers2 (store_id, first_name, last_name, email, address_id, active, create_date)
VALUES (1, NULL, 'Smith', 'smith@example.com', 2, 1, NOW());
* postgresql://testuser:***@localhost:5432/sakila 1 rows affected.
[]
%%sql
SELECT customer_id, first_name, last_name, email
FROM customers2
WHERE first_name IS NULL OR last_name IS NULL OR email IS NULL;
* postgresql://testuser:***@localhost:5432/sakila 1 rows affected.
customer_id | first_name | last_name | |
---|---|---|---|
None | None | Smith | smith@example.com |
Assuming NULLs were possible.
Possible fix:
Do not alter your raw data; ideally create a new table.
-- Create or replace a fixed version of the 'customer' table with placeholders for missing values
CREATE OR REPLACE TABLE customer_fixed AS
SELECT customer_id,
COALESCE(first_name, 'Unknown') AS first_name,
COALESCE(last_name, 'Unknown') AS last_name,
COALESCE(email, 'no_email@domain.com') AS email
FROM customer;
- Detect Negative Rental Rates
%%sql
SELECT * from film
limit 1;
* postgresql://testuser:***@localhost:5432/sakila 1 rows affected.
film_id | title | description | release_year | language_id | original_language_id | rental_duration | rental_rate | length | replacement_cost | rating | last_update | special_features | fulltext |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | ACADEMY DINOSAUR | A Epic Drama of a Feminist And a Mad Scientist who must Battle a Teacher in The Canadian Rockies | 2006 | 1 | None | 6 | 0.99 | 86 | 20.99 | PG | 2006-02-15 05:03:42 | ['Deleted Scenes', 'Behind the Scenes'] | 'academi':1 'battl':15 'canadian':20 'dinosaur':2 'drama':5 'epic':4 'feminist':8 'mad':11 'must':14 'rocki':21 'scientist':12 'teacher':17 |
%%sql
SELECT film_id, title, rental_rate
FROM film
WHERE rental_rate < 0;
* postgresql://testuser:***@localhost:5432/sakila 3 rows affected.
film_id | title | rental_rate |
---|---|---|
1001 | Negative Movie 1 | -0.99 |
1002 | Negative Movie 2 | -0.99 |
1003 | Negative Movie 3 | -0.99 |
Possible fix: Set Negative Rental Rates to 0.99
CREATE OR REPLACE TABLE film_fixed AS
SELECT film_id, title,
CASE
WHEN rental_rate < 0 THEN 0.99
ELSE rental_rate
END AS rental_rate
FROM film;
- Detect Duplicate Customers based on name and email.
%%sql
SELECT first_name, last_name, email, COUNT(*)
FROM customers2
GROUP BY first_name, last_name, email
HAVING COUNT(*) > 1;
* postgresql://testuser:***@localhost:5432/sakila 0 rows affected.
first_name | last_name | count |
---|
Possible fix: Deduplicate Customers by keeping only the first occurrence
CREATE OR REPLACE TABLE customer_deduped AS
WITH ranked_customers AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY first_name, last_name, email ORDER BY customer_id) AS rn
FROM customer
)
SELECT *
FROM ranked_customers
WHERE rn = 1;
- Detect Orphaned Records in Inventory:
film_id
does not exist ininventory
table.
%%sql
SELECT i.inventory_id, i.film_id
FROM inventory i
LEFT JOIN film f ON i.film_id = f.film_id
WHERE f.film_id IS NULL;
* postgresql://testuser:***@localhost:5432/sakila 0 rows affected.
inventory_id | film_id |
---|
%%sql
INSERT INTO inventory (film_id, store_id, last_update)
VALUES
(10001, 1, NOW());
* postgresql://testuser:***@localhost:5432/sakila (psycopg2.errors.ForeignKeyViolation) insert or update on table "inventory" violates foreign key constraint "inventory_film_id_fkey" DETAIL: Key (film_id)=(10001) is not present in table "film". [SQL: INSERT INTO inventory (film_id, store_id, last_update) VALUES (10001, 1, NOW());] (Background on this error at: https://sqlalche.me/e/20/gkpj)
I can have movies information - just not have them in the inventory.
%%sql
SELECT i.inventory_id, i.film_id
FROM inventory i
WHERE i.film_id = 1001
* postgresql://testuser:***@localhost:5432/sakila 0 rows affected.
inventory_id | film_id |
---|
%%sql
SELECT f.film_id
FROM film f
WHERE f.film_id = 1001
* postgresql://testuser:***@localhost:5432/sakila 1 rows affected.
film_id |
---|
1001 |
Possible fix: Remove Orphaned Records
CREATE OR REPLACE TABLE inventory_fixed AS
SELECT *
FROM inventory
WHERE film_id IN (SELECT film_id FROM film);
- Fix Inconsistent or Outdated Data:
Detect Overdue Rentals (Illogical Return Dates): Rentals wherereturn_date
is earlier thanrental_date
%%sql
SELECT rental_id, rental_date, return_date
FROM rental
WHERE return_date < rental_date;
* postgresql://testuser:***@localhost:5432/sakila 0 rows affected.
rental_id | rental_date | return_date |
---|
%%sql
SELECT rental_id, rental_date, return_date
FROM rental_reverse
WHERE return_date < rental_date
LIMIT 2;
* postgresql://testuser:***@localhost:5432/sakila 2 rows affected.
rental_id | rental_date | return_date |
---|---|---|
1 | 2005-05-26 22:04:30 | 2005-05-24 22:53:30 |
2 | 2005-05-28 19:40:33 | 2005-05-24 22:54:33 |
Possible Fix: Set Invalid Return Dates to a Default Value (e.g., rental_date + 1 day)
CREATE OR REPLACE TABLE rental_fixed AS
SELECT rental_id, rental_date,
CASE
WHEN return_date < rental_date THEN rental_date + INTERVAL '1 day'
ELSE return_date
END AS return_date
FROM rental;
Summary¶
Quality Assurance in SQL is vital for ensuring the reliability, accuracy, and integrity of data stored within relational databases.
Through our exploration of the Sakila database, we have applied various SQL queries that underscore the importance of QA practices in data management.
The queries applied to the Sakila database enhance our understanding of data, promote integrity, and facilitate informed decision-making.
Challenge Time¶
Task 1: Write a query to ensure that the number of active customers listed in the customer table matches the number of customers who have rentals in the rental table.
%load_ext sql
%sql postgresql://testuser:testpass@localhost:5432/sakila
%%sql
Task 2: Write a query to check whether each store has a minimum number of copies of each film in their inventory (e.g., at least 5 copies per film in each store).
%%sql
Task 3: Discuss the importance of using foreign key constraints and referential integrity in maintaining data accuracy between tables in the Sakila database (e.g., linking the rental table to the inventory and customer tables)
Appendix¶
DDL Queries
- Creating tables:
CREATE TABLE table_name(
column1 datatype,
column2 datatype,
column3 datatype,
...
);
- Dropping tables:
DROP TABLE table_name;
- Altering tables (adding a column):
ALTER TABLE table_name
ADD column_name datatype;
- Alter tables (dropping a column):
ALTER TABLE table_name
DROP COLUMN column_name;
- Altering tables (modifying a column):
ALTER TABLE table_name
MODIFY column_name datatype;