PostgreSQL Advanced Query Writing

Introduction

In this lab, you will enhance your PostgreSQL query writing skills by exploring advanced techniques. You'll learn to use subqueries within the WHERE clause to filter data based on the results of another query.

The lab guides you through defining and utilizing Common Table Expressions (CTEs) for improved query readability and modularity. Furthermore, you'll apply window functions like ROW_NUMBER to perform calculations across sets of rows. Finally, you'll master grouping and filtering data using GROUP BY and HAVING clauses to extract meaningful insights from your datasets.

Write a Subquery in a WHERE Clause

In this step, you will learn how to use a subquery within the WHERE clause of a SQL query. A subquery, also known as an inner query or nested query, is a query embedded inside another query. Subqueries are used to return data that will be used in the main query as a condition to further restrict the data to be retrieved.

Understanding Subqueries in the WHERE Clause

A subquery in the WHERE clause is typically used to compare a column's value against the result of the subquery. The subquery is executed first, and its result is then used by the outer query.

Basic Syntax:

SELECT column1, column2
FROM table_name
WHERE column_name IN (SELECT column_name FROM another_table WHERE condition);

Here, the subquery (SELECT column_name FROM another_table WHERE condition) returns a set of values. The outer query then selects rows from table_name where column_name is in that set.

Scenario:

Let's assume you have two tables: employees and departments. The employees table contains information about employees, including their employee_id, employee_name, and department_id. The departments table contains information about departments, including their department_id and department_name.

We want to find all employees who work in the 'Sales' department.

Step 1: Create the Tables and Insert Data

First, connect to the PostgreSQL database using the postgres user:

sudo -u postgres psql

Next, create the departments table:

CREATE TABLE departments (
    department_id SERIAL PRIMARY KEY,
    department_name VARCHAR(50)
);

Insert some sample data into the departments table:

INSERT INTO departments (department_name) VALUES
('Sales'),
('Marketing'),
('Engineering');

Now, create the employees table:

CREATE TABLE employees (
    employee_id SERIAL PRIMARY KEY,
    employee_name VARCHAR(50),
    department_id INTEGER REFERENCES departments(department_id)
);

Insert some sample data into the employees table:

INSERT INTO employees (employee_name, department_id) VALUES
('Alice', 1),
('Bob', 2),
('Charlie', 1),
('David', 3),
('Eve', 2);

Step 2: Write the Subquery

Now, let's write the query to find all employees who work in the 'Sales' department using a subquery in the WHERE clause.

SELECT employee_name
FROM employees
WHERE department_id IN (SELECT department_id FROM departments WHERE department_name = 'Sales');

Explanation:

The subquery (SELECT department_id FROM departments WHERE department_name = 'Sales') selects the department_id from the departments table where the department_name is 'Sales'. In this case, it will return 1.
The outer query then selects the employee_name from the employees table where the department_id is in the set returned by the subquery (which is just 1).

Step 3: Execute the Query and View the Results

Execute the query in your psql terminal. You should see the following output:

 employee_name
---------------
 Alice
 Charlie
(2 rows)

This shows that Alice and Charlie are the employees who work in the 'Sales' department.

Step 4: Using EXISTS with Subqueries

Another way to use subqueries in the WHERE clause is with the EXISTS operator. The EXISTS operator tests for the existence of rows in a subquery. It returns true if the subquery returns any rows, and false otherwise.

Here's an example of using EXISTS to achieve the same result:

SELECT employee_name
FROM employees e
WHERE EXISTS (
    SELECT 1
    FROM departments d
    WHERE d.department_id = e.department_id
    AND d.department_name = 'Sales'
);

This query achieves the same result as the previous one, but uses the EXISTS operator instead of IN.

Explanation:

The subquery SELECT 1 FROM departments d WHERE d.department_id = e.department_id AND d.department_name = 'Sales' checks if there is a department with the name 'Sales' that has the same department_id as the current employee.
If the subquery returns any rows (meaning there is a 'Sales' department with the same department_id), the EXISTS operator returns true, and the employee's name is selected.

Execute the query in your psql terminal. You should see the same output as before:

 employee_name
---------------
 Alice
 Charlie
(2 rows)

You have now successfully used a subquery in the WHERE clause to filter data based on a condition in another table. You've also learned how to use the EXISTS operator with a subquery.

Define and Use a CTE

In this step, you will learn how to define and use a Common Table Expression (CTE) in PostgreSQL. A CTE is a temporary named result set that you can reference within a single SELECT, INSERT, UPDATE, or DELETE statement. CTEs are useful for breaking down complex queries into simpler, more readable parts.

Understanding CTEs

CTEs are defined using the WITH clause. They exist only for the duration of the query execution.

Basic Syntax:

WITH cte_name AS (
    SELECT column1, column2
    FROM table_name
    WHERE condition
)
SELECT column1, column2
FROM cte_name
WHERE condition;

Here, cte_name is the name you give to the CTE. The SELECT statement inside the parentheses defines the CTE's result set. The outer SELECT statement then queries the CTE as if it were a regular table.

Scenario:

Continuing with the employees and departments tables from the previous step, let's use a CTE to find the names of employees and their corresponding department names.

Step 1: Verify the Tables and Data

Ensure that the employees and departments tables exist and contain the data from the previous step. You can verify this by running the following queries in your psql terminal:

SELECT * FROM departments;
SELECT * FROM employees;

If the tables or data are missing, refer to the previous step to create them and insert the data.

Step 2: Define a CTE

Now, let's define a CTE called EmployeeDepartments that joins the employees and departments tables to retrieve the employee names and their department names.

WITH EmployeeDepartments AS (
    SELECT
        e.employee_name,
        d.department_name
    FROM
        employees e
    JOIN
        departments d ON e.department_id = d.department_id
)
SELECT employee_name, department_name
FROM EmployeeDepartments;

Explanation:

The WITH EmployeeDepartments AS (...) clause defines the CTE named EmployeeDepartments.
The SELECT statement inside the parentheses joins the employees table (aliased as e) with the departments table (aliased as d) on the department_id column.
The outer SELECT statement then retrieves the employee_name and department_name from the EmployeeDepartments CTE.

Step 3: Execute the Query and View the Results

Execute the query in your psql terminal. You should see the following output:

 employee_name | department_name
---------------+-----------------
 Alice         | Sales
 Bob           | Marketing
 Charlie       | Sales
 David         | Engineering
 Eve           | Marketing
(5 rows)

This shows the names of all employees and their corresponding department names.

Step 4: Using CTEs for Filtering

You can also use CTEs to filter data. For example, let's find all employees who work in the 'Sales' department using the EmployeeDepartments CTE.

WITH EmployeeDepartments AS (
    SELECT
        e.employee_name,
        d.department_name
    FROM
        employees e
    JOIN
        departments d ON e.department_id = d.department_id
)
SELECT employee_name
FROM EmployeeDepartments
WHERE department_name = 'Sales';

Explanation:

This query is similar to the previous one, but it adds a WHERE clause to the outer SELECT statement to filter the results to only include employees who work in the 'Sales' department.

Execute the query in your psql terminal. You should see the following output:

 employee_name
---------------
 Alice
 Charlie
(2 rows)

This shows that Alice and Charlie are the employees who work in the 'Sales' department.

You have now successfully defined and used a CTE to join tables and filter data. CTEs can greatly improve the readability and maintainability of complex SQL queries.

Apply Window Functions (e.g., ROW_NUMBER)

In this step, you will learn how to apply window functions in PostgreSQL. Window functions perform calculations across a set of table rows that are related to the current row. They are similar to aggregate functions, but unlike aggregate functions, window functions do not group rows into a single output row. Instead, they provide a value for each row in the result set.

Understanding Window Functions

Window functions use the OVER() clause to define the window of rows for the calculation. The OVER() clause can include PARTITION BY and ORDER BY clauses to further define the window.

Basic Syntax:

SELECT
    column1,
    column2,
    WINDOW_FUNCTION(column3) OVER (
        [PARTITION BY column4]
        [ORDER BY column5]
    ) AS alias_name
FROM
    table_name;

WINDOW_FUNCTION: The name of the window function (e.g., ROW_NUMBER, RANK, SUM, AVG).
OVER(): Specifies the window over which the function operates.
PARTITION BY: Divides the rows into partitions, and the window function is applied to each partition independently.
ORDER BY: Defines the order of rows within each partition.
alias_name: The alias for the calculated window function result.

Scenario:

Continuing with the employees and departments tables from the previous steps, let's use the ROW_NUMBER() window function to assign a unique rank to each employee within their respective departments based on their employee_name.

Step 1: Apply the ROW_NUMBER() Window Function

Now, let's write a query that uses the ROW_NUMBER() window function to assign a rank to each employee within their department.

SELECT
    employee_name,
    department_id,
    ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY employee_name) AS employee_rank
FROM
    employees;

Explanation:

ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY employee_name): This is the window function.
- ROW_NUMBER(): Assigns a unique sequential integer to each row within the window.
- PARTITION BY department_id: Divides the rows into partitions based on the department_id. This means that the ranking will be done separately for each department.
- ORDER BY employee_name: Specifies the order in which the rows within each partition are ranked. In this case, employees are ranked alphabetically by their employee_name.
employee_rank: This is the alias given to the result of the window function.

Step 2: Execute the Query and View the Results

Execute the query in your psql terminal. You should see the following output:

 employee_name | department_id | employee_rank
---------------+---------------+---------------
 Alice         |             1 |             1
 Charlie       |             1 |             2
 Bob           |             2 |             1
 Eve           |             2 |             2
 David         |             3 |             1
(5 rows)

This shows each employee's name, their department ID, and their rank within their department. For example, Alice is ranked 1 in department 1, and Charlie is ranked 2 in department 1.

Step 3: Using Window Functions with CTEs

You can also use window functions within CTEs to make your queries more organized. Let's rewrite the previous query using a CTE.

WITH RankedEmployees AS (
    SELECT
        employee_name,
        department_id,
        ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY employee_name) AS employee_rank
    FROM
        employees
)
SELECT
    employee_name,
    department_id,
    employee_rank
FROM
    RankedEmployees;

This query produces the same result as the previous one, but it uses a CTE to encapsulate the window function calculation.

Execute the query in your psql terminal. You should see the same output as before.

Group and Filter with GROUP BY and HAVING

In this step, you will learn how to use the GROUP BY and HAVING clauses in PostgreSQL to group rows and filter the grouped results. The GROUP BY clause groups rows that have the same values in specified columns into summary rows, like finding the number of employees in each department. The HAVING clause is used to filter these grouped rows based on a specified condition.

Understanding GROUP BY and HAVING

GROUP BY: This clause groups rows with the same values in one or more columns into a summary row. You typically use aggregate functions (e.g., COUNT, SUM, AVG, MIN, MAX) to calculate values for each group.
HAVING: This clause filters the groups created by the GROUP BY clause. It's similar to the WHERE clause, but it operates on groups rather than individual rows.

Basic Syntax:

SELECT
    column1,
    aggregate_function(column2)
FROM
    table_name
WHERE
    condition
GROUP BY
    column1
HAVING
    aggregate_function(column2) condition;

column1: The column to group by.
aggregate_function(column2): An aggregate function applied to column2 for each group.
WHERE: Filters rows before grouping.
GROUP BY: Groups rows based on the values in column1.
HAVING: Filters groups after grouping, based on the result of the aggregate function.

Scenario:

Continuing with the employees and departments tables, let's use GROUP BY and HAVING to find the departments that have more than one employee.

Step 1: Group by Department and Count Employees

First, let's write a query to group the employees by department and count the number of employees in each department.

SELECT
    d.department_name,
    COUNT(e.employee_id) AS employee_count
FROM
    employees e
JOIN
    departments d ON e.department_id = d.department_id
GROUP BY
    d.department_name;

Explanation:

SELECT d.department_name, COUNT(e.employee_id) AS employee_count: This selects the department name and the count of employee IDs for each department.
FROM employees e JOIN departments d ON e.department_id = d.department_id: This joins the employees and departments tables on the department_id column.
GROUP BY d.department_name: This groups the rows by department name, so the COUNT() function will count the number of employees in each department.

Step 2: Execute the Query and View the Results

Execute the query in your psql terminal. You should see the following output:

 department_name | employee_count
-----------------+----------------
 Engineering     |              1
 Marketing       |              2
 Sales           |              2
(3 rows)

This shows the number of employees in each department.

Step 3: Filter with HAVING

Now, let's add a HAVING clause to filter the results to only include departments that have more than one employee.

SELECT
    d.department_name,
    COUNT(e.employee_id) AS employee_count
FROM
    employees e
JOIN
    departments d ON e.department_id = d.department_id
GROUP BY
    d.department_name
HAVING
    COUNT(e.employee_id) > 1;

Explanation:

This query is the same as the previous one, but it adds a HAVING clause:
- HAVING COUNT(e.employee_id) > 1: This filters the groups to only include departments where the count of employee IDs is greater than 1.

Step 4: Execute the Query and View the Results

Execute the query in your psql terminal. You should see the following output:

 department_name | employee_count
-----------------+----------------
 Marketing       |              2
 Sales           |              2
(2 rows)

This shows only the departments that have more than one employee (Marketing and Sales).

You have now successfully used the GROUP BY and HAVING clauses to group rows and filter the grouped results based on a condition.

Summary

In this lab, you have learned how to write advanced PostgreSQL queries. You started by using subqueries within the WHERE clause to filter data based on the results of another query. This involved understanding the syntax and application of subqueries to compare column values against a set of values returned by the inner query.

You then moved on to defining and utilizing Common Table Expressions (CTEs) to improve query readability and modularity. CTEs allow you to break down complex queries into simpler, more manageable parts.

Next, you applied window functions like ROW_NUMBER to perform calculations across sets of rows. Window functions are similar to aggregate functions, but they provide a value for each row in the result set, rather than grouping rows into a single output row.

Finally, you mastered grouping and filtering data using GROUP BY and HAVING clauses to extract meaningful insights from your datasets. The GROUP BY clause groups rows that have the same values in specified columns into summary rows, while the HAVING clause filters these grouped rows based on a specified condition.