MySQL Subqueries and Nested Operations

Introduction

In this lab, you will explore the power of MySQL subqueries and nested operations. The primary focus is on using subqueries within the WHERE clause to filter data based on conditions derived from other tables or the same table.

You'll learn how to connect to a MySQL server, create a database and tables (customers and orders), and then construct SQL queries that utilize subqueries to identify customers who have placed orders with a total amount exceeding a specific value. The lab also covers using EXISTS with a subquery, testing correlated subqueries, and comparing subquery performance.

Write a Subquery in a WHERE Clause

In this step, you will learn how to use a subquery within the WHERE clause of a SQL statement. A subquery is a query nested inside another query. It's a powerful tool for retrieving data based on conditions derived from other tables or the same table.

Understanding Subqueries

A subquery (or inner query) is a SQL query nested inside a larger query. The subquery is executed first, and its result is used by the outer query. Subqueries can appear in the WHERE, SELECT, FROM, and HAVING clauses.

In the WHERE clause, a subquery is typically used to filter the results of the outer query based on a condition. The subquery returns a single value or a set of values that the outer query uses for comparison.

Scenario

Imagine you have two tables: customers and orders. The customers table contains customer information (e.g., customer_id, name, city), and the orders table contains order information (e.g., order_id, customer_id, order_date, total_amount).

You want to find all customers who have placed at least one order with a total amount greater than $100.

Steps

Connect to the MySQL Server:

Open your terminal and execute the following command to connect to your MySQL server as the root user:
```
sudo mysql -u root
```
You should see the MySQL prompt: mysql>.

Create the Database and Tables:

If you don't already have a database and tables, create them now. Let's create a database named labdb and the customers and orders tables. Execute the following SQL commands in the MySQL prompt:

CREATE DATABASE IF NOT EXISTS labdb;
USE labdb;

CREATE TABLE IF NOT EXISTS customers (
    customer_id INT PRIMARY KEY,
    name VARCHAR(255),
    city VARCHAR(255)
);

CREATE TABLE IF NOT EXISTS orders (
    order_id INT PRIMARY KEY AUTO_INCREMENT,
    customer_id INT,
    order_date DATE,
    total_amount DECIMAL(10, 2),
    FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
);

Insert Sample Data:

Insert some sample data into the tables. Execute the following SQL commands in the MySQL prompt:

INSERT INTO customers (customer_id, name, city) VALUES
(1, 'Alice Smith', 'New York'),
(2, 'Bob Johnson', 'Los Angeles'),
(3, 'Charlie Brown', 'Chicago'),
(4, 'David Lee', 'Houston');

INSERT INTO orders (customer_id, order_date, total_amount) VALUES
(1, '2023-01-15', 120.00),
(2, '2023-02-20', 80.00),
(1, '2023-03-10', 150.00),
(3, '2023-04-05', 200.00),
(2, '2023-05-12', 110.00),
(4, '2023-06-18', 90.00);

Write the Subquery in the WHERE Clause:

Now, write the query to find customers who have placed orders with a total amount greater than $100. Execute the following SQL command in the MySQL prompt:
```
SELECT *
FROM customers
WHERE customer_id IN (SELECT customer_id FROM orders WHERE total_amount > 100);
```
Explanation:
- The subquery (SELECT customer_id FROM orders WHERE total_amount > 100) selects the customer_id from the orders table where the total_amount is greater than 100.
- The outer query SELECT * FROM customers WHERE customer_id IN (...) selects all columns from the customers table where the customer_id is in the set of customer_ids returned by the subquery.

Observe the Output:

You should see the following output, showing the customers who have placed orders with a total amount greater than $100:

+-------------+-------------+-----------+
| customer_id | name        | city      |
+-------------+-------------+-----------+
|           1 | Alice Smith | New York  |
|           2 | Bob Johnson | Los Angeles |
|           3 | Charlie Brown | Chicago   |
+-------------+-------------+-----------+
3 rows in set (0.00 sec)

Use EXISTS with a Subquery

In this step, you will learn how to use the EXISTS operator with a subquery in MySQL. The EXISTS operator is used to test for the existence of rows in a subquery. It returns TRUE if the subquery returns any rows, and FALSE otherwise.

Understanding EXISTS

The EXISTS operator is often used in the WHERE clause of a SQL statement to filter results based on the existence of related data in another table. It's a powerful alternative to using IN or JOIN operations, and can sometimes be more efficient, especially when dealing with large datasets.

Unlike IN, EXISTS doesn't actually retrieve the data from the subquery. It simply checks if any rows are returned. This can make it faster than IN when you only need to know if a match exists, not the actual values.

Scenario

Continuing with the customers and orders tables from the previous step, let's find all customers who have placed at least one order.

Prerequisites

Make sure you have completed the previous step ("Write a Subquery in a WHERE Clause") and have the labdb database, customers table, and orders table populated with data.

Steps

Write the Query using EXISTS:

Write the query to find customers who have placed at least one order. Execute the following SQL command in the MySQL prompt:
```
SELECT *
FROM customers c
WHERE EXISTS (
    SELECT 1
    FROM orders o
    WHERE o.customer_id = c.customer_id
);
```
Explanation:
- The outer query SELECT * FROM customers c selects all columns from the customers table, aliased as c.
- The WHERE EXISTS (...) clause checks if the subquery returns any rows.
- The subquery SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id selects the value 1 (it could be any constant value) from the orders table, aliased as o, where the customer_id in the orders table matches the customer_id in the customers table.
- The EXISTS operator returns TRUE if the subquery returns at least one row, indicating that the customer has placed at least one order.

Observe the Output:

You should see the following output, showing all customers who have placed at least one order:

+-------------+-------------+-----------+
| customer_id | name        | city      |
+-------------+-------------+-----------+
|           1 | Alice Smith | New York  |
|           2 | Bob Johnson | Los Angeles |
|           3 | Charlie Brown | Chicago   |
|           4 | David Lee   | Houston   |
+-------------+-------------+-----------+
4 rows in set (0.00 sec)

Modify the Query (Optional):

Let's modify the query to find customers who have not placed any orders. You can do this by using NOT EXISTS. Execute the following SQL command in the MySQL prompt:
```
SELECT *
FROM customers c
WHERE NOT EXISTS (
    SELECT 1
    FROM orders o
    WHERE o.customer_id = c.customer_id
);
```
Observe the Output:

Since all customers in our sample data have placed orders, this query should return an empty result set:
```
Empty set (0.00 sec)
```

Test a Correlated Subquery

In this step, you will learn about correlated subqueries in MySQL. A correlated subquery is a subquery that references a column from the outer query. This means that the subquery is executed once for each row in the outer query.

Understanding Correlated Subqueries

Unlike a simple subquery, which is executed only once, a correlated subquery depends on the outer query for its values. The subquery uses values from the current row of the outer query to determine its result. This makes correlated subqueries more powerful for certain types of queries, but also potentially less efficient than simple subqueries, especially for large datasets.

Scenario

Continuing with the customers and orders tables, let's find all customers who have placed an order with an amount greater than the average order amount for all orders.

Prerequisites

Make sure you have completed the previous steps and have the labdb database, customers table, and orders table populated with data.

Steps

Write the Correlated Subquery:

Write the query to find customers who have placed an order with an amount greater than the average order amount. Execute the following SQL command in the MySQL prompt:
```
SELECT c.customer_id, c.name
FROM customers c
WHERE EXISTS (
    SELECT 1
    FROM orders o
    WHERE o.customer_id = c.customer_id
    AND o.total_amount > (SELECT AVG(total_amount) FROM orders)
);
```
Explanation:
- The outer query SELECT c.customer_id, c.name FROM customers c selects the customer_id and name from the customers table, aliased as c.
- The WHERE EXISTS (...) clause checks if the subquery returns any rows.
- The correlated subquery SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id AND o.total_amount > (SELECT AVG(total_amount) FROM orders) selects the value 1 from the orders table, aliased as o, where:
  - o.customer_id = c.customer_id: This is the correlation. The subquery references the customer_id from the outer query's customers table.
  - o.total_amount > (SELECT AVG(total_amount) FROM orders): This condition checks if the order amount is greater than the average order amount across all orders. The AVG(total_amount) subquery is a non-correlated subquery that is executed only once to get the average order amount.

Observe the Output:

You should see the following output, showing the customers who have placed orders with an amount greater than the average order amount:

+-------------+-------------+
| customer_id | name        |
+-------------+-------------+
|           1 | Alice Smith |
|           3 | Charlie Brown |
+-------------+-------------+
2 rows in set (0.00 sec)

Another Example: Find the highest order amount for each customer

Execute the following SQL command in the MySQL prompt to retrieve each customer's ID, name, and their highest order amount. The correlated subquery calculates the maximum order amount for each customer individually.
```
SELECT c.customer_id, c.name, (
    SELECT MAX(o.total_amount)
    FROM orders o
    WHERE o.customer_id = c.customer_id
) AS highest_order_amount
FROM customers c;
```

Observe the Output:

You should see the following output:

+-------------+-------------+-----------------------+
| customer_id | name        | highest_order_amount  |
+-------------+-------------+-----------------------+
|           1 | Alice Smith |                150.00 |
|           2 | Bob Johnson |                110.00 |
|           3 | Charlie Brown |                200.00 |
|           4 | David Lee   |                 90.00 |
+-------------+-------------+-----------------------+
4 rows in set (0.00 sec)

Compare Subquery Performance

In this step, you will learn how to compare the performance of different types of subqueries in MySQL. Understanding the performance characteristics of subqueries is crucial for writing efficient SQL queries, especially when dealing with large datasets.

Understanding Performance Considerations

The performance of a subquery can be affected by several factors, including:

Data Size: The size of the tables involved in the query.
Subquery Type: Whether the subquery is correlated or non-correlated.
Indexing: The presence and effectiveness of indexes on the tables.
MySQL Version: The specific version of MySQL being used, as query optimization techniques can vary.

Scenario

Continuing with the customers and orders tables, let's compare the performance of a subquery using IN versus a subquery using EXISTS to find all customers who have placed at least one order.

Prerequisites

Make sure you have completed the previous steps and have the labdb database, customers table, and orders table populated with data. To make the performance comparison more meaningful, we'll add more data to the orders table.

Steps

Add More Data to the orders Table:

To make the performance comparison more realistic, let's add a significant amount of data to the orders table. We'll insert 1000 orders for each customer using a stored procedure. Execute the following SQL commands in the MySQL prompt:

DELIMITER //
CREATE PROCEDURE insert_many_orders()
BEGIN
  DECLARE i INT DEFAULT 1;
  WHILE i <= 1000 DO
    INSERT INTO orders (customer_id, order_date, total_amount) VALUES (1, CURDATE(), 50.00);
    INSERT INTO orders (customer_id, order_date, total_amount) VALUES (2, CURDATE(), 75.00);
    INSERT INTO orders (customer_id, order_date, total_amount) VALUES (3, CURDATE(), 100.00);
    INSERT INTO orders (customer_id, order_date, total_amount) VALUES (4, CURDATE(), 125.00);
    SET i = i + 1;
  END WHILE;
END//
DELIMITER ;
CALL insert_many_orders();
DROP PROCEDURE insert_many_orders;

Explanation:

This SQL script creates a stored procedure called insert_many_orders.
The procedure inserts 1000 orders for each of the four customers into the orders table.
After inserting the data, the procedure is dropped.

Query using IN:

Execute the following query using IN to find all customers who have placed at least one order:
```
SELECT *
FROM customers
WHERE customer_id IN (SELECT customer_id FROM orders);
```
Analyze the Query Execution Plan using EXPLAIN:

Before executing the query, use the EXPLAIN command to analyze the query execution plan. This will give you insights into how MySQL plans to execute the query and identify potential performance bottlenecks. Execute the following SQL command in the MySQL prompt:
```
EXPLAIN SELECT *
FROM customers
WHERE customer_id IN (SELECT customer_id FROM orders);
```
The output of EXPLAIN will show you the tables being accessed, the indexes being used (if any), and the order in which the operations are performed. Pay attention to the type column, which indicates the type of join or access method used.

Query using EXISTS:

Execute the following query using EXISTS to find all customers who have placed at least one order:

SELECT *
FROM customers c
WHERE EXISTS (
    SELECT 1
    FROM orders o
    WHERE o.customer_id = c.customer_id
);

Analyze the Query Execution Plan using EXPLAIN:

Use the EXPLAIN command to analyze the query execution plan for the EXISTS query. Execute the following SQL command in the MySQL prompt:
```
EXPLAIN SELECT *
FROM customers c
WHERE EXISTS (
    SELECT 1
    FROM orders o
    WHERE o.customer_id = c.customer_id
);
```
Compare the execution plan with the one from the IN query. Are there any differences in the tables accessed, indexes used, or access methods?
Observations:

In general, EXISTS tends to perform better than IN when the subquery returns a large number of rows. This is because IN needs to compare the outer query's values against all the values returned by the subquery, while EXISTS stops as soon as it finds a match. However, the actual performance can vary depending on the specific query, the data, and the database system. You can use the BENCHMARK() function (as shown in the original document) to get a more precise measurement of execution time, but for this lab, analyzing the EXPLAIN output provides sufficient insight into the query plans.
Clean up (Optional):

If you want to clean up the database and tables, you can execute the following commands in the MySQL prompt:
```
DROP TABLE IF EXISTS orders;
DROP TABLE IF EXISTS customers;
DROP DATABASE IF EXISTS labdb;
```
When you have finished all steps, you can exit the MySQL client by typing:
```
exit
```

Summary

In this lab, you learned how to use subqueries within the WHERE clause of a SQL statement to filter data based on conditions derived from other tables or the same table. You practiced connecting to a MySQL server, creating a database and tables, and inserting sample data.

You explored using the IN operator with a subquery to find customers based on related data in the orders table. You also learned about the EXISTS operator as an alternative to IN and practiced using it to check for the existence of related rows.

Furthermore, you were introduced to correlated subqueries, which reference columns from the outer query, and used one to find customers with orders exceeding the average order amount. Finally, you compared the performance of IN and EXISTS subqueries by analyzing their execution plans using the EXPLAIN command, gaining insight into how MySQL processes these queries.