MySQL Partitioning for Large Datasets

Introduction

In this lab, you will learn how to implement MySQL partitioning for large datasets to improve query performance and data management. The lab focuses on partitioning a table by range, specifically using the sale_date column of a sales table.

You will begin by connecting to the MySQL server and creating a sales_data database. Then, you'll create the sales table, partitioning it by the year of the sale_date into partitions for 2020, 2021, 2022, 2023, and a future partition. Subsequent steps will cover querying data from specific partitions, reorganizing partitions using ALTER TABLE, and checking the impact of partitioning on query speed.

Note: For this lab, you only need to enter the MySQL shell once at the beginning and exit at the end. All SQL commands in the following steps should be executed within the same MySQL session.

Create a Partitioned Table

In this step, we will create a database and a partitioned table in MySQL. Partitioning helps in managing large datasets by dividing a table into smaller, more manageable parts based on a specified rule. This can significantly improve query performance, especially for queries that filter data based on the partitioning key.

First, open a terminal in the LabEx VM. You should already be in the ~/project directory.

Connect to the MySQL server as the root user (do this only once at the start of the lab):

sudo mysql -u root

You are now in the MySQL shell. All subsequent SQL commands should be run in this session until you finish the lab.

Let's create a database named sales_data to store our table:

CREATE DATABASE sales_data;

Switch to the newly created database:

USE sales_data;

Now, we will create a table named sales and partition it by the year of the sale_date column. We will create partitions for the years 2020, 2021, 2022, 2023, and a catch-all partition for future dates.

CREATE TABLE sales (
    sale_id INT NOT NULL,
    sale_date DATE NOT NULL,
    amount DECIMAL(10, 2) NOT NULL,
    PRIMARY KEY (sale_id, sale_date)
)
PARTITION BY RANGE (YEAR(sale_date)) (
    PARTITION p2020 VALUES LESS THAN (2021),
    PARTITION p2021 VALUES LESS THAN (2022),
    PARTITION p2022 VALUES LESS THAN (2023),
    PARTITION p2023 VALUES LESS THAN (2024),
    PARTITION pFuture VALUES LESS THAN MAXVALUE
);

Let's understand the PARTITION BY RANGE clause:

PARTITION BY RANGE (YEAR(sale_date)): This specifies that the table will be partitioned based on the range of values returned by the YEAR() function applied to the sale_date column.
PARTITION p2020 VALUES LESS THAN (2021): This creates a partition named p2020. Any row where the year of sale_date is less than 2021 (i.e., 2020) will be stored in this partition.
PARTITION p2021 VALUES LESS THAN (2022): This creates a partition named p2021 for data from the year 2021.
PARTITION p2022 VALUES LESS THAN (2023): This creates a partition named p2022 for data from the year 2022.
PARTITION p2023 VALUES LESS THAN (2024): This creates a partition named p2023 for data from the year 2023.
PARTITION pFuture VALUES LESS THAN MAXVALUE: This creates a partition named pFuture that will store any data with a sale_date year greater than or equal to 2024. MAXVALUE is a special value that is always greater than any other value.

After executing the CREATE TABLE statement, you can verify the table structure and its partitions using the following command:

SHOW CREATE TABLE sales;

Look for the PARTITION BY RANGE clause in the output to confirm that the table was created with the specified partitions.

Now, let's insert some sample data into the sales table. MySQL will automatically place each row into the correct partition based on the sale_date.

INSERT INTO sales (sale_id, sale_date, amount) VALUES
(1, '2020-12-31', 100.00),
(2, '2021-01-15', 150.00),
(3, '2021-12-25', 200.00),
(4, '2022-06-01', 120.00),
(5, '2022-12-31', 180.00),
(6, '2023-03-10', 250.00),
(7, '2023-09-20', 300.00),
(8, '2024-01-01', 350.00);

You have successfully created a partitioned table and inserted data into it. In the next step, we will learn how to query data from specific partitions.

Query Data from Specific Partitions

In this step, we will explore how to query data efficiently from a partitioned table by targeting specific partitions. This is one of the primary benefits of partitioning, as it allows MySQL to scan only the relevant partitions, significantly reducing the amount of data processed and improving query performance.

Reminder: You should still be in the MySQL shell and using the sales_data database. If you are not, use:

USE sales_data;

To query data from a specific partition, you can include a WHERE clause that filters on the partitioning key. MySQL's query optimizer is often smart enough to identify which partitions are relevant based on the WHERE clause.

For example, to retrieve all sales from the year 2021, you can use the following query. Note that we are using a direct range condition on sale_date. Using functions like YEAR(sale_date) in the WHERE clause can prevent MySQL from using partition pruning, which would cause it to scan all partitions.

SELECT * FROM sales WHERE sale_date >= '2021-01-01' AND sale_date < '2022-01-01';

To see which partitions MySQL is accessing for this query, you can use the EXPLAIN PARTITIONS statement:

EXPLAIN PARTITIONS SELECT * FROM sales WHERE sale_date >= '2021-01-01' AND sale_date < '2022-01-01';

In the output of EXPLAIN PARTITIONS, look at the partitions column. It should show p2021, indicating that MySQL is only scanning the p2021 partition to fulfill this query.

+----+-------------+-------+------------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | sales | p2021      | ALL  | PRIMARY       | NULL | NULL    | NULL |    2 | Using where |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+-------------+

You can also query data that spans multiple partitions. For instance, to get sales data for both 2022 and 2023:

SELECT * FROM sales WHERE sale_date >= '2022-01-01' AND sale_date < '2024-01-01';

Using EXPLAIN PARTITIONS again will show that MySQL accesses both p2022 and p2023 partitions:

EXPLAIN PARTITIONS SELECT * FROM sales WHERE sale_date >= '2022-01-01' AND sale_date < '2024-01-01';

The partitions column will show p2022,p2023.

+----+-------------+-------+---------------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | partitions    | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+-------+---------------+------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | sales | p2022,p2023   | ALL  | PRIMARY       | NULL | NULL    | NULL |    4 | Using where |
+----+-------------+-------+---------------+------+---------------+------+---------+------+------+-------------+

This demonstrates how partitioning allows MySQL to prune (exclude) irrelevant partitions during query execution, leading to faster results, especially on very large tables where scanning the entire table would be time-consuming.

To see the number of rows in each partition, you can query the INFORMATION_SCHEMA.PARTITIONS table:

SELECT
    PARTITION_NAME,
    TABLE_ROWS
FROM
    INFORMATION_SCHEMA.PARTITIONS
WHERE
    TABLE_SCHEMA = 'sales_data' AND TABLE_NAME = 'sales';

This query provides a clear view of how the data is distributed across your partitions.

+----------------+------------+
| PARTITION_NAME | TABLE_ROWS |
+----------------+------------+
| p2020          |          1 |
| p2021          |          2 |
| p2022          |          2 |
| p2023          |          2 |
| pFuture        |          1 |
+----------------+------------+

You have successfully queried data from specific partitions and observed how MySQL utilizes partitioning for query optimization.

Reorganize and Manage Partitions

In this step, we will learn how to modify the partition structure of an existing table using the ALTER TABLE statement. This is useful for adapting your partitioning scheme as your data grows or your needs change.

Reminder: You should still be in the MySQL shell and using the sales_data database. If you are not, use:

USE sales_data;

Let's say we want to add a new partition for the year 2024. Currently, data for 2024 and beyond is in the pFuture partition. You cannot add a new partition with ADD PARTITION because the pFuture partition is defined with VALUES LESS THAN MAXVALUE, which must always be the last partition.

Instead, we need to REORGANIZE the pFuture partition to split it. We will split pFuture into two new partitions: one for the year 2024 (p2024) and a new pFuture partition for everything after that.

ALTER TABLE sales REORGANIZE PARTITION pFuture INTO (
    PARTITION p2024 VALUES LESS THAN (2025),
    PARTITION pFuture VALUES LESS THAN MAXVALUE
);

This command takes the existing pFuture partition, moves any data for 2024 into the new p2024 partition, and redefines pFuture to cover dates from 2025 onwards. The row with sale_date '2024-01-01' will be moved to p2024.

Let's verify the updated partition structure and row counts:

SELECT
    PARTITION_NAME,
    TABLE_ROWS
FROM
    INFORMATION_SCHEMA.PARTITIONS
WHERE
    TABLE_SCHEMA = 'sales_data' AND TABLE_NAME = 'sales';

You should see the new p2024 partition. The row from 2024 is now in p2024.

+----------------+------------+
| PARTITION_NAME | TABLE_ROWS |
+----------------+------------+
| p2020          |          0 |
| p2021          |          2 |
| p2022          |          2 |
| p2023          |          2 |
| p2024          |          0 |
| pFuture        |          0 |
+----------------+------------+

Now, let's demonstrate merging partitions. Suppose we want to combine the p2020 and p2021 partitions into a single partition named p2020_2021.

ALTER TABLE sales REORGANIZE PARTITION p2020, p2021 INTO (
    PARTITION p2020_2021 VALUES LESS THAN (2022)
);

This command merges the data from p2020 and p2021 into a new partition called p2020_2021. The VALUES LESS THAN (2022) clause defines the new boundary for this merged partition.

Verify the partition structure again:

SELECT
    PARTITION_NAME,
    TABLE_ROWS
FROM
    INFORMATION_SCHEMA.PARTITIONS
WHERE
    TABLE_SCHEMA = 'sales_data' AND TABLE_NAME = 'sales';

You will see that p2020 and p2021 are gone, and p2020_2021 exists with the combined row count.

+----------------+------------+
| PARTITION_NAME | TABLE_ROWS |
+----------------+------------+
| p2020_2021     |          3 |
| p2022          |          2 |
| p2023          |          2 |
| p2024          |          0 |
| pFuture        |          0 |
+----------------+------------+

Finally, let's drop a partition. We can remove the p2024 partition. Note that this will also delete all data within that partition.

ALTER TABLE sales DROP PARTITION p2024;

Verify the partition structure one last time:

SELECT
    PARTITION_NAME,
    TABLE_ROWS
FROM
    INFORMATION_SCHEMA.PARTITIONS
WHERE
    TABLE_SCHEMA = 'sales_data' AND TABLE_NAME = 'sales';

The p2024 partition should no longer be listed.

+----------------+------------+
| PARTITION_NAME | TABLE_ROWS |
+----------------+------------+
| p2020_2021     |          3 |
| p2022          |          2 |
| p2023          |          2 |
| pFuture        |          0 |
+----------------+------------+

You have successfully reorganized, merged, and dropped partitions using ALTER TABLE. This demonstrates the flexibility of managing partitioned tables as your data evolves.

Check Partition Impact on Query Speed

In this step, we will explore how partitioning can impact query performance. While our current dataset is small, we can still observe the principle of partition pruning, where MySQL only scans the necessary partitions. For larger datasets, this effect is much more pronounced.

Reminder: You should still be in the MySQL shell and using the sales_data database. If you are not, use:

USE sales_data;

To observe the impact of partitioning, we can use the EXPLAIN statement, which shows the execution plan of a query. Specifically, EXPLAIN PARTITIONS will show which partitions are accessed.

Let's run a query that filters by the partitioning key (year of sale_date):

EXPLAIN PARTITIONS SELECT * FROM sales WHERE sale_date >= '2023-01-01' AND sale_date < '2024-01-01';

Observe the partitions column in the output. It should indicate that only the p2023 partition is being scanned.

Now, let's run a query that does not filter directly on the partitioning key, but on another column (amount):

EXPLAIN PARTITIONS SELECT * FROM sales WHERE amount > 200;

In this case, since the query condition is not directly on the partitioning key (sale_date), MySQL might need to scan multiple or all partitions to find the matching rows. The partitions column in the EXPLAIN PARTITIONS output will show which partitions were considered. With our small dataset, it might still scan all partitions.

To get a more detailed look at the query execution process and time spent, you can use MySQL's profiling feature.

Enable profiling:

SET profiling = 1;

Now, run the two queries again:

SELECT * FROM sales WHERE sale_date >= '2023-01-01' AND sale_date < '2024-01-01';
SELECT * FROM sales WHERE amount > 200;

View the profiling results:

SHOW PROFILES;

The output will list the executed queries and their duration. You can then examine the details of a specific query using its Query_ID:

SHOW PROFILE FOR QUERY [Query_ID];

Replace [Query_ID] with the ID of the query you want to analyze from the SHOW PROFILES output. Look at the different stages of execution and the time spent in each.

While the time difference might be negligible with our small dataset, in a real-world scenario with millions of rows, queries that can utilize partition pruning (like the one filtering by YEAR(sale_date)) will be significantly faster than those that need to scan multiple or all partitions.

Finally, disable profiling:

SET profiling = 0;

This step demonstrated how to use EXPLAIN PARTITIONS and profiling to understand how partitioning affects query execution and performance.

Summary

In this lab, you have learned how to implement MySQL partitioning for large datasets to improve query performance and data management. You started by creating a database and a table partitioned by range based on the year of a date column. You then practiced querying data from specific partitions, observing how MySQL uses partition pruning to optimize queries. Finally, you learned how to reorganize partitions by adding, splitting, and merging them using the ALTER TABLE statement, and explored how to use EXPLAIN PARTITIONS and profiling to understand the impact of partitioning on query speed. Partitioning is a powerful technique for managing and querying large tables efficiently in MySQL.

When you have completed all the steps, you can exit the MySQL shell by typing:

exit;