Insert billions of rows in MySQL database for benchmarking

Swapnil Bhoite
5 min readAug 15, 2021

--

After thorough research on web I see its difficult to import millions of row in MySQL. The pain starts with finding the right data source and once this is identified the insert part of this data is cumbersome and may take longer. I understand you can use large CSV files and import them with MySQL command line this will again take longer and if the database compute size is not big enough, this will cause the database to be exhausted.

In my project, I required to have hundreds of millions of records in MySQL database. Here I wanted to compare and benchmark the difference in AWS Glue default read and parallel read from JDBC and likewise default and parallel reads using Apache spark. Though I was able to find datasets online, these are not big enough to convey the enhancement offered by parallel reads, especially by AWS Glue. This was certainly a blocker to me for exploring the parallel read enhancements provided by Glue and Spark in today's BigData world. Finally, I found a solution where I can use sub-query approach in INSERT statements in MySQL to load the data.

  • Create a sample table customers using the below statement in MySQL
CREATE TABLE IF NOT EXISTS customers (
id INT AUTO_INCREMENT PRIMARY KEY,
customer_name TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
) ENGINE=INNODB;

To insert 1Million rows in the above table you can use the below insert statement

INSERT INTO customers (id, customer_name)
SELECT n, CONCAT('Customer', n)
FROM
(
select a.N + b.N * 10 + c.N * 100 + d.N * 1000 + e.N * 10000 + f.N * 100000 + 1 N
from (select 0 as N union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) a
, (select 0 as N union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) b
, (select 0 as N union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) c
, (select 0 as N union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) d
, (select 0 as N union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) e
, (select 0 as N union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) f
) t

In case if you need to increase the number of records further, you can add sub-queries in the above statement. In my case I imported 1 Billion records in the customers table with below command

INSERT INTO customers (id, customer_name)
SELECT n, CONCAT('Customer', n)
FROM
(
select a.N + b.N * 10 + c.N * 100 + d.N * 1000 + e.N * 10000 + f.N * 100000 + g.N * 1000000 + h.N * 10000000 + i.N * 100000000 + 1 N
from (select 0 as N union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) a
, (select 0 as N union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) b
, (select 0 as N union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) c
, (select 0 as N union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) d
, (select 0 as N union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) e
, (select 0 as N union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) f
, (select 0 as N union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) g
, (select 0 as N union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) h
, (select 0 as N union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) i
) t

Notice the difference in the sub-queries for insert statements with 1M and 1B inserts, likewise you can alter the insert statements to control the number of records inserted.

The MySQL database is running on AWS RDS of size db.m5.xlarge. The insert statement was performed by connecting to MySQL using mysql command line from MacOS 8GB RAM machine. The insert time might differ depending on the hardware used to insert statement and the compute power of the database engine. In terms of size, the above 1M records are 1.2GB whereas 1B records counts to 13GB when written out to AWS S3 bucket.

References:

[1] MySQL create table: https://www.mysqltutorial.org/mysql-create-table/

[2] Insert statement: https://stackoverflow.com/a/24837185

--

--