Easily Generate Mock Data with PostgreSQL

When writing articles about PostgreSQL there are times I need to generate large swaths of data to test and demonstrate a topic (for example, these Timescale articles here and here). It can be a pain to manually insert data or create a CSV and import it into PostgreSQL—which I always have to StackOverflow. Lately, I ran into a much simpler solution to generating data: the generate_series function. It’s perfectly suited for integers and timestamp data types, streamlining various tasks such as populating test databases or setting up date ranges for reports.

Generating Mock Data

generate_series can be thought of as a for-loop or generator to produce a range of numbers or dates.

To generate a range of numbers, we call generate_series with a start and end value:

SELECT * FROM generate_series(1, 10);

The output showing ten rows of integers from one to ten

We can specify a step value by providing generate_series as an optional third argument.

SELECT * FROM generate_series(1, 10, 2.5);

The output with four rows showing 1, 3.5, 6.0, and 8.5 using 2.5 step between one and ten

Inserting Mock Data

generate_series’ power comes to fruition when we need to insert mock data into a table. We can easily insert a range of numbers into a table by using generate_series in a SELECT statement and then inserting the results into a table.

For the rest of this article, we’ll use users table that looks like this for our examples:

CREATE TABLE users (
 id serial PRIMARY KEY,
 name text NOT NULL,
 created_on timestamptz
);

Then, if we wanted to insert five users into the table we could run:

INSERT INTO users (name)
SELECT
    'Dylan'
FROM generate_series(1, 5);

Five rows in the users table all with their name as Dylan

This pattern can be used to insert any number of rows into a table. If you’re feeling brave, try inserting 1,000,000 users into the table by changing generate_series(1, 5) to generate_series(1, 1000000)!

Up to this point we are creating a lot of duplicate data. What if we wanted to create a bunch of users with all distinct names? We can access the current index of the loop (using as) to generate a unique name for each user.

That would look like this:

INSERT INTO users (name)
SELECT
    'Dylan number ' || i
FROM generate_series(1, 5) as i;

Five rows in the users table with names: Dylan number 1, Dylan, number 2, and so on until Dylan number 5

Creating Time-Series Data

generate_series is incredibly effective at generating time series data. This is where I get the most use out of it because it can become extremely tedious to create a lot of data around time ranges.

Similar to the integer example, generate a range of timestamps by providing a start and end value followed by an optional step or interval.

For example, to generate a user for every hour in a week, we could run the following SQL:

INSERT INTO users (name, created_on)
SELECT
    'Dylan',
    time_hour
FROM generate_series(
    TIMESTAMPTZ '2023-11-01', 
    TIMESTAMPTZ '2023-11-07', 
    INTERVAL '1 hour'
) as time_hour;

Ten rows in the users table, each with the name Dylan and a created_on column for each hour between 2023-11-01 to 2023-11-07

This query creates 146 rows with a user named Dylan every hour between November 1st and November 7th, 2023.

You may be asking yourself, can we use multiple generate_series to populate multiple fields? Yes! But, we need to use joins to combine the data from the two series. To quickly create a grid of users and created_on dates we can take the cartesian product of two series, otherwise known as a cross join, by adding another generate_series to the FROM clause.

INSERT INTO users (name, created_on)
SELECT
    'Dylan number' || i,
    time_hour
FROM 
    generate_series(1, 5) as i,
    generate_series(
        TIMESTAMPTZ '2023-11-01', 
        TIMESTAMPTZ '2023-11-07', 
        INTERVAL '2 days'
    ) as time_hour;

In the INSERT statement above, we’ll create four time_hour timestamps for every two days in the week for every iteration of i (there are five of them). Resulting in 4 * 5 = 20 twenty rows created. To better grasp what’s happening here, it’ll be easier to look at the SELECT output of the users table.

Twenty rows in the users tables showing a cross join between name and the created_on column

Using multiple generate_series functions combined with a cartesian product can be a powerful combination for creating unique, time-series data.

Conclusion

PostgreSQL’s generate_series is a game-changer for developers like me, who frequently dive into the deep end of data generation for testing and optimizing queries. It’s a powerhouse that can effortlessly create anything from a handful to millions of rows, transforming a potentially arduous task into a few simple lines of SQL. So, when you need to mock up data, let generate_series be your go-to solution.

Happy data generating!