Posts / easily-generate-mock-data-with-postgresql
Nov 9, 2023

Easily Generate Mock Data with PostgreSQL

Read time 5 min read

When writing articles about PostgreSQL there are times I need to generate large swaths of data to test and demonstrate a topic (for example, these Timescale articles here and here). It can be a pain to manually insert data or create a CSV and import it into PostgreSQL—which I always have to StackOverflow. Lately, I ran into a much simpler solution to generating data: the generate_series function. It’s perfectly suited for integers and timestamp data types, streamlining various tasks such as populating test databases or setting up date ranges for reports.

Generating Mock Data

generate_series can be thought of as a for-loop or generator to produce a range of numbers or dates.

To generate a range of numbers, we call generate_series with a start and end value:

SELECT * FROM generate_series(1, 10);

The output showing ten rows of integers from one to ten

We can specify a step value by providing generate_series as an optional third argument.

SELECT * FROM generate_series(1, 10, 2.5);

The output with four rows showing 1, 3.5, 6.0, and 8.5 using 2.5 step between one and ten

Inserting Mock Data

generate_series’ power comes to fruition when we need to insert mock data into a table. We can easily insert a range of numbers into a table by using generate_series in a SELECT statement and then inserting the results into a table.

For the rest of this article, we’ll use users table that looks like this for our examples:

CREATE TABLE users (
 id serial PRIMARY KEY,
 name text NOT NULL,
 created_on timestamptz
);

Then, if we wanted to insert five users into the table we could run:

INSERT INTO users (name)
SELECT
    'Dylan'
FROM generate_series(1, 5);

Five rows in the users table all with their name as Dylan

This pattern can be used to insert any number of rows into a table. If you’re feeling brave, try inserting 1,000,000 users into the table by changing generate_series(1, 5) to generate_series(1, 1000000)!

Up to this point we are creating a lot of duplicate data. What if we wanted to create a bunch of users with all distinct names? We can access the current index of the loop (using as) to generate a unique name for each user.

That would look like this:

INSERT INTO users (name)
SELECT
    'Dylan number ' || i
FROM generate_series(1, 5) as i;

Five rows in the users table with names: Dylan number 1, Dylan, number 2, and so on until Dylan number 5

Creating Time-Series Data

generate_series is incredibly effective at generating time series data. This is where I get the most use out of it because it can become extremely tedious to create a lot of data around time ranges.

Similar to the integer example, generate a range of timestamps by providing a start and end value followed by an optional step or interval.

For example, to generate a user for every hour in a week, we could run the following SQL:

INSERT INTO users (name, created_on)
SELECT
    'Dylan',
    time_hour
FROM generate_series(
    TIMESTAMPTZ '2023-11-01',
    TIMESTAMPTZ '2023-11-07',
    INTERVAL '1 hour'
) as time_hour;

Ten rows in the users table, each with the name Dylan and a created_on column for each hour between 2023-11-01 to 2023-11-07

This query creates 146 rows with a user named Dylan every hour between November 1st and November 7th, 2023.

You may be asking yourself, can we use multiple generate_series to populate multiple fields? Yes! But, we need to use joins to combine the data from the two series. To quickly create a grid of users and created_on dates we can take the cartesian product of two series, otherwise known as a cross join, by adding another generate_series to the FROM clause.

INSERT INTO users (name, created_on)
SELECT
    'Dylan number' || i,
    time_hour
FROM
    generate_series(1, 5) as i,
    generate_series(
        TIMESTAMPTZ '2023-11-01',
        TIMESTAMPTZ '2023-11-07',
        INTERVAL '2 days'
    ) as time_hour;

In the INSERT statement above, we’ll create four time_hour timestamps for every two days in the week for every iteration of i (there are five of them). Resulting in 4 * 5 = 20 twenty rows created. To better grasp what’s happening here, it’ll be easier to look at the SELECT output of the users table.

Twenty rows in the users tables showing a cross join between name and the created_on column

Using multiple generate_series functions combined with a cartesian product can be a powerful combination for creating unique, time-series data.

Conclusion

PostgreSQL’s generate_series is a game-changer for developers like me, who frequently dive into the deep end of data generation for testing and optimizing queries. It’s a powerhouse that can effortlessly create anything from a handful to millions of rows, transforming a potentially arduous task into a few simple lines of SQL. So, when you need to mock up data, let generate_series be your go-to solution.

Happy data generating!

Need a realistic dataset to test something more interesting than a flat table? Postgres: The Graph Database You Didn’t Know You Had uses generate_series to populate nodes and edges, then walks through recursive CTEs against the same data. For performance work on bigger tables, Explaining PostgreSQL EXPLAIN covers how to read query plans once the rows pile up.

Keep reading the Postgres guide

One of the Postgres posts on the site. The full track, grouped by topic, lives on the guide page .

All posts →