MySQL Streaming Read for Huge Tables

June 13, 2023 · 4 min read

Maintainers

Hint

This article is generated by AI translation.

A normal SELECT * loads all results into memory, easily causing OOM on huge tables. This article shows how to use dbVisitor's streaming query to read MySQL tables of any size safely.

Why Streaming Query?

Usually, we have two strategies for querying data:

Full Load: Traditional List<T> list = template.query(...).
- Pros: Simple to use, connection released quickly.
- Cons: Memory consumption is proportional to data volume; dies with big data.
Pagination Query: Using LIMIT offset, size.
- Pros: Controllable memory.
- Cons: Deep pagination performance disaster. As offset gets larger, the number of rows the database needs to scan and discard increases geometrically, becoming slower and slower.

Streaming Query is the third choice. It maintains a long connection, letting the database push data to the client row by row (or in batches) like "running water". The client processes one row and discards one row, keeping memory usage always at a very low level (caching only a small amount of row data), thus realizing "Single Machine Processing Billions of Data".

MySQL Specifics

Different databases enable streaming queries in different ways (e.g., PostgreSQL requires turning off auto-commit and setting fetchSize). For MySQL, the JDBC driver has very specific conventions.

According to the MySQL Connector/J Official Documentation, enabling streaming result sets must meet the following three hard conditions:

ResultSet.TYPE_FORWARD_ONLY: Can only traverse forward.
ResultSet.CONCUR_READ_ONLY: Read-only result set.
fetchSize must be set to Integer.MIN_VALUE.

If any condition is not met, the driver will degrade to full load mode.

dbVisitor Implementation

dbVisitor's core component JdbcTemplate provides strong underlying control capability, allowing us to customize the above parameters through PreparedStatementCreator, while cooperating with RowCallbackHandler to achieve row-by-row consumption.

1. Customize Statement

We need to take over the creation process of Statement and forcibly set fetchSize.

// Use lambda to define PreparedStatement creation logic
PreparedStatementCreator creator = con -> {
    // 1. Specify cursor type when creating Statement
    PreparedStatement ps = con.prepareStatement(
        "select id, name, age from huge_user_table", // Recommend explicitly specifying column names rather than *
        ResultSet.TYPE_FORWARD_ONLY,
        ResultSet.CONCUR_READ_ONLY
    );
    
    // 2. Enable MySQL streaming black magic
    ps.setFetchSize(Integer.MIN_VALUE);
    return ps;
};

2. Define Row Mapper

To facilitate processing, we usually need to convert each row of ResultSet into a Java object. dbVisitor provides high-performance mappers.

// Automatically map ResultSet to User object
MappingRowMapper<User> rowMapper = new MappingRowMapper<>(User.class);

3. Row-by-Row Consumption Callback

RowCallbackHandler takes over the processing of each row of data. Note: The logic here must be lightweight. Once a row is processed, the object referenced by that row becomes a candidate for garbage collection.

RowCallbackHandler handler = (rs, rowNum) -> {
    // Map current row
    User user = rowMapper.mapRow(rs, rowNum);
    
    // Business logic: e.g., write to file, send MQ, calculate statistics, etc.
    processUser(user);
    
    // Progress log: print once every 10,000 rows processed
    if (rowNum % 10000 == 0) {
        System.out.println("Current processing: " + rowNum);
    }
};

4. Complete Code Example

Combining the above steps:

@Test
public void testStreamQuery() throws SQLException {
    JdbcTemplate jdbcTemplate = new JdbcTemplate(dataSource);

    // 1. Creator: Enable Streaming
    PreparedStatementCreator creator = con -> {
        PreparedStatement ps = con.prepareStatement(
            "select * from test_user",
            ResultSet.TYPE_FORWARD_ONLY,
            ResultSet.CONCUR_READ_ONLY
        );
        ps.setFetchSize(Integer.MIN_VALUE);
        return ps;
    };

    // 2. Executor: Streaming Callback
    // The executeCreator method is responsible for Connection acquisition and release
    jdbcTemplate.executeCreator(creator, (RowCallbackHandler) (rs, rowNum) -> {
        // Operate ResultSet directly, or use Mapper to convert
        String name = rs.getString("name");
        int age = rs.getInt("age");
        
        // Simulate business processing
        // System.out.println("Processing: " + name);
    });
}

Performance Comparison

Assume we have a table with 5 million rows, and the JVM heap memory limit is 512MB.

Query Method	Memory Usage	Time Cost	Result
Normal query()	> 800MB	N/A	OOM Crash
Pagination Query (Size=1000)	~50MB	Extremely Slow	As pages increase, query gets slower, total time is extremely long
dbVisitor Streaming Query	< 30MB	Fast	Stable completion, smooth GC

Precautions

Connection Monopoly: During the streaming query, the database connection will be occupied until all data traversal is completed. Ensure the Connection Pool is large enough or query processing is fast enough.
Concurrency Issues: Since the connection is occupied continuously, if the business is also high-concurrency, it is recommended to use an independent connection pool or data source to execute such analytical tasks to avoid blocking core business.
Network Timeout: If the RowCallbackHandler processing logic is very time-consuming (e.g., calling a remote interface for every row), it may cause the database server to disconnect (net_write_timeout) because data has not been read for a long time.
- Suggestion: If heavy business logic is involved, it is recommended to adopt the "Producer-Consumer" model. handler is only responsible for quickly putting data into a blocking queue, and a separate thread pool handles it slowly.

Through the underlying API provided by dbVisitor, we can easily master this complex database feature, solving tricky performance problems while keeping the code clean.

Why Streaming Query?​

MySQL Specifics​

dbVisitor Implementation​

1. Customize Statement​

2. Define Row Mapper​

3. Row-by-Row Consumption Callback​

4. Complete Code Example​

Performance Comparison​

Precautions​