MySQL Streaming Read for Huge Tables
This article is generated by AI translation.
A normal SELECT * loads all results into memory, easily causing OOM on huge tables. This article shows how to use dbVisitor's streaming query to read MySQL tables of any size safely.
Why Streaming Query?
Usually, we have two strategies for querying data:
- Full Load: Traditional
List<T> list = template.query(...).- Pros: Simple to use, connection released quickly.
- Cons: Memory consumption is proportional to data volume; dies with big data.
- Pagination Query: Using
LIMIT offset, size.- Pros: Controllable memory.
- Cons: Deep pagination performance disaster. As
offsetgets larger, the number of rows the database needs to scan and discard increases geometrically, becoming slower and slower.
Streaming Query is the third choice. It maintains a long connection, letting the database push data to the client row by row (or in batches) like "running water". The client processes one row and discards one row, keeping memory usage always at a very low level (caching only a small amount of row data), thus realizing "Single Machine Processing Billions of Data".
MySQL Specifics
Different databases enable streaming queries in different ways (e.g., PostgreSQL requires turning off auto-commit and setting fetchSize). For MySQL, the JDBC driver has very specific conventions.
According to the MySQL Connector/J Official Documentation, enabling streaming result sets must meet the following three hard conditions:
ResultSet.TYPE_FORWARD_ONLY: Can only traverse forward.ResultSet.CONCUR_READ_ONLY: Read-only result set.fetchSizemust be set toInteger.MIN_VALUE.
If any condition is not met, the driver will degrade to full load mode.
dbVisitor Implementation
dbVisitor's core component JdbcTemplate provides strong underlying control capability, allowing us to customize the above parameters through PreparedStatementCreator, while cooperating with RowCallbackHandler to achieve row-by-row consumption.
1. Customize Statement
We need to take over the creation process of Statement and forcibly set fetchSize.
// Use lambda to define PreparedStatement creation logic
PreparedStatementCreator creator = con -> {
// 1. Specify cursor type when creating Statement
PreparedStatement ps = con.prepareStatement(
"select id, name, age from huge_user_table", // Recommend explicitly specifying column names rather than *
ResultSet.TYPE_FORWARD_ONLY,
ResultSet.CONCUR_READ_ONLY
);
// 2. Enable MySQL streaming black magic
ps.setFetchSize(Integer.MIN_VALUE);
return ps;
};
2. Define Row Mapper
To facilitate processing, we usually need to convert each row of ResultSet into a Java object. dbVisitor provides high-performance mappers.
// Automatically map ResultSet to User object
MappingRowMapper<User> rowMapper = new MappingRowMapper<>(User.class);
3. Row-by-Row Consumption Callback
RowCallbackHandler takes over the processing of each row of data. Note: The logic here must be lightweight. Once a row is processed, the object referenced by that row becomes a candidate for garbage collection.
RowCallbackHandler handler = (rs, rowNum) -> {
// Map current row
User user = rowMapper.mapRow(rs, rowNum);
// Business logic: e.g., write to file, send MQ, calculate statistics, etc.
processUser(user);
// Progress log: print once every 10,000 rows processed
if (rowNum % 10000 == 0) {
System.out.println("Current processing: " + rowNum);
}
};
4. Complete Code Example
Combining the above steps:
@Test
public void testStreamQuery() throws SQLException {
JdbcTemplate jdbcTemplate = new JdbcTemplate(dataSource);
// 1. Creator: Enable Streaming
PreparedStatementCreator creator = con -> {
PreparedStatement ps = con.prepareStatement(
"select * from test_user",
ResultSet.TYPE_FORWARD_ONLY,
ResultSet.CONCUR_READ_ONLY
);
ps.setFetchSize(Integer.MIN_VALUE);
return ps;
};
// 2. Executor: Streaming Callback
// The executeCreator method is responsible for Connection acquisition and release
jdbcTemplate.executeCreator(creator, (RowCallbackHandler) (rs, rowNum) -> {
// Operate ResultSet directly, or use Mapper to convert
String name = rs.getString("name");
int age = rs.getInt("age");
// Simulate business processing
// System.out.println("Processing: " + name);
});
}
Performance Comparison
Assume we have a table with 5 million rows, and the JVM heap memory limit is 512MB.
| Query Method | Memory Usage | Time Cost | Result |
|---|---|---|---|
| Normal query() | > 800MB | N/A | OOM Crash |
| Pagination Query (Size=1000) | ~50MB | Extremely Slow | As pages increase, query gets slower, total time is extremely long |
| dbVisitor Streaming Query | < 30MB | Fast | Stable completion, smooth GC |
Precautions
- Connection Monopoly: During the streaming query, the database connection will be occupied until all data traversal is completed. Ensure the
Connection Poolis large enough or query processing is fast enough. - Concurrency Issues: Since the connection is occupied continuously, if the business is also high-concurrency, it is recommended to use an independent connection pool or data source to execute such analytical tasks to avoid blocking core business.
- Network Timeout: If the
RowCallbackHandlerprocessing logic is very time-consuming (e.g., calling a remote interface for every row), it may cause the database server to disconnect (net_write_timeout) because data has not been read for a long time.- Suggestion: If heavy business logic is involved, it is recommended to adopt the "Producer-Consumer" model.
handleris only responsible for quickly putting data into a blocking queue, and a separate thread pool handles it slowly.
- Suggestion: If heavy business logic is involved, it is recommended to adopt the "Producer-Consumer" model.
Through the underlying API provided by dbVisitor, we can easily master this complex database feature, solving tricky performance problems while keeping the code clean.