Handling large data loads with Spring Batch partitioning

Struggling with slow performance when processing large datasets in Spring Batch? You're not alone. Spring Batch is powerful, but handling big data loads can expose its limitations—unless you know how to scale it properly.
In this article, I'll show you how to leverage the Partitioner class to efficiently parallelize your jobs, split the workload smartly, and speed up I/O-bound operations like database access.

1. Maven configuration

To get started, make sure your pom.xml includes the necessary dependencies: You can find the complete pom.xml in the project repository.

2. Spring Boot configuration

Configure your application.yml (or application.properties) with the following settings:

spring:
    application:
        name: PartitionerTest
    batch:
        jdbc:
            initialize-schema: always
    datasource:
        url: "jdbc:postgresql://localhost:5432/postgres"
        username: "postgres"
        password: "postgres"
    jpa:
        hibernate:
            ddl-auto: update
Here's what we're telling Spring Boot:
Warning: keep in mind that the last bullet point is very dangerous to use in a non-development environment, so handle it with care.

3. Choosing a dataset

To properly test the performance of our batch job, we need a dataset with a significant amount of data. After a brief search on Kaggle, I came across the Stock Market Data (NASDAQ, NYSE, S&P500), which perfectly fits our needs. Big thanks to the author for making it publicly available.

The dataset consists of multiple CSV files, each representing historical price data for a specific stock. Each file is named after its stock ticker. For every ticker (i.e., each CSV), we have: Based on this structure, let's define a corresponding JPA entity to model our data:

@Setter
@Getter
@Builder
@NoArgsConstructor
@AllArgsConstructor
@Entity
@Table(
    name = "stock",
    uniqueConstraints = {@UniqueConstraint(columnNames = {"ticker", "date"})},
    indexes = {@Index(columnList = "ticker")}
)
public class StockEntity {
    @Id
    @GeneratedValue(strategy = GenerationType.SEQUENCE)
    private Long id;

    @Column(name = "ticker", length = 10, nullable = false)
    private String ticker;

    @Column(name = "date", nullable = false)
    private LocalDate date;

    @Column(name = "low", nullable = false, precision = 50, scale = 20)
    private BigDecimal low;

    @Column(name = "open", nullable = false, precision = 50, scale = 20)
    private BigDecimal open;

    @Column(name = "volume", nullable = false, precision = 50)
    private BigInteger volume;

    @Column(name = "high", nullable = false, precision = 50, scale = 20)
    private BigDecimal high;

    @Column(name = "close", nullable = false, precision = 50, scale = 20)
    private BigDecimal close;

    @Column(name = "adjusted_close", nullable = false, precision = 50, scale = 20)
    private BigDecimal adjustedClose;
}
A few notes: After defining the entity, I created a Spring Batch job (FileLoaderFlow, available in the repository) to parse and load the CSV files into the stock table. I'll skip the details of this import step, as it's not the focus of this article.
For reference, the final table contains 3,258,423 rows — a decent volume of data to benchmark against.

4. Defining the intensive task

Now that we have a populated table, we can define a non-trivial processing task: for each record in the stock table, we want to compute the mean price, defined as the average between the high and low prices.
Given the table contains 3,258,423 rows, this operation involves a significant load — perfect for performance testing.

4.1. Create a reader

We define a JpaPagingItemReader that selects only the necessary fields, projecting them into a DTO:

@Bean
@StepScope
public JpaPagingItemReader<StockMeanDTO> itemReader(
        EntityManagerFactory entityManagerFactory
) {
    return new JpaPagingItemReaderBuilder<StockMeanDTO>()
            .name("meanReader")
            .entityManagerFactory(entityManagerFactory)
            .queryString("SELECT new com.davide99.partitionertest.mean.StockMeanDTO(e.id, e.high, e.low) " +
                            "FROM StockEntity e ORDER BY e.id")
            .pageSize(PAGE_SIZE)
            .build();
}
Where StockMeanDTO looks like this:

@Getter
public class StockMeanDTO {
    private final Long id;
    private final BigDecimal mean;

    public StockMeanDTO(Long id, BigDecimal high, BigDecimal low) {
        this.id = id;
        this.mean = high.add(low)
                        .divide(BigDecimal.valueOf(2), RoundingMode.HALF_UP);
    }
}

4.2. Add the mean field to the entity

We extend the entity to store the computed mean value:

@Column(name = "mean", nullable = false, precision = 50, scale = 20)
@ColumnDefault("0")
private BigDecimal mean = BigDecimal.ZERO;

4.3. Add a repository method for updates

We create a custom method to update the mean field based on the ID:

public interface StockRepository extends JpaRepository<StockEntity, Long> {
    @Modifying
    @Query("UPDATE StockEntity e SET e.mean = :mean WHERE e.id = :id")
    void updateMeanById(@Param("mean") BigDecimal mean, @Param("id") Long id);
}

4.4. Define the writer

Finally, we implement the writer that persists the computed values:

@Bean
public ItemWriter<StockMeanDTO> itemWriter(StockRepository repository) {
    return chunk -> chunk.forEach(
        dto -> repository.updateMeanById(dto.getMean(), dto.getId())
    );
}
With this setup, we now have:

4.5. Results

With the full pipeline in place, I ran the batch job on the full dataset of 3,258,423 rows.
The execution took exactly 31 minutes and 15 seconds to complete.
This gives us a baseline performance metric for a single-threaded execution of the task. We'll use this as a reference when comparing with other solutions.

5. Parallelizing the task with a Partitioner

To improve performance, we can parallelize the batch processing using a Partitioner. This mechanism splits the workload into multiple independent chunks, each handled by a separate thread.

5.1. Master/Slave Step

The Spring Batch master step is responsible for creating partitions using the partitioner and assigning them to worker steps (the slave step), which execute the actual logic. You can visualize it like this: (Insert image) We also need a TaskExecutor, which will run multiple slave steps in parallel. Each slave will process a subset of data defined by a key — in our case, the ticker.

5.2. Partition key selection

We need a way to split the data. The ticker is a perfect candidate, as it's already part of our data and defines a natural subset.
Add this method to the repository:

@Query("SELECT DISTINCT e.ticker FROM StockEntity e")
Set<String> findAllTickers();

5.3. Defining the Partitioner

We implement a custom Partitioner which adds the ticker value to the ExecutionContext of each partition:

@Component
@RequiredArgsConstructor
@Profile(MeanPartitionedFlow.JOB_NAME)
public class TickerPartitioner implements Partitioner {
    private final StockRepository repository;

    @Override
    @NonNull
    public Map<String, ExecutionContext> partition(int gridSize) {
        return repository.findAllTickers().stream().collect(Collectors.toMap(
                ticker -> ticker,
                ticker -> {
                    final ExecutionContext context = new ExecutionContext();
                    context.putString("ticker", ticker);
                    return context;
                }
        ));
    }
}    
Note: We're ignoring the parameter. If there are n tickers, we create n partitions. The actual parallelism is controlled by the thread pool (TaskExecutor) which can run up to m < n tasks concurrently.

5.4. Master step

The master step uses our partitioner and delegates to the slave step:

@Bean
public Step masterStep(
    JobRepository jobRepository,
    TickerPartitioner partitioner,
    Step slaveStep,
    TaskExecutor taskExecutor
) {
    return new StepBuilder(MASTER_STEP_NAME, jobRepository)
        .partitioner(SLAVE_STEP_NAME, partitioner)
        .step(slaveStep)
        .taskExecutor(taskExecutor)
        .gridSize(1) // Unused in our case
        .build();
}    
Note: We're ignoring the parameter. If there are n tickers, we create n partitions. The actual parallelism is controlled by the thread pool (TaskExecutor) which can run up to m < n tasks concurrently.

5.5. Slave step

The slave step is identical to the original non-partitioned step:

@Bean
public Step slaveStep(
    ItemReader<StockMeanDTO> reader,
    ItemWriter<StockMeanDTO> writer,
    PlatformTransactionManager transactionManager,
    JobRepository jobRepository
) {
    return new StepBuilder(SLAVE_STEP_NAME, jobRepository)
        .<StockMeanDTO, StockMeanDTO>chunk(CHUNK_SIZE, transactionManager)
        .reader(reader)
        .writer(writer)
        .build();
}

5.6. Reader adjusted to use the ticker

The reader is updated to filter by ticker using the value injected in the context by the partitioner:

@Bean
@StepScope
public JpaPagingItemReader itemReader(
    EntityManagerFactory entityManagerFactory,
    @Value("#{stepExecutionContext['ticker']}") String ticker
) {
    return new JpaPagingItemReaderBuilder()
        .name("meanReader")
        .entityManagerFactory(entityManagerFactory)
        .queryString("SELECT new com.davide99.partitionertest.mean.StockMeanDTO(e.id, e.high, e.low) " +
            "FROM StockEntity e " +
            "WHERE e.ticker = :ticker " +
            "ORDER BY e.id")
        .parameterValues(Map.of("ticker", ticker))
        .pageSize(PAGE_SIZE)
        .build();
}

5.7. Task executor configuration

We define a thread pool to handle parallel execution:

@Bean
public TaskExecutor taskExecutor() {
    final ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
    executor.setCorePoolSize(8);
    executor.setMaxPoolSize(16);
    return executor;
}

5.8. New properties configuration

To support parallel database access, we increased the connection pool size in the YAML configuration:

spring:
datasource:
    url: "jdbc:postgresql://localhost:5432/postgres"
    username: "postgres"
    password: "postgres"
    hikari:
        maximum-pool-size: 30
This ensures that enough connections are available for concurrent processing.

5.9. Execution results

With partitioning enabled and the thread pool configured, the job execution time dropped significantly.
Total execution time: 9 minutes and 5 seconds, which is 29% of the initial benchmark time.

6. Conclusions


With this approach, Spring Batch becomes a high-performance, modular solution even for very large datasets.


The project repository can be found here.