Integrating Spring Batch with Parquet

Spring Batch is one of the few existing tools in the Java Enterprise ecosystem for building batch processes or data pipelines. However, its components (ItemReader/ItemWriter) are primarily oriented toward relational databases, CSV, XML, or JSON.

In a world where Data Lakes and columnar formats are increasingly important, integrating Parquet with Spring Batch opens new possibilities for building data pipelines from the Java world, without depending on complex solutions or different technology stacks that often cause friction in the Enterprise world.

Neither the main Spring Batch project nor the extensions project offers native support for reading or writing Parquet files, and there’s no documentation on the topic. Therefore, in this article, we’ll see how, thanks to Carpet, we can very easily integrate Parquet with Spring Batch.

Introduction

Spring Batch has been the reference solution for implementing batch processes in Java for over 15 years. Its architecture based on Jobs, Steps, ItemReaders, and ItemWriters provides a model for processing large volumes of data reliably, with restart capabilities, transactionality, and error handling.

However, the data ecosystem has evolved significantly in recent years. Data Lakes based on object storage (S3, Azure Blob Storage, Google Cloud Storage) have become a key piece in data persistence systems. And within these Data Lakes, Apache Parquet has consolidated itself as the columnar format par excellence.

Despite this evolution, Spring Batch doesn’t provide direct support for reading or writing Parquet files, and there are hardly any articles or documentation covering this use case. The ItemReader and ItemWriter that come with the framework cover:

Relational databases: JdbcCursorItemReader, JdbcBatchItemWriter, JpaItemWriter.
Flat files: FlatFileItemReader, FlatFileItemWriter (CSV, fixed-width).
XML: StaxEventItemReader, StaxEventItemWriter.
JSON: JsonItemReader, JsonFileItemWriter.

In many organizations, Data Engineering teams work with Spark, Pandas, or analytical tools that natively consume and generate Parquet. Meanwhile, traditional batch processes in Java continue exporting CSV or inserting directly into databases, ignoring all the advantages that Parquet provides.

Integrating Spring Batch with Parquet allows you to:

Export data from transactional systems to data lakes in an optimized, well-known, and standardized format
Consume data generated by other tools (Spark, Python) from Java processes
Transform and enrich Parquet files while maintaining the format
Leverage Parquet’s advantages: compression, predicate pushdown, projections, and embedded schema

In this article, I’ll explain how to build a simple ItemReader and ItemWriter to work with Parquet. I don’t intend to build a complete production-ready library here, but rather show how easy it is to do with Carpet. The code I’ll create will be very simple but functional and will serve as a foundation to adapt to your specific needs.

As a prerequisite, I assume you have basic knowledge of Spring Batch (Jobs, Steps, ItemReaders/ItemWriters).

Why Use Parquet in Spring Batch?

Before diving into the implementation, it’s important to understand what advantages Parquet brings in the context of batch processes and when this integration makes sense.

Compression and Storage/Processing Efficiency

One of the clearest advantages of Parquet is its compression capability. Being a columnar format, data of the same type is stored contiguously, allowing highly efficient compression techniques to be applied.

The columnar format, combined with data structured in RowGroups with statistics and indexes, allows Parquet to filter rows that don’t meet certain conditions without reading the entire file (predicate pushdown) and skip reading unnecessary columns. Thanks to column projection, if you only need to read 3 columns from a file that has 50, Parquet will only read and process data from those 3 columns from disk.

Interoperability with Modern Data Ecosystems

Probably the most important reason to use Parquet in Spring Batch is interoperability. If your organization has a modern data architecture, it’s very likely that other teams are already using Parquet:

Data Engineering tools: Apache Spark, Apache Flink, Pandas/PyArrow, DuckDB
Data platforms: Databricks, Snowflake, Google BigQuery, Amazon Redshift, or Athena
Modern table formats: Delta Lake, Apache Iceberg, or DuckLake

If your Spring Batch process generates a CSV, the Data Engineering team probably has to convert it to Parquet in an additional step before they can work with it efficiently. This adds latency, complexity, and failure points.

Generating Parquet directly from Spring Batch eliminates that friction and allows your data to be immediately available to other teams, or simply enables your Backend team specialized in Java to implement those processes that were previously reserved for the Data Engineering team.

Embedded Schema and Evolution

Unlike CSV or flat formats, Parquet embeds the schema within the file itself. This means that any tool reading the file can:

Automatically discover what columns it contains and their types
Validate that the data complies with the expected schema
Evolve the schema in a controlled way (add columns, deprecate others)

With Parquet, the schema is embedded in each file. If you add a new column in your batch process, old consumers will simply ignore it. If you rename a column, it’s an explicit change visible in the schema.

Being a binary format with schema, data types are explicit and validated. There’s no ambiguity when interpreting column data. Inspired by Google’s Dremel paper, Parquet also supports complex data structures with collections, maps, and nested records.

This way, Parquet isn’t limited to a tabular format like CSV, but allows you to have records with the same richness as JSON, but with an explicit and typed schema, and high compression.

Implementing ParquetItemReader

We’ll start from the example provided by Spring Batch in its guides, Creating a Batch Service, where a CSV file is processed, transformed, and persisted to a database.

The example uses the ItemReader implementation that parses CSVs. To process the Parquet equivalent, we’ll implement the same interface by extending the AbstractItemCountingItemStreamItemReader abstract class, which resolves much of the common internal logic of Spring Batch. The minimal version only requires implementing the methods for opening, reading, and closing a file: doOpen, doRead, and doClose.

Our implementation will be an adapter over the CarpetReader classes that provide this logic. In this simple version, we’ll delegate to the user of the ParquetItemReader class the instantiation of CarpetReader:

public class ParquetItemReader<T> extends AbstractItemCountingItemStreamItemReader<T> {

    private final CarpetReader<T> carpetReader;
    private CloseableIterator<T> iterator;

    public ParquetItemReader(CarpetReader<T> carpetReader) {
        this.carpetReader = carpetReader;
    }

    @Override
    protected void doOpen() throws Exception {
        this.iterator = carpetReader.iterator();
    }

    @Override
    protected T doRead() throws Exception {
        return iterator.hasNext() ? iterator.next() : null;
    }

    @Override
    protected void doClose() throws Exception {
        this.iterator.close();
    }

}

In the Spring configuration, its usage would be reduced to:

@Bean
public ParquetItemReader<Person> parquetReader() throws IOException {
    InputFile inputFile = new FileSystemInputFile(new File("/tmp/sample-data.parquet"));
    CarpetReader<Person> reader = new CarpetReader<>(inputFile, Person.class);
    return new ParquetItemReader<>(reader);
}

Implementing ParquetItemWriter

The tutorial example writes the transformation result to a database. To complete the article, I’ll also write the result to another Parquet file as an excuse to create the ParquetItemWriter.

In this case, we’ll implement an ItemWriter extending AbstractItemStreamItemWriter, which manages the common part for us. The implementation is equally simple, delegating the write logic to CarpetWriter:

public class ParquetItemWriter<T> extends AbstractItemStreamItemWriter<T> {

    private final CarpetWriter<T> carpetWriter;

    public ParquetItemWriter(CarpetWriter<T> carpetWriter) {
        this.carpetWriter = carpetWriter;
    }

    @Override
    public void write(Chunk<? extends T> chunk) throws Exception {
        for (T item : chunk.getItems()) {
            carpetWriter.write(item);
        }
    }

    @Override
    public void close() {
        try {
            carpetWriter.close();
        } catch (IOException e) {
            throw new ItemStreamException("Error closing CarpetWriter", e);
        }
    }

}

As with the ItemReader, we delegate the creation of CarpetWriter to the user of the class:

@Bean
public ParquetItemWriter<Person> parquetWriter() throws IOException {
    OutputFile outputFile = new FileSystemOutputFile(new File("/tmp/processed-data.parquet"));
    CarpetWriter<Person> carpetWriter = new CarpetWriter<>(outputFile, Person.class);
    return new ParquetItemWriter<>(carpetWriter);
}

Dependency

To use Carpet in your Maven project, add the following dependency to your pom.xml:

<dependency>
    <groupId>com.jerolba</groupId>
    <artifactId>carpet-record</artifactId>
    <version>0.5.0</version>
</dependency>

The library transitively imports the minimum necessary dependencies of Apache Parquet, which implements the format.

Conclusion

Integrating Apache Parquet with Spring Batch allows you to leverage the advantages of a columnar format optimized for storing and processing large amounts of data within traditional Java batch processes, no longer being exclusive territory of tools like Spark or Pandas.

Although Spring Batch doesn’t offer native Parquet support, with Carpet it’s very easy to build the ItemReader and ItemWriter that facilitate this integration. This opens the door to building more efficient, interoperable data pipelines from Java that are aligned with modern data architectures based on Data Lakes and columnar formats.

You can find the example code in this GitHub repository.

Depending on the feedback received, I’ll consider creating a more complete library that implements these classes generically and configurably for production use.