Spartan Blog - Jerónimo | Jerolba’s blog. Tech, JVM and random stuff.

The Carpet feature that nobody will use

This week I released a new version of Carpet, the Java library for working with Parquet files. In this version, I’ve added a feature that I believe nobody will ever use: the ability to read and write BSON-type columns.

The two versions of Parquet

A few days ago, the creators of DuckDB wrote the article: Query Engines: Gatekeepers of the Parquet File Format, which explained how the engines that process Parquet files as SQL tables are blocking the evolution of the format. This is because those engines are not fully supporting the latest specification, and without this support, the rest of the ecosystem has no incentive to adopt it.

Compression algorithms in Parquet

Apache Parquet is a columnar storage format optimized for analytical workloads, though it can also be used to store any type of structured data solving multiple use cases.

One of its most notable features is the ability to efficiently compress data using different compression techniques at two stages of its process. This reduces storage costs and improves reading performance.

This article explains file compression in Parquet for Java, provides usage examples, and analyzes its performance.

Working with Parquet files in Java using Parquet Carpet

After some time working with Parquet files in Java using the Parquet Avro library, and studying how it worked, I concluded that despite being very useful in multiple use cases and having great potential, the documentation and ecosystem needed for adoption in the Java world was very poor.

Many people are using suboptimal solutions (CSV or JSON files), applying more complex solutions (Spark), or using languages they are not familiar with (Python) because they don’t know how to work with Parquet files easily. That’s why I decided to write this series of articles.

Once you understand it and have the examples, everything is easier. But, can it be even easier? Can we avoid the hassle of using strange libraries that serialize other formats? Yes, it should be even easier.

That’s why I decided to implement an Open Source library that makes working with Parquet from Java extremely simple, something that covers it: Carpet.

Working with Parquet files in Java using Protocol Buffers

This post continues the series of articles about working with Parquet files in Java. This time, I’ll explain how to do it using the Protocol Buffers (PB) library.

Finding examples and documentation on how to use Parquet with Avro is challenging, but with Protocol Buffers, it’s even more complicated.

Working with Parquet files in Java using Avro

In the previous article, I wrote an introduction to using Parquet files in Java, but I did not include any examples. In this article, I will explain how to do this using the Avro library.

Parquet with Avro is one of the most popular ways to work with Parquet files in Java due to its simplicity, flexibility, and because it is the library with the most examples.

Working with Parquet files in Java

Parquet is a widely used format in the Data Engineering realm and holds significant potential for traditional Backend applications. This article serves as an introduction to the format, including some of the unique challenges I’ve faced while using it, to spare you from similar experiences.

Java Serialization with Apache Avro

In previous posts I’ve analyzed Protocol Buffers and FlatBuffers, using JSON as the baseline. In this post, I will analyze Apache Avro and compare it with the previously studied formats.

Java Serialization with Flatbuffers

In the previous post I analyzed Protocol Buffers format, using JSON as baseline. In this post I’m going to analyze FlatBuffers and compare it with previously studied formats.

Java Serialization with Protocol Buffers

At Clarity AI, we generate batches of data that our application has to load and process to show our clients the social impact information of many companies. By volume of information, it is not Big Data, but it is enough to be a problem to read and load it efficiently in processes with online users.