Working with Parquet files in Java using Protocol Buffers

This post continues the series of articles about working with Parquet files in Java. This time, I’ll explain how to do it using the Protocol Buffers (PB) library.

Finding examples and documentation on how to use Parquet with Avro is challenging, but with Protocol Buffers, it’s even more complicated.

7 min read

Working with Parquet files in Java using Avro

In the previous article, I wrote an introduction to using Parquet files in Java, but I did not include any examples. In this article, I will explain how to do this using the Avro library.

Parquet with Avro is one of the most popular ways to work with Parquet files in Java due to its simplicity, flexibility, and because it is the library with the most examples.

11 min read

Working with Parquet files in Java

Parquet is a widely used format in the Data Engineering realm and holds significant potential for traditional Backend applications. This article serves as an introduction to the format, including some of the unique challenges I’ve faced while using it, to spare you from similar experiences.

9 min read

Java Serialization with Protocol Buffers

At Clarity AI, we generate batches of data that our application has to load and process to show our clients the social impact information of many companies. By volume of information, it is not Big Data, but it is enough to be a problem to read and load it efficiently in processes with online users.

8 min read

Bikey

TL;DR: in this post I present Bikey, a Java library to create Maps and Sets whose elements have two keys, consuming from 70%-85% to 99% less memory than usual data structures. It is Open Source, published in https://github.com/jerolba/bikey and available in Maven Central.

Bikey map memory consumption comparison

10 min read

Composite key in HashMaps

In my last post I talked about the problems of using an incorrect hash function when you put an object with a composite key in a Java HashMap, but I was stuck with the question: Which data structure is better to index those objects?

Continuing with the same example I will talk about products and stores, and I will use their identifiers to form the map key. The proposed data structures are:

  • A single map with a key containing its indexes: HashMap<Tuple<Integer, Integer>, MyObject>, which I will call TupleMap.
  • A nested map: HashMap<Integer, HashMap<Integer, MyObject>>, which I will call DoubleMap.
5 min read

Hashing and maps

Don’t worry, I’m not going to talk about blockchain or geopositioning, but a more boring and basic topic: hash and hash map functions.

In this post, I will describe you a real case in Nextail, where the change of a single line of code related to a hash function changed the performance of an application, both for CPU consumption and memory.