This post continues the series of articles about working with Parquet files in Java. This time, I’ll explain how to do it using the Protocol Buffers (PB) library.
Finding examples and documentation on how to use Parquet with Avro is challenging, but with Protocol Buffers, it’s even more complicated.
In the previous article, I wrote an introduction to using Parquet files in Java, but I did not include any examples. In this article, I will explain how to do this using the Avro library.
Parquet with Avro is one of the most popular ways to work with Parquet files in Java due to its simplicity, flexibility, and because it is the library with the most examples.
Parquet is a widely used format in the Data Engineering realm and holds significant potential for traditional Backend applications. This article serves as an introduction to the format, including some of the unique challenges I’ve faced while using it, to spare you from similar experiences.
In the previous post I analyzed Protocol Buffers format, using JSON as baseline. In this post I’m going to analyze FlatBuffers and compare it with previously studied formats.
At Clarity AI, we generate batches of data that our application has to load and process to show our clients the social impact information of many companies. By volume of information, it is not Big Data, but it is enough to be a problem to read and load it efficiently in processes with online users.
TL;DR: in this post I present Bikey, a Java library to create Maps and Sets whose elements have two keys, consuming from 70%-85% to 99% less memory than usual data structures. It is Open Source, published in https://github.com/jerolba/bikey and available in Maven Central.
In my last post I talked about the problems of using an incorrect hash function when you put an object with a composite key in a Java
HashMap, but I was stuck with the question: Which data structure is better to index those objects?
Continuing with the same example I will talk about products and stores, and I will use their identifiers to form the map key. The proposed data structures are:
- A single map with a key containing its indexes:
HashMap<Tuple<Integer, Integer>, MyObject>, which I will call TupleMap.
- A nested map:
HashMap<Integer, HashMap<Integer, MyObject>>, which I will call DoubleMap.
In this post, I will describe you a real case in Nextail, where the change of a single line of code related to a hash function changed the performance of an application, both for CPU consumption and memory.