This week I released a new version of Carpet, the Java library for working with Parquet files. In this version, I’ve added a feature that I believe nobody will ever use: the ability to read and write BSON-type columns.
Parquet supports embedded types, which allow defining complex data types within a column. Each of these types has its own internal representation as a byte array.
Historically, the Parquet format has defined JSON
and BSON
types:
-
JSON, with
binary value (JSON)
in the schema, represents data as a UTF-8 text (String) serialization of a JSON object. The difference from storing the same content in aSTRING
type column is that it explicitly defines the content as a JSON object, not just plain text. -
BSON, with
binary value (BSON)
in the schema, represents data as a byte array following the BSON specification. By typing it this way, it also explicitly identifies the content as a BSON object, rather than just any byte array.
In my limited experience with Parquet, I haven’t come across examples where these embedded types are used, and no one had requested support for them in Carpet.
So why did I decide to implement support for them? The answer is simple: to challenge myself and see if I could do it, and to test whether the codebase was flexible enough to support these features without breaking anything or imposing limitations on Carpet.
Another reason is that new embedded types have recently been defined in the Parquet specification: VARIANT, GEOMETRY, and GEOGRAPHY. There are still no functional implementations of these new types in Parquet, but they are already defined and work is underway on their implementation in at least two languages. It’s wise to lay the groundwork for their future availability.
The Implementation Problem
The key challenge when implementing these types was deciding which Java types to use for representation.
What’s the best Java class to represent JSON? Should I create a Carpet-specific class? Use the JSONObject
class from the org.json
library? Or perhaps the JsonNode
class from Jackson? Could I use any object and serialize it to JSON? And if so, with which library? Any of these approaches would tie Carpet to a specific implementation and add unnecessary dependencies for 99.99% of use cases, while also forcing users into a particular library.
The same issue applies to BSON. I could use the MongoDB bson library, but that would create a dependency on a specific implementation that virtually no one needs.
Instead, I opted to use Java’s String
type to represent JSON content and Parquet’s Binary class for BSON. This approach keeps Carpet independent of any specific implementation, allowing users to choose their preferred library for handling the content.
The trade-off is that anyone using this functionality (which I suspect will be very few people) will need to declare their attributes as String
or Binary
instead of using their own business classes, and will need to handle serialization and deserialization themselves before using Carpet:
record ProductEvent(
long id,
String name,
String jsonData,
Binary bsonData,
Instant timestamp) {
}
Annotating Java Types
I couldn’t simply use String
for JSON and Binary
for BSON because Java’s String
type already automatically maps to Parquet’s STRING
type. Additionally, Binary
needs to support multiple Parquet logical types (STRING
, BSON
, and eventually VARIANT
, GEOMETRY
, and GEOGRAPHY
).
To solve this, I created annotations for Java record attributes to specify which Parquet data type they should map to.
When you annotate a String
attribute with @ParquetJson
or a Binary
attribute with @ParquetBson
, you’re telling Carpet that these attributes contain JSON or BSON data respectively, not just regular text or byte arrays.
This record, when written to a Parquet file:
record ProductEvent(
long id,
String name,
@ParquetJson String jsonData,
@ParquetBson Binary bsonData,
Instant timestamp) {
}
will generate the following Parquet schema:
message ProductEvent {
required int64 id;
optional binary name (STRING);
optional binary jsonData (JSON);
optional binary bsonData (BSON);
required int64 timestamp (TIMESTAMP(MILLIS,true));
}
After introducing annotations for changing attribute types in the Parquet schema, I expanded this capability beyond just embedded types to include Java’s String
and Enum
types as well.
I added the @ParquetString
and @ParquetEnum
annotations to modify the Parquet logical type of Java attributes of type String
, Enum
, or Binary
. This is particularly useful when conforming to a third-party contract while still using the most convenient data types in your Java code.
This record, when written to a Parquet file:
record ProductEvent(
long id,
String name,
@ParquetString Binary productCode,
@ParquetString MyEnum category,
@ParquetEnum String type){
}
will generate the following Parquet schema:
message ProductEvent {
required int64 id;
optional binary name (STRING);
optional binary productCode (STRING);
optional binary category (STRING);
optional binary type (ENUM);
}
If you want to know more about the new features, you can read the Carpet documentation on Java type annotations.
Conclusion
While I doubt anyone will use the JSON and BSON functionality in Carpet, developing this feature has been valuable for testing the flexibility of Carpet’s codebase and preparing it for the new embedded types being defined in the Parquet specification.
As a bonus, I’ve enhanced support for Parquet’s Binary
type and added the ability to modify logical types of attributes in the Parquet schema, capabilities I hadn’t originally planned but that could prove useful in certain scenarios.
With this new functionality requiring documentation, I took the opportunity to create a dedicated documentation site for Carpet, moving all the content from the GitHub repository’s README to this new resource.
The documentation is built with MkDocs and hosted on GitHub Pages. You can browse the complete documentation at carpet.jerolba.com.