How Apsalar’s Parquet Library Helps Us Efficiently Scale by Reducing Memory Usage and Increasing Concurrency
Usually, Apsalar blog posts and other social media announcements focus on strategic and tactical topics of interest to marketers. Our Tech Blog is very different. Here we will be presenting a new series of posts about Apsalar technology, written primarily by and for technologists.
Our goal with these posts is to discuss the sorts of challenges and projects that face and are resolved by our team. We strongly believe in both open-source and the value of sharing experience so that all technologists can benefit from the collective work of our discipline.
As a leading global mobile data and audiences provider, Apsalar receives and processes billions of data points every day. These come from:
- Billions of consumer interactions on mobile devices through our SDK
- Additional billions of user events through our server-to-server API client integrations
- Tens of millions of signals from our more than 1,000 media and affiliate marketing partners
Our processing and analysis are conducted using Apache Spark, and we normalize, transform and enrich the data through a highly-optimized pipeline. To make the processing as efficient as possible, we leverage Parquet as a data compression and format for the Apache Spark ecosystem.
About Apache Parquet
For those that are unfamiliar, Apache Parquet is a free and open source column-oriented data format. There are a variety of advantages to Parquet, not least that it was specifically designed to support highly efficient compression schemes. It also enjoys wide support in the big data ecosystem from Hadoop to Spark. In keeping with Apsalar’s strong preference for using open-source technologies, Parquet is an open-source solution.
Using Parquet to Convert Enriched Datasets for Processing
Much of our data pipeline is built using Go (Golang) and Go messaging services (NSQ), which gives us a highly concurrent and scalable services architecture. One key element of our pipeline workflow was the conversion of enriched datasets to Parquet format before delivering them for processing in Apache Spark.
While this approach performed well in several respects, it also meant that we had to use memory-intensive processes for data conversion. This hampered our ability to scale these tasks as rapidly as we could other elements on the pipeline. Clearly, we needed to identify and leverage better ways to process quickly and at scale.
Developing the Apsalar Parquet Library
Our solution was to develop an Apsalar Parquet Library to write Parquet format and ensure compatibility with Apache Parquet. Written in C++, our Apsalar Parquet Library has been successfully used in production for more than a year. Performance improvements have been massive. Specifically, we’ve reduced memory usage by more than 50X, and increased concurrency by more than 8X.
Before we undertook development of our Apache Parquet library, we conducted a comprehensive assessment of existing open source libraries. What we found was that there were extensive libraries to READ Parquet, but sparse numbers on writing it. Cloudera’s Impala database reads and writes Parquet, but offers a far broader set of capabilities than we needed for our challenges.
We examined the Impala code to see if we could easily separate out components that were appropriate for our challenges. Unfortunately, we found that the capabilities we needed were actually intertwined in the database code, not “packaged” in an easy-to-isolate module or library.
There were a couple of other Parquet writing library projects in early development stages. Unfortunately, they offered very little support for the key challenges we were facing. So, we undertook to develop our own libraries, and then publish them so that others can benefit from our work. You can check out the code and documentation here. This project provides libParquetfile, a C++ library which can generate Parquet files. Additionally, the proto2parq application is provided which can convert a data files or streams containing protobuf defined records into Parquet format.