A few months ago, we made a post on our marketing blog about our new postback system. Response to that post has been so strong that our tech team decided to provide a post that outlined the evolution of our systems, and how we got to the great place that we are in today. For reference, the marketing post appears at the bottom of this tech story.
By Fazal Majid
Cofounder and CTO, Apsalar
As a mobile attribution and audiences platform, Apsalar Mobile Marketing Cloud is at the center of the mobile advertising ecosystem. Our platform tracks mobile device user activity on behalf of our customers and attributes credit for app installs and re-engagement events to the right media company and campaign based on rules-based last-click attribution.
We act as a measurement hub for the mobile app marketing ecosystem and must integrate with a wide variety of partners such as ad networks and analytics providers. Those integrations – Apsalar currently has more than 1000 of them – provide the information necessary to track and attribute credit for paid events and feedback to drive media optimization. For some media companies, postbacks are actually used in their billing systems to determine how much the client can be charged for media services rendered.
Mobile Postbacks and Challenges
“Postbacks” have historically been central to this feedback process. Mobile postbacks are outbound API calls we make to partners and platforms to alert them to marketing events and in-app user actions that clients have chosen to share. For example, with an app user acquisition or “install” campaign, we must identify the media partner and campaign responsible for driving the install, and then share that finding with relevant solution providers. Often, this process requires that we arbitrate competing claims from media partners who are part of a brand’s user acquisition efforts.
Installs aren’t the only events that get posted back. In-app purchases, registrations, cart abandons, and any other in-app user actions may be tracked and shared by clients. Mobile postbacks are entirely separate from the inbound API calls coming from our SDK- or API-integrated clients, but arguably just as critical to measuring and ensuring app marketing success:
Beyond the task of attributing credit for user acquisition, actually delivering postbacks is a significant technological challenge. Our postback processing and delivery system:
- Must deliver HUGE volumes of these outbound API calls – billions per day
- Postbacks need to be delivered as close as possible to real-time
- Absolute accuracy is essential because, as we mentioned earlier, mobile postbacks drive media fees and implement macros and other conditional logic that powers more effective marketing campaigns.
There are also infrastructure and environmental challenges to delivering accurate and timely postbacks. Sometimes mobile postback transmission needs to be retried if a media company’s infrastructure is having temporary issues and does not receive and acknowledge the app postbacks the first time they are transmitted.
When mobile postback retransmission is required, it must be clear to the vendor that the postbacks are “repeats.” That clarity is critical to ensure that there is no double counting of in-app user actions. Also, retries/retransmissions must not create backlogs for postbacks to other networks.
Further, all postbacks are delivered via the Internet, and we need to plan and have contingencies for when Internet performance is disrupted or spotty. Again, postbacks should be delivered only once, despite the environmental challenges.
In addition to these hurdles, a postback system needs to be flexible enough to allow for specific rules and policies of particular media vendors, governments and industry trade organizations. With some very large media partners, there are additional contractual requirements around privacy, data segregation and retention, Clients and partners must also adhere to specific commitments made to organizations like the FTC, EU and other regulatory authorities.
Another challenge for an attribution provider like Apsalar is that not all attribution information is known at install time. Many of the largest media providers for the app industry are “self-attributing,” meaning that we must query them to determine if they touched the user that has taken the desired action. We query outside APIs of self-reporting sources (using different postbacks) and must wait for the replies before we make a final determination of which media company deserves credit for an install. So if we sent a postback immediately at install time, we might have to countermand it a few moments later, and few partners are prepared for that type of contingency. Therefore, we need to wait until the final attribution determination is made to send postbacks. To take as long – but only as long – as is necessary before delivering the postback message.
Unfortunately, the deliver-once requirement is mathematically impossible to achieve in a distributed system like postback delivery. In fact, there are theorems that prove it is impossible to deliver in all cases. Nevertheless, we must strive to get as close as possible to this ideal.
Because of the latency and the stress it puts on a machine’s networking stack, we handle postbacks on different clusters than the rest of the services. Because of this, there needs to be some sort of queueing system that manages to send postbacks and feedback messages.
How Apsalar Solved the Postback Challenges
Our mobile postback solution has had a number of versions. Each was designed to deliver further precision and scale as key variables changed in our industry, including:
- Massive expansion in the number of media and technology partners used by app marketers
- Explosive growth in the number of apps in the industry, and Apsalar’s multi-year record of triple digit sales growth.
- Heterogeneity in the buying models used to purchase digital media, and the implications of those for the number and types of postbacks required for optimization
Growing mobile app advertiser interest in “downstream events” so that media could be optimized for user quality instead of just raw install counts. User quality optimization means that a mobile marketer requests optimization to maximize any desirable post-install action. For example:
o App Usage Metrics
Downstream event postbacks mean that instead of sending one postback for an install, we must send multiple postbacks whenever a consumer takes an action the advertiser deems important enough to track.
The development journey can reveal a lot about a technology team. Let me take you through the “generations” of our postback delivery platforms, and how we steadily evolved to meet new challenges and got to the industry-leading system we use today.
Architecture V1: Python + PgQ, single postback process per attribution process
Historically we have been a Python shop, dropping into C for speed when warranted. The queuing system we used initially was Skype’s Skytools PgQ, based on PostgreSQL. It is very robust and has strong data integrity features, and performs significantly better than other RDBMS-backed queues by cleverly leveraging PostgreSQL’s MVCC system to avoid performing one transaction per message.
This approach had many outstanding qualities, but as the number and importance of self-attributing media sources increased, there inevitably were occasions when we would need to reissue postbacks because the attribution decision was revised. That can cause challenges for some media providers. As a result, we quickly evolved our thinking and approach to address this exigency.
V2: Python + PgQ + Redis, single postback process per attribution
To prevent the need for many corrective postbacks, we added a stateful low-pass filter using Redis to store state. This was fairly complex logic that took some time to fully debug and handle edge cases. This solution fixed the reissue problem, but without further evolution could give rise to potential scale issues – especially in an industry experiencing near exponential growth.
V3: Python + NSQ + Redis, single postback process per attribution process
As postback volume increased, it became clear that we would ultimately reach the limits of PgQ throughput; caused by the fact that PgQ serializes messages. It was clear that we needed a more scalable approach. To address this need, our team performed a thorough analysis of scalable queueing systems. Ultimately, we chose Bit.ly’s NSQ, because it:
- Is implemented in Go and offers much lower ongoing overhead
- Is relatively simple to set up and administer
- Has extensive yet pragmatic telemetry
- Provides durability by storing queues on disk, unlike gnatsd or ZeroMQ
In addition, we also implemented both spillover and retry queues so that slow network(s) would not degrade the overall performance of our system.
V4: Python + NSQ + Redis, single postback process per attribution process, ElasticSearch + filebeat
The approach outlined above addressed postback timing and challenges. It represented a major step forward and served our company well for some time. Our next set of evolutions focused on providing better ways to enable our Customer Support team to help debug problems that occurred in production.
To address these, we set up ElasticSearch on the postback server logs, using filebeat to transport the logs from the postback server cluster to the ElasticSearch cluster.
V5: Python + NSQ + Redis, single postback process per attribution process, CitusDB + cstore_fdw
Unfortunately, we quickly ran into limitations with ElasticSearch and filebeat. We experienced major challenges with filebeat, despite being written in Go, it would consume over 1GB of RAM and consume so much CPU that it was starting to crowd out the postback processes doing actual work.
To resolve this, we switched to CitusDB with cstore columnar storage engine, which reduced disk requirements and sped up ad-hoc queries using block skip indexes that perform well across all columns without the overhead of a traditional B-tree index.
V6: Python + NSQ + Redis, multiple postback processes per attribution process, CitusDB + cstore_fdw
Those who haven’t worked much with mobile data would be surprised at how quickly data volumes grow with this medium. This, plus the multi-year triple-digit growth in our client base, meant that our SD traffic ramped at a staggering rate. As I mentioned earlier, until that point Apsalar was leveraging Python, as were many in the mobile data arena. Many still do.
But Python is unable to scale across more than one CPU core due to its Global Interpreter Lock (GIL). This raised the risk of delays in the future, should data volumes exceed typical volumes. What was clear was that we needed a two-pronged approach to increasing the scalability of our postback system.
In the short-term, we eliminated legacy bottlenecks in the postback server daemon so multiple instances could run in parallel for the same attribution server.
We then radically increased throughput by adding multiple instances – as many as five — to prevent and where necessary resorb transient backlogs. As Python was saturating an entire CPU core, we also provisioned 2 ½ times as many CPU cores in our postback cluster. The system enabled us to grow our capacity considerably while we simultaneously created a completely new postback system that could up our capacity by orders of magnitude.
V7 (today): Go + NSQ, multiple postback processes per attribution process, Postback low-pass filter/duplicate prevention done upstream in attribution, CitusDB + cstore_fdw
Our long-term solution began with a systematic process to transition performance-critical portions of our code from Python to Go. Go is a programming language developed at Google. Go delivers 80% of the performance of C while maintaining 80% of the productivity of Python. Further, its authors have been vigilant at keeping it small, elegant and simple. They’ve also avoided the “second system effect” of adding lots of additional features that could reduce stability and performance. In fact, each successive release of Go has significantly improved performance.
Our success with these sequential updates convinced us that the time had come to do a full rewrite of the postback server from Python to Go. We could use the data in CitusDB to build automated regression tests for the macro language based on thousands of postback rules and millions of actual production postbacks. We also revisited how the low-pass filtering works, by moving it upstream to the attribution server, which now tracks what outstanding attribution checks are still pending, and only releases postbacks to NSQ when all have reported in. We call the new system “Gostback.”
Our team is extremely proud of Gostback. We’ve massively increased throughput and capacity, and our automated regression tests using the data in Citus meant far fewer bugs throughout the development and deployment processes. Apsalar now enjoys a much more reliable system that eliminates race conditions. And when the self-attributing vendor responses come back to us in seconds, as is usually the case, we no longer need to provision an artificial wait time before sending the postbacks.
I am very proud of our team and how we met both short-term needs and prepared for long-term challenges. Ain mobile we face a constantly changing environment and colossal increases in both the amount of data we receive and how quickly we need to process it. These needs constantly challenge and energize our team and make working here so exciting and rewarding.
——–The Original marketing post appears below——–
APSALAR DEBUTS NEW POSTBACK MANAGEMENT SYSTEM TO ENSURE MAXIMUM POSTBACK ACCURACY AND SPEED OF DELIVERY
We’re pleased to announce that Apsalar has developed and deployed an entirely new postback management system to raise the standard in the mobile marketing measurement category by ensuring maximum accuracy and speed.
The new system addresses both the profound complexities of delivering accurate and timely information to clients and partners, and the constant growth of both the number and types of data that need to be posted back.
A Little Background on Postbacks
Postbacks are a key way that clients and partners optimize marketing programs. As an attribution platform, it is our most important responsibility to ensure that we precisely track consumer signals, and give credit to the source that actually drove the action. In-app marketing, the most common sources are major media properties like Facebook and Google, as well as more than 1,000 ad networks, affiliate platforms, and APKs. We do this using postbacks, which are basically outbound API calls we make to partners, as opposed to the inbound API calls coming from our SDK or our API-integrated clients.
Clients and sources rely on postbacks to pay for performance and to optimize the programs designed to derive maximum installs and ROI. Sound relatively straightforward? It isn’t.
Some of the biggest challenges of the postback ecosystem are:
- Processing postbacks is a tremendous volume challenge. We need to deliver accurate information billions of times every day.
- They need to be delivered as close as possible to real-time
- They implement often complex macros and other conditional logic so that measurement and automated optimization can take place
- Different media sources have different systems and need data parsed and organized in different ways
- Some of the largest partners in the industry – like Facebook, Google, and Twitter, have their own sets of contractual requirements. These can demand any or all of the following: user privacy protections, data segregation and data retention limits.
- They need to comply with regulatory standards and commitments made to bodies like the US Federal Trade Commission (FTC) and the General Data Protection Regulation of the EU.
- Cloud computing issues. The massive cloud computing services like AWS and those from Google and Microsoft are extremely popular – with clients, partners, and even our competitors. But occasional instability and performance issues can result in additional complications for postbacks. That’s one reason why we have our own servers – to deliver more consistent performance and a higher degree of data security.
- They need to function when delays caused by Internet instability, partner data stacks, and other delaying issues require that we hold a postback until all parties have delivered relevant data.
Another major challenge is that not all relevant data is automatically delivered to an attribution provider. For some sources, we must ping a series of partners after an install to determine when and if marketing on their platforms “touched” these users.
While we wait for a response to such a ping, it’s important that we not send a postback that will need to be rescinded later. Finally, an attribution platform must be constantly expanding its capacity to deal with the massive growth inherent in the mobile marketing industry, and in its own growth in client count.
From Old to New System
When we built our first postback system, we used a programming language called Python, which is known for simplicity, clarity and readability. But as both the volumes of postbacks grew and the variety of integrations increased, our Python platform struggled with volume during peaks of global usage. Without getting too much into detail, the system reached finite limits of throughput and would then begin to cause delays in postback delivery.
Over the last two years, we had made incremental changes in the platform and made major hardware purchases to address the growth in “normal” postback volumes. But peaks caused occasional temporary delays which could be a source of frustration to some clients and partners. One client, in particular, decided to move on from Apsalar as a result of these temporary delays. For us, the loss of a client is a big deal. In fact, any client dissatisfaction is a big deal for us. The business statistic we’re proudest of is our tiny churn rate.
Our completely rearchitected system will help ensure that our client satisfaction remains high.
Which brings us to today. Our new postback system is built in Go, a programming language developed by Google, and a new database system called Citus DB 5. Citus is a distributed database that makes accessing and processing data far faster than our previous systems.
Speed, accuracy, and scale have all dramatically increased. In a recent test, speed for critical postback tasks increased more than 40X. Perhaps even more importantly, the speed and flexibility of the new system, coupled with a new postback queueing capability that ensures we receive and process all signals from media sources before we deliver postbacks, helped address the persistent industry challenge of rescinded and duplicated postbacks.
Scale has also been addressed. We can now process postback volumes many times higher than the highest peaks we have ever experienced. And the structure of the system enables us to add hardware seamlessly –ahead of new client implementations so that our capacity remains higher than necessary.
We can now state confidently that we are more than ready for the increasing demands and postback volumes. We’re ready for unprecedented levels of scale. It’s something that we can discuss in greater detail with clients and prospects. In fact, it’s something we encourage companies to ask Apsalar, and our competitors, about as part of their due diligence processes.
We’re extremely proud of the team that made this complete redo of the postback system possible. They achieved this key milestone – all while making changes to the old system so that we could handle potential postback spikes while we created the new system. Most of all, we’re immensely happy to be able to deliver a new standard of accuracy, timeliness and capacity to our clients.