Plundering the ‘Gold Mine’: How This Adtech Firm Manages Massive Amounts of Data at Scale

by Stephen Ostrowski

May 21, 2020

60 billion raw events.

Around a million events per second.

A petabyte of data.

That’s what Kyle Kincaid estimates his team ingests daily at adtech firm Viant. Clearly, a deficiency of data isn’t an obstacle for the Irvine-based company; instead, it’s wrapping their head around it.

“We have 250 million users and 700 million different devices,” said Kincaid, the director of software engineering at Viant. “For us, it’s about understanding what impression opportunities are really going to drive revenue for our clients.”

That volume is reflective of the growth of the real-time bidding industry. To wit: Markets and Reports valued the market at $6.6 billion in 2019 and expects it to soar to $27.2 billion by 2024.

Viant's la offices.

“With the introduction of real-time bidding, there are just explosions of events that create gigantic amounts of data,” said Pratik Patil, manager of software engineering. “Now that we have this data, which is literally a gold mine sitting around in storage, everybody wants to use it. People want more and more data, and they want the results very quickly.”

To utilize the treasure trove of data, Viant’s operations team must ensure that the infrastructure in place can process efficiently while remaining cost- and security-conscious, said VP of Technology Operations Lee Sautia.

“This large amount of data definitely puts a strain on infrastructure at some point,” Sautia said.

But Viant’s kept up, processing more data last year than in the previous 19 years, according to the company. How? As Kincaid, Patil and Sautia explained, it’s the byproduct of a multi-year initiative to develop internal tools, shift to the cloud and leverage the power of automation.

What’s the data you’re looking at, and where is it coming from?

Kincaid: On the real-time bidding (RTB) end, we have about 60 billion raw events per day, roughly about a million per second. The total data size, unfiltered, is close to a petabyte. The first stage of that is really just a reduction of data for attributes that we or the clients may be interested in. We take it from a petabyte down to about 60 terabytes. It’s that first ingestion of the runtime data, which is based on open RTB data describing our impression opportunities. We’re then generating a number of aggregations off of that data and deriving reports from it.

Patil: Kyle’s team’s ad data, about 60 terabytes in daily size, is the source for my projects. We have a lot of integration with other external partners, where we get the data from on a daily, weekly or monthly basis. That roughly adds about a few terabytes of external other monthly loads. A lot of data integration comes separate from the ad-serving input.

Sautia: Over the past year, we’ve worked to provide more of an automation platform with respect to infrastructure. The goal at some point is hopefully making it self-serve. We definitely surpassed the projections of how much data we were going to be saving.

Power Users

What’s in a name? For Viant, it used to be advertisementbanners.com — the company’s initial name when they launched in 1999. With more than two decades of industry experience, Viant’s leveraged a robust amount of data to inform a “people-based advertising” approach delivering relevant and targeted advertising to consumers. The company’s cloud-based data lake connects marketers with data points on more than 1 billion user profiles and other sources.

What facilitated the increased data processing output?

Kincaid: We used to operate with a data warehousing team through which all reporting went. We slowly transitioned from this black-box reporting model — which had a very long turnover time for integrating new data and generating new reports — to something much more self-service. Anyone within the company could use the tools that we’ve developed, Magnus and Goliath, to gain insight into whatever aspect of the data they’re interested in.

Patil: Previously, when you submitted a job to the data warehouse team, it would take around 24 hours to get the data back. With the introduction of new internal tools like potens.io and data lake platform (DLP), anyone from the company or outside clients can use these tools and get their data in seconds. All the teams want hands on the data because everybody’s doing some kind of advanced reporting. Once these tools were in place, the usage just exploded.

What internal teams helped facilitate these efforts?

Kincaid: Our data warehousing team, which built the tools that we’re using, and the business intelligence team, which is digging to find insightful, actionable data. We’ve had to make adjustments to our ingest pipeline. We have tools written by our architect to make data accessible in near real-time to everybody. We’ve consolidated into a single schema with one central data repository that everything goes off of.

Patil: In late 2015, Viant decided to go 100 percent cloud. The data warehouse team started building internal tools, while other teams went on figuring out how to optimize queries, how to optimize data jobs and how to ramp up on this new cloud infrastructure. Now, we are at an optimal pace for development.

What were some of the challenges you faced during this shift?

Kincaid: On the ingest side, we have about a petabyte of data that we have to move every day. We’ve got roughly 1,000 hosts that we need to pull all this data from. That was somewhat of a challenge trying to keep costs low, deal with infrastructure issues and get data efficiently into BigQuery. We had to write tools to allow us to transform data and efficiently stream into our primary log source.

Sautia: We weren’t able to move as fast initially because of our aging infrastructure platform. We were using outdated code pipelines, deployment methods and instance types in Amazon. As part of the infrastructure refresh, we redesigned building and shipping code from the ground up. All of that helped make swapping in newer and faster hardware easier, which in turn sped up our data pipeline.

Patil: Migrating to the cloud was a big challenge. We moved away from the traditional physical data centers, where you put in everything and it has a fixed cost. A lot of people in the company didn’t know how to optimize the queries. Teaching people how to first store the tables properly in the cloud — meaning how to partition the tables, how to cluster it — reduces costs drastically.

Automate early. It’s hard to automate a cloud infrastructure that’s already running.”

What advice would you have for any technologists looking to undertake a similar infrastructure shift?

Kincaid: Only build in-house what you absolutely need to. Leveraging cloud infrastructure has been a big win for us, as well as utilizing common schemas and really looking at the entire data pipeline.

Sautia: Automate early. It’s harder to automate cloud infrastructure that’s already running. If you can start out automating deployment of instances and managed technologies within the cloud platform, it makes life easier in the end.

Patil: Every cloud infrastructure resource, storage and query costs something. Knowing cloud optimizations early helps reduce these costs. With the cloud, a lot of tech debt goes away; you move from traditional architecture to event-based architecture, which streamlines processes. Having a data retention policy set very early on that is built into your product requirement goes a long way in saving costs.

Responses have been edited for length and clarity. Images via Viant.