Data Scientists Share What Technologies They’re Using to Build Their Data Pipelines — and Why

Written by Alton Zenon III
Published on Jan. 28, 2020
Data Scientists Share What Technologies They’re Using to Build Their Data Pipelines — and Why
Brand Studio Logo

Scaling is a key consideration when selecting data tools at AI engineering company Beyond Limits and Seriously, a mobile game startup.

Seriously Data Engineer Alex Cano said his data team doesn’t focus solely on managing and scaling infrastructure because many of their tools, like Google Pub/Sub, do it for them. Having this technology frees up bandwidth for Seriously’s data team, allowing them to work on more pressing challenges like managing the more than 3,000 levels on its game “Best Fiends.”

Meanwhile, Beyond Limits Data Science Manager Michael Andric said his team relies on Amazon Simple Storage Service (S3) — a popular data science tool due to its limitless data storage capabilities — to expand the amount of data they can work with and their ability to manage it. 

However, there’s more to scaling than choosing the correct tools. For Cano, open communication among his team and between stakeholders is key, while Andric said refactoring old code helps his team improve their performance. 

 

Beyond limits team working
beyond limit
Michael Andric
Data Science Manager • Beyond Limits

Beyond Limits creates AI solutions designed to provide human-like cognitive reasoning for companies across industries including energy, manufacturing and healthcare. The company’s output means its data team has to work with huge data sets. Data Science Manager Michael Andric said based on this volume, the data team turns to tools like Kafka and Amazon S3 for their ability to handle expansive (and growing) pools of information. 

 

What tools are you using to build your data pipeline, and why did you choose those technologies?

Our data pipeline incorporates a mix of technologies that allow flexibility, both in terms of how team members gain direct access to the data they need and for the types of processes they want to perform. 

For livestreaming data, we use Kafka because it is effective in moving large amounts of data reliably and quickly. For raw data storage, we use Amazon S3 because it is scalable, resilient and very cost-effective. From here, we can query and run our extract, transform and load (ETL) process to bring data to our PostgreSQL cluster, which allows team members to easily perform analytical functions at scale.

Our data pipeline incorporates a mix of technologies that allow flexibility.”

 

As you scale, what steps are you taking to ensure your data pipeline continues to scale with the business?

The most important step is continuing to talk to each other. A culture of communication across the team helps to ensure individual data needs are being met right now, while also harnessing our collective viewpoints to keep a pulse on new tools that would help our company grow over time. 

 

Seriously digital team in meeting
seriously
Alex Cano
Data Engineer • Seriously

Many of the tools used by Seriously’s data team are centered around ease of use, according to Data Engineer Alex Cano. For instance, his team uses the data platform Snowflake, which integrates with many of the systems data scientists already employ and uses machine learning data preparation to reduce redundancies and tedious tasks. 

 

What tools are you using to build your data pipeline, and why did you choose those technologies?

We’re in the Google Cloud Platform, so a majority of our data pipeline resides within Google’s services. For our ingestion pipeline, our main app is built on Google App Engine, which forwards events to Google Pub/Sub. After data lands in Pub/Sub, events are processed, enriched and anonymized prior to landing in a raw form in Snowflake. Once in Snowflake, we transform our data in an ELT pattern — executed by Airflow and hosted on Google Cloud Composer — to provide properly normalized and validated views and tables. We chose these services primarily due to them being managed (because we don’t have a dedicated infrastructure team) and horizontally scalable.

We use Kubernetes as an executor to run as much code in parallel as possible.”

 

As you scale, what steps are you taking to ensure your data pipeline continues to scale with the business?

We routinely refactor old code if the amount of time taken to process does not scale well. We have historical information about our batch job processing in Airflow, so we can tell what code needs to be refactored for performance. 

We also try to leverage Airflow’s scheduling capability. Then we use Kubernetes as an executor to run as much code in parallel as possible, instead of one large batch job that takes several hours.

 

Responses have been edited for length and clarity. Images via listed companies.

Hiring Now
John Deere
Artificial Intelligence • Cloud • Internet of Things • Machine Learning • Analytics • Industrial