The Data Cleaning Techniques These Data Scientists Swear By

At price transparency platform GoodRx, the process of setting data projects up for success begins before the team even receives data: It starts with a well-defined governance strategy centered around group contribution.

Director of Data Analytics and Science Caroline Furman said that business stakeholders, compliance team members and engineers all affect data quality, whether indirectly through funding initiatives or directly by validation. That’s why all parties play a role in data cleaning.

“While building a culture of data is important, so is explicitly setting roles and responsibilities for its management,” Furman said.

Collaboration isn’t the only way to ensure data quality.

At Pluto TV, data professionals standardize, validate and de-duplicate data for analysis, reporting and machine learning. The result? The ability to replay data and conduct real-time record processing.

Caroline Furman

Director of Data Analytics and Science • GoodRx

According to Furman, her team doesn’t leave data cleaning and organization up to chance. Multiple people check the data at various points in the pipeline, ensuring the input is accurate, complete, consistent and timely. And there’s no shortage of data. GoodRx gathers prices for more than 70,000 pharmacies across the U.S. so customers can access up-to-date cost information.

Before you even get to the cleaning phase, what steps do you take to ensure the data you’re collecting is as precise as possible?

We work with many data sources of varying degrees of cleanliness and we follow a collaborative approach that involves multiple teams. Data is critical to how we make decisions and measure success, so almost every team has a hand in quality, whether implicitly or explicitly.

Tell us a bit about your approach to cleaning data.

We work closely with our data engineering team to take a non-destructive approach to data. Setting up for success begins before data even gets here. It starts with having a well-defined process and governance strategy. While building a culture of data is important, so is explicitly setting roles and responsibilities for its management.

We securely store and archive raw data that we ingest to differentiate cleanliness issues that can be traced to its initial reception versus mistakes we make along its journey. As data flows through our pipelines and transforms into the shape analysts and data scientists use, multiple checks at various stages ensure accuracy, completeness, consistency, uniqueness, and of course, timeliness.

It starts with having a well-defined process and governance strategy.’’

How have you made this process easier, faster and more efficient over time?

Data has come a long way. In the early days, our end-users discovered data issues. Nowadays, GoodRx uses Airflow to schedule and pipeline data. We use Slack and PagerDuty to ensure that our jobs are running smoothly, and we rely on automated testing and health dashboards on Looker to improve efficiency.

They're Hiring | View 30 Jobs GoodRx is Hiring | View 30 Jobs

Stephen Shelton

VP of Business Intelligence • Pluto TV

Pluto TV data scientists work closely with software test engineers to ensure accuracy and consistency across all product releases. VP of Business Intelligence Stephen Shelton said the online television service uses first-party and third-party libraries to enrich and classify their data.

Before you even get to the cleaning phase, what steps do you take to ensure the data you’re collecting is as precise as possible?

Pluto TV operates on the basis and the quality of its data. We collect data from more than 20 unique devices. We use a rigorous set of quality assurance standards to ensure that each device generates a common set of data points for every platform.

Tell us a bit about your approach to cleaning data.

Pluto TV leverages multiple data validation steps on both raw and processed data in our data lake and data warehouse, respectively. We standardize, validate and de-duplicate data for analysis, reporting and machine learning. The data science team partners with software test engineers and quality assurance employees to regularly monitor every software release for accuracy and consistency.

The data engineering team monitors, manages and maintains our streaming data pipelines and our data lake. We process over three terabytes of data daily, through five stages of capture: transient, raw, cleansed, trusted and curated. We value streamed data against custom schemas for every event type and field.

We standardize, validate and de-duplicate data for analysis, reporting and machine learning.’’

How have you made this process easier, faster and more efficient over time?

In 2019, Pluto TV rebuilt its entire data pipeline using the Apache Kafka platform to address the need for a real-time, fault-tolerant and highly scalable data infrastructure. The ability to replay data and conduct real-time record processing powers our data mining and model development, as well as our model training efforts.

Recent Articles