How to Build a Successful Data Science Workflow

Written by Janey Zitomer
Published on Apr. 21, 2020
How to Build a Successful Data Science Workflow
Brand Studio Logo

The diversification of data science tools and technologies in recent years has broadened the data landscape. While that’s beneficial, according to GoodRx’s Henry Mei, managers must choose wisely from a potentially overwhelming number of options to determine what’s best based on department and company needs.

GoGuardian data scientists stay away from proprietary platforms or any services that claim to make “data science easy for everyone.” Ryan Johnson, director of science and analytics, said that for advanced data scientists, those tools limit teams either in language, libraries, model architectures or deployment options.

“We want the flexibility and freedom to push boundaries, experiment with the latest model architectures and try novel ways to deploy our models,” Johnson said. 

The following data scientists don’t let the assortment of tools and technologies distract them from developing and maintaining stable workflows. Instead, they focus on the often overlooked “human factor” in data, and the harnessing of complementary skills among data scientists and engineers. 

 

GoGuardian
GoGuardian

GoGuardian data scientists have benefited from their ability to create and tear down compute and storage resources without oversight from other departments. Not having that additional red tape allows the team to adjust plans depending on site traffic and server load. Johnson said that the fewer barriers to stability and reproducibility, the better. 
 

Tell us a bit about your technical process for building data science workflows. 

Our data science products typically move through the following phases: problem definition with the product team; data acquisition and labeling (we build a lot of classifiers); exploratory data analysis and proof of concept; stakeholder approval; model training and testing; additional stakeholder approval; and operationalization with our engineering team. 

To continuously update the model, we gain stakeholder approval and do model training and testing repeatedly. 

Throughout the phases outlined above, we stick to flexible open-source tools such as Python, Jupyter for exploratory data analysis, Pandas, Numpy, Boto, Keras, TensorFlow, PyTorch and Flask. All code is managed in GitHub and executed in Docker containers. We also make heavy use of basic cloud infrastructure such as AWS Elastic Compute and AWS Simple Storage Solutions. Finally, we build our own tools to make our lives easier. We hope to open source some of these internal tools soon.

 

What processes or best practices have you found most effective in creating workflows that are reproducible and stable?

Peer review is the gold standard in science. We find that peer review of all code and logic helps ensure high-quality code and sound scientific thinking. Data science teams should be committing to a version control system and creating pull requests that are reviewed by other data scientists or engineers. These reviews should include checking EDA, model training, checking code before deployment, etc.

We use Docker containers to make sure data scientist “A” can spin up a container and run data scientist “B’s” code with minimal effort. We also containerize our models to make it easier for our engineering team to deploy. All this Docker use will eventually create some additional overhead around managing Docker images, but it’s less of a burden than asking other data scientists or other teams to work with non-containerized code.

The data science team should have the ability to create and tear down compute and storage resources on their own. Data science oscillates wildly between needing little compute and storage to needing a lot of compute and storage. Trying to manage this through ticket submission will create barriers to reproducibility and stability, among other problems.

Make heavy use of cloud services such as on-demand computing and flexible storage.’’  

What advice do you have for other data scientists looking to improve how they build their workflows?

Data scientists really shouldn’t be reinventing the wheel here. Engineering teams have well-established processes for writing code to build products. Models are just a different type of product. Pick up some basic software engineering skills and practice them before implementation.

Make heavy use of cloud services such as on-demand computing and flexible storage. You can completely automate and standardize the creation, utilization and removal of these resources into a few lines of code. Data science teams can easily develop simple tools or scripts that spin these resources up in a reliable and repeatable manner at the start or end of a project. 

 

GoodRx
GoodRx

GoodRx is largely tool-agnostic, but Mei, a data science manager, said they standardize around Python and SageMaker for production models. He attributes successful workflows to the creation of processes — possibly where none previously existed — and collaboration between data scientists and engineers. 
 

Tell us a bit about your technical process for building data science workflows. 

I have really enjoyed working with AWS, Google Cloud and Databricks because these platforms allow for seamless integration of a variety of services. They are also supported with documentation by large communities. GoodRx is largely tool-agnostic for data exploration and model-building, but we standardize around Python and SageMaker when it comes to production models. 

There has been a huge diversification of tools and technologies in recent years. There isn’t a clear winner anymore. The data landscape is broad and it’s become more difficult to answer this question. 

 

What processes or best practices have you found most effective in creating workflows that are reproducible and stable?

Automate a reproducible and stable workflow in code. There are decades of hard-earned software engineering best practices in continuous integration, which data professionals should take inspiration from. 

Data scientists come from a range of backgrounds, and many strong data scientists may still not have expertise in automation. Moreover, project automation can be unusually complicated. Both code and data need to be tested and models can be non-deterministic. 

I’m not advocating that data scientists must also be engineers, but many patterns from engineering best practices can and should be adopted for data science. I recommend using version control and pull requests to manage change as well as pairing with colleagues on model development and code.

Having the right tools does not necessarily create stable workflows.’’ 

What advice do you have for other data scientists looking to improve how they build their workflows?

Different problems often require different tools. Having the right tools does not necessarily create stable workflows. The human factor in data is often overlooked. Building a culture of quality takes a village and involves many complementary skills. It often requires creating processes where none exist.

 

Responses have been edited for length and clarity. Images via listed companies.

Hiring Now
Framework Security
Artificial Intelligence • Cloud • Information Technology • Legal Tech • Consulting • Cybersecurity • Data Privacy