How LeaseLock Builds Successful Data Workflows
If you have a hammer, everything can look like a nail.
People tend to over-rely on the tools they’re most familiar with, even if superior tools are available.
When it comes to building workflows in the data science industry, this sentiment is especially true. Why? There is no one-size-fits-all approach to solving any given data science problem, which means data scientists who rely on tried-and-true tools to build workflows might be missing out on simpler and more effective methods.
“In the data realm, software engineers usually neglect some steps in their workflow,” Amir Katoozian, data engineering manager at LeaseLock, said. “Adding those steps to a data science workflow requires time and a certain level of software engineering expertise. It will also contribute to a more robust workflow.”
Learning which tools and processes that companies use to solve different problems can be enormously helpful to data scientists. That’s why Built In LA sat down with Katoozian to get the inside scoop on his processes.
What are your favorite tools that you’ve used to build your data science workflow?
Airflow enables the orchestration of the data engineering pipeline using directed acyclic graphs (DAGs). LeaseLock uses dimensional modeling data structure to create ready-to-use data assets. This model consists of roughly 20 dimension table tasks and about six fact table tasks. Airflow helps manage how these tasks run in relation to each other. At LeaseLock, Airflow’s two main use-cases include parallelizing independent tasks for faster execution and managing dependencies. That way, a task only runs after all of the upstream tasks are successfully finished.
Looker is a business intelligence (BI) and visualization tool that enables non-technical users to easily create their own customized reports. The main advantage of Looker is that you can define the relationship between a group of tables in your database in what is called an “explore.” You can also define metrics, which are called measures, using one or more tables. The easy-to-digest Looker user interface (UI) allows anyone to pull the columns and metrics and create a custom report.
LeaseLock uses dimensional modeling data structure to create ready-to-use data assets.’’
What are some of your best practices for creating reproducible and stable workflows?
Using a continuous integration and continuous delivery (CI/CD) platform is important to make sure your production environment is running smoothly. At LeaseLock, we use CircleCI as our CI/CD tool to deploy any code changes. You can use a CI/CD platform to test installation of dependencies and set rules around linting and formatting. You can also run tests such as unit testing and generate coverage reports. You can also package up your application (eg. a Docker image) and deliver to the production platform.
Using a version control platform such as GitHub allows us to track code changes and identify the cause of failures in a workflow. It also enables us to set merge rules around pull request titles, description of changes, mandatory reviews, etc.
All CI/CD platforms can and should be integrated with GitHub. The combination of the two is the key to a reproducible and stable workflow.
What advice do you have for other data scientists looking to improve their workflows?
In the data realm, software engineering teams usually neglect some steps in their workflow. Adding those steps to a data science workflow requires time and a certain level of software engineering expertise, but will contribute to a more robust workflow.
- Unit testing: Test all functions and logics by creating hypothetical test cases to make sure your function/logic works as expected. Python has a library for this called pytest.
- Integration testing: Compare the output of a modified workflow with the output of the original workflow to ensure the changes are aligned with your expectations.
- Version control: Many data science teams ignore version control and instead focus more on the science and math behind the solutions they provide. Adding Git to your workflow might seem like a headache that adds extra steps, but its benefits are absolutely worth the hassle. Slowing down and taking some time to execute the process correctly yields reproducible and more accurate results!
LeaseLock’s real estate insurtech platform creates more valuable and efficient rental properties for owners and operators