Putting predictive models into production is one of the most direct ways that data scientists can add value to an organization. By learning how to build and deploy scalable model pipelines, data scientists can own more of the model production process and more rapidly deliver data products.
This book provides a hands-on approach to scaling up Python code to work in distributed environments in order to build robust pipelines. Readers will learn how to set up machine learning models as web endpoints, serverless functions, and streaming pipelines using multiple cloud environments. It is intended for analytics practitioners with hands-on experience with Python libraries such as Pandas and scikit-learn, and will focus on scaling up prototype models to production.
From startups to trillion dollar companies, data science is playing an important role in helping organizations maximize the value of their data. This book helps data scientists to level up their careers by taking ownership of data products with applied examples that demonstrate how to:
- Translate models developed on a laptop to scalable deployments in the cloud
- Develop end-to-end systems that automate data science workflows
- Own a data product from conception to production
The accompanying Jupyter notebooks provide examples of scalable pipelines across multiple cloud environments, tools, and libraries (github.com/bgweber/DS_Production).
Here are the topics covered by Data Science in Production:
- Chapter 1: Introduction – This chapter will motivate the use of Python and discuss the discipline of applied data science, present the data sets, models, and cloud environments used throughout the book, and provide an overview of automated feature engineering.
- Chapter 2: Models as Web Endpoints – This chapter shows how to use web endpoints for consuming data and hosting machine learning models as endpoints using the Flask and Gunicorn libraries. We’ll start with scikit-learn models and also set up a deep learning endpoint with Keras.
- Chapter 3: Models as Serverless Functions – This chapter will build upon the previous chapter and show how to set up model endpoints as serverless functions using AWS Lambda and GCP Cloud Functions.
- Chapter 4: Containers for Reproducible Models – This chapter will show how to use containers for deploying models with Docker. We’ll also explore scaling up with ECS and Kubernetes, and building web applications with Plotly Dash.
- Chapter 5: Workflow Tools for Model Pipelines – This chapter focuses on scheduling automated workflows using Apache Airflow. We’ll set up a model that pulls data from BigQuery, applies a model, and saves the results.
- Chapter 6: PySpark for Batch Modeling – This chapter will introduce readers to PySpark using the community edition of Databricks. We’ll build a batch model pipeline that pulls data from a data lake, generates features, applies a model, and stores the results to a No SQL database.
- Chapter 7: Cloud Dataflow for Batch Modeling – This chapter will introduce the core components of Cloud Dataflow and implement a batch model pipeline for reading data from BigQuery, applying an ML model, and saving the results to Cloud Datastore.
- Chapter 8: Streaming Model Workflows – This chapter will introduce readers to Kafka and PubSub for streaming messages in a cloud environment. After working through this material, readers will learn how to use these message brokers to create streaming model pipelines with PySpark and Dataflow that provide near real-time predictions.
Excerpts of these chapters are available on Medium (@bgweber), and a book sample is available on Leanpub.