1 Introduction

1.1 What is data science in production?

What does it mean for code to be running in production? I think there are three main characteristics:

It’s run on another computer. This is called your production environment and is typically a Linux server. This typically means it’s a different operating system to your development environment, which is often a Windows or Mac laptop. If you’ve never used Linux before, you’ve got some new stuff to learn. And even if your development environment is on a Linux server, your production environment is not going to be interactive, which means that you’re going to need to learn some new debugging skills and more about logging. You likely won’t have control over this server either - you’ll have an IT administrator, or a team of IT folks whose expertise is in things like Linux, network security, authentication and identity access management, cloud operations, and more. You’re going to meet new people, and this is a good thing, because if you had to take on these roles to administer your production environment, you wouldn’t have time to do the important data science work that you now need to put into production.
It’s run repeatedly. A production job is not a one off. It’s something’s that run repeatedly over months or years. This introduces new challenges because of all the things that can change over time: the data schema, package versions, system libraries, operating systems, the universe, and requirements.
Someone cares about the results. If your job stops working someone is going to bug you about it. This means that you need to take additional care when programming to minimise the chances of failure and if something goes wrong while you’re on vacation, you want your colleagues to be able to step in. That implies shared knowledge, tools, and processes.

1.2 What is data science in production done well?

Data science in production can sometimes happen by accident. You create something of value, it’s shared across the organization, and before you know it, you’re “in production,” but the foundation of this production system was so hastily put together that it could crumble much like the houses built of straw and sticks in the fable about the three little pigs.

The topics of this book are selected to provide a well-rounded and strong foundation for your data science in production. The chapters in this book address:

Getting code to production: How do we actually move code from development to a production server? Not just once, but in a way that’s repeatable and automated.
Code in production: When running code repeatedly in production, code efficiency, optimization, and observability become much more important, and the consequences of dependency changes are higher. This chapter will help you prepare your code to be robust and performant, while planning for what could go wrong and how to avoid surprises.
Working with data: How do you access the data you need in production, securely?
Managing packages and environments: We discuss the best practices in package management, reproducibility, and strategies for striking a balance between package availability, environment stability, and secure environments.
Security: We discuss security practices within your code and package dependencies, and ensuring secure access to your production data science so that the only people who have access to your work are those who should have access
Operations: Version control, automation, monitoring, auditing. These are the elements that provide the grease to the wheels of your production machine.
Infrastructure and architecture: It takes more than optimized code to keep a production environment humming. The infrastructure your code runs on can be critical to production success.
The “Us” in production: You, your team, and your organization will need to work together to establish and follow organizational norms and processes. And you’ll work with teams in different organizations with different roles in supporting production. We’ll discuss those roles and appreciate their perspectives, making it easier to have productive conversations when it’s time to collaborate.

1.3 Key vocabulary

1.3.1 Data science products

Documents produced by Quarto and RMarkdown (including dashboards). In this book we’ll focus on Quarto, but the same ideas apply to RMarkdown too.
Interactive apps produced by Shiny. These are the bread and butter of most data scientists. Like reports, they’re aimed at humans, but instead of being a static document that’s only controlled by the data scientist, they’re an interactive collaboration between the data scientist and the end-user.
APIs produced by Plumber. Sometimes you don’t need to communicate to other humans inside your organization, but other computers. This is where an API comes in: you provide an API that’s a black box from the outside but runs R code inside. A common use case of APIs for data scientists is to provide access to models: the caller supplies some data and the model returns a prediction or classification.
Models. Often exposed via an API but have their own development and deployment process.
Websites (including books and blogs) produced by Quarto. Sometimes you need more than one Quarto file, you need multiple Quarto files that live in one directory and are cross-linked. You can think of this as a book if it’s something that you’d tend to read through in linear order, a blog if it’s primarily organized in chronological order, or otherwise a website. I think every team should have at least one website, a Quarto site where you define your team’s recommended practices.
Packages. Like websites, I think every team should have at least one. Your first package should collect together your most common data loading code, any functions for computing common metrics, and the themes you use to make your other reports and apps match your corporate style guide.
Scripts that have side effects, like updating data or sending notifications.

1.3.2 Deploying vs executing

There are always at three steps in the ideal production environment:

From your development environment, you deploy your code to the production environment. This captures all dependencies that your code needs to run and transmits them along with your code to the production environment.
The production environment then tests your code. If the tests fail, the deployment fails, ensuring that you never deploy code with known bugs (unfortunately, there’s no way to protect against unknown bugs!). This step is optional, particularly when you’re starting out, but it’s important as you work on higher-stakes longer-term projects because it makes it easier for you to make changes to your code without worrying that you are breaking something unexpected.
If the tests pass, the production environment then executes the job, either on a schedule or on demand.

Typically your code will be executed many more times than it is deployed. This means that it often makes sense to do some upfront work in the deployment step in order to save time in the production step. For example, a deployment might involve creating a docker container with all packages and other dependencies. This takes a little time, but saves time when executing the job because you can just download the container, rather than having to install everything from scratch.

1.3.3 Production vs test vs development environments

This process gives rise to three different environments in which your code is run:

The production environment, where your code is executed. You want this environment to be as minimal as possible because these environments are usually created on the fly then thrown away, and you want this to happen as quickly as possible. The vast majority of production environments are inside a docker container.
The test environment, which is where you run your tests. This requires everything in the production environment, plus whatever testing tools you’re using, and is also highly likely to be a docker container.
Your development environment, which is where you write your code and test it interactively. This includes everything in the test environment, plus all the tools you use for interactive development (i.e. devtools/usethis).

It’s important to keep the deploy and execute steps clear in your mind because your production environment has to replicate your development environment as closely as possible. That means that part of deployment is capturing all the packages you used, and which versions that you currently have installed.

You will spend most of your time working in the development environment, but your code will mostly be run in the production environment.

1.3.4 Batch vs interactive jobs

It’s useful to split production jobs up into two categories based on whether or not someone is waiting for the job to complete:

A batch job is run in the background, typically either on a schedule or when some other job completes. A typical batch job renders documents, transforms and saves data, fits models, or sends notifications (or any combination of the above). If needed, these jobs can be computationally intensive because they’re running in the background.
In an interactive job, someone is waiting for the result, either a human (for an app) or another computer (for an API). They’re executed dynamically when someone needs them, and if there’s a lot of demand, multiple instances might be run at the same time. Interactive jobs should not generally be computationally intensive because someone is waiting for the results.

Some examples that help illustrate the difference are:

Depending on what technology you use to build it, a dashboard might be a batch job (e.g. flexdashboard, quarto dashboards) or an interactive job (e.g. shinydashboard, shiny + bslib).
If your interactive job is slow, a powerful and general technique is to pair it with a batch job that performs as much computation as possible and caches results.
Parameterising an RMarkdown report turns it from batch a job to an interactive job. (Behind the scenes parameterising a report turns it into a Shiny app that runs rmarkdown::render() on demand.)
You could use shinylive to turn a shiny app from an interactive job to a batch job. This produces a static HTML file that you can deploy anywhere and the viewer’s computer now does any computation.
The online version of this book is an example a batch job. Every time I push a change to GitHub, a GitHub action renders the book and publishes it to GitHub pages.

1.3.5 Push-button vs git-backed

There are two common ways to deploy a job:

Push-button deployment: As the name suggests, you push a button in your IDE and deployment just happens. This is a great place to start and is convenient when iterating at the beginning of a new project, especially when it’s just you working on it.
Git-backed deployment: You commit your code and push it (this time to a central git repository), and then it’s automatically deployed (if you have tests, they must pass). This requires a bit more work but is the workflow that we recommend for the long term because it clearly establishes the git repo as the source of truth and enables multiple people to collaborate on the same job.

1.3.6 “production” vs “Production”

It’s useful to draw the distinction between lower case and upper case production. “Production” is “production” with an additional constraint: whenever something breaks, someone gets paged regardless of the time of day or day or week. This is the definition of production that most IT organisations think of.

The highest level of caring leads to Production-with-a-capital-P which adds in paging: i.e, if something goes wrong, you need to fix it immediately, even if it’s 2am on Saturday morning. Most data science projects don’t end up in Production, not because you can’t put R in Production, but because you don’t want to put your data scientists in Production.

This book covers “production” not “Production” because we don’t believe that you want to put your data scientists in production. It’s not what they’re good at and not what they’re trained for.

1.4 Why R?

Because R is an unparalleled environment for exploratory data analysis. So why not use it for production too? It’s totally possible - it just requires some new tools and techniques, and a slightly different mindset, that you need for interactive analysis.

It is 100% possible to put R in production and many companies are already doing so. The goal of this book is to publicise the patterns that make it easiest and most effective, so that regardless of the size of the maturity of your data science org, you can put your R code into production with a minimum of fuss.

(Both the open source and pro sides of Posit have also spent a bunch of time making the whole process as painless as possible. There might not be the same range of deployment options available as in Python, but the options we provide have been thoughtfully design to “just work” so that you can focus on the data analysis challenges that you care about, rather than wrestling with systems to get your code running in production.)

Additionally, the tidyverse team and Posit developers generally have spent a lot of work across the stack to make all the pieces fit together smoothly, minimising the number of paper cuts that you’ll experience.

1.5 Platforms

Putting code into production requires some computational platform. While many of the details are the same regardless of platform, there are often minor, but important differences. Not possible to cover every possible platform so in this book we’ll focus on three:

GitHub Actions: Git-backed deployment for batch apps. Supports any programming language. Hosted offering. GitHub Actions are free for public repos and available to anyone with a GitHub account. They’re a great way to learn the basics of R in production and to generate outputs for the whole world to see. But it’s less likely that you’ll use them in a job. Very popular in open source world (e.g. many R packages use it for automated test) so useful tool regardless.
Posit Connect Cloud: Git-backed deployment for interactive apps (batch apps and push-button deployment coming soon). A centrally hosted service. Supports R and Python. At this moment in time is free and public, but in the future there will be a paid offering that is private.
Posit Connect: Git-backed and push-button deployment for interactive and batch apps. Supports R and Python. On prem offering designed to make publishing R (and now Python) data science content as easy as possible.

I have picked these three because they embrace three different ways of working, but they have all been tooled up specifically to support R. Two of them are created by my employer Posit. High quality hosting is expensive so they products cost money. However, Posit Connect Cloud will always have with a free version that can be used for publicly available products, and we also discuss GitHub Actions which is also free for public usage.