1 Introduction
1.1 What is data science in production?
What does it mean for code to be running in production? I think there are three main characteristics:
It’s run on another computer. This is called your production environment and is typically a Linux server. This typically means its a different operating system to your development environment, which is often a Windows or Mac laptop. If you’ve never used Linux before, you’ve got some new stuff to learn. And even if your development environment is on a Linux server, your production environment is not going to be interactive, which means that you’re going to need to learn some new debugging skills and more about logging.
It’s run repeatedly. A production job is not a one off. It’s something’s that run repeatedly over months or years. This introduces new challenges because of all the things that can change over time: the data schema, package versions, system libraries, operating systems, the universe, and requirements.
Someone cares about the results. If your job stops working someone is going to bug you about it. This means that you need to take additional care when programming to minimise the chances of failure and if something goes wrong while you’re on vacation, you want your colleagues to be able to step in. That implies shared knowledge, tools, and processes.
1.2 Key vocabulary
1.2.1 Deploying vs executing
There are always at three steps in the ideal production environment:
From your development environment, you deploy your code to the production environment. This captures all dependencies that your code needs to run and transmits them along with your code to the production environment.
The production environment then tests your code. If the tests fail, the deployment fails, ensuring that you never deploy code with known bugs (unfortunately, there’s no way to protect against unknown bugs!). This step is optional, particularly when you’re starting out, but it’s important as you work on higher-stakes longer-term projects because it makes it easier for you to make changes to your code without worrying that you are breaking something unexpected.
If the tests pass, the production environment then executes the job, either on a schedule or on demand.
Typically your code will be executed many more times than it is deployed. This means that it often makes sense to do some upfront work in the deployment step in order to save time in the production step. For example, a deployment might involve creating a docker container with all packages and other dependencies. This takes a little time, but saves time when executing the job because you can just download the container, rather than having to install everything from scratch.
1.2.2 Production vs test vs development environments
This process gives rise to three different environments in which your code is run:
The production environment, where your code is executed. You want this environment to be as minimal as possible because these environments are usually created on the fly then thrown away, and you want this to happen as quickly as possible. The vast majority of production environments are inside a docker container.
The test environment, which is where you run your tests. This requires everything in the production environment, plus whatever testing tools you’re using, and is also highly likely to be a docker container.
Your development environment, which is where you write your code and test it interactively. This includes everything in the test environment, plus all the tools you use for interactive development (i.e. devtools/usethis).
It’s important to keep the deploy and execute steps clear in your mind because your production environment has to replicate your development environment as closely as possible. That means that part of deployment is capturing all the packages you used, and which versions that you currently have installed.
You will spend most of your time working in the development environment, but your code will mostly be run in the production environment.
1.2.3 Batch vs interactive jobs
It’s useful to split production jobs up into two categories based on whether or not someone is waiting for the job to complete:
A batch job is run in the background, typically either on a schedule or when some other job completes. A typical batch job renders documents, transforms and saves data, fits models, or sends notifications (or any combination of the above). If needed, these jobs can be computationally intensive because they’re running in the background.
In an interactive job, someone is waiting for the result, either a human (for an app) or another computer (for an API). They’re executed dynamically when someone needs them, and if there’s a lot of demand, multiple instances might be run at the same time. Interactive jobs should not generally be computationally intensive because someone is waiting for the results.
Some examples that help illustrate the difference are:
Depending on what technology you use to build it, a dashboard might be a batch job (e.g. flexdashboard, quarto dashboards) or an interactive job (e.g. shinydashboard, shiny + bslib).
If your interactive job is slow, a powerful and general technique is to pair it with a batch job that performs as much computation as possible and caches results.
Parameterising an RMarkdown report turns it from batch a job to an interactive job. (Behind the scenes parameterising a report turns it into a Shiny app that runs
rmarkdown::render()
on demand.)You could use shinylive to turn a shiny app from an interactive job to a batch job. This produces a static HTML file that you can deploy anywhere and the viewer’s computer now does any computation.
The online version of this book is an example a batch job. Every time I push a change to GitHub, a GitHub action renders the book and publishes it to GitHub pages.
1.2.5 “production” vs “Production”
It’s useful to draw the distinction between lower case and upper case production. “Production” is “production” with an additional constraint: whenever something breaks, someone gets paged regardless of the time of day or day or week. This is the definition of production that most IT organisations think of.
The highest level of caring leads to Production-with-a-capital-P which adds in paging: i.e, if something goes wrong, you need to fix it immediately, even if it’s 2am on Saturday morning. Most data science projects don’t end up in Production, not because you can’t put R in Production, but because you don’t want to put your data scientists in Production.
This book covers “production” not “Production” becuase we don’t believe that you want to put your data scientists in production. It’s not what they’re good at and not what they’re trained for.
1.3 Why R?
Because R is an unparalleled environment for exploratory data analysis. So why not use it for production to? It’s totally possible - it just requires some new tools and techniques, and a slightly different mindset, that you need for interactive analysis.
It is 100% possible to put R in production and many companies are already doing so. The goal of this book is to publicise the patterns that make it easiest and most effective, so that regardless of the size of the maturity of your data science org, you can put your R code into production with a minimum of fuss.
(Both the open source and pro sides of Posit have also spent a bunch of time making the whole process as painless as possible. There might not be the same range of deployment options available as in Python, but the options we provide have been thoughtfully design to “just work” so that you can focus on the data analysis challenges that you care about, rather than wrestling with systems to get your code running in production.)
Additionally, the tidyverse team and Posit developers generally have spent a lot of work across the stack to make all the pieces fit together smoothly, minimising the number of paper cuts that you’ll experience.
1.4 Platforms
Putting code into production requires some computational platform. While many of the details are the same regardless of platform, there are often minor, but important differences. Not possible to cover every possible platform so in this book we’ll focus on three:
GitHub Actions: Git-backed deployment for batch apps. Supports any programming language. Hosted offering. GitHub Actions are free for public repos and available to anyone with a GitHub account. They’re a great way to learn the basics of R in production and to generate outputs for the whole world to see. But it’s less likely that you’ll use them in a job. Very popular in open source world (e.g. many R packages use it for automated test) so useful tool regardless.
Posit Connect Cloud: Git-backed deployment for interactive apps (batch apps and push-button deployment coming soon). A centrally hosted service. Supports R and Python. At this moment in time is free and public, but in the future there will be a paid offering that is private.
Posit Connect: Git-backed and push-button deployment for interactive and batch apps. Supports R and Python. On prem offering designed to make publishing R (and now Python) data science content as easy as possible.
I have picked these three because they embrace three different ways of working, but they have all been tooled up specifically to support R. Two of them are created by my employer Posit. High quality hosting is expensive so they products cost money. However, Posit Connect Cloud will always have with a free version that can be used for publicly available products, and we also discuss GitHub Actions which is also free for public usage.