3 Production projects
This chapter will introduce you to our recommend practices for organising your production projects. Our advice is fairly prescriptive because we want you to spend your time doing data science, not fretting about how to organise your project. It’s also much easier to collaborate with others if everyone shares the same project layout. But the most important thing is to be consistent, and you can’t be consistent unless you’ve documented what you do. So if you want to deviate from what we describe here, make sure to write it down!
In this chapter, you’ll learn about three important principles:
- A project is a self-contained directory: every file a project needs lives in that directory, and every file that project creates goes in that directory.
- There’s a single source of truth: you’ll inevitably end up with multiple copies of your project and you need to decide which one is “real”. We recommend using git, and making the copy of your project that lives in your central git repo the truth.
- Deployment is automated: deploying your code to production should involve minimal manual steps. We recommend you build a workflow so your product is deployed as soon as its is pushed to the main branch of your git rep.
Lets dive in!
3.1 Each project is a self-contained directory
Each project you work in should be defined by a self-contained directory. Self-contained means two things:
The directory includes all the files needed to run the project; i.e. it only reads files inside this directory.
The project only even modified files in this directory; i.e. it only writes files inside this directory.
This principle is the foundation to making your code work when it’s run on another machine (either a colleague’s or your production server).
3.1.1 Organising your project
We recommend that every project have either a main.R
or index.qmd
in the project root. This file is the project entry point, and will be run when your project is deployed. For very simple projects, you can put all of your code inside these files. But as soon as your project no longer fits in one file, we recommending moving everything to a separate exec/
directory. This helps keep a clear distinction between your data science product and the code and metadata that supports its deployment. You can organise your exec/
directory however makes sense for you. But we recommend that you follow the advice in the tidyverse style guide, to ensure that you convey as much usable information in your file names as possible.
Once you have a separate exec/
directory, your main.R
will just provide the code needed to execute your product:
- For report or dashboard:
quarto::quarto_render("exec/index.qmd")
. - For a book or website:
quarto::quarto_render("exec/")
. - For an R script:
source("exec/main.R", chdir = TRUE)
. - For a Shiny app:
shiny::runApp("exec/")
. - For a plumber API:
plumber::plumb("exec/")
.
Later on, we’ll introduce the producethis package which stores metadata in your DESCRIPTION
file so you can use producethis::exec()
in your main.R
, regardless of your product type.
Over time, most projects will accumulate other files and directories such as:
A
README.md
for overall project level documentation. This is a great place to describe the overall goal of the project, provide a link to the deployed product, and generally to jot down anything important that you and your collaborators need to remember about this project.A
renv.lock
file that specifies the exact versions of the dependencies needed to run your project.A pair of
R/
andtests/testthat
1 directories. You’ll create these when you extract out the first reusable function, which you’ll put in a file inside ofR/
. You can then write tests for this code, which live intests/testthat
. This is an important technique as your project grows because it allows you to isolate key pieces and test them independently from the rest of the product. This gives you the confidence to refactor your code and update your dependencies while knowing that your product will continue to be correct.data/
: small supplementary datasets. Most data science products will access data dynamically as needed, but it’s sometimes useful to a small amount of rarely changing static data.
(If you’ve created an R package before, a lot of these directories might look familiar. That’s no coincidence! Sharing as much structure as possible with packages allows us to reuse a lot of tools and makes it easier to learn one if you already know the other. That said, we don’t believe that production projects should be packages, because there are subtle but important differences in their aims and structure.)
It’s ok to include multiple data science products in a single project as long as they are very closely related (i.e. they share the same dependencies and use many of the same reusable functions). In that case, we recommend that you put each product in a directory of exec/
and have a matching main-{subdirectory-name}.R
file for each directory. Deploying multi-product projects is a bit more complicated and we’ll cover it in XYZ.
3.1.2 Paths and working directories
When you deploy your project to production or share your project with other people, it’s not going to live in the same location. That means absolute paths won’t work and you need to use a relative paths2. There are two type of relative paths:
Project-relative paths start at the project root, and can be created with
here::here()
. You should only need project-relative paths relatively infrequently because the tools we recommend will automatically load files from the most commonly used project directories. For exampleproducethis::load()
will automatically sources all files inR/
and lazily load all datasets indata/
, andproducethis::test()
will automatically run the tests intests/testthat
.Product-relative paths start at the same location as the app, api, report, or script that you’re deploying. These will generally be the same as
here::here("exec/)
unless you have multiple products in subdirectories.
In general, avoid manually changing the working directory by calling setwd()
or clicking some button in your IDE. Rely on relative paths to clearly and consistently indicate where files are located.
3.1.3 Dependencies
A literal reading of the rule to only read and write inside your project directory implies that you can’t use packages! That’s because the default configuration of R stores packages in one or two central locations. So whenever you call library()
it’s going to read files outside of the project, and whenever you call install.packages()
it’s going to write files outside of the project. Obviously your analyses are going to be severely limited if you can’t use packages, so how do we resolve this rule?
We’ll come back to this idea in much detail later, but the basic idea is to use a package like {renv} that creates a project-specific library. That way library()
will read and install.packages()
will write inside your project.
That makes it ok to use library()
, but you still shouldn’t use install.packages()
. The key problem with install.packages()
is that it defaults to installing the latest version of the package available on CRAN. If you started work on this project in 2020, and your collaborator started work on in 2025, you might end up with radically different versions of the package, which might lead to difference in your analysis. We’ll come back to this problem in much greater detail in Chapter TODO.
Note that you won’t actually want to store the packages inside your project because they’re rather large and you need different versions of the package for different operating systems. Instead, you’ll record the package names and versions that your project needs in a metadata file, and the people using your project will install these packages as needed.
3.1.4 Exceptions
There are a handful of exceptions to the self-contained rule. We’ll go into these in much more detail later in the book, but it’s good to be aware of them now:
R: despite needing R to run your code, you’ll won’t include it in your project. Instead, you’ll use tools like
renv
that automatically record the version of R that you’re running so that your deployment environment can use the same version.Secrets: depending on how your authentication system works, you project may need secrets like passwords or API keys in order to run. These should never be stored in the project itself. We’ll talk about alternatives in …
3.2 There’s one source of truth
Whenever you have multiple copies of something, you need to carefully think about which one is going to be the “source of truth” (i.e. if they’re different, which one do you believe?). In production scenarios, you have at least two copies of the project: one in your development environment and one in your production environment. If it’s a collaborative project there’s at least one more copy for each of your collaborators. So which of these copies should you “bless” and make it the official source? I’d argue none of them: instead you need to introduce one more copy, which is the copy that lives in a central git repository. This pattern is illustrated in Figure 3.1.
Figure 3.2 shows a less idea workflow where if your colleagues want to see your code, they download it from your production server. This works ok if they just want to look at your code, but if they edit it, how do they get the changes back to you? It’s also easy for them to edit it then deploy back to production, leading to an inconsistency between your version and their version.
We recommend this workflow, because while Git takes some time to learn, it provides a principled way to maintain and share code, tracking the provenance of every change. There are many ways to use Git, but a central repository is easiest to understand and well supported by modern tools like GitHub. The following sections go into more details of why we recommend Git and GitHub.
(The idea of the single source of truth also applies to packages. The source of truth is not the packages installed in your library, but the packages you have recorded in a metadata file because these are what the production server and your colleagues will use.)
We recommend using one git repo per project, but there’s a common alternative: the monorepo. When using a monorepo all of your projects live inside a single git repo. That is compatible with our workflow, as long as you ensure that your project doesn’t read from or write to any of the other top-level directories in the monorepo.
3.2.1 Git
In this book, we assume that you’re already familiar with the basics of Git. You certainly don’t need to be an expert but you should know how to add and commit files, push and pull, and create branches. If you haven’t used Git before, I’d recommend that you put this book down now, and read Happy Git and GitHub for the useR.
I also expect you to know the other key skill for using Git: how to Google or ask an LLM for advice when you get yourself stuck in a situation that you’ve never seen before. Another great resource for common problems is Oh Shit, Git!?!.
3.2.2 GitHub
Git is valuable even if you use it locally; it makes it easier to understand how your code is changing over time, and makes it possible to undo mistakes. But git is vastly more valuable when you use it to share your code with others, using a tool like GitHub. Throughout the rest of the book, we’ll talk about GitHub exclusively, but this is really just a shorthand for saying your “Git hosting platform”. There are many professional and open source solutions including GitLab, Bitbucket, or Azure DevOps. Your organisation will usually provide this as a service, and you’ll need to adapt to what you’re given. Fortunately this is generally not too hard since all the modern platforms all provide very similar tools. If your organisation doesn’t already provide GitHub or similar, you should immediately start a campaign to get it3.
We’ll talk about GitHub a lot when it comes to working in a team. The most important tools are issues and pull requests. Issues provide a great way to track TODO items for your project, making it easy to see what work is remaining. Pull requests facilitate code review, which an extremely powerful technique for improving quality and upskilling your entire team.
GitHub is also great when you’re a solo data scientist or learning data science. As you’ll learn later in the book, GitHub actions allow you to build your own production environment, and experience putting jobs into production just like you would in any larger organisation. If you’re just starting out in your career, GitHub is also a great place to build up a portfolio of data science projects that you can talk about when interviewing for jobs, showcasing some of the production jobs you’ve created for your own life. If you’re in that position, I’d highly recommend watching GitHub: How To Tell Your Professional Story, by Abigail Haddad.
3.3 Deployment is automated
Our final principle of the project lifestyle is that deployment is automated, based on the contents of your central git repository. (This is often called continuous deployment) The code that you deploy should come from your central Git repo, not your local computer. In other words, we recommend that you use push-to-deploy (aka Git-backed deployments), not click-to-deploy from your IDE. It’s totally fine to start with click-to-deploy to learn the basics, but over time you want to transition to a fully automated solution.
We’ll go into the details later in the book, but we also recommend that deployments be gated behind tests. In other words, you push to start the deployment process, but it will only complete if all your tests pass. This decreases the chances of accidentally deploying broken code.
For high-stakes projects, we recommend working in a branch until you are confident that the work is correct. This allows you (and your colleagues) to review your code before it’s deployed, and gives you a safe space to iterate. As you’ll learn later in the book, you can also configure your automation to deploy branches to a “staging” environment. A staging environment is identical to your production environment in every way, except that only data scientists use it, not your users. This allows you to fully test your project before your stakeholders see it. Once you’re confident that your update is ready to go live, you merge your PR back into the main branch, and your automated system will deploy it to production.
3.4 Summing it up
You might wonder why tests don’t just live in
tests
/. Firstly, this allows for the user of testing packages other than testthat (including packages that don’t exist now but might in the future). Secondly, this matches the directory structure of R packages, which is useful because it allows us to share tools between production projects and packages.↩︎i.e. never start the path with
/
,C:/
, or~.
↩︎If you’re interviewing for a data science job, you should definitely ask what Git hosting platform they use. If they don’t use one, or worse don’t use Git, this is a major red flag.↩︎