3 Production projects
This chapter will introduce you to our recommend practices for organising your production-bound projects. This advice helps you organise your own projects, but is particularly important when you’re collaborating with others. Our advice is fairly prescriptive because we want you to spend your time doing data science, not fretting about how to organise your project. It’s also much easier to collaborate with others if everyone shares the same project layout. But the most important thing is to be consistent, and you can’t be consistent unless you’ve documented what you do. So if you want to deviate from what we describe here, make sure to write it down!
In this chapter, you’ll learn about three important principles:
- A project is a self-contained directory: every file a project needs lives in that directory, and every file that project creates goes in that directory.
- There’s a single source of truth: you’ll inevitably end up with multiple copies of your project and you need to decide which one is “real”. We recommend using git, and making the copy of your project that lives in a central git repo the truth.
- Deployment is automated: deploying your code to production should involve minimal manual steps. We recommend you build a workflow so that as soon as your code is merged into the main branch, it gets deployed to your production server.
3.1 A project is a directory
The first, and most important, organisational principle is the each project you work is defined by a self-contained directory. This directory includes all the files needed to run the project and all files created by the project; i.e. your code should not read or write any files outside of the project directory. This principle is the foundation to making your code work when it’s run on another machine (either your colleagues or in production).
3.1.0.1 Paths and working directories
When you deploy your project to production or share your project with other people, it’s not going to live in the same location. That means absolute paths won’t work and you should always use a project-relative paths. The safest way to generate a project-relative path is to use here::here()
which will always create a path relative to the root of current project. In other words, regardless of where or how you call here::here("data/foo.csv")
it will always point to a the foo.csv
file that lives in the data directory at the top-level of the project.
You can also use relative paths1, but it’s important to note that these are a relative to the working directory, and the working directory varies based on how your R code is run. For shiny apps, plumber APIs, and quarto/RMarkdown docs, the working directory will be the directory the file lives in. For R scripts, the working directory will be the project root2 .
3.1.1 Files and subdirectories
Our advice on project structure is still evolving. But where possible we recommend that you follow the same conventions as R packages. We don’t believe that data analysis projects need to be a package, but where the two overlap, you get a number of tools for free. For example this means:
To add some overall documentation about your project, include a
README.md
(possibly generated from aREADME.qmd
)When you start writing functions that you reuse in multiple places in your analysis, you should put those functions in files that live in
R/
.When your functions get important enough and complicated enough that you need to test them, your tests should live in
tests/testthat/
.
Don’t worry if you’ve never created an R package before as we’ll cover these conventions in more detail as we work through this book.
In terms of naming individual files, We recommend that you follow the advice in the tidyverse style guide.
3.1.2 Dependencies
A literal reading of the rule to only read and write inside your project directory implies that you can’t use packages! That’s because the default configuration of R stores packages in one or two central locations. So whenever you call library()
it’s going to read files outside of the project, and whenever you call install.packages()
it’s going to write files outside of the project. Obviously your analyses are going to be severely limited if you can’t use packages, so how do we resolve this rule?
We’ll come back to this idea in much detail later, but the basic idea is to use a package like {renv} that creates a project-specific library. That way library()
will read and install.packages()
will write inside your project.
That makes it ok to use library()
, but you still shouldn’t use install.packages()
. The key problem with install.packages()
is that it defaults to installing the latest version of the package available on CRAN. If you started work on this project in 2020, and your collaborator started work on in 2025, you might end up with radically different versions of the package, which might lead to difference in your analysis. We’ll come back to this problem in much greater detail in Chapter TODO.
Note that you won’t actually want to store the packages inside your project because they’re rather large and you need different versions of the package for different operating systems. Instead, you’ll record the package names and versions that your project needs in a metadata file, and the people using your project will install these packages as needed.
3.1.3 Exceptions
There are a handful of exceptions to the self-contained rule. We’ll go into these in much more detail later in the book, but it’s good to be aware of them now:
R: despite needing R to run your code, you’ll won’t include it in your project. Instead, you’ll use tools like
renv
that automatically record the version of R that you’re running so that your deployment environment can use the same version.Secrets: depending on how your authentication system works, you project may need secrets like passwords or API keys in order to run. These should never be stored in the project itself. We’ll talk about alternatives in …
Data: by and large, your project will not include data, because your data will be retrieved on demand with a database query or API call. That doesn’t mean you won’t ever commit data — it’s certainly useful to include small supplementary datasets (like look-up tables, if they’re found in your data documentation, not the database), and example data used for testing.
3.2 There’s one source of truth
Whenever you have multiple copies of something, you need to carefully think about which one is going to be the “source of truth” (i.e. if they’re different, which one do you believe?). In production scenarios, you have at least two copies of the project: one in your development environment and one in your production environment. If it’s a collaborative project there’s at least one more copy for each of your collaborators. So which of these copies should you “bless” and make it the official source? I’d argue none of them: instead you need to introduce one more copy, which is the copy that lives in a central git repository. This pattern is illustrated in Figure 3.1.
We recommend this workflow, because while Git takes some time to learn, it provides a principled way to maintain and share code, tracking the provenance of every change. There are many ways to use Git, but a central repository is easiest to understand and well supported by modern tools like GitHub. The following sections go into more details of why we recommend Git and GitHub.
(The idea of the single soure of truth also applies to packages. The source of truth is not the packages installed in your library, but the packages you have recorded in a metadata file because these are what the production server and your colleagues will use.)
3.2.1 Git
In this book, we assume that you’re already familiar with the basics of Git. You certainly don’t need to be an expert but you should know how to add and commit files, push and pull, and create branches. If you haven’t used Git before, I’d recommend that you put this book down now, and read Happy Git and GitHub for the useR.
I also expect you to know the other key skill for using Git: how to Google or ask an LLM for advice when you get yourself stuck in a situation that you’ve never seen before. Another great resource for common problems is Oh Shit, Git!?!.
3.2.2 GitHub
Git is valuable even if you use it locally; it makes it easier to understand how your code is changing over time, and makes it possible to undo mistakes. But git is vastly more valuable when you use it to share your code with others, using a tool like GitHub. Throughout the rest of the book, we’ll talk about GitHub exclusively, but this is really just a shorthand for saying your “Git-hosting platform”. There are many professional and open source solutions including GitLab, Bitbucket, or Azure DevOps. You’re unlikely to get to choose so you’ll need to adapt to what you’re given; but this is generally not too hard since the modern platforms all provide pretty similar tools. If your organisation doesn’t already have access to some git hosting platform, you should immediately start a campaign to get it3.
We’ll talk about GitHub a lot when it comes to working with your team. But GitHub is also great when you’re a solo data scientist or learning data science. As you’ll learn later in the book, GitHub actions allow you to build your own production environment, and experience putting jobs into production just like you would in any larger organisation.
If you’re just starting out in your career, GitHub is also a great place to build up a portfolio of data science projects that you can talk about when interviewing for jobs, showcasing some of the production jobs you’ve created for your own life. If you’re in that position, I’d highly recommend watching GitHub: How To Tell Your Professional Story, by Abigail Haddad.
3.3 Deployment is automated
Our final principle of the project lifestyle is that deployment is automated, based on the contents of your central git repository. The code that you deploy should ideally come from your Git host, not your local computer. In other words, we recommend that you use push-to-deploy (aka Git-backed deployments), not click-to-deploy from your IDE. It’s totally fine to start with click-to-deploy to learn the basics, but over time you want to transition to a fully automated solution.
We’ll go into the details later in the book, but we also recommend that deployments be gated behind tests. In other words, you push to start the deployment process, but it will only complete if all your tests pass. This decreases the chances of accidentally deploying broken code.
For high-stakes projects, we recommend working in a branch until you are confident that the work is correct. This allows you (and your colleagues) to review your code before it’s deployed, and gives you a safe space to iterate. As you’ll learn later in the book, you can also configure your automation to deploy branches to a “staging” environment. A staging environment is identical to your production environment in every way, except that only data scientists use it, not your users. This allows you to fully test your project before your stakeholders see it. Once you’re confident that your update is ready to go live, you merge your PR back into the main branch, and your automated system will deploy it to production.
i.e. never start the path with
/
,C:/
, or~.
↩︎Assuming you have manually changed the working directory by calling
setwd()
or clicking some button in your IDE.↩︎If you’re interviewing for a data science job, you should definitely ask what Git hosting platform they use. If they don’t use one, or worse don’t use Git, this is a major red flag.↩︎