3 Production projects

This chapter introduces our recommend practices for organising your production projects. Our advice is quite prescriptive: we want you to spend your time doing data science, not fretting about how to organise your project. It’s also much easier to collaborate with others if everyone shares the same project layout. But the most important thing is to be consistent, and you can’t be consistent unless you’ve documented what you do. So it’s fine if you deviate from what we describe here, just make sure to write it down!

In this chapter, you’ll learn about three important principles:

A project is a self-contained directory: all the files needed for a project should live in a single directory, and we recommend that your directory use standard sub directories, so you always know where to look for a certain type of file.
There’s a single source of truth: you will inevitably end up with multiple copies of your project so you need to decide which one is “real”. We recommend using git, and declaring that the version of your project that lives in your central git repo is the source of truth.
Deployment is automated: deploying your code to production should involve minimal manual steps. We recommend you build a workflow so your product is deployed as soon as it is pushed to the main branch of your git repo.

Lets dive in!

3.1 Each project is a self-contained directory

Each project you work in should be defined by a self-contained directory: you shouldn’t read from or write to any file outside this directory. This principle is the foundation to making your code work when it’s run on another machine (either a colleague’s or your production server): you can only control the files inside your project directory. There are a handful of exceptions to the self-contained rule. We’ll go into these in much more detail later in the book, but it’s good to be aware of them now:

R: despite needing R to run your code, you’ll won’t include it in your project. Instead, you’ll use tools like renv that automatically record the version of R that you’re running so that your deployment environment can use the same version. We’ll discuss this in [TODO].
Packages and system dependencies: similarly you need to make sure the packages you use (and their dependencies) are also available on your production machine. You won’t save the packages directly in your project, but instead provide some metadata so that they can be installed on demand. You’ll learn more about this in [TODO].
Secrets: depending on how your authentication system works, you project may need secrets like passwords or API keys in order to run. These should never be stored in the project itself and instead saved in some secure external storage and retrieved on demand. We’ll talk about how to do this in [TODO].

In this section, you’ll learn our recommendations for organising your project directory. The first and most critical element is the project entry point: this is the one file that controls what happens when your project is executed. This file is sufficient for very simple projects, but most projects will grow to require multiple other files and directories, which we’ll also give you some recommendations for. We’ll finish up by talking about relative paths and how you can reliably refer to the different pieces.

3.1.1 Project entry point

We recommend that every project have either a main.R or index.qmd in the project root. This file is the project entry point, and will be run when your project is deployed. For very simple projects, you can put all of your code inside these files. But as soon as your project no longer fits in one file, we recommending moving everything to a separate exec/ directory. This helps keep a clear distinction between your data science product and the code and metadata that supports its deployment. You can organize your exec/ directory in whatever way makes sense to you. But we recommend that you follow the advice in the tidyverse style guide, to ensure that you convey as much usable information in your file names as possible.

Once you have a separate exec/ directory, your main.R will just provide the code needed to execute your product:

For report or dashboard: quarto::quarto_render("exec/index.qmd").
For a book or website: quarto::quarto_render("exec/").
For an R script: source("exec/main.R", chdir = TRUE).
For a Shiny app: shiny::runApp("exec/").
For a plumber API: plumber::plumb("exec/").

Later on, we’ll introduce the producethis package which stores metadata in your DESCRIPTION file so you can use producethis::exec() in your main.R, regardless of your product type.

Over time, most projects will accumulate other files and directories such as:

A README.md for overall project level documentation. This is a great place to describe the overall goal of the project, provide a link to the deployed product, and generally to jot down anything important that you and your team need to remember about this project. You’ll learn more about this file in [TODO].
A renv.lock file that specifies the exact versions of the dependencies needed to run your project.
A pair of R/ and tests/testthat¹ directories. You’ll create these when you extract out the first reusable function, which you’ll put in a file inside of R/. You can then write tests for this code, which live in tests/testthat. This is an important technique as your project grows because it allows you to isolate key pieces and test them independently from the rest of the product. This gives you the confidence to refactor your code and update your dependencies while knowing that your product will continue to be correct.
data/: small supplementary datasets. Most production products will access data dynamically as needed, but it’s sometimes useful to a small amount of rarely changing static data. This is also a good place to use if you’re scraping and saving data.

(If you’ve created an R package before, a lot of these directories might look familiar. That’s no coincidence! Sharing as much structure as possible with packages allows us to reuse a lot of tools and makes it easier to learn one if you already know the other. That said, we don’t believe that production projects should be packages, because there are subtle but important differences in their aims and structure.)

It’s ok to include multiple data science products in a single project as long as they are very closely related (i.e. they share the same dependencies and use many of the same reusable functions). In that case, we recommend that you put each product in a directory of exec/ and have a matching main-{subdirectory-name}.R file for each directory. Deploying multi-product projects is a bit more complicated and we’ll cover it in [TODO].

Discovering that there’s some code that almost all of your products share is not enough to put them all in the same project. Instead, we recommend creating a team package that extracts out code that you use in multiple projects. We’ll come back to that in more detail in [TODO].

3.1.2 Paths and working directories

When you deploy your project to production or share your project with other people, it’s not going to live in the same location. That means absolute paths won’t work and you need to use relative paths². There are two type of relative paths:

Project-relative paths start at the project root, and can be created with here::here(). You should only need project-relative paths relatively infrequently because the tools we recommend will automatically load files from the most commonly used project directories. For example producethis::load() will automatically sources all files in R/ and lazily load all datasets in data/, and producethis::test() will automatically run the tests in tests/testthat.
Product-relative paths start at the same location as the app, api, report, or script that you’re deploying. These will generally be the same as here::here("exec/) unless you have multiple products in subdirectories.

In general, avoid manually changing the working directory by calling setwd() or clicking some button in your IDE. Rely on relative paths to clearly and consistently indicate where files are located.

3.2 There’s one source of truth

Whenever you have multiple copies of something, you need to carefully think about which one is going to be the “source of truth” (i.e. if they’re different, which one do you believe?). In production scenarios, you have at least two copies of the project: one in your development environment and one in your production environment. If it’s a collaborative project there’s at least one more copy for each of your collaborators. So which of these copies should you “bless” and make it the official source? I’d argue none of them: instead you need to introduce one more copy, which is the copy that lives in a central git repository. This pattern is illustrated in Figure 3.1.

Figure 3.1: A collaborative development workflow where multiple team members interact with a central git repository, and that repository goes from git to production.

Figure 3.2 shows a less ideal workflow where if your colleagues want to see your code, they download it from your production server. This works ok if they just want to look at your code, but if they edit it, how do they get the changes back to you? It’s also easy for them to edit it then deploy back to production, leading to an inconsistency between your version and their version.

A flowchart diagram with four rectangular boxes connected by bidirectional arrows. At the top is a box labeled "You" which connects downward to a box labeled "Production". The "Production" box has two connections below it, linking to boxes labeled "Colleague A" (on the left) and "Colleague B" (on the right). — Figure 3.2: A less ideal workflow where you and your colleagues share a project by downloading deployed products. If there are conflicts between versions, it’s very difficult to know which one is true.

We recommend using Git, because while it takes some time to learn, it provides a principled way to maintain and share code, tracking the provenance of every change. There are many ways to use Git, but a central repository is easiest to understand and well supported by modern tools like GitHub. The following sections go into more details of why we recommend Git and GitHub.

Monorepos

We recommend using one git repo per project, but there’s a common alternative: the monorepo. In a monorepo all of your projects live inside a single git repo. That is compatible with our workflow, as long as you ensure that your project doesn’t read from or write to any of the other top-level directories in the monorepo.

3.2.1 Git

In this book, we assume that you’re already familiar with the basics of Git. You certainly don’t need to be an expert but you should know how to add and commit files, push and pull, and create branches. If you haven’t used Git before, I’d recommend that you put this book down now, and read Happy Git and GitHub for the useR. I also expect you to know the other key skills for using Git: Googling or asking an LLM for advice when you get yourself stuck in a situation that you’ve never seen before. Another great resource for common problems is Oh Shit, Git!?!.

3.2.2 GitHub

Git is valuable even if you use it locally; it makes it easier to understand how your code is changing over time, and makes it possible to undo mistakes. But Git is vastly more valuable when you use it to share your code with others, using a tool like GitHub. Throughout the rest of the book, we’ll talk about GitHub exclusively, but this is really just a shorthand for saying your “Git hosting platform”. There are many professional and open source solutions including GitLab, Bitbucket, or Azure DevOps. Your organisation will usually provide this as a service, and you’ll need to adapt to what you’re given. Fortunately this is generally not too hard since all the modern platforms all provide very similar tools. If your organisation doesn’t already provide GitHub or similar, you should immediately start a campaign to get it³.

We’ll talk about GitHub a lot when it comes to working in a team. The most important tools are issues and pull requests. Issues provide a great way to track TODO items for your project, making it easy to see what work is remaining. Pull requests facilitate code review, which is an extremely powerful technique for improving quality and upskilling your entire team.

GitHub is also great when you’re a solo data scientist or learning data science. As you’ll learn later in the book, GitHub Actions allow you to build your own production environment, and experience putting jobs into production just like you would in any larger organisation. If you’re just starting out in your career, GitHub is also a great place to build up a portfolio of data science projects that you can talk about when interviewing for jobs, showcasing some of the production jobs you’ve created for your own life. If you’re in that position, I’d highly recommend watching GitHub: How To Tell Your Professional Story, by Abigail Haddad.

3.3 Deployment is automated

Our final principle of the project lifestyle is that deployment is automated, based on the contents of your central git repository. (This is often called continuous deployment). In other words, we recommend that you use push-to-deploy (aka Git-backed deployments), not click-to-deploy from your IDE. It’s totally fine to start with click-to-deploy to learn the basics, but over time you want to transition to a fully automated solution.

We’ll go into the details later in the book, but we also recommend that deployments be gated behind tests. In other words, you push to start the deployment process, but it will only complete if all your tests pass. This decreases the chances of accidentally deploying broken code.

For high-stakes projects, we recommend working in a branch until you are confident that the work is correct. This allows you (and your colleagues) to review your code before it’s deployed, and gives you a safe space to iterate. As you’ll learn later in the book, you can also configure your automation to deploy branches to a “staging” environment. A staging environment is identical to your production environment in every way, except that only data scientists use it, not your users. This allows you to fully test your project before your stakeholders see it. Once you’re confident that your update is ready to go live, you merge your PR back into the main branch, and your automated system will deploy it to production.

3.4 Summing it up

In this chapter, we’ve introduced three fundamental principles that will help you organize your production data science projects effectively:

A project is a self-contained directory. By keeping all your files in a single directory with a standardized structure, you make your project portable and easier to collaborate on. We’ve recommended specific conventions for organizing your files, including a clear entry point (main.R or index.qmd), and standard subdirectories like exec/, R/, tests/, and data/. Following these conventions means you’ll always know where to find what you’re looking for.
There’s a single source of truth. With multiple copies of your project inevitably existing across development and production environments, it’s crucial to designate one authoritative version. We recommend using a central Git repository as this source of truth, establishing a clear workflow where changes are tracked, shared, and managed in a principled way.
Deployment is automated. Moving from manual deployments to an automated push-to-deploy workflow ensures consistency and reliability. By automating deployment and gating it with tests, you reduce the risk of deploying broken code while building a more robust production pipeline.

As your projects grow in complexity, these principles will help you maintain organization and focus on what matters most: doing great data science. While we’ve provided specific recommendations, remember that consistency is the most important thing. Document any deviations from these practices so that you and your team always know what to expect.

In the upcoming chapters, we’ll dive deeper into each of these areas, showing you how to handle dependencies, manage secrets, test your code, and implement automated deployment workflows that will make your production data science projects more reliable and maintainable.

i.e. never start the path with /, C:/, or ~.↩︎
i.e. never start the path with /, C:/, or ~.↩︎
If you’re interviewing for a data science job, you should definitely ask what Git hosting platform they use. If they don’t use one, or worse don’t use Git, this is a major red flag.↩︎