3  Project-oriented data science

While the focus of this book is putting your R code into production, we also need to lay out some principles for how to organise your work. In this chapter, you’ll learn the principles of project-oriented data science. This advice helps you organise your own projects and helps you share your work with others.

Overall, our advice is fairly prescriptive because we want you to spend most of your time doing data science, not fretting about how to organise your project. And it’s much easier to collaborate with others if everyone shares the same project layout. But the most important thing is that your team shares consistented and documented principles; so if you deviate from what we describe here, make sure to write it down!

There are three basic principles to project-oriented data science:

Before we can get to those principles we need to recommend a little setup.

3.1 Setup

One of the advantages of choosing to live the project-oriented lifestyle is that you get a bunch of free tools from the usethis package. We’ll use these throughout the book, so we recommend automatically loading usethis in your .Rprofile so they’re always available to you. The easiest way to do so is to use the helper supplied by usethis:

#| eval: false
usethis::use_usethis()

I highly recommend running that function and following its advice before continuing.

If you use RStudio, we also highly recommend that you run usethis::use_blank_slate() to ensure that RStudio never saves or reloads state. This forces you to capture all important information in your code files, ensuring that you never have objects in your global environment that you don’t know how to recreate. You don’t need to worry about this if you use Positron, since it defaults to those settings.

3.2 Each project is a directory

The most important principle of project-oriented data science a project is defined by a directory. This directory should include contain all the files needed to run the project, and all files created by the projects. In other words, your code should not read or write any files outside of the project directory.

Whenever you refer to a file in your project, always use a relative path (i.e. never start the path with /, C:/, or ~). And since you should only use files inside your project, you should never use paths starting with .. either. Relative paths are relative to the working directory, which should always be the project directory. (i.e. you should never use setwd() to manually change it.) This is important because when your share your project with other people, you can’t expect it to live in the same place, so absolute paths are unlikely to work.

There is one small challenge when using RMarkdown and Quarto, namely that paths in an .Rmd/.qmd are relative to the location of the file, not the project. If this becomes a problem, you can work around it by using here::here(). This function always returns the path to the project directory, so you can use here::here("data/my-data.csv") to refer to a file in the project directory regardless of where or how it’s called.

3.2.1 Dependencies

One less obvious consequence of the project-directory connection is how to handle packages. Usually in R, all your packages live in one or two central locations given by .libPaths(). So whenever you call library() it’s going to read files outside of the project, and whenever you call install.packages() it’s going to write files outside of the project. Obviously your analyses are going to be severely limited if you can’t use packgaes, so what can you do?

We’ll come back to this idea in much detail later, but the basic idea is to use a package like {renv} that creates a project-specific library. That way library() will read and install.packages() will write inside your project. That makes it ok to use library(), but you still shouldn’t use install.packages(). The key problem with install.packages() is that it defaults to installing the latest version of the package available on CRAN. If you started work on this project in 2020, and your collaborator started work on in 2025, you might end up with radically different versions of the package, which might lead to difference in your analysis. We’ll come back to this problem in much greater detail in Chapter TODO.

Note that you won’t actually want to store the packages inside your project because they’re rather large and you need different versions of the package for different operating systems. Instead, you’ll record the package names and versions that your project needs in a metadata file, and the people using your project will install these packages as needed.

There’s one last tool that never lives inside the package: R itself. Unfortunately there’s currently no standard way to record which version of R your package requires in the package.

3.2.2 Secrets

There’s one exception to the rule that everything your project need should be contained within the project: secrets. Secrets are generally used for auth.

3.2.3 Files and subdirectories

Where relevant we recommend that you follow the same conventions as R packages. We don’t believe that every data analysis project needs to be a package, but where the two overlap, you get a number of tools for free. For example this means:

  • To add some overall documentation about your project, include a README.md (possibly generated from a README.qmd.)

  • When you start writing functions that you reuse in multiple places in your analysis, you should put those functions in files that live in R/.

  • When your functions get important enough and complicated enough that you need to test them, your tests should live in tests/testthat/.

Don’t worry if you’ve never created an R package before as we’ll cover these conventions in more detail as we work through this book.

3.2.4 File naming advice

  1. File names should be machine readable: avoid spaces, symbols, and special characters. Don’t rely on case sensitivity to distinguish files.
  2. File names should be human readable: use file names to describe what’s in the file.
  3. File names should play well with default ordering: start file names with numbers so that alphabetical sorting puts them in the order they get used.
alternative model.R
code for exploratory analysis.r
finalreport.qmd
FinalReport.qmd
fig 1.png
Figure_02.png
model_first_try.R
run-first.r
temp.txt

There are a variety of problems here: it’s hard to find which file to run first, file names contain spaces, there are two files with the same name but different capitalization (finalreport vs. FinalReport), and some names don’t describe their contents (run-first and temp).

Here’s a better way of naming and organizing the same set of files:

01-load-data.R
02-exploratory-analysis.R
03-model-approach-1.R
04-model-approach-2.R
fig-01.png
fig-02.png
report-2022-03-20.qmd
report-2022-04-02.qmd
report-draft-notes.txt

3.2.5 Data

While your code is going to fetch data and possibly save it locally, you probably won’t want to commit it to your repo (that’s a recipe for your repo exploding in size). You should make sure to put that sort of data in your .gitignore so that you never accidentally commit it. That doesn’t mean you won’t ever commit data — it’s certainly useful to include small supplementary datasets (like look up tables, if they’re found in your data documentation, not the database), and example data that you use for testing.

3.3 Each project has a single source of truth

Whenever you have multiple copies of something, you need to carefully think about which one is going to be the “source of truth” (i.e. if they’re different, which one do you believe?). In production scenarios, you have at least two copies of the project: one in your development environment and one in your production environment. If it’s a collaborative project there’s at least one more copy for each of your collaborators. So which of these copies should you “bless” and make it the official source? I’d argue none of them: instead you need to introduce one more copy, which is the copy that lives in a central git repository. This pattern is illustrated in Figure 3.1.

A flowchart diagram showing a Git workflow with five boxes connected by arrows. The boxes are labeled "You", "git", "Production", "Colleague A", and "Colleague B". Bidirectional arrows connect "You", "Colleage A", and "Colleague B" with "git", a one-directional arrow goes from "git" to "Production" (dashed line).
Figure 3.1: A collaborative development workflow where multiple team members interact with a central git repository, and that repository goes from git to production.

We recommend this workflow, because while Git takes some time to learn, it provides a principled way to maintain and share code, tracking the provenance of every change. There are many ways to use Git, but a central repository is easiest to understand and well supported by modern tools like GitHub. The following sections go into more details of why we recommend Git and GitHub.

(The idea of the single soure of truth also applies to packages. The source of truth is not the packages installed in your library, but the packages you have recorded in a metadata file because these are what the production server and your colleagues will use.)

3.3.1 Git

In this book, we assume that you’re already familiar with the basics of Git. You certainly don’t need to be an expert but you should know how to add and commit files, push and pull, and create branches. If you haven’t used Git before, I’d recommend that you put this book down now, and read Happy Git and GitHub for the useR.

I also expect you to know the other key skill for using Git: how to Google or ask an LLM for advice when you get yourself stuck in a situation that you’ve never seen before. Another great resource for common problems is Oh Shit, Git!?!.

3.3.2 GitHub

Git is valuable even if you use it locally; it makes it easier to understand how your code is changing over time, and makes it possible to undo mistakes. But git is vastly more valuable when you use it to share your code with others, using a tool like GitHub. Throughout the rest of the book, we’ll talk about GitHub exclusively, but this is really just a shorthand for saying your “Git-hosting platform”. There are many professional and open source solutions including GitLab, Bitbucket, or Azure DevOps. You’re unlikely to get to choose so you’ll need to adapt to what you’re given; but this is generally not too hard since the modern platforms all provide pretty similar tools. If your organisation doesn’t already have access to some git hosting platform, you should immediately start a campaign to get it1.

We’ll talk about GitHub a lot when it comes to working with your team. But GitHub is also great when you’re a solo data scientist or learning data science. As you’ll learn later in the book, GitHub actions allow you to build your own production environment, and experience putting jobs into production just like you would in any larger organisation.

If you’re just starting out in your career, GitHub is also a great place to build up a portfolio of data science projects that you can talk about when interviewing for jobs, showcasing some of the production jobs you’ve created for your own life. If you’re in that position, I’d highly recommend watching GitHub: How To Tell Your Professional Story, by Abigail Haddad.

3.4 Automated deployments

Our last principle of the project lifestyle is that deployment is automated, based on the contents of your central git repository. The code that you deploy should ideally come from your Git host, not your local computer. In other words, we recommend that you use push-to-deploy (aka Git-backed deployments), not click-to-deploy from your IDE. It’s totally fine to start with click-to-deploy to learn the basics, but over time you want to transition to a fully automated solution.

We’ll go into the details later in the book, but we also recommend that deployments be gated behind tests. In other words, you push to start the deployment process, but it will only complete if all your tests pass. This decreases the chances of accidentally deploying broken code.

For high-stakes projects, we also recommend working in a branch until you are confident that the work is correct. As we’ll see, you can configure a branch to deploy into your staging environment so that you can experiment with confidence without having to worry that stakeholders will see your mistakes. (This also gives your colleagues a way to review your code before it’s merged). Once you’re confident, you can merge your branch back in to the main flow of development, and have the results automatically pushed to production.


  1. If you’re interviewing for a data science job, you should definitely ask what Git hosting platform they use. If they don’t use one, or worse don’t use Git, this is a major red flag.↩︎