6 Package installation

When working in production, you’re much more likely to be using a Linux server. R package installations are a little different there, so in this chapter you’ll learn more about the best ways to install R packages on Linux, regardless of whether it’s your development or production environment. There are three challenges you’ll need to overcome:

You’re probably most used to installing packages on a Mac or Windows computer. There are some important differences with Linux and to understand them, you’ll need some new vocabulary like binary packages and system libraries.
Production jobs are usually run in a throwaway container. That means packages are installed every time your production job runs and the speed of package installation becomes more much important than in your development environment.
You want to make sure that you’re installing exactly the same package versions on your development and production environments.

We’ll tackle each of those challenges in this chapter. But if you’re already familiar with the problems and just want to hear the solutions, there are two many takeaways from this chapter:

Use Posit Public Package Manager (P3M for short) instead of CRAN. This is a free service provided by Posit that provides R package binaries for Linux and makes package installation much much faster.
Use pak::pak() instead of install.packages(). pak() gives more actionable feedback if package installation fails due to a missing system dependency.

We’ll begin by learning about pak, because it’s useful for both development and deployment, regardless of what platform you’re using. You’ll then learn the vocabulary you need in order to understand the difference between package installation on Mac/Windows and Linux. We’ll finish off by discussing how you can match your development and deployment package versions.

6.1 pak

Regardless of where you’re installing packages, we recommend that you use pak::pak() instead of install.packages(). There are three main reasons:

It’s safe: pak works out an installation plan up front, and tells you exactly what it’s going to do. It also protects you against common failure modes including a package being loaded in another session on Windows or missing a system dependency on Linux.
It’s convenient: as well as installing packages from CRAN, pak can also install packages from GitHub, GitLab, Bioconductor and much much more. It also makes it easy to install historical versions of packages and works with P3M to provide CRAN snapshots at any given point in time.
It’s fast: pak downloads and installs packages in parallel. It also caches packages, making it fast to switch between multiple versions of the same package.

Using pak is simple. First install it:

install.packages("pak")

And then call it:

pak::pak("tidyverse")
# Install a package from GitHub
pak::pak("r-lib/rlang")

pak has many more capabilities that you’ll can learn about in its getting started guide.

6.2 Installing a package on Linux

When you install a package from CRAN on Mac and Windows you get a self-contained binary package:

Self-contained means that you don’t need to install any other tools to make it work.
Binary means that CRAN has done all the work to get the package ready for your specific operating system. In turn that means installation happens quickly because all R needs to do is unzip a file.

Things are different on Linux because CRAN only provides source packages. That means you need to compile the package (which can take multiple minutes for complex packages) and if the package has any external dependencies, you’ll need to install those before compilation will succeed. For example, take the xml2 package. Most of the work in xml2 is done in C code, and while there is some C code in xml2 itself, most of the code comes from the external libxml2 library¹. When you install xml2 on Mac or Windows you can get a binary package that already contains the external code. If you install xml2 on Linux, you have to first install libxml2 on your computer and then compile the package (which might take a few minutes).

To resolve these two problems we can use P3M and pak.

6.2.1 Package binaries from P3M

P3M is a freely available service² provided by Posit. It’s very similar to CRAN, but provides binaries for many popular Linux distributions, like CentOS, Rocky Linux, OpenSUSE, RHEL, SLES, Ubuntu and Debian. (Not relevant to this discussion, but very useful in general, it also provides snapshots so you can easily roll back packages to any point in time.)

You should ask your system administrator to configure P3M for your production systems so that it just works. If you need to do it yourself, you can following the P3M setup instructions, e.g.:

options(repos = c(
  CRAN = "https://p3m.dev/cran/__linux__/bookworm/latest",
))

(Note that the URL varies based on the Linux distribution you use, so please don’t blindly copy and paste that example!)

You might also want to use P3M in your development environment. In that case you might consider using both CRAN and P3M: that allows you to get binaries from P3M if they’re available; otherwise you’ll get the latest version from CRAN, which may require compiling from source. It’s up to you to make the trade-off between grabbing the absolute latest version and a version that’s faster to install.

options(repos = c(
  P3M  = "https://p3m.dev/cran/__linux__/bookworm/latest",
  CRAN = "https://cloud.r-project.org"
))

You can also use pak to help automate this task:

# override CRAN
pak::repo_add(CRAN = "PPM@latest")

# supplement P3M
pak::repo_add(P3M = "PPM@latest")

Regardless of how you set up the repos, you’ll need to make this change globally, which typically means running the code in your .Rprofile. The easiest way to open this file is by running usethis::edit_r_profile().

If you’re using renv, follow Shannon Pileggi’s advice to use P3M in your existing projects (new renv projects will automatically use P3M if you’ve set up as your default repo). If you haven’t heard of renv, don’t worry, you’ll learn about it Section 15.1.

6.2.2 System dependencies with pak

Using P3M resolves the speed problem by giving you binary packages, but it doesn’t give you self-contained binaries: the binaries expect system dependencies to be installed in standard locations. That’s the convention on Linux servers because typically a server admin wants to control exactly what versions of system libraries are used. This is most important for security: if an urgent update is needed, you want to be able to update it in just one place.

That means you’ll need to install some system dependencies in order to make some packages work. How do you know which packages need which system dependencies? That’s a tricky problem because R projects have a very casual way of declaring system dependencies. Fortunately, however, Posit has invested a bunch of time and effort into turning that casual metadata into something actionable, which you can find at https://github.com/rstudio/r-system-requirements.

pak() uses this metadata to automatically report if any system dependencies are missing, telling you exactly what you need to install. For example, if you attempt to install the tidyverse on a fresh Linux system, you’ll get a message like this:

→ Will install 101 packages.
→ Will download 31 CRAN packages (34.93 MB), cached: 70 (33.19 MB).

[...]

✖ Missing 11 system packages. You'll probably need to install them manually:
+ libcurl4-openssl-dev  - curl
+ libfontconfig1-dev    - systemfonts
+ libfreetype6-dev      - ragg, systemfonts, textshaping
+ libfribidi-dev        - textshaping
+ libharfbuzz-dev       - textshaping
+ libjpeg-dev           - ragg
+ libpng-dev            - ragg
+ libssl-dev            - curl, openssl
+ libtiff-dev           - ragg
+ libxml2-dev           - xml2
+ pandoc                - knitr, reprex, rmarkdown

You can then drop that info into a ticket to your IT department.

pak also provides tools to do this programmatically, e.g:

pak::pkg_sysreqs("devtools", sysreqs_platform = "centos")

── Install scripts ──────────────────────────────────────────────── Centos NA ──
yum install -y libX11-devel git libcurl-devel openssl-devel make zlib-devel \
  freetype-devel libjpeg-turbo-devel libpng-devel libtiff-devel libicu-devel \
  fontconfig-devel fribidi-devel harfbuzz-devel libxml2-devel

── Packages and their system dependencies ──────────────────────────────────────
clipr       – libX11-devel
credentials – git
curl        – libcurl-devel, openssl-devel
fs          – make
gitcreds    – git
httpuv      – make, zlib-devel
openssl     – openssl-devel
ragg        – freetype-devel, libjpeg-turbo-devel, libpng-devel, libtiff-devel
remotes     – git
sass        – make
stringi     – libicu-devel
systemfonts – fontconfig-devel, freetype-devel
textshaping – freetype-devel, fribidi-devel, harfbuzz-devel
xml2        – libxml2-devel

You can vary the sysreqs_platform to see one of the reasons that installing system dependencies is so frustrating to do by hand: every Linux distribution seems to use a slightly different name for the same system dependency.

If you’re server admin, we recommend that you install the most common set of system dependencies up front. That doesn’t take up a huge amount of disk space, and it saves everyone time by installing a bunch of packages at once rather than them dribbling in one at a time via tickets. You can find that list at https://docs.posit.co/connect/admin/r/dependencies/index.html.

6.3 Matching package versions

Now you know how to efficiently install packages on Linux so we can move on to tackling the final challenge: installing the right versions of the packages. The goal is to install the same versions of packages in your deployment environment as your development environment so that you get the same results. There are many ways you can do this, but Posit’s open source and pro tools have standardised on one format: manifest.json.

To generate a manifest.json, call rsconnect::writeManifest(). This function looks through all the code in your project, identifying all the packages that it uses³, finds all the packages that those packages use, and then records their versions in manifest.json. It also records some other useful metadata like the version of R that you’re using. You’ll then check this file into Git, and include it whenever you deploy your package.

Connect and Connect Cloud will automatically use this file (and in fact require it). For GitHub actions, you’ll need to use the setup-manifest step. This installs the version of R described in the manifest, and then all the packages that you need.

jobs:
  setup:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: r-lib/actions/setup-manifest@feature/setup-manifest

When you update packages, you’ll need to remember to update this file and re-commit it to Git.

Note that this not a complete solution to package management because while the versions on the deployment environment are locked, the versions on your development environment are not. Imagine that you create a dashboard, and it runs successfully for a couple of years without changes. When you come back it, the code no longer runs because you’ve installed a bunch of new versions in your development environment. We’ll come back to that problem in Section 15.1.

Using latest versions

There is another workflow available on GitHub Actions that worth briefly talking about: always installing the latest versions of packages from CRAN. This is very easy (all you need is a list of package names) but also the high risk. It’s more suitable code that you’re sharing widely with others that you want to run anywhere, not just in a curated deployment environment. To use this style all you need to do record package names in a DESCRIPTION file and use the setup-r-dependencies step in your action. I’d encourage you to have at least one job that uses this workflow. It shouldn’t be a critical job, but it’ll help you get a better sense of risk and reward, and it’ll give you one small job that requires a regular care and feeding.

Confusingly, the C equivalent of R’s packages are called libraries.↩︎
We also provide the non-free Posit Package Manager which you can use inside your organisation. Using PPM typically makes your IT department happy because you’re not installing code from random corners of the internet, and it makes you happy because it allows you to install all the open source data science packages (in R and Python) that you need to do your job. You can also use PPM to distribute your own internal packages. We’ll come back to that later in the book.↩︎
Whether that’s with library() or require(), via :: or something else.↩︎