When working in production, you’re much more likely to be using a Linux server. R package installations are a little different there, so in this chapter you’ll learn more about the best ways to install R packages on Linux, regardless of whether it’s your development or production environment. There are three challenges you’ll need to overcome:
You’re probably most used to installing packages on a Mac or Windows computer. There are some important differences with Linux and to understand them, you’ll need some new vocabulary like binary packages and system libraries.
Production jobs are usually run in a throwaway container. That means packages are installed every time your production job runs and the speed of package installation becomes more much important than in your development environment.
You want to make sure that you’re installing exactly the same package versions on your development and production environments.
We’ll tackle each of those challenges in this chapter. But if you’re already familiar with the problems and just want to hear the solutions, there are two many takeaways from this chapter:
Use Posit Public Package Manager (P3M for short) instead of CRAN. This is a free service provided by Posit that provides R package binaries for Linux and makes package installation much much faster.
Use pak::pak() instead of install.packages(). pak() gives more actionable feedback if package installation fails due to a missing system dependency.
We’ll begin by learning about pak, because it’s useful for both development and deployment, regardless of what platform you’re using. You’ll then learn the vocabulary you need in order to understand the difference between package installation on Mac/Windows and Linux. We’ll finish off by discussing how you can match your development and deployment package versions.
12.1 pak
Regardless of where you’re installing packages, we recommend that you use pak::pak() instead of install.packages(). There are three main reasons:
It’s safe: pak works out an installation plan up front, and tells you exactly what it’s going to do. It also protects you against common failure modes including a package being loaded in another session on Windows or missing a system dependency on Linux.
It’s convenient: as well as installing packages from CRAN, pak can also install packages from GitHub, GitLab, Bioconductor and much much more. It also makes it easy to install historical versions of packages and works with P3M to provide CRAN snapshots at any given point in time.
It’s fast: pak downloads and installs packages in parallel. It also caches packages, making it fast to switch between multiple versions of the same package.
Using pak is simple. First install it:
install.packages("pak")
And then call it:
pak::pak("tidyverse")# Install a package from GitHubpak::pak("r-lib/rlang")
pak has many more capabilities that you’ll can learn about in its getting started guide.
12.2 Installing a package on Linux
When you install a package from CRAN on Mac and Windows you get a self-contained binary package:
Self-contained means that you don’t need to install any other tools to make it work.
Binary means that CRAN has done all the work to get the package ready for your specific operating system. In turn that means installation happens quickly because all R needs to do is unzip a file.
Things are different on Linux because CRAN only provides source packages. That means you need to compile the package (which can take multiple minutes for complex packages) and if the package has any external dependencies, you’ll need to install those before compilation will succeed. For example, take the xml2 package. Most of the work in xml2 is done in C code, and while there is some C code in xml2 itself, most of the code comes from the external libxml2 library1. When you install xml2 on Mac or Windows you can get a binary package that already contains the external code. If you install xml2 on Linux, you have to first install libxml2 on your computer and then compile the package (which might take a few minutes).
To resolve these two problems we can use P3M and pak.
12.2.1 Package binaries from P3M
P3M is a freely available service2 provided by Posit. It’s very similar to CRAN, but provides binaries for many popular Linux distributions, like CentOS, Rocky Linux, OpenSUSE, RHEL, SLES, Ubuntu and Debian. (Not relevant to this discussion, but very useful in general, it also provides snapshots so you can easily roll back packages to any point in time.)
You should ask your system administrator to configure P3M for your production systems so that it just works. If you need to do it yourself, you can following the P3M setup instructions, e.g.:
(Note that the URL varies based on the Linux distribution you use, so please don’t blindly copy and paste that example!)
You might also want to use P3M in your development environment. In that case you might consider using both CRAN and P3M: that allows you to get binaries from P3M if they’re available; otherwise you’ll get the latest version from CRAN, which may require compiling from source. It’s up to you to make the trade-off between grabbing the absolute latest version and a version that’s faster to install.
Regardless of how you set up the repos, you’ll need to make this change globally, which typically means running the code in your .Rprofile. The easiest way to open this file is by running usethis::edit_r_profile().
If you’re using renv, follow Shannon Pileggi’s advice to use P3M in your existing projects (new renv projects will automatically use P3M if you’ve set up as your default repo). If you haven’t heard of renv, don’t worry, you’ll learn about it Section 15.1.
12.2.2 System dependencies with pak
Using P3M resolves the speed problem by giving you binary packages, but it doesn’t give you self-contained binaries: the binaries expect system dependencies to be installed in standard locations. That’s the convention on Linux servers because typically a server admin wants to control exactly what versions of system libraries are used. This is most important for security: if an urgent update is needed, you want to be able to update it in just one place.
That means you’ll need to install some system dependencies in order to make some packages work. How do you know which packages need which system dependencies? That’s a tricky problem because R projects have a very casual way of declaring system dependencies. Fortunately, however, Posit has invested a bunch of time and effort into turning that casual metadata into something actionable, which you can find at https://github.com/rstudio/r-system-requirements.
pak() uses this metadata to automatically report if any system dependencies are missing, telling you exactly what you need to install. For example, if you attempt to install the tidyverse on a fresh Linux system, you’ll get an message like this:
You can vary the sysreqs_platform to see one of the reasons that installing system dependencies is so frustrating to do by hand: every Linux distribution seems to use a slightly different name for the same system dependency.
If you’re server admin, we recommend that you install the most common set of system dependencies up front. That doesn’t take up a huge amount of disk space, and it saves everyone time by installing a bunch of packages at once rather than them dribbling in one at a time via tickets. You can find that list at https://docs.posit.co/connect/admin/r/dependencies/index.html.
12.3 Matching package versions
Now you know how to efficiently install packages on Linux so we can move on to tackling the final challenge: installing the right versions of the packages. The goal is to install the same versions of packages in your deployment environment as your development environment so that you get the same results. There are many ways you can do this, but Posit’s open source and pro tools have standardised on one format: manifest.json.
To generate a manifest.json, call rsconnect::writeManifest(). This function looks through all the code in your project, identifying all the packages that it uses3, finds all the packages that those packages use, and then records their versions in manifest.json. It also records some other useful metadata like the version of R that you’re using. You’ll then check this file into Git, and include it whenever you deploy your package.
Connect and Connect Cloud will automatically use this file (and in fact require it). For GitHub actions, you’ll need to use the setup-manifest step. This installs the version of R described in the manifest, and then all the packages that you need.
When you update packages, you’ll need to remember to update this file and re-commit it to Git.
Note that this not a complete solution to package management because while the versions on the deployment environment are locked, the versions on your development environment are not. Imagine that you create a dashboard, and it runs successfully for a couple of years without changes. When you come back it, the code no longer runs because you’ve installed a bunch of new versions in your development environment. We’ll come back to that problem in Section 15.1.
Using latest versions
There is another workflow available on GitHub Actions that worth briefly talking about: always installing the latest versions of packages from CRAN. This is very easy (all you need is a list of package names) but also the high risk. It’s more suitable code that you’re sharing widely with others that you want to run anywhere, not just in a curated deployment environment. To use this style all you need to do record package names in a DESCRIPTION file and use the setup-r-dependencies step in your action. I’d encourage you to have at least one job that uses this workflow. It shouldn’t be a critical job, but it’ll help you get a better sense of risk and reward, and it’ll give you one small job that requires a regular care and feeding.
Confusingly, the C equivalent of R’s packages are called libraries.↩︎
We also provide the non-free Posit Package Manager which you can use inside your organisation. Using PPM typically makes your IT department happy because you’re not installing code from random corners of the internet, and it makes you happy because it allows you to install all the open source data science packages (in R and Python) that you need to do your job. You can also use PPM to distribute your own internal packages. We’ll come back to that later in the book.↩︎
Whether that’s with library() or require(), via :: or something else.↩︎