Chapter 10 Using Third-Party Code

Before using third-party code, it must first be installed. After it is installed, it must be “loaded in” to your session. I will describe both of these steps in R and Python.

10.1 Installing Packages In R

In R, there are thousands of free, user-created packages (Lander 2017). You can download most of these from the Comprehensive R Archive Network. You can also download packages from other publishing platforms like Bioconductor, or Github. Installing from CRAN is more commonplace, and extremely easy to do. Just use the install.packages() function. This can be run inside your R console, so there is no need to type things into the command line.

install.packages("thePackage")

10.2 Installing Packages In Python

In Python, installing packages is more complicated. Commands must be written in the command line, and there are multiple package managers. This isn’t surprising, because Python is used more extensively than R in fields other than data science.

If you followed the suggestions provided in earlier in the text, then you installed Anaconda. This means you will usually be using the conda command. Point-and-click interfaces are made available as well.

conda install the_package

There are some packages that will not be available using this method. For more information on that situation, see here.

10.3 Loading Packages In R

After they are installed on your machine, third-party code will need to be “loaded” into your R or Python session.

Loading in a package is relatively simple in R, however complications can arise when different variables share the same name. This happens relatively often because

it’s easy to create a variable in the global environment that has the same name as another object you don’t know about, and
different packages you load in sometimes share names accidentally.

Starting off with the basics, here’s how to load in a package of third-party code. Just type the following into your R console.

library(thePackage)

You can also use the require() function, which has slightly different behavior when the requested package is not found.

To understand this more deeply, we need to talk about environments again. We discussed these before in 6.3, but only in the context of user-defined functions. When we load in a package with library(), we make its contents available by putting it all in an environment for that package.

An environment holds the names of objects. There are usually several environments, and each holds a different set of functions and variables. All the variables you define are in an environment, every package you load in gets its own environment, and all the functions that come in R pre-loaded have their own environment.

Formally, each environment is pair of two things: a frame and an enclosure. The frame is the set of symbol-value pairs, and the enclosure is a pointer to the parent environment. If you’ve heard of a linked list in a computer science class, it’s the same thing.

Moreover, all of these environments are connected in a chain-like structure. To see what environments are loaded on your machine, and what order they were loaded in, use the search() function. This displays the search path, or the ordered sequence of all of your environments.

Alternatively, if you’re using RStudio, the search path, and the contents of each of its environments, are displayed in the “Environment” window. You can choose which environment you’d like to look at by selecting it from the drop-down menu. This allows you to see all of the variables in that particular environment. The global environment (i.e. ".GlobalEnv") is displayed by default, because that is where you store all the objects you are creating in the console.

Figure 10.1: The Environment Window in RStudio

When you call library(thePackage), the package has an environment created for it, and it is inserted between the global environment, and the most recently loaded package. When you want to access an object by name, R will first search the global environment, and then it will traverse the environments in the search path in order. These has a few important implications.

First, don’t define variables in the global environment that are already named in another environment. There are many variables that come pre-loaded in the base package (to see them, type ls("package:base")), and if you like using a lot of packages, you’re increasing the number of names you should avoid using.
Second, don’t library() in a package unless you need it, and if you do, be aware of all the names it will mask it packages you loaded in before. The good news is that library will often print warnings letting you know which names have been masked. The bad news is that it’s somewhat out of your control–if you need two packages, then they might have a shared name, and the only thing you can do about it is watch the ordering you load them in.
Third, don’t use library() inside code that is source()’d in other files. For example, if you attach a package to the search path from within a function you defined, anybody that uses your function loses control over the order of packages that get attached.

All is not lost if there is a name conflict. The variables haven’t disappeared. It’s just slightly more difficult to refer to them. For instance, if I load in Hmisc (Harrell Jr, Charles Dupont, and others. 2021), I get the warning warning that format.pval and units are now masked because they were names that were in "package:base". I can still refer to these masked variables with the double colon operator (::).

library(Hmisc)
# this now refers to Hmisc's format.pval 
# because it was loaded more recently
format.pval 
Hmisc::format.pval # in this case is the same as above
# the below code is the only way 
# you can get base's format.pval now
base::format.pval

10.4 Loading Packages In Python

In Python, you use the import statement to access objects defined in another file. It is slightly more complicated than R’s library() function, but it is also more flexible. To make the contents of a package called, say, the_package available, type one of the following inside a Python session.

import the_package
import the_package as tp 
from the_package import *

To describe the difference between these three approaches, as well as to highlight the important takeaways and compare them with the important takeaways in the last section, we need to discuss what a Python module is, what a package is, and what a Python namespace is.¹⁷

A Python module is a .py file, separate from the one you are currently editing, with function and/or object definitions in it.¹⁸
A package is a group of modules.¹⁹
A namespace is “a mapping from names to objects.”

With these definitions, we can define importing. According to the Python documentation, “[t]he import statement combines two operations; it searches for the named module, then it binds the results of that search to a name in the local scope.”

The sequence of places Python looks for a module is called the search path. This is not the same as R’s search path, though. In Python, the search path is a list of places to look for modules, not a list of places to look for variables. To see it, import sys, then type sys.path.

After a module is found, the variable names inside it become available to the importing module. These variables are available in the global scope, but the names you use to access them will depend on what kind of import statement you used. From there, you are using the same scoping rules that we described in 6.6, which means the LEGB acronym still applies.

In both languages, an (unqualified) variable name can only refer to one object at any time. This does not necessarily have anything to do with using third-party code–you can redefine objects, but don’t expect to be able to access the old object after you do it.

The same thing can happen when you use third-party code.

In R, you have to worry about the order of library() and require() calls, because there is potential masking going on.
If you don’t want to worry about masking, don’t use library() or require(), and just refer to variables using the :: operator (e.g. coolPackage::specialFunc()).
In Python, loading packages using either the import package format or the import package as p format means you do not need to worry about the order of imports because you will be forced to qualify variable names (e.g. package.func() or p.func()).
In Python, if you load third-party code using either from package import foo or from package import *, you won’t have to qualify variable names, but imported objects will overwrite any variables that happen to have the same name as something you’re importing.

The way variable names are stored are only slightly different between R and Python.

Python namespaces are similar to R environments in that they hold name-value pairs; however
Python namespaces are unlike R environments in that they are not arranged into a sorted list.
Also, Python modules may be organized into a nested or tree-like structure, whereas R packages will always have a flat structure.

10.4.1 `import`ing Examples

In the example below, we import the entire numpy package in a way that lets us refer to it as np. This reduces the amount of typing that is required of us, but it also protects against variable name clashing. We then use the normal() function to simulate normal random variables. This function is in the random sub-module, which is a sub-module in numpy that collects all of the pseudorandom number generation functionality together.

import numpy as np # import all of numpy
np.random.normal(size=4)
## array([-0.6986366 , -0.17539033,  0.46794932,  0.47517799])

This is one use of the dot operator (.). It is also used to access attributes and methods of objects (more information on that will come later in chapter 14). normal is inside of random, which it itself inside of np.

As a second example, suppose we were interested in the stats sub-module found inside the scipy package. We could import all of scipy, but just like the above example, that would mean we would need to consistently refer to a variable’s module, the sub-module, and the variable name. For long programs, this can become tedious if we had to type scipy.stats.norm over and over again. Instead, let’s import the sub-module (or sub-package) and ignore the rest of scipy.

from scipy import stats
stats.norm().rvs(size=4)
## array([ 2.49062124,  0.09135411,  1.13549852, -1.49587011])

So we don’t have to type scipy every time we use something in scipy.stats.

Finally, we can import the function directly, and refer to it with only one letter. This is highly discouraged, though. We are much more likely to accidentally use the name n twice. Further, n is not a very descriptive name, which means it could be difficult to understand what your program is doing later.

from numpy.random import normal as n
n(size=4)
## array([-0.83022531, -0.12589462, -2.29715655, -1.47360775])

Keep in mind, you’re always at risk of accidentally re-using names, even if you aren’t importing anything. For example, consider the following code.

n = 3 # now you can't use n as a function 
n()

This is very bad, because now you cannot use the n() function that was imported from the numpy.random sub-module earlier. In other words, it is longer callable. The error message from the above code will be something like TypeError: 'int' object is not callable.

Use the dir() function to see what is available inside a module. Here are a few examples. Type them into your own machine to see what they output.

dir(np) # numpy stuff
dir(__builtins__) #built-in stuff

10.5 Exercises

What are important differences in the package installation procedures of R and Python? Select all that apply.

Installing R packages can be done from within R, while installing packages in Python can be done in the command line.
Installing R packages can usually be done with the same function install.packages(), while installing packages in Python can be done with a variety of package installers such as pip install and conda install.
There is only one package repository for R, but many for Python.
There is only one package repository for Python, but many for R.

What are important similarities and differences in the package loading procedures of R and Python? Select all that apply.

R and Python both have a search path.
R’s :: operator is very similar to Python’s . operator because they can both help access variable names inside packages.
Python namespaces are unlike R environments in that they are not arranged into a sorted list.
library(package) in R is similar to from package import * in Python because it will allow you to refer to all variables in package without qualification.
Python packages might have sub-modules whereas R’s packages do not.

In Python, which of the following is, generally speaking, the best way to import?

import the_package
from the_package import *
import the_package as tp

In Python, which of the following is, generally speaking, the worst way to import?

import the_package
from the_package import *
import the_package as tp

In R, if you want to use a function func() from package, do you always have to use library(package) or require(package) first?

Yes, otherwise func() won’t be available.
No, you can just use package::func() without calling any function that performs pre-loading.

References

Harrell Jr, Frank E, with contributions from Charles Dupont, and many others. 2021. Hmisc: Harrell Miscellaneous. https://CRAN.R-project.org/package=Hmisc.

Lander, Jared P. 2017. R for Everyone: Advanced Analytics and Graphics (2nd Edition). 2nd ed. Addison-Wesley Professional.

I am avoiding any mention of R’s namespaces and modules. These are things that exist, but they are different from Python’s namespaces and modules, and are not within the scope of this text.↩
The scripts you write are modules. They usually come with the intention of being run from start to finish. Other non-script modules are just a bag of definitions to be used in other places.↩
Sometimes a package is called a library but I will avoid this terminology.↩