Chapter 6 Functions

This text has already covered how to use functions that come to us pre-made. At least we have discussed how to use them in a one-off way–just write the name of the function, write some parentheses after that name, and then plug in any requisite arguments by writing them in a comma-separated way between those two parentheses. This is how it works in both R and Python.

In this section we take a look at how to define our own functions. This will not only help us to understand pre-made functions, but it will also be useful if we need some extra functionality that isn’t already provided to us.

Writing our own functions is also useful for “packaging up” computations. The utility of this will become apparent very soon. Consider the task of estimating a regression model. If you have a function that performs all of the required calculations, then

  • you can estimate models without having to think about lower-level details or write any code yourself, and
  • you can re-use this function every time you fit any model on any data set for any project.

6.1 Defining R Functions

To create a function in R, we need another function called function(). We give the output of function() a name in the same way we give names to any other variable in R, by using the assignment operator <- . Here’s an example of a toy function called addOne(). Here myInput is a placeholder that refers to whatever the user of the function ends up plugging in.

Below the definition, the function is called with an input of 41. When this happens, the following sequence of events occurs

  • The value 41 is assigned to myInput
  • myOutput is given the value 42
  • myOutput, which is 42, is returned from the function
  • the temporary variables myInput and myOutput are destroyed.

We get the desired answer, and all the unnecessary intermediate variables are cleaned up and thrown away after they are no longer needed.

If you are interested in writing a function, I recommend that you first write the logic outside of a function. This initial code will be easier to debug because your temporary variables will not be destroyed after the final result has been obtained. Once you are happy with the working code, you can copy and paste the logic into a function definition, and replace permanent variables with function inputs like myInput above.

6.2 Defining Python Functions

To create a function in Python, we use the def statement (instead of the function() function in R). The desired name of the function comes next. After that, the formal parameters come, comma-separated inside parentheses, just like in R.

Defining a function in Python is a little more concise. There is no assignment operator like there is in R, there are no curly braces, and return isn’t a function like it is in R, so there is no need to use parentheses after it. There is one syntactic addition, though–we need a colon (:) at the end of the first line of the definition.

Here is an example of a toy function called add_one().

Below the definition, the function is called with an input of 41. When this happens, the following sequence of events occurs

  • The value 41 is assigned to my_input
  • my_output is given the value 42
  • my_output, which is 42, is returned from the function
  • the temporary variables my_input and my_output are destroyed.

We get the desired answer, and all the unnecessary intermediate variables are cleaned up and thrown away after they are no longer needed.

6.3 More Details On R’s User-Defined Functions

Technically, in R, functions are defined as three things bundled together:

  1. a formal argument list (also known as formals),
  2. a body, and
  3. a parent environment.

The formal argument list is exactly what it sounds like. It is the list of arguments a function takes. You can access a function’s formal argument list using the formals() function. Note that it is not the actual arguments a user will plug in–that isn’t knowable at the time the function is created in the first place.

Here is another function that takes a parameter called whichNumber that comes with a default argument of 1. If the caller of the function does not specify what she wants to add to myInput, addNumber() will use 1 as the default. This default value shows up in the output of formals(addNumber).

The function’s body is also exactly what it sounds like. It is the work that a function performs. You can access a function’s body using the the body() function.

Every function you create also has a parent environment10. You can get/set this using the environment() function. Environments help a function know which variables it is allowed to use and how to use them. The parent environment of a function is where the function was created, and it contains variables outside of the body that the function can also use. The rules of which variables a function can use are called scoping. When you create functions in R, you are primarily using lexical scoping. This is discussed in more detail in section 6.5.

There is a lot more information about environments that isn’t provided in this text. For instance, a user-defined function also has binding, execution, and calling environments associated with it, and environments are used in creating package namespaces, which are important when two packages each have a function with the same name.

6.4 More details on Python’s user-defined functions

Roughly, Python functions have the same things R functions have. They have a formal parameter list, a body, and there are namespaces created that help organize which variables the function can access, as well as which pieces of code can call this new function. A namespace is just a “mapping from names to objects.”

These three concepts are analogous to those in R. The names are just a bit different sometimes, and it isn’t organized in the same way. To access these bits of information, you need to access the special attributes of a function. User-defined functions in Python have a lot of pieces of information attached to them. If you’d like to see all of them, you can visit this page of documentation.

So, for instance, let’s try to find the formal parameter list of a user-defined function below. This is, again, the collection of inputs a function takes. Just like in R, this is not the actual arguments a user will plug in–that isn’t knowable at the time the function is created.11 Here we have another function called add_number() that takes a parameter which_number that is accompanied by a default argument of 1.

The __code__ attribute has much more to offer. To see a list of names of all its contents, you can use dir(add_number.__code__).

Don’t worry if the notation add_number.__code__ looks strange. The dot (.) operator will become more clear in the future chapter on object-oriented programming. For now, just think of __code__ as being an object belonging to add_number. Objects that belong to other objects are called attributes in Python. The dot operator helps us access attributes inside other objects. It also helps us access objects belonging to modules that we import into our scripts.

6.5 Function Scope in R

R uses lexical scoping. This means, in R,

  1. functions can use local variables that are defined inside themselves,
  2. functions can use global variables defined in the environment where the function itself was defined in, and
  3. functions cannot necessarily use global variables defined in the environment where the function was called in, and
  4. functions will prefer local variables to global variables if there is a name clash.

The first characteristic is obvious. The second and third are import to distinguish between. Consider the following code below. sillyFunction() can access a because sillyFunction() and a are defined in the same place.

On the other hand, the following example will not work because a and anotherSillyFunc() are not defined in the same place. Calling the function is not the same as defining a function.

Finally, here is a demonstration of a function preferring one a over another. When sillyFunction() attempts to access a, it first looks in its own body, and so the innermost one gets used. On the other hand, print(a) shows 3, the global variable.

The same concept applies if you create functions within functions. The inner function innerFunc() looks “inside-out” for variables, but only in the place it was defined.

Below we call outerFunc(), which then calls innerFunc(). innerFunc() can refer to the variable b, because it lies in the same environment in which innerFunc() was created. Interestingly, innerFunc() can also refer to the variable a, because that variable was captured by outerFunc(), which provides access to innerFunc().

Here’s another interesting example. If we ask outerFunc() to return the function innerFunc() (not the return object of innerFunct()…functions are objects, too!), then we might be surprised to see that innerFunc() can still successfully refer to b, even though it doesn’t exist inside the calling environment. But don’t be surprised! What matters is what was available when the function was created.

We use this property all the time when we create functions that return other functions. This is discussed in more detail in chapter 15. In the above example, outerFuncV2(), the function that returned another function, is called a function factory.

Sometimes people will refer to R’s functions as closures to emphasize that they are capturing variables from the parent environment in which they were created, to emphasize the data that they are bundled with.

6.6 Function Scope in Python

Python uses lexical scoping just like R. This means, in Python,

  1. functions can use local variables that are defined inside themselves,
  2. functions have an order of preference for which variable to prefer in the case of a name clash, and
  3. functions can sometimes use variables defined outside itself, but that ability depends on where the function and variable were defined, not where the function was called.

Regarding characteristics (2) and (3), there is a famous acronym that describes the rules Python follows when finding and choosing variables: LEGB.

  • L: Local,
  • E: Enclosing,
  • G: Global, and
  • B: Built-in.

A Python function will search for a variable in these namespaces in this order.12.

Local” refers to variables that are defined inside of the function’s block. The function below uses the local a over the global one.

Enclosing” refers to variables that were defined in the enclosing namespace, but not the global namespace. These variables are sometimes called free variables. In the example below, there is no local a variable for inner_func(), but there is a global one, and one in the enclosing namespace. inner_func() chooses the one in the enclosing namespace. Moreover, inner_func() has its own copy of a to use, even after a was initially destroyed upon the completion of the call to outer_func().

Global” scope contains variables defined in the module-level namespace. If the code in the below example was the entirety of your script, then a would be a global variable.

Just like in R, Python functions cannot necessarily find variables where the function was called. For example, here is some code that mimics the above R example. Both a and b are accessible from within inner_func(). That is due to LEGB.

However, if we start using outer_func() inside another function, calling it in another function, when it was defined somewhere else, well then it doesn’t have access to variables in the call site. You might be surprised at how the following code functions. Does this print the right string: "this is the a I want to use now!" No!

## outside both
## inside one

If you feel like you understand lexical scoping, great! You should be ready to take on chapter 15, then. If not, keep playing around with examples. Without understanding the scoping rules R and Python share, writing your own functions will persistently feel more difficult than it really is.

6.7 Modifying a Function’s Arguments

Can/should we modify a function’s argument? The flexibility to do this sounds empowering; however, not doing it is recommended because it makes programs easier to reason about.

6.7.1 Passing By Value In R

In R, it is difficult for a function to modify one of its argument.13 Consider the following code.

The function f has an argument called arg. When f(a) is performed, changes are made to a copy of a. When a function constructs a copy of all input variables inside its body, this is called pass-by-value semantics. This copy is a temporary intermediate value that only serves as a starting point for the function to produce a return value of 2.

arg could have been called a, and the same behavior will take place. However, giving these two things different names is helpful to remind you and others that R copies its arguments.

It is still possible to modify a, but I don’t recommend doing this either. I will discuss this more in subsection 6.7.

6.7.2 Passing By Assignment In Python

The story is more complicated in Python. Python functions have pass-by-assignment semantics. This is something that is very unique to Python. What this means is that your ability to modify the arguments of a function depends on

  • what the type of the argument is, and
  • what you’re trying to do to it.

We will go throw some examples first, and then explain why this works the way it does. Here is some code that is analogous to the example above.

In this case, a is not modified. That is because a is an int. ints are immutable in Python, which means that their value cannot be changed after they are created, either inside or outside of the function’s scope. However, consider the case when a is a list, which is a mutable type. A mutable type is one that can have its value changed after its created.

In this case a is modified. Changing the value of the argument inside the function effects changes to that variable outside of the function.

Ready to be confused? Here is a tricky third example. What happens if we take in a list, but try to do something else with it.

That time a did not permanently change in the global scope. Why does this happen? I thought lists were mutable!

The reason behind all of this doesn’t even have anything to do with functions, per se. Rather, it has to do with how Python manages, objects, values, and types. It also has to do with what happens during assignment.

Let’s revisit the above code, but bring everything out of a function. Python is pass-by-assignment, so all we have to do is understand how assignment works. Starting with the immutable int example, we have the following.

The id() function returns the identity of an object, which is kind of like its memory address. Identities of objects are unique and constant. If two variables, a and b say, have the same identity, a is b will evaluate to True. Otherwise, it will evaluate to False.

In the first line, the name a is bound to the object 1. In the second line, the name arg is bound to the object that is referred to by the name a. After the second line finishes, arg and a are two names for the same object (a fact that you can confirm by inserting arg is a immediately after this line).

In the third line, arg is bound to 2. The variable arg can be changed, but only by re-binding it with a separate object. Re-binding arg does not change the value referred to by a because a still refers to 1, an object separate from 2. There is no reason to re-bind a because it wasn’t mentioned at all in the third line.

If we go back to the first function example, it’s basically the same idea. The only difference, however, is that arg is in its own scope. Let’s look at a simplified version of our second code chunk that uses a mutable list.

In this example, when we run arg = a, the name arg is bound to the same object that is bound to a. This much is the same. The only difference here, though, is that because lists are mutable, changing the first element of arg is done “in place”, and all variables can access the mutated object.

Why did the third example produce unexpected results? The difference is in the line arg = [2]. This rebinds the name arg to a different variable. lists are still mutable, but this has nothing to do with re-binding–re-binding a name works no matter what type of object you’re binding it to. In this case we are re-binding arg to a completely different list.

6.8 Accessing and Modifying Captured Variables

In the last section, we were talking about variables that were passed in as function arguments. Here we are talking about variables that are captured. They are not passed in as variables, but they are still used inside a function. In general, even though it is possible to access and modify non-local captured variables in both languages, it is not a good idea.

6.8.1 Accessing Captured Variables in R

As Hadley Wickham writes in his book, “[l]exical scoping determines where, but not when to look for values.” R has dynamic lookup, meaning code inside a function will only try to access a referred-to variable when the function is running, not when it is defined.

Consider the R code below. The dataReadyForModeling() function is created in the global environment, and the global environment contains a Boolean variable called dataAreClean.

Now imagine sharing some code with a collaborator. Imagine, further, that your collaborator is the subject-matter expert, and knows little about R programming. Suppose that he changes dataAreClean, a global variable in the script, after he is done . Shouldn’t this induce a relatively trivial change to the overall program?

Let’s explore this hypothetical further. Consider what could happen if any of the following (very typical) conditions are true:

  • you or your collaborators aren’t sure what dataReadyForModeling() will return because you don’t understand dynamic lookup, or
  • it’s difficult to visually keep track of all assignments to dataAreClean (e.g. your script is quite long or it changes often), or
  • you are not running code sequentially (e.g. you are repeatedly testing chunks at a time instead of clearing out your memory and source()ing from scratch, over and over again).

In each of these situations, understanding of the program would be compromised. However, if you follow the above principle of never referring to non-local variables in function code, all members of the group could do their own work separately, minimizing the dependence on one another.

Another reason violating this could be troublesome is if you define a function that refers to a nonexistent variable. Defining the function will never throw an error because R will assume that variable is defined in the global environment. Calling the function might throw an error, unless you accidentally defined the variable, or if you forgot to delete a variable whose name you no longer want to use. Defining myFunc() with the code below will not throw an error, even if you think it should!

6.8.2 Accessing Captured Variables in Python

It is the same exact situation in Python. Consider everything_is_safe(), a function that is analogous to dataReadyForModeling().

We can also define my_func(), which is analogous to myFunc(). Defining this function doesn’t throw an error either.

So stay away from referring to variables outside the body of your function!

6.8.3 Modifying Captured Variables In R

Now what if we want to be extra bad, and in addition to accessing global variables, we modify them, too?

In the program above, makeATwo() copies a into arg. It then assigns 2 to that copy. Then it takes that 2 and writes it to the global a variable in the parent environment. It does this using R’s super assignment operator <<-. Regardless of the inputs passed in to this function, it will always assign exactly 2 to a, no matter what.

This is problematic because you are pre-occupying your mind with one function: makeATwo(). Whenever you write code that depends on a (or on things that depend on a, or on things that depended on things that depend on a, or …), you’ll have to repeatedly interrupt your train of thought to try and remember if what you’re doing is going to be okay with the current and future makeATwo() call sites.

6.8.4 Modifying Captured Variables In Python

There is something in Python that is similar to R’s super assignment operator (<<-). It is the global keyword. This keyword will let you modify global variables from inside a function.

The upside to the global keyword is that it makes hunting for side effects relatively easy (A function’s side effects are changes it makes to non-local variables). Yes, this keyword should be used sparingly, even more sparingly than merely referring to global variables, but if you are ever debugging, and you want to hunt down places where variables are surprisingly being changed, you can hit Ctrl-F and search for the phrase “global.”

6.9 Exercises

6.9.1 R Questions

Suppose you have a matrix \(\mathbf{X} \in \mathbb{R}^{n \times p}\) and a column vector \(\mathbf{y} \in \mathbb{R}^{n}\). To estimate the linear regression model \[\begin{equation} \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \epsilon, \end{equation}\] where \(\boldsymbol{\beta} \in \mathbb{R}^p\) is a column vector of errors, you can use calculus instead of numerical optimization. The formula for the least squares estimate of \(\boldsymbol{\beta}\) is \[\begin{equation} \hat{\boldsymbol{\beta}} = (\mathbf{X}^\intercal \mathbf{X})^{-1} \mathbf{X}^\intercal \mathbf{y}. \end{equation}\]

Once this \(p\)-dimensional vector is found, you can also obtain the predicted (or fitted) values

\[\begin{equation} \hat{\mathbf{y}} := \mathbf{X}\hat{\boldsymbol{\beta}}, \end{equation}\] and the residuals (or errors)

\[\begin{equation} \mathbf{y} - \hat{\mathbf{y}} \end{equation}\]

Write a function called getLinModEstimates() that takes in two arguments in the following order:

  • the vector of response data \(\mathbf{y}\)
  • the matrix of predictors \(\mathbf{X}\).

Have it return a named list with three outputs inside:

  • the coefficient estimates as a vector,
  • a vector of fitted values, and
  • a vector of residuals.

The three elements of the returned list should have the names coefficients, fitVals, and residuals.

Write a function called monteCarlo that

  • takes as an input a function sim(n) that simulates n scalar variables,
  • takes as an input a function that evaluates \(f(x)\) on each random variable sample and that ideally takes in all of the random variables as a vector, and
  • returns a function that takes one integer-valued argument (num_sims) and outputs a length one vector.

Assume sim(n) only has one argument: n, which is the number of simulations desired. sim(n)’s output should be a length n vector.

The output of this returned function should be a Monte Carlo estimate of the expectation: \(\mathbb{E}[f(X)] \approx \frac{1}{n}\sum_{i=1}^n f(X^i)\).

Write a function called myDFT() that computes the Discrete Fourier Transform of a vector and returns another vector. Feel free to check your work against spec.pgram(), fft(), or astsa::mvspec(), but do not include calls to those functions in your submission. Also, you should be aware that different functions transform and scale the answer differently, so be sure to read the documentation of any function you use to test against.

Given data \(x_1,x_2,\ldots,x_n\), \(i = \sqrt{-1}\), and the Fourier/fundamental frequencies \(\omega_j= j/n\) for \(j=0,1,\ldots,n-1\), we define the discrete Fourier transform (DFT) as:

\[\begin{equation} \label{eq:DFT} d(\omega_j)= n^{-1/2} \sum_{t=1}^n x_t e^{-2 \pi i \omega_j t} \end{equation}\]

6.9.2 Python Questions

Estimating statistical models often involves some form of optimization, and often times, optimization is performed numerically. One of the most famous optimization algorithms is Newton’s method.

Suppose you have a function \(f(x)\) that takes a scalar-valued input and returns a scalar as well. Also, suppose you have the function’s derivative \(f'(x)\), its second derivative \(f''(x)\), and a starting point guess for what the minimizing input of \(f(x)\) is: \(x_0\).

The algorithm repeatedly applies the following recursion:

\[\begin{equation} x_{n+1} = x_{n} - \frac{f'(x_n)}{f''(x_{n})}. \end{equation}\] Under appropriate regularity conditions for \(f\), after many iterations of the above recursion, when \(\tilde{n}\) is very large, \(x_{\tilde{n}}\) will be nearly the same as \(x_{\tilde{n}-1}\), and \(x_{\tilde{n}}\) is pretty close to \(\text{argmin}_x f(x)\). In other words, \(x_{\tilde{n}}\) is the minimizer of \(f\), and a root of \(f'\).

  1. Write a function called f that takes a float x and returns \((x-42)^2 - 33\).
  2. Write a function called f_prime that takes a float and returns the derivative of the above.
  3. Write a function called f_dub_prime that takes a float and returns an evaluation of the second derivative of \(f\).
  4. Theoretically, what is the minimizer of \(f\)? Assign your answer to the variable best_x.
  5. Write a function called minimize() that takes three arguments, and performs ten iterations of Newton’s algorithm, after which it returns \(x_{10}\). Don’t be afraid of copy/pasting ten or so lines of code. We haven’t learned loops yet, so that’s fine. The ordered arguments are:
    • the function that evaluates the derivative of the function you’re interested in,
    • the function that evaluates the second derivative of your objective function,
    • an initial guess of the minimizer.
  6. Test your function by plugging in the above functions, and use a starting point of \(10\). Assign the output to a variable called x_ten.

Write a function called smw_inverse(A,U,C,V) that returns the inverse of a matrix using the Sherman-Morrison-Woodbury formula (Guttman 1946). Have it take the arguments \(A\), \(U\), \(C\), and \(V\) in that order and as Numpy ndarrays. Assume that A is a diagonal matrix.

\[\begin{equation} (A + UCV)^{-1} = A^{-1} - A^{-1}U(C^{-1} + VA^{-1}U)^{-1}V A^{-1} \end{equation}\] Despite being difficult to remember, this formula can be quite handy for speeding up matrix inversions when \(A\) and \(C\) are easier to invert (e.g. if \(A\) is diagonal and \(C\) is a scalar). The formula often shows up a lot in applications where you multiply matrices together (there are many such examples).

To check your work, pick certain inputs, and make sure your formula corresponds with the naive, left-hand-side approach.

References

Guttman, Louis. 1946. “Enlargement Methods for Computing the Inverse Matrix.” The Annals of Mathematical Statistics 17 (3): 336–43. https://doi.org/10.1214/aoms/1177730946.


  1. Primitive functions are functions that contain no R code and are internally implemented in C. These are the only type of function in R that don’t have a parent environment.

  2. You might have noticed that Python uses two different words to prevent confusion. Unlike R, Python uses the word “parameter” (instead of “argument”) to refer to the inputs a function takes, and “arguments” to the specific values a user plugs in.

  3. Functions aren’t the only thing that get their own namespace. Classes do, too. More information on classes is provided in Chapter 14

  4. There are some exceptions to this, but it’s generally true.