Chapter 8 Data Frames

The rectangular array of information (e.g. an Excel spreadsheet ) is what many think of when they hear the word “data.” Each column contains elements of a shared data type, and these data types can vary from column to column.

There is a type for this in R and Python: a data frame. It might even be the most common way that data is stored in both R and Python programs because many functions that read in data from an external source return objects of this type (e.g. read.csv() in R and pd.read_csv() in Python).

R and Python’s data frames have a number of things in common:

  1. each column must be the same length as all other columns,
  2. each column’s elements will all have the same type,
  3. elements in any rows can have different types,
  4. columns and rows can be named in different ways,
  5. there are many ways to get and set different subsets of data, and
  6. when reading in data, the same sorts of difficulties arise in both languages.

8.1 Data Frames in R

Let’s consider as an example Fisher’s “Iris” data set (Fisher 1988) hosted by (Dua and Graff 2017). We will read this data set in from a comma separated file (more information on input/output can be found in chapter 9). This file can be downloaded from this link: https://archive.ics.uci.edu/ml/datasets/iris.

There are some exceptions, but most data sets can be stored as a data.frame. These kinds of two-dimensional data sets are quite common. Any particular row is often an observation on one experimental unit (e.g. person, place or thing). Looking at a particular column gives you one kind of measurement stored for all observations.

Do not rely on the default arguments of read.csv() or read.table()! After you read in a data frame, always check to make sure that

  1. the number of columns is correct because the correct column separator was used (c.f. sep=),
  2. column names were parsed correctly, if there were some in the raw text file,
  3. the first row of data wasn’t used as a column name sequence, if there weren’t column names in the text file, and
  4. the last few rows aren’t reading in empty spaces
  5. character columns are read in correctly (c.f. stringsAsFactors=), and
  6. special characters signifying missing data were correctly identified (c.f. na.strings=).

A data.frame is a special case of a list. Every element of the list is a column. Columns can be vectors or factors, and they can all be of a different type. This is one of the biggest differences between data frames and matrixs. They are both two-dimensional, but a matrix needs elements to be all the same type. Unlike a general list, a data.frame requires all of its columns to have the same number of elements. In other words, the data.frame is not a “ragged” list.

Often times you will need to extract pieces of information from a data.frame. This can be done in many ways. If the columns have names, you can use the $ operator to access a single column. Accessing a single column might be followed up by creating a new vector. You can also use the [ operator to access multiple columns by name.

The [ operator is also useful for selecting rows and columns by index numbers, or by some logical criteria.

In the code above, irisData$species == "Iris-setosa" creates a logical vector (try it!) using the vectorized == operator. The [ operator selects the rows for which the corresponding element of this logical vector is TRUE.

Be careful: depending on how you use the square brackets, you can either get a data.frame or a vector. As an example, try both class(irisData[,1]) and class(irisData[,c(1,2)]).

In R, data.frames might have row names. You can get and set this character vector with the rownames() function. You can access rows by name using the square bracket operator.

Code that modifies data usually looks quite similar to code extracting data. You’ll notice a lot of the same symbols (e.g. $, [, etc.), but the (<-) will point in the other direction.

8.2 Data Frames in Python

The Pandas library in Python has data frames that are modeled after R’s (McKinney 2017).

The structure is very similar to that of R’s data frame. It’s two dimensional, and you can access columns and rows by name or number. Each column is a Series object, and each column can have a different dtype, which is analogous to R’s situation. Again, because the elements need to be the same type along columns only, this is a big difference between 2-d Numpy ndarrays and DataFrames (c.f. R’s matrix versus R’s data.frame).

Again, do not rely on the default arguments of pd.read_csv()! After you read in a data set, always check that

  1. the number of columns is correct because the correct column separator was used (c.f. sep=),
  2. column names were parsed correctly, if there were some in the raw text file,
  3. the first row of data wasn’t used as a column name sequence, if there weren’t column names in the text file (c.f. header=), and
  4. the last few rows aren’t reading in empty spaces
  5. character columns are read in correctly (c.f. dtype=), and
  6. special characters signifying missing data were correctly identified (c.f. na.values=).

Square brackets are a little different in Python than they are in R. Just like in R, you can access columns by name with square brackets, and you can also access rows. Unlike R, though, you don’t have to specify both rows and columns every time you use the square brackets.

Notice that iris_data['sepal.length'] returns a Series and iris_data[["sepal.length", "sepal.width"]] returns a Pandas DataFrame. This behavior is similar to what happened in R. For more details, click here.

You can select columns and rows by number with the .iloc method. iloc is (probably) short for “integer location.”

Selecting columns by anything besides integer number can be done with the .loc() method. You should generally prefer this method to access columns because accessing things by name instead of number is more readable. Here are some examples.

Notice we used a slice (i.e. 'sepal.width':'pedal.width') to access many columns by only referring to the left-most and the right-most. Unlike slicing with numbers, the right end is included. Also note that this does not work with the regular square bracket operator (i.e. iris_data['sepal.width':'pedal.width']). The second example filters out the rows where the "species" column elements are equal to "Iris-setosa".

Each DataFrame in Pandas comes with an .index attribute. This is analogous to a row name in R, but it’s much more flexible because the index can take on a variety of types. This can help us highlight the difference between .loc and .iloc. Recall that .loc was label-based selection. Labels don’t necessarily have to be strings. Consider the following example.

iris_data.loc[0] selects the 0th index. The second line reversed the indexes, so this is actually the last row. If you want the first row, use iris_data.iloc[0].

Modifying data inside a data frame looks quite similar to extracting data. You’ll recognize a lot of the methods mentioned earlier.

You can also use the .assign() method to create a new column. This method does not modify the data frame in place. It returns a new DataFrame with the additional column.

Above we were assigning Numpy arrays to columns of a DataFrame. Be careful when you’re assigning Series objects. You’ll see in the documentation that “Pandas aligns all AXES when setting Series and DataFrame from .loc, and .iloc.”.

8.3 Exercises

8.3.1 R Questions

Consider the data set "original_rt_snippets.txt" (Socher et al. 2013), which is hosted by (Dua and Graff 2017). We will calculate the term frequency-inverse document frequency statistics (Jones 1972) on this data set, which is a common data transformation technique used in text mining and natural language processing. You may use the stringr library for this question, if you wish.

  1. read in this data set as a vector and call it corpus.
  2. create a vector called dictionary that contains the following phrases: “charming”,“fantasy”, “hate”, and “boring”.
  3. Construct a data.frame with four columns called bagOfWords that contains the number of appearances of each word in the dictionary. Match the exact phrases. For simplicity, don’t worry about the case of letters or using regular expressions (c.f. section 3.9). Label the columns of this data.frame with the phrases you’re searching for. Try to write code that is easy to modify if you decide to change the set of phrases in your dictionary.
  4. Create a data.frame with four columns called termFrequency. Each element should correspond with the count from the previous data.frame. Instead of the count, each element should be \(\log(1 + \text{count})\).
  5. Create a vector of length four called invDocFrequency. The inverse document frequency formula for any term \(t\) is \(\log([\text{number of documents in corpus}])\) minus \(\log([\text{number of documents that contain term } t])\). Make sure the names of this vector are the same as the words in the dictionary.
  6. Create a data.frame called tfidf (short for “term frequency-inverse document frequency”). For row/document \(d\), and column/term \(t\), the formula is the product: \([\text{term frequency of term } t \text{ and document } d ] \times [\text{inverse doc. freq. of term } t]\).
  7. Extract the elements of corpus that have at least one nonzero element in the corresponding row of tfidf. Call the vector informativeDocuments.
  8. Do you see any documents that were labeled as informative, that do not actually contain the words you searched for?

mtcars is a data set that is built into R, so you don’t need to read it in. You can read more about it by typing ?datasets::mtcars.

  1. Create a new data.frame called withoutDisp that is the same as mtcars, but has its disp column removed.
  2. Create a new column for withoutDisp called coolnessScore. The formula is \(\frac{1}{\text{mpg}} + \text{quarter mile time}\).
  3. Create a new data.frame called sortedWD that is equal to withoutDisp, but sorted in descending order by the coolness score.
  4. Create a new data.frame from sortedWD called specialRows that keeps only rows where \(\text{weight (1000lbs)} + \text{rear axle ratio} < 7\)
  5. Calculate the percent reduction in number of rows, going from sortedWD to specialRows. Call it percReduction. Make sure it is between \(0\) and \(100\).

This question investigates the Zillow Home Value Index (ZHVI) for single family homes.

  1. read in "Zip_zhvi_uc_sfr_tier_0.33_0.67_sm_sa_month.csv". Call the data.frame homeData. Remember to be careful with file paths. Also, when looking at the data set with a text editor, be sure that “word wrap” is not turned on.
  2. Extract the rows of homeData that correspond with Charlottesville, VA, and assign them as a data.frame to the variable cvilleData
  3. Assign all the unique zip codes to a character vector called cvilleZips.
  4. Extract the columns of cvilleData that correspond with home prices, and transpose them so that each row of the result corresponds with a different month. Call this new data.frame timeSeriesData. Also, make sure that the column names of this new data.frame are set equal to the appropriate zip codes.
  5. Write a function called getAveMonthlyChange that takes in a numeric vector, and returns the average change. Your function should not return an NA, so be sure to deal with NAs appropriately.
  6. Calculate the average monthly price change for each zip code. Store your results in a vector called aveMonthlyChanges. Make sure this vector has named elements so one can extract elements by zip code.

8.3.2 Python Questions

This question deals with looking at historical prices of the S&P500 Index. This data was downloaded from https://finance.yahoo.com (“GSPC Data” 2021). It contains prices starting from “2007-01-03” and going until “2021-10-01”.

  1. Read in the data file "gspc.csv" as a data.frame and call the variable gspc.
  2. Use .set_index() to change the index of gspc to its "Index" column. Store the new DataFrame as gspc_good_idx.
  3. Recall the formula for log returns provided in exercises of chapter 3. Add a column to gspc_good_idx called log_returns. Calculate them from the column GSPC.Adjusted. Make sure to scale them by \(100\) and put an np.nan in the first element where you don’t have a return.
  4. Extract all the returns that are available for the year 2021, and store them as a Series called this_year_returns.
  5. Add a column to gspc_good_idx containing the drawdown time series. Call this column drawdown. To calculate drawdown at a given date, subtract that date’s price from the running maximum price that is current at that date. Use only adjusted close prices for this calculation.
  6. Add a column to gspc_good_idx containing the percent drawdown time series. Call this column perc_drawdown. Use the previous column, but make this number a percent of the corresponding running maximum.
  7. What is the maximum drawdown of this time series? Store it as a percent in the value mdd.

In this question we’ll look at some data on radon measurements (Gelman and Hill 2007). Instead of reading in a text file, we will load the data into Python using the tensorflow_datasets module (“TensorFlow Datasets, a Collection of Ready-to-Use Datasets” 2021).

Please include the following code in your submission.

Many of you will need to install tensorflow and tensorflow_datasets before you’re able to import it. If that’s so, please read section 10.2 for more information on how to install packages.

  1. Assign to worst_row the row of d that is associated the highest recorded level of radon. Make sure it a DataFrame.
  2. Assign to nrows and ncols the number of rows and columns of d, respectively.
  3. Assign the most common column data type to most_common_dtype. Make sure the variable is of type numpy.dtype
  4. Are there any observations from Virginia in this data set? If so, assign True to any_va. Otherwise assign False.
  5. Create a new column in d called dist_from_cville. Use the Haversine formula to calculate distance between each row and the University of Virginia, in kilometers.
    • Assume the University of Virginia is at 38.0336\(^\circ\) N, 78.5080\(^\circ\)W
    • Assume the Earth’s radius \(r = 6378.137\) kilometers.
    • The formula for distance between \((\lambda_1, \phi_1)\) (signed longitude in radians, signed latitude in radians) and \((\lambda_2, \phi_2)\) is
    \[\begin{equation} 2 \hspace{1mm} r \hspace{1mm} \text{arcsin}\left( \sqrt{ \sin^2\left( \frac{\phi_2 - \phi_1}{2}\right) + \cos(\phi_1)\cos(\phi_2) \sin^2\left( \frac{\lambda_2 - \lambda_1}{2} \right) } \right) \end{equation}\]
  6. What is the average radon measurement between all measurements taken at the place that is closest to where we are now? Assign your answer as a float to close_ave

References

Dua, Dheeru, and Casey Graff. 2017. “UCI Machine Learning Repository.” University of California, Irvine, School of Information; Computer Sciences. http://archive.ics.uci.edu/ml.

Fisher, Test, R.A. & Creator. 1988. “Iris.” UCI Machine Learning Repository.

Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models. Analytical Methods for Social Research. Cambridge University Press.

Jones, Karen Spärck. 1972. “A Statistical Interpretation of Term Specificity and Its Application in Retrieval.” Journal of Documentation 28: 11–21.

McKinney, Wes. 2017. Python for Data Analysis: Data Wrangling with Pandas, Numpy, and Ipython. 2nd ed. O’Reilly Media, Inc.

Socher, Richard, Alex Perelygin, Jean Wu, Jason Chuang, Christopher Manning, Andrew Ng, and Christopher Potts. 2013. “Parsing with Compositional Vector Grammars.” In EMNLP.

“TensorFlow Datasets, a Collection of Ready-to-Use Datasets.” 2021. https://www.tensorflow.org/datasets.