Chapter 8 Data Frames
The rectangular array of information (e.g. an Excel spreadsheet ) is what many think of when they hear the word “data.” Each column contains elements of a shared data type, and these data types can vary from column to column.
There is a type for this in R and Python: a data frame. It might even be the most common way that data is stored in both R and Python programs because many functions that read in data from an external source return objects of this type (e.g. read.csv() in R and pd.read_csv() in Python).
R and Python’s data frames have a number of things in common:
- each column must be the same length as all other columns,
- each column’s elements will all have the same type,
- elements in any rows can have different types,
- columns and rows can be named in different ways,
- there are many ways to get and set different subsets of data, and
- when reading in data, the same sorts of difficulties arise in both languages.
8.1 Data Frames in R
Let’s consider as an example Fisher’s “Iris” data set (Fisher 1988) hosted by (Dua and Graff 2017). We will read this data set in from a comma separated file (more information on input/output can be found in chapter 9). This file can be downloaded from this link: https://archive.ics.uci.edu/ml/datasets/iris.
irisData <- read.csv("data/iris.csv", header = F)
head(irisData, 3)
## V1 V2 V3 V4 V5
## 1 5.1 3.5 1.4 0.2 Iris-setosa
## 2 4.9 3.0 1.4 0.2 Iris-setosa
## 3 4.7 3.2 1.3 0.2 Iris-setosa
typeof(irisData)
## [1] "list"
class(irisData) # we'll talk more about classes later
## [1] "data.frame"
dim(irisData)
## [1] 150 5
nrow(irisData)
## [1] 150
ncol(irisData)
## [1] 5There are some exceptions, but most data sets can be stored as a data.frame. These kinds of two-dimensional data sets are quite common. Any particular row is often an observation on one experimental unit (e.g. person, place or thing). Looking at a particular column gives you one kind of measurement stored for all observations.
Do not rely on the default arguments of read.csv() or read.table()! After you read in a data frame, always check to make sure that
- the number of columns is correct because the correct column separator was used (c.f.
sep=), - column names were parsed correctly, if there were some in the raw text file,
- the first row of data wasn’t used as a column name sequence, if there weren’t column names in the text file, and
- the last few rows aren’t reading in empty spaces
- character columns are read in correctly (c.f.
stringsAsFactors=), and - special characters signifying missing data were correctly identified (c.f.
na.strings=).
A data.frame is a special case of a list. Every element of the list is a column. Columns can be vectors or factors, and they can all be of a different type. This is one of the biggest differences between data frames and matrixs. They are both two-dimensional, but a matrix needs elements to be all the same type. Unlike a general list, a data.frame requires all of its columns to have the same number of elements. In other words, the data.frame is not a “ragged” list.
Often times you will need to extract pieces of information from a data.frame. This can be done in many ways. If the columns have names, you can use the $ operator to access a single column. Accessing a single column might be followed up by creating a new vector. You can also use the [ operator to access multiple columns by name.
colnames(irisData) <- c("sepal.length", "sepal.width",
"petal.length","petal.width",
"species")
firstCol <- irisData$sepal.length
head(firstCol)
## [1] 5.1 4.9 4.7 4.6 5.0 5.4
firstTwoCols <- irisData[c("sepal.length", "sepal.width")]
head(firstTwoCols, 3)
## sepal.length sepal.width
## 1 5.1 3.5
## 2 4.9 3.0
## 3 4.7 3.2The [ operator is also useful for selecting rows and columns by index numbers, or by some logical criteria.
topLeft <- irisData[1,1] # first row, first col
topLeft
## [1] 5.1
firstThreeRows <- irisData[1:3,] # rows 1-3, all cols
firstThreeRows
## sepal.length sepal.width petal.length petal.width species
## 1 5.1 3.5 1.4 0.2 Iris-setosa
## 2 4.9 3.0 1.4 0.2 Iris-setosa
## 3 4.7 3.2 1.3 0.2 Iris-setosa
# rows where species column is setosa
setosaOnly <- irisData[irisData$species == "Iris-setosa",]
setosaOnly[1:3,-1]
## sepal.width petal.length petal.width species
## 1 3.5 1.4 0.2 Iris-setosa
## 2 3.0 1.4 0.2 Iris-setosa
## 3 3.2 1.3 0.2 Iris-setosaIn the code above, irisData$species == "Iris-setosa" creates a logical vector (try it!) using the vectorized == operator. The [ operator selects the rows for which the corresponding element of this logical vector is TRUE.
Be careful: depending on how you use the square brackets, you can either get a data.frame or a vector. As an example, try both class(irisData[,1]) and class(irisData[,c(1,2)]).
In R, data.frames might have row names. You can get and set this character vector with the rownames() function. You can access rows by name using the square bracket operator.
head(rownames(irisData))
## [1] "1" "2" "3" "4" "5" "6"
rownames(irisData) <- as.numeric(rownames(irisData)) + 1000
head(rownames(irisData))
## [1] "1001" "1002" "1003" "1004" "1005" "1006"
irisData["1002",]
## sepal.length sepal.width petal.length petal.width species
## 1002 4.9 3 1.4 0.2 Iris-setosaCode that modifies data usually looks quite similar to code extracting data. You’ll notice a lot of the same symbols (e.g. $, [, etc.), but the (<-) will point in the other direction.
irisData$columnOfOnes <- rep(1, nrow(irisData))
irisData[,1] <- NULL #delete first col
irisData[1:2,1] <- rnorm(n = 2, mean = 999)
irisData[,'sepal.width'] <- rnorm(n = nrow(irisData), mean = -999)
irisData[irisData$species == "Iris-setosa", 'species'] <- "SETOSA!"
head(irisData, 3)
## sepal.width petal.length petal.width species columnOfOnes
## 1001 -998.9036 1.4 0.2 SETOSA! 1
## 1002 -997.3385 1.4 0.2 SETOSA! 1
## 1003 -999.6752 1.3 0.2 SETOSA! 18.2 Data Frames in Python
The Pandas library in Python has data frames that are modeled after R’s (McKinney 2017).
import pandas as pd
iris_data = pd.read_csv("data/iris.csv", header = None)
iris_data.head(3)
## 0 1 2 3 4
## 0 5.1 3.5 1.4 0.2 Iris-setosa
## 1 4.9 3.0 1.4 0.2 Iris-setosa
## 2 4.7 3.2 1.3 0.2 Iris-setosa
iris_data.shape
## (150, 5)
len(iris_data) # num rows
## 150
len(iris_data.columns) # num columns
## 5
list(iris_data.dtypes)[:3]
## [dtype('float64'), dtype('float64'), dtype('float64')]
list(iris_data.dtypes)[3:]
## [dtype('float64'), dtype('O')]The structure is very similar to that of R’s data frame. It’s two dimensional, and you can access columns and rows by name or number. Each column is a Series object, and each column can have a different dtype, which is analogous to R’s situation. Again, because the elements need to be the same type along columns only, this is a big difference between 2-d Numpy ndarrays and DataFrames (c.f. R’s matrix versus R’s data.frame).
Again, do not rely on the default arguments of pd.read_csv()! After you read in a data set, always check that
- the number of columns is correct because the correct column separator was used (c.f.
sep=), - column names were parsed correctly, if there were some in the raw text file,
- the first row of data wasn’t used as a column name sequence, if there weren’t column names in the text file (c.f.
header=), and - the last few rows aren’t reading in empty spaces
- character columns are read in correctly (c.f.
dtype=), and - special characters signifying missing data were correctly identified (c.f.
na.values=).
Square brackets are a little different in Python than they are in R. Just like in R, you can access columns by name with square brackets, and you can also access rows. Unlike R, though, you don’t have to specify both rows and columns every time you use the square brackets.
iris_data.columns = ["sepal.length", "sepal.width", "petal.length",
"petal.width", "species"]
first_col = iris_data['sepal.length']
first_col.head()
## 0 5.1
## 1 4.9
## 2 4.7
## 3 4.6
## 4 5.0
## Name: sepal.length, dtype: float64
first_two_cols = iris_data[["sepal.length", "sepal.width"]]
first_two_cols.head(3)
## sepal.length sepal.width
## 0 5.1 3.5
## 1 4.9 3.0
## 2 4.7 3.2Notice that iris_data['sepal.length'] returns a Series and iris_data[["sepal.length", "sepal.width"]] returns a Pandas DataFrame. This behavior is similar to what happened in R. For more details, click here.
You can select columns and rows by number with the .iloc method. iloc is (probably) short for “integer location.”
# specify rows/cols by number
top_left = iris_data.iloc[0,0]
top_left
## 5.1
first_three_rows_without_last_col = iris_data.iloc[:3,:-1]
first_three_rows_without_last_col
## sepal.length sepal.width petal.length petal.width
## 0 5.1 3.5 1.4 0.2
## 1 4.9 3.0 1.4 0.2
## 2 4.7 3.2 1.3 0.2Selecting columns by anything besides integer number can be done with the .loc() method. You should generally prefer this method to access columns because accessing things by name instead of number is more readable. Here are some examples.
sepal_w_to_pedal_w = iris_data.loc[:,'sepal.width':'petal.width']
sepal_w_to_pedal_w.head()
## sepal.width petal.length petal.width
## 0 3.5 1.4 0.2
## 1 3.0 1.4 0.2
## 2 3.2 1.3 0.2
## 3 3.1 1.5 0.2
## 4 3.6 1.4 0.2
setosa_only = iris_data.loc[iris_data['species'] == "Iris-setosa",]
# don't need the redundant column anymore
del setosa_only['species']
setosa_only.head(3)
## sepal.length sepal.width petal.length petal.width
## 0 5.1 3.5 1.4 0.2
## 1 4.9 3.0 1.4 0.2
## 2 4.7 3.2 1.3 0.2Notice we used a slice (i.e. 'sepal.width':'pedal.width') to access many columns by only referring to the left-most and the right-most. Unlike slicing with numbers, the right end is included. Also note that this does not work with the regular square bracket operator (i.e. iris_data['sepal.width':'pedal.width']). The second example filters out the rows where the "species" column elements are equal to "Iris-setosa".
Each DataFrame in Pandas comes with an .index attribute. This is analogous to a row name in R, but it’s much more flexible because the index can take on a variety of types. This can help us highlight the difference between .loc and .iloc. Recall that .loc was label-based selection. Labels don’t necessarily have to be strings. Consider the following example.
iris_data.index
# reverse the index
## RangeIndex(start=0, stop=150, step=1)
iris_data = iris_data.set_index(iris_data.index[::-1])
iris_data.iloc[-2:,:3] # top is now bottom
## sepal.length sepal.width petal.length
## 1 6.2 3.4 5.4
## 0 5.9 3.0 5.1
iris_data.loc[0] # last row has 0 index
## sepal.length 5.9
## sepal.width 3
## petal.length 5.1
## petal.width 1.8
## species Iris-virginica
## Name: 0, dtype: object
iris_data.iloc[0] # first row with big index
## sepal.length 5.1
## sepal.width 3.5
## petal.length 1.4
## petal.width 0.2
## species Iris-setosa
## Name: 149, dtype: objectiris_data.loc[0] selects the 0th index. The second line reversed the indexes, so this is actually the last row. If you want the first row, use iris_data.iloc[0].
Modifying data inside a data frame looks quite similar to extracting data. You’ll recognize a lot of the methods mentioned earlier.
import numpy as np
n_rows = iris_data.shape[0]
iris_data['col_ones'] = np.repeat(1.0, n_rows)
iris_data.iloc[:2,0] = np.random.normal(loc=999, size=2)
rand_nums = np.random.normal(loc=-999, size=n_rows)
iris_data.loc[:,'sepal.width'] = rand_nums
setosa_rows = iris_data['species'] == "Iris-setosa"
iris_data.loc[setosa_rows, 'species'] = "SETOSA!"
del iris_data['petal.length']
iris_data.head(3)
## sepal.length sepal.width petal.width species col_ones
## 149 999.146739 -998.446586 0.2 SETOSA! 1.0
## 148 998.005821 -999.015224 0.2 SETOSA! 1.0
## 147 4.700000 -999.803985 0.2 SETOSA! 1.0You can also use the .assign() method to create a new column. This method does not modify the data frame in place. It returns a new DataFrame with the additional column.
iris_data = iris_data.assign(new_col_name = np.arange(n_rows))
del iris_data['sepal.length']
iris_data.head(3)
## sepal.width petal.width species col_ones new_col_name
## 149 -998.446586 0.2 SETOSA! 1.0 0
## 148 -999.015224 0.2 SETOSA! 1.0 1
## 147 -999.803985 0.2 SETOSA! 1.0 2Above we were assigning Numpy arrays to columns of a DataFrame. Be careful when you’re assigning Series objects. You’ll see in the documentation that “Pandas aligns all AXES when setting Series and DataFrame from .loc, and .iloc.”.
8.3 Exercises
8.3.1 R Questions
Consider the data set "original_rt_snippets.txt" (Socher et al. 2013), which is hosted by (Dua and Graff 2017). We will calculate the term frequency-inverse document frequency statistics (Jones 1972) on this data set, which is a common data transformation technique used in text mining and natural language processing. You may use the stringr library for this question, if you wish.
- read in this data set as a
vectorand call itcorpus. - create a
vectorcalleddictionarythat contains the following phrases: “charming”,“fantasy”, “hate”, and “boring”. - Construct a
data.framewith four columns calledbagOfWordsthat contains the number of appearances of each word in the dictionary. Match the exact phrases. For simplicity, don’t worry about the case of letters or using regular expressions (c.f. section 3.9). Label the columns of thisdata.framewith the phrases you’re searching for. Try to write code that is easy to modify if you decide to change the set of phrases in your dictionary. - Create a
data.framewith four columns calledtermFrequency. Each element should correspond with the count from the previousdata.frame. Instead of the count, each element should be . - Create a
vectorof length four calledinvDocFrequency. The inverse document frequency formula for any term is minus . Make sure the names of this vector are the same as the words in the dictionary. - Create a
data.framecalledtfidf(short for “term frequency-inverse document frequency”). For row/document , and column/term , the formula is the product: . - Extract the elements of
corpusthat have at least one nonzero element in the corresponding row oftfidf. Call thevectorinformativeDocuments. - Do you see any documents that were labeled as informative, that do not actually contain the words you searched for?
mtcars is a data set that is built into R, so you don’t need to read it in. You can read more about it by typing ?datasets::mtcars.
- Create a new
data.framecalledwithoutDispthat is the same asmtcars, but has itsdispcolumn removed. - Create a new column for
withoutDispcalledcoolnessScore. The formula is . - Create a new
data.framecalledsortedWDthat is equal towithoutDisp, but sorted in descending order by the coolness score. - Create a new
data.framefromsortedWDcalledspecialRowsthat keeps only rows where - Calculate the percent reduction in number of rows, going from
sortedWDtospecialRows. Call itpercReduction. Make sure it is between and .
This question investigates the Zillow Home Value Index (ZHVI) for single family homes.
- read in
"Zip_zhvi_uc_sfr_tier_0.33_0.67_sm_sa_month.csv". Call thedata.framehomeData. Remember to be careful with file paths. Also, when looking at the data set with a text editor, be sure that “word wrap” is not turned on. - Extract the rows of
homeDatathat correspond with Charlottesville, VA, and assign them as adata.frameto the variablecvilleData - Assign all the unique zip codes to a
character vectorcalledcvilleZips. - Extract the columns of
cvilleDatathat correspond with home prices, and transpose them so that each row of the result corresponds with a different month. Call this newdata.frametimeSeriesData. Also, make sure that the column names of this newdata.frameare set equal to the appropriate zip codes. - Write a function called
getAveMonthlyChangethat takes in a numericvector, and returns the average change. Your function should not return anNA, so be sure to deal withNAs appropriately. - Calculate the average monthly price change for each zip code. Store your results in a
vectorcalledaveMonthlyChanges. Make sure thisvectorhas named elements so one can extract elements by zip code.
8.3.2 Python Questions
This question deals with looking at historical prices of the S&P500 Index. This data was downloaded from https://finance.yahoo.com (“GSPC Data” 2021). It contains prices starting from “2007-01-03” and going until “2021-10-01”.
- Read in the data file
"gspc.csv"as adata.frameand call the variablegspc. - Use
.set_index()to change the index ofgspcto its"Index"column. Store the newDataFrameasgspc_good_idx. - Recall the formula for log returns provided in exercises of chapter 3. Add a column to
gspc_good_idxcalledlog_returns. Calculate them from the columnGSPC.Adjusted. Make sure to scale them by and put annp.nanin the first element where you don’t have a return. - Extract all the returns that are available for the year 2021, and store them as a
Seriescalledthis_year_returns. - Add a column to
gspc_good_idxcontaining the drawdown time series. Call this columndrawdown. To calculate drawdown at a given date, subtract that date’s price from the running maximum price that is current at that date. Use only adjusted close prices for this calculation.
- Add a column to
gspc_good_idxcontaining the percent drawdown time series. Call this columnperc_drawdown. Use the previous column, but make this number a percent of the corresponding running maximum. - What is the maximum drawdown of this time series? Store it as a percent in the value
mdd.
In this question we’ll look at some data on radon measurements (Gelman and Hill 2007). Instead of reading in a text file, we will load the data into Python using the tensorflow_datasets module (“TensorFlow Datasets, a Collection of Ready-to-Use Datasets” 2021).
Please include the following code in your submission.
import tensorflow_datasets as tfds
import pandas as pd
import numpy as np
d = tfds.load("radon")
d = pd.DataFrame(tfds.as_dataframe(d['train']))Many of you will need to install tensorflow and tensorflow_datasets before you’re able to import it. If that’s so, please read section 10.2 for more information on how to install packages.
- Assign to
worst_rowthe row ofdthat is associated the highest recorded level of radon. Make sure it aDataFrame. - Assign to
nrowsandncolsthe number of rows and columns ofd, respectively. - Assign the most common column data type to
most_common_dtype. Make sure the variable is of typenumpy.dtype - Are there any observations from Virginia in this data set? If so, assign
Truetoany_va. Otherwise assignFalse. - Create a new column in
dcalleddist_from_cville. Use the Haversine formula to calculate distance between each row and the University of Virginia, in kilometers.- Assume the University of Virginia is at 38.0336 N, 78.5080W
- Assume the Earth’s radius kilometers.
- The formula for distance between (signed longitude in radians, signed latitude in radians) and is
- What is the average radon measurement between all measurements taken at the place that is closest to where we are now? Assign your answer as a
floattoclose_ave
References
Dua, Dheeru, and Casey Graff. 2017. “UCI Machine Learning Repository.” University of California, Irvine, School of Information; Computer Sciences. http://archive.ics.uci.edu/ml.
Fisher, Test, R.A. & Creator. 1988. “Iris.” UCI Machine Learning Repository.
Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models. Analytical Methods for Social Research. Cambridge University Press.
“GSPC Data.” 2021. https://finance.yahoo.com/quote/%5EGSPC/history?p=%5EGSPC.
Jones, Karen Spärck. 1972. “A Statistical Interpretation of Term Specificity and Its Application in Retrieval.” Journal of Documentation 28: 11–21.
McKinney, Wes. 2017. Python for Data Analysis: Data Wrangling with Pandas, Numpy, and Ipython. 2nd ed. O’Reilly Media, Inc.
Socher, Richard, Alex Perelygin, Jean Wu, Jason Chuang, Christopher Manning, Andrew Ng, and Christopher Potts. 2013. “Parsing with Compositional Vector Grammars.” In EMNLP.
“TensorFlow Datasets, a Collection of Ready-to-Use Datasets.” 2021. https://www.tensorflow.org/datasets.