Chapter 8 Data Frames
The rectangular array of information (e.g. an Excel spreadsheet ) is what many think of when they hear the word “data.” Each column contains elements of a shared data type, and these data types can vary from column to column.
There is a type for this in R and Python: a data frame. It might even be the most common way that data is stored in both R and Python programs because many functions that read in data from an external source return objects of this type (e.g. read.csv()
in R and pd.read_csv()
in Python).
R and Python’s data frames have a number of things in common:
- each column must be the same length as all other columns,
- each column’s elements will all have the same type,
- elements in any rows can have different types,
- columns and rows can be named in different ways,
- there are many ways to get and set different subsets of data, and
- when reading in data, the same sorts of difficulties arise in both languages.
8.1 Data Frames in R
Let’s consider as an example Fisher’s “Iris” data set (Fisher 1988) hosted by (Dua and Graff 2017). We will read this data set in from a comma separated file (more information on input/output can be found in chapter 9). This file can be downloaded from this link: https://archive.ics.uci.edu/ml/datasets/iris.
irisData <- read.csv("data/iris.csv", header = F)
head(irisData, 3)
## V1 V2 V3 V4 V5
## 1 5.1 3.5 1.4 0.2 Iris-setosa
## 2 4.9 3.0 1.4 0.2 Iris-setosa
## 3 4.7 3.2 1.3 0.2 Iris-setosa
typeof(irisData)
## [1] "list"
class(irisData) # we'll talk more about classes later
## [1] "data.frame"
dim(irisData)
## [1] 150 5
nrow(irisData)
## [1] 150
ncol(irisData)
## [1] 5
There are some exceptions, but most data sets can be stored as a data.frame
. These kinds of two-dimensional data sets are quite common. Any particular row is often an observation on one experimental unit (e.g. person, place or thing). Looking at a particular column gives you one kind of measurement stored for all observations.
Do not rely on the default arguments of read.csv()
or read.table()
! After you read in a data frame, always check to make sure that
- the number of columns is correct because the correct column separator was used (c.f.
sep=
), - column names were parsed correctly, if there were some in the raw text file,
- the first row of data wasn’t used as a column name sequence, if there weren’t column names in the text file, and
- the last few rows aren’t reading in empty spaces
- character columns are read in correctly (c.f.
stringsAsFactors=
), and - special characters signifying missing data were correctly identified (c.f.
na.strings=
).
A data.frame
is a special case of a list
. Every element of the list is a column. Columns can be vector
s or factor
s, and they can all be of a different type. This is one of the biggest differences between data frames and matrix
s. They are both two-dimensional, but a matrix
needs elements to be all the same type. Unlike a general list
, a data.frame
requires all of its columns to have the same number of elements. In other words, the data.frame
is not a “ragged” list.
Often times you will need to extract pieces of information from a data.frame
. This can be done in many ways. If the columns have names, you can use the $
operator to access a single column. Accessing a single column might be followed up by creating a new vector. You can also use the [
operator to access multiple columns by name.
colnames(irisData) <- c("sepal.length", "sepal.width",
"petal.length","petal.width",
"species")
firstCol <- irisData$sepal.length
head(firstCol)
## [1] 5.1 4.9 4.7 4.6 5.0 5.4
firstTwoCols <- irisData[c("sepal.length", "sepal.width")]
head(firstTwoCols, 3)
## sepal.length sepal.width
## 1 5.1 3.5
## 2 4.9 3.0
## 3 4.7 3.2
The [
operator is also useful for selecting rows and columns by index numbers, or by some logical criteria.
topLeft <- irisData[1,1] # first row, first col
topLeft
## [1] 5.1
firstThreeRows <- irisData[1:3,] # rows 1-3, all cols
firstThreeRows
## sepal.length sepal.width petal.length petal.width species
## 1 5.1 3.5 1.4 0.2 Iris-setosa
## 2 4.9 3.0 1.4 0.2 Iris-setosa
## 3 4.7 3.2 1.3 0.2 Iris-setosa
# rows where species column is setosa
setosaOnly <- irisData[irisData$species == "Iris-setosa",]
setosaOnly[1:3,-1]
## sepal.width petal.length petal.width species
## 1 3.5 1.4 0.2 Iris-setosa
## 2 3.0 1.4 0.2 Iris-setosa
## 3 3.2 1.3 0.2 Iris-setosa
In the code above, irisData$species == "Iris-setosa"
creates a logical vector (try it!) using the vectorized ==
operator. The [
operator selects the rows for which the corresponding element of this logical vector is TRUE
.
Be careful: depending on how you use the square brackets, you can either get a data.frame
or a vector.
As an example, try both class(irisData[,1])
and class(irisData[,c(1,2)])
.
In R, data.frame
s might have row names. You can get and set this character vector
with the rownames()
function. You can access rows by name using the square bracket operator.
head(rownames(irisData))
## [1] "1" "2" "3" "4" "5" "6"
rownames(irisData) <- as.numeric(rownames(irisData)) + 1000
head(rownames(irisData))
## [1] "1001" "1002" "1003" "1004" "1005" "1006"
irisData["1002",]
## sepal.length sepal.width petal.length petal.width species
## 1002 4.9 3 1.4 0.2 Iris-setosa
Code that modifies data usually looks quite similar to code extracting data. You’ll notice a lot of the same symbols (e.g. $
, [
, etc.), but the (<-
) will point in the other direction.
irisData$columnOfOnes <- rep(1, nrow(irisData))
irisData[,1] <- NULL #delete first col
irisData[1:2,1] <- rnorm(n = 2, mean = 999)
irisData[,'sepal.width'] <- rnorm(n = nrow(irisData), mean = -999)
irisData[irisData$species == "Iris-setosa", 'species'] <- "SETOSA!"
head(irisData, 3)
## sepal.width petal.length petal.width species columnOfOnes
## 1001 -998.9036 1.4 0.2 SETOSA! 1
## 1002 -997.3385 1.4 0.2 SETOSA! 1
## 1003 -999.6752 1.3 0.2 SETOSA! 1
8.2 Data Frames in Python
The Pandas library in Python has data frames that are modeled after R’s (McKinney 2017).
import pandas as pd
iris_data = pd.read_csv("data/iris.csv", header = None)
iris_data.head(3)
## 0 1 2 3 4
## 0 5.1 3.5 1.4 0.2 Iris-setosa
## 1 4.9 3.0 1.4 0.2 Iris-setosa
## 2 4.7 3.2 1.3 0.2 Iris-setosa
iris_data.shape
## (150, 5)
len(iris_data) # num rows
## 150
len(iris_data.columns) # num columns
## 5
list(iris_data.dtypes)[:3]
## [dtype('float64'), dtype('float64'), dtype('float64')]
list(iris_data.dtypes)[3:]
## [dtype('float64'), dtype('O')]
The structure is very similar to that of R’s data frame. It’s two dimensional, and you can access columns and rows by name or number. Each column is a Series
object, and each column can have a different dtype
, which is analogous to R’s situation. Again, because the elements need to be the same type along columns only, this is a big difference between 2-d Numpy ndarray
s and DataFrame
s (c.f. R’s matrix
versus R’s data.frame
).
Again, do not rely on the default arguments of pd.read_csv()
! After you read in a data set, always check that
- the number of columns is correct because the correct column separator was used (c.f.
sep=
), - column names were parsed correctly, if there were some in the raw text file,
- the first row of data wasn’t used as a column name sequence, if there weren’t column names in the text file (c.f.
header=
), and - the last few rows aren’t reading in empty spaces
- character columns are read in correctly (c.f.
dtype=
), and - special characters signifying missing data were correctly identified (c.f.
na.values=
).
Square brackets are a little different in Python than they are in R. Just like in R, you can access columns by name with square brackets, and you can also access rows. Unlike R, though, you don’t have to specify both rows and columns every time you use the square brackets.
iris_data.columns = ["sepal.length", "sepal.width", "petal.length",
"petal.width", "species"]
first_col = iris_data['sepal.length']
first_col.head()
## 0 5.1
## 1 4.9
## 2 4.7
## 3 4.6
## 4 5.0
## Name: sepal.length, dtype: float64
first_two_cols = iris_data[["sepal.length", "sepal.width"]]
first_two_cols.head(3)
## sepal.length sepal.width
## 0 5.1 3.5
## 1 4.9 3.0
## 2 4.7 3.2
Notice that iris_data['sepal.length']
returns a Series
and iris_data[["sepal.length", "sepal.width"]]
returns a Pandas DataFrame
. This behavior is similar to what happened in R. For more details, click here.
You can select columns and rows by number with the .iloc
method. iloc
is (probably) short for “integer location.”
# specify rows/cols by number
top_left = iris_data.iloc[0,0]
top_left
## 5.1
first_three_rows_without_last_col = iris_data.iloc[:3,:-1]
first_three_rows_without_last_col
## sepal.length sepal.width petal.length petal.width
## 0 5.1 3.5 1.4 0.2
## 1 4.9 3.0 1.4 0.2
## 2 4.7 3.2 1.3 0.2
Selecting columns by anything besides integer number can be done with the .loc()
method. You should generally prefer this method to access columns because accessing things by name instead of number is more readable. Here are some examples.
sepal_w_to_pedal_w = iris_data.loc[:,'sepal.width':'petal.width']
sepal_w_to_pedal_w.head()
## sepal.width petal.length petal.width
## 0 3.5 1.4 0.2
## 1 3.0 1.4 0.2
## 2 3.2 1.3 0.2
## 3 3.1 1.5 0.2
## 4 3.6 1.4 0.2
setosa_only = iris_data.loc[iris_data['species'] == "Iris-setosa",]
# don't need the redundant column anymore
del setosa_only['species']
setosa_only.head(3)
## sepal.length sepal.width petal.length petal.width
## 0 5.1 3.5 1.4 0.2
## 1 4.9 3.0 1.4 0.2
## 2 4.7 3.2 1.3 0.2
Notice we used a slice
(i.e. 'sepal.width':'pedal.width'
) to access many columns by only referring to the left-most and the right-most. Unlike slicing with numbers, the right end is included. Also note that this does not work with the regular square bracket operator (i.e. iris_data['sepal.width':'pedal.width']
). The second example filters out the rows where the "species"
column elements are equal to "Iris-setosa"
.
Each DataFrame
in Pandas comes with an .index
attribute. This is analogous to a row name in R, but it’s much more flexible because the index can take on a variety of types. This can help us highlight the difference between .loc
and .iloc
. Recall that .loc
was label-based selection. Labels don’t necessarily have to be strings. Consider the following example.
iris_data.index
# reverse the index
## RangeIndex(start=0, stop=150, step=1)
iris_data = iris_data.set_index(iris_data.index[::-1])
iris_data.iloc[-2:,:3] # top is now bottom
## sepal.length sepal.width petal.length
## 1 6.2 3.4 5.4
## 0 5.9 3.0 5.1
iris_data.loc[0] # last row has 0 index
## sepal.length 5.9
## sepal.width 3
## petal.length 5.1
## petal.width 1.8
## species Iris-virginica
## Name: 0, dtype: object
iris_data.iloc[0] # first row with big index
## sepal.length 5.1
## sepal.width 3.5
## petal.length 1.4
## petal.width 0.2
## species Iris-setosa
## Name: 149, dtype: object
iris_data.loc[0]
selects the 0
th index. The second line reversed the indexes, so this is actually the last row. If you want the first row, use iris_data.iloc[0]
.
Modifying data inside a data frame looks quite similar to extracting data. You’ll recognize a lot of the methods mentioned earlier.
import numpy as np
n_rows = iris_data.shape[0]
iris_data['col_ones'] = np.repeat(1.0, n_rows)
iris_data.iloc[:2,0] = np.random.normal(loc=999, size=2)
rand_nums = np.random.normal(loc=-999, size=n_rows)
iris_data.loc[:,'sepal.width'] = rand_nums
setosa_rows = iris_data['species'] == "Iris-setosa"
iris_data.loc[setosa_rows, 'species'] = "SETOSA!"
del iris_data['petal.length']
iris_data.head(3)
## sepal.length sepal.width petal.width species col_ones
## 149 999.146739 -998.446586 0.2 SETOSA! 1.0
## 148 998.005821 -999.015224 0.2 SETOSA! 1.0
## 147 4.700000 -999.803985 0.2 SETOSA! 1.0
You can also use the .assign()
method to create a new column. This method does not modify the data frame in place. It returns a new DataFrame
with the additional column.
iris_data = iris_data.assign(new_col_name = np.arange(n_rows))
del iris_data['sepal.length']
iris_data.head(3)
## sepal.width petal.width species col_ones new_col_name
## 149 -998.446586 0.2 SETOSA! 1.0 0
## 148 -999.015224 0.2 SETOSA! 1.0 1
## 147 -999.803985 0.2 SETOSA! 1.0 2
Above we were assigning Numpy arrays to columns of a DataFrame
. Be careful when you’re assigning Series
objects. You’ll see in the documentation that “Pandas aligns all AXES when setting Series and DataFrame
from .loc
, and .iloc
.”.
8.3 Exercises
8.3.1 R Questions
Consider the data set "original_rt_snippets.txt"
(Socher et al. 2013), which is hosted by (Dua and Graff 2017). We will calculate the term frequency-inverse document frequency statistics (Jones 1972) on this data set, which is a common data transformation technique used in text mining and natural language processing. You may use the stringr
library for this question, if you wish.
- read in this data set as a
vector
and call itcorpus
. - create a
vector
calleddictionary
that contains the following phrases: “charming”,“fantasy”, “hate”, and “boring”. - Construct a
data.frame
with four columns calledbagOfWords
that contains the number of appearances of each word in the dictionary. Match the exact phrases. For simplicity, don’t worry about the case of letters or using regular expressions (c.f. section 3.9). Label the columns of thisdata.frame
with the phrases you’re searching for. Try to write code that is easy to modify if you decide to change the set of phrases in your dictionary. - Create a
data.frame
with four columns calledtermFrequency
. Each element should correspond with the count from the previousdata.frame
. Instead of the count, each element should be \(\log(1 + \text{count})\). - Create a
vector
of length four calledinvDocFrequency
. The inverse document frequency formula for any term \(t\) is \(\log([\text{number of documents in corpus}])\) minus \(\log([\text{number of documents that contain term } t])\). Make sure the names of this vector are the same as the words in the dictionary. - Create a
data.frame
calledtfidf
(short for “term frequency-inverse document frequency”). For row/document \(d\), and column/term \(t\), the formula is the product: \([\text{term frequency of term } t \text{ and document } d ] \times [\text{inverse doc. freq. of term } t]\). - Extract the elements of
corpus
that have at least one nonzero element in the corresponding row oftfidf
. Call thevector
informativeDocuments
. - Do you see any documents that were labeled as informative, that do not actually contain the words you searched for?
mtcars
is a data set that is built into R, so you don’t need to read it in. You can read more about it by typing ?datasets::mtcars
.
- Create a new
data.frame
calledwithoutDisp
that is the same asmtcars
, but has itsdisp
column removed. - Create a new column for
withoutDisp
calledcoolnessScore
. The formula is \(\frac{1}{\text{mpg}} + \text{quarter mile time}\). - Create a new
data.frame
calledsortedWD
that is equal towithoutDisp
, but sorted in descending order by the coolness score. - Create a new
data.frame
fromsortedWD
calledspecialRows
that keeps only rows where \(\text{weight (1000lbs)} + \text{rear axle ratio} < 7\) - Calculate the percent reduction in number of rows, going from
sortedWD
tospecialRows
. Call itpercReduction
. Make sure it is between \(0\) and \(100\).
This question investigates the Zillow Home Value Index (ZHVI) for single family homes.
- read in
"Zip_zhvi_uc_sfr_tier_0.33_0.67_sm_sa_month.csv"
. Call thedata.frame
homeData
. Remember to be careful with file paths. Also, when looking at the data set with a text editor, be sure that “word wrap” is not turned on. - Extract the rows of
homeData
that correspond with Charlottesville, VA, and assign them as adata.frame
to the variablecvilleData
- Assign all the unique zip codes to a
character vector
calledcvilleZips
. - Extract the columns of
cvilleData
that correspond with home prices, and transpose them so that each row of the result corresponds with a different month. Call this newdata.frame
timeSeriesData
. Also, make sure that the column names of this newdata.frame
are set equal to the appropriate zip codes. - Write a function called
getAveMonthlyChange
that takes in a numericvector
, and returns the average change. Your function should not return anNA
, so be sure to deal withNA
s appropriately. - Calculate the average monthly price change for each zip code. Store your results in a
vector
calledaveMonthlyChanges
. Make sure thisvector
has named elements so one can extract elements by zip code.
8.3.2 Python Questions
This question deals with looking at historical prices of the S&P500 Index. This data was downloaded from https://finance.yahoo.com (“GSPC Data” 2021). It contains prices starting from “2007-01-03” and going until “2021-10-01”.
- Read in the data file
"gspc.csv"
as adata.frame
and call the variablegspc
. - Use
.set_index()
to change the index ofgspc
to its"Index"
column. Store the newDataFrame
asgspc_good_idx
. - Recall the formula for log returns provided in exercises of chapter 3. Add a column to
gspc_good_idx
calledlog_returns
. Calculate them from the columnGSPC.Adjusted
. Make sure to scale them by \(100\) and put annp.nan
in the first element where you don’t have a return. - Extract all the returns that are available for the year 2021, and store them as a
Series
calledthis_year_returns
. - Add a column to
gspc_good_idx
containing the drawdown time series. Call this columndrawdown
. To calculate drawdown at a given date, subtract that date’s price from the running maximum price that is current at that date. Use only adjusted close prices for this calculation.
- Add a column to
gspc_good_idx
containing the percent drawdown time series. Call this columnperc_drawdown
. Use the previous column, but make this number a percent of the corresponding running maximum. - What is the maximum drawdown of this time series? Store it as a percent in the value
mdd
.
In this question we’ll look at some data on radon measurements (Gelman and Hill 2007). Instead of reading in a text file, we will load the data into Python using the tensorflow_datasets
module (“TensorFlow Datasets, a Collection of Ready-to-Use Datasets” 2021).
Please include the following code in your submission.
import tensorflow_datasets as tfds
import pandas as pd
import numpy as np
d = tfds.load("radon")
d = pd.DataFrame(tfds.as_dataframe(d['train']))
Many of you will need to install tensorflow
and tensorflow_datasets
before you’re able to import
it. If that’s so, please read section 10.2 for more information on how to install packages.
- Assign to
worst_row
the row ofd
that is associated the highest recorded level of radon. Make sure it aDataFrame
. - Assign to
nrows
andncols
the number of rows and columns ofd
, respectively. - Assign the most common column data type to
most_common_dtype
. Make sure the variable is of typenumpy.dtype
- Are there any observations from Virginia in this data set? If so, assign
True
toany_va
. Otherwise assignFalse
. - Create a new column in
d
calleddist_from_cville
. Use the Haversine formula to calculate distance between each row and the University of Virginia, in kilometers.- Assume the University of Virginia is at 38.0336\(^\circ\) N, 78.5080\(^\circ\)W
- Assume the Earth’s radius \(r = 6378.137\) kilometers.
- The formula for distance between \((\lambda_1, \phi_1)\) (signed longitude in radians, signed latitude in radians) and \((\lambda_2, \phi_2)\) is
- What is the average radon measurement between all measurements taken at the place that is closest to where we are now? Assign your answer as a
float
toclose_ave
References
Dua, Dheeru, and Casey Graff. 2017. “UCI Machine Learning Repository.” University of California, Irvine, School of Information; Computer Sciences. http://archive.ics.uci.edu/ml.
Fisher, Test, R.A. & Creator. 1988. “Iris.” UCI Machine Learning Repository.
Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models. Analytical Methods for Social Research. Cambridge University Press.
“GSPC Data.” 2021. https://finance.yahoo.com/quote/%5EGSPC/history?p=%5EGSPC.
Jones, Karen Spärck. 1972. “A Statistical Interpretation of Term Specificity and Its Application in Retrieval.” Journal of Documentation 28: 11–21.
McKinney, Wes. 2017. Python for Data Analysis: Data Wrangling with Pandas, Numpy, and Ipython. 2nd ed. O’Reilly Media, Inc.
Socher, Richard, Alex Perelygin, Jean Wu, Jason Chuang, Christopher Manning, Andrew Ng, and Christopher Potts. 2013. “Parsing with Compositional Vector Grammars.” In EMNLP.
“TensorFlow Datasets, a Collection of Ready-to-Use Datasets.” 2021. https://www.tensorflow.org/datasets.