Chapter 7 Categorical Data
While statisticians may describe data as being either categorical or numerical, this classification is different than classifying data by its type in a program. So, strictly speaking, if you have categorical data, you are not obligated to use any particular type to represent it in your script.
However, there are types that are specifically designed to be used with categorical data, and so they are especially advantageous to use if you end up with the opportunity. We describe a few of them here in this chapter.
7.1 factor
s in R
Categorical data in R is often stored in a factor
variable. factor
s are more special than vector
s of integers because
- they have a
levels
attribute, which is comprised of all the possible values that each response could be; - they may or may not be ordered, which will also control how they are used in mathematical functions;
- they might have a
contrasts
attribute, which will control how they are used in statistical modeling functions.
Here is a first example. Say we asked three people what their favorite season was. The data might look something like this.
allSeasons <- c("spring", "summer", "autumn", "winter")
responses <- factor(c("autumn", "summer", "summer"),
levels = allSeasons)
levels(responses)
## [1] "spring" "summer" "autumn" "winter"
is.factor(responses)
## [1] TRUE
is.ordered(responses)
## [1] FALSE
#contrasts(responses)
# ^ controls how factor is used in different functions
factor
s always have levels, which is the collection of all possible unique values each observation can take.
You should be careful if you are not specifying them directly. What happens when you use the default option and replace the second assignment in the above code with responses <- factor(c("autumn", "summer", "summer"))
? The documentation of factor()
will tell you that, by default, factor()
will just take the unique values found in the data. In this case, nobody prefers winter or spring, and so neither will show up in levels(responses)
. This may or may not be what you want.
factor
s can be ordered or unordered. Ordered factor
s are for ordinal data. Ordinal data is a particular type of categorical data that recognizes the categories have a natural order (e.g. low/ medium/high and not red/green/blue).
As another example, say we asked ten people how much they liked statistical computing, and they could only respond “love it”, “it’s okay” or “hate it”. The data might look something like this.
ordFeelOptions <- c("hate it", "it's okay", "love it")
responses <- factor(c("love it", "it's okay", "love it",
"love it", "it's okay", "love it",
"love it", "love it", "it's okay",
"it's okay"),
levels = ordFeelOptions,
ordered = TRUE)
levels(responses)
## [1] "hate it" "it's okay" "love it"
is.factor(responses)
## [1] TRUE
is.ordered(responses)
## [1] TRUE
# contrasts(responses)
When creating ordered factors with factor()
, be mindful that the levels=
argument is assumed to be ordered when you plug it into factor()
. In the above example, if you specified levels = c("love it", "it's okay", "hate it")
, then the factor would assume love it < it's okay < hate it
, which may or may not be what you want.
Last, factor
s may or may not have a contrast
attribute. You can get or set this with the contrasts()
function. This will influence some of the functions you use on your data that estimate statistical models.
I will not discuss specifics of contrasts in this text, but the overall motivation is important. In short, the primary reason for using factor
s is that they are designed to allow control over how you model categorical data. To be more specific, changing attributes of a factor
could control the paremeterization of a model you’re estimating. If you’re using a particular function for modeling with categorical data, you need to know how it treats factors. On the other hand, if you’re writing a function that performs modeling of categorical data, you should know how to treat factors.
Here are two examples that you might come across in your studies.
Consider using
factor
s as inputs to a function that performs linear regression. With linear regression models, if you have categorical inputs, there are many choices for how to write down a model. In each model, the collection of parameters will mean different things. In R, you might pick the model by creating thefactor
in a specific way.Suppose you are interested in estimating a classification model. In this case, the dependent variable is categorical, not the independent variable. With these types of models, choosing whether or not your
factor
is ordered is critical. These options would estimate completely different models, so choose wisely!
The mathematical details of these examples is outside of the scope of this text. If you have not learned about dummy variables in a regression course, or if you have not considered the difference between multinomial logistic regression and ordinal logistic regression, or if you have but you’re just a little rusty, that is totally fine. I only mention these as examples for how the factor
type can trigger special behavior.
In addition to creating one with factor()
, there are two other common ways that you can end up with factors
:
- creating factors from numerical data, and
- when reading in an external data file, one of the columns is coerced to a
factor
.
Here is an example of (1). We can take non-categorical data, and cut()
it into something categorical.
stockReturns <- rnorm(6) # not categorical here
typeOfDay <- cut(stockReturns, breaks = c(-Inf, 0, Inf))
typeOfDay
## [1] (-Inf,0] (0, Inf] (-Inf,0] (-Inf,0] (0, Inf] (-Inf,0]
## Levels: (-Inf,0] (0, Inf]
levels(typeOfDay)
## [1] "(-Inf,0]" "(0, Inf]"
is.factor(typeOfDay)
## [1] TRUE
is.ordered(typeOfDay)
## [1] FALSE
Finally, be mindful of how different functions read in external data sets. When reading in an external file, if a particular function comes across a column that has characters in it, it will need to decide whether to store that column as a character vector, or as a factor
. For example, read.csv()
and read.table()
have a stringsAsFactors=
argument that you should be mindful of.
7.2 Two Options for Categorical Data in Pandas
Pandas provides two options for storing categorical data. They are both very similar to R’s factor
s. You may use either
- a Pandas
Series
with a specialdtype
, or - a Pandas
Categorical
container.
Pandas’ Series
were discussed earlier in sections 3.2 and 3.4. These were containers that forced every element to share the same dtype
. Here, we specify dtype="category"
in pd.Series()
.
import pandas as pd
szn_s = pd.Series(["autumn", "summer", "summer"], dtype = "category")
szn_s.cat.categories
## Index(['autumn', 'summer'], dtype='object')
szn_s.cat.ordered
## False
szn_s.dtype
## CategoricalDtype(categories=['autumn', 'summer'], ordered=False)
type(szn_s)
## <class 'pandas.core.series.Series'>
The second option is to use Pandas’ Categorical
containers. They are quite similar, so the choice is subtle. Like Series
containers, they also force all of their elements to share the same shared dtype
.
szn_c = pd.Categorical(["autumn", "summer", "summer"])
szn_c.categories
## Index(['autumn', 'summer'], dtype='object')
szn_c.ordered
## False
szn_c.dtype
## CategoricalDtype(categories=['autumn', 'summer'], ordered=False)
type(szn_c)
## <class 'pandas.core.arrays.categorical.Categorical'>
You might have noticed that, with the Categorical
container, methods and data members were not accessed through the .cat
accessor. It is also more similar to R’s factor
s because you can specify more arguments in the constructor.
all_szns = ["spring","summer", "autumn", "winter"]
szn_c2 = pd.Categorical(["autumn", "summer", "summer"],
categories = all_szns,
ordered = False)
In Pandas, just like in R, you need to be very careful about what the categories
(c.f levels
) are. If you are using ordinal data, they need to be specified in the correct order. If you are using small data sets, be cognizant of whether all the categories show up in the data–otherwise they will not be correctly inferred.
With Pandas’ Series
it’s more difficult to specify a nondefault dtype
. One option is to change them after the object has been created.
szn_s = szn_s.cat.set_categories(
["autumn", "summer","spring","winter"])
szn_s.cat.categories
## Index(['autumn', 'summer', 'spring', 'winter'], dtype='object')
szn_s = szn_s.cat.remove_categories(['spring','winter'])
szn_s.cat.categories
## Index(['autumn', 'summer'], dtype='object')
szn_s = szn_s.cat.add_categories(["fall", "winter"])
szn_s.cat.categories
## Index(['autumn', 'summer', 'fall', 'winter'], dtype='object')
Another option is to create the dtype
before you create the Series
, and pass it into pd.Series()
.
cat_type = pd.CategoricalDtype(
categories=["autumn", "summer", "spring", "winter"],
ordered=True)
responses = pd.Series(
["autumn", "summer", "summer"],
dtype = cat_type)
responses
## 0 autumn
## 1 summer
## 2 summer
## dtype: category
## Categories (4, object): ['autumn' < 'summer' < 'spring' < 'winter']
Just like in R, you can convert numerical data into categorical. The function even has the same name as in R: pd.cut()
. Depending on the type of the input, it will return either a Series
or a Categorical
.
import numpy as np
stock_returns = np.random.normal(size=10) # not categorical
# array input means Categorical output
type_of_day = pd.cut(stock_returns,
bins = [-np.inf, 0, np.inf],
labels = ['bad day', 'good day'])
type(type_of_day)
# Series in means Series out
## <class 'pandas.core.arrays.categorical.Categorical'>
type_of_day2 = pd.cut(pd.Series(stock_returns),
bins = [-np.inf, 0, np.inf],
labels = ['bad day', 'good day'])
type(type_of_day2)
## <class 'pandas.core.series.Series'>
Finally, when reading in data from an external source, choose carefully whether you want character data to be stored as a string type, or as a categorical type. Here we use pd.read_csv()
to read in Fisher’s Iris data set (Fisher 1988) hosted by (Dua and Graff 2017). More information on Pandas’ DataFrames
can be found in the next chapter.
import numpy as np
# make 5th col categorical
my_data = pd.read_csv("data/iris.csv", header=None,
dtype = {4:"category"})
my_data.head(1)
## 0 1 2 3 4
## 0 5.1 3.5 1.4 0.2 Iris-setosa
my_data.dtypes
## 0 float64
## 1 float64
## 2 float64
## 3 float64
## 4 category
## dtype: object
np.unique(my_data[4]).tolist()
## ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
7.3 Exercises
7.3.1 R Questions
Read in this chess data set (“Chess (King-Rook vs. King-Pawn)” 1989), hosted by (Dua and Graff 2017), with the following code. You will probably have to change your working directory, but if you do, make sure to comment out that code before you submit your script to me.
- Are all of the columns
factor
s? AssignTRUE
orFALSE
toallFactors
. - Should any of these
factor
s be ordered? AssignTRUE
orFALSE
toideallyOrdered
. Hint: read the data set description from https://archive.ics.uci.edu/ml/datasets/Chess+%28King-Rook+vs.+King-Pawn%29. - Are any of these factors currently ordered? Assign
TRUE
orFALSE
tocurrentlyOrdered
. - What percent (between \(0\) and \(100\)) of the time is the first column equal to
'f'
? Assign your answer topercentF
.
Suppose you have the following vector
. Please make sure to include this code in your script.
- create a
factor
fromnormSamps
. Map each element to"within 1 sd"
or"outside 1 sd"
depending on whether the element is within \(1\) theoretical standard deviation of \(0\) or not. Call thefactor
withinOrNot
.
7.3.2 Python Questions
Consider the following simulated letter grade data for two students:
import pandas as pd
import numpy as np
poss_grades = ['A+','A','A-','B+','B','B-',
'C+','C','C-','D+','D','D-',
'F']
grade_values = {'A+':4.0,'A':4.0,'A-':3.7,'B+':3.3,'B':3.0,'B-':2.7,
'C+':2.3,'C':2.0,'C-':1.7,'D+':1.3,'D':1.0,'D-':.67,
'F':0.0}
student1 = np.random.choice(poss_grades, size = 10, replace = True)
student2 = np.random.choice(poss_grades, size = 12, replace = True)
- Convert the two Numpy arrays to one of the Pandas types for categorical data that the textbook discussed. Call these two variables
s1
ands2
. - These data are categorical. Are they ordinal? Make sure to adjust
s1
ands2
accordingly. - Calculate the two student GPAs. Assign the floating point numbers to variables named
s1_gpa
ands2_gpa
. Usegrade_values
to convert each letter grade to a number, and then average all the numbers for each student together using equal weights. - Is each category equally-spaced? If yes, then these are said to be interval data. Does your answer to this question affect the legitimacy of averaging together any ordinal data? Assign a
str
response to the variableave_ord_data_response
. Hint: consider (any) two different data sets that happen to produce the same GPA. Is the equality of these two GPAs misleading? - Compute the mode grade for each student. Assign your answers as
str
s to the variabless1_mode
ands2_mode
. If there are more than one modes, then assign the one that comes first alphabetically.
Suppose you are creating a classifier whose job it is to predict labels. Consider the following DataFrame
of predicted labels next to their corresponding actual labels. Please make sure to include this code in your script.
import pandas as pd
import numpy as np
d = pd.DataFrame({'predicted label' : [1,2,2,1,2,2,1,2,3,2,2,3],
'actual label': [1,2,3,1,2,3,1,2,3,1,2,3]},
dtype='category')
d.dtypes[0]
## CategoricalDtype(categories=[1, 2, 3], ordered=False)
d.dtypes[1]
## CategoricalDtype(categories=[1, 2, 3], ordered=False)
- Assign the prediction accuracy, as a percent (between \(0\) and \(100\)), to the variable
perc_acc
. - Create a confusion matrix to better assess which labels your classifier has a difficult time with. This should be a \(3 \times 3\) Numpy
ndarray
of percentages. The row will correspond to the predicted label, the column will correspond to the actual label, and number in location \((0,2)\), say, will be the percent of observations where your model predicted label1
and the actual was a label3
. Call the variableconfusion
.
References
“Chess (King-Rook vs. King-Pawn).” 1989. UCI Machine Learning Repository.
Dua, Dheeru, and Casey Graff. 2017. “UCI Machine Learning Repository.” University of California, Irvine, School of Information; Computer Sciences. http://archive.ics.uci.edu/ml.
Fisher, Test, R.A. & Creator. 1988. “Iris.” UCI Machine Learning Repository.