Chapter 7 Categorical Data

While statisticians may describe data as being either categorical or numerical, this classification is different than classifying data by its type in a program. So, strictly speaking, if you have categorical data, you are not obligated to use any particular type to represent it in your script.

However, there are types that are specifically designed to be used with categorical data, and so they are especially advantageous to use if you end up with the opportunity. We describe a few of them here in this chapter.

7.1 `factor`s in R

Categorical data in R is often stored in a factor variable. factors are more special than vectors of integers because

they have a levels attribute, which is comprised of all the possible values that each response could be;
they may or may not be ordered, which will also control how they are used in mathematical functions;
they might have a contrasts attribute, which will control how they are used in statistical modeling functions.

Here is a first example. Say we asked three people what their favorite season was. The data might look something like this.

allSeasons <- c("spring", "summer", "autumn", "winter")
responses <- factor(c("autumn", "summer", "summer"), 
                    levels = allSeasons)
levels(responses)
## [1] "spring" "summer" "autumn" "winter"
is.factor(responses)
## [1] TRUE
is.ordered(responses)
## [1] FALSE
#contrasts(responses) 
# ^ controls how factor is used in different  functions

factors always have levels, which is the collection of all possible unique values each observation can take.

You should be careful if you are not specifying them directly. What happens when you use the default option and replace the second assignment in the above code with responses <- factor(c("autumn", "summer", "summer"))? The documentation of factor() will tell you that, by default, factor() will just take the unique values found in the data. In this case, nobody prefers winter or spring, and so neither will show up in levels(responses). This may or may not be what you want.

factors can be ordered or unordered. Ordered factors are for ordinal data. Ordinal data is a particular type of categorical data that recognizes the categories have a natural order (e.g. low/ medium/high and not red/green/blue).

As another example, say we asked ten people how much they liked statistical computing, and they could only respond “love it”, “it’s okay” or “hate it”. The data might look something like this.

ordFeelOptions <- c("hate it", "it's okay", "love it")
responses <- factor(c("love it", "it's okay", "love it", 
                      "love it", "it's okay", "love it", 
                      "love it", "love it", "it's okay", 
                      "it's okay"), 
                    levels = ordFeelOptions,
                    ordered = TRUE)
levels(responses)
## [1] "hate it"   "it's okay" "love it"
is.factor(responses)
## [1] TRUE
is.ordered(responses)
## [1] TRUE
# contrasts(responses)

When creating ordered factors with factor(), be mindful that the levels= argument is assumed to be ordered when you plug it into factor(). In the above example, if you specified levels = c("love it", "it's okay", "hate it"), then the factor would assume love it < it's okay < hate it, which may or may not be what you want.

Last, factors may or may not have a contrast attribute. You can get or set this with the contrasts() function. This will influence some of the functions you use on your data that estimate statistical models.

I will not discuss specifics of contrasts in this text, but the overall motivation is important. In short, the primary reason for using factors is that they are designed to allow control over how you model categorical data. To be more specific, changing attributes of a factor could control the paremeterization of a model you’re estimating. If you’re using a particular function for modeling with categorical data, you need to know how it treats factors. On the other hand, if you’re writing a function that performs modeling of categorical data, you should know how to treat factors.

Here are two examples that you might come across in your studies.

Consider using factors as inputs to a function that performs linear regression. With linear regression models, if you have categorical inputs, there are many choices for how to write down a model. In each model, the collection of parameters will mean different things. In R, you might pick the model by creating the factor in a specific way.
Suppose you are interested in estimating a classification model. In this case, the dependent variable is categorical, not the independent variable. With these types of models, choosing whether or not your factor is ordered is critical. These options would estimate completely different models, so choose wisely!

The mathematical details of these examples is outside of the scope of this text. If you have not learned about dummy variables in a regression course, or if you have not considered the difference between multinomial logistic regression and ordinal logistic regression, or if you have but you’re just a little rusty, that is totally fine. I only mention these as examples for how the factor type can trigger special behavior.

In addition to creating one with factor(), there are two other common ways that you can end up with factors:

creating factors from numerical data, and
when reading in an external data file, one of the columns is coerced to a factor.

Here is an example of (1). We can take non-categorical data, and cut() it into something categorical.

stockReturns <- rnorm(6) # not categorical here
typeOfDay <- cut(stockReturns, breaks = c(-Inf, 0, Inf)) 
typeOfDay
## [1] (-Inf,0] (0, Inf] (-Inf,0] (-Inf,0] (0, Inf] (-Inf,0]
## Levels: (-Inf,0] (0, Inf]
levels(typeOfDay)
## [1] "(-Inf,0]" "(0, Inf]"
is.factor(typeOfDay)
## [1] TRUE
is.ordered(typeOfDay)
## [1] FALSE

Finally, be mindful of how different functions read in external data sets. When reading in an external file, if a particular function comes across a column that has characters in it, it will need to decide whether to store that column as a character vector, or as a factor. For example, read.csv() and read.table() have a stringsAsFactors= argument that you should be mindful of.

7.2 Two Options for Categorical Data in Pandas

Pandas provides two options for storing categorical data. They are both very similar to R’s factors. You may use either

a Pandas Series with a special dtype, or
a Pandas Categorical container.

Pandas’ Series were discussed earlier in sections 3.2 and 3.4. These were containers that forced every element to share the same dtype. Here, we specify dtype="category" in pd.Series().

import pandas as pd
szn_s = pd.Series(["autumn", "summer", "summer"], dtype = "category") 
szn_s.cat.categories
## Index(['autumn', 'summer'], dtype='object')
szn_s.cat.ordered
## False
szn_s.dtype
## CategoricalDtype(categories=['autumn', 'summer'], ordered=False)
type(szn_s)
## <class 'pandas.core.series.Series'>

The second option is to use Pandas’ Categorical containers. They are quite similar, so the choice is subtle. Like Series containers, they also force all of their elements to share the same shared dtype.

szn_c = pd.Categorical(["autumn", "summer", "summer"])
szn_c.categories
## Index(['autumn', 'summer'], dtype='object')
szn_c.ordered
## False
szn_c.dtype
## CategoricalDtype(categories=['autumn', 'summer'], ordered=False)
type(szn_c)
## <class 'pandas.core.arrays.categorical.Categorical'>

You might have noticed that, with the Categorical container, methods and data members were not accessed through the .cat accessor. It is also more similar to R’s factors because you can specify more arguments in the constructor.

all_szns = ["spring","summer", "autumn", "winter"]
szn_c2 = pd.Categorical(["autumn", "summer", "summer"], 
                        categories = all_szns, 
                        ordered = False)

In Pandas, just like in R, you need to be very careful about what the categories (c.f levels) are. If you are using ordinal data, they need to be specified in the correct order. If you are using small data sets, be cognizant of whether all the categories show up in the data–otherwise they will not be correctly inferred.

With Pandas’ Series it’s more difficult to specify a nondefault dtype. One option is to change them after the object has been created.

szn_s = szn_s.cat.set_categories(
  ["autumn", "summer","spring","winter"])
szn_s.cat.categories
## Index(['autumn', 'summer', 'spring', 'winter'], dtype='object')
szn_s = szn_s.cat.remove_categories(['spring','winter'])
szn_s.cat.categories
## Index(['autumn', 'summer'], dtype='object')
szn_s = szn_s.cat.add_categories(["fall", "winter"])
szn_s.cat.categories
## Index(['autumn', 'summer', 'fall', 'winter'], dtype='object')

Another option is to create the dtype before you create the Series, and pass it into pd.Series().

cat_type = pd.CategoricalDtype(
  categories=["autumn", "summer", "spring", "winter"],
  ordered=True)
responses = pd.Series(
  ["autumn", "summer", "summer"], 
  dtype = cat_type)
responses
## 0    autumn
## 1    summer
## 2    summer
## dtype: category
## Categories (4, object): ['autumn' < 'summer' < 'spring' < 'winter']

Just like in R, you can convert numerical data into categorical. The function even has the same name as in R: pd.cut(). Depending on the type of the input, it will return either a Series or a Categorical.

import numpy as np
stock_returns = np.random.normal(size=10) # not categorical 
# array input means Categorical output
type_of_day = pd.cut(stock_returns, 
                      bins = [-np.inf, 0, np.inf], 
                      labels = ['bad day', 'good day']) 
type(type_of_day)
# Series in means Series out
## <class 'pandas.core.arrays.categorical.Categorical'>
type_of_day2 = pd.cut(pd.Series(stock_returns), 
                      bins = [-np.inf, 0, np.inf], 
                      labels = ['bad day', 'good day']) 
type(type_of_day2)
## <class 'pandas.core.series.Series'>

Finally, when reading in data from an external source, choose carefully whether you want character data to be stored as a string type, or as a categorical type. Here we use pd.read_csv() to read in Fisher’s Iris data set (Fisher 1988) hosted by (Dua and Graff 2017). More information on Pandas’ DataFrames can be found in the next chapter.

import numpy as np
# make 5th col categorical
my_data = pd.read_csv("data/iris.csv", header=None, 
                        dtype = {4:"category"}) 
my_data.head(1)
##      0    1    2    3            4
## 0  5.1  3.5  1.4  0.2  Iris-setosa
my_data.dtypes
## 0     float64
## 1     float64
## 2     float64
## 3     float64
## 4    category
## dtype: object
np.unique(my_data[4]).tolist()
## ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']

7.3 Exercises

7.3.1 R Questions

Read in this chess data set (“Chess (King-Rook vs. King-Pawn)” 1989), hosted by (Dua and Graff 2017), with the following code. You will probably have to change your working directory, but if you do, make sure to comment out that code before you submit your script to me.

d <- read.csv("kr-vs-kp.data", header=FALSE, stringsAsFactors = TRUE)
head(d)

Are all of the columns factors? Assign TRUE or FALSE to allFactors.
Should any of these factors be ordered? Assign TRUE or FALSE to ideallyOrdered. Hint: read the data set description from https://archive.ics.uci.edu/ml/datasets/Chess+%28King-Rook+vs.+King-Pawn%29.
Are any of these factors currently ordered? Assign TRUE or FALSE to currentlyOrdered.
What percent (between $0$ and $100$ ) of the time is the first column equal to 'f'? Assign your answer to percentF.

Suppose you have the following vector. Please make sure to include this code in your script.

normSamps <- rnorm(100)

create a factor from normSamps. Map each element to "within 1 sd" or "outside 1 sd" depending on whether the element is within $1$ theoretical standard deviation of $0$ or not. Call the factor withinOrNot.

7.3.2 Python Questions

Consider the following simulated letter grade data for two students:

import pandas as pd
import numpy as np
poss_grades = ['A+','A','A-','B+','B','B-',
               'C+','C','C-','D+','D','D-',
               'F']
grade_values = {'A+':4.0,'A':4.0,'A-':3.7,'B+':3.3,'B':3.0,'B-':2.7,
                'C+':2.3,'C':2.0,'C-':1.7,'D+':1.3,'D':1.0,'D-':.67,
                'F':0.0}
student1 = np.random.choice(poss_grades, size = 10, replace = True)
student2 = np.random.choice(poss_grades, size = 12, replace = True)

Convert the two Numpy arrays to one of the Pandas types for categorical data that the textbook discussed. Call these two variables s1 and s2.
These data are categorical. Are they ordinal? Make sure to adjust s1 and s2 accordingly.
Calculate the two student GPAs. Assign the floating point numbers to variables named s1_gpa and s2_gpa. Use grade_values to convert each letter grade to a number, and then average all the numbers for each student together using equal weights.
Is each category equally-spaced? If yes, then these are said to be interval data. Does your answer to this question affect the legitimacy of averaging together any ordinal data? Assign a str response to the variable ave_ord_data_response. Hint: consider (any) two different data sets that happen to produce the same GPA. Is the equality of these two GPAs misleading?
Compute the mode grade for each student. Assign your answers as strs to the variables s1_mode and s2_mode. If there are more than one modes, then assign the one that comes first alphabetically.

Suppose you are creating a classifier whose job it is to predict labels. Consider the following DataFrame of predicted labels next to their corresponding actual labels. Please make sure to include this code in your script.

import pandas as pd
import numpy as np
d = pd.DataFrame({'predicted label' : [1,2,2,1,2,2,1,2,3,2,2,3], 
                  'actual label': [1,2,3,1,2,3,1,2,3,1,2,3]}, 
                  dtype='category')
d.dtypes[0]
## CategoricalDtype(categories=[1, 2, 3], ordered=False)
d.dtypes[1]
## CategoricalDtype(categories=[1, 2, 3], ordered=False)

Assign the prediction accuracy, as a percent (between $0$ and $100$ ), to the variable perc_acc.
Create a confusion matrix to better assess which labels your classifier has a difficult time with. This should be a $3 \times 3$ Numpy ndarray of percentages. The row will correspond to the predicted label, the column will correspond to the actual label, and number in location $(0, 2)$ , say, will be the percent of observations where your model predicted label 1 and the actual was a label 3. Call the variable confusion.

References

“Chess (King-Rook vs. King-Pawn).” 1989. UCI Machine Learning Repository.

Dua, Dheeru, and Casey Graff. 2017. “UCI Machine Learning Repository.” University of California, Irvine, School of Information; Computer Sciences. http://archive.ics.uci.edu/ml.

Fisher, Test, R.A. & Creator. 1988. “Iris.” UCI Machine Learning Repository.