Chapter 12 Reshaping and Combining Data Sets
12.1 Ordering and Sorting Data
Sorting a data set, in ascending order, say, is a common task. You might need to do it because
- ordering and ranking is commonly done in nonparametric statistics,
- you want to inspect the most “extreme” observations in a data set,
- it’s a pre-processing step before generating visualizations.
In R, it all starts with vector
s. There are two common functions you should know: sort()
and order()
. sort()
returns the sorted data, while order()
returns the order indexes.
sillyData <- rnorm(5)
print(sillyData)
## [1] 0.3903776 0.5796584 1.4929115 0.3704896 -1.3450719
sort(sillyData)
## [1] -1.3450719 0.3704896 0.3903776 0.5796584 1.4929115
order(sillyData)
## [1] 5 4 1 2 3
order()
is useful if you’re sorting a data frame by a particularly column. Below, we inspect the top 5 most expensive cars in an example data set (“SAS Viya Example Data Sets” 2021). Notice that we need to clean up the MSRP
(a character
vector) a little first. We use the function gsub()
to find patterns in the text, and replace them with the empty string.
carData <- read.csv("data/cars.csv")
noDollarSignMSRP <- gsub("$", "", carData$MSRP, fixed = TRUE)
carData$cleanMSRP <- as.numeric(gsub(",", "", noDollarSignMSRP,
fixed = TRUE))
rowIndices <- order(carData$cleanMSRP, decreasing = TRUE)[1:5]
carData[rowIndices,c("Make", "Model", "MSRP", "cleanMSRP")]
## Make Model MSRP cleanMSRP
## 335 Porsche 911 GT2 2dr $192,465 192465
## 263 Mercedes-Benz CL600 2dr $128,420 128420
## 272 Mercedes-Benz SL600 convertible 2dr $126,670 126670
## 271 Mercedes-Benz SL55 AMG 2dr $121,770 121770
## 262 Mercedes-Benz CL500 2dr $94,820 94820
In Python, Numpy has np.argsort()
and np.sort()
.
import numpy as np
silly_data = np.random.normal(size=5)
print(silly_data)
## [-0.52817175 -1.07296862 0.86540763 -2.3015387 1.74481176]
print( np.sort(silly_data) )
## [-2.3015387 -1.07296862 -0.52817175 0.86540763 1.74481176]
np.argsort(silly_data)
## array([3, 1, 0, 2, 4])
For Pandas’ DataFrame
s, most of the functions I find useful are methods attached to the DataFrame
class. That means that, as long as something is inside a DataFrame
, you can use dot notation.
import pandas as pd
car_data = pd.read_csv("data/cars.csv")
car_data['no_dlr_msrp'] = car_data['MSRP'].str.replace("$", "",
regex = False)
no_commas = car_data['no_dlr_msrp'].str.replace(",","")
car_data['clean_MSRP'] = no_commas.astype(float)
car_data = car_data.sort_values(by='clean_MSRP', ascending = False)
car_data[["Make", "Model", "MSRP", "clean_MSRP"]].head(5)
## Make Model MSRP clean_MSRP
## 334 Porsche 911 GT2 2dr $192,465 192465.0
## 262 Mercedes-Benz CL600 2dr $128,420 128420.0
## 271 Mercedes-Benz SL600 convertible 2dr $126,670 126670.0
## 270 Mercedes-Benz SL55 AMG 2dr $121,770 121770.0
## 261 Mercedes-Benz CL500 2dr $94,820 94820.0
Pandas’ DataFrame
s and Series
have a .replace()
method. We use this to remove dollar signs and commas from the MSRP column. Note that we had to access the .str
attribute of the Series
column before we used it. After the string was processed, we converted it to a Series
of float
s with the astype()
method.
Finally, sorting the overall data frame could have been done with the same approach as the code we used in R (i.e. raw subsetting by row indexes), but there is a built-in method called sort_values()
that will do it for us.
12.2 Stacking Data Sets and Placing Them Shoulder to Shoulder
Stacking data sets on top of each other is a common task. You might need to do it if
- you need to add a new row (or many rows) to a data frame,
- you need to recombine data sets (e.g. recombine a train/test split), or
- you’re creating a matrix in a step-by-step way.
In R, this can be done with rbind()
(short for “row bind”). Consider the following example that makes use of GIS data queried from (Albemarle County Geographic Data Services Office 2021) and cleaned with code from (Ford 2016).
realEstate <- read.csv("data/albemarle_real_estate.csv")
train <- realEstate[-1,]
test <- realEstate[1,]
str(rbind(test, train), strict.width = "cut")
## 'data.frame': 30381 obs. of 12 variables:
## $ YearBuilt : int 1769 1818 2004 2006 2004 1995 1900 1960 ..
## $ YearRemodeled: int 1988 1991 NA NA NA NA NA NA NA NA ...
## $ Condition : chr "Average" "Average" "Average" "Average" ..
## $ NumStories : num 1.7 2 1 1 1.5 2.3 2 1 1 1 ...
## $ FinSqFt : int 5216 5160 1512 2019 1950 2579 1530 800 9..
## $ Bedroom : int 4 6 3 3 3 3 4 2 2 2 ...
## $ FullBath : int 3 4 2 3 3 2 1 1 1 1 ...
## $ HalfBath : int 0 1 1 0 0 1 0 0 0 0 ...
## $ TotalRooms : int 8 11 9 10 8 8 6 4 4 4 ...
## $ LotSize : num 5.1 453.9 42.6 5 5.5 ...
## $ TotalValue : num 1096600 2978600 677800 453200 389200 ...
## $ City : chr "CROZET" "CROZET" "CROZET" "CROZET" ...
sum(rbind(test, train) != realEstate)
## [1] NA
The above example was with data.frame
s. This example of rbind()
is with matrix
objects.
rbind(matrix(1,nrow = 2, ncol = 3),
matrix(2,nrow = 2, ncol = 3))
## [,1] [,2] [,3]
## [1,] 1 1 1
## [2,] 1 1 1
## [3,] 2 2 2
## [4,] 2 2 2
In Python, you can stack data frames with pd.concat()
. It has a lot of options, so feel free to peruse them. You can also replace the call to pd.concat()
below with test.append(train)
. Consider the example below that uses the Albemarle County real estate data (Albemarle County Geographic Data Services Office 2021) (Ford 2016).
import pandas as pd
real_estate = pd.read_csv("data/albemarle_real_estate.csv")
train = real_estate.iloc[1:,]
test = real_estate.iloc[[0],] # need the extra brackets!
stacked = pd.concat([test,train], axis=0)
stacked.iloc[:3,:3]
## YearBuilt YearRemodeled Condition
## 0 1769 1988.0 Average
## 1 1818 1991.0 Average
## 2 2004 NaN Average
(stacked != real_estate).sum().sum()
## 28251
Take note of the extra square brackets when we create test
. If you use real_estate.iloc[0,]
instead, it will return a Series
with all the elements coerced to the same type, and this won’t pd.concat()
properly with the rest of the data!
12.3 Merging or Joining Data Sets
If you have two different data sets that provide information about the same experimental units, you can put the two data sets together using a merge
(aka join
) operation. In R, you can use the merge()
function. In Python, you can use the .merge()
method.
Merging (or joining) data sets is not the same as placing them shoulder to shoulder. Placing data sets shoulder to shoulder will not reorder the rows of your data and the operation requires that both input data sets have the same number of rows to start off with. On the other hand, merging data takes care to match rows together in an intelligent way, and it can handle situations of missing or duplicate matches. In both cases, the resulting data set is wider, but with merging, the output might end contain either more or fewer rows.
Here’s a clarifying example. Suppose you have to sets of supposedly anonymized data about individual accounts on some online platforms.
# in R
baby1 <- read.csv("data/baby1.csv", stringsAsFactors = FALSE)
baby2 <- read.csv("data/baby2.csv", stringsAsFactors = FALSE)
head(baby1)
## idnum height.inches. email_address
## 1 1 74 fakeemail123@gmail.com
## 2 3 66 anotherfake@gmail.com
## 3 4 62 notreal@gmail.com
## 4 23 62 notreal@gmail.com
head(baby2)
## idnum phone email
## 1 3901283 5051234567 notreal@gmail.com
## 2 41823 5051234568 notrealeither@gmail.com
## 3 7198273 5051234568 anotherfake@gmail.com
The first thing you need to ask yourself is “which column is the unique identifier that is shared between these two data sets?” In our case, they both have an “identification number” column. However, these two data sets are coming from different online platforms, and these two places use different schemes to number their users.
In this case, it is better to merge on the email addresses. Users might be using different email addresses on these two platforms, but there’s a stronger guarantee that matched email addresses means that you’re matching the right accounts. The columns are named differently in each data set, so we must specify them by name.
# in R
merge(baby1, baby2, by.x = "email_address", by.y = "email")
## email_address idnum.x height.inches. idnum.y phone
## 1 anotherfake@gmail.com 3 66 7198273 5051234568
## 2 notreal@gmail.com 4 62 3901283 5051234567
## 3 notreal@gmail.com 23 62 3901283 5051234567
In Python, merge()
is a method attached to each DataFrame
instance.
# in Python
baby1.merge(baby2, left_on = "email_address", right_on = "email")
## idnum_x height(inches) ... phone email
## 0 3 66 ... 5051234568 anotherfake@gmail.com
## 1 4 62 ... 5051234567 notreal@gmail.com
## 2 23 62 ... 5051234567 notreal@gmail.com
##
## [3 rows x 6 columns]
The email addresses anotherfake@gmail.com
and notreal@gmail.com
exist in both data sets, so each of these email addresses will end up in the result data frame. The rows in the result data set are wider and have more attributes for each individual.
Notice the duplicate email address, too. In this case, either the user signed up for two accounts using the same email, or one person signed up for an account with another person’s email address. In the case of duplicates, both rows will match with the same rows in the other data frame.
Also, in this case, all email addresses that weren’t found in both data sets were thrown away. This does not necessarily need to be the intended behavior. For instance, if we wanted to make sure no rows were thrown away, that would be possible. In this case, though, for email addresses that weren’t found in both data sets, some information will be missing. Recall that Python and R handle missing data differently (see 3.8.2).
# in R
merge(baby1, baby2,
by.x = "email_address", by.y = "email",
all.x = TRUE, all.y = TRUE)
## email_address idnum.x height.inches. idnum.y phone
## 1 anotherfake@gmail.com 3 66 7198273 5051234568
## 2 fakeemail123@gmail.com 1 74 NA NA
## 3 notreal@gmail.com 4 62 3901283 5051234567
## 4 notreal@gmail.com 23 62 3901283 5051234567
## 5 notrealeither@gmail.com NA NA 41823 5051234568
# in Python
le_merge = baby1.merge(baby2,
left_on = "email_address", right_on = "email",
how = "outer")
le_merge.iloc[:5,3:]
## idnum_y phone email
## 0 NaN NaN NaN
## 1 7198273.0 5.051235e+09 anotherfake@gmail.com
## 2 3901283.0 5.051235e+09 notreal@gmail.com
## 3 3901283.0 5.051235e+09 notreal@gmail.com
## 4 41823.0 5.051235e+09 notrealeither@gmail.com
You can see it’s slightly more concise in Python. If you are familiar with SQL, you might have heard of inner and outer joins. This is where Pandas takes some of its argument names from.
Finally, if both data sets have multiple values in the column you’re joining on, the result can have more rows than either table. This is because every possible match shows up.
12.4 Long Versus Wide Data
12.4.1 Long Versus Wide in R
Many types of data can be stored in either a wide or long format.
The classical example is data from a longitudinal study. If an experimental unit (in the example below this is a person) is repeatedly measured over time, each row would correspond to an experimental unit and an observation time in a data set in a long form.
peopleNames <- c("Taylor","Taylor","Charlie","Charlie")
fakeLongData1 <- data.frame(person = peopleNames,
timeObserved = c(1, 2, 1, 2),
nums = c(100,101,300,301))
fakeLongData1
## person timeObserved nums
## 1 Taylor 1 100
## 2 Taylor 2 101
## 3 Charlie 1 300
## 4 Charlie 2 301
A long format can also be used if you have multiple observations (at a single time point) on an experimental unit. Here is another example.
myAttrs <- c("attrA","attrB","attrA","attrB")
fakeLongData2 <- data.frame(person = peopleNames,
attributeName = myAttrs,
nums = c(100,101,300,301))
fakeLongData2
## person attributeName nums
## 1 Taylor attrA 100
## 2 Taylor attrB 101
## 3 Charlie attrA 300
## 4 Charlie attrB 301
If you would like to reshape the long data sets into a wide format, you can use the reshape()
function. You will need to specify which columns correspond with the experimental unit, and which column is the “factor” variable.
fakeWideData1 <- reshape(fakeLongData1,
direction = "wide",
timevar = "timeObserved",
idvar = "person",
varying = c("before","after"))
# ^ varying= arg becomes col names in new data set
fakeLongData1
## person timeObserved nums
## 1 Taylor 1 100
## 2 Taylor 2 101
## 3 Charlie 1 300
## 4 Charlie 2 301
fakeWideData1
## person before after
## 1 Taylor 100 101
## 3 Charlie 300 301
# timevar= is a misnomer here
fakeWideData2 <- reshape(fakeLongData2,
direction = "wide",
timevar = "attributeName",
idvar = "person",
varying = c("attribute A","attribute B"))
fakeLongData2
## person attributeName nums
## 1 Taylor attrA 100
## 2 Taylor attrB 101
## 3 Charlie attrA 300
## 4 Charlie attrB 301
fakeWideData2
## person attribute A attribute B
## 1 Taylor 100 101
## 3 Charlie 300 301
reshape()
will also go in the other direction: it can take wide data and convert it into long data
reshape(fakeWideData1,
direction = "long",
idvar = "person",
varying = list(c("before","after")),
v.names = "nums")
## person time nums
## Taylor.1 Taylor 1 100
## Charlie.1 Charlie 1 300
## Taylor.2 Taylor 2 101
## Charlie.2 Charlie 2 301
fakeLongData1
## person timeObserved nums
## 1 Taylor 1 100
## 2 Taylor 2 101
## 3 Charlie 1 300
## 4 Charlie 2 301
reshape(fakeWideData2,
direction = "long",
idvar = "person",
varying = list(c("attribute A","attribute B")),
v.names = "nums")
## person time nums
## Taylor.1 Taylor 1 100
## Charlie.1 Charlie 1 300
## Taylor.2 Taylor 2 101
## Charlie.2 Charlie 2 301
fakeLongData2
## person attributeName nums
## 1 Taylor attrA 100
## 2 Taylor attrB 101
## 3 Charlie attrA 300
## 4 Charlie attrB 301
12.4.2 Long Versus Wide in Python
With Pandas, we can take make long data wide with pd.DataFrame.pivot()
, and we can go in the other direction with pd.DataFrame.melt()
.
When going from long to wide, make sure to use the pd.DataFrame.reset_index()
method afterwards to reshape the data and remove the index. Here is an example similar to the one above.
import pandas as pd
fake_long_data1 = pd.DataFrame(
{'person' : ["Taylor","Taylor","Charlie","Charlie"],
'time_observed' : [1, 2, 1, 2],
'nums' : [100,101,300,301]})
fake_long_data1
## person time_observed nums
## 0 Taylor 1 100
## 1 Taylor 2 101
## 2 Charlie 1 300
## 3 Charlie 2 301
pivot_data1 = fake_long_data1.pivot(index='person',
columns='time_observed',
values='nums')
fake_wide_data1 = pivot_data1.reset_index()
fake_wide_data1
## time_observed person 1 2
## 0 Charlie 300 301
## 1 Taylor 100 101
Here’s one more example showing the same functionality–going from long to wide format.
people_names = ["Taylor","Taylor","Charlie","Charlie"]
attribute_list = ['attrA', 'attrB', 'attrA', 'attrB']
fake_long_data2 = pd.DataFrame({'person' : people_names,
'attribute_name' : attribute_list,
'nums' : [100,101,300,301]})
fake_wide_data2 = fake_long_data2.pivot(index='person',
columns='attribute_name',
values='nums').reset_index()
fake_wide_data2
## attribute_name person attrA attrB
## 0 Charlie 300 301
## 1 Taylor 100 101
Here are some examples of going in the other direction: from wide to long with pd.DataFrame.melt()
. The first example specifies value columns by integers.
fake_wide_data1
## time_observed person 1 2
## 0 Charlie 300 301
## 1 Taylor 100 101
fake_wide_data1.melt(id_vars = "person", value_vars = [1,2])
## person time_observed value
## 0 Charlie 1 300
## 1 Taylor 1 100
## 2 Charlie 2 301
## 3 Taylor 2 101
The second example uses strings to specify value columns.
12.5 Exercises
12.5.1 R Questions
Recall the car.data
data set (“Car Evaluation” 1997), which is hosted by (Dua and Graff 2017).
- Read in the data set as
carData
. - Convert the third and fourth columns to ordered
factor
s. - Order the data by the third and then the fourth column (simultaneously). Do not change the data in place. Instead store it under the name
ordCarData1
- Order the data by the fourth and then the third column (simultaneously). Do not change the data in place. Instead store it under the name
ordCarData2
day1Data <- data.frame(idNum = 1:10,
measure = rnorm(10))
day2Data <- data.frame(idNum = 11:20,
measure = rnorm(10))
- Pretend
day1Data
andday2Data
are two separate data sets that possess the same type of measures but on different experimental units. Stackday1Data
on top ofday2Data
and call the resultstackedData
. - Pretend
day1Data
andday2Data
are different measurements on the same experimental units. Place them shoulder to shoulder and call the resultsideBySide
. Putday1Data
first, andday2Data
second.
If you are dealing with random matrices, you might need to vectorize a matrix object. This is not the same as “vectorization” in programming. Instead, it means you write the matrix as a big column vector by stacking the columns on top of each other. Specifically, if you have a real-valued matrix , then
where is the th column as an column vector. There is another operator that we will use, the Kronecker product:
If is and is , then is .
- Write a function called
vec(myMatrix)
. Its input should be onematrix
object. It’s output should be avector
. Hint:matrix
objects are stored in column-major order. - Write a function called
unVec(myVector, nRows)
that takes in the vectorized matrix as avector
, splits that into elements withnRows
elements, and then places them together shoulder-to-shoulder as amatrix
. - Write a function called
stackUp(m, BMat)
that returns where is a length column vector of ones. You may check your work with%x%
, but do not use this in your function. - Write a function called
shoulderToShoulder(n, BMat)
that returns where is a length row vector of ones. You may check your work with%x%
, but do not use this in your function.
This problem uses the Militarized Interstate Disputes (v5.0) (Palmer et al.) data set from The Correlates of War Project. There are four .csv
files we use for this problem. MIDA 5.0.csv
contains the essential attributes of each militarized interstate dispute from 1/1/1816 through 12/31/2014. MIDB 5.0.csv
describes the participants in each of those disputes. MIDI 5.0.csv
contains the essential elements of each militarized interstate incident, and MIDIP 5.0.csv
describes the participants in each of those incidents.
- Read in the four data sets and give them the names
mida
,midb
,midi
, andmidp
. Take care to convert all instances of-9
toNA
. - Examine all rows of
midb
where itsdispnum
column equals2
. Do not changemidb
permanently. Are these two rows corresponding to the same conflict? If so, assignTRUE
tosameConflict
. Otherwise, assignFALSE
. - Join the first two data sets together on the dispute number column (
dispnum
). Call the resultingdata.frame
join1
. Do not address any concerns about duplicate columns. - Is there any difference between doing an inner join and an outer join in the previous question? If there was a difference, assign
TRUE
totheyAreNotTheSame
. Otherwise, assignFALSE
to it. - Join the last two data sets together by
incidnum
and call the resultjoin2
. Is there any difference between an inner and an outer join for this problem? Why or why not? Do not address any concerns about duplicate columns. - The codebook mentions that the last two data sets don’t go as far back in time as the first two. Suppose then that we only care about the events in
join2
. Mergejoin2
andjoin1
in a way where all undesired rows fromjoin1
are discarded, and all rows fromjoin2
are kept. Call the resultingdata.frame
midData
. Do not address any concerns about duplicate columns. - Use a scatterplot to display the relationship between the maximum duration and the end year. Plot each country as a different color.
- Create a
data.frame
calledlongData
that has the following three columns frommidp
:incidnum
(incident identification number)stabb
(state abbreviation of participant) andfatalpre
(precise number of fatalities). Convert this to “wide” format. Make the new table calledwideData
. Use the incident number row as a unique row-identifying variable. - Bonus Question: identify all column pairs that contain duplicate information in
midData
, remove all but one of the columns, and change the column name back to its original name.
12.5.2 Python Questions
Once again, recall the "car.data"
data set (“Car Evaluation” 1997).
- Read in the data set as
car_data
. - Order the data by the third and then the fourth column. Do not change the data in place. Instead store it under the name
ord_car_data1
- Order the data by the fourth and then the third column. Do not change the data in place. Instead store it under the name
ord_car_data2
Consider the following random data set.
indexes = np.random.choice(np.arange(20),size=20,replace=False)
d1 = pd.DataFrame({'a' : indexes,
'b' : np.random.normal(size=20)})
d2 = pd.DataFrame({'a' : indexes + 20,
'b' : np.random.normal(size=20)})
- Pretend
d1
andd2
are two separate data sets that possess the same type of measures but on different experimental units. Stackd1
on top ofd2
and call the resultstacked_data_sets
. Make sure theindex
of the result is the numbers through - Pretend
d1
andd2
are different measurements on the same experimental units. Place them shoulder to shoulder and call the resultside_by_side_data_sets
. Putd1
first, andd2
second.
Consider the following two data sets:
import numpy as np
import pandas as pd
dog_names1 = ['Charlie','Gus', 'Stubby', 'Toni','Pearl']
dog_names2 = ['Charlie','Gus', 'Toni','Arya','Shelby']
nicknames = ['Charles','Gus The Bus',np.nan,'Toni Bologna','Porl']
breed_names = ['Black Lab','Beagle','Golden Retriever','Husky',np.nan]
dataset1 = pd.DataFrame({'dog': dog_names1,
'nickname': nicknames})
dataset2 = pd.DataFrame({'dog':dog_names2,
'breed':breed_names})
- Join/merge the two data sets together in such a way that there is a row for every dog, whether or not both tables have information for that dog. Call the result
merged1
. - Join/merge the two data sets together in such a way that there are only rows for every dog in
dataset1
, whether or not there is information about these dogs’ breeds. Call the resultmerged2
. - Join/merge the two data sets together in such a way that there are only rows for every dog in
dataset2
, whether or not there is information about the dogs’ nicknames. Call the resultmerged3
. - Join/merge the two data sets together in such a way that all rows possess complete information. Call the result
merged4
.
Let’s consider Fisher’s “Iris” data set (Fisher 1988) again.
- Read in
iris.csv
and store theDataFrame
with the nameiris
. Let it have the column names'a'
,'b'
,'c'
,'d'
and'e'
. - Create a
DataFrame
calledname_key
that stores correspondences between long names and short names. It should have three rows and two columns. The long names are the unique values of column five ofiris
. The short names are either's'
,'vers'
or'virg'
. Use the column names'long name'
and'short name'
. - Merge/join the two data sets together to give
iris
a new column with information about short names. Do not overwriteiris
. Rather, give theDataFrame
a new name:iris_with_short_names
. Remove any columns with duplicate information. - Change the first four column names of
iris_with_short_names
tos_len
,s_wid
,p_len
, andp_wid
. Use Matplotlib to create a figure with 4 subplots arranged into a grid. On each subplot, plot a histogram of these four columns. Make sure to use x-axis labels so viewers can tell which column is being plotted in each subplot. - Let’s go back to
iris
. Change that to long format. Store it as aDataFrame
calledlong_iris
. Make the column namesrow
,variable
andvalue
, in that order. Last, make sure it is sorted (simultaneously/once) byrow
and thenvariable
.
References
Albemarle County Geographic Data Services Office. 2021. “Albemarle County GIS Web.” https://www.albemarle.org/government/community-development/gis-mapping/gis-data.
“Car Evaluation.” 1997. UCI Machine Learning Repository.
Dua, Dheeru, and Casey Graff. 2017. “UCI Machine Learning Repository.” University of California, Irvine, School of Information; Computer Sciences. http://archive.ics.uci.edu/ml.
Fisher, Test, R.A. & Creator. 1988. “Iris.” UCI Machine Learning Repository.
Ford, Clay. 2016. “ggplot: Files for UVA StatLab workshop, Fall 2016.” GitHub Repository. https://github.com/clayford/ggplot2; GitHub.
Palmer, Glenn, Roseanne W McManus, Vito D’Orazio, Michael R Kenwick, Mikaela Karstens, Chase Bloch, Nick Dietrich, Kayla Kahn, Kellan Ritter, and Michael J Soules“The Mid5 Dataset, 2011–2014: Procedures, Coding Rules, and Description.” Conflict Management and Peace Science 0 (0): 0738894221995743. https://doi.org/10.1177/0738894221995743.
“SAS Viya Example Data Sets.” 2021. https://support.sas.com/documentation/onlinedoc/viya/examples.htm.