# Chapter 12 Reshaping and Combining Data Sets

## 12.1 Ordering and Sorting Data

Sorting a data set, in ascending order, say, is a common task. You might need to do it because

- ordering and ranking is commonly done in
*nonparametric statistics*, - you want to inspect the most “extreme” observations in a data set,
- it’s a pre-processing step before generating visualizations.

In R, it all starts with `vector`

s. There are two common functions you should know: `sort()`

and `order()`

. `sort()`

returns the sorted *data*, while `order()`

returns the *order indexes*.

```
sillyData <- rnorm(5)
print(sillyData)
## [1] 0.3903776 0.5796584 1.4929115 0.3704896 -1.3450719
sort(sillyData)
## [1] -1.3450719 0.3704896 0.3903776 0.5796584 1.4929115
order(sillyData)
## [1] 5 4 1 2 3
```

`order()`

is useful if you’re sorting a data frame by a particularly column. Below, we inspect the top 5 most expensive cars in an example data set (“SAS^{} Viya^{} Example Data Sets” 2021). Notice that we need to clean up the `MSRP`

(a `character`

vector) a little first. We use the function `gsub()`

to find patterns in the text, and replace them with the empty string.

```
carData <- read.csv("data/cars.csv")
noDollarSignMSRP <- gsub("$", "", carData$MSRP, fixed = TRUE)
carData$cleanMSRP <- as.numeric(gsub(",", "", noDollarSignMSRP,
fixed = TRUE))
rowIndices <- order(carData$cleanMSRP, decreasing = TRUE)[1:5]
carData[rowIndices,c("Make", "Model", "MSRP", "cleanMSRP")]
```

```
## Make Model MSRP cleanMSRP
## 335 Porsche 911 GT2 2dr $192,465 192465
## 263 Mercedes-Benz CL600 2dr $128,420 128420
## 272 Mercedes-Benz SL600 convertible 2dr $126,670 126670
## 271 Mercedes-Benz SL55 AMG 2dr $121,770 121770
## 262 Mercedes-Benz CL500 2dr $94,820 94820
```

In Python, Numpy has `np.argsort()`

and `np.sort()`

.

```
import numpy as np
silly_data = np.random.normal(size=5)
print(silly_data)
## [-0.52817175 -1.07296862 0.86540763 -2.3015387 1.74481176]
print( np.sort(silly_data) )
## [-2.3015387 -1.07296862 -0.52817175 0.86540763 1.74481176]
np.argsort(silly_data)
## array([3, 1, 0, 2, 4])
```

For Pandas’ `DataFrame`

s, most of the functions I find useful are methods attached to the `DataFrame`

class. That means that, as long as something is inside a `DataFrame`

, you can use dot notation.

```
import pandas as pd
car_data = pd.read_csv("data/cars.csv")
car_data['no_dlr_msrp'] = car_data['MSRP'].str.replace("$", "",
regex = False)
no_commas = car_data['no_dlr_msrp'].str.replace(",","")
car_data['clean_MSRP'] = no_commas.astype(float)
car_data = car_data.sort_values(by='clean_MSRP', ascending = False)
car_data[["Make", "Model", "MSRP", "clean_MSRP"]].head(5)
## Make Model MSRP clean_MSRP
## 334 Porsche 911 GT2 2dr $192,465 192465.0
## 262 Mercedes-Benz CL600 2dr $128,420 128420.0
## 271 Mercedes-Benz SL600 convertible 2dr $126,670 126670.0
## 270 Mercedes-Benz SL55 AMG 2dr $121,770 121770.0
## 261 Mercedes-Benz CL500 2dr $94,820 94820.0
```

Pandas’ `DataFrame`

s and `Series`

have a `.replace()`

method. We use this to remove dollar signs and commas from the MSRP column. Note that we had to access the `.str`

attribute of the `Series`

column before we used it. After the string was processed, we converted it to a `Series`

of `float`

s with the `astype()`

method.

Finally, sorting the overall data frame could have been done with the same approach as the code we used in R (i.e. raw subsetting by row indexes), but there is a built-in method called `sort_values()`

that will do it for us.

## 12.2 Stacking Data Sets and Placing Them Shoulder to Shoulder

Stacking data sets on top of each other is a common task. You might need to do it if

- you need to add a new row (or many rows) to a data frame,
- you need to recombine data sets (e.g. recombine a train/test split), or
- you’re creating a matrix in a step-by-step way.

In R, this can be done with `rbind()`

(short for “row bind”). Consider the following example that makes use of GIS data queried from (Albemarle County Geographic Data Services Office 2021) and cleaned with code from (Ford 2016).

```
realEstate <- read.csv("data/albemarle_real_estate.csv")
train <- realEstate[-1,]
test <- realEstate[1,]
str(rbind(test, train), strict.width = "cut")
## 'data.frame': 30381 obs. of 12 variables:
## $ YearBuilt : int 1769 1818 2004 2006 2004 1995 1900 1960 ..
## $ YearRemodeled: int 1988 1991 NA NA NA NA NA NA NA NA ...
## $ Condition : chr "Average" "Average" "Average" "Average" ..
## $ NumStories : num 1.7 2 1 1 1.5 2.3 2 1 1 1 ...
## $ FinSqFt : int 5216 5160 1512 2019 1950 2579 1530 800 9..
## $ Bedroom : int 4 6 3 3 3 3 4 2 2 2 ...
## $ FullBath : int 3 4 2 3 3 2 1 1 1 1 ...
## $ HalfBath : int 0 1 1 0 0 1 0 0 0 0 ...
## $ TotalRooms : int 8 11 9 10 8 8 6 4 4 4 ...
## $ LotSize : num 5.1 453.9 42.6 5 5.5 ...
## $ TotalValue : num 1096600 2978600 677800 453200 389200 ...
## $ City : chr "CROZET" "CROZET" "CROZET" "CROZET" ...
sum(rbind(test, train) != realEstate)
## [1] NA
```

The above example was with `data.frame`

s. This example of `rbind()`

is with `matrix`

objects.

```
rbind(matrix(1,nrow = 2, ncol = 3),
matrix(2,nrow = 2, ncol = 3))
## [,1] [,2] [,3]
## [1,] 1 1 1
## [2,] 1 1 1
## [3,] 2 2 2
## [4,] 2 2 2
```

In Python, you can stack data frames with `pd.concat()`

. It has a lot of options, so feel free to peruse them. You can also replace the call to `pd.concat()`

below with `test.append(train)`

. Consider the example below that uses the Albemarle County real estate data (Albemarle County Geographic Data Services Office 2021) (Ford 2016).

```
import pandas as pd
real_estate = pd.read_csv("data/albemarle_real_estate.csv")
train = real_estate.iloc[1:,]
test = real_estate.iloc[[0],] # need the extra brackets!
stacked = pd.concat([test,train], axis=0)
stacked.iloc[:3,:3]
## YearBuilt YearRemodeled Condition
## 0 1769 1988.0 Average
## 1 1818 1991.0 Average
## 2 2004 NaN Average
(stacked != real_estate).sum().sum()
## 28251
```

Take note of the extra square brackets when we create `test`

. If you use `real_estate.iloc[0,]`

instead, it will return a `Series`

with all the elements coerced to the same type, and this won’t `pd.concat()`

properly with the rest of the data!

## 12.3 Merging or Joining Data Sets

If you have two different data sets that provide information about the same experimental units, you can put the two data sets together using a ** merge** (aka

**) operation. In R, you can use the**

`join`

`merge()`

function. In Python, you can use the `.merge()`

method.Merging (or joining) data sets is not the same as placing them shoulder to shoulder. Placing data sets shoulder to shoulder will not reorder the rows of your data and the operation requires that both input data sets have the same number of rows to start off with. On the other hand, merging data takes care to match rows together in an intelligent way, and it can handle situations of missing or duplicate matches. In both cases, the resulting data set is wider, but with merging, the output might end contain either more or fewer rows.

Here’s a clarifying example. Suppose you have to sets of supposedly anonymized data about individual accounts on some online platforms.

```
# in R
baby1 <- read.csv("data/baby1.csv", stringsAsFactors = FALSE)
baby2 <- read.csv("data/baby2.csv", stringsAsFactors = FALSE)
head(baby1)
## idnum height.inches. email_address
## 1 1 74 fakeemail123@gmail.com
## 2 3 66 anotherfake@gmail.com
## 3 4 62 notreal@gmail.com
## 4 23 62 notreal@gmail.com
head(baby2)
## idnum phone email
## 1 3901283 5051234567 notreal@gmail.com
## 2 41823 5051234568 notrealeither@gmail.com
## 3 7198273 5051234568 anotherfake@gmail.com
```

The first thing you need to ask yourself is *“which column is the unique identifier that is shared between these two data sets?”* In our case, they both have an “identification number” column. However, these two data sets are coming from different online platforms, and these two places use different schemes to number their users.

In this case, it is better to merge on the email addresses. Users might be using different email addresses on these two platforms, but there’s a stronger guarantee that matched email addresses means that you’re matching the right accounts. The columns are named differently in each data set, so we must specify them by name.

```
# in R
merge(baby1, baby2, by.x = "email_address", by.y = "email")
## email_address idnum.x height.inches. idnum.y phone
## 1 anotherfake@gmail.com 3 66 7198273 5051234568
## 2 notreal@gmail.com 4 62 3901283 5051234567
## 3 notreal@gmail.com 23 62 3901283 5051234567
```

In Python, `merge()`

is a method attached to each `DataFrame`

instance.

```
# in Python
baby1.merge(baby2, left_on = "email_address", right_on = "email")
## idnum_x height(inches) ... phone email
## 0 3 66 ... 5051234568 anotherfake@gmail.com
## 1 4 62 ... 5051234567 notreal@gmail.com
## 2 23 62 ... 5051234567 notreal@gmail.com
##
## [3 rows x 6 columns]
```

The email addresses `anotherfake@gmail.com`

and `notreal@gmail.com`

exist in both data sets, so each of these email addresses will end up in the result data frame. The rows in the result data set are wider and have more attributes for each individual.

Notice the duplicate email address, too. In this case, either the user signed up for two accounts using the same email, or one person signed up for an account with another person’s email address. In the case of duplicates, both rows will match with the same rows in the other data frame.

Also, in this case, all email addresses that weren’t found in both data sets were thrown away. This does not necessarily need to be the intended behavior. For instance, if we wanted to make sure no rows were thrown away, that would be possible. In this case, though, for email addresses that weren’t found in both data sets, some information will be missing. Recall that Python and R handle missing data differently (see 3.8.2).

```
# in R
merge(baby1, baby2,
by.x = "email_address", by.y = "email",
all.x = TRUE, all.y = TRUE)
## email_address idnum.x height.inches. idnum.y phone
## 1 anotherfake@gmail.com 3 66 7198273 5051234568
## 2 fakeemail123@gmail.com 1 74 NA NA
## 3 notreal@gmail.com 4 62 3901283 5051234567
## 4 notreal@gmail.com 23 62 3901283 5051234567
## 5 notrealeither@gmail.com NA NA 41823 5051234568
```

```
# in Python
le_merge = baby1.merge(baby2,
left_on = "email_address", right_on = "email",
how = "outer")
le_merge.iloc[:5,3:]
## idnum_y phone email
## 0 NaN NaN NaN
## 1 7198273.0 5.051235e+09 anotherfake@gmail.com
## 2 3901283.0 5.051235e+09 notreal@gmail.com
## 3 3901283.0 5.051235e+09 notreal@gmail.com
## 4 41823.0 5.051235e+09 notrealeither@gmail.com
```

You can see it’s slightly more concise in Python. If you are familiar with SQL, you might have heard of inner and outer joins. This is where Pandas takes some of its argument names from.

Finally, if both data sets have multiple values in the column you’re joining on, the result can have more rows than either table. This is because *every possible match* shows up.

## 12.4 Long Versus Wide Data

### 12.4.1 Long Versus Wide in R

Many types of data can be stored in either a **wide** or **long** format.

The classical example is data from a *longitudinal study.* If an experimental unit (in the example below this is a person) is repeatedly measured over time, each row would correspond to an experimental unit *and* an observation time in a data set in a long form.

```
peopleNames <- c("Taylor","Taylor","Charlie","Charlie")
fakeLongData1 <- data.frame(person = peopleNames,
timeObserved = c(1, 2, 1, 2),
nums = c(100,101,300,301))
fakeLongData1
## person timeObserved nums
## 1 Taylor 1 100
## 2 Taylor 2 101
## 3 Charlie 1 300
## 4 Charlie 2 301
```

A long format can also be used if you have multiple observations (at a single time point) on an experimental unit. Here is another example.

```
myAttrs <- c("attrA","attrB","attrA","attrB")
fakeLongData2 <- data.frame(person = peopleNames,
attributeName = myAttrs,
nums = c(100,101,300,301))
fakeLongData2
## person attributeName nums
## 1 Taylor attrA 100
## 2 Taylor attrB 101
## 3 Charlie attrA 300
## 4 Charlie attrB 301
```

If you would like to reshape the long data sets into a wide format, you can use the `reshape()`

function. You will need to specify which columns correspond with the experimental unit, and which column is the “factor” variable.

```
fakeWideData1 <- reshape(fakeLongData1,
direction = "wide",
timevar = "timeObserved",
idvar = "person",
varying = c("before","after"))
# ^ varying= arg becomes col names in new data set
fakeLongData1
## person timeObserved nums
## 1 Taylor 1 100
## 2 Taylor 2 101
## 3 Charlie 1 300
## 4 Charlie 2 301
fakeWideData1
## person before after
## 1 Taylor 100 101
## 3 Charlie 300 301
```

```
# timevar= is a misnomer here
fakeWideData2 <- reshape(fakeLongData2,
direction = "wide",
timevar = "attributeName",
idvar = "person",
varying = c("attribute A","attribute B"))
fakeLongData2
## person attributeName nums
## 1 Taylor attrA 100
## 2 Taylor attrB 101
## 3 Charlie attrA 300
## 4 Charlie attrB 301
fakeWideData2
## person attribute A attribute B
## 1 Taylor 100 101
## 3 Charlie 300 301
```

`reshape()`

will also go in the other direction: it can take wide data and convert it into long data

```
reshape(fakeWideData1,
direction = "long",
idvar = "person",
varying = list(c("before","after")),
v.names = "nums")
## person time nums
## Taylor.1 Taylor 1 100
## Charlie.1 Charlie 1 300
## Taylor.2 Taylor 2 101
## Charlie.2 Charlie 2 301
fakeLongData1
## person timeObserved nums
## 1 Taylor 1 100
## 2 Taylor 2 101
## 3 Charlie 1 300
## 4 Charlie 2 301
reshape(fakeWideData2,
direction = "long",
idvar = "person",
varying = list(c("attribute A","attribute B")),
v.names = "nums")
## person time nums
## Taylor.1 Taylor 1 100
## Charlie.1 Charlie 1 300
## Taylor.2 Taylor 2 101
## Charlie.2 Charlie 2 301
fakeLongData2
## person attributeName nums
## 1 Taylor attrA 100
## 2 Taylor attrB 101
## 3 Charlie attrA 300
## 4 Charlie attrB 301
```

### 12.4.2 Long Versus Wide in Python

With Pandas, we can take make long data wide with `pd.DataFrame.pivot()`

, and we can go in the other direction with `pd.DataFrame.melt()`

.

When going from long to wide, make sure to use the `pd.DataFrame.reset_index()`

method afterwards to reshape the data and remove the index. Here is an example similar to the one above.

```
import pandas as pd
fake_long_data1 = pd.DataFrame(
{'person' : ["Taylor","Taylor","Charlie","Charlie"],
'time_observed' : [1, 2, 1, 2],
'nums' : [100,101,300,301]})
fake_long_data1
## person time_observed nums
## 0 Taylor 1 100
## 1 Taylor 2 101
## 2 Charlie 1 300
## 3 Charlie 2 301
pivot_data1 = fake_long_data1.pivot(index='person',
columns='time_observed',
values='nums')
fake_wide_data1 = pivot_data1.reset_index()
fake_wide_data1
## time_observed person 1 2
## 0 Charlie 300 301
## 1 Taylor 100 101
```

Here’s one more example showing the same functionality–going from long to wide format.

```
people_names = ["Taylor","Taylor","Charlie","Charlie"]
attribute_list = ['attrA', 'attrB', 'attrA', 'attrB']
fake_long_data2 = pd.DataFrame({'person' : people_names,
'attribute_name' : attribute_list,
'nums' : [100,101,300,301]})
fake_wide_data2 = fake_long_data2.pivot(index='person',
columns='attribute_name',
values='nums').reset_index()
fake_wide_data2
## attribute_name person attrA attrB
## 0 Charlie 300 301
## 1 Taylor 100 101
```

Here are some examples of going in the other direction: from wide to long with `pd.DataFrame.melt()`

. The first example specifies value columns by integers.

```
fake_wide_data1
## time_observed person 1 2
## 0 Charlie 300 301
## 1 Taylor 100 101
fake_wide_data1.melt(id_vars = "person", value_vars = [1,2])
## person time_observed value
## 0 Charlie 1 300
## 1 Taylor 1 100
## 2 Charlie 2 301
## 3 Taylor 2 101
```

The second example uses strings to specify value columns.

## 12.5 Exercises

### 12.5.1 R Questions

Recall the `car.data`

data set (“Car Evaluation” 1997), which is hosted by (Dua and Graff 2017).

- Read in the data set as
`carData`

. - Convert the third and fourth columns to
*ordered*`factor`

s. - Order the data by the third and then the fourth column (simultaneously). Do not change the data in place. Instead store it under the name
`ordCarData1`

- Order the data by the fourth and then the third column (simultaneously). Do not change the data in place. Instead store it under the name
`ordCarData2`

```
day1Data <- data.frame(idNum = 1:10,
measure = rnorm(10))
day2Data <- data.frame(idNum = 11:20,
measure = rnorm(10))
```

- Pretend
`day1Data`

and`day2Data`

are two separate data sets that possess the same type of measures but on different experimental units. Stack`day1Data`

on top of`day2Data`

and call the result`stackedData`

. - Pretend
`day1Data`

and`day2Data`

are different measurements on the same experimental units. Place them shoulder to shoulder and call the result`sideBySide`

. Put`day1Data`

first, and`day2Data`

second.

If you are dealing with random matrices, you might need to **vectorize** a matrix object. This is not the same as “vectorization” in programming. Instead, it means you write the matrix as a big column vector by stacking the columns on top of each other. Specifically, if you have a \(n \times p\) real-valued matrix \(\mathbf{X}\), then

\[\begin{equation} \text{vec}(\mathbf{X}) =\begin{bmatrix} \mathbf{X}_1 \\ \vdots \\ \mathbf{X}_p \end{bmatrix} \end{equation}\]

where \(\mathbf{X}_i\) is the \(i\)th column as an \(n \times 1\) column vector. There is another operator that we will use, the **Kronecker product:**

\[\begin{equation} \mathbf{A} \otimes \mathbf{B} = \begin{bmatrix} a_{11} \mathbf{B} & \cdots & a_{1n} \mathbf{B} \\ \vdots & \ddots & \vdots \\ a_{m1} \mathbf{B} & \cdots & a_{mn} \mathbf{B} \\ \end{bmatrix}. \end{equation}\]

If \(\mathbf{A}\) is \(m \times n\) and \(\mathbf{B}\) is \(p \times q\), then \(\mathbf{A} \otimes \mathbf{B}\) is \(pm \times qn\).

- Write a function called
`vec(myMatrix)`

. Its input should be one`matrix`

object. It’s output should be a`vector`

. Hint:`matrix`

objects are stored in column-major order. - Write a function called
`unVec(myVector, nRows)`

that takes in the vectorized matrix as a`vector`

, splits that into elements with`nRows`

elements, and then places them together shoulder-to-shoulder as a`matrix`

. - Write a function called
`stackUp(m, BMat)`

that returns \(\mathbf{1}_m \otimes \mathbf{B}\) where \(\mathbf{1}_m\) is a length \(m\) column vector of ones. You may check your work with`%x%`

, but do not use this in your function. - Write a function called
`shoulderToShoulder(n, BMat)`

that returns \(\mathbf{1}^\intercal_n \otimes \mathbf{B}\) where \(\mathbf{1}_n^\intercal\) is a length \(n\) row vector of ones. You may check your work with`%x%`

, but do not use this in your function.

This problem uses the Militarized Interstate Disputes (v5.0) (Palmer et al.) data set from The Correlates of War Project. There are four `.csv`

files we use for this problem. `MIDA 5.0.csv`

contains the essential attributes of each militarized interstate dispute from 1/1/1816 through 12/31/2014. `MIDB 5.0.csv`

describes the participants in each of those disputes. `MIDI 5.0.csv`

contains the essential elements of each militarized interstate incident, and `MIDIP 5.0.csv`

describes the participants in each of those incidents.

- Read in the four data sets and give them the names
`mida`

,`midb`

,`midi`

, and`midp`

. Take care to convert all instances of`-9`

to`NA`

. - Examine all rows of
`midb`

where its`dispnum`

column equals`2`

. Do not change`midb`

permanently. Are these two rows corresponding to the same conflict? If so, assign`TRUE`

to`sameConflict`

. Otherwise, assign`FALSE`

. - Join the first two data sets together on the dispute number column (
`dispnum`

). Call the resulting`data.frame`

`join1`

. Do not address any concerns about duplicate columns. - Is there any difference between doing an inner join and an outer join in the previous question? If there was a difference, assign
`TRUE`

to`theyAreNotTheSame`

. Otherwise, assign`FALSE`

to it. - Join the last two data sets together by
`incidnum`

and call the result`join2`

. Is there any difference between an inner and an outer join for this problem? Why or why not? Do not address any concerns about duplicate columns. - The codebook mentions that the last two data sets don’t go as far back in time as the first two. Suppose then that we only care about the events in
`join2`

. Merge`join2`

and`join1`

in a way where all undesired rows from`join1`

are discarded, and all rows from`join2`

are kept. Call the resulting`data.frame`

`midData`

. Do not address any concerns about duplicate columns. - Use a scatterplot to display the relationship between the maximum duration and the end year. Plot each country as a different color.
- Create a
`data.frame`

called`longData`

that has the following three columns from`midp`

:`incidnum`

(incident identification number)`stabb`

(state abbreviation of participant) and`fatalpre`

(precise number of fatalities). Convert this to “wide” format. Make the new table called`wideData`

. Use the incident number row as a unique row-identifying variable. - Bonus Question: identify all column pairs that contain duplicate information in
`midData`

, remove all but one of the columns, and change the column name back to its original name.

### 12.5.2 Python Questions

Once again, recall the `"car.data"`

data set (“Car Evaluation” 1997).

- Read in the data set as
`car_data`

. - Order the data by the third and then the fourth column. Do not change the data in place. Instead store it under the name
`ord_car_data1`

- Order the data by the fourth and then the third column. Do not change the data in place. Instead store it under the name
`ord_car_data2`

Consider the following random data set.

```
indexes = np.random.choice(np.arange(20),size=20,replace=False)
d1 = pd.DataFrame({'a' : indexes,
'b' : np.random.normal(size=20)})
d2 = pd.DataFrame({'a' : indexes + 20,
'b' : np.random.normal(size=20)})
```

- Pretend
`d1`

and`d2`

are two separate data sets that possess the same type of measures but on different experimental units. Stack`d1`

on top of`d2`

and call the result`stacked_data_sets`

. Make sure the`index`

of the result is the numbers \(0\) through \(39\) - Pretend
`d1`

and`d2`

are different measurements on the same experimental units. Place them shoulder to shoulder and call the result`side_by_side_data_sets`

. Put`d1`

first, and`d2`

second.

Consider the following two data sets:

```
import numpy as np
import pandas as pd
dog_names1 = ['Charlie','Gus', 'Stubby', 'Toni','Pearl']
dog_names2 = ['Charlie','Gus', 'Toni','Arya','Shelby']
nicknames = ['Charles','Gus The Bus',np.nan,'Toni Bologna','Porl']
breed_names = ['Black Lab','Beagle','Golden Retriever','Husky',np.nan]
dataset1 = pd.DataFrame({'dog': dog_names1,
'nickname': nicknames})
dataset2 = pd.DataFrame({'dog':dog_names2,
'breed':breed_names})
```

- Join/merge the two data sets together in such a way that there is a row for every dog, whether or not both tables have information for that dog. Call the result
`merged1`

. - Join/merge the two data sets together in such a way that there are only rows for every dog in
`dataset1`

, whether or not there is information about these dogs’ breeds. Call the result`merged2`

. - Join/merge the two data sets together in such a way that there are only rows for every dog in
`dataset2`

, whether or not there is information about the dogs’ nicknames. Call the result`merged3`

. - Join/merge the two data sets together in such a way that all rows possess complete information. Call the result
`merged4`

.

Let’s consider Fisher’s “Iris” data set (Fisher 1988) again.

- Read in
`iris.csv`

and store the`DataFrame`

with the name`iris`

. Let it have the column names`'a'`

,`'b'`

,`'c'`

,`'d'`

and`'e'`

. - Create a
`DataFrame`

called`name_key`

that stores correspondences between long names and short names. It should have three rows and two columns. The long names are the unique values of column five of`iris`

. The short names are either`'s'`

,`'vers'`

or`'virg'`

. Use the column names`'long name'`

and`'short name'`

. - Merge/join the two data sets together to give
`iris`

a new column with information about short names. Do not overwrite`iris`

. Rather, give the`DataFrame`

a new name:`iris_with_short_names`

. Remove any columns with duplicate information. - Change the first four column names of
`iris_with_short_names`

to`s_len`

,`s_wid`

,`p_len`

, and`p_wid`

. Use Matplotlib to create a figure with 4 subplots arranged into a \(2 \times 2\) grid. On each subplot, plot a histogram of these four columns. Make sure to use x-axis labels so viewers can tell which column is being plotted in each subplot. - Let’s go back to
`iris`

. Change that to long format. Store it as a`DataFrame`

called`long_iris`

. Make the column names`row`

,`variable`

and`value`

, in that order. Last, make sure it is sorted (simultaneously/once) by`row`

and then`variable`

.

### References

Albemarle County Geographic Data Services Office. 2021. “Albemarle County GIS Web.” https://www.albemarle.org/government/community-development/gis-mapping/gis-data.

“Car Evaluation.” 1997. UCI Machine Learning Repository.

Dua, Dheeru, and Casey Graff. 2017. “UCI Machine Learning Repository.” University of California, Irvine, School of Information; Computer Sciences. http://archive.ics.uci.edu/ml.

Fisher, Test, R.A. & Creator. 1988. “Iris.” UCI Machine Learning Repository.

Ford, Clay. 2016. “ggplot: Files for UVA StatLab workshop, Fall 2016.” *GitHub Repository*. https://github.com/clayford/ggplot2; GitHub.

Palmer, Glenn, Roseanne W McManus, Vito D’Orazio, Michael R Kenwick, Mikaela Karstens, Chase Bloch, Nick Dietrich, Kayla Kahn, Kellan Ritter, and Michael J Soules“The Mid5 Dataset, 2011–2014: Procedures, Coding Rules, and Description.” *Conflict Management and Peace Science* 0 (0): 0738894221995743. https://doi.org/10.1177/0738894221995743.

“SAS^{} Viya^{} Example Data Sets.” 2021. https://support.sas.com/documentation/onlinedoc/viya/examples.htm.