Chapter 12 Reshaping and Combining Data Sets

12.1 Ordering and Sorting Data

Sorting a data set, in ascending order, say, is a common task. You might need to do it because

  1. ordering and ranking is commonly done in nonparametric statistics,
  2. you want to inspect the most “extreme” observations in a data set,
  3. it’s a pre-processing step before generating visualizations.

In R, it all starts with vectors. There are two common functions you should know: sort() and order(). sort() returns the sorted data, while order() returns the order indexes.

order() is useful if you’re sorting a data frame by a particularly column. Below, we inspect the top 5 most expensive cars in an example data set (“SAS Viya Example Data Sets” 2021). Notice that we need to clean up the MSRP (a character vector) a little first. We use the function gsub() to find patterns in the text, and replace them with the empty string.

##              Make                 Model     MSRP cleanMSRP
## 335       Porsche           911 GT2 2dr $192,465    192465
## 263 Mercedes-Benz             CL600 2dr $128,420    128420
## 272 Mercedes-Benz SL600 convertible 2dr $126,670    126670
## 271 Mercedes-Benz          SL55 AMG 2dr $121,770    121770
## 262 Mercedes-Benz             CL500 2dr  $94,820     94820

In Python, Numpy has np.argsort() and np.sort().

For Pandas’ DataFrames, most of the functions I find useful are methods attached to the DataFrame class. That means that, as long as something is inside a DataFrame, you can use dot notation.

Pandas’ DataFrames and Series have a .replace() method. We use this to remove dollar signs and commas from the MSRP column. Note that we had to access the .str attribute of the Series column before we used it. After the string was processed, we converted it to a Series of floats with the astype() method.

Finally, sorting the overall data frame could have been done with the same approach as the code we used in R (i.e. raw subsetting by row indexes), but there is a built-in method called sort_values() that will do it for us.

12.2 Stacking Data Sets and Placing Them Shoulder to Shoulder

Stacking data sets on top of each other is a common task. You might need to do it if

  1. you need to add a new row (or many rows) to a data frame,
  2. you need to recombine data sets (e.g. recombine a train/test split), or
  3. you’re creating a matrix in a step-by-step way.

In R, this can be done with rbind() (short for “row bind”). Consider the following example that makes use of GIS data queried from (Albemarle County Geographic Data Services Office 2021) and cleaned with code from (Ford 2016).

The above example was with data.frames. This example of rbind() is with matrix objects.

In Python, you can stack data frames with pd.concat(). It has a lot of options, so feel free to peruse them. You can also replace the call to pd.concat() below with test.append(train). Consider the example below that uses the Albemarle County real estate data (Albemarle County Geographic Data Services Office 2021) (Ford 2016).

Take note of the extra square brackets when we create test. If you use real_estate.iloc[0,] instead, it will return a Series with all the elements coerced to the same type, and this won’t pd.concat() properly with the rest of the data!

12.3 Merging or Joining Data Sets

If you have two different data sets that provide information about the same experimental units, you can put the two data sets together using a merge (aka join) operation. In R, you can use the merge() function. In Python, you can use the .merge() method.

Merging (or joining) data sets is not the same as placing them shoulder to shoulder. Placing data sets shoulder to shoulder will not reorder the rows of your data and the operation requires that both input data sets have the same number of rows to start off with. On the other hand, merging data takes care to match rows together in an intelligent way, and it can handle situations of missing or duplicate matches. In both cases, the resulting data set is wider, but with merging, the output might end contain either more or fewer rows.

Here’s a clarifying example. Suppose you have to sets of supposedly anonymized data about individual accounts on some online platforms.

The first thing you need to ask yourself is “which column is the unique identifier that is shared between these two data sets?” In our case, they both have an “identification number” column. However, these two data sets are coming from different online platforms, and these two places use different schemes to number their users.

In this case, it is better to merge on the email addresses. Users might be using different email addresses on these two platforms, but there’s a stronger guarantee that matched email addresses means that you’re matching the right accounts. The columns are named differently in each data set, so we must specify them by name.

In Python, merge() is a method attached to each DataFrame instance.

The email addresses anotherfake@gmail.com and notreal@gmail.com exist in both data sets, so each of these email addresses will end up in the result data frame. The rows in the result data set are wider and have more attributes for each individual.

Notice the duplicate email address, too. In this case, either the user signed up for two accounts using the same email, or one person signed up for an account with another person’s email address. In the case of duplicates, both rows will match with the same rows in the other data frame.

Also, in this case, all email addresses that weren’t found in both data sets were thrown away. This does not necessarily need to be the intended behavior. For instance, if we wanted to make sure no rows were thrown away, that would be possible. In this case, though, for email addresses that weren’t found in both data sets, some information will be missing. Recall that Python and R handle missing data differently (see 3.8.2).

You can see it’s slightly more concise in Python. If you are familiar with SQL, you might have heard of inner and outer joins. This is where Pandas takes some of its argument names from.

Finally, if both data sets have multiple values in the column you’re joining on, the result can have more rows than either table. This is because every possible match shows up.

12.4 Long Versus Wide Data

12.4.1 Long Versus Wide in R

Many types of data can be stored in either a wide or long format.

The classical example is data from a longitudinal study. If an experimental unit (in the example below this is a person) is repeatedly measured over time, each row would correspond to an experimental unit and an observation time in a data set in a long form.

A long format can also be used if you have multiple observations (at a single time point) on an experimental unit. Here is another example.

If you would like to reshape the long data sets into a wide format, you can use the reshape() function. You will need to specify which columns correspond with the experimental unit, and which column is the “factor” variable.

reshape() will also go in the other direction: it can take wide data and convert it into long data

12.4.2 Long Versus Wide in Python

With Pandas, we can take make long data wide with pd.DataFrame.pivot(), and we can go in the other direction with pd.DataFrame.melt().

When going from long to wide, make sure to use the pd.DataFrame.reset_index() method afterwards to reshape the data and remove the index. Here is an example similar to the one above.

Here’s one more example showing the same functionality–going from long to wide format.

Here are some examples of going in the other direction: from wide to long with pd.DataFrame.melt(). The first example specifies value columns by integers.

The second example uses strings to specify value columns.

12.5 Exercises

12.5.1 R Questions

Recall the car.data data set (“Car Evaluation” 1997), which is hosted by (Dua and Graff 2017).

  1. Read in the data set as carData.
  2. Convert the third and fourth columns to ordered factors.
  3. Order the data by the third and then the fourth column (simultaneously). Do not change the data in place. Instead store it under the name ordCarData1
  4. Order the data by the fourth and then the third column (simultaneously). Do not change the data in place. Instead store it under the name ordCarData2
  1. Pretend day1Data and day2Data are two separate data sets that possess the same type of measures but on different experimental units. Stack day1Data on top of day2Data and call the result stackedData.
  2. Pretend day1Data and day2Data are different measurements on the same experimental units. Place them shoulder to shoulder and call the result sideBySide. Put day1Data first, and day2Data second.

If you are dealing with random matrices, you might need to vectorize a matrix object. This is not the same as “vectorization” in programming. Instead, it means you write the matrix as a big column vector by stacking the columns on top of each other. Specifically, if you have a n×p real-valued matrix X, then

vec(X)=[X1Xp]

where Xi is the ith column as an n×1 column vector. There is another operator that we will use, the Kronecker product:

AB=[a11Ba1nBam1BamnB].

If A is m×n and B is p×q, then AB is pm×qn.

  1. Write a function called vec(myMatrix). Its input should be one matrix object. It’s output should be a vector. Hint: matrix objects are stored in column-major order.
  2. Write a function called unVec(myVector, nRows) that takes in the vectorized matrix as a vector, splits that into elements with nRows elements, and then places them together shoulder-to-shoulder as a matrix.
  3. Write a function called stackUp(m, BMat) that returns 1mB where 1m is a length m column vector of ones. You may check your work with %x%, but do not use this in your function.
  4. Write a function called shoulderToShoulder(n, BMat) that returns 1nB where 1n is a length n row vector of ones. You may check your work with %x%, but do not use this in your function.

This problem uses the Militarized Interstate Disputes (v5.0) (Palmer et al.) data set from The Correlates of War Project. There are four .csv files we use for this problem. MIDA 5.0.csv contains the essential attributes of each militarized interstate dispute from 1/1/1816 through 12/31/2014. MIDB 5.0.csv describes the participants in each of those disputes. MIDI 5.0.csv contains the essential elements of each militarized interstate incident, and MIDIP 5.0.csv describes the participants in each of those incidents.

  1. Read in the four data sets and give them the names mida, midb, midi, and midp. Take care to convert all instances of -9 to NA.
  2. Examine all rows of midb where its dispnum column equals 2. Do not change midb permanently. Are these two rows corresponding to the same conflict? If so, assign TRUE to sameConflict. Otherwise, assign FALSE.
  3. Join the first two data sets together on the dispute number column (dispnum). Call the resulting data.frame join1. Do not address any concerns about duplicate columns.
  4. Is there any difference between doing an inner join and an outer join in the previous question? If there was a difference, assign TRUE to theyAreNotTheSame. Otherwise, assign FALSE to it.
  5. Join the last two data sets together by incidnum and call the result join2. Is there any difference between an inner and an outer join for this problem? Why or why not? Do not address any concerns about duplicate columns.
  6. The codebook mentions that the last two data sets don’t go as far back in time as the first two. Suppose then that we only care about the events in join2. Merge join2 and join1 in a way where all undesired rows from join1 are discarded, and all rows from join2 are kept. Call the resulting data.frame midData. Do not address any concerns about duplicate columns.
  7. Use a scatterplot to display the relationship between the maximum duration and the end year. Plot each country as a different color.
  8. Create a data.frame called longData that has the following three columns from midp: incidnum (incident identification number) stabb (state abbreviation of participant) and fatalpre (precise number of fatalities). Convert this to “wide” format. Make the new table called wideData. Use the incident number row as a unique row-identifying variable.
  9. Bonus Question: identify all column pairs that contain duplicate information in midData, remove all but one of the columns, and change the column name back to its original name.

12.5.2 Python Questions

Once again, recall the "car.data" data set (“Car Evaluation” 1997).

  1. Read in the data set as car_data.
  2. Order the data by the third and then the fourth column. Do not change the data in place. Instead store it under the name ord_car_data1
  3. Order the data by the fourth and then the third column. Do not change the data in place. Instead store it under the name ord_car_data2

Consider the following random data set.

  1. Pretend d1 and d2 are two separate data sets that possess the same type of measures but on different experimental units. Stack d1 on top of d2 and call the result stacked_data_sets. Make sure the index of the result is the numbers 0 through 39
  2. Pretend d1 and d2 are different measurements on the same experimental units. Place them shoulder to shoulder and call the result side_by_side_data_sets. Put d1 first, and d2 second.

Consider the following two data sets:

  1. Join/merge the two data sets together in such a way that there is a row for every dog, whether or not both tables have information for that dog. Call the result merged1.
  2. Join/merge the two data sets together in such a way that there are only rows for every dog in dataset1, whether or not there is information about these dogs’ breeds. Call the result merged2.
  3. Join/merge the two data sets together in such a way that there are only rows for every dog in dataset2, whether or not there is information about the dogs’ nicknames. Call the result merged3.
  4. Join/merge the two data sets together in such a way that all rows possess complete information. Call the result merged4.

Let’s consider Fisher’s “Iris” data set (Fisher 1988) again.

  1. Read in iris.csv and store the DataFrame with the name iris. Let it have the column names 'a','b','c', 'd' and 'e'.
  2. Create a DataFrame called name_key that stores correspondences between long names and short names. It should have three rows and two columns. The long names are the unique values of column five of iris. The short names are either 's', 'vers' or 'virg'. Use the column names 'long name' and 'short name'.
  3. Merge/join the two data sets together to give iris a new column with information about short names. Do not overwrite iris. Rather, give the DataFrame a new name: iris_with_short_names. Remove any columns with duplicate information.
  4. Change the first four column names of iris_with_short_names to s_len, s_wid, p_len, and p_wid. Use Matplotlib to create a figure with 4 subplots arranged into a 2×2 grid. On each subplot, plot a histogram of these four columns. Make sure to use x-axis labels so viewers can tell which column is being plotted in each subplot.
  5. Let’s go back to iris. Change that to long format. Store it as a DataFrame called long_iris. Make the column names row, variable and value, in that order. Last, make sure it is sorted (simultaneously/once) by row and then variable.

References

Albemarle County Geographic Data Services Office. 2021. “Albemarle County GIS Web.” https://www.albemarle.org/government/community-development/gis-mapping/gis-data.

“Car Evaluation.” 1997. UCI Machine Learning Repository.

Dua, Dheeru, and Casey Graff. 2017. “UCI Machine Learning Repository.” University of California, Irvine, School of Information; Computer Sciences. http://archive.ics.uci.edu/ml.

Fisher, Test, R.A. & Creator. 1988. “Iris.” UCI Machine Learning Repository.

Ford, Clay. 2016. “ggplot: Files for UVA StatLab workshop, Fall 2016.” GitHub Repository. https://github.com/clayford/ggplot2; GitHub.

Palmer, Glenn, Roseanne W McManus, Vito D’Orazio, Michael R Kenwick, Mikaela Karstens, Chase Bloch, Nick Dietrich, Kayla Kahn, Kellan Ritter, and Michael J Soules“The Mid5 Dataset, 2011–2014: Procedures, Coding Rules, and Description.” Conflict Management and Peace Science 0 (0): 0738894221995743. https://doi.org/10.1177/0738894221995743.