Chapter 9 Input and Output

9.1 General Input Considerations

So far, this text has been favoring the creation of small pieces of data within our scripts. The avoidance of reading in data from an external file has been avoided primarily for pedagogical purposes. In general, one might have

data read in from a plain text file (e.g. "my_data.csv" or "log_file.txt" ),
data read in from a database (e.g. MySQL, PostgreSQL, etc.), or
data created in a script (either deterministically or randomly).

When discussing reading in data, this text mostly focuses on the first category. Here are the reasons for its doing so:

text-files are more readily-available to students than databases,
teaching the second category requires teaching SQL, and that would introduce conceptual overlap, and
the third category is programmatically self-explanatory.

The third reason does not imply data created by code is unimportant. For example, it is the most common approach to create data used in simulation studies. Authors writing statistical papers need to demonstrate that their techniques work on “nice” data: data simulated from a known data-generating process. In a simulation study, unlike in the “real-world,” you have access to the parameters generating your data, and you can examine data that might otherwise be unobserved or hidden. Further, with data from the real-world, there is no guarantee your model correctly matches the true model.

Can your code/technique/algorithm, at the very least, obtain parameter estimates that are “in-line” with the parameters your code is using to simulate data? Are forecasts or predictions obtained by your method accurate? These kinds of questions can often only be answered by simulating fake data. Programmatically, simulating data like this largely involves calling functions that we have seen before (e.g. rnorm() in R or np.random.choice() in Python). This may or may not involve setting a pseudorandom number seed, first, for reproducibility.

Also, benchmark data sets are often readily available through specialized function calls.

Even though this chapter is written to teach you how to read in files into R and Python, you should not expect that you will know how to read in all data sets after reading this section. For both R and Python, there are an enormous amount of functions, different functions have different return types, different functions are suited for different file types, many functions are spread across a plethora of third party libraries, and many of these functions have an enormous amount of arguments. You will probably not be able to memorize everything. In my very humble opinion, I doubt you should want to.

Instead, focus on developing your ability to identify and diagnose data input problems. Reading in a data set correctly is often a process of trial-and-error. After attempting to read in a data set, always check the following items. Many of these points were previously mentioned in section @(data-frames-in-r). Some apply to reading in text data more than reading in structured data from a database, and vice versa.

Check that the correct column separator was used, or the correct “fixed-width format” was expected. If mistakes are made, data frame columns are going to be combined or split apart in weird ways, and often the wrong types are going to be used for pieces of data (e.g. "2,3" instead of 2 and 3.) Also, watch out for when separators are found inside data elements or column names. For example, sometimes it’s unclear whether people’s names in the “last, first” format can be stored in one or two columns. Also, text data might surprise you with unexpected spaces or other whitespace is a common separator.
Check that the column names were parsed and stored correctly. Column names should not be stored as data in R/Python. Functions that read in data should not expect column names when they don’t exist in the actual file.
Check that empty space and metadata was ignored correctly. Data descriptions are sometimes stored in the same file as the data itself, and that should be skipped over when it’s being read in. Empty space between column names and data shouldn’t be stored. This can occur at the beginning of the file, and even at the end of the file.
Check that type choice and recognition of special characters are performed correctly. Are letters stored as strings or as something else such as an R factor? Are dates and times stored as a special date/time type, or as strings? Is missing data correctly identified? Sometimes data providers use outrageous numbers like \(-9999\) to represent missing data–don’t store that as a float or integer!
Be ready to prompt R or Python to recognize a specific character encoding if you are reading in text data written in another language. All text data has a character encoding, which is a mapping of numbers to characters. Any specific encoding will dictate what characters are recognizable in a program. If you try to read in data written in another language, the function you are using will likely complain about unrecognized characters. Fortunately, these errors and warnings are easily fixed by specifying a nondefault argument such as encoding= or fileEncoding=.

This is no small task. To make matters worse:

you can’t (or shouldn’t) edit the raw data to suit your needs, to make it easier to read in. You have to work with what you are given. If you were allowed to edit, say, a text file you downloaded onto your own machine, you shouldn’t–it will lead to code that doesn’t run anywhere else. Additionally, if you abuse write privileges on your company’s database, for example–that could be very dangerous as well.
Data sets are often quite large, so manually checking each element is often impossible. In this situation you will have to resign yourself to checking the top and bottom of a data set, or maybe anticipate a specific place where problems are likely to appear.

9.2 Reading in Text Files with R

You’ve seen examples of read.csv() used earlier in the book, so it should not surprise you that this is one of the most common ways to read in data in R. Another important function is read.table().

If you look at the source code for read.csv() (type the name of the function without parentheses into the console and press the <Enter> key), you will see it calls read.table(). The primary difference between these functions is default arguments. Mind the default arguments. Do not be completely averse to writing a long-line of code to read in a data set correctly. Or do, and choose the function with the best default arguments.

Consider the “Challenger USA Space Shuttle O-Ring Data Set” from (Dua and Graff 2017). The first few rows of the raw text file¹⁴ looks like this.

6 0 66  50  1
6 1 70  50  2
6 0 69  50  3

It does not use commas as separators, and there is no header information, so read.csv() used with its default arguments will produce an incorrect result. It will miss the first row by counting it as a column name, and store everything in one column with the wrong type.

d <- read.csv("data/o-ring-erosion-only.data")
dim(d) # one row short, only 1 col
## [1] 22  1
typeof(d[,1])
## [1] "character"

Specifying header=FALSE fixes the column name issue, but sep = " " does not fix the separator issue.

d <- read.csv("data/o-ring-erosion-only.data", 
              header=FALSE, sep = " ")
str(d)
## 'data.frame':    23 obs. of  7 variables:
##  $ V1: int  6 6 6 6 6 6 6 6 6 6 ...
##  $ V2: int  0 1 0 0 0 0 0 0 1 1 ...
##  $ V3: int  66 70 69 68 67 72 73 70 57 63 ...
##  $ V4: int  NA NA NA NA NA NA 100 100 200 200 ...
##  $ V5: int  50 50 50 50 50 50 NA NA NA 10 ...
##  $ V6: int  NA NA NA NA NA NA 7 8 9 NA ...
##  $ V7: int  1 2 3 4 5 6 NA NA NA NA ...

One space is strictly one space. Some rows have two, though. This causes there to be two too many columns filled with NAs.

After digging into the documentation a bit further, you will notice that "" works for “one or more spaces, tabs, newlines or carriage returns.” This is why read.table(), with its default arguments, works well.

d <- read.table("data/o-ring-erosion-only.data")
str(d)
## 'data.frame':    23 obs. of  5 variables:
##  $ V1: int  6 6 6 6 6 6 6 6 6 6 ...
##  $ V2: int  0 1 0 0 0 0 0 0 1 1 ...
##  $ V3: int  66 70 69 68 67 72 73 70 57 63 ...
##  $ V4: int  50 50 50 50 50 50 100 100 200 200 ...
##  $ V5: int  1 2 3 4 5 6 7 8 9 10 ...

This data set has columns whose widths are “fixed”, too. It is in “fixed width format” because any given column has all its elements take up a constant amount of characters. The third column has integers with two or three digits, but no matter what, each row has the same number of characters.

You may choose to exploit this and use a specialized function that reads in data in a fixed width format (e.g. read.fwf()). The frustrating thing about this approach, though, is that you have to specify what those widths are. This can be quite tedious, particularly if your data set has many columns and/or many rows. The upside though, is that the files can be a little bit smaller, because the data provider does not have to waste characters on separators.

In the example below, we specify widths that include blank spaces to the left of the digits. On the other hand, if we specified widths=c(2,2,4,4,1), which includes spaces to the right of digits, then columns would have been recognized as characters.

d <- read.fwf("data/o-ring-erosion-only.data", 
              widths = c(1,2,3,4,3)) # or try c(2,2,4,4,1)
str(d)
## 'data.frame':    23 obs. of  5 variables:
##  $ V1: int  6 6 6 6 6 6 6 6 6 6 ...
##  $ V2: int  0 1 0 0 0 0 0 0 1 1 ...
##  $ V3: int  66 70 69 68 67 72 73 70 57 63 ...
##  $ V4: int  50 50 50 50 50 50 100 100 200 200 ...
##  $ V5: int  1 2 3 4 5 6 7 8 9 10 ...

If you need to read in some text data that does not possess a tabular structure, then you may need readLines(). This function will read in all of the text, separate each line into an element of a character vector, and will not make any attempt to parse lines into columns. Further processing can be accomplished using the techniques from section 3.9.

html_data <- readLines("data/Google.html", warn = FALSE)
head(html_data, 1)
## [1] "<!DOCTYPE html>"

Some of you may have had difficulty reading in the above data. This can happen if your machine’s default character encoding is different than mine. For instance, if your character encoding is “GBK”, then you might get a warning message like “invalid input found on input connection.” This message means that your machine didn’t recognize some of the characters in the data set.

These errors are easy to fix, though, so don’t worry. Just specify an encoding argument in your function that reads in data.

tmp <- read.table("data/Google.html", sep = "~", 
                  fileEncoding = "UTF-8") # makes errors disappear

9.3 Reading in Text Files with Pandas

A wide variety of different file formats can be read in with Pandas. I will only mention a few functions here.

Recall R has read.table() and read.csv(), and that they are very similar. In Pandas, pd.read_csv() and pd.read_table() have a lot in common, too. Their primary difference is the default column separator, as well.

Recall the O-Ring data from above. The columns are not separated by commas, so if we treat it as a comma-separated file, the resulting Pandas DataFrame is going to be missing all but one of its columns.

import pandas as pd
d = pd.read_csv("data/o-ring-erosion-only.data")
d.shape # one column and missing a row
## (22, 1)
d.columns # column labels are data
## Index(['6 0 66  50  1'], dtype='object')

By default, pd.read_csv() is expecting column labels, which is also a problem. Unlike R, though, the header= argument is not expected to be a Boolean. You will need to provide a None, instead. The separator needs to be just right, too.

pd.read_csv("data/o-ring-erosion-only.data", 
              header=None, sep = " ").head(2) # 1 space: no
##    0  1   2   3     4   5    6
## 0  6  0  66 NaN  50.0 NaN  1.0
## 1  6  1  70 NaN  50.0 NaN  2.0
pd.read_csv("data/o-ring-erosion-only.data", 
              header=None, sep = "\t").head(2) # tabs: no
##                0
## 0  6 0 66  50  1
## 1  6 1 70  50  2
pd.read_table("data/o-ring-erosion-only.data", 
              header=None).head(2) # default sep is tabs, so no
##                0
## 0  6 0 66  50  1
## 1  6 1 70  50  2
pd.read_csv("data/o-ring-erosion-only.data", 
              header=None, sep = "\s+").head(2) # 1 or more spaces: yes
##    0  1   2   3  4
## 0  6  0  66  50  1
## 1  6  1  70  50  2

Reading in fixed width files can be done in a way that is nearly identical to the way we did it in R. Here is an example.

d = pd.read_fwf("data/o-ring-erosion-only.data", 
                widths = [1,2,3,4,3], header=None) # try [2,2,4,4,1]
d.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 23 entries, 0 to 22
## Data columns (total 5 columns):
##  #   Column  Non-Null Count  Dtype
## ---  ------  --------------  -----
##  0   0       23 non-null     int64
##  1   1       23 non-null     int64
##  2   2       23 non-null     int64
##  3   3       23 non-null     int64
##  4   4       23 non-null     int64
## dtypes: int64(5)
## memory usage: 1.0 KB

If you had chosen widths=[2,2,4,4,1], instead, then the trailing whitespace will cause Pandas to recognize a dtype of object. The reason it is not recognized as a string is because strings can be of different length, and all string types specify a maximum length. If you want to enforce a maximum length, there may be some speed advantages. In the below example, we use d.astype() to convert two columns’ types to pd.StringDtype.

d = pd.read_fwf("data/o-ring-erosion-only.data", 
                  widths = [2,2,4,4,1], header=None)
list(d.dtypes)[:4]
## [dtype('int64'), dtype('int64'), dtype('O'), dtype('O')]
d = d.astype({2:'string', 3:'string'}) 
list(d.dtypes)[:4]
## [dtype('int64'), dtype('int64'), StringDtype, StringDtype]

Just like in R, you may run into an encoding issue with a file. For instance, the following will not work because the file contains Chinese characters. If you mostly work with UTF-8 files, you will receive a UnicodeDecodeError if you try to run the following code.

pd.read_csv("data/message.txt")

However, the error messages disappear when you specify encoding="gbk".¹⁵

pd.read_csv("data/message.txt", encoding = "gbk")
## Empty DataFrame
## Columns: [恭喜发财]
## Index: []

You may also read in unstructured, nontabular data with Python. Use the built-in open() function to open up a file in read mode, and then use f.readlines() to return a list of strings.

f = open("data/Google.html", "r")
d = f.readlines()
d[:1]
## ['<!DOCTYPE html>\n']
print(type(d), type(d[0]))
## <class 'list'> <class 'str'>

9.4 Saving Data in R

Storing data is important for saving your progress. For example, sometimes running a script that performs data cleaning can take a very long time. Saving your progress might free you from the responsibility of running that script many times.

In R, there are many options for storing data. I will mention two: writing data out to a plain text file, and saving a serialized object.

9.4.1 Writing Out Tabular Plain Text Data in R

If you want to write out tabular data to a text file, use write.table() or write.csv(). There are two arguments that you must specify, at a minimum: the first argument is your R object (typically a matrix or data.frame), and the second argument is the file path on your hard drive.

Here is an example of writing out d to a file called "oring_out.csv". I choose to include column names, but not row names. I also use commas to separate columns.

write.table(d, file = "data/oring_out.csv", 
            col.names = TRUE, row.names = FALSE, sep = ";")

The above will not print anything to the R console, but we can use a text editor to take a look at the raw text file on our hard drive. Here are the first three rows.

"V1";"V2";"V3";"V4";"V5"
6;0;66;50;1
6;1;70;50;2

9.4.2 Serialization in R

Alternatively you may choose to store your data in a serialized form. With this approach, you are still saving your data in a more permanent way to your hard drive, but it is stored in format that’s usually more memory efficient.

Recall that a common reason for writing out data is to save your progress. When you want to save your progress, it is important to ask yourself: “is it better to save my progress as a serialized object, or as a raw text file?”

When making this decision, consider versatility. On the one hand, raw text files are more versatile and can be used in more places. On the other hand, versatility is often bug prone.

For example, suppose you want to save a cleaned up data.frame. Are you sure you will remember to store that column of strings as character and not a factor? Does any code that uses this data.frame require that this column be in this format?

For instance, let’s save the object d in a file called oring.rds.

saveRDS(d, file = "data/oring.rds")
rm(d)
exists("d")
## [1] FALSE

After it is saved with saveRDS(), we are free to delete the variable with rm(), because it can be read back in later on. To do this, call readRDS(). This is file has a special format that is recognized by R, so you will not need to worry about any of the usual struggles that occur when reading in data from a plain text file. Additionally, .rds files are typically smaller–oring.rds is only 248 bytes, while "oring_out.csv" is 332 bytes.

d2 <- readRDS(file = "data/oring.rds")
head(d2, 3)
##   V1 V2 V3 V4 V5
## 1  6  0 66 50  1
## 2  6  1 70 50  2
## 3  6  0 69 50  3

You can serialize multiple objects at once, too! Convention dictates that these files end with the .RData suffix. Save your entire global environment with save() or save.image(), and bring it back with load() or attach().

rm(list=ls()) # remove everything
a <- 1
b <- 2
save.image(file = "data/my-current-workspace.RData")
rm(list=ls()) 
load("data/my-current-workspace.RData")
ls() # print all objects in your workspace
## [1] "a" "b"

9.5 Saving Data in Python

9.5.1 Writing Out Tabular Plain Text Data in Python

You can write out tabular data with a variety of DataFrame methods that are named to_*().. pd.DataFrame.to_csv() has a lot of common with write.csv() in R. Below we write out d to a file called oring_out2.csv.

import pandas as pd
d = pd.read_csv("data/o-ring-erosion-only.data", 
                  header=None, sep = "\s+")
d.to_csv("data/oring_out2.csv", 
          header=True, index=False, sep = ",")

Here is how the first few rows of that file looks in a text editor.

0,1,2,3,4
6,0,66,50,1
6,1,70,50,2

9.5.2 Serialization in Python

Serialization functionality is readily available in Python, just like it is in R. In Python, the pickle and cPickle libraries are probably the most commonly used. Serializing objects with these libraries is known as pickling an object.

Pandas has a .to_pickle() wrapper method attached to every DataFrame. Once the pickled object is saved, the file can be read back into Python with pd.read_pickle(). These functions are extremely convenient, because they call all the required pickle code and hide a decent amount of complexity.

Here is an example of writing out d and then reading the pickled object back in. In Python 3, the file suffix for pickled objects is usually .pickle, but there are many other choices.

d.to_pickle("data/oring.pickle")
del d
d_is_back = pd.read_pickle("data/oring.pickle")
d_is_back.head(2)
##    0  1   2   3  4
## 0  6  0  66  50  1
## 1  6  1  70  50  2

Unfortunately, "oring.pickle" is much larger (1,676 bytes) than the original text file "o-ring-erosion-only.data" (322 bytes). This is for two reasons. First, the original data set is small, so the overhead of pickling this object is relatively pronounced, and second, we are not taking advantage of any compression. If you use something like d_is_back.to_pickle("data/oring.zip") it will become smaller.

In Python, unlike in R, it is more difficult to serialize all of the objects you currently have in memory. It is possible, but it will likely require the use of a third-party library.

Speaking of third-party code, there are many that provide alternative serialization solutions in both R and Python. I do not discuss any in this text. However, I will mention that some of them may provide combinations of the following: an increase in read and write speed, a decrease in required memory, improved security¹⁶, improved human readability and interoperability between multiple programming languages. If any of these sound potentially beneficial, I encourage you to conduct further research.

9.6 Exercises

9.6.1 R Questions

Consider again the data set called "gspc.csv", which contains daily open, high, low and close values for the S&P500 Index.

Read in this data set as a data.frame, and call it myData. Do not include the code that achieves this in your assignment submission.
Write out this object as myData.rds. After you are finished, remove myData from memory. Do not include the code that achieves this in your assignment submission.
Read in myData.rds, and store the variable as financialData. Do include the code that achieves this in your project submission. Make sure this code assumes that myData.rds is in the same folder as the code file io_lab.R.

9.6.2 Python Questions

We will use the "Google.html" data set mentioned in the chapter.

Use open() to open the "Google.html" file. Store the output of the function as my_file.
Use the .readlines() method of the file to write the contents of the file as a list called html_data
Coerce the list to a DataFrame with one column called html
Create a Series called nchars_ineach that stores the number of characters in each line of text. Hint: the Series.str attribute has a lot of helpful methods.
Create an int-like variable called num_div_tags that holds the total number of times the phrase “<div>” appears in the file.

Consider the data set called "gspc.csv", which contains daily open, high, low and close values for the S&P500 Index.

Read in this data set as a DataFrame, and call it my_data. Do not include the code that achieves this in your assignment submission.
Write out this object as "my_data.pickle". After you are finished, remove my_data from memory. Do not include the code that achieves this in your assignment submission.
Read in "my_data.pickle", and store the variable as financial_data. Do include the code that achieves this in your project submission. Make sure this code assumes that "my_data.pickle" is in the same folder as the code file io_lab.py.

References

Dua, Dheeru, and Casey Graff. 2017. “UCI Machine Learning Repository.” University of California, Irvine, School of Information; Computer Sciences. http://archive.ics.uci.edu/ml.

Open raw text files with text editor programs, not with programs that perform any kind of processing. For instance, if you open it with Microsoft Excel, the appearance of the data will change, and important information helping you to read your data into R or Python will not be available to you.↩
A list of more options of encodings that are built into Python,are available here.↩
The documentation for pickle mentions that the library is “not secure against erroneous or maliciously constructed data” and recommends that you “[n]ever unpickle data received from an untrusted or unauthenticated source.”↩