How to read data safely with only base R?

When you read data into R you may silently introduce some issues that can complicate your data analysis. The most common problems are automatically handled by the read_*() (underscore) functions of the readr package – which I encourage you to use (see this post). But if you are careful, of course you can use only base R – that is, the read.*() (dot) functions – that come with R via the utils package. If that’s your choice you should read the documentation of ?read.table carefully. Here I want to highlight three arguments:

stringsAsFactors: Use FALSE to interpret text strings as text strings. Unfortunately, the default converts text strings to factors (read argument as.is). If you don’t know what a factor is (?factor) you most likely don’t want it. (Even if you do want factors, your code will be more readable if you explicitly coerce strings to factors later, with something like as.factor("your-string").) In short, always use stringsAsFactors = FALSE (for details read this post on the history of stringsAsFactors).
colClasses: A character vector of classes to be assumed for each columns. For example, c(x = "numeric", y = "integer", z = "character").
na.strings: A character vector of strings which are to be interpreted as NA values. For example: c("", "NA", "NULL", "-")

I emphasize these arguments because they are buried among many other arguments so they are easy to miss. (Sure, the arguments header and sep are important but you will unlikely miss them because they are in the second and third position of the function’s signature (see ?read.table).]

To show you these arguments in action let’s suppose I have a spreadsheet that looks like this:

I can simply import this dataset from the Environment tab of RStudio, using the option Import Dataset > From Text (base) …

The panel that pops-up will help me to (1) find the file I want to import; (2) preview the dataset; and (3) select values for the most common arguments. (Notice that Strings as factors is checked)

But to make my analysis reproducible I instead read the data via code. Accepting all defaults – as in the panel above – is equivalent to running this code:

my_path <- here::here("static/my_dataset.csv")
my_dataset <- read.csv(my_path)

And this is the result.

my_dataset

##    x    y    z
## 1  1    2    a
## 2  3    4     
## 3 NA <NA>    b
## 4  0 NULL    -
## 5 NA    - NULL

str(my_dataset)

## 'data.frame':    5 obs. of  3 variables:
##  $ x: int  1 3 NA 0 NA
##  $ y: Factor w/ 4 levels "-","2","4","NULL": 2 3 NA 4 1
##  $ z: Factor w/ 5 levels "","-","a","b",..: 3 1 4 2 5

In this example the defaults are not enough. Here are some problems:

x: I want a double (real number) but instead I got an integer.
y: I want an integer but instead I got a factor. That is because “NULL” was interpreted as the literal string “NULL”; thus the entire column was interpreted first as a text string and then converted to a factor.
z: I want a character string but instead I got a factor.

But I can fix these problem with the arguments you just learned about.

my_path <- here::here("static/my_dataset.csv")

my_dataset <- read.csv(
  my_path,
  stringsAsFactors = FALSE,
  na.strings = c("", "NA", "NULL", "-"),
  colClasses = c("double", "integer", "character")
)

my_dataset

##    x  y    z
## 1  1  2    a
## 2  3  4 <NA>
## 3 NA NA    b
## 4  0 NA <NA>
## 5 NA NA <NA>

str(my_dataset)

## 'data.frame':    5 obs. of  3 variables:
##  $ x: num  1 3 NA 0 NA
##  $ y: int  2 4 NA NA NA
##  $ z: chr  "a" NA "b" NA ...

Now the dataset is ready for analysis.

See an example using the argument row.names
Go to a similar post but using the readr package

Acknowledgements

Thanks to Suzanne Lao for sharing her tricks and for encouraging me to write this post.

fgeo blog

How to read data safely with only base R?

More

Acknowledgements