When you read data into R you may silently introduce some issues that can complicate your data analysis. The most common problems are automatically handled by the
read_*() (underscore) functions of the readr package – which I encourage you to use (see this post). But if you are careful, of course you can use only base R – that is, the
read.*() (dot) functions – that come with R via the utils package. If that’s your choice you should read the documentation of
?read.table carefully. Here I want to highlight three arguments:
FALSEto interpret text strings as text strings. Unfortunately, the default converts text strings to factors (read argument
as.is). If you don’t know what a factor is (
?factor) you most likely don’t want it. (Even if you do want factors, your code will be more readable if you explicitly coerce strings to factors later, with something like
as.factor("your-string").) In short, always use
stringsAsFactors = FALSE(for details read this post on the history of
colClasses: A character vector of classes to be assumed for each columns. For example,
c(x = "numeric", y = "integer", z = "character").
na.strings: A character vector of strings which are to be interpreted as NA values. For example:
c("", "NA", "NULL", "-")
I emphasize these arguments because they are buried among many other arguments so they are easy to miss. (Sure, the arguments
sep are important but you will unlikely miss them because they are in the second and third position of the function’s signature (see
To show you these arguments in action let’s suppose I have a spreadsheet that looks like this:
I can simply import this dataset from the Environment tab of RStudio, using the option Import Dataset > From Text (base) …
The panel that pops-up will help me to (1) find the file I want to import; (2) preview the dataset; and (3) select values for the most common arguments. (Notice that Strings as factors is checked)
But to make my analysis reproducible I instead read the data via code. Accepting all defaults – as in the panel above – is equivalent to running this code:
my_path <- here::here("static/my_dataset.csv") my_dataset <- read.csv(my_path)
And this is the result.
## x y z ## 1 1 2 a ## 2 3 4 ## 3 NA <NA> b ## 4 0 NULL - ## 5 NA - NULL
## 'data.frame': 5 obs. of 3 variables: ## $ x: int 1 3 NA 0 NA ## $ y: Factor w/ 4 levels "-","2","4","NULL": 2 3 NA 4 1 ## $ z: Factor w/ 5 levels "","-","a","b",..: 3 1 4 2 5
In this example the defaults are not enough. Here are some problems:
x: I want a double (real number) but instead I got an integer.
y: I want an integer but instead I got a factor. That is because “NULL” was interpreted as the literal string “NULL”; thus the entire column was interpreted first as a text string and then converted to a factor.
z: I want a character string but instead I got a factor.
But I can fix these problem with the arguments you just learned about.
my_path <- here::here("static/my_dataset.csv") my_dataset <- read.csv( my_path, stringsAsFactors = FALSE, na.strings = c("", "NA", "NULL", "-"), colClasses = c("double", "integer", "character") ) my_dataset
## x y z ## 1 1 2 a ## 2 3 4 <NA> ## 3 NA NA b ## 4 0 NA <NA> ## 5 NA NA <NA>
## 'data.frame': 5 obs. of 3 variables: ## $ x: num 1 3 NA 0 NA ## $ y: int 2 4 NA NA NA ## $ z: chr "a" NA "b" NA ...
Now the dataset is ready for analysis.
Thanks to Suzanne Lao for sharing her tricks and for encouraging me to write this post.