Importing a dataset into R can be challenging. It often results in silent issues that cause severe errors later in your data analysis. Getting this right is important, and also it is simple if you use RStudio and the readr package.

# install.packages("readr")
library(readr)
# install.packages("here")
library(here)
## here() starts at C:/Users/LeporeM/Documents/Dropbox/git/fgeo.blog

Suppose that you have a spreadsheet that looks like this:

A simple way to import this dataset is from the Environment tab of RStudio, using the option Import Dataset > From Text (readr) …

The panel that pops-up will help you to find the file you want to import, and show you useful previews of the data (central panel) and code (bottom left) that the selected options generate.

The defaults are often enough. Although you can click import, instead you should probably copy the code, paste it in your script, and run it.

In this example the defaults are not enough: Some missing values are not automatically identified, and the type of the column y is not automatically parsed as the type I want – I want not an integer but a double (real number).

file_path <- here::here("static/my_data.csv")
my_data <- read_csv(file_path)
## Parsed with column specification:
## cols(
##   x = col_integer(),
##   y = col_character()
## )
my_data
## # A tibble: 5 x 2
##       x y    
##   <int> <chr>
## 1     1 2    
## 2     3 4    
## 3    NA <NA> 
## 4     0 NULL 
## 5    NA -

To fix this I’ll show two additional arguments of readr::read_csv() (and friends) that is good to know about because they solve common problems: na, and col_types.

my_data <- read_csv(
  file_path, 
  na = c("", "NA", "NULL", "-999", "-"), 
  col_types = list(x = "i", y = "d")
)

my_data
## # A tibble: 5 x 2
##       x     y
##   <int> <dbl>
## 1     1     2
## 2     3     4
## 3    NA    NA
## 4     0    NA
## 5    NA    NA