Data import with readr

Last updated: 2019-10-02

Checks: 7 0

Knit directory: wflow-r4ds/

This reproducible R Markdown analysis was created with workflowr (version 1.4.0.9001). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20190925)

The command set.seed(20190925) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 5472b4d

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File	Version	Author	Date	Message
html	5472b4d	John Blischak	2019-10-02	Build site.
Rmd	a23c44c	John Blischak	2019-10-02	Chp 8 exercises on readr

Setup

library(tidyverse)

Getting started

p. 128

What function would you use to read a file where fields were separated with
“|”?

read_delim() with delim = "|".

Apart from file, skip, and comment, what other arguments do read_csv() and read_tsv() have in common?

intersect(names(formals(read_csv)), names(formals(read_tsv)))

 [1] "file"            "col_names"       "col_types"      
 [4] "locale"          "na"              "quoted_na"      
 [7] "quote"           "comment"         "trim_ws"        
[10] "skip"            "n_max"           "guess_max"      
[13] "progress"        "skip_empty_rows"

In fact they share all the same arguments:

identical(names(formals(read_csv)), names(formals(read_tsv)))

[1] TRUE

read_csv() and read_tsv() are both wrappers to the internal function read_delimited():

names(formals(readr:::read_delimited))

 [1] "file"            "tokenizer"       "col_names"      
 [4] "col_types"       "locale"          "skip"           
 [7] "skip_empty_rows" "comment"         "n_max"          
[10] "guess_max"       "progress"

What are the most important arguments to read_fwf()?

names(formals(read_fwf))

 [1] "file"            "col_positions"   "col_types"      
 [4] "locale"          "na"              "comment"        
 [7] "trim_ws"         "skip"            "n_max"          
[10] "guess_max"       "progress"        "skip_empty_rows"

col_positions

col_positions Column positions, as created by fwf_empty(), fwf_widths() or fwf_positions(). To read in only selected fields, use fwf_positions(). If the width of the last column is variable (a ragged fwf file), supply the last end position as NA.

Sometimes strings in a CSV file contain commas. To prevent them from causing problems they need to be surrounded by a quoting character, like " or '. By convention, read_csv() assumes that the quoting character will be ", and if you want to change it you’ll need to use read_delim() instead. What arguments do you need to specify to read the following text into a data frame?

"x,y\n1,'a,b'"

Set quote to ':

read_delim("x,y\n1,'a,b'", delim = ",", quote = "'")

# A tibble: 1 x 2
      x y    
  <dbl> <chr>
1     1 a,b

As of readr 1.1.0 (released in March 2017), you can just use read_csv():

read_csv("x,y\n1,'a,b'", quote = "'")

# A tibble: 1 x 2
      x y    
  <dbl> <chr>
1     1 a,b

Identify what is wrong with each of the following inline CSV files. What happens when you run the code?

# 2 column names but 3 columns of data
read_csv("a,b\n1,2,3\n4,5,6")

Warning: 2 parsing failures.
row col  expected    actual         file
  1  -- 2 columns 3 columns literal data
  2  -- 2 columns 3 columns literal data

# A tibble: 2 x 2
      a     b
  <dbl> <dbl>
1     1     2
2     4     5

# Each row has a different number of columns
read_csv("a,b,c\n1,2\n1,2,3,4")

Warning: 2 parsing failures.
row col  expected    actual         file
  1  -- 3 columns 2 columns literal data
  2  -- 3 columns 4 columns literal data

# A tibble: 2 x 3
      a     b     c
  <dbl> <dbl> <dbl>
1     1     2    NA
2     1     2     3

# There is an opening quote in the second row but no closing quote
read_csv("a,b\n\"1")

Warning: 2 parsing failures.
row col                     expected    actual         file
  1  a  closing quote at end of file           literal data
  1  -- 2 columns                    1 columns literal data

# A tibble: 1 x 2
      a b    
  <dbl> <chr>
1     1 <NA>

# Both columns are characters b/c they contain a mix of numbers and characters
read_csv("a,b\n1,2\na,b")

# A tibble: 2 x 2
  a     b    
  <chr> <chr>
1 1     2    
2 a     b

# The delimiter is a `;`, so everything is in one column
read_csv("a;b\n1;3")

# A tibble: 1 x 1
  `a;b`
  <chr>
1 1;3

Parsing a vector

p. 136

What are the most important arguments to locale()?

Seems like a pretty context-dependent question. In this chapter, they use decimal_mark to accomodate different numeric styles, date_names to format the date names according to the tradition in a specific location, and encoding to specify the encoding used by the file. I think tz for time zone would also be useful.

What happens if you try and set decimal_mark and grouping_mark to the same character? What happens to the default value of grouping_mark when you set decimal_mark to “,”? What happens to the default value of decimal_mark when you set the grouping_mark to “.”?

locale(decimal_mark = ".", grouping_mark = ".")

Error: `decimal_mark` and `grouping_mark` must be different

locale(decimal_mark = ",")$grouping_mark

[1] "."

locale(grouping_mark = ".")$decimal_mark

[1] ","

I didn’t discuss the date_format and time_format options to locale(). What do they do? Construct an example that shows when they might be useful.

The date_format can be used to parse dates that are not in the default YYYY-MM-DD format:

parse_date("01/31/2000")

Warning: 1 parsing failure.
row col   expected     actual
  1  -- date like  01/31/2000

[1] NA

# January 31, 2000
parse_date("01/31/2000", locale = locale(date_format = "%m/%d/%Y"))

[1] "2000-01-31"

According to the readr locales vignette, the argument time_format is not used, so it is never useful. But the vignette is outdated. time_format is used exactly the same as date_format.

parse_time("17:55:14")

17:55:14

parse_time("5:55:14 PM")

17:55:14

# Example of a non-standard time
parse_time("h5m55s14 PM")

Warning: 1 parsing failure.
row col   expected      actual
  1  -- time like  h5m55s14 PM

NA

parse_time("h5m55s14 PM", locale = locale(time_format = "h%Hm%Ms%S %p"))

17:55:14

If you live outside the US, create a new locale object that encapsulates the settings for the types of file you read most commonly.

You can create it by passing custom arguments to locale and saving the result. Many languages are already supported:

(es <- locale("es"))

<locale>
Numbers:  123,456.78
Formats:  %AD / %AT
Timezone: UTC
Encoding: UTF-8
<date_names>
Days:   domingo (dom.), lunes (lun.), martes (mar.), miércoles (mié.),
        jueves (jue.), viernes (vie.), sábado (sáb.)
Months: enero (ene.), febrero (feb.), marzo (mar.), abril (abr.), mayo
        (may.), junio (jun.), julio (jul.), agosto (ago.),
        septiembre (sept.), octubre (oct.), noviembre (nov.),
        diciembre (dic.)
AM/PM:  a. m./p. m.

str(es)

List of 7
 $ date_names   :List of 5
  ..$ mon   : chr [1:12] "enero" "febrero" "marzo" "abril" ...
  ..$ mon_ab: chr [1:12] "ene." "feb." "mar." "abr." ...
  ..$ day   : chr [1:7] "domingo" "lunes" "martes" "miércoles" ...
  ..$ day_ab: chr [1:7] "dom." "lun." "mar." "mié." ...
  ..$ am_pm : chr [1:2] "a. m." "p. m."
  ..- attr(*, "class")= chr "date_names"
 $ date_format  : chr "%AD"
 $ time_format  : chr "%AT"
 $ decimal_mark : chr "."
 $ grouping_mark: chr ","
 $ tz           : chr "UTC"
 $ encoding     : chr "UTF-8"
 - attr(*, "class")= chr "locale"

What’s the difference between read_csv() and read_csv2()?

read_csv2() uses ; for the field separator and , for the decimal point. This is common in some European countries.

What are the most common encodings used in Europe? What are the most common encodings used in Asia? Do some googling to find out.

From the online book Programming with Unicode (CC BY-SA 3.0 license), the most popular encodings on the internet are:

1st (56%): ASCII
2nd (23%): Western Europe encodings (ISO 8859-1, ISO 8859-15 and cp1252)
3rd (8%): Chinese encodings (GB2312, …)
and then come Korean (EUC-KR), Cyrillic (cp1251, KOI8-R, …), East Europe (cp1250, ISO-8859-2), Arabic (cp1256, ISO-8859-6), etc.
(UTF-8 was not used on the web in 2001)

Note that I used DuckDuckGo for the online search :-)

Generate the correct format string to parse each of the following dates and times:

See ?strptime for the available conversion specifiers (not sure whether to be proud or depressed that I remembered off the top of my head that %B was the full month name).

d1 <- "January 1, 2010"
parse_date(d1, "%B %d, %Y")

[1] "2010-01-01"

# Alternatively can specify date_format via locale argument
parse_date(d1, locale = locale(date_format = "%B %d, %Y"))

[1] "2010-01-01"

d2 <- "2015-Mar-07"
parse_date(d2, "%Y-%b-%d")

[1] "2015-03-07"

d3 <- "06-Jun-2017"
parse_date(d3, "%d-%b-%Y")

[1] "2017-06-06"

d4 <- c("August 19 (2015)", "July 1 (2015)")
parse_date(d4, "%B %d (%Y)")

[1] "2015-08-19" "2015-07-01"

d5 <- "12/30/14" # Dec 30, 2014
parse_date(d5, "%m/%d/%y")

[1] "2014-12-30"

t1 <- "1705"
parse_time(t1, "%H%M")

17:05:00

t2 <- "11:15:10.12 PM"
parse_time(t2, "%H:%M:%OS %p")

23:15:10.12

# Alternatively can specify time_format via locale argument
parse_time(t2, locale = locale(time_format = ("%H:%M:%OS %p")))

23:15:10.12

%OS is strange. Apparently it is R-specific, and I couldn’t get readr to accept the decimal argument:

Specific to R is %OSn, which for output gives the seconds truncated to 0 <= n <= 6 decimal places (and if %OS is not followed by a digit, it uses the setting of getOption(“digits.secs”), or if that is unset, n = 0). Further, for strptime %OS will input seconds including fractional seconds. Note that %S does not read fractional parts on output.

sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] forcats_0.4.0   stringr_1.4.0   dplyr_0.8.3     purrr_0.3.2    
[5] readr_1.3.1     tidyr_1.0.0     tibble_2.1.3    ggplot2_3.2.1  
[9] tidyverse_1.2.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2           cellranger_1.1.0     pillar_1.4.2        
 [4] compiler_3.6.1       git2r_0.26.1.9000    workflowr_1.4.0.9001
 [7] tools_3.6.1          zeallot_0.1.0        digest_0.6.21       
[10] lubridate_1.7.4      jsonlite_1.6         evaluate_0.14       
[13] lifecycle_0.1.0      nlme_3.1-141         gtable_0.3.0        
[16] lattice_0.20-38      pkgconfig_2.0.3      rlang_0.4.0         
[19] cli_1.1.0            rstudioapi_0.10      yaml_2.2.0          
[22] haven_2.1.1          xfun_0.9             withr_2.1.2         
[25] xml2_1.2.2           httr_1.4.1           knitr_1.25          
[28] hms_0.5.1            generics_0.0.2       fs_1.3.1            
[31] vctrs_0.2.0          rprojroot_1.2        grid_3.6.1          
[34] tidyselect_0.2.5     glue_1.3.1           R6_2.4.0            
[37] fansi_0.4.0          readxl_1.3.1         rmarkdown_1.15      
[40] modelr_0.1.5         magrittr_1.5         whisker_0.4         
[43] backports_1.1.4      scales_1.0.0         htmltools_0.3.6     
[46] rvest_0.3.4          assertthat_0.2.1     colorspace_1.4-1    
[49] utf8_1.1.4           stringi_1.4.3        lazyeval_0.2.2      
[52] munsell_0.5.0        broom_0.5.2          crayon_1.3.4

Data import with readr

John Blischak

2019-10-01

Setup

Getting started

Parsing a vector