class: center, middle, inverse, title-slide # Show me the errors you didn’t look for! ### Claus Thorn Ekstrøm
UCPH Biostatistics
.small[
ekstrom@sund.ku.dk
] ### useR, July 6th 2017
.small[Slides @
biostatistics.dk/talks/
] --- background-image: url(pics/manbeer.jpg) background-size: 100% class: middle, center # The RESCueH project --- class: center # Timeline followback (TLFB) ``` day1 day2 day3 1 14 NA NA 2 10 NA 99 3 19 14 40 4 5 7 7 ``` --- class: center # Timeline followback (TLFB) ``` day1 day2 day3 1 14 NA NA 2 10 NA 99 3 19 14 40 4 5 7 7 5 12 5 1 6 11 88 4 ``` --- background-image: url(pics/mau.png) background-size: 80% class: center # Monthly Alcohol units --- # Reproducible research What **didn't** we check? -- - Need **experts in relevant field** - Merge existing databases into megadatabases - New technologies revive old data --- class: center .footnotesize[ ``` colour int string numeric uniq ident 3 red 1 a -0.8356286 3 Irrelevant 4 red 2 a 1.5952808 4 Irrelevant 5 red 2 a 0.3295078 5 Irrelevant 6 red 6 b -0.8204684 6 Irrelevant 7 red 6 b 0.4874291 7 Irrelevant 8 red 6 b 0.7383247 8 Irrelevant 9 red 999 c 0.5757814 9 Irrelevant 10 red NA c -0.3053884 10 Irrelevant 11 blue 4 c 1.5117812 11 Irrelevant 12 blue 82 . 0.3898432 12 Irrelevant 13 blue NA -0.6212406 13 Irrelevant 14 <NA> NaN other -2.2146999 14 Irrelevant 15 <NA> 5 OTHER 1.1249309 15 Irrelevant ``` ] --- class: middle # dataMaid ```r library(dataMaid) data(toyData) *clean(toyData) ``` Documentation to be **read** and **evaluated** by a human. See [github.com/ekstroem/dataMaid](github.com/ekstroem/dataMaid) for more info. Stable version on CRAN. --- background-image: url(pics/flowchart.png) background-size: 100% class: center # Flowchart --- # Part 1: Data cleaning summary <img src="pics/summ.png" width="100%" style="display: block; margin: auto;" /> --- background-image: url(pics/miss.png) background-size: 100% # Part 2: Summary table --- background-image: url(pics/out1.png) background-size: 100% # Part 3: Variable list --- background-image: url(pics/out2.png) background-size: 100% --- # Using dataMaid interactively .footnotesize[ ```r > check(toyData$int) # Individual check $identifyMissing The following suspected missing value codes enter as regular values: 999, NaN. $identifyOutliers Note that the following possible outlier values were detected: 82, 999. > check(toyData$int, numericChecks = "identifyMissing") ``` ] --- # Using dataMaid interactively .pull-left[ .footnotesize[ ```r visualize(toyData$int) ``` ] ] .pull-right[ ![](useR2017_files/figure-html/unnamed-chunk-10-1.png)<!-- --> ] --- # Using dataMaid interactively .footnotesize[ ```r summarize(toyData$int) ``` ``` Feature Result [1,] "Variable type" "numeric" [2,] "Number of missing obs." "3 (20 %)" [3,] "Number of unique values" "8" [4,] "Median" "4.5" [5,] "1st and 3rd quartiles" "1.75; 6" [6,] "Min. and max." "1; 999" ``` ] --- # Using dataMaid interactively .small[ ```r > allSummaryFunctions() ------------------------------------------------------------- name description classes ------------ -------------------- --------------------------- centralValue Compute median character, Date, factor, or mode integer, labelled, logical, numeric countMissing Compute ratio of character, Date, factor, missing obs. integer, labelled, logical, numeric minMax Find min and max integer, numeric, Date values quartiles Compute 1st and 3rd quartiles Date, integer, numeric uniqueValues Count number of unique values character, Date, factor, integer, labelled, logical, numeric variableType Data class of variable character, Date, factor, integer, labelled, logical, numeric ---------------------------------------------------------------------- ``` ] --- # Extending dataMaid .footnotesize[ ```r isSSN <- function(v, nMax = NULL, ...) { out <- list(problem = FALSE, message = "") if (class(v) %in% c("character", "factor", "labelled")) { if (any(grep("\\d{3}-\\d{2}-\\d{4}", v))) { out$problem <- TRUE out$message <- "Warning: Seems to contain SSNs." } } out } ``` ] --- # Adding the function .footnotesize[ ```r DF <- data.frame(ids=c("111-22-3333","123-45-6789", "111-22-3333"), id2=c("111223333", "123456789", "4728491283"), stringsAsFactors=FALSE) clean(DF, characterChecks = c("isSSN")) ``` ``` Warning: Seems to contain SSNs. ``` ] --- # In summary # 42 -- * We **need** reproducible research to ensure that we can document exactly what we have done --- and also what we **haven't** done. * `dataMaid` addresses part of that requirement. * **You** are your worst collaborator.