class: center, middle, inverse, title-slide # Show me the data you didn’t consider! ## Reducing unknown unknowns in data science ### Claus Thorn Ekstrøm
UCPH Biostatistics
.small[
ekstrom@sund.ku.dk
] ### Ann Arbor R Users meeting, March 9th
.small[Slides @
biostatistics.dk/talks/
] --- background-image: url(pics/newtrump.jpg) background-size: 100% class: middle --- background-image: url(pics/guardian.png) background-size: 100% class: center, middle --- class: center, middle # p-value --- background-image: url(pics/redcard+sociology.jpg) background-size: 100% class: center, middle --- background-image: url(pics/brexit.jpg) background-size: 100% class: center, middle --- background-image: url(pics/trump-headlines.jpg) background-size: 100% class: center, middle --- class: center, middle, inverse # Quiz --- # The life of a data scientist > .large[Data scientists, according to interviews and expert estimates, spend from **50 percent to 80 percent** of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.] > > .right[-- "For Big-Data Scientists, 'Janitor Work' Is Key Hurdle to Insight" - The New York Times, 2014] ??? Noter --- background-image: url(pics/bd.jpg) background-size: 100% class: center, middle # In reality --- # Beyond the hype > .large[Many of the existing problems with small data are also applicable to big data. ...] > > .large[The problems do not disappear because the data sizes becomes larger. They become **worse**.] --- # State of the art? > **Statistical analysis** > > .large[All of the data were analyzed with data processing software and figures with Microsoft excel 2007.] > > .pull-right[-- Tayefe *et al*, Advances in Bioresearch, 2014] --- background-image: url(pics/manbeer.jpg) background-size: 100% class: middle, center # The RESCueH project --- class: center # Timeline follow back ``` day1 day2 day3 1 18 NA NA 2 14 NA 99 3 20 17 40 4 23 14 17 ``` -- ``` day1 day2 day3 1 18 NA NA 2 14 NA 99 3 20 17 40 4 23 14 17 5 10 24 2 6 19 88 8 ``` --- class: center # Monthly Alcohol units ![](dataMaid_files/figure-html/unnamed-chunk-3-1.png)<!-- --> --- # Reproducible research What **didn't** we check? -- - Many studies cannot be replicated: time, money, unique - New technologies increase data sizes - Merge existing databases into megadatabases - Genetic data for future analyses -- ## You are your worst collaborator. --- class: middle # dataMaid ```r devtools::install_github("ekstroem/dataMaid") library(dataMaid) data(toyData) clean(toyData) ``` Documentation to be **read** and **evaluated** by a human. See [github.com/ekstroem/dataMaid](github.com/ekstroem/dataMaid) for more info. --- background-image: url(pics/mvar3.png) background-size: 100% class: center, middle --- background-image: url(pics/msummary.png) background-size: 100% class: center, middle --- class: center, middle <img src="pics/flowchart.png" width="50%" style="display: block; margin: auto;" /> --- # Using dataMaid interactively ```r check(toyData$var2) # Individual check check(toyData$var2, numericChecks = "identifyMissing") visualize(toyData$var2) summarize(toyData$var2) summarize(toyData$var2, numericSummaries = c("centralValue", "minMax")) ``` --- # Extending dataMaid ```r isID <- function(v, nMax = NULL, ...) { out <- list(problem = FALSE, message = "") if (class(v) %in% setdiff(allClasses(), c("logical", "Date"))) { v <- as.character(v) lengths <- c(nchar(v)) if (all(lengths > 10) & length(unique(lengths)) == 1) { out$problem <- TRUE out$message <- "Warning: Seems to contain IDs." } } out } ``` --- # Adding the function ```r data("exampleData") exampleData$names <- sapply(1:300, function(i) { paste0(sample(LETTERS, size=10), collapse="") }) clean(exampleData, preChecks = c("isID")) ``` --- class: middle, center # Communicate!