Show me the data you didn’t consider!

# Show me the data you didn’t consider!
## Reducing unknown unknowns in data science
### Claus Thorn Ekstrøm<br>UCPH Biostatistics<br>.small[<a href="mailto:ekstrom@sund.ku.dk">ekstrom@sund.ku.dk</a>]
### Ann Arbor R Users meeting, March 9th<br>.small[Slides @ <a href="www.biostatistics.dk/talks/">biostatistics.dk/talks/</a>]

---

---

background-image: url(pics/guardian.png)
background-size: 100%
class: center, middle

---

# p-value

---

background-image: url(pics/redcard+sociology.jpg)
background-size: 100%
class: center, middle

---

background-image: url(pics/brexit.jpg)
background-size: 100%
class: center, middle

---
background-image: url(pics/trump-headlines.jpg)
background-size: 100%
class: center, middle

---

# Quiz

---

# The life of a data scientist

> .large[Data scientists, according to interviews and expert estimates, spend from **50 percent to 80 percent** of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.]
> 
> .right[-- "For Big-Data Scientists, 'Janitor Work' Is Key Hurdle to Insight" - The New York Times, 2014]

???

Noter

---

# In reality

---

# Beyond the hype

> .large[Many of the existing problems with small data are also applicable to big data. ...]
>
> .large[The problems do not disappear because the data sizes becomes larger. They become **worse**.]

---

# State of the art?

> **Statistical analysis**
>
> .large[All of the data were analyzed with data processing software and figures with Microsoft excel 2007.]
>
> .pull-right[-- Tayefe *et al*, Advances in Bioresearch, 2014]

---

background-image: url(pics/manbeer.jpg)
background-size: 100%
class: middle, center

# The RESCueH project

---

# Timeline follow back

```
  day1 day2 day3
1   18   NA   NA
2   14   NA   99
3   20   17   40
4   23   14   17
```

```
  day1 day2 day3
1   18   NA   NA
2   14   NA   99
3   20   17   40
4   23   14   17
5   10   24    2
6   19   88    8
```

---

# Monthly Alcohol units

![](dataMaid_files/figure-html/unnamed-chunk-3-1.png)

---

# Reproducible research

What **didn't** we check?

- Many studies cannot be replicated: time, money, unique
- New technologies increase data sizes
- Merge existing databases into megadatabases
- Genetic data for future analyses

## You are your worst collaborator.

---

# dataMaid

```r
devtools::install_github("ekstroem/dataMaid")
library(dataMaid)
data(toyData)
clean(toyData)
```

Documentation to be **read** and **evaluated** by a human.

See [github.com/ekstroem/dataMaid](github.com/ekstroem/dataMaid) for more info.

---

background-image: url(pics/mvar3.png)
background-size: 100%
class: center, middle

---

background-image: url(pics/msummary.png)
background-size: 100%
class: center, middle

---

---

# Using dataMaid interactively

```r
check(toyData$var2)  # Individual check
check(toyData$var2, numericChecks = "identifyMissing")
visualize(toyData$var2)
summarize(toyData$var2)
summarize(toyData$var2, 
       numericSummaries = c("centralValue", "minMax"))
```

---

# Extending dataMaid

```r
isID <- function(v, nMax = NULL, ...) {
  out <- list(problem = FALSE, message = "")
  if (class(v) %in% setdiff(allClasses(), 
                            c("logical", "Date"))) {
    v <- as.character(v)
    lengths <- c(nchar(v))
    if (all(lengths > 10) & 
        length(unique(lengths)) == 1) {
      out$problem <- TRUE
      out$message <- "Warning: Seems to contain IDs."
    }  }
  out }
```

---

# Adding the function

```r
data("exampleData")
exampleData$names <- sapply(1:300, 
  function(i) { paste0(sample(LETTERS, size=10), 
                       collapse="") })
clean(exampleData,
  preChecks = c("isID"))
```

---

# Communicate!