+ - 0:00:00
Notes for current slide
Notes for next slide

Show me the data you didn’t consider!

Reducing unknown unknowns in data science

Claus Thorn Ekstrøm
UCPH Biostatistics
ekstrom@sund.ku.dk

Ann Arbor R Users meeting, March 9th
Slides @ biostatistics.dk/talks/

1
2
3

p-value

4
5
6
7

Quiz

8

The life of a data scientist

Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.

-- "For Big-Data Scientists, 'Janitor Work' Is Key Hurdle to Insight" - The New York Times, 2014

9

Noter

In reality

10

Beyond the hype

Many of the existing problems with small data are also applicable to big data. ...

The problems do not disappear because the data sizes becomes larger. They become worse.

11

State of the art?

Statistical analysis

All of the data were analyzed with data processing software and figures with Microsoft excel 2007.

-- Tayefe et al, Advances in Bioresearch, 2014

12

The RESCueH project

13

Timeline follow back

day1 day2 day3
1 18 NA NA
2 14 NA 99
3 20 17 40
4 23 14 17
14

Timeline follow back

day1 day2 day3
1 18 NA NA
2 14 NA 99
3 20 17 40
4 23 14 17
day1 day2 day3
1 18 NA NA
2 14 NA 99
3 20 17 40
4 23 14 17
5 10 24 2
6 19 88 8
14

Monthly Alcohol units

15

Reproducible research

What didn't we check?

16

Reproducible research

What didn't we check?

  • Many studies cannot be replicated: time, money, unique
  • New technologies increase data sizes
  • Merge existing databases into megadatabases
  • Genetic data for future analyses
16

Reproducible research

What didn't we check?

  • Many studies cannot be replicated: time, money, unique
  • New technologies increase data sizes
  • Merge existing databases into megadatabases
  • Genetic data for future analyses

You are your worst collaborator.

16

dataMaid

devtools::install_github("ekstroem/dataMaid")
library(dataMaid)
data(toyData)
clean(toyData)

Documentation to be read and evaluated by a human.

See github.com/ekstroem/dataMaid for more info.

17
18
19

20

Using dataMaid interactively

check(toyData$var2) # Individual check
check(toyData$var2, numericChecks = "identifyMissing")
visualize(toyData$var2)
summarize(toyData$var2)
summarize(toyData$var2,
numericSummaries = c("centralValue", "minMax"))
21

Extending dataMaid

isID <- function(v, nMax = NULL, ...) {
out <- list(problem = FALSE, message = "")
if (class(v) %in% setdiff(allClasses(),
c("logical", "Date"))) {
v <- as.character(v)
lengths <- c(nchar(v))
if (all(lengths > 10) &
length(unique(lengths)) == 1) {
out$problem <- TRUE
out$message <- "Warning: Seems to contain IDs."
} }
out }
22

Adding the function

data("exampleData")
exampleData$names <- sapply(1:300,
function(i) { paste0(sample(LETTERS, size=10),
collapse="") })
clean(exampleData,
preChecks = c("isID"))
23

Communicate!

24
2
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow