class: center, middle, inverse, title-slide # Show me the data you didn’t consider! ## Reducing unknown unknowns in data science ### Claus Thorn Ekstrøm
UCPH Biostatistics
.small[
ekstrom@sund.ku.dk
] ### Data Science Day 2017, January 30th
.small[Slides @
biostatistics.dk/talks/
] --- background-image: url(pics/newtrump.jpg) background-size: 100% class: middle --- background-image: url(pics/guardian.png) background-size: 100% class: center, middle --- class: center, middle # p-value --- background-image: url(pics/redcard+sociology.jpg) background-size: 100% class: center, middle --- background-image: url(pics/brexit.jpg) background-size: 100% class: center, middle --- background-image: url(pics/trump-headlines.jpg) background-size: 100% class: center, middle --- class: center, middle, inverse # Quiz --- # The life of a data scientist > .large[Data scientists, according to interviews and expert estimates, spend from **50 percent to 80 percent** of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.] > > .right[-- "For Big-Data Scientists, 'Janitor Work' Is Key Hurdle to Insight" - The New York Times, 2014] ??? Noter --- background-image: url(pics/bd.jpg) background-size: 100% class: center, middle # In reality --- # Beyond the hype > .large[Many of the existing problems with small data are also applicable to big data. ...] > > .large[The problems do not disappear because the data sizes becomes larger. They become **worse**.] --- # State of the art? > **Statistical analysis** > > .large[All of the data were analyzed with data processing software and figures with Microsoft excel 2007.] > > .pull-right[-- Tayefe *et al*, Advances in Bioresearch, 2014] --- background-image: url(pics/manbeer.jpg) background-size: 100% class: middle, center # The RESCueH project --- class: center # Timeline follow back ``` day1 day2 day3 1 8 NA NA 2 7 NA 99 3 13 13 40 4 21 6 2 ``` -- ``` day1 day2 day3 1 8 NA NA 2 7 NA 99 3 13 13 40 4 21 6 2 5 12 25 16 6 14 88 22 ``` --- class: center # Monthly Alcohol units ![](microsoft_files/figure-html/unnamed-chunk-3-1.png)<!-- --> --- # Reproducible research What **didn't** we check? -- - Many studies cannot be replicated: time, money, unique - New technologies increase data sizes - Merge existing databases into megadatabases -- ## You are your worst collaborator. --- class: middle # dataMaid ```r library(dataMaid) data(toyData) %>% clean() ``` Documentation to be **read** and **evaluated** by a human. See [github.com/ekstroem/dataMaid](github.com/ekstroem/dataMaid) for more info. --- background-image: url(pics/mvar3.png) background-size: 100% class: center, middle --- background-image: url(pics/msummary.png) background-size: 100% class: center, middle --- class: middle, center # Communicate! --- background-image: url(pics/crisis-averted.jpg) background-size: 100% class: center, middle