class: center, middle, inverse, title-slide # Seven deadly sins of data science ### Claus Thorn Ekstrøm
UCPH Biostatistics
.small[
ekstrom@sund.ku.dk
] ### May 30th, 2018
.small[Slides @
biostatistics.dk/talks/
] --- # A long time ago ... > *I'd like to convince you to give a talk [...]* > > *It's at the end of May. A rather big conference. Check this out: https://intelligentcloud.dk* -- > > *They could use a **"grumpy professor's"** advice on good basic statistics :-)* --- class: center, middle .yellow[.large[**A NEW HOPE**]] .Large[data].HUGE[Science] ??? Digitalization - Paradigm shift in AI (statistical learning) - Near human or superhuman performance in image and sound recognition, and text processing - Automatization of decision processes (rebranding old ideas as AI fuelled by the increase in computing power) - xxx - ... and the naive hope that more data will make difficult problems easy --- # What makes a good scientist? Be curious ... keep learning new ... remember collaborative effort -- | Scientist | Seller | |:-----------|:----------| | Be sceptical of your results | "Sell" your results | | Interpret conclusions carefully | Highlight/exaggerate importance | | "Publish" negative results | Publish strategically | | Replicate replicate replicate | Replicate ... if you must | | Novel exciting results are less likely to be true. Double check them | Publish novel results before they get scooped | --- class: inverse, center, middle .Huge[What is the question?] --- class: center ![](ic_files/figure-html/unnamed-chunk-1-1.svg)<!-- --> --- class: center ![](ic_files/figure-html/unnamed-chunk-2-1.svg)<!-- --> --- .pull-left[ `\(p\)`-value hacking Cluster analysis Cherry picking Network analysis Marketing .yellow[Use recommendations from pharma industry] ] .pull-right[ <img src="pics/cluster.png" width="2277" /> ] --- class: inverse, center, middle .Huge[Representativity] --- ## Population and sample ![](ic_files/figure-html/unnamed-chunk-4-1.png)<!-- --> Generalization and external validity --- background-image: url(pics/mm.png) background-size: 120% ??? Guardian, May 24th --- # Global Drug Survey `\(N \approx 140000\)` globally, `\(N \approx 13500\)` in DK Sampling: volunteers from facebook, reddit, twitter, partners. *Their statements:* Can **not** be used to say anything about drug use prevalence. *Can* be used to say something about the *patterns*. .yellow[In DK: "Easier to get cocaine than a pizza"] --- background-image: url("pics/mushr.jpeg") background-size: 100% --- # Global Drug Survey > *"Magic mushrooms are one of the safest drugs in the world," said Adam Winstock, [...] pointing out that the bigger risk was people picking and eating the wrong mushrooms.* -- > *"**Death from toxicity** is almost unheard of with poisoning with more dangerous fungi being a much greater risk in terms of serious harms."* --- class: inverse, center, middle .Huge[Confounding] --- # What is confounding? > Confounding is when an association is **distorted** due to a mix-up with other factors that are associated with the outcome and exposure.
--- class: center, middle ![](ic_files/figure-html/unnamed-chunk-5-1.png)<!-- --> --- class: center, middle ![](ic_files/figure-html/unnamed-chunk-6-1.png)<!-- --> --- # Will reproductive behaviour change following birth of child with severe disease? PhD on **stoppage**: change in reproductive behaviour. <img src="pics/asd.png" width="2008" /> -- .center[.yellow[Mother's age]] --- class: inverse, center, middle .Huge[Modeling is serious] --- > In God we trust ... all others must bring data -- In regard "Big data" we often hear: > The data will speak for themselves -- *If* you ask them how the data fit a model then they *might* give you an answer. .yellow[But **you** decide the class of models!] --- background-image: url("pics/turing.jpg") background-size: 100% --- class: inverse, center, middle .Huge[Correlation and causality] --- **Correlation** indicates a relationship between two events. For example, two events tend to occur together. -- **Causation** indicates that the occurrence of one event has *caused* the occurrence of a second event. These two events also occur together, but there is a causal mechanism! .yellow[Correlation does not imply causation!] --- class: center, middle ![](ic_files/figure-html/unnamed-chunk-8-1.png)<!-- --> --- background-image: url("pics/fish.png") background-size: 100% --- class: inverse, center, middle .Huge[Kill your darlings] --- ## Take care not to pander to your own expectations What happens with your analysis process when you * find the result you expect? * find a result you did not expect? -- When are you ready to cut your losses? .yellow[What is your role?] --- background-image: url("pics/marcbjarke.png") background-size: 100% --- class: inverse, center, middle .Huge[Beware the wisdom of (small, homogeneous) crowds] --- # The wisdom of crowds Data science problems are rarely off-the-shelf-problems. How is knowledge acquired and passed on? Key criteria for wise crowds: 1. Diversity of opinion 2. Independence - People's opinions aren't determined by the opinions of those around them. 3. People are able to specialize and draw on local knowledge 4. Method for aggregating .yellow[Why?] --- background-image: url(pics/soccer.png) background-size: 100% --- # Summary 1. What's the purpose? 2. What data are available? What do they represent? 3. Did you consider all confounders? 4. What models did you use and why? 5. Don't overstate your conclusions (from associations) 6. Careful with preconceptions 7. Set up a framework to expand your knowledge Stand up for your results