class: center, middle, inverse, title-slide # Clean up your data screening process with reporteR ### Anne Helby Petersen and Claus Thorn Ekstrøm
UCPH Biostatistics ### December 3rd, 2020 --- class: animated, fadeIn layout: true --- background-image: url(pics/title.png) background-size: 100% --- background-image: url(pics/manbeer.jpg) background-size: 100% class: middle, center # The RESCueH project --- class: center # Timeline follow back ``` ## day1 day2 day3 ## 1 22 NA NA ## 2 5 NA 99 ## 3 1 9 40 ## 4 23 15 15 ``` --- class: center # Timeline follow back ``` ## day1 day2 day3 ## 1 22 NA NA ## 2 5 NA 99 ## 3 1 9 40 ## 4 23 15 15 ## 5 15 10 23 ## 6 17 88 12 ``` --- class: center # Monthly Alcohol units ![](whyr_datareporter_files/figure-html/unnamed-chunk-3-1.png)<!-- --> --- background-image: url(pics/flower.png) background-size: 60% class: center, middle --- # Design idea for dataReporter * Aid researchers in identifying problems that we didn't consider. * Ensure that everything you checked (and **didn't check**) is documented. * Enable **collaborations** with non-statisticians. * Encourage reproducible research dataReporter is **not** * a cleaner * a formal rule validator Documentation to be **read** and **assessed** by a human. --- .footnote[ # Data on US presidents ```r head(dataReporter::presidentData, 5) ``` ``` ## lastName firstName orderOfPresidency birthday stateOfBirth ## 1 Washington George 1 1732-02-22 Virginia ## 2 Adams John 2 1735-10-30 Massachusetts ## 3 Jefferson Thomas 3 1743-04-13 Virginia ## 4 Madison James 4 1751-03-16 Virginia ## 5 Monroe James 5 1758-04-28 Virginia ## assassinationAttempt sex ethnicity presidencyYears ageAtInauguration ## 1 0 Male Caucasian 7 57 ## 2 0 Male Caucasian 3 61 ## 3 0 Male Caucasian 8 57 ## 4 0 Male Caucasian 8 57 ## 5 0 Male Caucasian 8 58 ## favoriteNumber ## 1 3+0.000000i ## 2 4+0.000000i ## 3 0+1.414214i ## 4 10+0.000000i ## 5 3+0.000000i ``` ] --- # Let's see it in action ```r # Install package devtools::install_github("ekstroem/dataReporter") # Load package library(dataReporter) # Get example data data(presidentData) # Main function makeDataReport(presidentData) # Produces report # When data is screened and "clear" for use makeCodebook(presidentData) # Produces condensed codebook ``` --- background-image: url(pics/dr-1.png) background-size: 80% --- background-image: url(pics/dr-2.png) background-size: 90% --- background-image: url(pics/dr-3.png) background-size: 85% --- background-image: url(pics/dr-5.png) background-size: 100% --- background-image: url(pics/dr-4.png) background-size: 100% --- background-image: url(pics/dr-6.png) background-size: 82% .caption-right-vertical[makeCodebook() result] --- background-image: url(pics/helper.png) background-size: 100% --- background-image: url(pics/flow.png) background-size: 90% --- # Pre-checking - Check if a variable is eligible for being described/checked further in the report - Don't want to describe/check: 1. Keys (IDs) 2. Singular variables (same value for all) 3. Variable classes that are not supported by `dataReporter` 4. ... --- background-image: url(pics/dr-3.png) background-size: 85% --- background-image: url(pics/flow.png) background-size: 90% --- # Three steps of data quality assesment 1. **Summarize**: Simple summary statistics - what's in the variable? 2. **Visualize**: Look at the distribution of the variable - is the distribution as expected? 3. **Check**: Look for specific problems - is the variable valid? -- What summaries/visualization/checks are relevant depends on the **class** of the variable. --- # Looking at `presidencyYears` ```r presidentData$presidencyYears ``` ``` ## [1] 7 3 8 8 8 4 8 4 0 3 4 1 2 4 4 4 3 8 4 ## [20] 0 3 4 4 4 4 7 4 8 2 5 4 12 7 8 2 5 5 2 ## [39] 4 8 4 8 8 Inf NA NA ``` -- ```r class(presidentData$presidencyYears) ``` ``` ## [1] "numeric" ``` --- # Following the flow of `presidencyYears` Note: Precheck passed - not key/singleton/unsupported -- <img src="pics/presidencyyears1.png" width="100%" style="display: block; margin: auto;" /> --- # Following the flow of `presidencyYears` Summarize <img src="pics/presidencyyears2.png" width="100%" style="display: block; margin: auto;" /> --- # Following the flow of `presidencyYears` Visualize <img src="pics/presidencyyears3.png" width="100%" style="display: block; margin: auto;" /> --- # Following the flow of `presidencyYears` Check <img src="pics/presidencyyears4.png" width="100%" style="display: block; margin: auto;" /> <!-- --- --> <!-- # Variable classes --> <!-- `dataReporter` currently supports seven variable classes: --> <!-- - **C** `character` --> <!-- - **F** `factor` --> <!-- - **I** `integer` --> <!-- - **L** `labelled` (and `haven_labelled`) --> <!-- - **B** `logical` (Boolean) --> <!-- - **N** `numeric` --> <!-- - **D** `Date` --> <!-- The variables should be organized in a `data.frame`, `tibble` or `matrix`. --> --- # Built-in summary functions (1/2) ```r allSummaryFunctions() ``` -- ```r ------------------------------------------------------------------------- name description classes -------------- ------------------------------- -------------------------- centralValue Compute median for numeric character, Date, factor, variables, mode for integer, labelled, categorical variables haven_labelled, logical, numeric countMissing Compute proportion of missing character, Date, factor, observations integer, labelled, haven_labelled, logical, numeric minMax Find minimum and maximum integer, numeric, Date values quartiles Compute 1st and 3rd quartiles Date, integer, numeric refCat Find reference level factor uniqueValues Count number of unique values character, Date, factor, integer, labelled, haven_labelled, logical, numeric variableType Data class of variable character, Date, factor, integer, labelled, haven_labelled, logical, numeric ------------------------------------------------------------------------- ``` --- # Built-in summary functions (2/2) ```r ------------------------------------------------------------------------- name description classes -------------- ------------------------------- -------------------------- quartiles Compute 1st and 3rd quartiles Date, integer, numeric refCat Find reference level factor uniqueValues Count number of unique values character, Date, factor, integer, labelled, haven_labelled, logical, numeric variableType Data class of variable character, Date, factor, integer, labelled, haven_labelled, logical, numeric ------------------------------------------------------------------------- ``` --- # Built-in visual functions ```r allVisualFunctions() ------------------------------------------------------------------------------- name description classes ---------------- ------------------------------- ------------------------------ basicVisual Histograms and barplots using character, Date, factor, graphics integer, labelled, haven_labelled, logical, numeric standardVisual Histograms and barplots using character, Date, factor, ggplot2 integer, labelled, haven_labelled, logical, numeric tableVisual Distribution tables character, factor, labelled, haven_labelled ------------------------------------------------------------------------------- ``` --- # Built-in check functions (1/3) ```r allCheckFunctions() ---------------------------------------------------------- name description ------------------------- -------------------------------- identifyCaseIssues Identify case issues identifyLoners Identify levels with < 6 obs. identifyMissing Identify miscoded missing values identifyNums Identify misclassified numeric or integer variables ``` --- # Built-in check functions (2/3) ```r ---------------------------------------------------------- name description ------------------------- -------------------------------- identifyOutliers Identify outliers identifyOutliersTBStyle Identify outliers (Tukey's Boxplot style) identifyWhitespace Identify prefixed and suffixed whitespace ``` --- # Built-in check functions (3/3) ```r ---------------------------------------------------------- name description ------------------------- -------------------------------- isCPR Identify Danish CPR numbers isKey Check if the variable is a key isSingular Check if the variable contains only a single value isSupported Check if the variable class is supported by dataReporter. ---------------------------------------------------------- ``` --- # Having a look at the documentation ```r ?identifyMissing ``` <img src="pics/identifymissingdocu.png" width="80%" style="display: block; margin: auto;" /> --- # Let's try an example: Report tweaking - Output an html document instead of pdf - Skip summaries altogether - Use the `tableVisual` function for `character` variables - Only print up to 5 problematic values from checks - Treat `Name` variables as if the were `character` variables - Name the report "Dirty presidents" - ... --- # Using dataReporter interactively The summarize/visual/check steps can be run interactively too: ```r > check(presidentData$stateOfBirth) $identifyMissing No problems found. $identifyWhitespace No problems found. $identifyLoners Note that the following levels have at most five observations: Arkansas, California, Connecticut, Georgia, Gondor, ..., North Carolina, Pennsylvania, Tennessee, Texas, Vermont (11 values omitted). $identifyCaseIssues Note that there might be case problems with the following levels: New york, New York. $identifyNums No problems found. ``` --- # More interactive use All individual summary/visual/check functions can be used interactively as well: ```r > identifyCaseIssues(presidentData$stateOfBirth) Note that there might be case problems with the following levels: New york, New York. ``` -- ```r str(identifyCaseIssues(presidentData$stateOfBirth)) ``` ``` ## List of 3 ## $ problem : logi TRUE ## $ message : chr "Note that there might be case problems with the following levels: \\\"New york\\\", \\\"New York\\\"." ## $ problemValues: chr [1:2] "New york" "New York" ## - attr(*, "class")= chr "checkResult" ``` --- # Adding custom functions New summary function that counts zeros: ```r countZeros <- function(v, ...) { val <- sum(na.omit(v) == 0) summaryResult(list(feature = "No. zeros", result = val, value = val)) } ``` -- ```r countZeros <- summaryFunction(countZeros, description = "Count number of zeros", classes = c("character", "factor", "integer", "labelled", "numeric")) ``` --- # Custom function is now added to overview ```r > allSummaryFunctions() ---------------------------------------------------------------------------- name description classes -------------- ------------------------------- ----------------------------- countZeros Count number of zeros character, factor, integer, labelled, numeric centralValue Compute median for numeric character, Date, factor, variables, mode for integer, labelled, categorical variables haven_labelled, logical, numeric countMissing Compute proportion of missing character, Date, factor, observations integer, labelled, haven_labelled, logical, numeric ``` --- # Use custom functions interactively ```r countZeros(presidentData$presidencyYears) ``` ``` ## No. zeros: 2 ``` --- # Use custom functions in a report ```r makeDataReport(presidentData, summaries = setSummaries( numeric = defaultNumericSummaries(add = "countZeros"))) ``` -- <img src="pics/presidencyyears5.png" width="90%" style="display: block; margin: auto;" /> --- # Further resources for dataReporter - **Article** in Journal of Statistical Software describing report making and interactive use: https://www.jstatsoft.org/article/view/v090i06 - Detailed **vignette** about extending `dataReporter` with custom functions: `vignette("extending_dataReporter")` - 3-4 hour **workshop** available: Ask for materials if you want to teach about `dataReporter` yourself - or invite us come and do it! ☺ --- # Don't be a stranger Get in touch at https://github.com/ekstroem/dataReporter if you - have a bug to report - have a suggestion for new features - have written a cool summary/visual/check function or contact us at ahpe [at] sund.ku.dk (Anne) and ekstrom [at] sund.ku.dk (Claus)