class: center, middle, inverse, title-slide # Reproducible data screening and quality control ## Module 2: Data Screening with dataMaid ### Claus Thorn Ekstrøm and Anne Helby Petersen
UCPH Biostatistics ### eRum, Budapest, May 14th, 2018
.footnotesize[
Slides/homepage
] --- background-image: url(pics/manbeer.jpg) background-size: 100% class: middle, center # The RESCueH project --- class: center # Timeline followback (TLFB) ``` day1 day2 day3 1 25 NA NA 2 15 NA 99 3 9 12 40 4 21 14 7 ``` --- class: center # Timeline followback (TLFB) ``` day1 day2 day3 1 25 NA NA 2 15 NA 99 3 9 12 40 4 21 14 7 5 12 24 12 6 7 88 16 ``` --- background-image: url(pics/mau.png) background-size: 80% class: center # Monthly Alcohol units --- # Reproducible research What **didn't** we check? -- - Need **experts in relevant field** - Merge existing databases into megadatabases - New technologies revive old data --- class: center .small[ ``` # A tibble: 15 x 6 pill events region change id spotifysong <fct> <dbl> <fct> <dbl> <fct> <fct> 1 red 1 a -0.626 1 Irrelevant 2 red 1 a 0.184 2 Irrelevant 3 red 1 a -0.836 3 Irrelevant 4 red 2 a 1.60 4 Irrelevant 5 red 2 a 0.330 5 Irrelevant 6 red 6 b -0.820 6 Irrelevant 7 red 6 b 0.487 7 Irrelevant 8 red 6 b 0.738 8 Irrelevant 9 red 999 c 0.576 9 Irrelevant 10 red NA c -0.305 10 Irrelevant 11 blue 4 c 1.51 11 Irrelevant 12 blue 82 . 0.390 12 Irrelevant 13 blue NA " " -0.621 13 Irrelevant 14 <NA> NaN other -2.21 14 Irrelevant 15 <NA> 5 OTHER 1.12 15 Irrelevant ``` ] --- class: middle # dataMaid ```r > library(dataMaid) > data(toyData) *> makeDataReport(toyData) ``` Documentation to be **read** and **evaluated** by a human. See [github.com/ekstroem/dataMaid](github.com/ekstroem/dataMaid) for more info and a [paper with details (accepted for publication in JSS)](https://github.com/ekstroem/dataMaid/raw/master/latex/article_accept.pdf). Stable version on CRAN. --- background-image: url(pics/flowchart2.png) background-size: 100% class: center # `dataMaid` flowchart
--- # Part 1: Data cleaning summary <img src="pics/summ.png" width="100%" style="display: block; margin: auto;" /> --- background-image: url(pics/miss.png) background-size: 100% # Part 2: Summary table --- background-image: url(pics/out1.png) background-size: 100% # Part 3: Variable list --- background-image: url(pics/out2.png) background-size: 100% --- # `dataMaid` common arguments .small[ | Argument | Description | | ------------- |:--------------------------------| | `mode` | Tasks to perform. `c("summarize", "visualize", "check")` is default | | `replace` | Logical. Should existing dataMaid reports be overwritten? Default `FALSE` | | `output` | Output format. Choices are `"pdf"`, `"html"`, `"word"` | | `onlyProblematic` | Logical. Show only variable with problems. Default `FALSE` | | `maxProbVals` | Maximum number of unique values printed. Positive int or `Inf` (default 10) | ] --- class: inverse, middle # Exercise 2 1. Load the `dataMaid` package 2. Use `makeDataReport()` to generate a data report 3. Screen the `presiData` data for errors we did not find before. Hunt for errors! --- #Next up * How to use dataMaid interactively * How to extend dataMaid with your own functions for summarizing, visualizing and checking data --- class: center .small[ ``` # A tibble: 15 x 6 pill events region change id spotifysong <fct> <dbl> <fct> <dbl> <fct> <fct> 1 red 1 a -0.626 1 Irrelevant 2 red 1 a 0.184 2 Irrelevant 3 red 1 a -0.836 3 Irrelevant 4 red 2 a 1.60 4 Irrelevant 5 red 2 a 0.330 5 Irrelevant 6 red 6 b -0.820 6 Irrelevant 7 red 6 b 0.487 7 Irrelevant 8 red 6 b 0.738 8 Irrelevant 9 red 999 c 0.576 9 Irrelevant 10 red NA c -0.305 10 Irrelevant 11 blue 4 c 1.51 11 Irrelevant 12 blue 82 . 0.390 12 Irrelevant 13 blue NA " " -0.621 13 Irrelevant 14 <NA> NaN other -2.21 14 Irrelevant 15 <NA> 5 OTHER 1.12 15 Irrelevant ``` ] --- # Using `dataMaid` interactively .footnotesize[ ```r > check(toyData$events) ``` ``` $identifyMissing The following suspected missing value codes enter as regular values: 999, NaN. $identifyOutliers Note that the following possible outlier values were detected: 82, 999. ``` ```r > check(toyData$events, + checks = setChecks(numeric = "identifyMissing")) ``` ``` $identifyMissing The following suspected missing value codes enter as regular values: 999, NaN. ``` ] ??? check er de funktioner, der bliver checket for Kan sætte specifikke (også i makeDataReport) Vis, at det er en liste med 2 elementer, og den form, som de har. --- # Overview of `check` functions .footnotesize[ ```r > allCheckFunctions() ``` ] .scriptsize[ ``` ---------------------------------------------------------------------------------- name description classes -------------------- ------------------------------------------------------------- identifyCaseIssues Identify case issues character, factor identifyLoners Identify levels with < 6 obs. character, factor identifyMissing Identify miscoded missing character, Date, factor, values integer, labelled, logical, numeric ``` ] and more --- # Interactive `dataMaid` - visualizations .pull-left[ .footnotesize[ ```r > visualize(toyData$events) ``` Can also check the available `visual` functions ```r > allVisualFunctions() ``` ] ] .pull-right[ ![](module2_files/figure-html/unnamed-chunk-13-1.png)<!-- --> ] --- # Interactive `dataMaid` - summaries .footnotesize[ ```r > summarize(toyData$events) ``` ``` $variableType Variable type: numeric $countMissing Number of missing obs.: 3 (20 %) $uniqueValues Number of unique values: 8 $centralValue Median: 4.5 $quartiles 1st and 3rd quartiles: 1.75; 6 $minMax Min. and max.: 1; 999 ``` ] --- # Overview of the `summary` function library .scriptsize[ ``` > allSummaryFunctions() ---------------------------------------------------------------------------- name description classes -------------- ------------------------------- ----------------------------- centralValue Compute median for numeric character, Date, factor, variables, mode for integer, labelled, logical, categorical variables numeric countMissing Compute proportion of missing character, Date, factor, observations integer, labelled, logical, numeric minMax Find minimum and maximum integer, numeric, Date values quartiles Compute 1st and 3rd quartiles Date, integer, numeric uniqueValues Count number of unique values character, Date, factor, integer, labelled, logical, numeric variableType Data class of variable character, Date, factor, integer, labelled, logical, numeric ---------------------------------------------------------------------------- ``` ] --- # Everything is a function Note that you can call the individual `check`, `summary`, and `visualize` functions directly. .footnotesize[ ```r > countMissing(toyData$events) ``` ``` Number of missing obs.: 3 (20 %) ``` ```r > centralValue(toyData$events) ``` ``` Median: 4.5 ``` ```r > identifyOutliers(toyData$events) ``` ``` Note that the following possible outlier values were detected: 82, 999. ``` ] --- # Extending `dataMaid` * Custom check, visual, or summary functions. * Few requirements --- input and output formats. * Easiest to work with a **template system** and modify one of those. * Check vignette `vignette("extending_dataMaid")` for detailed instructions. Or the exercises! --- ## Custom `summaryFunction` - template .footnotesize[ ``` mySummaryFunction <- function(v, ...) { val <- [ result of whatever summary we are doing ] res <- [ properly escaped version of val ] summaryResult(list(feature = "[Feature name]", result = res, value = val)) } ``` ] Example (`centralValue` for numeric/integer) .footnotesize[ ```r > function (v, maxDecimals = 2) { + v <- na.omit(v) + val <- median(v) + summaryResult(list(feature = "Median", result = round(val, + maxDecimals), value = val)) + } ``` ] --- ## Custom `checkFunction` - example .small[ ```r > isSSN <- function(v, nMax = NULL, ...) { + out <- list(problem = FALSE, + message = "", + problemValues=NULL) + if (class(v) %in% c("character", "factor", + "labelled")) { + if (any(grep("\\d{3}-\\d{2}-\\d{4}", v))) { + out$problem <- TRUE + out$message <- "Warning: Seems to contain SSNs." + out$problemValues <- "Will not show" + } + } + out + } ``` ] --- ## Using the function .footnotesize[ ```r > DF <- data.frame(ids=c("111-22-3333","123-45-6789", + "111-22-3333"), + id2=c("111223333", "123456789", "4728491283"), + stringsAsFactors=FALSE) > > check(DF, characterChecks = c("isSSN")) ``` ``` $ids $ids$isSSN $ids$isSSN$problem [1] TRUE $ids$isSSN$message [1] "Warning: Seems to contain SSNs." $ids$isSSN$problemValues [1] "Will not show" $id2 $id2$isSSN $id2$isSSN$problem [1] FALSE $id2$isSSN$message [1] "" $id2$isSSN$problemValues NULL ``` ] --- class: inverse # Exercise 2b/2c How to tailor `dataMaid` to work with *your* dataset: * Work interactively with `dataMaid` (exercise 2b) * Create new custom summary, check and visualization functions (exercise 2c) Pick whatever you want. Or jump back and forth.