class: center, middle, inverse, title-slide # Cleaning Up the Data Cleaning Process ## Module 2: Data Screening with dataMaid ### Claus Thorn Ekstrøm and Anne Helby Petersen
UCPH Biostatistics ### CSP, Portland, OR, Feb. 15th, 2018
.footnotesize[
Slides/homepage
] --- background-image: url(pics/manbeer.jpg) background-size: 100% class: middle, center # The RESCueH project --- class: center # Timeline followback (TLFB) ``` day1 day2 day3 1 4 NA NA 2 24 NA 99 3 15 23 40 4 2 2 25 ``` --- class: center # Timeline followback (TLFB) ``` day1 day2 day3 1 4 NA NA 2 24 NA 99 3 15 23 40 4 2 2 25 5 1 17 8 6 12 88 4 ``` --- background-image: url(pics/mau.png) background-size: 80% class: center # Monthly Alcohol units --- # Reproducible research What **didn't** we check? -- - Need **experts in relevant field** - Merge existing databases into megadatabases - New technologies revive old data --- class: center .small[ ``` # A tibble: 15 x 6 pill events region change id spotifysong <fct> <dbl> <fct> <dbl> <fct> <fct> 1 red 1.00 a -0.626 1 Irrelevant 2 red 1.00 a 0.184 2 Irrelevant 3 red 1.00 a -0.836 3 Irrelevant 4 red 2.00 a 1.60 4 Irrelevant 5 red 2.00 a 0.330 5 Irrelevant 6 red 6.00 b -0.820 6 Irrelevant 7 red 6.00 b 0.487 7 Irrelevant 8 red 6.00 b 0.738 8 Irrelevant 9 red 999 c 0.576 9 Irrelevant 10 red NA c -0.305 10 Irrelevant 11 blue 4.00 c 1.51 11 Irrelevant 12 blue 82.0 . 0.390 12 Irrelevant 13 blue NA " " -0.621 13 Irrelevant 14 <NA> NaN other -2.21 14 Irrelevant 15 <NA> 5.00 OTHER 1.12 15 Irrelevant ``` ] --- class: middle # dataMaid ```r > library(dataMaid) > data(toyData) *> makeDataReport(toyData) ``` Documentation to be **read** and **evaluated** by a human. See [github.com/ekstroem/dataMaid](github.com/ekstroem/dataMaid) for more info. Stable version on CRAN. --- background-image: url(pics/flowchart2.png) background-size: 100% class: center # `dataMaid` flowchart
--- # Part 1: Data cleaning summary <img src="pics/summ.png" width="100%" style="display: block; margin: auto;" /> --- background-image: url(pics/miss.png) background-size: 100% # Part 2: Summary table --- background-image: url(pics/out1.png) background-size: 100% # Part 3: Variable list --- background-image: url(pics/out2.png) background-size: 100% --- # `dataMaid` common arguments .small[ | Argument | Description | | ------------- |:--------------------------------| | `mode` | Tasks to perform. `c("summarize", "visualize", "check")` is default | | `replace` | Logical. Should existing dataMaid reports be overwritten? Default `FALSE` | | `output` | Output format. Choices are `"pdf"`, `"html"`, "`"word"` | | `onlyProblematic` | Logical. Show only variable with problems. Default `FALSE` | | `maxProbVals` | Maximum number of unique values printed. Positive int or `Inf` (default 10) | ] --- class: inverse, middle # Exercise 2 1. Load the `dataMaid` package 2. Use `makeDataReport()` to generate a data report 3. Screen the `bigPresidentData` data for errors we did not find before. Hunt for errors! --- class: center .small[ ``` # A tibble: 15 x 6 pill events region change id spotifysong <fct> <dbl> <fct> <dbl> <fct> <fct> 1 red 1.00 a -0.626 1 Irrelevant 2 red 1.00 a 0.184 2 Irrelevant 3 red 1.00 a -0.836 3 Irrelevant 4 red 2.00 a 1.60 4 Irrelevant 5 red 2.00 a 0.330 5 Irrelevant 6 red 6.00 b -0.820 6 Irrelevant 7 red 6.00 b 0.487 7 Irrelevant 8 red 6.00 b 0.738 8 Irrelevant 9 red 999 c 0.576 9 Irrelevant 10 red NA c -0.305 10 Irrelevant 11 blue 4.00 c 1.51 11 Irrelevant 12 blue 82.0 . 0.390 12 Irrelevant 13 blue NA " " -0.621 13 Irrelevant 14 <NA> NaN other -2.21 14 Irrelevant 15 <NA> 5.00 OTHER 1.12 15 Irrelevant ``` ] --- # Using `dataMaid` interactively .footnotesize[ ```r > check(toyData$events) ``` ``` $identifyMissing The following suspected missing value codes enter as regular values: 999, NaN. $identifyOutliers Note that the following possible outlier values were detected: 82, 999. ``` ```r > check(toyData$events, + numericChecks = "identifyMissing") ``` ``` $identifyMissing The following suspected missing value codes enter as regular values: 999, NaN. ``` ] ??? check er de funktioner, der bliver checket for Kan sætte specifikke (også i makeDataReport) Vis, at det er en liste med 2 elementer, og den form, som de har. --- # Overview of `check` functions .footnotesize[ ```r > allCheckFunctions() ``` ] .footnotesize[ ``` ---------------------------------------------------------------------------------- name description classes -------------------- ------------------------------------------------------------- identifyCaseIssues Identify case issues character, factor identifyLoners Identify levels with < 6 obs. character, factor identifyMissing Identify miscoded missing character, Date, factor, values integer, labelled, logical, numeric ``` ] and more --- # Interactive `dataMaid` - visualizations .pull-left[ .footnotesize[ ```r > visualize(toyData$events) ``` Can also check the available `visual` functions ```r > allVisualFunctions() ``` ] ] .pull-right[ ![](module2_files/figure-html/unnamed-chunk-13-1.png)<!-- --> ] --- # Interactive `dataMaid` - summaries .footnotesize[ ```r > summarize(toyData$events) ``` ``` $variableType Variable type: numeric $countMissing Number of missing obs.: 3 (20 %) $uniqueValues Number of unique values: 8 $centralValue Median: 4.5 $quartiles 1st and 3rd quartiles: 1.75; 6 $minMax Min. and max.: 1; 999 ``` ] --- # Using dataMaid interactively .small[ ``` > allSummaryFunctions() ------------------------------------------------------------- name description classes ------------ -------------------- --------------------------- centralValue Compute median character, Date, factor, or mode integer, labelled, logical, numeric countMissing Compute ratio of character, Date, factor, missing obs. integer, labelled, logical, numeric minMax Find min and max integer, numeric, Date values quartiles Compute 1st and 3rd quartiles Date, integer, numeric uniqueValues Count number of unique values character, Date, factor, integer, labelled, logical, numeric variableType Data class of variable character, Date, factor, integer, labelled, logical, numeric ---------------------------------------------------------------------- ``` ] --- # Everything is a function Note that you can call the individual `check`, `summary`, and `visualize` functions directly. .footnotesize[ ```r > countMissing(toyData$events) ``` ``` Number of missing obs.: 3 (20 %) ``` ```r > centralValue(toyData$events) ``` ``` Median: 4.5 ``` ```r > identifyOutliers(toyData$events) ``` ``` Note that the following possible outlier values were detected: 82, 999. ``` ] --- # Extending `dataMaid` Custom check, visual, or summary functions. Few requirements --- input and output formats. Easiet to work with a **template system** and modify one of those. Check vignette `vignette("extending_dataMaid")` for detailed instructions. Or the exercises! --- ## Custom `summaryFunction` - template .footnotesize[ ``` mySummaryFunction <- function(v, ...) { val <- [ result of whatever summary we are doing ] res <- [ properly escaped version of val ] summaryResult(list(feature = "[Feature name]", result = res, value = val)) } ``` ] Example (centralValue for numeric/integer) .footnotesize[ ```r > function (v, maxDecimals = 2) { + v <- na.omit(v) + val <- median(v) + summaryResult(list(feature = "Median", result = round(val, + maxDecimals), value = val)) + } ``` ] --- ## Custom `checkFunction` - template .small[ ```r > isSSN <- function(v, nMax = NULL, ...) { + out <- list(problem = FALSE, + message = "", + problemValues=NULL) + if (class(v) %in% c("character", "factor", + "labelled")) { + if (any(grep("\\d{3}-\\d{2}-\\d{4}", v))) { + out$problem <- TRUE + out$message <- "Warning: Seems to contain SSNs." + out$problemValues <- "Will not show" + } + } + out + } ``` ] --- ## Using the function .footnotesize[ ```r > DF <- data.frame(ids=c("111-22-3333","123-45-6789", + "111-22-3333"), + id2=c("111223333", "123456789", "4728491283"), + stringsAsFactors=FALSE) > > check(DF, characterChecks = c("isSSN")) ``` ``` $ids $ids$isSSN $ids$isSSN$problem [1] TRUE $ids$isSSN$message [1] "Warning: Seems to contain SSNs." $ids$isSSN$problemValues [1] "Will not show" $id2 $id2$isSSN $id2$isSSN$problem [1] FALSE $id2$isSSN$message [1] "" $id2$isSSN$problemValues NULL ``` ] --- class: inverse # Exercise 2b/2c How to tailor `dataMaid` to work with *your* dataset: * Work interactively with `dataMaid` (exercise 2b) * Create new custom summary, check and visualization functions (exercise 2c) Pick whatever you want. Or jump back and forth.