class: center, middle, inverse, title-slide # Cleaning Up the Data Cleaning Process ## Module 3: Row-wise constraints ### Claus Thorn Ekstrøm and Anne Helby Petersen
UCPH Biostatistics ### CSP, Portland, OR, Feb. 15th, 2018
.footnotesize[
Slides/homepage
] --- ##Row-wise or column-wise checks? <img src="pics/colrow1.png" width="75%" style="display: block; margin: auto;" /> --- ##Row-wise or column-wise checks? <img src="pics/colrow2.png" width="75%" style="display: block; margin: auto;" /> --- ##Row-wise or column-wise checks? <img src="pics/colrow3.png" width="75%" style="display: block; margin: auto;" /> --- ##Row-wise *and* column-wise constraints! * `dataMaid` performs class dependent checks for each variable in a dataset, one at a time (column-wise) + Pros: Easy to document what was (not) done, let's you get started without a lot of prior knowledge, easy to share with collaborators + Shortcomings: Generally cannot detect internal consistency issues or use non-class dependent variable constraints --- ##Row-wise *and* column-wise constraints! An R-packages that performs row-wise checks: `validate` ```r library(validate) ``` Note: Different use of the term "validation" - no longer about format, type and range, but used as synonym to "check". --- ##`validate` - overview: * Splits checking into two steps: 1. Define the checking rules in a `validator` object, using the `valdiator()` function. 2. Confront the data with the rules in a call to `confront`, thereby obtaining a `confrontation` object. * Easy to asses what problems were found and to document what was checked (saved in `validator` object). * <span style="color:red">Beware: Non-standard object storage might cause trouble!</span> --- ###Make `validator` object .footnotesize[ ```r val1 <- validator( `Adult president` = ageAtInauguration >= 18, `Alive at inauguration` = dateOfDeath >= presidencyBeginDate, `Positive first name` = firstName*2 > firstName ) ``` ] --- ###Confront data with `validator` object: .footnotesize[ ```r con1 <- confront(bpD, val1) summary(con1)[, 1:6] ``` ``` ## name items passes fails nNA error ## 1 Adult.president 47 47 0 0 FALSE ## 2 Alive.at.inauguration 47 39 1 7 FALSE ## 3 Positive.first.name 0 0 0 0 TRUE ``` ] --- ##Understand confrontation results Lots of functions available for inspecting confrontations: * `summary()`: Overview of confrontation results * `aggregate()`: Compute percentage pass/fail/na * `sort()`: Sort results by problem prevalence * `values()`: For each observation and each check: `TRUE`/`FALSE`/`NA` * `barplot()`: Visual overview of check results * `errors()`: What errors were caught? * `warnings()`: What warnings were caught? --- ##Warning: modify-by-reference * `validator` and `confrontation` objects imply use of "modify-by-reference" rather than "copy-on-modify" semantics (standard R) * Other package that uses "modify-by-reference": `data.table` * If one is not aware of modify-by-reference, it can have problematic consequences + Unintentional overwriting of full objects, attributes or parts + Caching in rmarkdown does not work (but throws no errors) --- ##Warning: modify-by-reference ```r v1 <- validator(check1 = sex == "Male") v1 ``` ``` ## Object of class 'validator' with 1 elements: ## check1: sex == "Male" ``` ```r v2 <- v1 names(v2) <- "All males" v1 ``` ``` ## Object of class 'validator' with 1 elements: ## All.males: sex == "Male" ``` --- ##Warning: modify-by-reference Make a copy using `[TRUE]`: ```r v1 <- validator(check1 = sex == "Male") v2 <- v1[TRUE] names(v2) <- "All males" v1 ``` ``` ## Object of class 'validator' with 1 elements: ## check1: sex == "Male" ``` --- class: inverse # Exercise 3 We will now add checking for row-wise constraints to our data screening process. * Go through exercise 3 (Note the last couple of questions in exercise 3 can be pretty tricky)