+ - 0:00:00
Notes for current slide
Notes for next slide

Cleaning Up the Data Cleaning Process

Module 2: Data Screening with dataMaid

Claus Thorn Ekstrøm and Anne Helby Petersen
UCPH Biostatistics

CSP, Portland, OR, Feb. 15th, 2018
Slides/homepage

1

The RESCueH project

2

Timeline followback (TLFB)

day1 day2 day3
1 4 NA NA
2 24 NA 99
3 15 23 40
4 2 2 25
3

Timeline followback (TLFB)

day1 day2 day3
1 4 NA NA
2 24 NA 99
3 15 23 40
4 2 2 25
5 1 17 8
6 12 88 4
4

Monthly Alcohol units

5

Reproducible research

What didn't we check?

6

Reproducible research

What didn't we check?

  • Need experts in relevant field
  • Merge existing databases into megadatabases
  • New technologies revive old data
6
# A tibble: 15 x 6
pill events region change id spotifysong
<fct> <dbl> <fct> <dbl> <fct> <fct>
1 red 1.00 a -0.626 1 Irrelevant
2 red 1.00 a 0.184 2 Irrelevant
3 red 1.00 a -0.836 3 Irrelevant
4 red 2.00 a 1.60 4 Irrelevant
5 red 2.00 a 0.330 5 Irrelevant
6 red 6.00 b -0.820 6 Irrelevant
7 red 6.00 b 0.487 7 Irrelevant
8 red 6.00 b 0.738 8 Irrelevant
9 red 999 c 0.576 9 Irrelevant
10 red NA c -0.305 10 Irrelevant
11 blue 4.00 c 1.51 11 Irrelevant
12 blue 82.0 . 0.390 12 Irrelevant
13 blue NA " " -0.621 13 Irrelevant
14 <NA> NaN other -2.21 14 Irrelevant
15 <NA> 5.00 OTHER 1.12 15 Irrelevant
7

dataMaid

> library(dataMaid)
> data(toyData)
> makeDataReport(toyData)

Documentation to be read and evaluated by a human.

See github.com/ekstroem/dataMaid for more info. Stable version on CRAN.

8

dataMaid flowchart

%0 data frame data frame summarize summarize data frame->summarize visualize visualize summarize->visualize check check visualize->check check->summarize .Rmd / render .Rmd / render check->.Rmd / render
9

Part 1: Data cleaning summary

10

Part 2: Summary table

11

Part 3: Variable list

12
13

dataMaid common arguments

Argument Description
mode Tasks to perform. c("summarize", "visualize", "check") is default
replace Logical. Should existing dataMaid reports be overwritten? Default FALSE
output Output format. Choices are "pdf", "html", ""word"
onlyProblematic Logical. Show only variable with problems. Default FALSE
maxProbVals Maximum number of unique values printed. Positive int or Inf (default 10)
14

Exercise 2

  1. Load the dataMaid package
  2. Use makeDataReport() to generate a data report
  3. Screen the bigPresidentData data for errors we did not find before.

Hunt for errors!

15
# A tibble: 15 x 6
pill events region change id spotifysong
<fct> <dbl> <fct> <dbl> <fct> <fct>
1 red 1.00 a -0.626 1 Irrelevant
2 red 1.00 a 0.184 2 Irrelevant
3 red 1.00 a -0.836 3 Irrelevant
4 red 2.00 a 1.60 4 Irrelevant
5 red 2.00 a 0.330 5 Irrelevant
6 red 6.00 b -0.820 6 Irrelevant
7 red 6.00 b 0.487 7 Irrelevant
8 red 6.00 b 0.738 8 Irrelevant
9 red 999 c 0.576 9 Irrelevant
10 red NA c -0.305 10 Irrelevant
11 blue 4.00 c 1.51 11 Irrelevant
12 blue 82.0 . 0.390 12 Irrelevant
13 blue NA " " -0.621 13 Irrelevant
14 <NA> NaN other -2.21 14 Irrelevant
15 <NA> 5.00 OTHER 1.12 15 Irrelevant
16

Using dataMaid interactively

> check(toyData$events)
$identifyMissing
The following suspected missing value codes enter as regular values: 999, NaN.
$identifyOutliers
Note that the following possible outlier values were detected: 82, 999.
> check(toyData$events,
+ numericChecks = "identifyMissing")
$identifyMissing
The following suspected missing value codes enter as regular values: 999, NaN.
17

check er de funktioner, der bliver checket for

Kan sætte specifikke (også i makeDataReport)

Vis, at det er en liste med 2 elementer, og den form, som de har.

Overview of check functions

> allCheckFunctions()
----------------------------------------------------------------------------------
name description classes
-------------------- -------------------------------------------------------------
identifyCaseIssues Identify case issues character, factor
identifyLoners Identify levels with < 6 obs. character, factor
identifyMissing Identify miscoded missing character, Date, factor,
values integer, labelled, logical,
numeric

and more

18

Interactive dataMaid - visualizations

> visualize(toyData$events)

Can also check the available visual functions

> allVisualFunctions()

19

Interactive dataMaid - summaries

> summarize(toyData$events)
$variableType
Variable type: numeric
$countMissing
Number of missing obs.: 3 (20 %)
$uniqueValues
Number of unique values: 8
$centralValue
Median: 4.5
$quartiles
1st and 3rd quartiles: 1.75; 6
$minMax
Min. and max.: 1; 999
20

Using dataMaid interactively

> allSummaryFunctions()
-------------------------------------------------------------
name description classes
------------ -------------------- ---------------------------
centralValue Compute median character, Date, factor,
or mode integer, labelled, logical,
numeric
countMissing Compute ratio of character, Date, factor,
missing obs. integer, labelled, logical,
numeric
minMax Find min and max integer, numeric, Date
values
quartiles Compute 1st and 3rd quartiles Date, integer, numeric
uniqueValues Count number of unique values character, Date, factor,
integer, labelled, logical,
numeric
variableType Data class of variable character, Date, factor,
integer, labelled, logical,
numeric
----------------------------------------------------------------------
21

Everything is a function

Note that you can call the individual check, summary, and visualize functions directly.

> countMissing(toyData$events)
Number of missing obs.: 3 (20 %)
> centralValue(toyData$events)
Median: 4.5
> identifyOutliers(toyData$events)
Note that the following possible outlier values were detected: 82, 999.
22

Extending dataMaid

Custom check, visual, or summary functions.

Few requirements --- input and output formats.

Easiet to work with a template system and modify one of those.

Check vignette vignette("extending_dataMaid") for detailed instructions. Or the exercises!

23

Custom summaryFunction - template

mySummaryFunction <- function(v, ...) {
val <- [ result of whatever summary we are doing ]
res <- [ properly escaped version of val ]
summaryResult(list(feature = "[Feature name]",
result = res,
value = val))
}

Example (centralValue for numeric/integer)

> function (v, maxDecimals = 2) {
+ v <- na.omit(v)
+ val <- median(v)
+ summaryResult(list(feature = "Median", result = round(val,
+ maxDecimals), value = val))
+ }
24

Custom checkFunction - template

> isSSN <- function(v, nMax = NULL, ...) {
+ out <- list(problem = FALSE,
+ message = "",
+ problemValues=NULL)
+ if (class(v) %in% c("character", "factor",
+ "labelled")) {
+ if (any(grep("\\d{3}-\\d{2}-\\d{4}", v))) {
+ out$problem <- TRUE
+ out$message <- "Warning: Seems to contain SSNs."
+ out$problemValues <- "Will not show"
+ }
+ }
+ out
+ }
25

Using the function

> DF <- data.frame(ids=c("111-22-3333","123-45-6789",
+ "111-22-3333"),
+ id2=c("111223333", "123456789", "4728491283"),
+ stringsAsFactors=FALSE)
>
> check(DF, characterChecks = c("isSSN"))
$ids
$ids$isSSN
$ids$isSSN$problem
[1] TRUE
$ids$isSSN$message
[1] "Warning: Seems to contain SSNs."
$ids$isSSN$problemValues
[1] "Will not show"
$id2
$id2$isSSN
$id2$isSSN$problem
[1] FALSE
$id2$isSSN$message
[1] ""
$id2$isSSN$problemValues
NULL
26

Exercise 2b/2c

How to tailor dataMaid to work with your dataset:

  • Work interactively with dataMaid (exercise 2b)
  • Create new custom summary, check and visualization functions (exercise 2c)

Pick whatever you want. Or jump back and forth.

27

The RESCueH project

2
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow