day1 day2 day31 4 NA NA2 24 NA 993 15 23 404 2 2 25
day1 day2 day31 4 NA NA2 24 NA 993 15 23 404 2 2 255 1 17 86 12 88 4
What didn't we check?
What didn't we check?
# A tibble: 15 x 6 pill events region change id spotifysong <fct> <dbl> <fct> <dbl> <fct> <fct> 1 red 1.00 a -0.626 1 Irrelevant 2 red 1.00 a 0.184 2 Irrelevant 3 red 1.00 a -0.836 3 Irrelevant 4 red 2.00 a 1.60 4 Irrelevant 5 red 2.00 a 0.330 5 Irrelevant 6 red 6.00 b -0.820 6 Irrelevant 7 red 6.00 b 0.487 7 Irrelevant 8 red 6.00 b 0.738 8 Irrelevant 9 red 999 c 0.576 9 Irrelevant 10 red NA c -0.305 10 Irrelevant 11 blue 4.00 c 1.51 11 Irrelevant 12 blue 82.0 . 0.390 12 Irrelevant 13 blue NA " " -0.621 13 Irrelevant 14 <NA> NaN other -2.21 14 Irrelevant 15 <NA> 5.00 OTHER 1.12 15 Irrelevant
> library(dataMaid)> data(toyData)> makeDataReport(toyData)
Documentation to be read and evaluated by a human.
See github.com/ekstroem/dataMaid for more info. Stable version on CRAN.
dataMaid
flowchartdataMaid
common argumentsArgument | Description |
---|---|
mode |
Tasks to perform. c("summarize", "visualize", "check") is default |
replace |
Logical. Should existing dataMaid reports be overwritten? Default FALSE |
output |
Output format. Choices are "pdf" , "html" , ""word" |
onlyProblematic |
Logical. Show only variable with problems. Default FALSE |
maxProbVals |
Maximum number of unique values printed. Positive int or Inf (default 10) |
dataMaid
packagemakeDataReport()
to generate a data reportbigPresidentData
data for errors we did not find before.Hunt for errors!
# A tibble: 15 x 6 pill events region change id spotifysong <fct> <dbl> <fct> <dbl> <fct> <fct> 1 red 1.00 a -0.626 1 Irrelevant 2 red 1.00 a 0.184 2 Irrelevant 3 red 1.00 a -0.836 3 Irrelevant 4 red 2.00 a 1.60 4 Irrelevant 5 red 2.00 a 0.330 5 Irrelevant 6 red 6.00 b -0.820 6 Irrelevant 7 red 6.00 b 0.487 7 Irrelevant 8 red 6.00 b 0.738 8 Irrelevant 9 red 999 c 0.576 9 Irrelevant 10 red NA c -0.305 10 Irrelevant 11 blue 4.00 c 1.51 11 Irrelevant 12 blue 82.0 . 0.390 12 Irrelevant 13 blue NA " " -0.621 13 Irrelevant 14 <NA> NaN other -2.21 14 Irrelevant 15 <NA> 5.00 OTHER 1.12 15 Irrelevant
dataMaid
interactively> check(toyData$events)
$identifyMissingThe following suspected missing value codes enter as regular values: 999, NaN.$identifyOutliersNote that the following possible outlier values were detected: 82, 999.
> check(toyData$events, + numericChecks = "identifyMissing")
$identifyMissingThe following suspected missing value codes enter as regular values: 999, NaN.
check er de funktioner, der bliver checket for
Kan sætte specifikke (også i makeDataReport)
Vis, at det er en liste med 2 elementer, og den form, som de har.
check
functions> allCheckFunctions()
----------------------------------------------------------------------------------name description classes -------------------- -------------------------------------------------------------identifyCaseIssues Identify case issues character, factor identifyLoners Identify levels with < 6 obs. character, factor identifyMissing Identify miscoded missing character, Date, factor, values integer, labelled, logical, numeric
and more
dataMaid
- visualizations> visualize(toyData$events)
Can also check the available visual
functions
> allVisualFunctions()
dataMaid
- summaries> summarize(toyData$events)
$variableTypeVariable type: numeric$countMissingNumber of missing obs.: 3 (20 %)$uniqueValuesNumber of unique values: 8$centralValueMedian: 4.5$quartiles1st and 3rd quartiles: 1.75; 6$minMaxMin. and max.: 1; 999
> allSummaryFunctions()-------------------------------------------------------------name description classes ------------ -------------------- ---------------------------centralValue Compute median character, Date, factor, or mode integer, labelled, logical, numeric countMissing Compute ratio of character, Date, factor, missing obs. integer, labelled, logical, numeric minMax Find min and max integer, numeric, Date values quartiles Compute 1st and 3rd quartiles Date, integer, numeric uniqueValues Count number of unique values character, Date, factor, integer, labelled, logical, numeric variableType Data class of variable character, Date, factor, integer, labelled, logical, numeric ----------------------------------------------------------------------
Note that you can call the individual check
, summary
, and visualize
functions directly.
> countMissing(toyData$events)
Number of missing obs.: 3 (20 %)
> centralValue(toyData$events)
Median: 4.5
> identifyOutliers(toyData$events)
Note that the following possible outlier values were detected: 82, 999.
dataMaid
Custom check, visual, or summary functions.
Few requirements --- input and output formats.
Easiet to work with a template system and modify one of those.
Check vignette vignette("extending_dataMaid")
for detailed instructions. Or the exercises!
summaryFunction
- templatemySummaryFunction <- function(v, ...) { val <- [ result of whatever summary we are doing ] res <- [ properly escaped version of val ] summaryResult(list(feature = "[Feature name]", result = res, value = val))}
Example (centralValue for numeric/integer)
> function (v, maxDecimals = 2) {+ v <- na.omit(v)+ val <- median(v)+ summaryResult(list(feature = "Median", result = round(val, + maxDecimals), value = val))+ }
checkFunction
- template> isSSN <- function(v, nMax = NULL, ...) {+ out <- list(problem = FALSE, + message = "", + problemValues=NULL)+ if (class(v) %in% c("character", "factor", + "labelled")) {+ if (any(grep("\\d{3}-\\d{2}-\\d{4}", v))) {+ out$problem <- TRUE+ out$message <- "Warning: Seems to contain SSNs."+ out$problemValues <- "Will not show"+ } + }+ out + }
> DF <- data.frame(ids=c("111-22-3333","123-45-6789", + "111-22-3333"),+ id2=c("111223333", "123456789", "4728491283"), + stringsAsFactors=FALSE)> > check(DF, characterChecks = c("isSSN"))
$ids$ids$isSSN$ids$isSSN$problem[1] TRUE$ids$isSSN$message[1] "Warning: Seems to contain SSNs."$ids$isSSN$problemValues[1] "Will not show"$id2$id2$isSSN$id2$isSSN$problem[1] FALSE$id2$isSSN$message[1] ""$id2$isSSN$problemValuesNULL
How to tailor dataMaid
to work with your dataset:
dataMaid
(exercise 2b)Pick whatever you want. Or jump back and forth.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |