Exercise 1

For this initial exercise we will be trying to identify errors in a data frame containing information about the US presidents. First we install the packages needed to get the dataset that we will be working with. If you have already installed dataMaid from home, you can skip this step.

install.packages("dataMaid")

Now we have installed the dataMaid package which contains the raw dataset, bigPresidentData. We can see the first couple of rows of the dataset with the head command.

library("dataMaid")
data("bigPresidentData")
head(bigPresidentData)
##      lastName firstName orderOfPresidency   birthday dateOfDeath
## 38       Ford    Gerald                38 1913-07-14  2006-12-26
## 2       Adams      John                 2 1735-10-30  1826-07-04
## 31     Hoover   Herbert                31 1874-08-10  1964-10-20
## 1  Washington    George                 1 1732-02-22  1799-12-14
## 13   Fillmore   Millard                13 1800-01-07  1874-03-08
## 42    Clinton   William                42 1946-08-19        <NA>
##     stateOfBirth       party presidencyBeginDate presidencyEndDate
## 38      Nebraska  Republican          1974-08-09        1977-01-20
## 2  Massachusetts  Federalist          1797-03-04        1801-03-04
## 31          Iowa  Republican          1929-03-04        1933-03-04
## 1       Virginia Independent          1789-04-30        1797-03-04
## 13      New York        Whig          1850-07-09        1853-03-04
## 42      Arkansas  Democratic          1993-01-20        2001-01-20
##    assassinationAttempt  sex ethnicity presidencyYears ageAtInauguration
## 38                    1 Male Caucasian               2                61
## 2                     0 Male Caucasian               3                61
## 31                    0 Male Caucasian               4                54
## 1                     0 Male Caucasian               7                57
## 13                    0 Male Caucasian               2                50
## 42                    0 Male Caucasian               8                46
##    favoriteNumber
## 38           2+0i
## 2            4+0i
## 31           5+0i
## 1            3+0i
## 13           7+0i
## 42           7+0i

We are being told by our research collaborators that the dataset should contain information on 18 variables relates to the US presidents. The variables are

  • lastName The surname of the president.
  • firstName The first name of the president.
  • orderOfPresidency A factor variable indicating the order of the presidents (with George Washington as number 1 and Donald Trump as number 45).
  • birthday The date of birth of the president.
  • dateOfDeath The date of the president’s death.
  • stateOfBirth A character variable with the state in which the president was born.
  • party A charcter variable with the political party to which the president was associated.
  • presidencyBeginDate A Date variable with the date of inauguration of the president.
  • presidencyEndDate A Date variable with the date at which the presidency ends.
  • assassinationAttempt A text variable indicating whether there was an assassination attempt (1) or not (0) on the president.
  • sex A factor variable with the sex of the president.
  • ethnicity A factor variable with the ethnicity of the president.
  • presidencyYears A numeric variable with the duration of the presidency, in years.
  • ageAtInauguration the age at inauguration.
  • favoriteNumber A complex type variable with a fictional favorite number for each president.

Try to find and document as many errors are you can in the bigPresidentData. In particular focus on validity (conforms to the correct syntax with the right format, type, and range), on consistency, accuracy, and uniqueness.

[Hint: If you are new to R then you can use the functions str, print and the $ operator to extract variables from the data frame.]

Exercise 2

Here we should use the dataMaid package makeDataReport() to screen data for obvious and non-obvious errors in the dataset. We will continue working on the bigPresidentData from the dataMaid package in order to find errors that we previously missed.

Note that exercise 2 contains a lot of exercises. We do not expect you to be able to solve all of them in the time given. Remember the primary focus is to identify potential errors in the dataset.

  1. Run the makeDataReport() function to make your own report for the bigPresidentData and look through it. What legitimate errors are found? What warnings are unnecessary?

  2. Start by verifying the validity of the data (format, type, and range). When dataMaid encounters a variable type that is unknown then it prints a note and skips the variable. If we want to perform specific checks for a variable with an unknown type then we must fix the class in the data frame (when relevant) or alternatively force dataMaid to consider a specific class as if it were another class.

    Try calling help(makeDataReport) and look at the documentation for the argument treatXasY. Use this argument in a new call to makeDataReport() so that for example the favoriteNumber variable will be handled like a numeric variable. [Hint: you may need to use the replace=TRUE argument too to overwrite any previous reports.]

  3. Are there any errors that does not pop up using the makeDataReport() function? What about consistency in the data? What about uniqueness in the data?

Note how everything is documented and can be used in collaboration with a research partner.

The argument checks can be used to control which checks are performed for variables of each supported class. The helper function setChecks() makes it easy to add or remove certain checks for certain variable classes while allCheckFunctions() returns an overview of all available checkFunctions to choose from.

The following bit of code shows how to change the choice of checks for numeric variables to not include a check for outliers (i.e. the checkFunction identifyOutliers):

makeDataReport(bigPresidentData, replace = TRUE,
               checks = setChecks(numeric = defaultNumericChecks(remove="identifyOutliers")))

#or, by without the use of defaultNumericChecks():
makeDataReport(bigPresidentData, replace  = TRUE,
               checks = setChecks(numeric = "identifyMissing"))

Now, it’s your turn to make a makeDataReport() call that does not check whether or not character variables have less than 5 unique levels. You can proceed as follows:

  1. Use allCheckFunctions() to find out which checkFunction does the check for less than 5 unique values. Check that this function is indeed among the default checkFunctions for character variables by calling defaultCharacterChecks().

  2. Use the checks argument, setChecks() and defaultCharacterChecks() to remove the “less than 5 unique values”-check from the checks performed for character variables. Look at the overview table on the first page of your report. Can you find a difference?

The default plots used in a dataMaid report are made using the ggplot2 package. However, you can change the plot style to base R graphics if you please by using the visuals argument in makeDataReport().

  1. Use allVisualFunctions() to identify the name of the visualFunction that you need to use if you want base R graphics style plots.

  2. Look at the documentation for setVisuals() e.g. by calling ?setVisuals. Here, you see an argument called all. Call setVisuals() in the console using the all argument to specify the graphics style plots. Try also calling setVisuals() with the new function in the factor and character arguments only.

  3. Use makeDataReport()’s visuals argument and setVisuals() to obtain a data report where character and factor variables use base R graphics plots, while all other variable classes use ggplot2 plots.

  4. The choice of summaries included for each variable class in the report can also be controlled, and the manner in which this is done is very similar to the procedures you have tried for checks and visuals in the above. Experiment with the functions setSummaries(), allSummaryFunctions() and defaultCharacterSummaries() in order to remove “Mode” from the summaries listed for such variables in the data report (using the argument summaries).

  5. We would now like a report that only displays the results of checks (no visuals or summaries) and only lists variables that actually were found to have potential problems. But for each problem found, we would like to see all problematic values listed, not just the first 10, as is currently the case. Moreover, we would like to rename the report title (as displayed on the front page) to be “Problem flagging report”, and we would also like to have “problemflagging” appended to the file name of the report, so that we can easily tell it apart from the usual data report.

    Look at the documentation for makeDataReport() and try to produce the report described here.

Exercise 2b - using dataMaid interactively

The primary intent of the dataMaid package is to generate the combined report produced by the makeDataReport(). However, the report generated by makeDataReport() are produced by a series of check, summary, and visualization function that can be called directly. This enables us to work interactively with the functions that are part of the dataMaid package.

Recall that the allCheckFunctions(), allSummaryFunctions(), and allVisualFunctions() show the built-in functions that are used for checks, summaries, and visuals, respectively.

  1. Run check(bigPresidentData$ageAtInauguration), visualize(bigPresidentData$ageAtInauguration), and summarize(bigPresidentData$ageAtInauguration) and verify that you obtain information identical to what you saw in the report previously generated by makeDataReport(bigPresidentData).

Look at str(check(bigPresidentData$ageAtInauguration)). The check() function returns a list with a length matching the number of checks, and each element is itself a list of length 3 containing information about whether a problem was found, the message that is printed in the report, and the list of observations that gave rise to the problem.

  1. Earlier, we saw how we could modify the choice of checks by using the checks argument in combination with the setChecks() function. Modify the manual check for presidencyYears such that it only returns potential outliers.

When a particular value is flagged as a potential error we might want to report this back to the data provider so they can verify if the data is correct. In order to do this we need to be able to identify the rows that give rise to these potential problems.

  1. Use check() to identify the values that are thought to be potential outliers for the presidencyYears variable and store the result in an object problems.

    Recall that problems consists of a list (of 1 element) of lists. Return the vector with the problemValues and save them to a vector probs. [ Hint: You can reference elements of a list using double square brackets [[]] and elements of a list can be referenced using the $ operator ]

  2. Use the vector of potential error values to identify the indices (rows) of bigPresidentData which contain values for presidencyYears that are part of the vector of problem values, probs. [ Hint: the operator %in% provides a way to return the indices of a vector that contain elements that are found within another vector. ]

    Print all the rows that were just identified.

We can also use check() on a full data frame. This will return a list of lists of lists. We can use this to extract check results across different variables.

For the next 3 questions we will use the toyData dataset.

  1. Run the following two lines to load the toyData dataset and print it

    data(toyData)
    toyData
    ## # A tibble: 15 x 6
    ##    pill  events region change id    spotifysong
    ##    <fct>  <dbl> <fct>   <dbl> <fct> <fct>      
    ##  1 red     1.00 a      -0.626 1     Irrelevant 
    ##  2 red     1.00 a       0.184 2     Irrelevant 
    ##  3 red     1.00 a      -0.836 3     Irrelevant 
    ##  4 red     2.00 a       1.60  4     Irrelevant 
    ##  5 red     2.00 a       0.330 5     Irrelevant 
    ##  6 red     6.00 b      -0.820 6     Irrelevant 
    ##  7 red     6.00 b       0.487 7     Irrelevant 
    ##  8 red     6.00 b       0.738 8     Irrelevant 
    ##  9 red   999    c       0.576 9     Irrelevant 
    ## 10 red    NA    c      -0.305 10    Irrelevant 
    ## 11 blue    4.00 c       1.51  11    Irrelevant 
    ## 12 blue   82.0  .       0.390 12    Irrelevant 
    ## 13 blue   NA    " "    -0.621 13    Irrelevant 
    ## 14 <NA>  NaN    other  -2.21  14    Irrelevant 
    ## 15 <NA>    5.00 OTHER   1.12  15    Irrelevant

We will try to identify all the values found in the dataset that are possible missing values that have been wrongly encoded.

  1. Do a full check() on the full toyData data frame but only consider the identifyMissing check. [ Hint: Note you can use the all="identifyMissing" argument to setChecks() as described previously ]

  2. (Somewhat R-tricky) Return a vector of values that are potential missing values across the full dataset. [ Hint: You can use sapply() to extract the problemValues for each element in the list, and then wrap it in unlist() to combine it all into a single list.]

    This approach is especially useful if you have data in wide format and you want to summarize problem values across different repeats of the same underlying variable (eg, if it is the same value measured over time).

Exercise 2c - customizing new function

Build your own summaryFunction and checkFunction Next up is the task of writing custom summaryFunctions and checkFunctions. Note that a complete guide to writing such extensions is available in the vignette: vignette("extending_dataMaid"). We do not cover writing visualFunctions in the exercises, but we recommend you to look through the vignette if you are particularly interested in this subject.

Make a summaryFunction: refCat()

First up is building a summaryFunction, refCat(), which will give en reference category for factor variables, i.e. the first level. For instance, for the variable pill in toyData, the reference category is blue:

library(dataMaid)
toyData$pill
##  [1] red  red  red  red  red  red  red  red  red  red  blue blue blue <NA>
## [15] <NA>
## Levels: blue red

Below, we have provided a template for building refCat() that you can use as a starting point. There are a few points to note about this function:

  • The final call to summaryFunction() changes the class of the output, so that it will be indistinguishable from that of the built-in summaryFunctions.
  • The list argument used for summaryFunction() has three entries, feature, result and value. In the example below, the contents of result and value are identical, but generally, result should be used for the result one wants printed in the data report, while value is not mandatory, but can be used to store whatever we are summarizing in its original data class.
refCat <- function(v, ...) {
  val <- #the reference category of the variable v
  res <- val
  summaryResult(list(feature = "Reference category", result = res,
                     value = val))
}
  1. Use the template to finish writing refCat(). Call it on pill from toyData in order to test whether it is working.

  2. Now, we will add refCat() to dataMaids known summaryFunctions. First, try calling allSummaryFunctions() to see what summary functions are already available. We want refCat() to be added to the output of this function. This is done by use of summaryFunction(). Fill in the missing pieces in the code below, run it, and try calling allSummaryFunctions() again afterwards.

refCat <- summaryFunction(refCat,
  description = #Text describing what the summaryFunction does,
    ,
  classes = c(#[vector of data types that the function is intended for]
    )
  )
  1. Lastly, we will use refCat() from makeDataReport(). Run the following code bit, adding refCat() to the summaries used for factor variables and look at the result.
makeDataReport(bigPresidentData, 
               summaries = setSummaries(factor = defaultFactorSummaries(add = "refCat")),
               vol = "_withRefCat")

Make a checkFunction: identifyNonStartCase

We will now write a check function that checks whether character variables are written in start case (i.e., “With Capital Letters For Each New Word”). This might be relevant for variables containing names. Again, we give a template below that can be used as a starting point.

identifyNonStartCase <- function(v, nMax, maxDecimals, ...) {
  
  #do the check
  problemValues <- #vector of values in v that are problematic, i.e. that are not 
    #written in start case. If no problem is encountered, it should be set to NULL
  
  problem <- #is there a problem? TRUE/FALSE
  
  problemStatus <- list(problem = problem,
                        problemValues = problemValues)

  problemMessage <- #Message that is printed prior to listing
                    # problem values in the dataMaid output,
                    # ending with a colon
    
  outMessage <- messageGenerator(problemStatus, problemMessage, nMax)

  checkResult(list(problem = problem,
    message = outMessage,
    problemValues = problemValues))
}

identifyNonStartCase <- checkFunction(identifyNonStartCase,
  description = #Some text describing the checkFunction
    ,
  classes = c(#[the data types that this function is intended to be used for]
    )
)

A few comments for the helper functions seen in the template:

  • messageGenerator() helps creating streamlined, properly escaped messages that can be printed in the data report without any problems. It also ommits extra problemValues, if nMax is smaller than the encountered number of problemValues.
  • checkResult() works just like summaryResult() and converts the output class. Note that problemValues must be in their original data class (i.e. the raw values from v), because then the contents of problemValues can be used to identify where problems were found.
  • checkFunction() is also like its summaryFunction cousin mentioned above: It converts the funciton into a checkFunction and makes sure that it becomes available in a allCheckFunctions() call.
  1. Finish identifyNonStartCase(). A few tips that might be helpful:

    • One strategy for a check like the current one is first manipulating the variable so that it adhers to the check-rule (in this case: is in start case) and secondly, comparing the original version of the variable with the new one. Any entries that differ in the two versions of the variables are then deemed problemValues and the problem indicator must be set to TRUE if length(problemValues > 0).
    • strsplit() splits character strings into smaller parts. Try calling it on a character string with split = " ".
    • toupper() and tolower() changes the case for letters to upper- and lower case, respectively.
    • If you are familiar with regular expressions, this is of course also a very good strategy for identifying non-start case values.
  2. Use identifyNonStartCase() on the variable stateOfBirth from bigPresidentData. Try using it on all character variables in bigPresidentData by use of the function check().

  3. Add identifyNonStartCase() to the checks used on character variables in a makeDataReport() call on bigPresidentData. Use the argument vol = "_nonStartCase" to give the new report a different file name than the old ones. Compare the new report with an old version. Can you find the description you wrote for identifyNonStartCase() when calling checkFunction()?

Exercise 3

In order to solve the exercises below, you will need to use logical operators in R. Note that their documentation is available through a ?Logic call. Here is a brief summary of the most commonly used logical operators:

&: And. x & y evaluates to TRUE if both x and y evaluate to TRUE. |: Or. x | y evaluates to TRUE if either x or y evaluate to TRUE. !: Not. !x evaluates to TRUE if x evaluates to FALSE (and vice versa).

You will also need to use relational operators, which are documented in ?Comparison:

<: Strictly smaller than. >: Strictly greater than. <=: Smaller than. >=: Greater than. ==: Equals - note the double equality sign! !=: Not equal.

Note that these relational operators are not just implemented for comparing numbers, but also dates.

Row-wise constraints

Below, we list some sets of variables in bigPresidentData for which logical row-wise constraints must be fulfilled. For each set of variables, identify what logical constraint(s) must be satisfied and check if it does indeed hold, using validator() and confront() from validate. We have provided an example below that you can use for inspiration.

  • birthday, dateOfDeath, presidencyBeginDate and presidencyEndDate
  • presidencyBeginDate, presidencyEndDate and presidencyYears
  • birthday, ageAtInauguration and presidencyBeginDate
  • orderOfPresidency and presidencyBeginDate (tricky - remember to handle missing data)

Tip: The floor() function rounds a number down to the nearest, smaller integer. Tip: The rank() function provides the ranks of variables.

Try to experiment with the different functions available for inspecting confrontations, summary(), aggregate(), sort(), values(), barplot(), errors() and warnings(), in order to get an idea of what they each do.

Example

We wish to ensure that all presidents are born at an earlier date than their time of death (if relevant). This can be done in validate by creating a validator object with the appropriate logical statement and confronting the data with it:

library(validate)
## 
## Attaching package: 'validate'
## The following objects are masked from 'package:dataMaid':
## 
##     description, description<-
## The following object is masked from 'package:dplyr':
## 
##     expr
#define check
bdaycheck <- validator(birthday < dateOfDeath)
## Found more than one class "rule" in cache; using the first, from namespace 'cli'
## Also defined by 'validate'
#save confrontation
bdayconfr <- confront(bigPresidentData, bdaycheck)

#look at confrontation
summary(bdayconfr)
##   name items passes fails nNA error warning             expression
## 1   V1    47     41     0   6 FALSE   FALSE birthday < dateOfDeath

We see that all observations either have missing information (nNA = 6) in the check or pass it (passes = 41). And we can look at the results of the check together with the relevant variables, birthday and dateofDeath by using the values() function:

#look at contents of values(bdayconfr)
head(values(bdayconfr))
##        V1
## [1,] TRUE
## [2,] TRUE
## [3,] TRUE
## [4,] TRUE
## [5,] TRUE
## [6,]   NA
#view president names, birthday, date of death and bdayconfr values
#together:
View(cbind(bigPresidentData[, c("firstName", "lastName", "birthday", "dateOfDeath")],
           values(bdayconfr)))

Here, we see that the NA values from the confrontation corresponds to presidents who are still alive and therefore have dateOfDeath = NA.

One at a time, please! (advanced)

There can only be one president at a time. However, a mistake in the data has caused two presidencies to overlap. Locate the error using validator() and confront() from validate.

Tip: It may be helpful to begin by sorting the data according to presidencyBeginDate.


Exercise 4

Until now we have identified a number of errors and inconsistencies in the bigPresidentData. For each data error you encountered in the above, your should:

  1. Correct the error
  2. Check the resulting dataset to ensure that you did indeed clean up the data errors.

When all the corrections are completed you should be able to run all the error checks from dataMaid and validate and end up without any critical errors.

Remember to follow rules 1 and 2 and consider how these corrections best are documented. “a”

Exercise 4b

For now we assume that we have a dataset that has been screened and is clean enough to be used for subsequent analyses.

  1. Use the makeCodebook() function to produce a final codebook that could be passed on as a documentation (of the dataset) for the data analysis.
  2. Use the option to set shortDescription attributes for the dataset to explain that:
    1. The information in favoriteNumber has been obtained by consulting a Ouija board or - when that failed - just typing in a number and consequently the accuracy may be low.
    2. For assassinationAttempt, 1 means yes and 0 means no
    3. For the firstName it is literally the first name. No middle names or initials. Also try using the labels attribute to add some variable labels of your own choosing. Run the codebook command again.