Exercise 1

For this initial exercise we will be trying to identify errors in a data frame containing information about the US presidents. First we install the packages needed to get the dataset that we will be working with. If you have already installed dataMaid from home, you can skip this step.

install.packages("dataMaid")

Now we have installed the dataMaid package which contains the raw dataset, presiData. We can see the first couple of rows of the dataset with the head command.

library("dataMaid")
load(url("http://biostatistics.dk/eRum2018/data/presiData.rda"))
head(presiData)
##     lastName firstName orderOfPresidency   birthday dateOfDeath
## 10     Tyler      John                10 1790-03-29  1862-01-18
## 2      Adams      John                 2 1735-10-30  1826-07-04
## 25  McKinley   William                25 1843-01-29  1901-09-14
## 32 Roosevelt  Franklin                32 1882-01-30  1945-04-12
## 14    Pierce  Franklin                14 1804-11-03  1869-10-08
## 24 Cleveland    Grover                24 1837-03-18  1908-06-24
##     stateOfBirth assassinationAttempt  sex ethnicity presidencyYears
## 10      Virginia                    0 Male Caucasian               3
## 2  Massachusetts                    0 Male Caucasian               3
## 25          Ohio                    1 Male Caucasian               4
## 32      New York                    1 Male Caucasian              12
## 14 New Hampshire                    0 Male Caucasian               4
## 24    New Jersey                    0 Male Caucasian               4
##    ageAtInauguration favoriteNumber
## 10                51           1+0i
## 2                 61           4+0i
## 25                54           3+0i
## 32                51           6+0i
## 14                48           4+0i
## 24                55           2+0i

We are being told by our research collaborators that the dataset should contain information on 11 variables relates to the US presidents. The variables are

  • lastName The surname of the president.
  • firstName The first name of the president.
  • orderOfPresidency A factor variable indicating the order of the presidents (with George Washington as number 1 and Donald Trump as number 45).
  • birthday The date of birth of the president.
  • dateOfDeath The date of death of the president.
  • stateOfBirth A character variable with the state in which the president was born.
  • assassinationAttempt A text variable indicating whether there was an assassination attempt (1) or not (0) on the president.
  • sex A factor variable with the sex of the president.
  • ethnicity A factor variable with the ethnicity of the president.
  • presidencyYears A numeric variable with the duration of the presidency, in years.
  • ageAtInauguration the age at inauguration.
  • favoriteNumber A complex type variable with a fictional favorite number for each president.

Try to find and document as many errors are you can in the presiData. In particular focus on validity (conforms to the correct syntax with the right format, type, and range), on consistency, accuracy, and uniqueness.

[Hint: If you are new to R then you can use the functions str, print and the $ operator to extract variables from the data frame.]

Exercise 2

Here we will use the dataMaid package makeDataReport() to screen data for obvious and non-obvious errors in the dataset. We will continue working on the presiData from the dataMaid package in order to find errors that we previously missed.

Note that exercise 2 contains a lot of exercises. We do not expect you to be able to solve all of them in the time given. Remember the primary focus is to identify potential errors in the dataset.

  1. Run the makeDataReport() function to make your own report for the presiData and look through it. What legitimate errors are found? What warnings are unnecessary?

  2. Start by verifying the validity of the data (format, type, and range). When dataMaid encounters a variable type that is unknown then it prints a note and skips the variable. If we want to perform specific checks for a variable with an unknown type then we must fix the class in the data frame (when relevant) or alternatively force dataMaid to consider a specific class as if it were another class.

    Handle the unusual variable class used for the variables lastName and firstName by changing their classes to something more standard. This can be done using the code

    class(presiData$firstName) <- "character"
    class(presiData$lastName) <- "character"

    Try calling help(makeDataReport) and look at the documentation for the argument treatXasY. Use this argument in a new call to makeDataReport() so that for example the favoriteNumber variable will be handled like a numeric variable.

    [Hint: you may need to use the replace=TRUE argument too to overwrite any previous reports.]

Note how everything is documented and can be used in collaboration with a research partner.

The argument checks can be used to control which checks are performed for variables of each supported class. The helper function setChecks() makes it easy to add or remove certain checks for certain variable classes while allCheckFunctions() returns an overview of all available checkFunctions to choose from.

The following bit of code shows how to change the choice of checks for numeric variables to not include a check for outliers (i.e. the checkFunction identifyOutliers):

makeDataReport(presiData, replace = TRUE,
               checks = setChecks(numeric = defaultNumericChecks(remove="identifyOutliers")))

#or, by without the use of defaultNumericChecks():
makeDataReport(presiData, replace  = TRUE,
               checks = setChecks(numeric = "identifyMissing"))

Now, it’s your turn to make a makeDataReport() call that does not check whether or not character variables have less than 5 unique levels. You can proceed as follows:

  1. Use allCheckFunctions() to find out which checkFunction does the check for less than 5 unique values. Check that this function is indeed among the default checkFunctions for character variables by calling defaultCharacterChecks().

  2. Use the checks argument, setChecks() and defaultCharacterChecks() to remove the “less than 5 unique values”-check from the checks performed for character variables. Look at the overview table on the first page of your report. Can you find a difference?

The default plots used in a dataMaid report are made using the ggplot2 package. However, you can change the plot style to base R graphics if you please by using the visuals argument in makeDataReport().

  1. Use allVisualFunctions() to identify the name of the visualFunction that you need to use if you want base R graphics style plots.

  2. Look at the documentation for setVisuals() e.g. by calling ?setVisuals. Here, you see an argument called all. Call setVisuals() in the console using the all argument to specify the graphics style plots. Try also calling setVisuals() with the new function in the factor and character arguments only.

  3. Use makeDataReport()’s visuals argument and setVisuals() to obtain a data report where character and factor variables use base R graphics plots, while all other variable classes use ggplot2 plots.

  4. The choice of summaries included for each variable class in the report can also be controlled, and the manner in which this is done is very similar to the procedures you have tried for checks and visuals in the above. Experiment with the functions setSummaries(), allSummaryFunctions() and defaultCharacterSummaries() in order to remove “Mode” from the summaries listed for such variables in the data report (using the argument summaries).

  5. We would now like a report that only displays the results of checks (no visuals or summaries) and only lists variables that actually were found to have potential problems. But for each problem found, we would like to see all problematic values listed, not just the first 10, as is currently the case. Moreover, we would like to rename the report title (as displayed on the front page) to be “Problem flagging report”, and we would also like to have “problemflagging” appended to the file name of the report, so that we can easily tell it apart from the usual data report.

    Look at the documentation for makeDataReport() and try to produce the report described here.

Exercise 2b - using dataMaid interactively

The primary intent of the dataMaid package is to generate the combined report produced by the makeDataReport(). However, the report generated by makeDataReport() are produced by a series of check, summary, and visualization function that can be called directly. This enables us to work interactively with the functions that are part of the dataMaid package.

Recall that the allCheckFunctions(), allSummaryFunctions(), and allVisualFunctions() show the built-in functions that are used for checks, summaries, and visuals, respectively.

  1. Run check(presiData$ageAtInauguration), visualize(presiData$ageAtInauguration), and summarize(presiData$ageAtInauguration) and verify that you obtain information identical to what you saw in the report previously generated by makeDataReport(presiData).

Look at str(check(presiData$ageAtInauguration)). The check() function returns a list with a length matching the number of checks, and each element is itself a list of length 3 containing information about whether a problem was found, the message that is printed in the report, and the list of observations that gave rise to the problem.

  1. Earlier, we saw how we could modify the choice of checks by using the checks argument in combination with the setChecks() function. Modify the manual check for presidencyYears such that it only returns potential outliers.

When a particular value is flagged as a potential error we might want to report this back to the data provider so they can verify if the data is correct. In order to do this we need to be able to identify the rows that give rise to these potential problems.

  1. Use check() to identify the values that are thought to be potential outliers for the presidencyYears variable and store the result in an object problems.

    Recall that problems consists of a list (of 1 element) of lists. Return the vector with the problemValues and save them to a vector probs. [ Hint: You can reference elements of a list using double square brackets [[]] or the $ operator ]

  2. Use the vector of potential error values to identify the indices (rows) of presiData which contain values for presidencyYears that are part of the vector of problem values, probs. [ Hint: the operator %in% provides a way to return the indices of a vector that contain elements that are found within another vector. ]

    Print all the rows that were just identified.

We can also use check() on a full data frame. This will return a list of lists of lists. We can use this to extract check results across different variables.

For the next 3 questions we will use the toyData dataset.

  1. Run the following two lines to load the toyData dataset and print it

    data(toyData)
    toyData
    ## # A tibble: 15 x 6
    ##    pill  events region change id    spotifysong
    ##    <fct>  <dbl> <fct>   <dbl> <fct> <fct>      
    ##  1 red        1 a      -0.626 1     Irrelevant 
    ##  2 red        1 a       0.184 2     Irrelevant 
    ##  3 red        1 a      -0.836 3     Irrelevant 
    ##  4 red        2 a       1.60  4     Irrelevant 
    ##  5 red        2 a       0.330 5     Irrelevant 
    ##  6 red        6 b      -0.820 6     Irrelevant 
    ##  7 red        6 b       0.487 7     Irrelevant 
    ##  8 red        6 b       0.738 8     Irrelevant 
    ##  9 red      999 c       0.576 9     Irrelevant 
    ## 10 red       NA c      -0.305 10    Irrelevant 
    ## 11 blue       4 c       1.51  11    Irrelevant 
    ## 12 blue      82 .       0.390 12    Irrelevant 
    ## 13 blue      NA " "    -0.621 13    Irrelevant 
    ## 14 <NA>     NaN other  -2.21  14    Irrelevant 
    ## 15 <NA>       5 OTHER   1.12  15    Irrelevant

We will try to identify all the values found in the dataset that are possible missing values that have been wrongly encoded.

  1. Do a full check() on the full toyData data frame but only consider the identifyMissing check. [ Hint: Note you can use the all="identifyMissing" argument to setChecks() as described previously ]

  2. (Somewhat tricky) Return a vector of values that are potential missing values across the full dataset. [ Hint: You can use sapply() to extract the problemValues for each element in the list, and then wrap it in unlist() to combine it all into a single list.]

    This approach is especially useful if you have data in wide format and you want to summarize problem values across different repeats of the same underlying variable (eg, if it is the same value measured over time).

Exercise 2c - customizing new function

Next up is the task of writing custom summaryFunctions and checkFunctions. Note that a complete guide to writing such extensions is available in the vignette: vignette("extending_dataMaid"). We do not cover writing visualFunctions in the exercises, but we recommend you to look through the vignette if you are particularly interested in this subject.

Make a summaryFunction: refCat()

First up is building a summaryFunction, refCat(), which will give en reference category for factor variables, i.e. the first level. For instance, for the variable pill in toyData, the reference category is blue:

library(dataMaid)
toyData$pill
##  [1] red  red  red  red  red  red  red  red  red  red  blue blue blue <NA>
## [15] <NA>
## Levels: blue red

Below, we have provided a template for building refCat() that you can use as a starting point. There are a few points to note about this function:

  • The final call to summaryFunction() changes the class of the output, so that it will be indistinguishable from that of the built-in summaryFunctions.
  • The list argument used for summaryFunction() has three entries, feature, result and value. In the example below, the contents of result and value are identical, but generally, result should be used for the result one wants printed in the data report, while value is not mandatory, but can be used to store whatever we are summarizing in its original data class.
refCat <- function(v, ...) {
  val <- #the reference category of the variable v
  res <- val
  summaryResult(list(feature = "Reference category", result = res,
                     value = val))
}
  1. Use the template to finish writing refCat(). Call it on pill from toyData in order to test whether it is working.

  2. Now, we will add refCat() to dataMaids known summaryFunctions. First, try calling allSummaryFunctions() to see what summary functions are already available. We want refCat() to be added to the output of this function. This is done by use of summaryFunction(). Fill in the missing pieces in the code below, run it, and try calling allSummaryFunctions() again afterwards.

refCat <- summaryFunction(refCat,
  description = #Text describing what the summaryFunction does,
    ,
  classes = c(#[vector of data types that the function is intended for]
    )
  )
  1. Lastly, we will use refCat() from makeDataReport(). Run the following code bit, adding refCat() to the summaries used for factor variables and look at the result.
makeDataReport(presiData, 
               summaries = setSummaries(factor = defaultFactorSummaries(add = "refCat")),
               vol = "_withRefCat")

Make a checkFunction: identifyNonStartCase

We will now write a check function that checks whether character variables are written in start case (i.e., “With Capital Letters For Each New Word”). This might be relevant for variables containing names. Again, we give a template below that can be used as a starting point.

identifyNonStartCase <- function(v, nMax, maxDecimals, ...) {
  
  #do the check
  problemValues <- #vector of values in v that are problematic, i.e. that are not 
    #written in start case. If no problem is encountered, it should be set to NULL
  
  problem <- #is there a problem? TRUE/FALSE
  
  problemStatus <- list(problem = problem,
                        problemValues = problemValues)

  problemMessage <- #Message that is printed prior to listing
                    # problem values in the dataMaid output,
                    # ending with a colon
    
  outMessage <- messageGenerator(problemStatus, problemMessage, nMax)

  checkResult(list(problem = problem,
    message = outMessage,
    problemValues = problemValues))
}

identifyNonStartCase <- checkFunction(identifyNonStartCase,
  description = #Some text describing the checkFunction
    ,
  classes = c(#[the data types that this function is intended to be used for]
    )
)

A few comments for the helper functions seen in the template:

  • messageGenerator() helps creating streamlined, properly escaped messages that can be printed in the data report without any problems. It also ommits extra problemValues, if nMax is smaller than the encountered number of problemValues.
  • checkResult() works just like summaryResult() and converts the output class. Note that problemValues must be in their original data class (i.e. the raw values from v), because then the contents of problemValues can be used to identify where problems were found.
  • checkFunction() is also like its summaryFunction cousin mentioned above: It converts the function into a checkFunction and makes sure that it becomes available in a allCheckFunctions() call.
  1. Finish identifyNonStartCase(). A few tips that might be helpful:

    • One strategy for a check like the current one is first manipulating the variable so that it adhers to the check-rule (in this case: is in start case) and secondly, comparing the original version of the variable with the new one. Any entries that differ in the two versions of the variables are then deemed problemValues and the problem indicator must be set to TRUE if length(problemValues > 0).
    • strsplit() splits character strings into smaller parts. Try calling it on a character string with split = " ".
    • toupper() and tolower() changes the case for letters to upper- and lower case, respectively.
    • If you are familiar with regular expressions, this is of course also a very good strategy for identifying non-start case values.
  2. Use identifyNonStartCase() on the variable stateOfBirth from presiData. Try using it on all character variables in presiData by use of the function check().

  3. Add identifyNonStartCase() to the checks used on character variables in a makeDataReport() call on presiData. Use the argument vol = "_nonStartCase" to give the new report a different file name than the old ones. Compare the new report with an old version. Can you find the description you wrote for identifyNonStartCase() when calling checkFunction()?


Exercise 3

Exercise 3a

Until now we have identified a number of errors and inconsistencies in the presiData. For each data error you encountered in the above, your should:

  1. Correct the error
  2. Check the resulting dataset to ensure that you did indeed clean up the data errors.

When all the corrections are completed you should be able to run all the error checks from dataMaid and end up without any critical errors.

Remember to follow rules 1 and 2 and consider how these corrections best are documented.

Exercise 3b

For now we assume that we have a dataset that has been screened and is clean enough to be used for subsequent analyses.

  1. Use the makeCodebook() function to produce a final codebook that could be passed on as a documentation (of the dataset) for the data analysis.
  2. Use the option to set shortDescription attributes for the dataset to explain that:
    1. The information in favoriteNumber has been obtained by consulting a Ouija board or - when that failed - just typing in a number and consequently the accuracy may be low.
    2. For assassinationAttempt, 1 means yes and 0 means no
    3. For the firstName it is literally the first name. No middle names or initials. Also try using the label attribute to add some variable labels of your own choosing. Run the codebook command again.