For this initial exercise we will be trying to identify errors in a data frame containing information about the US presidents. First we install the packages needed to get the dataset that we will be working with. If you have already installed dataMaid
from home, you can skip this step.
install.packages("dataMaid")
Now we have installed the dataMaid
package which contains the raw dataset, presiData
. We can see the first couple of rows of the dataset with the head
command.
library("dataMaid")
load(url("http://biostatistics.dk/eRum2018/data/presiData.rda"))
head(presiData)
## lastName firstName orderOfPresidency birthday dateOfDeath
## 10 Tyler John 10 1790-03-29 1862-01-18
## 2 Adams John 2 1735-10-30 1826-07-04
## 25 McKinley William 25 1843-01-29 1901-09-14
## 32 Roosevelt Franklin 32 1882-01-30 1945-04-12
## 14 Pierce Franklin 14 1804-11-03 1869-10-08
## 24 Cleveland Grover 24 1837-03-18 1908-06-24
## stateOfBirth assassinationAttempt sex ethnicity presidencyYears
## 10 Virginia 0 Male Caucasian 3
## 2 Massachusetts 0 Male Caucasian 3
## 25 Ohio 1 Male Caucasian 4
## 32 New York 1 Male Caucasian 12
## 14 New Hampshire 0 Male Caucasian 4
## 24 New Jersey 0 Male Caucasian 4
## ageAtInauguration favoriteNumber
## 10 51 1+0i
## 2 61 4+0i
## 25 54 3+0i
## 32 51 6+0i
## 14 48 4+0i
## 24 55 2+0i
We are being told by our research collaborators that the dataset should contain information on 11 variables relates to the US presidents. The variables are
lastName
The surname of the president.firstName
The first name of the president.orderOfPresidency
A factor variable indicating the order of the presidents (with George Washington as number 1 and Donald Trump as number 45).birthday
The date of birth of the president.dateOfDeath
The date of death of the president.stateOfBirth
A character variable with the state in which the president was born.assassinationAttempt
A text variable indicating whether there was an assassination attempt (1) or not (0) on the president.sex
A factor variable with the sex of the president.ethnicity
A factor variable with the ethnicity of the president.presidencyYears
A numeric variable with the duration of the presidency, in years.ageAtInauguration
the age at inauguration.favoriteNumber
A complex type variable with a fictional favorite number for each president.Try to find and document as many errors are you can in the presiData
. In particular focus on validity (conforms to the correct syntax with the right format, type, and range), on consistency, accuracy, and uniqueness.
[Hint: If you are new to R then you can use the functions str
, print
and the $
operator to extract variables from the data frame.]
Here we will use the dataMaid
package makeDataReport()
to screen data for obvious and non-obvious errors in the dataset. We will continue working on the presiData
from the dataMaid
package in order to find errors that we previously missed.
Note that exercise 2 contains a lot of exercises. We do not expect you to be able to solve all of them in the time given. Remember the primary focus is to identify potential errors in the dataset.
Run the makeDataReport()
function to make your own report for the presiData
and look through it. What legitimate errors are found? What warnings are unnecessary?
Start by verifying the validity of the data (format, type, and range). When dataMaid
encounters a variable type that is unknown then it prints a note and skips the variable. If we want to perform specific checks for a variable with an unknown type then we must fix the class
in the data frame (when relevant) or alternatively force dataMaid
to consider a specific class as if it were another class.
Handle the unusual variable class used for the variables lastName
and firstName
by changing their classes to something more standard. This can be done using the code
class(presiData$firstName) <- "character"
class(presiData$lastName) <- "character"
Try calling help(makeDataReport)
and look at the documentation for the argument treatXasY
. Use this argument in a new call to makeDataReport()
so that for example the favoriteNumber
variable will be handled like a numeric
variable.
[Hint: you may need to use the replace=TRUE
argument too to overwrite any previous reports.]
Note how everything is documented and can be used in collaboration with a research partner.
The argument checks
can be used to control which checks are performed for variables of each supported class. The helper function setChecks()
makes it easy to add or remove certain checks for certain variable classes while allCheckFunctions()
returns an overview of all available checkFunctions to choose from.
The following bit of code shows how to change the choice of checks for numeric
variables to not include a check for outliers (i.e. the checkFunction identifyOutliers
):
makeDataReport(presiData, replace = TRUE,
checks = setChecks(numeric = defaultNumericChecks(remove="identifyOutliers")))
#or, by without the use of defaultNumericChecks():
makeDataReport(presiData, replace = TRUE,
checks = setChecks(numeric = "identifyMissing"))
Now, it’s your turn to make a makeDataReport()
call that does not check whether or not character
variables have less than 5 unique levels. You can proceed as follows:
Use allCheckFunctions()
to find out which checkFunction does the check for less than 5 unique values. Check that this function is indeed among the default checkFunctions for character
variables by calling defaultCharacterChecks()
.
Use the checks
argument, setChecks()
and defaultCharacterChecks()
to remove the “less than 5 unique values”-check from the checks performed for character
variables. Look at the overview table on the first page of your report. Can you find a difference?
The default plots used in a dataMaid
report are made using the ggplot2
package. However, you can change the plot style to base R
graphics if you please by using the visuals
argument in makeDataReport()
.
Use allVisualFunctions()
to identify the name of the visualFunction that you need to use if you want base R graphics style plots.
Look at the documentation for setVisuals()
e.g. by calling ?setVisuals
. Here, you see an argument called all
. Call setVisuals()
in the console using the all
argument to specify the graphics
style plots. Try also calling setVisuals()
with the new function in the factor
and character
arguments only.
Use makeDataReport()
’s visuals
argument and setVisuals()
to obtain a data report where character
and factor
variables use base R graphics plots, while all other variable classes use ggplot2
plots.
The choice of summaries included for each variable class in the report can also be controlled, and the manner in which this is done is very similar to the procedures you have tried for checks and visuals in the above. Experiment with the functions setSummaries()
, allSummaryFunctions()
and defaultCharacterSummaries()
in order to remove “Mode” from the summaries listed for such variables in the data report (using the argument summaries
).
We would now like a report that only displays the results of checks (no visuals or summaries) and only lists variables that actually were found to have potential problems. But for each problem found, we would like to see all problematic values listed, not just the first 10, as is currently the case. Moreover, we would like to rename the report title (as displayed on the front page) to be “Problem flagging report”, and we would also like to have “problemflagging” appended to the file name of the report, so that we can easily tell it apart from the usual data report.
Look at the documentation for makeDataReport()
and try to produce the report described here.
dataMaid
interactivelyThe primary intent of the dataMaid
package is to generate the combined report produced by the makeDataReport()
. However, the report generated by makeDataReport()
are produced by a series of check, summary, and visualization function that can be called directly. This enables us to work interactively with the functions that are part of the dataMaid
package.
Recall that the allCheckFunctions()
, allSummaryFunctions()
, and allVisualFunctions()
show the built-in functions that are used for checks, summaries, and visuals, respectively.
check(presiData$ageAtInauguration)
, visualize(presiData$ageAtInauguration)
, and summarize(presiData$ageAtInauguration)
and verify that you obtain information identical to what you saw in the report previously generated by makeDataReport(presiData)
.Look at str(check(presiData$ageAtInauguration))
. The check()
function returns a list with a length matching the number of checks, and each element is itself a list of length 3 containing information about whether a problem was found, the message that is printed in the report, and the list of observations that gave rise to the problem.
checks
argument in combination with the setChecks()
function. Modify the manual check for presidencyYears
such that it only returns potential outliers.When a particular value is flagged as a potential error we might want to report this back to the data provider so they can verify if the data is correct. In order to do this we need to be able to identify the rows that give rise to these potential problems.
Use check()
to identify the values that are thought to be potential outliers for the presidencyYears
variable and store the result in an object problems
.
Recall that problems
consists of a list (of 1 element) of lists. Return the vector with the problemValues
and save them to a vector probs
. [ Hint: You can reference elements of a list using double square brackets [[]]
or the $
operator ]
Use the vector of potential error values to identify the indices (rows) of presiData
which contain values for presidencyYears
that are part of the vector of problem values, probs
. [ Hint: the operator %in%
provides a way to return the indices of a vector that contain elements that are found within another vector. ]
Print all the rows that were just identified.
We can also use check()
on a full data frame. This will return a list of lists of lists. We can use this to extract check results across different variables.
For the next 3 questions we will use the toyData
dataset.
Run the following two lines to load the toyData
dataset and print it
data(toyData)
toyData
## # A tibble: 15 x 6
## pill events region change id spotifysong
## <fct> <dbl> <fct> <dbl> <fct> <fct>
## 1 red 1 a -0.626 1 Irrelevant
## 2 red 1 a 0.184 2 Irrelevant
## 3 red 1 a -0.836 3 Irrelevant
## 4 red 2 a 1.60 4 Irrelevant
## 5 red 2 a 0.330 5 Irrelevant
## 6 red 6 b -0.820 6 Irrelevant
## 7 red 6 b 0.487 7 Irrelevant
## 8 red 6 b 0.738 8 Irrelevant
## 9 red 999 c 0.576 9 Irrelevant
## 10 red NA c -0.305 10 Irrelevant
## 11 blue 4 c 1.51 11 Irrelevant
## 12 blue 82 . 0.390 12 Irrelevant
## 13 blue NA " " -0.621 13 Irrelevant
## 14 <NA> NaN other -2.21 14 Irrelevant
## 15 <NA> 5 OTHER 1.12 15 Irrelevant
We will try to identify all the values found in the dataset that are possible missing values that have been wrongly encoded.
Do a full check()
on the full toyData
data frame but only consider the identifyMissing
check. [ Hint: Note you can use the all="identifyMissing"
argument to setChecks()
as described previously ]
(Somewhat tricky) Return a vector of values that are potential missing values across the full dataset. [ Hint: You can use sapply()
to extract the problemValues
for each element in the list, and then wrap it in unlist()
to combine it all into a single list.]
This approach is especially useful if you have data in wide format and you want to summarize problem values across different repeats of the same underlying variable (eg, if it is the same value measured over time).
Next up is the task of writing custom summaryFunctions and checkFunctions. Note that a complete guide to writing such extensions is available in the vignette: vignette("extending_dataMaid")
. We do not cover writing visualFunctions in the exercises, but we recommend you to look through the vignette if you are particularly interested in this subject.
refCat()
First up is building a summaryFunction, refCat()
, which will give en reference category for factor variables, i.e. the first level. For instance, for the variable pill
in toyData
, the reference category is blue
:
library(dataMaid)
toyData$pill
## [1] red red red red red red red red red red blue blue blue <NA>
## [15] <NA>
## Levels: blue red
Below, we have provided a template for building refCat()
that you can use as a starting point. There are a few points to note about this function:
summaryFunction()
changes the class of the output, so that it will be indistinguishable from that of the built-in summaryFunctions.summaryFunction()
has three entries, feature
, result
and value
. In the example below, the contents of result
and value
are identical, but generally, result
should be used for the result one wants printed in the data report, while value
is not mandatory, but can be used to store whatever we are summarizing in its original data class.refCat <- function(v, ...) {
val <- #the reference category of the variable v
res <- val
summaryResult(list(feature = "Reference category", result = res,
value = val))
}
Use the template to finish writing refCat()
. Call it on pill
from toyData
in order to test whether it is working.
Now, we will add refCat()
to dataMaid
s known summaryFunctions. First, try calling allSummaryFunctions()
to see what summary functions are already available. We want refCat()
to be added to the output of this function. This is done by use of summaryFunction()
. Fill in the missing pieces in the code below, run it, and try calling allSummaryFunctions()
again afterwards.
refCat <- summaryFunction(refCat,
description = #Text describing what the summaryFunction does,
,
classes = c(#[vector of data types that the function is intended for]
)
)
refCat()
from makeDataReport()
. Run the following code bit, adding refCat()
to the summaries used for factor variables and look at the result.makeDataReport(presiData,
summaries = setSummaries(factor = defaultFactorSummaries(add = "refCat")),
vol = "_withRefCat")
identifyNonStartCase
We will now write a check function that checks whether character variables are written in start case (i.e., “With Capital Letters For Each New Word”). This might be relevant for variables containing names. Again, we give a template below that can be used as a starting point.
identifyNonStartCase <- function(v, nMax, maxDecimals, ...) {
#do the check
problemValues <- #vector of values in v that are problematic, i.e. that are not
#written in start case. If no problem is encountered, it should be set to NULL
problem <- #is there a problem? TRUE/FALSE
problemStatus <- list(problem = problem,
problemValues = problemValues)
problemMessage <- #Message that is printed prior to listing
# problem values in the dataMaid output,
# ending with a colon
outMessage <- messageGenerator(problemStatus, problemMessage, nMax)
checkResult(list(problem = problem,
message = outMessage,
problemValues = problemValues))
}
identifyNonStartCase <- checkFunction(identifyNonStartCase,
description = #Some text describing the checkFunction
,
classes = c(#[the data types that this function is intended to be used for]
)
)
A few comments for the helper functions seen in the template:
messageGenerator()
helps creating streamlined, properly escaped messages that can be printed in the data report without any problems. It also ommits extra problemValues, if nMax
is smaller than the encountered number of problemValues.checkResult()
works just like summaryResult()
and converts the output class. Note that problemValues
must be in their original data class (i.e. the raw values from v
), because then the contents of problemValues
can be used to identify where problems were found.checkFunction()
is also like its summaryFunction
cousin mentioned above: It converts the function into a checkFunction
and makes sure that it becomes available in a allCheckFunctions()
call.Finish identifyNonStartCase()
. A few tips that might be helpful:
problemValues
and the problem
indicator must be set to TRUE
if length(problemValues > 0)
.strsplit()
splits character strings into smaller parts. Try calling it on a character string with split = " "
.toupper()
and tolower()
changes the case for letters to upper- and lower case, respectively.Use identifyNonStartCase()
on the variable stateOfBirth
from presiData
. Try using it on all character
variables in presiData
by use of the function check()
.
Add identifyNonStartCase()
to the checks used on character variables in a makeDataReport()
call on presiData
. Use the argument vol = "_nonStartCase"
to give the new report a different file name than the old ones. Compare the new report with an old version. Can you find the description you wrote for identifyNonStartCase()
when calling checkFunction()
?
Until now we have identified a number of errors and inconsistencies in the presiData
. For each data error you encountered in the above, your should:
When all the corrections are completed you should be able to run all the error checks from dataMaid
and end up without any critical errors.
Remember to follow rules 1 and 2 and consider how these corrections best are documented.
For now we assume that we have a dataset that has been screened and is clean enough to be used for subsequent analyses.
makeCodebook()
function to produce a final codebook that could be passed on as a documentation (of the dataset) for the data analysis.shortDescription
attributes for the dataset to explain that:
favoriteNumber
has been obtained by consulting a Ouija board or - when that failed - just typing in a number and consequently the accuracy may be low.assassinationAttempt
, 1 means yes and 0 means nofirstName
it is literally the first name. No middle names or initials. Also try using the label
attribute to add some variable labels of your own choosing. Run the codebook command again.