For this initial exercise we will be trying to identify errors in a data frame containing information about the US presidents. First we install the packages needed to get the dataset that we will be working with. If you have already installed dataMaid
from home, you can skip this step.
install.packages("dataMaid")
Now we have installed the dataMaid
package which contains the raw dataset, bigPresidentData
. We can see the first couple of rows of the dataset with the head
command.
library("dataMaid")
data("bigPresidentData")
head(bigPresidentData)
## lastName firstName orderOfPresidency birthday dateOfDeath
## 38 Ford Gerald 38 1913-07-14 2006-12-26
## 2 Adams John 2 1735-10-30 1826-07-04
## 31 Hoover Herbert 31 1874-08-10 1964-10-20
## 1 Washington George 1 1732-02-22 1799-12-14
## 13 Fillmore Millard 13 1800-01-07 1874-03-08
## 42 Clinton William 42 1946-08-19 <NA>
## stateOfBirth party presidencyBeginDate presidencyEndDate
## 38 Nebraska Republican 1974-08-09 1977-01-20
## 2 Massachusetts Federalist 1797-03-04 1801-03-04
## 31 Iowa Republican 1929-03-04 1933-03-04
## 1 Virginia Independent 1789-04-30 1797-03-04
## 13 New York Whig 1850-07-09 1853-03-04
## 42 Arkansas Democratic 1993-01-20 2001-01-20
## assassinationAttempt sex ethnicity presidencyYears ageAtInauguration
## 38 1 Male Caucasian 2 61
## 2 0 Male Caucasian 3 61
## 31 0 Male Caucasian 4 54
## 1 0 Male Caucasian 7 57
## 13 0 Male Caucasian 2 50
## 42 0 Male Caucasian 8 46
## favoriteNumber
## 38 2+0i
## 2 4+0i
## 31 5+0i
## 1 3+0i
## 13 7+0i
## 42 7+0i
We are being told by our research collaborators that the dataset should contain information on 18 variables relates to the US presidents. The variables are
lastName
The surname of the president.firstName
The first name of the president.orderOfPresidency
A factor variable indicating the order of the presidents (with George Washington as number 1 and Donald Trump as number 45).birthday
The date of birth of the president.dateOfDeath
The date of the president’s death.stateOfBirth
A character variable with the state in which the president was born.party
A charcter variable with the political party to which the president was associated.presidencyBeginDate
A Date variable with the date of inauguration of the president.presidencyEndDate
A Date variable with the date at which the presidency ends.assassinationAttempt
A text variable indicating whether there was an assassination attempt (1) or not (0) on the president.sex
A factor variable with the sex of the president.ethnicity
A factor variable with the ethnicity of the president.presidencyYears
A numeric variable with the duration of the presidency, in years.ageAtInauguration
the age at inauguration.favoriteNumber
A complex type variable with a fictional favorite number for each president.Try to find and document as many errors are you can in the bigPresidentData
. In particular focus on validity (conforms to the correct syntax with the right format, type, and range), on consistency, accuracy, and uniqueness.
[Hint: If you are new to R then you can use the functions str
, print
and the $
operator to extract variables from the data frame.]
Here we should use the dataMaid
package makeDataReport()
to screen data for obvious and non-obvious errors in the dataset. We will continue working on the bigPresidentData
from the dataMaid
package in order to find errors that we previously missed.
Note that exercise 2 contains a lot of exercises. We do not expect you to be able to solve all of them in the time given. Remember the primary focus is to identify potential errors in the dataset.
Run the makeDataReport()
function to make your own report for the bigPresidentData
and look through it. What legitimate errors are found? What warnings are unnecessary?
Start by verifying the validity of the data (format, type, and range). When dataMaid
encounters a variable type that is unknown then it prints a note and skips the variable. If we want to perform specific checks for a variable with an unknown type then we must fix the class
in the data frame (when relevant) or alternatively force dataMaid
to consider a specific class as if it were another class.
Try calling help(makeDataReport)
and look at the documentation for the argument treatXasY
. Use this argument in a new call to makeDataReport()
so that for example the favoriteNumber
variable will be handled like a numeric
variable. [Hint: you may need to use the replace=TRUE
argument too to overwrite any previous reports.]
Are there any errors that does not pop up using the makeDataReport()
function? What about consistency in the data? What about uniqueness in the data?
Note how everything is documented and can be used in collaboration with a research partner.
The argument checks
can be used to control which checks are performed for variables of each supported class. The helper function setChecks()
makes it easy to add or remove certain checks for certain variable classes while allCheckFunctions()
returns an overview of all available checkFunctions to choose from.
The following bit of code shows how to change the choice of checks for numeric
variables to not include a check for outliers (i.e. the checkFunction identifyOutliers
):
makeDataReport(bigPresidentData, replace = TRUE,
checks = setChecks(numeric = defaultNumericChecks(remove="identifyOutliers")))
#or, by without the use of defaultNumericChecks():
makeDataReport(bigPresidentData, replace = TRUE,
checks = setChecks(numeric = "identifyMissing"))
Now, it’s your turn to make a makeDataReport()
call that does not check whether or not character
variables have less than 5 unique levels. You can proceed as follows:
Use allCheckFunctions()
to find out which checkFunction does the check for less than 5 unique values. Check that this function is indeed among the default checkFunctions for character
variables by calling defaultCharacterChecks()
.
Use the checks
argument, setChecks()
and defaultCharacterChecks()
to remove the “less than 5 unique values”-check from the checks performed for character
variables. Look at the overview table on the first page of your report. Can you find a difference?
The default plots used in a dataMaid
report are made using the ggplot2
package. However, you can change the plot style to base R
graphics if you please by using the visuals
argument in makeDataReport()
.
Use allVisualFunctions()
to identify the name of the visualFunction that you need to use if you want base R graphics style plots.
Look at the documentation for setVisuals()
e.g. by calling ?setVisuals
. Here, you see an argument called all
. Call setVisuals()
in the console using the all
argument to specify the graphics
style plots. Try also calling setVisuals()
with the new function in the factor
and character
arguments only.
Use makeDataReport()
’s visuals
argument and setVisuals()
to obtain a data report where character
and factor
variables use base R graphics plots, while all other variable classes use ggplot2
plots.
The choice of summaries included for each variable class in the report can also be controlled, and the manner in which this is done is very similar to the procedures you have tried for checks and visuals in the above. Experiment with the functions setSummaries()
, allSummaryFunctions()
and defaultCharacterSummaries()
in order to remove “Mode” from the summaries listed for such variables in the data report (using the argument summaries
).
We would now like a report that only displays the results of checks (no visuals or summaries) and only lists variables that actually were found to have potential problems. But for each problem found, we would like to see all problematic values listed, not just the first 10, as is currently the case. Moreover, we would like to rename the report title (as displayed on the front page) to be “Problem flagging report”, and we would also like to have “problemflagging” appended to the file name of the report, so that we can easily tell it apart from the usual data report.
Look at the documentation for makeDataReport()
and try to produce the report described here.
dataMaid
interactivelyThe primary intent of the dataMaid
package is to generate the combined report produced by the makeDataReport()
. However, the report generated by makeDataReport()
are produced by a series of check, summary, and visualization function that can be called directly. This enables us to work interactively with the functions that are part of the dataMaid
package.
Recall that the allCheckFunctions()
, allSummaryFunctions()
, and allVisualFunctions()
show the built-in functions that are used for checks, summaries, and visuals, respectively.
check(bigPresidentData$ageAtInauguration)
, visualize(bigPresidentData$ageAtInauguration)
, and summarize(bigPresidentData$ageAtInauguration)
and verify that you obtain information identical to what you saw in the report previously generated by makeDataReport(bigPresidentData)
.Look at str(check(bigPresidentData$ageAtInauguration))
. The check()
function returns a list with a length matching the number of checks, and each element is itself a list of length 3 containing information about whether a problem was found, the message that is printed in the report, and the list of observations that gave rise to the problem.
checks
argument in combination with the setChecks()
function. Modify the manual check for presidencyYears
such that it only returns potential outliers.When a particular value is flagged as a potential error we might want to report this back to the data provider so they can verify if the data is correct. In order to do this we need to be able to identify the rows that give rise to these potential problems.
Use check()
to identify the values that are thought to be potential outliers for the presidencyYears
variable and store the result in an object problems
.
Recall that problems
consists of a list (of 1 element) of lists. Return the vector with the problemValues
and save them to a vector probs
. [ Hint: You can reference elements of a list using double square brackets [[]]
and elements of a list can be referenced using the $
operator ]
Use the vector of potential error values to identify the indices (rows) of bigPresidentData
which contain values for presidencyYears
that are part of the vector of problem values, probs
. [ Hint: the operator %in%
provides a way to return the indices of a vector that contain elements that are found within another vector. ]
Print all the rows that were just identified.
We can also use check()
on a full data frame. This will return a list of lists of lists. We can use this to extract check results across different variables.
For the next 3 questions we will use the toyData
dataset.
Run the following two lines to load the toyData
dataset and print it
data(toyData)
toyData
## # A tibble: 15 x 6
## pill events region change id spotifysong
## <fct> <dbl> <fct> <dbl> <fct> <fct>
## 1 red 1.00 a -0.626 1 Irrelevant
## 2 red 1.00 a 0.184 2 Irrelevant
## 3 red 1.00 a -0.836 3 Irrelevant
## 4 red 2.00 a 1.60 4 Irrelevant
## 5 red 2.00 a 0.330 5 Irrelevant
## 6 red 6.00 b -0.820 6 Irrelevant
## 7 red 6.00 b 0.487 7 Irrelevant
## 8 red 6.00 b 0.738 8 Irrelevant
## 9 red 999 c 0.576 9 Irrelevant
## 10 red NA c -0.305 10 Irrelevant
## 11 blue 4.00 c 1.51 11 Irrelevant
## 12 blue 82.0 . 0.390 12 Irrelevant
## 13 blue NA " " -0.621 13 Irrelevant
## 14 <NA> NaN other -2.21 14 Irrelevant
## 15 <NA> 5.00 OTHER 1.12 15 Irrelevant
We will try to identify all the values found in the dataset that are possible missing values that have been wrongly encoded.
Do a full check()
on the full toyData
data frame but only consider the identifyMissing
check. [ Hint: Note you can use the all="identifyMissing"
argument to setChecks()
as described previously ]
(Somewhat R-tricky) Return a vector of values that are potential missing values across the full dataset. [ Hint: You can use sapply()
to extract the problemValues
for each element in the list, and then wrap it in unlist()
to combine it all into a single list.]
This approach is especially useful if you have data in wide format and you want to summarize problem values across different repeats of the same underlying variable (eg, if it is the same value measured over time).
Build your own summaryFunction
and checkFunction
Next up is the task of writing custom summaryFunctions and checkFunctions. Note that a complete guide to writing such extensions is available in the vignette: vignette("extending_dataMaid")
. We do not cover writing visualFunctions in the exercises, but we recommend you to look through the vignette if you are particularly interested in this subject.
refCat()
First up is building a summaryFunction, refCat()
, which will give en reference category for factor variables, i.e. the first level. For instance, for the variable pill
in toyData
, the reference category is blue
:
library(dataMaid)
toyData$pill
## [1] red red red red red red red red red red blue blue blue <NA>
## [15] <NA>
## Levels: blue red
Below, we have provided a template for building refCat()
that you can use as a starting point. There are a few points to note about this function:
summaryFunction()
changes the class of the output, so that it will be indistinguishable from that of the built-in summaryFunctions.summaryFunction()
has three entries, feature
, result
and value
. In the example below, the contents of result
and value
are identical, but generally, result
should be used for the result one wants printed in the data report, while value
is not mandatory, but can be used to store whatever we are summarizing in its original data class.refCat <- function(v, ...) {
val <- #the reference category of the variable v
res <- val
summaryResult(list(feature = "Reference category", result = res,
value = val))
}
Use the template to finish writing refCat()
. Call it on pill
from toyData
in order to test whether it is working.
Now, we will add refCat()
to dataMaid
s known summaryFunctions. First, try calling allSummaryFunctions()
to see what summary functions are already available. We want refCat()
to be added to the output of this function. This is done by use of summaryFunction()
. Fill in the missing pieces in the code below, run it, and try calling allSummaryFunctions()
again afterwards.
refCat <- summaryFunction(refCat,
description = #Text describing what the summaryFunction does,
,
classes = c(#[vector of data types that the function is intended for]
)
)
refCat()
from makeDataReport()
. Run the following code bit, adding refCat()
to the summaries used for factor variables and look at the result.makeDataReport(bigPresidentData,
summaries = setSummaries(factor = defaultFactorSummaries(add = "refCat")),
vol = "_withRefCat")
identifyNonStartCase
We will now write a check function that checks whether character variables are written in start case (i.e., “With Capital Letters For Each New Word”). This might be relevant for variables containing names. Again, we give a template below that can be used as a starting point.
identifyNonStartCase <- function(v, nMax, maxDecimals, ...) {
#do the check
problemValues <- #vector of values in v that are problematic, i.e. that are not
#written in start case. If no problem is encountered, it should be set to NULL
problem <- #is there a problem? TRUE/FALSE
problemStatus <- list(problem = problem,
problemValues = problemValues)
problemMessage <- #Message that is printed prior to listing
# problem values in the dataMaid output,
# ending with a colon
outMessage <- messageGenerator(problemStatus, problemMessage, nMax)
checkResult(list(problem = problem,
message = outMessage,
problemValues = problemValues))
}
identifyNonStartCase <- checkFunction(identifyNonStartCase,
description = #Some text describing the checkFunction
,
classes = c(#[the data types that this function is intended to be used for]
)
)
A few comments for the helper functions seen in the template:
messageGenerator()
helps creating streamlined, properly escaped messages that can be printed in the data report without any problems. It also ommits extra problemValues, if nMax
is smaller than the encountered number of problemValues.checkResult()
works just like summaryResult()
and converts the output class. Note that problemValues
must be in their original data class (i.e. the raw values from v
), because then the contents of problemValues
can be used to identify where problems were found.checkFunction()
is also like its summaryFunction
cousin mentioned above: It converts the funciton into a checkFunction
and makes sure that it becomes available in a allCheckFunctions()
call.Finish identifyNonStartCase()
. A few tips that might be helpful:
problemValues
and the problem
indicator must be set to TRUE
if length(problemValues > 0)
.strsplit()
splits character strings into smaller parts. Try calling it on a character string with split = " "
.toupper()
and tolower()
changes the case for letters to upper- and lower case, respectively.Use identifyNonStartCase()
on the variable stateOfBirth
from bigPresidentData
. Try using it on all character
variables in bigPresidentData
by use of the function check()
.
Add identifyNonStartCase()
to the checks used on character variables in a makeDataReport()
call on bigPresidentData
. Use the argument vol = "_nonStartCase"
to give the new report a different file name than the old ones. Compare the new report with an old version. Can you find the description you wrote for identifyNonStartCase()
when calling checkFunction()
?
In order to solve the exercises below, you will need to use logical operators in R
. Note that their documentation is available through a ?Logic
call. Here is a brief summary of the most commonly used logical operators:
&
: And. x & y
evaluates to TRUE
if both x
and y
evaluate to TRUE
. |
: Or. x | y
evaluates to TRUE
if either x
or y
evaluate to TRUE
. !
: Not. !x
evaluates to TRUE
if x
evaluates to FALSE
(and vice versa).
You will also need to use relational operators, which are documented in ?Comparison
:
<
: Strictly smaller than. >
: Strictly greater than. <=
: Smaller than. >=
: Greater than. ==
: Equals - note the double equality sign! !=
: Not equal.
Note that these relational operators are not just implemented for comparing numbers, but also dates.
Below, we list some sets of variables in bigPresidentData
for which logical row-wise constraints must be fulfilled. For each set of variables, identify what logical constraint(s) must be satisfied and check if it does indeed hold, using validator()
and confront()
from validate
. We have provided an example below that you can use for inspiration.
birthday
, dateOfDeath
, presidencyBeginDate
and presidencyEndDate
presidencyBeginDate
, presidencyEndDate
and presidencyYears
birthday
, ageAtInauguration
and presidencyBeginDate
orderOfPresidency
and presidencyBeginDate
(tricky - remember to handle missing data)Tip: The floor()
function rounds a number down to the nearest, smaller integer. Tip: The rank()
function provides the ranks of variables.
Try to experiment with the different functions available for inspecting confrontations, summary()
, aggregate()
, sort()
, values()
, barplot()
, errors()
and warnings()
, in order to get an idea of what they each do.
We wish to ensure that all presidents are born at an earlier date than their time of death (if relevant). This can be done in validate
by creating a validator
object with the appropriate logical statement and confronting the data with it:
library(validate)
##
## Attaching package: 'validate'
## The following objects are masked from 'package:dataMaid':
##
## description, description<-
## The following object is masked from 'package:dplyr':
##
## expr
#define check
bdaycheck <- validator(birthday < dateOfDeath)
## Found more than one class "rule" in cache; using the first, from namespace 'cli'
## Also defined by 'validate'
#save confrontation
bdayconfr <- confront(bigPresidentData, bdaycheck)
#look at confrontation
summary(bdayconfr)
## name items passes fails nNA error warning expression
## 1 V1 47 41 0 6 FALSE FALSE birthday < dateOfDeath
We see that all observations either have missing information (nNA = 6
) in the check or pass it (passes = 41
). And we can look at the results of the check together with the relevant variables, birthday
and dateofDeath
by using the values()
function:
#look at contents of values(bdayconfr)
head(values(bdayconfr))
## V1
## [1,] TRUE
## [2,] TRUE
## [3,] TRUE
## [4,] TRUE
## [5,] TRUE
## [6,] NA
#view president names, birthday, date of death and bdayconfr values
#together:
View(cbind(bigPresidentData[, c("firstName", "lastName", "birthday", "dateOfDeath")],
values(bdayconfr)))
Here, we see that the NA
values from the confrontation corresponds to presidents who are still alive and therefore have dateOfDeath = NA
.
There can only be one president at a time. However, a mistake in the data has caused two presidencies to overlap. Locate the error using validator()
and confront()
from validate
.
Tip: It may be helpful to begin by sorting the data according to presidencyBeginDate
.
Until now we have identified a number of errors and inconsistencies in the bigPresidentData
. For each data error you encountered in the above, your should:
When all the corrections are completed you should be able to run all the error checks from dataMaid
and validate
and end up without any critical errors.
Remember to follow rules 1 and 2 and consider how these corrections best are documented. “a”
For now we assume that we have a dataset that has been screened and is clean enough to be used for subsequent analyses.
makeCodebook()
function to produce a final codebook that could be passed on as a documentation (of the dataset) for the data analysis.shortDescription
attributes for the dataset to explain that:
favoriteNumber
has been obtained by consulting a Ouija board or - when that failed - just typing in a number and consequently the accuracy may be low.assassinationAttempt
, 1 means yes and 0 means nofirstName
it is literally the first name. No middle names or initials. Also try using the labels
attribute to add some variable labels of your own choosing. Run the codebook command again.