Exploratory exercise - no solution here.
makeDataReport()
Run the makeDataReport()
function to make your own report for the presiData
and look through it. What legitimate errors are found? What warnings are unnecessary?
We open the dataMaid
package, load the presiData
dataset and run our first data report:
#open dataMaid
library(dataMaid)
##
## Attaching package: 'dataMaid'
## The following object is masked from 'package:dplyr':
##
## summarize
#load data
load(url("http://biostatistics.dk/eRum2018/data/presiData.rda"))
#make data report for it
makeDataReport(presiData)
We see that the report has identified the following legitimate errors:
orderOfPresidency
is stored as a factor variable, although it might more naturally be stored as a numeric (validity)Inf
(infinity) occurs as a length of presidency (validity)ageAtInauguration
has been misclassified as a categorical variable, even though it is supposed to be numeric (validity)On the other hand, the following warnings are not really necessary:
stateOfBirth
variable has less than five observations.sex
only containing one level (“male”), but this is not a mistake in the dataethnicity
variable, the value African American
has less than five observations. However, this is not a mistake in the data either.Handle the unusual variable class used for the variables lastName
and firstName
by changing their classes to something more standard. Try calling help(makeDataReport)
and look at the documentation for the argument treatXasY
. Use this argument in a new call to makeDataReport()
so that for example the favoriteNumber
variable will be handled like a numeric
variable.
We change the class of lastName
and firstName
as suggested:
class(presiData$firstName) <- "character"
class(presiData$lastName) <- "character"
We make a new report where variables of class complex
(like favoriteNumber
) are handled as numeric
variables:
makeDataReport(presiData, treatXasY = list(complex = "numeric"),
replace = TRUE)
Note that we can see that this was succesfull in two ways:
favoriteNumber
in the Variable List, where we can see that the summaries usually used for numeric variables are now appliedNote also that we now have checks performed for the firstName
and lastName
variables.
Use allCheckFunctions()
to find out which checkFunction does the check for less than 5 unique values. Check that this function is indeed among the default checkFunctions for character
variables by calling defaultCharacterChecks()
.
We call allCheckFunctions()
to see that check functions are available:
allCheckFunctions()
name | description |
---|---|
identifyCaseIssues | Identify case issues |
identifyLoners | Identify levels with < 6 obs. |
identifyMissing | Identify miscoded missing values |
identifyNums | Identify misclassified numeric or integer variables |
identifyOutliers | Identify outliers |
identifyOutliersTBStyle | Identify outliers (Turkish Boxplot style) |
identifyWhitespace | Identify prefixed and suffixed whitespace |
isCPR | Identify Danish CPR numbers |
isEmpty | Check if the variable contains only a single value |
isKey | Check if the variable is a key |
isSingular | Check if the variable contains only a single value |
isSupported | Check if the variable class is supported by dataMaid. |
classes |
---|
character, factor |
character, factor |
character, Date, factor, integer, labelled, logical, numeric |
character, factor, labelled |
Date, integer, numeric |
Date, integer, numeric |
character, factor, labelled |
character, Date, factor, integer, labelled, logical, numeric |
character, Date, factor, integer, labelled, logical, numeric |
character, Date, factor, integer, labelled, logical, numeric |
character, Date, factor, integer, labelled, logical, numeric |
character, Date, factor, integer, labelled, logical, numeric |
We see that the check function identifyLoners
looks for levels with strictly less than 6 observations. This is the function we were looking for. Let’s see if it is among the default options used in makeDataReport()
for checking character
variables:
defaultCharacterChecks()
## [1] "identifyMissing" "identifyWhitespace" "identifyLoners"
## [4] "identifyCaseIssues" "identifyNums"
– and it’s there.
Use the checks
argument, setChecks()
and defaultCharacterChecks()
to remove the “less than 5 unique values”-check from the checks performed for character
variables. Look at the overview table on the first page of your report. Can you find a difference?
We remove the identifyLoners
check function from the checks used for character
variables in a data report:
makeDataReport(presiData,
checks = setChecks(character = defaultCharacterChecks(remove = "identifyLoners")),
replace = TRUE)
We note that in the table in the Data Report Overview that marks what checks were performed, “Identify levels with <6 obs.” is no longer checked for character
variables. And if we look at the character
variables in the Variable List, we also see that it is no longer performed.
If you are unsure about exactly what happens in the code bit above, try running the functions one by one and look at the output:
defaultCharacterChecks()
## [1] "identifyMissing" "identifyWhitespace" "identifyLoners"
## [4] "identifyCaseIssues" "identifyNums"
defaultCharacterChecks(remove = "identifyLoners")
## [1] "identifyMissing" "identifyWhitespace" "identifyCaseIssues"
## [4] "identifyNums"
setChecks()
## $character
## [1] "identifyMissing" "identifyWhitespace" "identifyLoners"
## [4] "identifyCaseIssues" "identifyNums"
##
## $factor
## [1] "identifyMissing" "identifyWhitespace" "identifyLoners"
## [4] "identifyCaseIssues" "identifyNums"
##
## $labelled
## [1] "identifyMissing" "identifyWhitespace" "identifyLoners"
## [4] "identifyCaseIssues" "identifyNums"
##
## $numeric
## [1] "identifyMissing" "identifyOutliers"
##
## $integer
## [1] "identifyMissing" "identifyOutliers"
##
## $logical
## NULL
##
## $Date
## [1] "identifyOutliers" "identifyMissing"
setChecks(character = defaultCharacterChecks(remove = "identifyLoners"))
## $character
## [1] "identifyMissing" "identifyWhitespace" "identifyCaseIssues"
## [4] "identifyNums"
##
## $factor
## [1] "identifyMissing" "identifyWhitespace" "identifyLoners"
## [4] "identifyCaseIssues" "identifyNums"
##
## $labelled
## [1] "identifyMissing" "identifyWhitespace" "identifyLoners"
## [4] "identifyCaseIssues" "identifyNums"
##
## $numeric
## [1] "identifyMissing" "identifyOutliers"
##
## $integer
## [1] "identifyMissing" "identifyOutliers"
##
## $logical
## NULL
##
## $Date
## [1] "identifyOutliers" "identifyMissing"
Use allVisualFunctions()
to identify the name of the visualFunction that you need to use if you want base R graphics style plots.
We call allVisualFunctions()
:
allVisualFunctions()
name | description | classes |
---|---|---|
basicVisual | Histograms and barplots using graphics | character, Date, factor, integer, labelled, logical, numeric |
standardVisual | Histograms and barplots using ggplot2 | character, Date, factor, integer, labelled, logical, numeric |
And we note that we should use basicVisual
if we want base R graphics plots.
Call setVisuals()
in the console using the all
argument to specify the graphics
style plots. Try also calling setVisuals()
with the new function in the factor
and character
arguments only.
We use the all
argument for the setVisuals()
function to specify that all variable classes should use basicVisual
as their visual function:
setVisuals(all = "basicVisual")
## $character
## [1] "basicVisual"
##
## $factor
## [1] "basicVisual"
##
## $labelled
## [1] "basicVisual"
##
## $numeric
## [1] "basicVisual"
##
## $integer
## [1] "basicVisual"
##
## $logical
## [1] "basicVisual"
##
## $Date
## [1] "basicVisual"
We see that basicVisual
is indeed listed for all variables. We now specify only character
and factor
variables to use the basicVisual
function:
setVisuals(character = "basicVisual",
factor = "basicVisual")
## $character
## [1] "basicVisual"
##
## $factor
## [1] "basicVisual"
##
## $labelled
## [1] "standardVisual"
##
## $numeric
## [1] "standardVisual"
##
## $integer
## [1] "standardVisual"
##
## $logical
## [1] "standardVisual"
##
## $Date
## [1] "standardVisual"
Note that we have not yet used these arguments for anything: we have just looked at the functions that are used to specify what visual functions to use.
Use makeDataReport()
’s visuals
argument and setVisuals()
to obtain a data report where character
and factor
variables use base R graphics plots, while all other variable classes use ggplot2
plots.
This task corresponds to using setVisuals()
just like we did in exercise 2.8. E.g. by looking at the documentation for makeDataReport()
by using help()
or ?
, we can find out that the choice of visual functions is specified in the argument visuals
. So the following command will make a data report for presiData
where character
and factor
variables are visualized using the graphics
package rather than ggplot2
:
makeDataReport(presiData,
visuals = setVisuals(character = "basicVisual",
factor = "basicVisual"),
replace = TRUE)
Experiment with the functions setSummaries()
, allSummaryFunctions()
and defaultCharacterSummaries()
in order to remove “Mode” from the summaries listed for such variables in the data report (using the argument summaries
).
First, we see what summary functions are available in order to identify the one that produces the “mode” information:
allSummaryFunctions()
name | description | classes |
---|---|---|
centralValue | Compute median for numeric variables, mode for categorical variables | character, Date, factor, integer, labelled, logical, numeric |
countMissing | Compute proportion of missing observations | character, Date, factor, integer, labelled, logical, numeric |
minMax | Find minimum and maximum values | integer, numeric, Date |
quartiles | Compute 1st and 3rd quartiles | Date, integer, numeric |
uniqueValues | Count number of unique values | character, Date, factor, integer, labelled, logical, numeric |
variableType | Data class of variable | character, Date, factor, integer, labelled, logical, numeric |
We see that the function centralValue
computes modes for categorical variables. Now, we can use defaultCharacterSummaries()
and setSummaries()
in a call to makeDataReport()
to create a report where centralValue
is not used for character variables:
makeDataReport(presiData,
summaries = setSummaries(character = defaultCharacterSummaries(remove = "centralValue")),
replace = TRUE)
We would now like a report that only displays the results of checks (no visuals or summaries) and only lists variables that actually were found to have potential problems. But for each problem found, we would like to see all problematic values listed, not just the first 10, as is currently the case. Moreover, we would like to rename the report title (as displayed on the front page) to be “Problem flagging report”, and we would also like to have “problemflagging” appended to the file name of the report, so that we can easily tell it apart from the usual data report.
This customized report can be generated in the following way:
makeDataReport(presiData, mode = "check",
onlyProblematic = TRUE,
maxProbVals = Inf,
reportTitle = "Problem flagging report",
vol = "_problemflagging")
Here, * mode = "check"
specifies that we only want checks done - no visuals nor summaries * onlyProblematic = TRUE
makes sure that one variables for which problems were found are included in the report * maxProbVals = Inf
means that there is no upper limit (or, an infinite one, if you will) for how many problematic values are printed in the checks * reportTitle = "Problem flagging report"
puts “Problem Flagging Report” as the title on the first page * vol = "_problemflagging"
appends “problemflagging” to the file name so that it now reads “dataMaid_presiData_problemflagging” (with a file extension depending on you operating system and availability of LaTeX.)
dataMaid
interactivelyRun check(presiData$ageAtInauguration)
, visualize(presiData$ageAtInauguration)
, and summarize(presiData$ageAtInauguration)
and verify that you obtain information identical to what you saw in the report previously generated by makeDataReport(presiData)
.
We run the three check/visualize/summarize steps in the variable ageAtInauguration
interactively:
check(presiData$ageAtInauguration)
## $identifyMissing
## No problems found.
## $identifyWhitespace
## No problems found.
## $identifyLoners
## Note that the following levels have at most five observations: 42, 43, 46, 47, 48, 49, 50, 51, 52, 54 (13 additional values omitted).
## $identifyCaseIssues
## No problems found.
## $identifyNums
## Note: The variable consists exclusively of numbers and takes a lot of different values. Is it perhaps a misclassified numeric variable?
visualize(presiData$ageAtInauguration)
summarize(presiData$ageAtInauguration)
## $variableType
## Variable type: character
## $countMissing
## Number of missing obs.: 0 (0 %)
## $uniqueValues
## Number of unique values: 23
## $centralValue
## Mode: "54"
Look at str(check(presiData$ageAtInauguration))
. We try calling the structure (str()
) function on the output of a check()
call:
ageAtInaugCheck <- check(presiData$ageAtInauguration)
str(ageAtInaugCheck)
## List of 5
## $ identifyMissing :List of 3
## ..$ problem : logi FALSE
## ..$ message : chr ""
## ..$ problemValues: NULL
## ..- attr(*, "class")= chr "checkResult"
## $ identifyWhitespace:List of 3
## ..$ problem : logi FALSE
## ..$ message : chr ""
## ..$ problemValues: NULL
## ..- attr(*, "class")= chr "checkResult"
## $ identifyLoners :List of 3
## ..$ problem : logi TRUE
## ..$ message : chr "Note that the following levels have at most five observations: \\\"42\\\", \\\"43\\\", \\\"46\\\", \\\"47\\\", "| __truncated__
## ..$ problemValues: chr [1:23] "42" "43" "46" "47" ...
## ..- attr(*, "class")= chr "checkResult"
## $ identifyCaseIssues:List of 3
## ..$ problem : logi FALSE
## ..$ message : chr ""
## ..$ problemValues: NULL
## ..- attr(*, "class")= chr "checkResult"
## $ identifyNums :List of 3
## ..$ problem : logi TRUE
## ..$ message : chr "Note: The variable consists exclusively of numbers and takes a lot of different values. Is it perhaps a misclas"| __truncated__
## ..$ problemValues: NULL
## ..- attr(*, "class")= chr "checkResult"
We observe that the results are stored in a list-type object with one entry for each check function used. And for each check function, we then find a new list with three entries: problem
(TRUE
if a problem was found), message
(the message printed in the report) and problemValues
the values in the variable that were deemed problematic.
Modify the manual check for presidencyYears
such that it only returns potential outliers.
We call check()
and use the checks
argument to specify that the variable presidencyYears
(a numeric) should only be subjected to a check for outliers.
First, we see that class the variable has:
class(presiData$presidencyYears)
## [1] "numeric"
Next, we find the name of the check function that identifies potential outliers:
allCheckFunctions()
name | description |
---|---|
identifyCaseIssues | Identify case issues |
identifyLoners | Identify levels with < 6 obs. |
identifyMissing | Identify miscoded missing values |
identifyNums | Identify misclassified numeric or integer variables |
identifyOutliers | Identify outliers |
identifyOutliersTBStyle | Identify outliers (Turkish Boxplot style) |
identifyWhitespace | Identify prefixed and suffixed whitespace |
isCPR | Identify Danish CPR numbers |
isEmpty | Check if the variable contains only a single value |
isKey | Check if the variable is a key |
isSingular | Check if the variable contains only a single value |
isSupported | Check if the variable class is supported by dataMaid. |
classes |
---|
character, factor |
character, factor |
character, Date, factor, integer, labelled, logical, numeric |
character, factor, labelled |
Date, integer, numeric |
Date, integer, numeric |
character, factor, labelled |
character, Date, factor, integer, labelled, logical, numeric |
character, Date, factor, integer, labelled, logical, numeric |
character, Date, factor, integer, labelled, logical, numeric |
character, Date, factor, integer, labelled, logical, numeric |
character, Date, factor, integer, labelled, logical, numeric |
And lastly, we specify that this is the check we want for numeric variables and apply it to presidencyYears
:
check(presiData$presidencyYears,
checks = setChecks(numeric = "identifyOutliers"))
## $identifyOutliers
## Note that the following possible outlier values were detected: 12, Inf.
Note that the same output could have been obtained by calling identifyOutliers
directly:
identifyOutliers(presiData$presidencyYears)
## Note that the following possible outlier values were detected: 12, Inf.
Use check()
to identify the values that are thought to be potential outliers for the presidencyYears
variable and store the result in an object problems
. Return the vector with the problemValues
and save them to a vector probs
.
We store the results of checking presidencyYears
for outliers in an object called problems
:
problems <- check(presiData$presidencyYears,
check = setChecks(numeric = "identifyOutliers"))
#look at the contents
problems
## $identifyOutliers
## Note that the following possible outlier values were detected: 12, Inf.
#look at the structure of the contents
str(problems)
## List of 1
## $ identifyOutliers:List of 3
## ..$ problem : logi TRUE
## ..$ message : chr "Note that the following possible outlier values were detected: \\\"12\\\", \\\"Inf\\\"."
## ..$ problemValues: num [1:2] 12 Inf
## ..- attr(*, "class")= chr "checkResult"
We now want to save the problem values found by the identifyOutliers
check. In order to do this, we must first select the identifyOutliers
entry of the list problems
and then select the problem values within that list. Below, we show two ways of doing this, using either list indexing ([[]]
) or $
. We store the problematic values in a new object called probs
:
#select first entry of problems using indexing, and then choose
#problemValues by use of the entry name
probs <- problems[[1]]$problemValues
#Select the results from `identifyOutliers` using the entry name,
#and then select `problemValues` also by use of the entry name
probs <- problems$identifyOutliers$problemValues
In both objects we will find the same contents, namely:
probs
## [1] 12 Inf
Use the vector of potential error values to identify the indices (rows) of presiData
which contain values for presidencyYears
that are part of the vector of problem values, probs
.
We now identify the row numbers of presiData
where potential outliers were identified in the presidencyYears
variable:
probRows <- which(presiData$presidencyYears %in% probs)
probRows
## [1] 4 8
A brief comment on the code: The inner statement, presiData$presidencyYears %in% probs
returns a vector of TRUE
(for observations that are among the problematic values) and FALSE
(for observations that are not in the probs
vector). By wrapping the statement in a which()
call, instead of getting the logical vector, we get the row numbers of the places in the vector that are TRUE
.
Next, we can use the indicies stored in probRows
to look at the full dataset for those rows:
presiData[probRows, ]
## lastName firstName orderOfPresidency birthday dateOfDeath
## 32 Roosevelt Franklin 32 1882-01-30 1945-04-12
## 44 Obama Barack 44 1961-08-04 <NA>
## stateOfBirth assassinationAttempt sex ethnicity presidencyYears
## 32 New York 1 Male Caucasian 12
## 44 Hawaii 0 Male African American Inf
## ageAtInauguration favoriteNumber
## 32 51 6+0i
## 44 47 6+0i
And now we can see that President Obama was listed to have an infinite presidency – a data error – while Franklin Roosevelt was the president with a 12 year presidency. However, that is not a mistake - it is just an unusual value.
We look at the toyData
dataset:
data(toyData)
toyData
## # A tibble: 15 x 6
## pill events region change id spotifysong
## <fct> <dbl> <fct> <dbl> <fct> <fct>
## 1 red 1 a -0.626 1 Irrelevant
## 2 red 1 a 0.184 2 Irrelevant
## 3 red 1 a -0.836 3 Irrelevant
## 4 red 2 a 1.60 4 Irrelevant
## 5 red 2 a 0.330 5 Irrelevant
## 6 red 6 b -0.820 6 Irrelevant
## 7 red 6 b 0.487 7 Irrelevant
## 8 red 6 b 0.738 8 Irrelevant
## 9 red 999 c 0.576 9 Irrelevant
## 10 red NA c -0.305 10 Irrelevant
## 11 blue 4 c 1.51 11 Irrelevant
## 12 blue 82 . 0.390 12 Irrelevant
## 13 blue NA " " -0.621 13 Irrelevant
## 14 <NA> NaN other -2.21 14 Irrelevant
## 15 <NA> 5 OTHER 1.12 15 Irrelevant
Do a full check()
on the full toyData
data frame but only consider the identifyMissing
check.
We use check()
to look for missing values in the toyData
dataset:
check(toyData, checks = setChecks(all = "identifyMissing"))
## $pill
## $pill$identifyMissing
## No problems found.
##
## $events
## $events$identifyMissing
## The following suspected missing value codes enter as regular values: 999, NaN.
##
## $region
## $region$identifyMissing
## The following suspected missing value codes enter as regular values: , ..
##
## $change
## $change$identifyMissing
## No problems found.
##
## $id
## $id$identifyMissing
## No problems found.
##
## $spotifysong
## $spotifysong$identifyMissing
## No problems found.
Return a vector of values that are potential missing values across the full dataset. We now collect a vector with all potential missing values in the toyData
dataset. Note that this will coerce them to all have the same data class (namely, character), as vectors in R
can only have a single class.
The strategy goes as follows:
check()
like in the previous exercise to identify where there are problems. As we just saw, when called on a full dataset, this function returns a list (of variables) of lists (of checks) of lists (problem status, message and problem values).sapply()
to select the problem values from each identifyMissing
result. Note that if no problems are found, NULL
is stored under problemValues
.unlist()
to obtain a character vector with the result. Note that this function automatically drops NULL
entries.missCheck <- check(toyData, checks = setChecks(all = "identifyMissing"))
missCheckValList <- sapply(missCheck, function(x) x$identifyMissing$problemValues)
missCheckVals <- unlist(missCheckValList)
#Look at the results
missCheckVals
## events1 events2 region1 region2
## "999" "NaN" " " "."
Use the template to finish writing refCat()
. Call it on pill
from toyData
in order to test whether it is working.
We fill out the missing line of refCat()
where we choose the first level of the variable (assuming that v
is a factor):
refCat <- function(v, ...) {
val <- levels(v)[1]
res <- val
summaryResult(list(feature = "Reference category", result = res,
value = val))
}
And we use it on pill
from toyData
:
refCat(toyData$pill)
## Reference category: blue
It seems to be working as intended. Note that we can also use it in a summarize()
call. For instance, below we add it to the summaries performed on factor
variables and summarize the full toyData
dataset:
summarize(toyData,
summaries = setSummaries(factor = defaultFactorSummaries(add = "refCat")))
## $pill
## $pill$variableType
## Variable type: factor
## $pill$countMissing
## Number of missing obs.: 2 (13.33 %)
## $pill$uniqueValues
## Number of unique values: 2
## $pill$centralValue
## Mode: "red"
## $pill$refCat
## Reference category: blue
##
## $events
## $events$variableType
## Variable type: numeric
## $events$countMissing
## Number of missing obs.: 3 (20 %)
## $events$uniqueValues
## Number of unique values: 8
## $events$centralValue
## Median: 4.5
## $events$quartiles
## 1st and 3rd quartiles: 1.75; 6
## $events$minMax
## Min. and max.: 1; 999
##
## $region
## $region$variableType
## Variable type: factor
## $region$countMissing
## Number of missing obs.: 0 (0 %)
## $region$uniqueValues
## Number of unique values: 7
## $region$centralValue
## Mode: "a"
## $region$refCat
## Reference category:
##
## $change
## $change$variableType
## Variable type: numeric
## $change$countMissing
## Number of missing obs.: 0 (0 %)
## $change$uniqueValues
## Number of unique values: 15
## $change$centralValue
## Median: 0.33
## $change$quartiles
## 1st and 3rd quartiles: -0.62; 0.66
## $change$minMax
## Min. and max.: -2.21; 1.6
##
## $id
## $id$variableType
## Variable type: factor
## $id$countMissing
## Number of missing obs.: 0 (0 %)
## $id$uniqueValues
## Number of unique values: 15
## $id$centralValue
## Mode: "1"
## $id$refCat
## Reference category: 1
##
## $spotifysong
## $spotifysong$variableType
## Variable type: factor
## $spotifysong$countMissing
## Number of missing obs.: 0 (0 %)
## $spotifysong$uniqueValues
## Number of unique values: 1
## $spotifysong$centralValue
## Mode: "Irrelevant"
## $spotifysong$refCat
## Reference category: Irrelevant
First, try calling allSummaryFunctions()
to see what summary functions are already available. We want refCat()
to be added to the output of this function. This is done by use of summaryFunction()
. Fill in the missing pieces in the code, run it, and try calling allSummaryFunctions()
again afterwards.
First, we see what summary functions are available already:
allSummaryFunctions()
name | description | classes |
---|---|---|
centralValue | Compute median for numeric variables, mode for categorical variables | character, Date, factor, integer, labelled, logical, numeric |
countMissing | Compute proportion of missing observations | character, Date, factor, integer, labelled, logical, numeric |
minMax | Find minimum and maximum values | integer, numeric, Date |
quartiles | Compute 1st and 3rd quartiles | Date, integer, numeric |
uniqueValues | Count number of unique values | character, Date, factor, integer, labelled, logical, numeric |
variableType | Data class of variable | character, Date, factor, integer, labelled, logical, numeric |
And then we make refCat()
a proper summary function by using the summaryFunction()
function to change its class:
refCat <- summaryFunction(refCat,
description = "Identify reference level",
classes = c("factor")
)
and we see that refCat
is now added to the output of allSummaryFunctions()
:
allSummaryFunctions()
name | description | classes |
---|---|---|
refCat | Identify reference level | factor |
centralValue | Compute median for numeric variables, mode for categorical variables | character, Date, factor, integer, labelled, logical, numeric |
countMissing | Compute proportion of missing observations | character, Date, factor, integer, labelled, logical, numeric |
minMax | Find minimum and maximum values | integer, numeric, Date |
quartiles | Compute 1st and 3rd quartiles | Date, integer, numeric |
uniqueValues | Count number of unique values | character, Date, factor, integer, labelled, logical, numeric |
variableType | Data class of variable | character, Date, factor, integer, labelled, logical, numeric |
Please note that even though we have written that the function should only be used for factor
variables, this is not enforced automatically: It is up to the user to consult allSummaryFunctions()
and only apply appropriate summaries.
Run the following code bit, adding refCat()
to the summaries used for factor variables and look at the result.
makeDataReport(presiData,
summaries = setSummaries(factor = defaultFactorSummaries(add = "refCat")),
vol = "_withRefCat")
We now see that “reference category” is included among the summaries used for factor
variables.
Finish identifyNonStartCase()
.
Here is an example of how the function can be written. But please note that there are many different approaches for solving the problem. This solution draws on list operations (with sapply()
) but avoids regular expressions. Note also that we add default values to the mandatory argument, nMax
, so that it is easier to call from the console. However, this default setting will be overwritten both by check()
and makeDataReport()
if the function is used within one of these functions.
identifyNonStartCase <- function(v, nMax = 10, ...) {
#omit NA values from v and only keep unique values
v <- unique(na.omit(v))
#for each entry in v, split it around blank spaces. Note that this
#function returns a list with one entry per entry in v
vSplit <- strsplit(v, split = " ")
#We then transform all entries of v to be lower case:
vSplitAllLower <- sapply(vSplit, tolower)
#We define a helper function that converts a lower case character
#string into having a capitalized first letter
foo <- function(x) {
capFirstLetters <- toupper(substring(x, 1, 1))
x <- paste(capFirstLetters, substring(x, 2), sep = "")
x
}
#And we use the helper function to obtain a Start Case version of v (first
#as list, then as a variable):
vSplitStartCase <- sapply(vSplit, foo)
vStartCase <- sapply(vSplitStartCase, function(x) paste(x, collapse = " "))
#We can then compare the original v with the Start Case v in order to
#find problems (i.e. places where they differ)
problemPlaces <- v != vStartCase
#The problemValues are then the values in v in the problemPlaces. In
#no problems are found, we store NULL
if (any(problemPlaces)) {
problemValues <- v[problemPlaces]
} else problemValues <- NULL
#We store the a logical indicator of whether a problem was found
problem <- any(problemPlaces)
problemStatus <- list(problem = problem,
problemValues = problemValues)
problemMessage <- "The following values were not in start case:"
outMessage <- messageGenerator(problemStatus, problemMessage, nMax)
checkResult(list(problem = problem,
message = outMessage,
problemValues = problemValues))
}
identifyNonStartCase <- checkFunction(identifyNonStartCase,
description = "Identify entries that are not written in Start Case",
classes = c("character", "factor")
)
Use identifyNonStartCase()
on the variable stateOfBirth
from presiData
. Try using it on all character
variables in presiData
by use of the function check()
.
We use identifyNonStartCase()
on stateOfBirth
:
identifyNonStartCase(presiData$stateOfBirth)
## The following values were not in start case: New york.
And we find the lower-case “New York” entry that we have encountered before. We add identifyNonStartCase
to the checks applied on character
variables and use it to check presiData
:
check(presiData,
checks = setChecks(character = defaultCharacterChecks(add = "identifyNonStartCase")))
## $lastName
## $lastName$identifyMissing
## No problems found.
## $lastName$identifyWhitespace
## The following values appear with prefixed or suffixed white space: Truman.
## $lastName$identifyLoners
## Note that the following levels have at most five observations: Truman, Adams, Arathornson, Arthur, Buchanan, Bush, Carter, Cleveland, Clinton, Coolidge (30 additional values omitted).
## $lastName$identifyCaseIssues
## No problems found.
## $lastName$identifyNums
## No problems found.
## $lastName$identifyNonStartCase
## No problems found.
##
## $firstName
## $firstName$identifyMissing
## The following suspected missing value codes enter as regular values: ..
## $firstName$identifyWhitespace
## No problems found.
## $firstName$identifyLoners
## Note that the following levels have at most five observations: ., Abraham, Andrew, Aragorn, Barack, Benjamin, Chester, Dwight, Dwight D, Franklin (22 additional values omitted).
## $firstName$identifyCaseIssues
## No problems found.
## $firstName$identifyNums
## No problems found.
## $firstName$identifyNonStartCase
## No problems found.
##
## $orderOfPresidency
## $orderOfPresidency$identifyMissing
## No problems found.
## $orderOfPresidency$identifyWhitespace
## No problems found.
## $orderOfPresidency$identifyLoners
## Note that the following levels have at most five observations: 0, 1, 10, 11, 12, 13, 14, 15, 16, 17 (36 additional values omitted).
## $orderOfPresidency$identifyCaseIssues
## No problems found.
## $orderOfPresidency$identifyNums
## Note: The variable consists exclusively of numbers and takes a lot of different values. Is it perhaps a misclassified numeric variable?
##
## $birthday
## $birthday$identifyOutliers
## Note that the following possible outlier values were detected: 1300-03-01.
## $birthday$identifyMissing
## No problems found.
##
## $dateOfDeath
## $dateOfDeath$identifyOutliers
## Note that the following possible outlier values were detected: 1510-01-01.
## $dateOfDeath$identifyMissing
## No problems found.
##
## $stateOfBirth
## $stateOfBirth$identifyMissing
## No problems found.
## $stateOfBirth$identifyWhitespace
## No problems found.
## $stateOfBirth$identifyLoners
## Note that the following levels have at most five observations: Arkansas, California, Connecticut, Georgia, Gondor, Hawaii, Illinois, Indiana, Iowa, Kentucky (12 additional values omitted).
## $stateOfBirth$identifyCaseIssues
## Note that there might be case problems with the following levels: New york, New York.
## $stateOfBirth$identifyNums
## No problems found.
## $stateOfBirth$identifyNonStartCase
## The following values were not in start case: New york.
##
## $assassinationAttempt
## $assassinationAttempt$identifyMissing
## No problems found.
## $assassinationAttempt$identifyOutliers
## Note that the following possible outlier values were detected: 1.
##
## $sex
## $sex$identifyMissing
## No problems found.
## $sex$identifyWhitespace
## No problems found.
## $sex$identifyLoners
## No problems found.
## $sex$identifyCaseIssues
## No problems found.
## $sex$identifyNums
## No problems found.
##
## $ethnicity
## $ethnicity$identifyMissing
## No problems found.
## $ethnicity$identifyWhitespace
## No problems found.
## $ethnicity$identifyLoners
## Note that the following levels have at most five observations: African American.
## $ethnicity$identifyCaseIssues
## No problems found.
## $ethnicity$identifyNums
## No problems found.
##
## $presidencyYears
## $presidencyYears$identifyMissing
## The following suspected missing value codes enter as regular values: Inf.
## $presidencyYears$identifyOutliers
## Note that the following possible outlier values were detected: 12, Inf.
##
## $ageAtInauguration
## $ageAtInauguration$identifyMissing
## No problems found.
## $ageAtInauguration$identifyWhitespace
## No problems found.
## $ageAtInauguration$identifyLoners
## Note that the following levels have at most five observations: 42, 43, 46, 47, 48, 49, 50, 51, 52, 54 (13 additional values omitted).
## $ageAtInauguration$identifyCaseIssues
## No problems found.
## $ageAtInauguration$identifyNums
## Note: The variable consists exclusively of numbers and takes a lot of different values. Is it perhaps a misclassified numeric variable?
## $ageAtInauguration$identifyNonStartCase
## No problems found.
##
## $favoriteNumber
## $favoriteNumber$NoChecksPerformed
## No problems found.
Add identifyNonStartCase()
to the checks used on character variables in a makeDataReport()
call on presiData
. Use the argument vol = "_nonStartCase"
to give the new report a different file name than the old ones. Compare the new report with an old version. Can you find the description you wrote for identifyNonStartCase()
when calling checkFunction()
?
We do as told:
makeDataReport(presiData,
checks = setChecks(character = defaultCharacterChecks(add = "identifyNonStartCase")),
vol = "_nonStartCase")
We see that the description is added on the first page of the report in the table that summarizes what checks were performed.
We now show how the data mistakes can be fixed, one by one. Since we advice that this is done in a self contained R-script, we will provide the answers as a commented chunk of R code. Note that we correct all errors, some of which you might not (yet) be aware of.
#Load packages
library(dataMaid)
#Load data
data(presiData)
#Copy data. This is the data we will make changes to.
pd <- presiData
#We fix the following mistake:
#The variables firstName and lastName are stored using a non-standard
#class even though they are really just character variables.
class(pd$firstName) <- "character"
class(pd$lastName) <- "character"
#We fix the following mistake:
#Aragorn Arathornson is included in the dataset.
pd <- pd[!(pd$firstName == "Aragorn" & pd$lastName == "Arathornson"),]
#We fix the following mistake:
#Trump has "." listed as his first name (firstName).
pd[pd$lastName == "Trump", "firstName"] <- "Donald"
#We fix the following mistake:
#Obama's presidency duration is listed as infinite (presidencyYears).
pd[pd$lastName == "Obama", "presidencyYears"] <- 8
#We fix the following mistake:
#Trump's state of birth (New York) was spelled with a lower case "y" (stateOfBirth).
pd[pd$lastName == "Trump", "stateOfBirth"] <- "New York"
#We fix the following mistake:
#Truman's last name is prefixed with whitespace (lastName).
pd[pd$lastName == " Truman", "lastName"] <- "Truman"
#We fix the following mistake:
#ageAtInauguration is coded as a character variable.
pd$ageAtInauguration <- as.numeric(pd$ageAtInauguration)
#We fix the following mistake:
#James Garfield's state of birth (stateOfBirth) has been changed from Ohio to Indiana
#(state of birth of Jim Davis, the creator of the cartoon "Garfield").
pd$stateOfBirth[pd$firstName == "James" & pd$lastName == "Garfield"] <- "Ohio"
#We fix the following mistake:
#Calvin Goolidge has had his first name changed to "Hobbes" (firstName).
pd$firstName[pd$firstName == "Hobbes"] <- "Calvin"
#We fix the following mistake:
#Eisenhower appears twice in the dataset, one time with the first name "Dwight"
#and one time with the first name "Dwight D".
#Note: We delete the observation with the extra "D" as the other presidents do
#not have their middle names included
pd <- pd[pd$firstName != "Dwight D",]
#We fix the following mistake:
#Lincoln has had his date of death recorded as 1801-04-15 rather than the actual
#date 1865-04-15.
pd[pd$lastName == "Lincoln", "dateOfDeath"] <- as.Date("1865-04-15")
#Save a new copy of the data
save(list = "pd", file = "presiData_cleaned.rdata")
Use the makeCodebook()
function to produce a final codebook that could be passed on as a documentation (of the dataset) for the data analysis.
We make a codebook of pd
:
makeCodebook(pd, replace = TRUE)
Use the option to set shortDescription
attributes for the dataset to explain that: a) The information in favoriteNumber
have been obtained by consulting an Ouija board or - when that failed - just typing in a number and consequently the accuracy may be low, b) For assassinationAttempt
, 1 means yes and 0 means no, c) For the firstName
it is literally the first name. No middle names or initials. Also try using the label
attribute to add some variable labels of your own choosing. Run the codebook command again.
We add short descriptions:
attr(pd$favoriteNumber, "shortDescription") <- "The information has been obtained by consulting a Ouija board or - when that failed - just typing in a number. Thus, the accuracy may be low."
attr(pd$assassinationAttempt, "shortDescription") <- "1 means yes and 0 means no."
attr(pd$firstName, "shortDescription") <- "Only literal first names, no inclusion of middle names or initials"
And we add labels:
attr(pd$favoriteNumber, "label") <- "Favorite number"
attr(pd$assassinationAttempt, "label") <- "Was there an assassination attempt on the president?"
attr(pd$firstName, "label") <- "First name"
attr(pd$lastName, "label") <- "Last name"
attr(pd$presidencyYears, "label") <- "Duration of presidency"
attr(pd$orderOfPresidency, "label") <- "Presidency order"
attr(pd$birthday, "label") <- "Date of birth"
attr(pd$dateOfDeath, "label") <- "Date of death"
attr(pd$stateOfBirth, "label") <- "Birth state of the president"
attr(pd$sex, "label") <- "Sex"
attr(pd$ethnicity, "label") <- "Ethnicity"
attr(pd$ageAtInauguration, "label") <- "Age at inauguration"
And lastly, we rerun the codebook:
makeCodebook(pd, replace = TRUE)
Thanks for reading! If you have questions or comments for these exercises, don’t hesitate to contact Anne Helby Petersen at [ahpe@sund.ku.dk].