NOTE: This function has been replaced by makeDataReport
. The current
function is no longer updated and it is only included for backwards compatability.
clean(data, output = c("pdf", "html"), render = TRUE, useVar = NULL, ordering = c("asIs", "alphabetical"), onlyProblematic = FALSE, labelled_as = c("factor"), mode = c("summarize", "visualize", "check"), smartNum = TRUE, preChecks = c("isKey", "isSingular", "isSupported"), file = NULL, replace = FALSE, vol = "", standAlone = TRUE, twoCol = TRUE, quiet = TRUE, openResult = TRUE, characterChecks = defaultCharacterChecks(), factorChecks = defaultFactorChecks(), labelledChecks = defaultLabelledChecks(), numericChecks = defaultNumericChecks(), integerChecks = defaultIntegerChecks(), logicalChecks = defaultLogicalChecks(), dateChecks = defaultDateChecks(), allChecks = NULL, characterSummaries = defaultCharacterSummaries(), factorSummaries = defaultFactorSummaries(), labelledSummaries = defaultLabelledSummaries(), numericSummaries = defaultNumericSummaries(), integerSummaries = defaultIntegerSummaries(), logicalSummaries = defaultLogicalSummaries(), dateSummaries = defaultDateSummaries(), allSummaries = NULL, allVisuals = "standardVisual", listChecks = TRUE, maxProbVals = 10, maxDecimals = 2, addSummaryTable = TRUE, reportTitle = NULL, treatXasY = NULL, ...)
data | The dataset to be checked. This dataset should be of class |
---|---|
output | Output format. Options are |
render | Should the output file be rendered (defaults to |
useVar | Variables to clean. If |
ordering | Choose the ordering of the variables in the variable presentation. The options are "asIs" (ordering as in the dataset) and "alphabetical" (alphabetical order). |
onlyProblematic | A logical. If |
labelled_as | A string explaining the way to handle labelled vectors.
Currently |
mode | Vector of tasks to perform among the three categories "summarize", "visualize" and "check".
The default, |
smartNum | If |
preChecks | Vector of function names for check functions used in the pre-check stage. The pre-check stage consists of variable checks that should be performed before the summary/visualization/checking step. If any of these checks find problems, the variable will not be summarized nor visualized nor checked. |
file | The filename of the outputted rmarkdown (.Rmd) file.
If set to |
replace | If |
vol | Extra text string or numeric that is appended on the end of the output
file name(s). For example, if the dataset is called "myData", no file argument is
supplied and |
standAlone | A logical. If |
twoCol | A logical. Should the results from the summarize and visualize
steps be presented in two columns? Defaults to |
quiet | A logical. If |
openResult | A logical. If |
characterChecks | A vector of the names of error-checking functions to apply to character vectors. |
factorChecks | A vector of the names of error-checking functions to apply to integer vectors. |
labelledChecks | A vector of the names of error-checking functions to apply to character vectors. |
numericChecks | A vector of the names of error-checking functions to apply to numeric vectors. |
integerChecks | A vector of the names of error-checking functions to apply to integer vectors. |
logicalChecks | A vector of the names of error-checking functions to apply to logical vectors. |
dateChecks | A vector of the names of error-checking functions to apply to Date vectors. |
allChecks | Vector of function names that should be used as check-functions
for all variable types. Note that this argument overwrites the arguments
|
characterSummaries | A vector of the names of summary functions to apply to character vectors. |
factorSummaries | A vector of the names of summary functions to apply to factor vectors. |
labelledSummaries | A vector of the names of summary functions to apply to labelled vectors. |
numericSummaries | A vector of the names of summary functions to apply to numeric vectors. |
integerSummaries | A vector of the names of summary functions to apply to integer vectors. |
logicalSummaries | A vector of the names of summary functions to apply to logical vectors. |
dateSummaries | A vector of the names of summary functions to apply to Date vectors. |
allSummaries | Vector of function names that should be used as summary
functions for all variable types. Note that this argument overwrites the arguments
|
allVisuals | A single function name. This funtion name is called for
creating the plots for each variable in the "visualize" step. The default,
|
listChecks | A logical. Controls whether what checks that were used for each
possible variable type are summarized in the output. Defaults to |
maxProbVals | A positive integer or |
maxDecimals | A positive integer or |
addSummaryTable | A logical. If |
reportTitle | A text string. If supplied, this will be the printed title of the report. If left unspecified, the title with the name of the supplied dataset. |
treatXasY | A list that indicates how non-standard variable classes should be treated.
This parameter allows you to include variables that are not of class |
… | FIX ME-------- Other arguments that are passed on the to precheck, checking, summary and visualization functions.WHAT ARGUMENTS ARE RELEVANT TO MENTION HERE? ---------- FIX ME |
The function does not return anything. Its side effect (the production of a data cleaning overview document) is the reason for running the function.
Run a set of class-specific validation checks to check the variables in a dataset for potential errors. Performs checking steps according to user input and/or data type of the inputted variable. The checks are saved to an R markdown file which can rendered into an easy-to-read document. This document also includes summaries and visualizations of each variable in the dataset.
For each variable, a set of pre-check (controlled by the
preChecks
argument) is first run and then then a battery of
functions are applied depending on the variable class. For each
variable type the summarize/visualize/check functions are applied
and and the results are written to an R markdown file.
#> $pill #> $pill$identifyMissing #> No problems found. #> $pill$identifyWhitespace #> No problems found. #> $pill$identifyLoners #> Note that the following levels have at most five observations: blue. #> $pill$identifyCaseIssues #> No problems found. #> $pill$identifyNums #> No problems found. #> #> $events #> $events$identifyMissing #> The following suspected missing value codes enter as regular values: 999, NaN. #> $events$identifyOutliers #> Note that the following possible outlier values were detected: 82, 999. #> #> $region #> $region$identifyMissing #> The following suspected missing value codes enter as regular values: , .. #> $region$identifyWhitespace #> The following values appear with prefixed or suffixed white space: . #> $region$identifyLoners #> Note that the following levels have at most five observations: , ., a, b, c, other, OTHER. #> $region$identifyCaseIssues #> Note that there might be case problems with the following levels: other, OTHER. #> $region$identifyNums #> No problems found. #> #> $change #> $change$identifyMissing #> No problems found. #> $change$identifyOutliers #> Note that the following possible outlier values were detected: 1.12, 1.51, 1.6. #> #> $id #> $id$identifyMissing #> No problems found. #> $id$identifyWhitespace #> No problems found. #> $id$identifyLoners #> Note that the following levels have at most five observations: 1, 10, 11, 12, 13, 14, 15, 2, 3, 4 (5 additional values omitted). #> $id$identifyCaseIssues #> No problems found. #> $id$identifyNums #> Note: The variable consists exclusively of numbers and takes a lot of different values. Is it perhaps a misclassified numeric variable? #> #> $spotifysong #> $spotifysong$identifyMissing #> No problems found. #> $spotifysong$identifyWhitespace #> No problems found. #> $spotifysong$identifyLoners #> No problems found. #> $spotifysong$identifyCaseIssues #> No problems found. #> $spotifysong$identifyNums #> No problems found. #># NOT RUN { DF <- data.frame(x = 1:15) clean(DF) # }# NOT RUN { data(testData) clean(testData) # }# Overwrite any existing files generated by clean# NOT RUN { clean(testData, replace=TRUE) # }# Only include problematic variables in the output document# NOT RUN { clean(testData, replace=TRUE, onlyProblematic=TRUE) # }# Add user defined check-function to the checks performed on character variables: # Here we add functionality to search for the string wally (ignoring case)# NOT RUN { wheresWally <- function(v, ...) { res <- grepl("wally", v, ignore.case=TRUE) problem <- any(res) message <- "Wally was found in these data" checkResult(list(problem = problem, message = message, problemValues = v[res])) } wheresWally <- checkFunction(wheresWally, description = "Search for the string 'wally' ignoring case", classes = c("character") ) # Add the newly defined function to the list of checks used for characters. clean(testData, characterChecks=c(defaultCharacterChecks(), "wheresWally"), replace=TRUE) # }#Handle non-supported variable classes using treatXasY: treat raw as character and #treat complex as numeric. We also add a list variable, but as lists are not #handled through treatXasY, this variable will be caught in the preChecks and skipped:# NOT RUN { toyData$rawVar <- as.raw(c(1:14, 1)) toyData$compVar <- c(1:14, 1) + 2i toyData$listVar <- as.list(c(1:14, 1)) clean(toyData, replace = TRUE, treatXasY = list(raw = "character", complex = "numeric")) # }