NOTE: This function has been replaced by makeDataReport. The current function is no longer updated and it is only included for backwards compatability.

clean(data, output = c("pdf", "html"), render = TRUE, useVar = NULL,
  ordering = c("asIs", "alphabetical"), onlyProblematic = FALSE,
  labelled_as = c("factor"), mode = c("summarize", "visualize", "check"),
  smartNum = TRUE, preChecks = c("isKey", "isSingular", "isSupported"),
  file = NULL, replace = FALSE, vol = "", standAlone = TRUE,
  twoCol = TRUE, quiet = TRUE, openResult = TRUE,
  characterChecks = defaultCharacterChecks(),
  factorChecks = defaultFactorChecks(),
  labelledChecks = defaultLabelledChecks(),
  numericChecks = defaultNumericChecks(),
  integerChecks = defaultIntegerChecks(),
  logicalChecks = defaultLogicalChecks(), dateChecks = defaultDateChecks(),
  allChecks = NULL, characterSummaries = defaultCharacterSummaries(),
  factorSummaries = defaultFactorSummaries(),
  labelledSummaries = defaultLabelledSummaries(),
  numericSummaries = defaultNumericSummaries(),
  integerSummaries = defaultIntegerSummaries(),
  logicalSummaries = defaultLogicalSummaries(),
  dateSummaries = defaultDateSummaries(), allSummaries = NULL,
  allVisuals = "standardVisual", listChecks = TRUE, maxProbVals = 10,
  maxDecimals = 2, addSummaryTable = TRUE, reportTitle = NULL,
  treatXasY = NULL, ...)

Arguments

data

The dataset to be checked. This dataset should be of class data.frame, tibble or matrix. If it is of classs matrix, it will be converted to a data.frame.

output

Output format. Options are "pdf" (the default), and "html"

render

Should the output file be rendered (defaults to TRUE), i.e. should a pdf/html document be generated and saved to the disc?

useVar

Variables to clean. If NULL (the default), all variables in data are included. If a vector of variable names is supplied, only the variables in data that are also in useVar are included in the data cleaning overview document.

ordering

Choose the ordering of the variables in the variable presentation. The options are "asIs" (ordering as in the dataset) and "alphabetical" (alphabetical order).

onlyProblematic

A logical. If TRUE, only the variables flagged as problematic in the check step will be included in the variable list.

labelled_as

A string explaining the way to handle labelled vectors. Currently "factor" (the default) is the only possibility. This means that labelled variables that appear factor-like (by having a non-NULL labels-attribute) will be treated as factors, while other labelled variables will be treated as whatever base variable class they inherit from.

mode

Vector of tasks to perform among the three categories "summarize", "visualize" and "check". The default, c("summarize", "visualize", "check"), implies that all three steps are performed. The steps selected in mode will be performed for each variable in data and their results are presented in the second part of the outputtet data cleaning overview document. The "summarize" step is responsible for creating the summary table, the "visualize" step is responsible for creating the plot and the "check" step is responsible for performing checks on the variable and printing the results if any problems are found.

smartNum

If TRUE (the default), numeric and integer variables with less than 5 unique values are treated as factor variables in the checking, visualization and summary steps, and a message notifying the reader of this is printed in the data summary.

preChecks

Vector of function names for check functions used in the pre-check stage. The pre-check stage consists of variable checks that should be performed before the summary/visualization/checking step. If any of these checks find problems, the variable will not be summarized nor visualized nor checked.

file

The filename of the outputted rmarkdown (.Rmd) file. If set to NULL (the default), the filename will be the name of data prefixed with "dataMaid_", if this qualifies as a valid file name (e.g. no special characters allowed). Otherwise, clean() tries to create a valid filename by substituing illegal characters. Note that a valid file is of type .Rmd, hence all filenames should have a ".Rmd"-suffix.

replace

If FALSE (the default), an error is thrown if one of the files that we are about to be created (.Rmd overview file and possible also a .html or .pdf file) already exist. If TRUE, no checks are performed and files on disc thus might be overwritten.

vol

Extra text string or numeric that is appended on the end of the output file name(s). For example, if the dataset is called "myData", no file argument is supplied and vol=2, the output file will be called "dataMaid_myData2.Rmd"

standAlone

A logical. If TRUE, the document begins with a markdown YAML preamble such that it can be rendered as a stand alone rmarkdown file, e.g. by calling render. If FALSE, this preamble is removed. Moreover, no matter the input to the render argument, the document will now not be rendered, as it has no preamble.

twoCol

A logical. Should the results from the summarize and visualize steps be presented in two columns? Defaults to TRUE.

quiet

A logical. If TRUE (the default), only a few messages are printed to the screen as clean runs. If FALSE, no messages are suppressed. The third option, silent, renders the function completely silent, such that only fatal errors are printed.

openResult

A logical. If TRUE (the default), the last file produced by clean is automatically opened by the end of the function run. This means that if render = TRUE, the rendered pdf or html file is opened, while if render = FALSE, the .Rmd file is opened.

characterChecks

A vector of the names of error-checking functions to apply to character vectors.

factorChecks

A vector of the names of error-checking functions to apply to integer vectors.

labelledChecks

A vector of the names of error-checking functions to apply to character vectors.

numericChecks

A vector of the names of error-checking functions to apply to numeric vectors.

integerChecks

A vector of the names of error-checking functions to apply to integer vectors.

logicalChecks

A vector of the names of error-checking functions to apply to logical vectors.

dateChecks

A vector of the names of error-checking functions to apply to Date vectors.

allChecks

Vector of function names that should be used as check-functions for all variable types. Note that this argument overwrites the arguments characterChekcs, factorChecks, etc.

characterSummaries

A vector of the names of summary functions to apply to character vectors.

factorSummaries

A vector of the names of summary functions to apply to factor vectors.

labelledSummaries

A vector of the names of summary functions to apply to labelled vectors.

numericSummaries

A vector of the names of summary functions to apply to numeric vectors.

integerSummaries

A vector of the names of summary functions to apply to integer vectors.

logicalSummaries

A vector of the names of summary functions to apply to logical vectors.

dateSummaries

A vector of the names of summary functions to apply to Date vectors.

allSummaries

Vector of function names that should be used as summary functions for all variable types. Note that this argument overwrites the arguments characterSummaries, factorSummaries, etc.

allVisuals

A single function name. This funtion name is called for creating the plots for each variable in the "visualize" step. The default, "standardVisual" thus calls the visualFunction standardVisual for each variable in data.

listChecks

A logical. Controls whether what checks that were used for each possible variable type are summarized in the output. Defaults to TRUE.

maxProbVals

A positive integer or Inf. Maximum number of unique values printed from check-functions. In the case of Inf, all problematic values are printed. Defaults to 10.

maxDecimals

A positive integer or Inf. Number of decimals used when printing numerical values in the data summary and in problematic values from the data checks. If Inf, no rounding is performed.

addSummaryTable

A logical. If TRUE (the default), a summary table of the variable checks is added between the Data Cleaning Summary and the Variable List.

reportTitle

A text string. If supplied, this will be the printed title of the report. If left unspecified, the title with the name of the supplied dataset.

treatXasY

A list that indicates how non-standard variable classes should be treated. This parameter allows you to include variables that are not of class factor, character, labelled, numeric, integer, logical nor Date (or a class that inherits from any of these classes). The names of the list are the new classes and the entries are the names of the class, they should be treated as. If clean() should e.g. treat variables of class raw as characters and variables of class complex as numeric, you should put treatXasY = list(raw = "character", complex = "numeric").

FIX ME-------- Other arguments that are passed on the to precheck, checking, summary and visualization functions.WHAT ARGUMENTS ARE RELEVANT TO MENTION HERE? ---------- FIX ME

Value

The function does not return anything. Its side effect (the production of a data cleaning overview document) is the reason for running the function.

Details

Run a set of class-specific validation checks to check the variables in a dataset for potential errors. Performs checking steps according to user input and/or data type of the inputted variable. The checks are saved to an R markdown file which can rendered into an easy-to-read document. This document also includes summaries and visualizations of each variable in the dataset.

For each variable, a set of pre-check (controlled by the preChecks argument) is first run and then then a battery of functions are applied depending on the variable class. For each variable type the summarize/visualize/check functions are applied and and the results are written to an R markdown file.

Examples

data(testData) data(toyData) check(toyData)
#> $pill #> $pill$identifyMissing #> No problems found. #> $pill$identifyWhitespace #> No problems found. #> $pill$identifyLoners #> Note that the following levels have at most five observations: blue. #> $pill$identifyCaseIssues #> No problems found. #> $pill$identifyNums #> No problems found. #> #> $events #> $events$identifyMissing #> The following suspected missing value codes enter as regular values: 999, NaN. #> $events$identifyOutliers #> Note that the following possible outlier values were detected: 82, 999. #> #> $region #> $region$identifyMissing #> The following suspected missing value codes enter as regular values: , .. #> $region$identifyWhitespace #> The following values appear with prefixed or suffixed white space: . #> $region$identifyLoners #> Note that the following levels have at most five observations: , ., a, b, c, other, OTHER. #> $region$identifyCaseIssues #> Note that there might be case problems with the following levels: other, OTHER. #> $region$identifyNums #> No problems found. #> #> $change #> $change$identifyMissing #> No problems found. #> $change$identifyOutliers #> Note that the following possible outlier values were detected: 1.12, 1.51, 1.6. #> #> $id #> $id$identifyMissing #> No problems found. #> $id$identifyWhitespace #> No problems found. #> $id$identifyLoners #> Note that the following levels have at most five observations: 1, 10, 11, 12, 13, 14, 15, 2, 3, 4 (5 additional values omitted). #> $id$identifyCaseIssues #> No problems found. #> $id$identifyNums #> Note: The variable consists exclusively of numbers and takes a lot of different values. Is it perhaps a misclassified numeric variable? #> #> $spotifysong #> $spotifysong$identifyMissing #> No problems found. #> $spotifysong$identifyWhitespace #> No problems found. #> $spotifysong$identifyLoners #> No problems found. #> $spotifysong$identifyCaseIssues #> No problems found. #> $spotifysong$identifyNums #> No problems found. #>
# NOT RUN { DF <- data.frame(x = 1:15) clean(DF) # }
# NOT RUN { data(testData) clean(testData) # }
# Overwrite any existing files generated by clean
# NOT RUN { clean(testData, replace=TRUE) # }
# Only include problematic variables in the output document
# NOT RUN { clean(testData, replace=TRUE, onlyProblematic=TRUE) # }
# Add user defined check-function to the checks performed on character variables: # Here we add functionality to search for the string wally (ignoring case)
# NOT RUN { wheresWally <- function(v, ...) { res <- grepl("wally", v, ignore.case=TRUE) problem <- any(res) message <- "Wally was found in these data" checkResult(list(problem = problem, message = message, problemValues = v[res])) } wheresWally <- checkFunction(wheresWally, description = "Search for the string 'wally' ignoring case", classes = c("character") ) # Add the newly defined function to the list of checks used for characters. clean(testData, characterChecks=c(defaultCharacterChecks(), "wheresWally"), replace=TRUE) # }
#Handle non-supported variable classes using treatXasY: treat raw as character and #treat complex as numeric. We also add a list variable, but as lists are not #handled through treatXasY, this variable will be caught in the preChecks and skipped:
# NOT RUN { toyData$rawVar <- as.raw(c(1:14, 1)) toyData$compVar <- c(1:14, 1) + 2i toyData$listVar <- as.list(c(1:14, 1)) clean(toyData, replace = TRUE, treatXasY = list(raw = "character", complex = "numeric")) # }