Make a data overview report that summarizes the contents of a dataset and flags potential problems. The potential problems are identified by running a set of class-specific validation checks, so that different checks are performed on different variables types. The checking steps can be customized according to user input and/or data type of the inputted variable. The checks are saved to an R markdown file which can rendered into an easy-to-read data report in pdf, html or word formats. This report also includes summaries and visualizations of each variable in the dataset.

makeDataReport(data, output = NULL, render = TRUE, useVar = NULL,
  ordering = c("asIs", "alphabetical"), onlyProblematic = FALSE,
  labelled_as = c("factor"), mode = c("summarize", "visualize", "check"),
  smartNum = TRUE, preChecks = c("isKey", "isSingular", "isSupported"),
  file = NULL, replace = FALSE, vol = "", standAlone = TRUE,
  twoCol = TRUE, quiet = TRUE, openResult = TRUE,
  summaries = setSummaries(), visuals = setVisuals(),
  checks = setChecks(), listChecks = TRUE, maxProbVals = 10,
  maxDecimals = 2, addSummaryTable = TRUE, codebook = FALSE,
  reportTitle = NULL, treatXasY = NULL, ...)

Arguments

data

The dataset to be checked. This dataset should be of class data.frame, tibble or matrix. If it is of classs matrix, it will be converted to a data.frame.

output

Output format. Options are "pdf", "word" (.docx) and "html". If NULL (the default), the output format depends two sequential checks. First, whether a LaTeX installation is available, in which case pdf output is chosen. Secondly, if no LaTeX installation is found, then if the operating system is Windows, word output is used. Lastly, if neither of these checks are positive, html output is used.

render

Should the output file be rendered (defaults to TRUE), i.e. should a pdf/word/html document be generated and saved to the disc?

useVar

Variables to describe in the report. If NULL (the default), all variables in data are included. If a vector of variable names is supplied, only the variables in data that are also in useVar are included in the data report.

ordering

Choose the ordering of the variables in the variable presentation. The options are "asIs" (ordering as in the dataset) and "alphabetical" (alphabetical order).

onlyProblematic

A logical. If TRUE, only the variables flagged as problematic in the check step will be included in the variable list.

labelled_as

A string explaining the way to handle labelled vectors. Currently "factor" (the default) is the only possibility. This means that labelled variables that appear factor-like (by having a non-NULL labels-attribute) will be treated as factors, while other labelled variables will be treated as whatever base variable class they inherit from.

mode

Vector of tasks to perform among the three categories "summarize", "visualize" and "check". The default, c("summarize", "visualize", "check"), implies that all three steps are performed. The steps selected in mode will be performed for each variable in data and their results are presented in the second part of the outputtet data report. The "summarize" step is responsible for creating the summary table, the "visualize" step is responsible for creating the plot and the "check" step is responsible for performing checks on the variable and printing the results if any problems are found.

smartNum

If TRUE (the default), numeric and integer variables with less than 5 unique values are treated as factor variables in the checking, visualization and summary steps, and a message notifying the reader of this is printed in the data summary.

preChecks

Vector of function names for check functions used in the pre-check stage. The pre-check stage consists of variable checks that should be performed before the summary/visualization/checking step. If any of these checks find problems, the variable will not be summarized nor visualized nor checked.

file

The filename of the outputted rmarkdown (.Rmd) file. If set to NULL (the default), the filename will be the name of data prefixed with "dataMaid_", if this qualifies as a valid file name (e.g. no special characters allowed). Otherwise, makeDataReport() tries to create a valid filename by substituing illegal characters. Note that a valid file is of type .Rmd, hence all filenames should have a ".Rmd"-suffix.

replace

If FALSE (the default), an error is thrown if one of the files that we are about to be created (.Rmd overview file and possible also a .html, .pdf or .docx file) already exist. If TRUE, no checks are performed and files on disc thus might be overwritten.

vol

Extra text string or numeric that is appended on the end of the output file name(s). For example, if the dataset is called "myData", no file argument is supplied and vol=2, the output file will be called "dataMaid_myData2.Rmd"

standAlone

A logical. If TRUE, the document begins with a markdown YAML preamble such that it can be rendered as a stand alone rmarkdown file, e.g. by calling render. If FALSE, this preamble is removed. Moreover, no matter the input to the render argument, the document will now not be rendered, as it has no preamble.

twoCol

A logical. Should the results from the summarize and visualize steps be presented in two columns? Defaults to TRUE.

quiet

A logical. If TRUE (the default), only a few messages are printed to the screen as makeDataReport runs. If FALSE, no messages are suppressed. The third option, silent, renders the function completely silent, such that only fatal errors are printed.

openResult

A logical. If TRUE (the default), the last file produced by makeDataReport is automatically opened by the end of the function run. This means that if render = TRUE, the rendered pdf, word or html file is opened, while if render = FALSE, the .Rmd file is opened.

summaries

A list of summaries to use on each supported variable type. We recommend using setSummaries for creating this list and refer to the documentation of this function for more details.

visuals

A list of visual functions to use on each supported variable type. We recommend using setVisuals for creating this list and refer to the documentation of this function for more details.

checks

A list of checks to use on each supported variable type. We recommend using setChecks for creating this list and refer to the documentation of this function for more details.

listChecks

A logical. Controls whether what checks that were used for each possible variable type are summarized in the output. Defaults to TRUE.

maxProbVals

A positive integer or Inf. Maximum number of unique values printed from check-functions. In the case of Inf, all problematic values are printed. Defaults to 10.

maxDecimals

A positive integer or Inf. Number of decimals used when printing numerical values in the data summary and in problematic values from the data checks. If Inf, no rounding is performed.

addSummaryTable

A logical. If TRUE (the default), a summary table of the variable checks is added between the Data Cleaning Summary and the Variable List. Only one of addSummaryTable and addCodebookTable can be TRUE.

codebook

A logical. Defaults to FALSE. If TRUE then the document is tweaked to better represent a codebook.

reportTitle

A text string. If supplied, this will be the printed title of the report. If left unspecified, the title with the name of the supplied dataset.

treatXasY

A list that indicates how non-standard variable classes should be treated. This parameter allows you to include variables that are not of class factor, character, labelled, numeric, integer, logical nor Date (or a class that inherits from any of these classes). The names of the list are the new classes and the entries are the names of the class, they should be treated as. If makeDataReport() should e.g. treat variables of class raw as characters and variables of class complex as numeric, you should put treatXasY = list(raw = "character", complex = "numeric").

Other arguments that are passed on the to precheck, checking, summary and visualization functions.

Value

The function does not return anything. Its side effect (the production of a data report) is the reason for running the function.

Details

For each variable, a set of pre-check functions (controlled by the preChecks argument) are first run and then then a battery of functions are applied depending on the variable class. For each variable type the summarize/visualize/check functions are applied and and the results are written to an R markdown file.

Examples

data(testData) data(toyData) check(toyData)
#> $pill #> $pill$identifyMissing #> No problems found. #> $pill$identifyWhitespace #> No problems found. #> $pill$identifyLoners #> Note that the following levels have at most five observations: blue. #> $pill$identifyCaseIssues #> No problems found. #> $pill$identifyNums #> No problems found. #> #> $events #> $events$identifyMissing #> The following suspected missing value codes enter as regular values: 999, NaN. #> $events$identifyOutliers #> Note that the following possible outlier values were detected: 82, 999. #> #> $region #> $region$identifyMissing #> The following suspected missing value codes enter as regular values: , .. #> $region$identifyWhitespace #> The following values appear with prefixed or suffixed white space: . #> $region$identifyLoners #> Note that the following levels have at most five observations: , ., a, b, c, other, OTHER. #> $region$identifyCaseIssues #> Note that there might be case problems with the following levels: other, OTHER. #> $region$identifyNums #> No problems found. #> #> $change #> $change$identifyMissing #> No problems found. #> $change$identifyOutliers #> Note that the following possible outlier values were detected: 1.12, 1.51, 1.6. #> #> $id #> $id$identifyMissing #> No problems found. #> $id$identifyWhitespace #> No problems found. #> $id$identifyLoners #> Note that the following levels have at most five observations: 1, 10, 11, 12, 13, 14, 15, 2, 3, 4 (5 additional values omitted). #> $id$identifyCaseIssues #> No problems found. #> $id$identifyNums #> Note: The variable consists exclusively of numbers and takes a lot of different values. Is it perhaps a misclassified numeric variable? #> #> $spotifysong #> $spotifysong$identifyMissing #> No problems found. #> $spotifysong$identifyWhitespace #> No problems found. #> $spotifysong$identifyLoners #> No problems found. #> $spotifysong$identifyCaseIssues #> No problems found. #> $spotifysong$identifyNums #> No problems found. #>
# NOT RUN { DF <- data.frame(x = 1:15) makeDataReport(DF) # }
# NOT RUN { data(testData) makeDataReport(testData) # }
# Overwrite any existing files generated by makeDataReport
# NOT RUN { makeDataReport(testData, replace=TRUE) # }
# Change output format to Word/docx:
# NOT RUN { makeDataReport(testData, replace=TRUE, output = "word") # }
# Only include problematic variables in the output document
# NOT RUN { makeDataReport(testData, replace=TRUE, onlyProblematic=TRUE) # }
# Add user defined check-function to the checks performed on character variables: # Here we add functionality to search for the string wally (ignoring case)
# NOT RUN { wheresWally <- function(v, ...) { res <- grepl("wally", v, ignore.case=TRUE) problem <- any(res) message <- "Wally was found in these data" checkResult(list(problem = problem, message = message, problemValues = v[res])) } wheresWally <- checkFunction(wheresWally, description = "Search for the string 'wally' ignoring case", classes = c("character") ) # Add the newly defined function to the list of checks used for characters. makeDataReport(testData, checks = setChecks(character = defaultCharacterChecks(with = "wheresWally")), replace=TRUE) # }
#Handle non-supported variable classes using treatXasY: treat raw as character and #treat complex as numeric. We also add a list variable, but as lists are not #handled through treatXasY, this variable will be caught in the preChecks and skipped:
# NOT RUN { toyData$rawVar <- as.raw(c(1:14, 1)) toyData$compVar <- c(1:14, 1) + 2i toyData$listVar <- as.list(c(1:14, 1)) makeDataReport(toyData, replace = TRUE, treatXasY = list(raw = "character", complex = "numeric")) # }