Produce a data report

Make a data overview report that summarizes the contents of a dataset and flags potential problems. The potential problems are identified by running a set of class-specific validation checks, so that different checks are performed on different variables types. The checking steps can be customized according to user input and/or data type of the inputted variable. The checks are saved to an R markdown file which can rendered into an easy-to-read data report in pdf, html or word formats. This report also includes summaries and visualizations of each variable in the dataset.

makeDataReport(data, output = NULL, render = TRUE, useVar = NULL,
  ordering = c("asIs", "alphabetical"), onlyProblematic = FALSE,
  labelled_as = c("factor"), mode = c("summarize", "visualize", "check"),
  smartNum = TRUE, preChecks = c("isKey", "isSingular", "isSupported"),
  file = NULL, replace = FALSE, vol = "", standAlone = TRUE,
  twoCol = TRUE, quiet = TRUE, openResult = TRUE,
  summaries = setSummaries(), visuals = setVisuals(),
  checks = setChecks(), listChecks = TRUE, maxProbVals = 10,
  maxDecimals = 2, addSummaryTable = TRUE, codebook = FALSE,
  reportTitle = NULL, treatXasY = NULL, ...)

Arguments

data	The dataset to be checked. This dataset should be of class `data.frame`, `tibble` or `matrix`. If it is of classs `matrix`, it will be converted to a `data.frame`.
output	Output format. Options are `"pdf"`, `"word"` (.docx) and `"html"`. If `NULL` (the default), the output format depends two sequential checks. First, whether a LaTeX installation is available, in which case `pdf` output is chosen. Secondly, if no LaTeX installation is found, then if the operating system is Windows, `word` output is used. Lastly, if neither of these checks are positive, `html` output is used.
render	Should the output file be rendered (defaults to `TRUE`), i.e. should a pdf/word/html document be generated and saved to the disc?
useVar	Variables to describe in the report. If `NULL` (the default), all variables in `data` are included. If a vector of variable names is supplied, only the variables in `data` that are also in `useVar` are included in the data report.
ordering	Choose the ordering of the variables in the variable presentation. The options are "asIs" (ordering as in the dataset) and "alphabetical" (alphabetical order).
onlyProblematic	A logical. If `TRUE`, only the variables flagged as problematic in the check step will be included in the variable list.
labelled_as	A string explaining the way to handle labelled vectors. Currently `"factor"` (the default) is the only possibility. This means that labelled variables that appear factor-like (by having a non-`NULL` `labels`-attribute) will be treated as factors, while other labelled variables will be treated as whatever base variable class they inherit from.
mode	Vector of tasks to perform among the three categories "summarize", "visualize" and "check". The default, `c("summarize", "visualize", "check")`, implies that all three steps are performed. The steps selected in `mode` will be performed for each variable in `data` and their results are presented in the second part of the outputtet data report. The "summarize" step is responsible for creating the summary table, the "visualize" step is responsible for creating the plot and the "check" step is responsible for performing checks on the variable and printing the results if any problems are found.
smartNum	If `TRUE` (the default), numeric and integer variables with less than 5 unique values are treated as factor variables in the checking, visualization and summary steps, and a message notifying the reader of this is printed in the data summary.
preChecks	Vector of function names for check functions used in the pre-check stage. The pre-check stage consists of variable checks that should be performed before the summary/visualization/checking step. If any of these checks find problems, the variable will not be summarized nor visualized nor checked.
file	The filename of the outputted rmarkdown (.Rmd) file. If set to `NULL` (the default), the filename will be the name of `data` prefixed with "dataMaid_", if this qualifies as a valid file name (e.g. no special characters allowed). Otherwise, `makeDataReport()` tries to create a valid filename by substituing illegal characters. Note that a valid file is of type .Rmd, hence all filenames should have a ".Rmd"-suffix.
replace	If `FALSE` (the default), an error is thrown if one of the files that we are about to be created (.Rmd overview file and possible also a .html, .pdf or .docx file) already exist. If `TRUE`, no checks are performed and files on disc thus might be overwritten.
vol	Extra text string or numeric that is appended on the end of the output file name(s). For example, if the dataset is called "myData", no file argument is supplied and `vol=2`, the output file will be called "dataMaid_myData2.Rmd"
standAlone	A logical. If `TRUE`, the document begins with a markdown YAML preamble such that it can be rendered as a stand alone rmarkdown file, e.g. by calling `render`. If `FALSE`, this preamble is removed. Moreover, no matter the input to the `render` argument, the document will now not be rendered, as it has no preamble.
twoCol	A logical. Should the results from the summarize and visualize steps be presented in two columns? Defaults to `TRUE`.
quiet	A logical. If `TRUE` (the default), only a few messages are printed to the screen as `makeDataReport` runs. If `FALSE`, no messages are suppressed. The third option, `silent`, renders the function completely silent, such that only fatal errors are printed.
openResult	A logical. If `TRUE` (the default), the last file produced by `makeDataReport` is automatically opened by the end of the function run. This means that if `render = TRUE`, the rendered pdf, word or html file is opened, while if `render = FALSE`, the .Rmd file is opened.
summaries	A list of summaries to use on each supported variable type. We recommend using `setSummaries` for creating this list and refer to the documentation of this function for more details.
visuals	A list of visual functions to use on each supported variable type. We recommend using `setVisuals` for creating this list and refer to the documentation of this function for more details.
checks	A list of checks to use on each supported variable type. We recommend using `setChecks` for creating this list and refer to the documentation of this function for more details.
listChecks	A logical. Controls whether what checks that were used for each possible variable type are summarized in the output. Defaults to `TRUE`.
maxProbVals	A positive integer or `Inf`. Maximum number of unique values printed from check-functions. In the case of `Inf`, all problematic values are printed. Defaults to `10`.
maxDecimals	A positive integer or `Inf`. Number of decimals used when printing numerical values in the data summary and in problematic values from the data checks. If `Inf`, no rounding is performed.
addSummaryTable	A logical. If `TRUE` (the default), a summary table of the variable checks is added between the Data Cleaning Summary and the Variable List. Only one of `addSummaryTable` and `addCodebookTable` can be `TRUE`.
codebook	A logical. Defaults to `FALSE`. If `TRUE` then the document is tweaked to better represent a codebook.
reportTitle	A text string. If supplied, this will be the printed title of the report. If left unspecified, the title with the name of the supplied dataset.
treatXasY	A list that indicates how non-standard variable classes should be treated. This parameter allows you to include variables that are not of class `factor`, `character`, `labelled`, `numeric`, `integer`, `logical` nor `Date` (or a class that inherits from any of these classes). The names of the list are the new classes and the entries are the names of the class, they should be treated as. If `makeDataReport()` should e.g. treat variables of class `raw` as characters and variables of class `complex` as numeric, you should put `treatXasY = list(raw = "character", complex = "numeric")`.
…	Other arguments that are passed on the to precheck, checking, summary and visualization functions.

Value

The function does not return anything. Its side effect (the production of a data report) is the reason for running the function.

Details

For each variable, a set of pre-check functions (controlled by the preChecks argument) are first run and then then a battery of functions are applied depending on the variable class. For each variable type the summarize/visualize/check functions are applied and and the results are written to an R markdown file.

Examples

data(testData)
data(toyData)

check(toyData)
#> $pill
#> $pill$identifyMissing
#> No problems found.
#> $pill$identifyWhitespace
#> No problems found.
#> $pill$identifyLoners
#> Note that the following levels have at most five observations: blue.
#> $pill$identifyCaseIssues
#> No problems found.
#> $pill$identifyNums
#> No problems found.
#> 
#> $events
#> $events$identifyMissing
#> The following suspected missing value codes enter as regular values: 999, NaN.
#> $events$identifyOutliers
#> Note that the following possible outlier values were detected: 82, 999.
#> 
#> $region
#> $region$identifyMissing
#> The following suspected missing value codes enter as regular values:  , ..
#> $region$identifyWhitespace
#> The following values appear with prefixed or suffixed white space:  .
#> $region$identifyLoners
#> Note that the following levels have at most five observations:  , ., a, b, c, other, OTHER.
#> $region$identifyCaseIssues
#> Note that there might be case problems with the following levels: other, OTHER.
#> $region$identifyNums
#> No problems found.
#> 
#> $change
#> $change$identifyMissing
#> No problems found.
#> $change$identifyOutliers
#> Note that the following possible outlier values were detected: 1.12, 1.51, 1.6.
#> 
#> $id
#> $id$identifyMissing
#> No problems found.
#> $id$identifyWhitespace
#> No problems found.
#> $id$identifyLoners
#> Note that the following levels have at most five observations: 1, 10, 11, 12, 13, 14, 15, 2, 3, 4 (5 additional values omitted).
#> $id$identifyCaseIssues
#> No problems found.
#> $id$identifyNums
#> Note: The variable consists exclusively of numbers and takes a lot of different values. Is it perhaps a misclassified numeric variable?
#> 
#> $spotifysong
#> $spotifysong$identifyMissing
#> No problems found.
#> $spotifysong$identifyWhitespace
#> No problems found.
#> $spotifysong$identifyLoners
#> No problems found.
#> $spotifysong$identifyCaseIssues
#> No problems found.
#> $spotifysong$identifyNums
#> No problems found.
#> 

 
# NOT RUN {
DF <- data.frame(x = 1:15)
makeDataReport(DF)
# }
# NOT RUN {
data(testData)
makeDataReport(testData)
# }
# Overwrite any existing files generated by makeDataReport
# NOT RUN {
makeDataReport(testData, replace=TRUE)
# }
# Change output format to Word/docx:
# NOT RUN {
makeDataReport(testData, replace=TRUE, output = "word")
# }
# Only include problematic variables in the output document
# NOT RUN {
makeDataReport(testData, replace=TRUE, onlyProblematic=TRUE)
# }
# Add user defined check-function to the checks performed on character variables:
# Here we add functionality to search for the string wally (ignoring case)
# NOT RUN {
wheresWally <- function(v, ...) {
     res <- grepl("wally", v, ignore.case=TRUE)
     problem <- any(res)
     message <- "Wally was found in these data"
     checkResult(list(problem = problem,
                      message = message,
                      problemValues = v[res]))
}

wheresWally <- checkFunction(wheresWally,
                             description = "Search for the string 'wally' ignoring case",
                             classes = c("character")
                             )
# Add the newly defined function to the list of checks used for characters.
makeDataReport(testData,
      checks = setChecks(character = defaultCharacterChecks(with = "wheresWally")),
      replace=TRUE)
# }
#Handle non-supported variable classes using treatXasY: treat raw as character and
#treat complex as numeric. We also add a list variable, but as lists are not 
#handled through treatXasY, this variable will be caught in the preChecks and skipped:
# NOT RUN {
toyData$rawVar <- as.raw(c(1:14, 1))
toyData$compVar <- c(1:14, 1) + 2i
toyData$listVar <- as.list(c(1:14, 1))
makeDataReport(toyData, replace  = TRUE,
    treatXasY = list(raw = "character", complex = "numeric"))
# }

Arguments

Value

Details

Examples

Contents