Make a data overview report that summarizes the contents of a dataset and flags potential problems. The potential problems are identified by running a set of class-specific validation checks, so that different checks are performed on different variables types. The checking steps can be customized according to user input and/or data type of the inputted variable. The checks are saved to an R markdown file which can rendered into an easy-to-read data report in pdf, html or word formats. This report also includes summaries and visualizations of each variable in the dataset.
makeDataReport(data, output = NULL, render = TRUE, useVar = NULL, ordering = c("asIs", "alphabetical"), onlyProblematic = FALSE, labelled_as = c("factor"), mode = c("summarize", "visualize", "check"), smartNum = TRUE, preChecks = c("isKey", "isSingular", "isSupported"), file = NULL, replace = FALSE, vol = "", standAlone = TRUE, twoCol = TRUE, quiet = TRUE, openResult = TRUE, summaries = setSummaries(), visuals = setVisuals(), checks = setChecks(), listChecks = TRUE, maxProbVals = 10, maxDecimals = 2, addSummaryTable = TRUE, codebook = FALSE, reportTitle = NULL, treatXasY = NULL, ...)
data | The dataset to be checked. This dataset should be of class |
---|---|
output | Output format. Options are |
render | Should the output file be rendered (defaults to |
useVar | Variables to describe in the report.
If |
ordering | Choose the ordering of the variables in the variable presentation. The options are "asIs" (ordering as in the dataset) and "alphabetical" (alphabetical order). |
onlyProblematic | A logical. If |
labelled_as | A string explaining the way to handle labelled vectors.
Currently |
mode | Vector of tasks to perform among the three categories "summarize", "visualize" and "check".
The default, |
smartNum | If |
preChecks | Vector of function names for check functions used in the pre-check stage. The pre-check stage consists of variable checks that should be performed before the summary/visualization/checking step. If any of these checks find problems, the variable will not be summarized nor visualized nor checked. |
file | The filename of the outputted rmarkdown (.Rmd) file.
If set to |
replace | If |
vol | Extra text string or numeric that is appended on the end of the output
file name(s). For example, if the dataset is called "myData", no file argument is
supplied and |
standAlone | A logical. If |
twoCol | A logical. Should the results from the summarize and visualize
steps be presented in two columns? Defaults to |
quiet | A logical. If |
openResult | A logical. If |
summaries | A list of summaries to use on each supported variable type. We recommend
using |
visuals | A list of visual functions to use on each supported variable type. We recommend
using |
checks | A list of checks to use on each supported variable type. We recommend
using |
listChecks | A logical. Controls whether what checks that were used for each
possible variable type are summarized in the output. Defaults to |
maxProbVals | A positive integer or |
maxDecimals | A positive integer or |
addSummaryTable | A logical. If |
codebook | A logical. Defaults to |
reportTitle | A text string. If supplied, this will be the printed title of the report. If left unspecified, the title with the name of the supplied dataset. |
treatXasY | A list that indicates how non-standard variable classes should be treated.
This parameter allows you to include variables that are not of class |
… | Other arguments that are passed on the to precheck, checking, summary and visualization functions. |
The function does not return anything. Its side effect (the production of a data report) is the reason for running the function.
For each variable, a set of pre-check functions (controlled by the
preChecks
argument) are first run and then then a battery of
functions are applied depending on the variable class. For each
variable type the summarize/visualize/check functions are applied
and and the results are written to an R markdown file.
#> $pill #> $pill$identifyMissing #> No problems found. #> $pill$identifyWhitespace #> No problems found. #> $pill$identifyLoners #> Note that the following levels have at most five observations: blue. #> $pill$identifyCaseIssues #> No problems found. #> $pill$identifyNums #> No problems found. #> #> $events #> $events$identifyMissing #> The following suspected missing value codes enter as regular values: 999, NaN. #> $events$identifyOutliers #> Note that the following possible outlier values were detected: 82, 999. #> #> $region #> $region$identifyMissing #> The following suspected missing value codes enter as regular values: , .. #> $region$identifyWhitespace #> The following values appear with prefixed or suffixed white space: . #> $region$identifyLoners #> Note that the following levels have at most five observations: , ., a, b, c, other, OTHER. #> $region$identifyCaseIssues #> Note that there might be case problems with the following levels: other, OTHER. #> $region$identifyNums #> No problems found. #> #> $change #> $change$identifyMissing #> No problems found. #> $change$identifyOutliers #> Note that the following possible outlier values were detected: 1.12, 1.51, 1.6. #> #> $id #> $id$identifyMissing #> No problems found. #> $id$identifyWhitespace #> No problems found. #> $id$identifyLoners #> Note that the following levels have at most five observations: 1, 10, 11, 12, 13, 14, 15, 2, 3, 4 (5 additional values omitted). #> $id$identifyCaseIssues #> No problems found. #> $id$identifyNums #> Note: The variable consists exclusively of numbers and takes a lot of different values. Is it perhaps a misclassified numeric variable? #> #> $spotifysong #> $spotifysong$identifyMissing #> No problems found. #> $spotifysong$identifyWhitespace #> No problems found. #> $spotifysong$identifyLoners #> No problems found. #> $spotifysong$identifyCaseIssues #> No problems found. #> $spotifysong$identifyNums #> No problems found. #># NOT RUN { DF <- data.frame(x = 1:15) makeDataReport(DF) # }# NOT RUN { data(testData) makeDataReport(testData) # }# Overwrite any existing files generated by makeDataReport# NOT RUN { makeDataReport(testData, replace=TRUE) # }# Change output format to Word/docx:# NOT RUN { makeDataReport(testData, replace=TRUE, output = "word") # }# Only include problematic variables in the output document# NOT RUN { makeDataReport(testData, replace=TRUE, onlyProblematic=TRUE) # }# Add user defined check-function to the checks performed on character variables: # Here we add functionality to search for the string wally (ignoring case)# NOT RUN { wheresWally <- function(v, ...) { res <- grepl("wally", v, ignore.case=TRUE) problem <- any(res) message <- "Wally was found in these data" checkResult(list(problem = problem, message = message, problemValues = v[res])) } wheresWally <- checkFunction(wheresWally, description = "Search for the string 'wally' ignoring case", classes = c("character") ) # Add the newly defined function to the list of checks used for characters. makeDataReport(testData, checks = setChecks(character = defaultCharacterChecks(with = "wheresWally")), replace=TRUE) # }#Handle non-supported variable classes using treatXasY: treat raw as character and #treat complex as numeric. We also add a list variable, but as lists are not #handled through treatXasY, this variable will be caught in the preChecks and skipped:# NOT RUN { toyData$rawVar <- as.raw(c(1:14, 1)) toyData$compVar <- c(1:14, 1) + 2i toyData$listVar <- as.list(c(1:14, 1)) makeDataReport(toyData, replace = TRUE, treatXasY = list(raw = "character", complex = "numeric")) # }