A checkFunction to be called from check that identifies outlier values in a Date/numeric/integer variable.

identifyOutliers(v, nMax = 10, maxDecimals = 2)

Arguments

v

A Date, numeric or integer variable to check.

nMax

The maximum number of problematic values to report. Default is 10. Set to Inf if all problematic values are to be included in the outputted message, or to 0 for no output.

maxDecimals

A positive integer or Inf. Number of decimals used when printing numerical values in the data summary and in problematic values from the data checks. If Inf, no rounding is performed.

Value

A checkResult with three entires: $problem (a logical indicating whether outliers were found), $message (a message describing which values are outliers) and $problemValues (the outlier values).

Details

Outliers are identified based on an outlier rule that is appropriate for asymmetric data. Outliers are observations outside the range

$$Q1 - 1.5*exp(a*MC)*IQR ; Q3 + 1.5*exp(b*MC)*IQR $$

where Q1, Q3, and IQR are the first quartile, third quartile, and inter-quartile range, MC is the 'medcouple', a robust concept and estimator of skewness, and a and b are appropriate constants (-4 and 3). The medcouple is defined as a scaled median difference of the left and right half of distribution, and hence not based on the third moment as the classical skewness.

When the data are symmetric, the measure reduces to the standard outlier rule also used in Tukey Boxplots (consistent with the boxplot function), i.e. as values that are smaller than the 1st quartile minus the inter quartile range (IQR) or greater than the third quartile plus the IQR.

For Date variables, the calculations are done on their raw numeric format (as obtained by using unclass), after which they are translated back to Dates. Note that no rounding is performed for Dates, no matter the value of maxDecimals.

See also

Examples

identifyOutliers(c(1:10, 200, 200, 700))
#> Note that the following possible outlier values were detected: 1, 2, 200, 700.