A checkFunction to be called from check
that identifies outlier values
in a Date/numeric/integer variable.
identifyOutliers(v, nMax = 10, maxDecimals = 2)
v | A Date, numeric or integer variable to check. |
---|---|
nMax | The maximum number of problematic values to report.
Default is |
maxDecimals | A positive integer or |
A checkResult
with three entires:
$problem
(a logical indicating whether outliers were found),
$message
(a message describing which values are outliers) and
$problemValues
(the outlier values).
Outliers are identified based on an outlier rule that is appropriate for asymmetric data. Outliers are observations outside the range
$$Q1 - 1.5*exp(a*MC)*IQR ; Q3 + 1.5*exp(b*MC)*IQR $$
where Q1, Q3, and IQR are the first quartile, third quartile, and inter-quartile range, MC is the 'medcouple', a robust concept and estimator of skewness, and a and b are appropriate constants (-4 and 3). The medcouple is defined as a scaled median difference of the left and right half of distribution, and hence not based on the third moment as the classical skewness.
When the data are symmetric, the measure reduces to the
standard outlier rule also used in Tukey Boxplots (consistent with
the boxplot
function), i.e. as values that are
smaller than the 1st quartile minus the inter quartile range (IQR)
or greater than the third quartile plus the IQR.
For Date variables, the calculations are done on their raw numeric format (as
obtained by using unclass
), after which they are translated back to Dates.
Note that no rounding is performed for Dates, no matter the value of maxDecimals
.
identifyOutliers(c(1:10, 200, 200, 700))#> Note that the following possible outlier values were detected: 1, 2, 200, 700.