Data cleaning and validation are the first steps in any data analysis, as the validity of the conclusions from the analysis hinges on the quality of the input data. Mistakes in the data can arise for any number of reasons, including erroneous codings, malfunctioning measurement equipment, and inconsistent data generation manuals.

We present a systematic, analytical approach to data cleaning that will ensure the data cleaning process to be just as structured and well-documented as the rest of the data analysis. The primary software tool is the dataMaid R package, which implements an extensive and customisable suite of quality assessment tools that can be used to identify potential problems in a dataset. The results are summarised in an auto-generated, non-technical, stand-alone document readable by statisticians and non-statisticians alike. Thus, the course teaches practical skills that aid the dialogue between data analysts and field experts, while also providing easy documentation of reproducible data cleaning steps and data quality control.

Learning objectives

After having followed the course a participant will

have a vocabulary for understanding, describing, and discussing the necessary steps and process of data screening and cleaning.
have an overview of existing R packages that can be used as a foundation for reproducible data screening.
be able to produce reports that are relevant for their specific data cleaning needs.
be able to extend and customize the data screening steps to their own needs.

Practical information

Participants are assumed to be R-users, but not necessarily familiar with writing R extensions. Bring a laptop with R installed as the course will contain several hands-on exercises where we use R.

Installation

Before the course starts you should make sure that you have installed the latest version of:

The following R packages:

install.packages(c("dataMaid", "validate", "lubridate"))

R studio is recommended

Downloading the slides

The slides can be found by following the menu above. To download a copy of the slides you can print/save the slides from each module to a pdf file. This should also make it possible to produce 4-ups or 6-ups for handouts by changing the layout when printing to pdf.

Cleaning Up the Data Cleaning Process: Challenges and Solutions in R

Learning objectives

Practical information

Installation

Downloading the slides