Data cleaning and validation are the first steps in any data analysis, as the validity of the conclusions from the analysis hinges on the quality of the input data. Mistakes in the data can arise for any number of reasons, including erroneous codings, malfunctioning measurement equipment, and inconsistent data generation manuals.
We present a systematic, analytical approach to data cleaning that will ensure the data cleaning process to be just as structured and well-documented as the rest of the data analysis. The primary software tool is the dataMaid
R package, which implements an extensive and customisable suite of quality assessment tools that can be used to identify potential problems in a dataset. The results are summarised in an auto-generated, non-technical, stand-alone document readable by statisticians and non-statisticians alike. Thus, the course teaches practical skills that aid the dialogue between data analysts and field experts, while also providing easy documentation of reproducible data cleaning steps and data quality control.
After having followed the course a participant will
Participants are assumed to be R-users, but not necessarily familiar with writing R extensions. Bring a laptop with R installed as the course will contain several hands-on exercises where we use R.
Before the course starts you should make sure that you have installed the latest version of:
The slides can be found by following the menu above. To download a copy of the slides you can print/save the slides from each module to a pdf file. This should also make it possible to produce 4-ups or 6-ups for handouts by changing the layout when printing to pdf.