Reproducible data screening and quality control

# Reproducible data screening and quality control
## Module 1: Introduction and terminology
### Claus Thorn Ekstrøm and Anne Helby Petersen<br>UCPH Biostatistics
### eRum, Budapest, May 14th, 2018<br>.footnotesize[<a href="https://ekstroem.github.io/dataMaid/eRum2018/index.html">Slides/homepage</a>]

---

background-image: url(pics/redcard+sociology.jpg)
background-size: 96%
class: center, middle

???

Science og statistics har været under pres de seneste år.

Specielt indenfor psykologi har det vist sig svært at replikere vigtige resultater

Resulteret i en diskussion om, hvad videnserfaring var, og hvordan forskellige grupper 
kan se så forskelligt på de samme data.

---

background-image: url(pics/soccer.png)
background-size: 98%
class: center, middle

???

Soccer la la la la

Rødt kort til farvede spillere

29 forskergrupper

---

# Classical research process

From idea ...

<div id="htmlwidget-f58ee175655fc6341135" style="width:100%;height:35%;" class="grViz html-widget"></div>
<script type="application/json" data-for="htmlwidget-f58ee175655fc6341135">{"x":{"diagram":"\ndigraph dot {\n\ngraph [layout = dot,\n       rankdir = LR,\n       bgcolor=\"#000000\",\n       size=2]\n\nnode [shape = circle,\n      style = filled,\n      fillcolor = DimGray,\n      fontcolor = White,\n      fontsize=15,\n      fontname=Helvetica,\n      label = \"\", \n      penwidth=4, \n      margin=0.05,\n      color=White]\na [label=\"Design\"]\nb [label=\"Collect\"]\nc [label=\"Analyze\"]\nd [label=\"Publish\"]\n\nedge [color = White, penwidth=4]\na -> b -> c \nc -> d [color=red, penwidth=4]\n}","config":{"engine":"dot","options":null}},"evals":[],"jsHooks":[]}</script>

???

The red line shows where peer review comes in. Total summary. We need to document the steps we did througout as shown in the soccer example.

We want reproducible research

---

# Why is reproducible research important?

> **Statistical analysis**
>
> .large[All of the data were analyzed with data processing software and figures with Microsoft excel 2007.]
>
> .pull-right[-- Tayefe *et al*, Advances in Bioresearch, 2014]

???

Full statistical analysis section from a scientific paper.

Clearly impossible to reproduce

---

# Terminology

**Reproducibility**

Given code/data/materials, can I get *the same* (=identical) numbers that you did?

**Replicability**

Given scientific protocol, can I get *the same* (=in agreement) result that you did in my own study?

???

However, what do we really do?

---

## The life of a data scientist

> .large[Data scientists, according to interviews and expert estimates, spend from **50 percent to 80 percent** of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.]
>
> .right[.small[-- "For Big-Data Scientists, 'Janitor Work' Is Key Hurdle to Insight" - The New York Times, 2014]]

???

Noter

---

## Realistic process plan

<div id="htmlwidget-f4dbfbcc957e7c6d8d55" style="width:100%;height:25%;" class="grViz html-widget"></div>
<script type="application/json" data-for="htmlwidget-f4dbfbcc957e7c6d8d55">{"x":{"diagram":"\ndigraph dot {\n\ngraph [layout = dot,\n       rankdir = LR,\n       bgcolor=\"#000000\",\n       size=2]\n\nnode [shape = circle,\n      style = filled,\n      fillcolor = DimGray,\n      fontcolor = White,\n      fontsize=15,\n      fontname=Helvetica,\n      label = \"\", \n      penwidth=4, \n      margin=0.05,\n      color=White]\na [label=\"Design\"]\nb [label=\"Collect\"]\nc [label=\"Analyze\"]\nd [label=\"Publish\"]\n\nedge [color = White, penwidth=4]\na -> b \nc -> d \nb -> c [color=red, penwidth=16]\n}","config":{"engine":"dot","options":null}},"evals":[],"jsHooks":[]}</script>

???

GIGO

Huge impact **here**

*   Filtering, selection, error-fixing, ...

---

# Plan for workshop

1.  Introduction, data, error-hunting
2.  Data screening with `dataMaid`. Extending `dataMaid`
3.  Data cleaning and reporting

Exercises.

If you haven't already: go to [ekstroem.github.io/dataMaid/eRum2018/](https://ekstroem.github.io/dataMaid/eRum2018/index.html) and install the required packages.

---

background-image: url(pics/flower.png)
background-size: 60%
class: center, middle

---

background-image: url(pics/structure.png)
background-position: right
background-size: 30%

# From raw data to analysis-ready

.small[
.pull-left[
*   Wrangle to put into correct format and type (validity)
*   Screen to look for consistency, accuracy and uniqueness
*   Validate to check for consistency, accuracy and uniqueness
*   Clean data 
*   Check (screen/validate) again
]]

---

## What is data?

### **Context**

*   Data type : BMI
*   Values : Non-negative. Idea about lower and upper boundary
*   Missing  : Half the individuals were not measured

### **Content**

*   Data type : numeric
*   Values  : 23.5, 31.1, ...
*   Missing : Code as `.`

???

Note the complete overlap between context and content.

Crucial: someone must know the topic!

---

# Wide vs. long

| id| bmi0| bmi52|
|--:|----:|-----:|
|  1| 35.2|  24.2|
|  2| 31.1|  27.0|
]

| id| value| time|
|--:|-----:|----:|
|  1|  35.2|    0|
|  2|  31.1|    0|
|  1|  24.2|   52|
|  2|  27.0|   52|
]

*   Work row-wise or column-wise?
*   Tidy data?

???

Fordele og ulemper ved begge dele. Man skal være opmærksom på, hvad man har med at gøre.

---

## First rule of reproducible data wrangling

.Large[*Thou shall never manually modify your raw data.*]

There are no exceptions to this rule.

---

## Second rule of reproducible data wrangling

.Large[*Thou shall never overwrite your raw data.*]

There are no exceptions to this rule either.

---

# R markdown

Format for writing **reproducible**, dynamic reports with R. Embed R code and results into slideshows, pdfs, html documents, Word files and more. See [cheat sheet at RStudio](https://github.com/rstudio/cheatsheets/raw/master/rmarkdown-2.0.pdf).

```r
> install.packages("rmarkdown", "knitr")
```

---

# Validity - data formats

Technically corrct data requires that the data formats are correct

```r
> DF
```

```
  id bmi0 bmi52
1  1 35.2  24.2
2  2 31.1  27.0
```
]

```r
> lapply(DF, class)
```

```
$id
[1] "numeric"

$bmi0
[1] "numeric"

$bmi52
[1] "numeric"
```
]

---

# Exercise 1

See exercises at: `www.biostatistics.dk/eRum2018/`
(note case sensitive).

Get the `presiData` and load the `dataMaid` package:

```r
> library(dataMaid)
> load("http://biostatistics.dk/eRum2018/data/presiData.rda")
```
]

Hunt for errors!