Reproducible data screening and quality control

# Reproducible data screening and quality control
## Module 3: Time to clean and other options
### Claus Thorn Ekstrøm and Anne Helby Petersen UCPH Biostatistics
### eRum, Budapest, May 14th, 2018 .footnotesize[<a href="https://ekstroem.github.io/dataMaid/eRum2018/index.html">Slides/homepage</a>]

---

#Summarizing the errors

* What did we find so far?
* What didn't we find?

---

#Mistakes in presiData:

.footnotesize[
* Aragorn Arathorn is included in the dataset.
* Trump has "." listed as his first name (firstName).
* Obama's presidency duration is listed as infinite (presidencyYears).
* Trump's state of birth (New York) was spelled with a lower case "y" (stateOfBirth).
* Truman's last name is prefixed with whitespace (lastName).
* ageAtInauguration is coded as a character variable.
* James  Garfield's state of birth (stateOfBirth) has been changed from Ohio to Indiana (state of birth of Jim Davis, the creator of the cartoon "Garfield").
* Calvin Coolidge has had his first name changed to "Hobbes" (firstName).
* Eisenhower appears twice in the dataset.
* Lincoln has had his date of death changed from 1865-04-15 to 1801-04-15 (dateOfDeath).
]

---

#Data cleaning

Not the best term ... and should not be unsupervised

---

#Data cleaning in R

In an R-script:

1. Make a copy of the dataset. 
2. Use indexing to locate the problem in the data.
3. Overwrite the faulty value with a correct one - if you know it - or `NA` to mark that information is missing in this spot. 
4. Save the copy of the "cleaned" data in a **new** file.

---

#Selection - rows/observations

Two systems for selecting observations in `data.frame`s in R:

By index (row number) or using a logical vector.

```r
> (tD <- head(toyData, 3)) 
```

```
  pill events region     change id spotifysong
1  red      1      a -0.6264538  1  Irrelevant
2  red      1      a  0.1836433  2  Irrelevant
3  red      1      a -0.8356286  3  Irrelevant
```

---

#Selection - rows/observations

Four equivalent ways to get the second line of `tD`:

```r
> tD[2, ] #indexing
> tD[c(FALSE, TRUE, FALSE), ] #manual logical vector 
> tD[tD$id == 2, ] #informative logical vector
> tD %>% filter(id==2)  # Using tidyverse
```

```
  pill events region    change id spotifysong
2  red      1      a 0.1836433  2  Irrelevant
```

---

#Selection - rows/observations

Use informative logical vectors as much as possible!

```r
> tD
```

```
  pill events region     change id spotifysong
1  red      1      a -0.6264538  1  Irrelevant
2  red      1      a  0.1836433  2  Irrelevant
3  red      1      a -0.8356286  3  Irrelevant
```

```r
> #Mark non-positive change as missing:
> tD[tD$change > 0, "change"] <- NA
```

---

#Selection - columns/variables

*ALWAYS* use variable names.

```r
> #readable, informative code:
> tD[tD$change > 0, "change"] <- NA
> 
> # Indexing by numbers easily becomes 
> # a source of error by itself:
> tD[tD$change > 0, 4] <- NA
```

---

background-image: url(pics/structure.png)
background-size: 30%
background-position: right

## Finishing up after cleaning

Should now have 
a cleaned dataset 
that can form the 
basis for future 
analyses.

With documentation 
of how we got 
there!

---

# Create codebook

Produce a summary document for subsequent analyses.

```r
> makeCodebook(presidentData)
```
]

Add label (similar to `labelled` package) or extra information

```r
> pD <- presidentData
> attr(pD$presidencyYears, "label") <- 
+ "Full years as president"
```
]

```r
> attr(pD$birthday, "shortDescription") <- 
+ "Dates are stored in YYYY-MM-DD format"
```
]

---

# Exercise 3

Correct the errors you have found so far.

Make sure to make the cleaning process reproducible.

Remember **rules 1 and 2**!

Create the final codebook with additional information about some of the variables.

```r
> makeCodebook(myCleanedData)
```

---

##Row-wise or column-wise checks?

---

##Row-wise or column-wise checks?

---

##Row-wise or column-wise checks?

---

##Row-wise *and* column-wise constraints!
  
* `dataMaid` performs class dependent checks for each variable in a dataset, one at a time (column-wise)
  + Pros: Easy to document what was (not) done, let's you get started without a lot of prior knowledge, easy to share with collaborators
  + Shortcomings: Generally cannot detect internal consistency issues or use non-class dependent variable constraints

---

##Row-wise *and* column-wise constraints!

An R-packages that performs row-wise checks: `validate`. Check out the talk on Wednesday @ 14.50 by Edwin de Jonge:

**validatetools - resolve and simplify contradictive or redundant data validation rules**

Note: Different use of the term "validation" - no longer about format, type and range, but used as synonym to "check".

---

# Thank you!

Please grab hold of us here or via email

.pull-left[Anne [ahpe@sund.ku.dk](mailto:ahpe@sund.ku.dk)] .pull-right[Claus [ekstrom@sund.ku.dk](mailto:ekstrom@sund.ku.dk)]