Cleaning Up the Data Cleaning Process

# Cleaning Up the Data Cleaning Process
## Module 4: Time to clean!
### Claus Thorn Ekstrøm and Anne Helby Petersen UCPH Biostatistics
### CSP, Portland, OR, Feb. 15th, 2018 .footnotesize[<a href="https://ekstroem.github.io/dataMaid/CSP2018/index.html">Slides/homepage</a>]

---

#Summarizing the errors

* What did we find so far?
* What didn't we find?

---

#Data cleaning

Not the best term ... and should not be unsupervised

---

#Data cleaning in R

In an R-script:

1. Make a copy of the dataset. 
2. Use indexing to locate the problem in the data.
3. Overwrite the faulty value with a correct one - if you know it - or `NA` to mark that information is missing in this spot. 
4. Save the copy of the "cleaned" data in a **new** file.

---

#Selection - rows/observations

Two systems for selecting observations in `data.frame`s in R: 
By index (row number) or using a logical vector.

```r
> (tD <- head(toyData, 3)) 
```

```
  pill events region     change id spotifysong
1  red      1      a -0.6264538  1  Irrelevant
2  red      1      a  0.1836433  2  Irrelevant
3  red      1      a -0.8356286  3  Irrelevant
```

---

#Selection - rows/observations

Four equivalent ways to get the second line of `tD`:

```r
> tD[2, ] #indexing
> tD[c(FALSE, TRUE, FALSE), ] #manual logical vector 
> tD[tD$id == 2, ] #informative logical vector
> tD %>% filter(id==2)  # Using tidyverse
```

```
  pill events region    change id spotifysong
2  red      1      a 0.1836433  2  Irrelevant
```

---

#Selection - rows/observations

Use informative logical vectors as much as possible!

```r
> tD
```

```
  pill events region     change id spotifysong
1  red      1      a -0.6264538  1  Irrelevant
2  red      1      a  0.1836433  2  Irrelevant
3  red      1      a -0.8356286  3  Irrelevant
```

```r
> #Mark non-positive change as missing:
> tD[tD$change > 0, "change"] <- NA
```

---

#Selection - columns/variables

*ALWAYS* use variable names.

```r
> #readable, informative code:
> tD[tD$change > 0, "change"] <- NA
> 
> # Indexing by numbers easily becomes 
> # a source of error by itself:
> tD[tD$change > 0, 4] <- NA
```

---

background-image: url(pics/structure.png)
background-size: 30%
background-position: right

## Finishing up after cleaning

Should now have 
a cleaned dataset 
that can form the 
basis for future 
analyses.

With documentation 
of how we got 
there!

---

# Create codebook

Produce a summary document for subsequent analyses.

```r
> makeCodebook(bigPresidentData)
```
]

Add label (similar to `labelled` package) or extra information

```r
> bPD <- bigPresidentData
> attr(bPD$presidencyYears, "label") <- 
+ "Full years as president"
```
]

```r
> attr(bPD$dateOfDeath, "shortDescription") <- 
+ "Missing means that the person is still alive"
```
]

---

# Exercise 4

Correct the errors you have found so far.

Make sure to make the cleaning process reproducible.

Remember **rules 1 and 2**!

Create the final codebook with additional information about some of the variables.

```r
> makeCodebook(myCleanedData)
```

---

# Thank you!

Please grab hold of us here or via email

.pull-left[Anne [ahpe@sund.ku.dk](mailto:ahpe@sund.ku.dk)] .pull-right[Claus [ekstrom@sund.ku.dk](mailto:ekstrom@sund.ku.dk)]