class: center, middle, inverse, title-slide # Cleaning Up the Data Cleaning Process ## Module 4: Time to clean! ### Claus Thorn Ekstrøm and Anne Helby Petersen
UCPH Biostatistics ### CSP, Portland, OR, Feb. 15th, 2018
.footnotesize[
Slides/homepage
] --- #Summarizing the errors * What did we find so far? * What didn't we find? --- #Data cleaning .center[ <img src="pics/datacleaning.jpg" width="60%" /> ] Not the best term ... and should not be unsupervised --- #Data cleaning in R In an R-script: 1. Make a copy of the dataset. 2. Use indexing to locate the problem in the data. 3. Overwrite the faulty value with a correct one - if you know it - or `NA` to mark that information is missing in this spot. 4. Save the copy of the "cleaned" data in a **new** file. --- #Selection - rows/observations Two systems for selecting observations in `data.frame`s in R: By index (row number) or using a logical vector. ```r > (tD <- head(toyData, 3)) ``` ``` pill events region change id spotifysong 1 red 1 a -0.6264538 1 Irrelevant 2 red 1 a 0.1836433 2 Irrelevant 3 red 1 a -0.8356286 3 Irrelevant ``` --- #Selection - rows/observations Four equivalent ways to get the second line of `tD`: ```r > tD[2, ] #indexing > tD[c(FALSE, TRUE, FALSE), ] #manual logical vector > tD[tD$id == 2, ] #informative logical vector > tD %>% filter(id==2) # Using tidyverse ``` ``` pill events region change id spotifysong 2 red 1 a 0.1836433 2 Irrelevant ``` --- #Selection - rows/observations Use informative logical vectors as much as possible! ```r > tD ``` ``` pill events region change id spotifysong 1 red 1 a -0.6264538 1 Irrelevant 2 red 1 a 0.1836433 2 Irrelevant 3 red 1 a -0.8356286 3 Irrelevant ``` ```r > #Mark non-positive change as missing: > tD[tD$change > 0, "change"] <- NA ``` --- #Selection - columns/variables *ALWAYS* use variable names. ```r > #readable, informative code: > tD[tD$change > 0, "change"] <- NA > > # Indexing by numbers easily becomes > # a source of error by itself: > tD[tD$change > 0, 4] <- NA ``` --- background-image: url(pics/structure.png) background-size: 30% background-position: right ## Finishing up after cleaning Should now have<br> a cleaned dataset<br> that can form the<br> basis for future<br> analyses.<br> With documentation<br> of how we got<br> there! --- # Create codebook Produce a summary document for subsequent analyses. .footnotesize[ ```r > makeCodebook(bigPresidentData) ``` ] Add label (similar to `labelled` package) or extra information .footnotesize[ ```r > bPD <- bigPresidentData > attr(bPD$presidencyYears, "label") <- + "Full years as president" ``` ] .footnotesize[ ```r > attr(bPD$dateOfDeath, "shortDescription") <- + "Missing means that the person is still alive" ``` ] --- class: inverse # Exercise 4 Correct the errors you have found so far. Make sure to make the cleaning process reproducible. Remember **rules 1 and 2**! Create the final codebook with additional information about some of the variables. ```r > makeCodebook(myCleanedData) ``` --- class: middle, center # Thank you! Please grab hold of us here or via email .pull-left[Anne<br>[ahpe@sund.ku.dk](mailto:ahpe@sund.ku.dk)] .pull-right[Claus<br>[ekstrom@sund.ku.dk](mailto:ekstrom@sund.ku.dk)]