class: center, middle, inverse, title-slide # Cleaning Up the Data Cleaning Process ## Module 1: Introduction and terminology ### Claus Thorn Ekstrøm and Anne Helby Petersen
UCPH Biostatistics ### CSP, Portland, OR, Feb. 15th, 2018
.footnotesize[
Slides/homepage
] --- background-image: url(pics/redcard+sociology.jpg) background-size: 96% class: center, middle ??? Science og statistics har været under pres de seneste år. Specielt indenfor psykologi har det vist sig svært at replikere vigtige resultater Resulteret i en diskussion om, hvad videnserfaring var, og hvordan forskellige grupper kan se så forskelligt på de samme data. --- background-image: url(pics/soccer.png) background-size: 98% class: center, middle ??? Soccer la la la la Rødt kort til farvede spillere 29 forskergrupper --- # Classical research process From idea ...
.pull-right[... to publication.] ??? The red line shows where peer review comes in. Total summary. We need to document the steps we did througout as shown in the soccer example. We want reproducible research --- # Why is reproducible research important? > **Statistical analysis** > > .large[All of the data were analyzed with data processing software and figures with Microsoft excel 2007.] > > .pull-right[-- Tayefe *et al*, Advances in Bioresearch, 2014] ??? Full statistical analysis section from a scientific paper. Clearly impossible to reproduce --- # Terminology **Reproducibility** Given code/data/materials, can I get *the same* (=identical) numbers that you did? **Replicability** Given scientific protocol, can I get *the same* (=in agreement) result that you did in my own study? ??? However, what do we really do? --- ## The life of a data scientist > .large[Data scientists, according to interviews and expert estimates, spend from **50 percent to 80 percent** of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.] > > .right[.small[-- "For Big-Data Scientists, 'Janitor Work' Is Key Hurdle to Insight" - The New York Times, 2014]] ??? Noter --- ## Realistic process plan
<img src="pics/cartoon-metadata.png" width="1200" /> ??? GIGO Huge impact **here** * Filtering, selection, error-fixing, ... --- class: middle # Plan for workshop 1. Introduction, data, error-hunting 2. Data screening with `dataMaid`. Extending `dataMaid` 3. Validity and consistency checking using `validate` 4. Data cleaning and reporting Exercises. If you haven't already: go to [ekstroem.github.io/dataMaid/CSP2018/](https://ekstroem.github.io/dataMaid/CSP2018/index.html) and install the required packages. --- background-image: url(pics/flower.png) background-size: 60% class: center, middle --- background-image: url(pics/structure.png) background-position: right background-size: 30% # From raw data to analysis-ready .small[ .pull-left[ * Wrangle to put into correct format and type (validity) * Screen to look for consistency, accuracy and uniqueness * Validate to check for consistency, accuracy and uniqueness * Clean data * Check (screen/validate) again ]] --- ## What is data? ### **Context** * Data type : BMI * Values : Non-negative. Idea about lower and upper boundary * Missing : Half the individuals were not measured ### **Content** -- * Data type : numeric * Values : 23.5, 31.1, ... * Missing : Code as `.` ??? Note the complete overlap between context and content. Crucial: someone must know the topic! --- # Wide vs. long .pull-left[ | id| bmi0| bmi52| |--:|----:|-----:| | 1| 35.2| 24.2| | 2| 31.1| 27.0| ] .pull-right[ | id| value| time| |--:|-----:|----:| | 1| 35.2| 0| | 2| 31.1| 0| | 1| 24.2| 52| | 2| 27.0| 52| ] -- * Work row-wise or column-wise? * Tidy data? ??? Fordele og ulemper ved begge dele. Man skal være opmærksom på, hvad man har med at gøre. --- class: center, middle ## First rule of reproducible data wrangling .Large[*Thou shall never manually modify your raw data.*] There are no exceptions to this rule. --- class: center, middle ## Second rule of reproducible data wrangling .Large[*Thou shall never overwrite your raw data.*] There are no exceptions to this rule either. --- # R markdown Format for writing **reproducible**, dynamic reports with R. Embed R code and results into slideshows, pdfs, html documents, Word files and more. See [cheat sheet at RStudio](https://github.com/rstudio/cheatsheets/raw/master/rmarkdown-2.0.pdf). ```r > install.packages("rmarkdown", "knitr") ``` <img src="pics/rmarkdown.png" width="784" /> --- # Validity - data formats Technically corrct data requires that the data formats are correct .pull-left[ ```r > DF ``` ``` id bmi0 bmi52 1 1 35.2 24.2 2 2 31.1 27.0 ``` ] .pull-right[ ```r > lapply(DF, class) ``` ``` $id [1] "numeric" $bmi0 [1] "numeric" $bmi52 [1] "numeric" ``` ] --- class: inverse, middle # Exercise 1 Get the `bigPresidentData` from the `dataMaid` package: ```r > library(dataMaid) > data("bigPresidentData") ``` Hunt for errors!