Chapter 2 Exploratory Data Analysis (EDA) 1
2.1 R with R Studio and/or R Studio.cloud
2.1.1 Course Contents
- 2020-12-08: Introduction: About the course
- An introduction to open and public data, and data science - 2020-12-15: Exploratory Data Analysis (EDA) 1 [lead by hs]
- R Basics with RStudio and/or RStudio.cloud; R Script, swirl - 2021-12-22: Exploratory Data Analysis (EDA) 2 [lead by hs]
- R Markdown; Introduction totidyverse
; RStudio Primers - 2022-01-12: Exploratory Data Analysis (EDA) 3 [lead by hs]
- Introduction totidyverse
; Public Data, WDI, etc - 2022-01-19: Exploratory Data Analysis (EDA) 4 [lead by hs]
- Introduction totidyverse
; WDI, UN, WHO, etc - 2022-01-26: Exploratory Data Analysis (EDA) 5 [lead by hs]
- Introduction totidyverse
; WDI, OECD, US gov, etc - 2022-02-02: Inference Statistics 1
- 2022-02-09: Inference Statistics 2
- 2022-02-16: Inference Statistics 3
- 2022-02-23: Project Presentation
2.1.2 Learning Resources, I
2.1.2.1 Textbooks
“R for Data Science” by Hadley Wickham and Garrett Grolemund:
- Free Online Book: https://r4ds.had.co.nz
“R for Data Science: Exercise Solutions” by Jeffrey B. Arnold
- Free Online Book: https://jrnold.github.io/r4ds-exercise-solutions/
2.1.2.2 Other Resources (MOOCs)
- edX: HarvardX Data Science - 9 courses. Textbook:
- “Introduction to Data Science” by Rafael A. Irizarry.
- Free Online Book by Rafael A. Irizarry.
- coursera: JHU Data Science - 10 courses. List of Companion Books:
- “R Programming for Data Science” by Roger Peng.
- Free Online Book by Roger Peng.
- “Exploratory Data Analysis with R” by Roger Peng.
- Free online Book by Roger Peng.
- “Report Writing for Data Science in R” by Roger Peng
- “Statistical Inference for Data Science” by Brian Caffo
- “Regression Modeling for Data Science in R” by Brian Caffo
2.1.3 EDA1: Contents
What is R?
Why R?
- the First Example
What is R Studio and R Studio Cloud?
Installation of R and R Studio
R Studio Basics
R Studio Cloud Basics
- Project, R Console
R Basics using an R Script
{swirl}: Learn
R
, inR
EDA: Coronavirus, the first example
Assignment 1 and Assignment 2 in Moodle
2.1.4 What is R?
2.1.4.1 R (programming language), Wikipedia
R is a programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing.
The R language is widely used among statisticians and data miners for developing statistical software and data analysis.
A GNU package, the official R software environment is written primarily in C, Fortran, and R itself (thus, it is partially self-hosting) and is freely available under the GNU General Public License.
2.1.5 Why R? – Responses by Hadley Wickham
2.1.5.1 r4ds: R is a great place to start your data science journey because
- R is an environment designed from the ground up to support data science.
- R is not just a programming language, but it is also an interactive environment for doing data science.
- To support interaction, R is a much more flexible language than many of its peers.
2.1.5.2 Why R today?
When you talk about choosing programming languages, I always say you shouldn’t pick them based on technical merits, but rather pick them based on the community. And I think the R community is like really, really strong, vibrant, free, welcoming, and embraces a wide range of domains. So, if there are like people like you using R, then your life is going to be much easier. That’s the first reason.
Interview: “Advice to Young (and Old) Programmers, H. Wickham”
2.1.6 The First Example
plot(cars)
plot(cars) # cars: Speed and Stopping Distances of Cars
abline(lm(cars$dist~cars$speed))
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
summary(lm(cars$dist~cars$speed))
##
## Call:
## lm(formula = cars$dist ~ cars$speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## cars$speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
2.1.7 What is RStudio? https://rstudio.com
RStudio is an integrated development environment, or IDE, for R programming.
2.1.7.1 R Studio (Wikipedia)
RStudio is an integrated development environment (IDE) for R, a programming language for statistical computing and graphics. It is available in two formats: RStudio Desktop is a regular desktop application while RStudio Server runs on a remote server and allows accessing RStudio using a web browser.
2.1.7.2 R Studio Cloud https://rstudio.cloud
RStudio Cloud is a lightweight, cloud-based solution that allows anyone to do, share, teach and learn data science online.
2.1.8 Installation of R and R Studio
2.1.8.1 R Installation
To download R, go to CRAN, the comprehensive R archive network. CRAN is composed of a set of mirror servers distributed around the world and is used to distribute R and R packages. Don’t try and pick a mirror that’s close to you: instead use the cloud mirror, https://cloud.r-project.org, which automatically figures it out for you.
A new major version of R comes out once a year, and there are 2-3 minor releases each year. It’s a good idea to update regularly.
2.1.8.2 R Studio Installation
Download and install it from http://www.rstudio.com/download.
RStudio is updated a couple of times a year. When a new version is available, RStudio will let you know.
2.1.9 R Studio
2.1.9.1 The First Step
- Start R Studio Application
- Top Menu: File > New Project > New Directory > New Project > Directory name or Browse the directory and choose the parent directory you want to create the directory
2.1.9.2 When You Start the Project
- Go to the directory you created
- Double click _‘Directory Name’.Rproj
Or,
- Start R Studio
- File > Open Project (or choose from Recent Project)
In this way the working directory of the session is set to the project directory and R can search releted files without difficulty (getwd()
, setwd()
)
2.1.10 R Studio Cloud
2.1.10.1 Cloud Free
- Up to 15 projects total
- 1 shared space (5 members and 10 projects max)
- 15 project hours per month
- Up to 1 GB RAM per project
- Up to 1 CPU per project
- Up to 1 hour background execution time
2.1.10.2 How to Start R Studio Cloud
- Go to https://rstudio.cloud/
- Sign Up: top right
- Email address or Google account
- New Project: Project Name
- R Console
2.1.11 Let’s Try R Basics
2.1.11.2 R Scripts
- Copy a script in Moodle:
basics.R
- In RStudio (Workspace in RStudio.cloud, Project in RStudio) choose File > New File > R Script and paste it.
- Choose File > Save with a name; e.g. basics (.R will be added automatically)
2.1.11.3 Helpful Resources
- Cheet Sheet in RStudio: https://www.rstudio.com/resources/cheatsheets/
- ‘Quick R’ by DataCamp: https://www.statmethods.net/management
2.1.13 Practicum: R Studio Cloud (or R Studio) and R basics
2.1.13.1 Let’s Try R Basics
- R Studio Cloud
- Create an account
- Create a Project
- R Studio Basics
- R Basics
- basics.R
2.1.13.2 Basics.R
The script with the outputs.
#################
#
# basics.R
#
################
# 'Quick R' by DataCamp may be a handy reference:
# https://www.statmethods.net/management/index.html
# Cheat Sheet at RStudio: https://www.rstudio.com/resources/cheatsheets/
# Base R Cheat Sheet: https://github.com/rstudio/cheatsheets/raw/main/base-r.pdf
# To execute the line: Control + Enter (Window and Linux), Command + Enter (Mac)
## try your experiments on the console
## calculator
3 + 7
## [1] 10
### +, -, *, /, ^ (or **), %%, %/%
3 + 10 / 2
## [1] 8
3^2
## [1] 9
2^3
## [1] 8
2*2*2
## [1] 8
### assignment: <-, (=, ->, assign())
<- 5
x
x
## [1] 5
#### object_name <- value, '<-' shortcut: Alt (option) + '-' (hyphen or minus)
#### Object names must start with a letter and can only contain letter, numbers, _ and .
<- 5^3
this_is_a_long_name
this_is_a_long_name
## [1] 125
<- "What is your name?"
char_name
char_name
## [1] "What is your name?"
#### Use 'tab completion' and 'up arrow'
### ls(): list of all assignments
ls()
## [1] "char_name" "this_is_a_long_name" "x"
ls.str()
## char_name : chr "What is your name?"
## this_is_a_long_name : num 125
## x : num 5
#### check Environment in the upper right pane
### (atomic) vectors
5:10
## [1] 5 6 7 8 9 10
<- seq(5,10)
a
a
## [1] 5 6 7 8 9 10
<- 5:10
b
identical(a,b)
## [1] TRUE
seq(5,10,2) # same ase seq(from = 5, to = 10, by = 2)
## [1] 5 7 9
<- seq(0,100, by = 10)
c1
<- seq(0,100, length.out = 10)
c2
c1
## [1] 0 10 20 30 40 50 60 70 80 90 100
c2
## [1] 0.00000 11.11111 22.22222 33.33333 44.44444 55.55556 66.66667
## [8] 77.77778 88.88889 100.00000
length(c1)
## [1] 11
#### ? seq ? length ? identical
<- 1:6) (die
## [1] 1 2 3 4 5 6
<- c(0,1) # same as 0:1
zero_one
+ zero_one # c(1,2,3,4,5,6) + c(0,1). re-use die
## [1] 1 3 3 5 5 7
<- rep(1:3,2) # repeat
d1
d1
## [1] 1 2 3 1 2 3
== d1 die
## [1] TRUE TRUE TRUE FALSE FALSE FALSE
<- as.character(die == d1)
d2
d2
## [1] "TRUE" "TRUE" "TRUE" "FALSE" "FALSE" "FALSE"
<- as.numeric(die == d1)
d3
d3
## [1] 1 1 1 0 0 0
### class() for class and typeof() for mode
### class of vectors: numeric, charcters, logical
### types of vectors: doubles, integers, characters, logicals (complex and raw)
typeof(d1); class(d1)
## [1] "integer"
## [1] "integer"
typeof(d2); class(d2)
## [1] "character"
## [1] "character"
typeof(d3); class(d3)
## [1] "double"
## [1] "numeric"
sqrt(2)
## [1] 1.414214
sqrt(2)^2
## [1] 2
sqrt(2)^2 - 2
## [1] 4.440892e-16
typeof(sqrt(2))
## [1] "double"
typeof(2)
## [1] "double"
typeof(2L)
## [1] "integer"
5 == c(5)
## [1] TRUE
length(5)
## [1] 1
### Subsetting
<- LETTERS) (A_Z
## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
## [20] "T" "U" "V" "W" "X" "Y" "Z"
<- A_Z[1:6]
A_F
A_F
## [1] "A" "B" "C" "D" "E" "F"
3] A_F[
## [1] "C"
c(3,5)] A_F[
## [1] "C" "E"
<- die > 3
large
large
## [1] FALSE FALSE FALSE TRUE TRUE TRUE
<- die %in% c(2,4,6)
even
even
## [1] FALSE TRUE FALSE TRUE FALSE TRUE
A_F[large]
## [1] "D" "E" "F"
A_F[even]
## [1] "B" "D" "F"
< 4] A_F[die
## [1] "A" "B" "C"
### Compare df with df1 <- data.frame(number = die, alphabet = A_F)
<- data.frame(number = die, alphabet = A_F, stringsAsFactors = FALSE)
df
df
## number alphabet
## 1 1 A
## 2 2 B
## 3 3 C
## 4 4 D
## 5 5 E
## 6 6 F
$number df
## [1] 1 2 3 4 5 6
$alphabet df
## [1] "A" "B" "C" "D" "E" "F"
3,2] df[
## [1] "C"
4,1] df[
## [1] 4
1] df[
## number
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
class(df[1])
## [1] "data.frame"
class(df[[1]])
## [1] "integer"
identical(df[[1]], die)
## [1] TRUE
identical(df[1],die)
## [1] FALSE
####################
# The First Example
####################
plot(cars)
# Help
? cars
# cars is in the 'datasets' package
data()
# help(cars) does the same as ? cars
# You can use Help tab in the right bottom pane
help(plot)
## トピック 'plot' に対するヘルプが以下のパッケージ中にありました:
##
## Package Library
## graphics /Library/Frameworks/R.framework/Versions/4.2/Resources/library
## base /Library/Frameworks/R.framework/Resources/library
##
##
## 最初にマッチしたものを使っています ...
? par
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
str(cars)
## 'data.frame': 50 obs. of 2 variables:
## $ speed: num 4 4 7 7 8 9 10 10 10 11 ...
## $ dist : num 2 10 4 22 16 10 18 26 34 17 ...
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
<- cars$speed
x <- cars$dist
y
min(x)
## [1] 4
mean(x)
## [1] 15.4
quantile(x)
## 0% 25% 50% 75% 100%
## 4 12 15 19 25
plot(cars)
abline(lm(cars$dist ~ cars$speed))
summary(lm(cars$dist ~ cars$speed))
##
## Call:
## lm(formula = cars$dist ~ cars$speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## cars$speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
boxplot(cars)
hist(cars$speed)
hist(cars$dist)
hist(cars$dist, breaks = seq(0,120, 10))
2.2 Practicum: Swirl and more on R Script
2.2.1 Swirl: An interactive learning environment for R and statistics
- {
swirl
} website: https://swirlstats.com - JHU Data Science in coursera uses
swirl
for exercises.
2.2.1.1 Swirl Courses
- R Programming: The basics of programming in R
- Regression Models: The basics of regression modeling in R
- Statistical Inference: The basics of statistical inference in R
- Exploratory Data Analysis: The basics of exploring data in R
You can install other swirl
courses as well
- Swirl Courses Organized by Title
- Swirl Courses Organized by Author’s Name
- Github: swirl courses
install_course("Course Name Here")
2.2.2 Install and Start Swirl Courses
2.2.2.1 Three Steps to Start Swirl
install.packages("swirl") # Only the first time.
library(swirl) # Everytime you start swirl
swirl() # Everytime you start or resume swirl
2.2.2.2 R Programming: The basics of programming in R
1: Basic Building Blocks 2: Workspace and Files 3: Sequences of Numbers
4: Vectors 5: Missing Values 6: Subsetting Vectors
7: Matrices and Data Frames 8: Logic 9: Functions
10: lapply and sapply 11: vapply and tapply 12: Looking at Data
13: Simulation 14: Dates and Times 15: Base Graphics
2.2.2.3 Recommended Sections in Order
1, 3, 4, 5, 6, 7, 12, 15, 14, 8, 9, 10, 11, 13, 2
- Section 2 discusses the directories and file systems of a computer
- Sections 9, 10, 11 are for programming
2.2.2.4 Controling a swirl
Session
… <– That’s your cue to press Enter to continue
You can exit swirl and return to the R prompt (>) at any time by pressing the Esc key.
If you are already at the prompt, type bye() to exit and save your progress. When you exit properly, you’ll see a short message letting you know you’ve done so.
When you are at the R prompt (>):
- Typing skip() allows you to skip the current question.
- Typing play() lets you experiment with R on your own; swirl will ignore what you do…
- UNTIL you type nxt() which will regain swirl’s attention.
- Typing bye() causes swirl to exit. Your progress will be saved.
- Typing main() returns you to swirl’s main menu.
- Typing info() displays these options again.