1 About the Course

1.1 ‘QALL401 Data Analysis for Researchers’

An introduction to data science (DS). It is an exploratory data analysis (EDA) that is an essential part of scientific research and an evidence-based decision making of a responsible global citizen. Students acquire the knowledge and learn the necessary principles for appropriate computer utilization in making research results public in order to communicate the outcomes. Since data science supports technologies of artificial intelligence, ethical issues are becoming more and more important.

We introduce R, a widely used free software environment for statistical computing and graphics, and Rmarkdown, an authoring format that enables easy creation of dynamic documents, presentations, and reports from R, supporting reproducible research and literate programming.

We will experience the process of data science and set a foundation to delovep data science skills and take time to think about the ethical issues of its outcomes.

Instructors: Taisei Kaizoji, Professor of Economics, and Hiroshi Suzuki, Part-time Instructor

Description: This course will help students from many academic fields develop skills to obtain necessary information from open data, as well as make charting and graphing for visualization. Students also learn fundamentals of data analysis and write short articles including data reasoning. The laboratory work uses open software such as R, and guest lectures on data analysis for research are included.

Key Words: open data, data visualization, data analysis, data reasoning, R

Features: laboratory work - practicum, write short articles, guest lectures

1.1.1 Course Overview

The objective of this course is to learn the fundamentals of data science. Using the free software, R and its IDE, RStudio, students will learn how to collect data, transform it into appropriate forms, and visualize it. Students will also learn how to analyze data and present their findings to others.

Sylabus: https://campus.icu.ac.jp/public/ehandbook/PreviewSyllabus.aspx?regno=32002&year=2022&term=3

1.1.2 Course Schedule

2022-12-07: Introduction: About the course　[lead by TK]
- An introduction to open and public data, and data science
2022-12-14: Exploratory Data Analysis (EDA) 1 [lead by hs]
- R Basics with RStudio and/or RStudio.cloud
- tidyverse using Toy Data
- Assignment One
2022-12-21: Exploratory Data Analysis (EDA) 2 [lead by hs]
- R Markdown for reproducibility and communication
- dplyr for transforming data
- Assignment Two
2023-01-11: Exploratory Data Analysis (EDA) 3 [lead by hs]
- WDI, a package for searching and downloading World Development Indicators
- ggplot2 for data visualization
- Assignment Three
2023-01-18: Exploratory Data Analysis (EDA) 4 [lead by hs]
- tidyr for tidying data
- Workflow of EDA
- Assignment Four
2023-01-25: Exploratory Data Analysis (EDA) 5 [lead by hs]
- Data Modeling
- Roundups, R Markdown revisited
- Assignment Five
2023-02-01: Introduction to PPDAC (Problem-Plan-Data-Analysis-Conclusion) Cycle: [lead by TK] - PPDAC in EDA
- owidR
2023-02-08: Model building I [lead by TK] - World Bank data
- Merging data
2023-02-15: Model building II [lead by TK]
-Analyzing data and communications
2023-02-22: Project Presentation

1.1.3 Objective and Grading Policy：

Grading policy：

A. Course participation by giving feedback - 10%

B. Short papers: Assignment 1-5 - 30%

C. Presentation - 20%

D. Final report - 40%

1.1.4 Learning Resources

1.1.4.1 Textbooks and References

“R for Data Science” by Hadley Wickham and Garrett Grolemund:
- Free Online Book: https://r4ds.had.co.nz
Visit bookdown site: https://bookdown.org
- Many more on the archive page.

1.1.5 Interactive Tutorials for R

1.1.5.1 Posit Primers https://posit.cloud/learn/primers

The Basics – r4ds: Explore, I

Work with Data – r4ds: Wrangle, I

Visualize Data – r4ds: Explore, II

Tidy Your Data – r4ds: Wrangle, II

Iterate – r4ds: Program

Write Functions – r4ds: Program

Report Reproductively – r4ds: Communicate

Link to Videos and Explanations

Build Interactive Web Apps

1.1.5.2 Swirl: An interactive learning environment for R and statistics

It is a console-based interactive tutorial containing several courses. We did not use it in class this academic year.

{swirl} website: https://swirlstats.com
- JHU Data Science in coursera uses swirl for exercises.

1.1.6 A massage to students

This course consists of the following components.

Lecture Note

We provide slides, notes and lecture note in the past

Lecture

We provide Zoom as an option, and its recording

Textbook

R for Data Science - you can read online

Practicum in class

We provide the log in R Notebooks or R Scripts

Interactive Tutorial

Posit Primers - you can practice online

Assignments - format: R Notebook

We provide feedback to each and responses in R Notebook

Student Presentation - format: R Notebook, Slides, …

Last class

Final Paper - format: R Notebook (including codes) and PDF (8 pages)

Due: Two weeks after the last class

Each component is closely linked. We do not check your engagement in Posit Primers, but the lectures from week two to week six are designed following Posit Primers. For assignments, you can submit R Notebook containing code chunks with errors. Hopefully, instructors will give feedback and suggestions. We also set up a personal tutorial meeting on Zoom upon request.

Our goal is that you develop skills to explore and analyze data, mainly using open public data by yourself. We truly hope you enjoy the course.

1.2 Introduction to Exploratory Data Analysis

1.2.1 What is data science?

Wikipedia https://en.wikipedia.org/wiki/Data_science

An inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data.

Create Insights
Impact Decision Making
Maintain & Improve Overtime

1.3 Gapminder

1.3.1 Hans Rosling (1948 – 2017)

Hans Rosling was a Swedish physician, academic, and public speaker. He was a professor of international health at Karolinska Institute[4] and was the co-founder and chairman of the Gapminder Foundation, which developed the Trendalyzer software system. (wikipedia)

Books:
- Factfulness: Ten Reasons We’re Wrong About The World - And Why Things Are Better Than You Think, 2018
- How I Learned to Understand the World: A Memoir, 2020
Gapminder: https://www.gapminder.org
- You are probably wrong about: Upgrade Your World View
- Bubble Chart: Income vs Life Expectancy over time, 1800 - 2020
  - How many variables?
Videos: The best stats you’ve ever seen, Hans Rosling

How not to be ignorant about the world | Hans and Ola Rosling

1.3.2 Factfulness is … From the book

recognizing when a decision feels urgent and remembering that it rarely is.

To control the urgency instinct, take small steps.

Take a breath. When your urgency instinct is triggered, your other instincts kick in and your analysis shuts down. Ask for more time and more information. It’s rarely now or never and it’s rarely either/or.
Insist on the data. If something is urgent and important, it should be measured. Beware of data that is relevant but inaccurate, or accurate but irrelevant. Only relevant and accurate data is useful.
Beware of fortune-tellers. Any prediction about the future is uncertain. Be wary of predictions that fail to acknowledge that. Insist on a full range of scenarios, never just the best or worst case. Ask how often such predictions have been right before.
Be wary of drastic action. Ask what the side effects will be. Ask how the idea has been tested. Step-by-step practical improvements, and evaluation of their impact, are less dramatic but usually more effective.

1.4 Exploratory Data Analysis

1.4.1 What is EDA (Posit Primers: Visualise Data)

EDA is an iterative cycle that helps you understand what your data says. When you do EDA, you:
Generate questions about your data
Search for answers by visualising, transforming, and/or modeling your data

Use what you learn to refine your questions and/or generate new questions

EDA is an important part of any data analysis. You can use EDA to make discoveries about the world; or you can use EDA to ensure the quality of your data, asking questions about whether the data meets your standards or not.

1.5 Open and Public Data, World Bank

1.5.1 Open Government Data Toolkit: Open Data Defined

The term Open Data has a very precise meaning. Data or content is open if anyone is free to use, re-use or redistribute it, subject at most to measures that preserve provenance and openness.

The data must be legally open, which means they must be placed in the public domain or under liberal terms of use with minimal restrictions.
The data must be technically open, which means they must be published in electronic formats that are machine readable and non-proprietary, so that anyone can access and use the data using common, freely available software tools. Data must also be publicly available and accessible on a public server, without password or firewall restrictions. To make Open Data easier to find, most organizations create and manage Open Data catalogs.

About

2 R and R Studio