1 About the Course
1.1 ‘QALL401 Data Analysis for Researchers’
An introduction to data science (DS). It is an exploratory data analysis (EDA) that is an essential part of scientific research and an evidence-based decision making of a responsible global citizen. Students acquire the knowledge and learn the necessary principles for appropriate computer utilization in making research results public in order to communicate the outcomes. Since data science supports technologies of artificial intelligence, ethical issues are becoming more and more important.
We introduce R, a widely used free software environment for statistical computing and graphics, and Rmarkdown, an authoring format that enables easy creation of dynamic documents, presentations, and reports from R, supporting reproducible research and literate programming.
We will experience the process of data science and set a foundation to delovep data science skills and take time to think about the ethical issues of its outcomes.
Instructors: Taisei Kaizoji, Professor of Economics, and Hiroshi Suzuki, Part-time Instructor
Description: This course will help students from many academic fields develop skills to obtain necessary information from open data, as well as make charting and graphing for visualization. Students also learn fundamentals of data analysis and write short articles including data reasoning. The laboratory work uses open software such as R, and guest lectures on data analysis for research are included.
Key Words: open data, data visualization, data analysis, data reasoning, R
Features: laboratory work - practicum, write short articles, guest lectures
1.1.1 Course Overview
The objective of this course is to learn the fundamentals of data science. Using the free software, R and its IDE, RStudio, students will learn how to collect data, transform it into appropriate forms, and visualize it. Students will also learn how to analyze data and present their findings to others.
Sylabus: https://campus.icu.ac.jp/public/ehandbook/PreviewSyllabus.aspx?regno=32002&year=2022&term=3
1.1.2 Course Schedule
2022-12-07: Introduction: About the course [lead by TK]
- An introduction to open and public data, and data science2022-12-14: Exploratory Data Analysis (EDA) 1 [lead by hs]
- R Basics with RStudio and/or RStudio.cloud
-tidyverse
using Toy Data
- Assignment One2022-12-21: Exploratory Data Analysis (EDA) 2 [lead by hs]
- R Markdown for reproducibility and communication
-dplyr
for transforming data
- Assignment Two2023-01-11: Exploratory Data Analysis (EDA) 3 [lead by hs]
- WDI, a package for searching and downloading World Development Indicators
-ggplot2
for data visualization
- Assignment Three2023-01-18: Exploratory Data Analysis (EDA) 4 [lead by hs]
-tidyr
for tidying data
- Workflow of EDA
- Assignment Four2023-01-25: Exploratory Data Analysis (EDA) 5 [lead by hs]
- Data Modeling
- Roundups, R Markdown revisited
- Assignment Five2023-02-01: Introduction to PPDAC (Problem-Plan-Data-Analysis-Conclusion) Cycle: [lead by TK] - PPDAC in EDA
-owidR
2023-02-08: Model building I [lead by TK] - World Bank data
- Merging data2023-02-15: Model building II [lead by TK]
-Analyzing data and communications2023-02-22: Project Presentation
1.1.3 Objective and Grading Policy:
The objective of this course is to learn the fundamentals of data science. Using the free software, R and its IDE, RStudio, students will learn how to collect data, transform it into appropriate forms, and visualize it. Students will also learn how to analyze data and present their findings to others.
Grading policy:
A. Course participation by giving feedback - 10%
B. Short papers: Assignment 1-5 - 30%
C. Presentation - 20%
D. Final report - 40%
1.1.4 Learning Resources
1.1.4.1 Textbooks and References
- “R for Data Science” by Hadley Wickham and Garrett Grolemund:
- Free Online Book: https://r4ds.had.co.nz
- Visit
bookdown
site: https://bookdown.org- Many more on the archive page.
1.1.5 Interactive Tutorials for R
1.1.5.1 Posit Primers https://posit.cloud/learn/primers
- The Basics – r4ds: Explore, I
- Work with Data – r4ds: Wrangle, I
- Visualize Data – r4ds: Explore, II
- Exploratory Data Analysis
- Bar Charts
- Histograms
- Boxplots and Counts
- Scatterplots
- Line plots and maps
- Overplotting
- Customize plots
- Tidy Your Data – r4ds: Wrangle, II
- Iterate – r4ds: Program
- Write Functions – r4ds: Program
- Function Basics
- How to Write a Function
- Argument Matching
- Environments and Scoping
- Control Flow
- Advanced Control Flow
- Loops in R
- Report Reproductively – r4ds: Communicate
1.1.5.2 Swirl: An interactive learning environment for R and statistics
It is a console-based interactive tutorial containing several courses. We did not use it in class this academic year.
- {swirl} website: https://swirlstats.com
- JHU Data Science in coursera uses swirl for exercises.
1.1.6 A massage to students
This course consists of the following components.
- Lecture Note
- We provide slides, notes and lecture note in the past
- Lecture
- We provide Zoom as an option, and its recording
- Textbook
- R for Data Science - you can read online
- Practicum in class
- We provide the log in R Notebooks or R Scripts
- Interactive Tutorial
- Posit Primers - you can practice online
- Assignments - format: R Notebook
- We provide feedback to each and responses in R Notebook
- Student Presentation - format: R Notebook, Slides, …
- Last class
- Final Paper - format: R Notebook (including codes) and PDF (8 pages)
- Due: Two weeks after the last class
Each component is closely linked. We do not check your engagement in Posit Primers, but the lectures from week two to week six are designed following Posit Primers. For assignments, you can submit R Notebook containing code chunks with errors. Hopefully, instructors will give feedback and suggestions. We also set up a personal tutorial meeting on Zoom upon request.
Our goal is that you develop skills to explore and analyze data, mainly using open public data by yourself. We truly hope you enjoy the course.
1.2 Introduction to Exploratory Data Analysis
1.2.1 What is data science?
Wikipedia https://en.wikipedia.org/wiki/Data_science
An inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data.
- Create Insights
- Impact Decision Making
- Maintain & Improve Overtime
1.3 Gapminder
1.3.1 Hans Rosling (1948 – 2017)
Hans Rosling was a Swedish physician, academic, and public speaker. He was a professor of international health at Karolinska Institute[4] and was the co-founder and chairman of the Gapminder Foundation, which developed the Trendalyzer software system. (wikipedia)
- Books:
- Factfulness: Ten Reasons We’re Wrong About The World - And Why Things Are Better Than You Think, 2018
- How I Learned to Understand the World: A Memoir, 2020
- Gapminder: https://www.gapminder.org
- You are probably wrong about: Upgrade Your World View
-
Bubble Chart: Income vs Life Expectancy over time, 1800 - 2020
- How many variables?
- Videos: The best stats you’ve ever seen, Hans Rosling
1.3.2 Factfulness is … From the book
recognizing when a decision feels urgent and remembering that it rarely is.
To control the urgency instinct, take small steps.
- Take a breath. When your urgency instinct is triggered, your other instincts kick in and your analysis shuts down. Ask for more time and more information. It’s rarely now or never and it’s rarely either/or.
- Insist on the data. If something is urgent and important, it should be measured. Beware of data that is relevant but inaccurate, or accurate but irrelevant. Only relevant and accurate data is useful.
- Beware of fortune-tellers. Any prediction about the future is uncertain. Be wary of predictions that fail to acknowledge that. Insist on a full range of scenarios, never just the best or worst case. Ask how often such predictions have been right before.
- Be wary of drastic action. Ask what the side effects will be. Ask how the idea has been tested. Step-by-step practical improvements, and evaluation of their impact, are less dramatic but usually more effective.
1.4 Exploratory Data Analysis
1.4.1 What is EDA (Posit Primers: Visualise Data)
EDA is an iterative cycle that helps you understand what your data says. When you do EDA, you:
Generate questions about your data
Search for answers by visualising, transforming, and/or modeling your data
Use what you learn to refine your questions and/or generate new questions
EDA is an important part of any data analysis. You can use EDA to make discoveries about the world; or you can use EDA to ensure the quality of your data, asking questions about whether the data meets your standards or not.
1.5 Open and Public Data, World Bank
1.5.1 Open Government Data Toolkit: Open Data Defined
The term Open Data has a very precise meaning. Data or content is open if anyone is free to use, re-use or redistribute it, subject at most to measures that preserve provenance and openness.
- The data must be legally open, which means they must be placed in the public domain or under liberal terms of use with minimal restrictions.
- The data must be technically open, which means they must be published in electronic formats that are machine readable and non-proprietary, so that anyone can access and use the data using common, freely available software tools. Data must also be publicly available and accessible on a public server, without password or firewall restrictions. To make Open Data easier to find, most organizations create and manage Open Data catalogs.