1 About the Course

1.1 ‘QALL401 Data Analysis for Researchers’

An introduction to data science (DS). It is an exploratory data analysis (EDA) that is an essential part of scientific research and an evidence-based decision making of a responsible global citizen. Students acquire the knowledge and learn the necessary principles for appropriate computer utilization in making research results public in order to communicate the outcomes. Since data science supports technologies of artificial intelligence, ethical issues are becoming more and more important.

We introduce R, a widely used free software environment for statistical computing and graphics, and Rmarkdown, an authoring format that enables easy creation of dynamic documents, presentations, and reports from R, supporting reproducible research and literate programming.

We will experience the process of data science and set a foundation to delovep data science skills and take time to think about the ethical issues of its outcomes.

Instructors: Taisei Kaizoji, Professor of Economics, and Hiroshi Suzuki, Part-time Instructor

Description: This course will help students from many academic fields develop skills to obtain necessary information from open data, as well as make charting and graphing for visualization. Students also learn fundamentals of data analysis and write short articles including data reasoning. The laboratory work uses open software such as R, and guest lectures on data analysis for research are included.

Key Words: open data, data visualization, data analysis, data reasoning, R

Features: laboratory work - practicum, write short articles, guest lectures

1.1.1 Course Overview

The objective of this course is to learn the fundamentals of data science. Using the free software, R and its IDE, RStudio, students will learn how to collect data, transform it into appropriate forms, and visualize it. Students will also learn how to analyze data and present their findings to others.

Sylabus: https://campus.icu.ac.jp/public/ehandbook/PreviewSyllabus.aspx?regno=32002&year=2022&term=3

1.1.2 Course Schedule

  1. 2022-12-07: Introduction: About the course [lead by TK]
    - An introduction to open and public data, and data science

  2. 2022-12-14: Exploratory Data Analysis (EDA) 1 [lead by hs]
    - R Basics with RStudio and/or RStudio.cloud
    - tidyverse using Toy Data
    - Assignment One

  3. 2022-12-21: Exploratory Data Analysis (EDA) 2 [lead by hs]
    - R Markdown for reproducibility and communication
    - dplyr for transforming data
    - Assignment Two

  4. 2023-01-11: Exploratory Data Analysis (EDA) 3 [lead by hs]
    - WDI, a package for searching and downloading World Development Indicators
    - ggplot2 for data visualization
    - Assignment Three

  5. 2023-01-18: Exploratory Data Analysis (EDA) 4 [lead by hs]
    - tidyr for tidying data
    - Workflow of EDA
    - Assignment Four

  6. 2023-01-25: Exploratory Data Analysis (EDA) 5 [lead by hs]
    - Data Modeling
    - Roundups, R Markdown revisited
    - Assignment Five

  7. 2023-02-01: Introduction to PPDAC (Problem-Plan-Data-Analysis-Conclusion) Cycle: [lead by TK] - PPDAC in EDA
    - owidR

  8. 2023-02-08: Model building I [lead by TK] - World Bank data
    - Merging data

  9. 2023-02-15: Model building II [lead by TK]
    -Analyzing data and communications

  10. 2023-02-22: Project Presentation

1.1.3 Objective and Grading Policy:

The objective of this course is to learn the fundamentals of data science. Using the free software, R and its IDE, RStudio, students will learn how to collect data, transform it into appropriate forms, and visualize it. Students will also learn how to analyze data and present their findings to others.

Grading policy:

A. Course participation by giving feedback - 10%

B. Short papers: Assignment 1-5 - 30%

C. Presentation - 20%

D. Final report - 40%

1.1.4 Learning Resources

1.1.4.1 Textbooks and References

1.1.5 Interactive Tutorials for R

1.1.5.2 Swirl: An interactive learning environment for R and statistics

It is a console-based interactive tutorial containing several courses. We did not use it in class this academic year.

1.1.6 A massage to students

This course consists of the following components.

  1. Lecture Note
  • We provide slides, notes and lecture note in the past
  1. Lecture
  • We provide Zoom as an option, and its recording
  1. Textbook
  • R for Data Science - you can read online
  1. Practicum in class
  • We provide the log in R Notebooks or R Scripts
  1. Interactive Tutorial
  • Posit Primers - you can practice online
  1. Assignments - format: R Notebook
  • We provide feedback to each and responses in R Notebook
  1. Student Presentation - format: R Notebook, Slides, …
  • Last class
  1. Final Paper - format: R Notebook (including codes) and PDF (8 pages)
  • Due: Two weeks after the last class

Each component is closely linked. We do not check your engagement in Posit Primers, but the lectures from week two to week six are designed following Posit Primers. For assignments, you can submit R Notebook containing code chunks with errors. Hopefully, instructors will give feedback and suggestions. We also set up a personal tutorial meeting on Zoom upon request.

Our goal is that you develop skills to explore and analyze data, mainly using open public data by yourself. We truly hope you enjoy the course.

1.2 Introduction to Exploratory Data Analysis

1.2.1 What is data science?

Wikipedia https://en.wikipedia.org/wiki/Data_science

An inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data.

  • Create Insights
  • Impact Decision Making
  • Maintain & Improve Overtime

1.3 Gapminder

1.3.1 Hans Rosling (1948 – 2017)

Hans Rosling was a Swedish physician, academic, and public speaker. He was a professor of international health at Karolinska Institute[4] and was the co-founder and chairman of the Gapminder Foundation, which developed the Trendalyzer software system. (wikipedia)

1.3.2 Factfulness is … From the book

recognizing when a decision feels urgent and remembering that it rarely is.

To control the urgency instinct, take small steps.

  • Take a breath. When your urgency instinct is triggered, your other instincts kick in and your analysis shuts down. Ask for more time and more information. It’s rarely now or never and it’s rarely either/or.
  • Insist on the data. If something is urgent and important, it should be measured. Beware of data that is relevant but inaccurate, or accurate but irrelevant. Only relevant and accurate data is useful.
  • Beware of fortune-tellers. Any prediction about the future is uncertain. Be wary of predictions that fail to acknowledge that. Insist on a full range of scenarios, never just the best or worst case. Ask how often such predictions have been right before.
  • Be wary of drastic action. Ask what the side effects will be. Ask how the idea has been tested. Step-by-step practical improvements, and evaluation of their impact, are less dramatic but usually more effective.

1.4 Exploratory Data Analysis

1.4.1 What is EDA (Posit Primers: Visualise Data)

  1. EDA is an iterative cycle that helps you understand what your data says. When you do EDA, you:

  2. Generate questions about your data

  3. Search for answers by visualising, transforming, and/or modeling your data

Use what you learn to refine your questions and/or generate new questions

EDA is an important part of any data analysis. You can use EDA to make discoveries about the world; or you can use EDA to ensure the quality of your data, asking questions about whether the data meets your standards or not.

1.5 Open and Public Data, World Bank

1.5.1 Open Government Data Toolkit: Open Data Defined

The term Open Data has a very precise meaning. Data or content is open if anyone is free to use, re-use or redistribute it, subject at most to measures that preserve provenance and openness.

  1. The data must be legally open, which means they must be placed in the public domain or under liberal terms of use with minimal restrictions.
  2. The data must be technically open, which means they must be published in electronic formats that are machine readable and non-proprietary, so that anyone can access and use the data using common, freely available software tools. Data must also be publicly available and accessible on a public server, without password or firewall restrictions. To make Open Data easier to find, most organizations create and manage Open Data catalogs.