tidyr and WIR2022
a3_123456.nb.html by
replacing 123456 with your ID)
a3_123456.Rmd,a3_123456.nb.html,a3_123456.nb.html to Moodle.Choose data with at least two categorical variables and at least two numerical variables.
Explore the data using visualization using
ggplot2.
Observations based on your data visualization and difficulties and questions encountered, if any.
Due: 2023-01-23 23:59:00. Submit your R Notebook file in Moodle (The Fourth Assignment). Due on Monday!
library(tidyverse)
library(readxl) # for excel files
library(WDI)The following is useful when you use WDI.
wdi_cache <- WDIcache()Or, write the cache and read it from your computer. Since
wdi_cache is a list of two data frames, we cannot use
write_csv(); instead, we use write_rds.
write_rds(wdi_cache, "./data/wdi_cache")wdi_cache <- read_rds("./data/wdi_cache")WDIcache() produces a list containing two data frames:
wdi_cache$series and wdi_cache$country.
glimpse(wdi_cache)List of 2
$ series :'data.frame': 21034 obs. of 5 variables:
..$ indicator : chr [1:21034] "1.0.HCount.1.90usd" "1.0.HCount.2.5usd" "1.0.HCount.Mid10to50" "1.0.HCount.Ofcl" ...
..$ name : chr [1:21034] "Poverty Headcount ($1.90 a day)" "Poverty Headcount ($2.50 a day)" "Middle Class ($10-50 a day) Headcount" "Official Moderate Poverty Rate-National" ...
..$ description : chr [1:21034] "The poverty headcount index measures the proportion of the population with daily per capita income (in 2011 PPP"| __truncated__ "The poverty headcount index measures the proportion of the population with daily per capita income (in 2005 PPP"| __truncated__ "The poverty headcount index measures the proportion of the population with daily per capita income (in 2005 PPP"| __truncated__ "The poverty headcount index measures the proportion of the population with daily per capita income below the of"| __truncated__ ...
..$ sourceDatabase : chr [1:21034] "LAC Equity Lab" "LAC Equity Lab" "LAC Equity Lab" "LAC Equity Lab" ...
..$ sourceOrganization: chr [1:21034] "LAC Equity Lab tabulations of SEDLAC (CEDLAS and the World Bank)." "LAC Equity Lab tabulations of SEDLAC (CEDLAS and the World Bank)." "LAC Equity Lab tabulations of SEDLAC (CEDLAS and the World Bank)." "LAC Equity Lab tabulations of data from National Statistical Offices." ...
$ country:'data.frame': 299 obs. of 9 variables:
..$ iso3c : chr [1:299] "ABW" "AFE" "AFG" "AFR" ...
..$ iso2c : chr [1:299] "AW" "ZH" "AF" "A9" ...
..$ country : chr [1:299] "Aruba" "Africa Eastern and Southern" "Afghanistan" "Africa" ...
..$ region : chr [1:299] "Latin America & Caribbean" "Aggregates" "South Asia" "Aggregates" ...
..$ capital : chr [1:299] "Oranjestad" "" "Kabul" "" ...
..$ longitude: chr [1:299] "-70.0167" "" "69.1761" "" ...
..$ latitude : chr [1:299] "12.5167" "" "34.5228" "" ...
..$ income : chr [1:299] "High income" "Aggregates" "Low income" "Aggregates" ...
..$ lending : chr [1:299] "Not classified" "Aggregates" "IDA" "Aggregates" ...
Please add mode="wb" (web binary). This should work
better.
url_summary <- "https://wir2022.wid.world/www-site/uploads/2022/03/WIR2022TablesFigures-Summary.xlsx"
download.file(url = url_summary,
destfile = "./data/WIR2022s.xlsx",
mode = "wb") If you get an error, download the file directory from the methodology site into your computer, then open it with Excel and save it in the data folder of your R Studio project. Then R studio can recognize it easily as an Excel data.
Generally, a text file such as a CSV file is easy to import, but a binary file is difficult to handle. It is because unless R can recognize its file type, for example, Excel or so, R cannot import the data.
excel_sheets("./data/WIR2022s.xlsx") [1] "Index" "F1" "F2" "F3" "F4" "F5."
[7] "F6" "F7" "F8" "F9" "F10" "F11"
[13] "F12" "F13" "F14" "F15" "T1" "data-F1"
[19] "data-F2" "data-F3" "data-F4" "data-F5" "data-F6" "data-F7"
[25] "data-F8" "data-F9" "data-F10" "data-F11" "data-F12" "data-F13."
[31] "data-F14." "data-F15"
Reproducibility and Literate Programming are critical to exploratory data analysis (EDA). These are for communication; communication with readers of the paper, graders of the assignments, and communication with yourself, as we always forget. Please think about the reader of the article, and record the procedure and output so that reader can easily understand what you have done.
The data source is critical. Unless the reader obtains the same data
quickly, the communication on EDA does not start. If the data is not
downloaded automatically through the code chunk, you should explain how
to obtain the data and the part of the data you applied. It is crucial
when you use copying and paste using
read_delim(clipboard()). Please describe the way for the
reader to retrieve the same data easily. It is best to read your paper;
in some cases, it can be a hard copy from the beginning to check whether
the reader can reproduce what you have done in the article.
In this Assignment Four, we required the following:
You can create a simple chart, such as a histogram or a box plot with
only one variable. If you have two variables, you can create a scatter
plot. But with ggplot2, you can create various charts with
rich information using more than two variables. For example, the year
can be used for both numerical and categorical variables using
factor(year) or recognized as a character vector by
as.character(year). So the distinction between categorical
variables and numerical variables is flexible. The purpose of this
assignment is to experience creating a chart with rich information using
more than two variables.
However, I needed to clarify the variables’ requirements for some of you. So I sent out an extra message from Announcement that you do not need to take it so strictly.
If you use WDI, the following may be examples:
If you use WIR, the following may be examples you saw in the executive summary:
two categorical and one numerical: F1, F2, F4, F13 (year in this case is categorical), F15
two numerical and one categorical: F6, F7, F10
Three categorical: F3
Two categorical and two numerical: F8, F11
Data visualization is a key to EDA. Create various charts and write your observations you can or cannot obtain from the chart.
The following are the first two fundamental questions you keep in mind.
Here is a list of data your classmates used for Assignment Four.
As for WIR2022, please refer to: https://ds-sl.github.io/data-analysis/wir2022.nb.html
I added explanations to each chart.
There is a step-by-step explanation of how to recreate a chart.
df_f8 <- read_excel("./data/WIR2022s.xlsx", sheet = "data-F8")
df_f8pivot_longer.df_f8_rev <- df_f8 %>% filter(year == "2020") %>%
select(year, Germany_public = Germany, Germany_private = 'Germany (private)',
Spain_public = Spain, Spain_private = 'Spain (private)',
France_public = France, France_private = 'France (private)',
UK_public = UK, UK_private = 'UK (private)',
Japan_public = Japan, Japan_private = 'Japan (private)',
Norway_public = Norway, Norway_private = 'Norway (private)',
USA_public = USA, USA_private = 'USA (private)') %>%
pivot_longer(!year, names_to = c("country",".value"), names_sep = "_") %>%
pivot_longer(3:4, names_to = "type", values_to = "value")
df_f8_revggplot2.Then, in this case, geom_col seems to fit.
df_f8_rev %>%
ggplot() +
geom_col(aes(x = country, y = value, fill = type), position = "dodge") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(title = "Private versus public wealth in rich countries in 2020",
x = "", y = "wealth as % of national income", color = "", type = "")Can you find a similar data of other countries of this type?
It is in Chapter 3 of the report:
https://wir2022.wid.world/chapter-3/
From methodology, I explained on January 25, you can download the data for chapter three: WIR2022TablesFigures-Chapter3.xlsx
The strange looking line graph is called a sawtooth shape, and happens very often. So let me explain it
WDI indicator: BX.KLT.DINV.CD.WD: Foreign Direct Investment (FDI) inflows
Step 1. Import the data.
df_fdi <- WDI(country = "all", indicator = c(fdi = "BX.KLT.DINV.CD.WD"), start =1970 , extra = TRUE, cache = NULL)
df_fdiincome names.The following code in Base R does the same as the
following using tidyverse:
df_fdi %>% distinct(income) %>% pull(). If the list
is long, it may be better to check using tibble by
`df_fdi %>% distinct(income). You can also use
DT::datatable(df_fdi) and search items of interest, though
it takes up a lot of memory.
unique(df_fdi$income)[1] "Low income" "Aggregates" "Upper middle income"
[4] "Lower middle income" "High income" NA
[7] "Not classified"
df_dfi %>% ggplot(aes(x=year, y=fdi, color=income)) + geom_line()We observe several problems. But the most significant issue is it
looks like a sawtooth. It is because there are so many y
values at the same x value. When you draw a line graph, you
need to choose only several countries or use group_by and summarize and
use summarized data. However, there is an option; we can use a model to
summarize the data of each group using geom_smooth(). Since
you do not want a line but a curve, we use “loess” with
span, we used to draw some of WIR2022 charts.
group_by and
summarize.df_fdi %>% drop_na(fdi) %>% drop_na(income) %>%
filter(!income %in% c("Aggregates","Not classified")) %>%
group_by(income, year) %>% summarize(fdi_mean = mean(fdi)) %>%
ggplot(aes(x=year,y=fdi_mean,color=income)) +
geom_line()`summarise()` has grouped output by 'income'. You can override using the `.groups` argument.
If you do not want the message ‘summarise() has grouped
output by ’income’. You can override using the .groups
argument.’ try the following by adding .group = drop.
df_fdi %>% drop_na(fdi) %>% drop_na(income) %>%
filter(!income %in% c("Aggregates","Not classified")) %>%
group_by(income, year) %>% summarize(fdi_mean = mean(fdi), .groups = "drop") %>%
ggplot(aes(x=year,y=fdi_mean,color=income)) +
geom_line()geom_smooth with loess
and span.Do you see similarities and differences? We need to choose the one from the other by our objective, and explain
df_fdi %>% drop_na(fdi) %>% drop_na(income) %>%
filter(!income %in% c("Aggregates","Not classified")) %>%
ggplot(aes(x=year,y=fdi,color=income)) +
geom_smooth(formula = y~x, method = "loess", span = 0.25, se = FALSE)It may be a good choice to use scale_y_log10(). However,
since log10 is not finite if the value is not positive, you need to
choose those with the indicator positive. Let us see how many zero
values are in each income level.
df_fdi %>% filter(!income %in% c(NA, "Aggregates")) %>% filter(fdi <= 0) %>%
ggplot(aes(x = income, fill = income)) + geom_bar() +
labs(title = "Number of countries with FDI is not positive") +
theme(legend.position = "none")df_fdi %>% drop_na(income) %>% filter(fdi > 0) %>%
filter(!income %in% c("Aggregates","Not classified")) %>%
ggplot(aes(x=year,y=fdi,color=income)) +
geom_smooth(formula = y~x, method = "loess", span = 0.25, se = FALSE) +
scale_y_log10() + labs(title="The Value FID < 0 or Zero Excluded")Note. If this is the target chart, it may be better
to check the number of NA values, 0 values, negative values, and nonzero
values in each income group. I add
mutate(value = factor(value, levels = c("Positive", "Zero", "Negative", "NA"), labels = c("Positive", "Zero", "Negative", "NA")))
in order to set the order of the labels. Please try the same without the
line.
df_fdi %>% select(country, year, fdi, income) %>%
filter(!income %in% c("Aggregates", NA)) %>%
mutate(value = case_when(
fdi == NA ~ "NA",
fdi == 0 ~ "Zero",
fdi < 0 ~ "Negative",
fdi > 0 ~ "Positive")) %>%
mutate(value = factor(value, levels = c("Positive", "Zero", "Negative", "NA"), labels = c("Positive", "Zero", "Negative", "NA"))) %>%
group_by(income, value) %>% summarize(n = n(), .groups = "drop") %>%
ggplot(aes(income, n, fill = value)) + geom_col(position="dodge") +
labs(x = "")vjust and hjust
values to place the labels in appropriate places:
theme(axis.text.x = element_text(angle = 30, vjust = 1, hjust=1))df_fdi %>% select(country, year, fdi, income) %>%
filter(!income %in% c("Aggregates", NA)) %>%
mutate(value = case_when(
fdi == NA ~ "NA",
fdi == 0 ~ "Zero",
fdi < 0 ~ "Negative",
fdi > 0 ~ "Positive")) %>%
mutate(value = factor(value, levels = c("Positive", "Zero", "Negative", "NA"), labels = c("Positive", "Zero", "Negative", "NA"))) %>%
group_by(income, value) %>% summarize(n = n(), .groups = "drop") %>%
ggplot(aes(income, n, fill = value)) + geom_col(position="dodge") +
theme(axis.text.x = element_text(angle = 30, vjust = 1, hjust=1)) +
labs(x = "")stringr included in
tidyverse but not loaded.
scale_x_discrete(labels = function(x) stringr::str_wrap(x, width = 15))
Change the width value to fit to your chart. If you add
library(stringr), then
scale_x_discrete(labels = function(x) str_wrap(x, width = 15))
is enough.df_fdi %>% select(country, year, fdi, income) %>%
filter(!income %in% c("Aggregates", NA)) %>%
mutate(value = case_when(
fdi == NA ~ "NA",
fdi == 0 ~ "Zero",
fdi < 0 ~ "Negative",
fdi > 0 ~ "Positive")) %>%
mutate(value = factor(value, levels = c("Positive", "Zero", "Negative", "NA"), labels = c("Positive", "Zero", "Negative", "NA"))) %>%
group_by(income, value) %>% summarize(n = n(), .groups = "drop") %>%
ggplot(aes(income, n, fill = value)) + geom_col(position="dodge") +
scale_x_discrete(labels = function(x) stringr::str_wrap(x, width = 15)) +
labs(x = "")\n for the
line feed.df_fdi %>% select(country, year, fdi, income) %>%
filter(!income %in% c("Aggregates", NA)) %>%
mutate(value = case_when(
fdi == NA ~ "NA",
fdi == 0 ~ "Zero",
fdi < 0 ~ "Negative",
fdi > 0 ~ "Positive")) %>%
mutate(value = factor(value, levels = c("Positive", "Zero", "Negative", "NA"), labels = c("Positive", "Zero", "Negative", "NA"))) %>%
group_by(income, value) %>% summarize(n = n(), .groups = "drop") %>%
ggplot(aes(income, n, fill = value)) + geom_col(position="dodge") +
scale_x_discrete(labels = function(x) stringr::str_wrap(x, width = 15)) +
labs(title = "long long long long long long long \nlong long long title", x = "")Step 1. If you want to use you own color palette, choose the codes or the color names from the following sites.
color_list <- c("#00AE9D","#F58220","#6C676E")df_f1 <- read_excel("./data/WIR2022s.xlsx", sheet = "data-F1")New names:
df_f1_rev <- pivot_longer(df_f1, -1, names_to = "group", values_to = "value")
df_f1_revgeom_col(), change the
default fill color using the list of the color in Step 1, and change the
scale of the y axis into percents.df_f1_rev[df_f1_rev$group != "Top 1%",] %>%
ggplot(aes(x = ...1, y = value, fill = group)) +
geom_col(position = "dodge", width = 0.8) +
scale_fill_manual(values = color_list) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(x = "")df_f1_rev[df_f1_rev$group != "Top 1%",] %>%
ggplot(aes(x = ...1, y = value, fill = group)) +
geom_col(position = "dodge", width = 0.8) +
scale_fill_manual(values = color_list) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(x="") +
geom_text(aes(x = ...1, y = value, group = group,
label = scales::label_percent(accuracy=1)(value)),
position = position_dodge(0.8)) vjust = -0.2.df_f1_rev[df_f1_rev$group != "Top 1%",] %>%
ggplot(aes(x = ...1, y = value, fill = group)) +
geom_col(position = "dodge", width = 0.8) +
scale_fill_manual(values = color_list) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(x="") +
geom_text(aes(x = ...1, y = value, group = group,
label = scales::label_percent(accuracy=1)(value)), vjust = -0.2,
position = position_dodge(0.8)) 0.03 to the
value of y by y = value+0.03. Great!df_f1_rev[df_f1_rev$group != "Top 1%",] %>%
ggplot(aes(x = ...1, y = value, fill = group)) +
geom_col(position = "dodge", width = 0.8) +
scale_fill_manual(values = color_list) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(x="") +
geom_text(aes(x = ...1, y = value+0.03, group = group,
label = scales::label_percent(accuracy=1)(value)),
position = position_dodge(0.8)) Please try as various charts as possible. You can learn only by experience or from others.
year as a group?df_wdi <- WDI(
country = "all",
indicator = c(lifeExp = "SP.DYN.LE00.IN"), start = 1990, extra = TRUE, cache = wdi_cache)
df_wdidf_wdi %>%
filter(year %in% c("1988", "1998", "2008", "2018")) %>%
filter(country %in% c("Afghanistan", "Israel", "Azerbaijan", "Austria", "Australia")) %>%
ggplot(aes(x=year)) +
geom_boxplot(aes(y=lifeExp, fill=country))I erased the second line:
filter(year %in% c("1988", "1998", "2008", "2018")) but the
result is very similar.
df_wdi %>%
filter(country %in% c("Afghanistan", "Israel", "Azerbaijan", "Austria", "Australia")) %>%
ggplot(aes(x=year)) +
geom_boxplot(aes(y=lifeExp, fill=country))If you look at the table, you can see that year is a integer vector, not a character vector. Then what happens if we remove quotation marks. The next chart is not a box plot anymore. It is because, for each year there is only one value for each country.
df_wdi %>%
filter(year %in% c(1988, 1998, 2008, 2018)) %>%
filter(country %in% c("Afghanistan", "Israel", "Azerbaijan", "Austria", "Australia")) %>%
ggplot(aes(x=factor(year))) +
geom_boxplot(aes(y=lifeExp, fill=country))If we want to take year as a group after selecting some
years, then we should try the next using factor(year). You
can change the label of x axis by labs(x = "year") easily.
We should also notice that there are no values for 1988. We should check
basic information as such first.
df_wdi %>%
filter(year %in% c(1988, 1998, 2008, 2018)) %>%
filter(country %in% c("Afghanistan", "Israel", "Azerbaijan", "Austria", "Australia")) %>%
ggplot(aes(x=factor(year), y=lifeExp, fill=country)) +
geom_col(position = "dodge", col = "black")It is possible if you change year to a character vector by
mutate(year = as.character(year)).
df_wdi %>% mutate(year = as.character(year)) %>%
filter(year %in% c("1998", "2008", "2018")) %>%
filter(country %in% c("Afghanistan", "Israel", "Azerbaijan", "Austria", "Australia")) %>%
ggplot(aes(x=year, y=lifeExp, fill=country)) +
geom_col(position = "dodge", col = "black") +
labs(x = "year")Data of World Development Indicators are in a uniform format and downloadable using an R package WDI. So it is easy to handle. However, other data require data transformation to make it tidy. We give a couple of examples. Most of the UN data, they are in CSV, and you can get a link quickly, or download it by clicking. Though the data structure is not uniform, it is relatively easy to handle.
By the following, you can see that the first row is not the column name. R gives column names such as …1, …2, etc., when the column name is void.
You can copy the link (url) by right click or ctrl+click.
url_un_edu <- "https://data.un.org/_Docs/SYB/CSV/SYB65_309_202209_Education.csv"un_edu <- read_csv(url_un_edu)New names:Rows: 7283 Columns: 7── Column specification ─────────────────────────────────────────────────────
Delimiter: ","
chr (7): T08, Enrolment in primary, secondary and tertiary education leve...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
un_eduLet is skip the first row by adding 1skip = 11.
un_edu <- read_csv(url_un_edu, skip = 1)New names:Rows: 7282 Columns: 7── Column specification ─────────────────────────────────────────────────────
Delimiter: ","
chr (4): ...2, Series, Footnotes, Source
dbl (2): Region/Country/Area, Year
num (1): Value
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
un_eduIt is a very large data, and we need to check the values.
summary(un_edu) Region/Country/Area ...2 Year Series
Min. : 1.0 Length:7282 Min. :2000 Length:7282
1st Qu.:178.0 Class :character 1st Qu.:2005 Class :character
Median :417.0 Mode :character Median :2010 Mode :character
Mean :408.8 Mean :2012
3rd Qu.:626.0 3rd Qu.:2015
Max. :894.0 Max. :2021
Value Footnotes Source
Min. : 0.0 Length:7282 Length:7282
1st Qu.: 71.6 Class :character Class :character
Median : 100.2 Mode :character Mode :character
Mean : 2534.7
3rd Qu.: 133.6
Max. :750125.0
We can see that the Year is from 2000 to 2021. The first variable,
Region/Country/Area and the fifth variable, Value are dbl,
i.e., double; hence, these are numerical variables, and you can see them
from the summary as well. But it is not easy to see other variables. Let
us try them one by one.
un_edu %>% distinct(...2)un_edu %>% distinct(Series)un_edu %>% distinct(Footnotes)un_edu %>% distinct(Source)df_un_edu <- un_edu %>%
select(Region = ...2, Year, Series, Value)
df_un_eduIs there a way to separate regions from countries?
df_un_edu %>% left_join(wdi_cache$country, by = c("Region"="country")) %>%
filter(!is.na(iso2c)) %>% distinct(Region)df_un_edu %>% left_join(wdi_cache$country, by = c("Region"="country")) %>%
filter(is.na(iso2c)) %>% distinct(Region)df_un_edu %>% left_join(wdi_cache$country, by = c("Region"="country")) %>%
filter(is.na(iso2c)) %>% distinct(Region) %>% pull() [1] "Total, all countries or areas" "Northern Africa"
[3] "Sub-Saharan Africa" "Northern America"
[5] "Latin America & the Caribbean" "Central Asia"
[7] "Eastern Asia" "South-eastern Asia"
[9] "Southern Asia" "Western Asia"
[11] "Europe" "Oceania"
[13] "Anguilla" "Bahamas"
[15] "Bolivia (Plurin. State of)" "China, Hong Kong SAR"
[17] "China, Macao SAR" "Congo"
[19] "Cook Islands" "Côte d’Ivoire"
[21] "Curaçao" "Dem. People's Rep. Korea"
[23] "Dem. Rep. of the Congo" "Egypt"
[25] "Gambia" "Holy See"
[27] "Iran (Islamic Republic of)" "Kyrgyzstan"
[29] "Lao People's Dem. Rep." "Micronesia (Fed. States of)"
[31] "Montserrat" "Netherlands Antilles [former]"
[33] "Niue" "Republic of Korea"
[35] "Republic of Moldova" "Saint Kitts and Nevis"
[37] "Saint Lucia" "Saint Vincent & Grenadines"
[39] "Slovakia" "State of Palestine"
[41] "Sudan [former]" "Tokelau"
[43] "Türkiye" "United Rep. of Tanzania"
[45] "United States of America" "Venezuela (Boliv. Rep. of)"
[47] "Viet Nam" "Yemen"
There are some countries iso2c is not properly assigned. From the list above, Probably, the first 12 are areas and the value contains the aggregated value.
area <- df_un_edu %>% distinct(Region) %>% slice(1:12) %>% pull()
area [1] "Total, all countries or areas" "Northern Africa"
[3] "Sub-Saharan Africa" "Northern America"
[5] "Latin America & the Caribbean" "Central Asia"
[7] "Eastern Asia" "South-eastern Asia"
[9] "Southern Asia" "Western Asia"
[11] "Europe" "Oceania"
un_edu_area <- df_un_edu %>% filter(Region %in% area)
un_edu_region <- df_un_edu %>% filter(!Region %in% area)Now we can start studying the data.
un_edu_area %>%
filter(Series %in% c("Gross enrollment ratio - Upper secondary level (male)", "Gross enrollment ratio - Upper secondary level (female)")) %>%
ggplot(aes(Year, Value, color = Region, linetype = Series)) + geom_line()un_edu_area %>%
filter(Series %in% c("Gross enrollment ratio - Upper secondary level (male)", "Gross enrollment ratio - Upper secondary level (female)")) %>%
pivot_wider(names_from = Series, values_from = Value) %>%
mutate (Ratio = `Gross enrollment ratio - Upper secondary level (female)`/`Gross enrollment ratio - Upper secondary level (male)`) %>%
ggplot(aes(Year, Ratio, color = Region, linetype = Region)) + geom_line() +
labs(title = "Upper Secondary Level Education", subtitle = "Ratio = female/male")Data structure is similar to the previous one. So use
skip=1, and check the variable s briefly.
url_un_pop = "https://data.un.org/_Docs/SYB/CSV/SYB65_246_202209_Population%20Growth,%20Fertility%20and%20Mortality%20Indicators.csv"
df_un_pop <- read.csv(url_un_pop, skip = 1)
df_un_popdf_un_pop %>% distinct(Source)df_un_pop %>% distinct(Footnotes)df_un_pop %>% distinct(X)df_un_pop %>% distinct(Series)pop_area <- df_un_pop %>% distinct(X) %>% slice(1:30) %>% pull()
pop_area [1] "Total, all countries or areas" "Africa"
[3] "Northern Africa" "Sub-Saharan Africa"
[5] "Eastern Africa" "Middle Africa"
[7] "Southern Africa" "Western Africa"
[9] "Northern America" "Latin America & the Caribbean"
[11] "Caribbean" "Central America"
[13] "South America" "Asia"
[15] "Central Asia" "Eastern Asia"
[17] "South-central Asia" "South-eastern Asia"
[19] "Southern Asia" "Western Asia"
[21] "Europe" "Eastern Europe"
[23] "Northern Europe" "Southern Europe"
[25] "Western Europe" "Oceania"
[27] "Australia and New Zealand" "Melanesia"
[29] "Micronesia" "Polynesia"
un_pop <- df_un_pop %>% select(Region = X, Year, Series, Value)
un_popLet us change the names of series.
un_pop_wide <- un_pop %>% pivot_wider(names_from = Series, values_from = Value)
colnames(un_pop_wide) <- c("Region", "Year", "IncRate", "Fert", "InfDeath", "MatDeath", "LifeExp", "LifeExpM", "LifeExpF")
un_pop_wideun_pop_long <- un_pop_wide %>% pivot_longer(cols = -c(1,2), names_to = "Series", values_to = "Value")
un_pop_long un_pop_long_area <- un_pop_long %>% filter(Region %in% pop_area)
un_pop_long_region <- un_pop_long %>% filter(!Region %in% pop_area)
un_pop_wide_area <- un_pop_wide %>% filter(Region %in% pop_area)
un_pop_wide_region <- un_pop_wide %>% filter(!Region %in% pop_area)Now we can visualize data.
In the following, we explain how to download data by an R package
wir. First, you need to install the package. However, it is
not an official R package yet; you need to use the package
devtools to install it.
install.packages("devtools")
devtools::install_github("WIDworld/wid-r-tool")I have not studied fully, but you can download the data by a package
called wir. See here.
After installing the package, check the codebook of the
indicators. The following is not the ratio given in F8, but an
example.
library(wid)
wwealg <- download_wid(indicators = "wwealg", areas = "all", years = "all")
wwealp <- download_wid(indicators = "wwealp", areas = "all", years = "all")public <- wwealg %>% select(country, year, public = value)
publicprivate <- wwealp %>% select(country, year, private = value)
privatepublic_vs_private <- public %>% left_join(private)Joining, by = c("country", "year")
public_vs_privatedf_pub_priv <- public_vs_private %>% pivot_longer(cols = c(3,4), names_to = "category", values_to = "value") %>% left_join(wdi_cache$country, by = c("country"="iso2c")) %>%
select(country = country.y, iso2c = country, year, category, value, region, income, lending)
df_pub_privunique(df_pub_priv$country) [1] "Andorra"
[2] "United Arab Emirates"
[3] "Afghanistan"
[4] "Antigua and Barbuda"
[5] NA
[6] "Albania"
[7] "Armenia"
[8] "Angola"
[9] "Argentina"
[10] "American Samoa"
[11] "Austria"
[12] "Australia"
[13] "Aruba"
[14] "Azerbaijan"
[15] "Bosnia and Herzegovina"
[16] "Barbados"
[17] "Bangladesh"
[18] "Belgium"
[19] "Burkina Faso"
[20] "Bulgaria"
[21] "Bahrain"
[22] "Burundi"
[23] "Benin"
[24] "Bermuda"
[25] "Brunei Darussalam"
[26] "Bolivia"
[27] "Brazil"
[28] "Bahamas, The"
[29] "Bhutan"
[30] "Botswana"
[31] "Belize"
[32] "Canada"
[33] "Congo, Dem. Rep."
[34] "Central African Republic"
[35] "Congo, Rep."
[36] "Switzerland"
[37] "Cote d'Ivoire"
[38] "Chile"
[39] "Cameroon"
[40] "China"
[41] "Colombia"
[42] "Costa Rica"
[43] "Cuba"
[44] "Cabo Verde"
[45] "Curacao"
[46] "Cyprus"
[47] "Czechia"
[48] "Germany"
[49] "Djibouti"
[50] "Denmark"
[51] "Dominica"
[52] "Dominican Republic"
[53] "Algeria"
[54] "Ecuador"
[55] "Estonia"
[56] "Egypt, Arab Rep."
[57] "Eritrea"
[58] "Spain"
[59] "Ethiopia"
[60] "Finland"
[61] "Fiji"
[62] "Micronesia, Fed. Sts."
[63] "France"
[64] "Gabon"
[65] "United Kingdom"
[66] "Grenada"
[67] "Georgia"
[68] "Ghana"
[69] "Greenland"
[70] "Gambia, The"
[71] "Guinea"
[72] "Equatorial Guinea"
[73] "Greece"
[74] "Guatemala"
[75] "Guam"
[76] "Guinea-Bissau"
[77] "Guyana"
[78] "Hong Kong SAR, China"
[79] "Honduras"
[80] "Croatia"
[81] "Haiti"
[82] "Hungary"
[83] "Indonesia"
[84] "Ireland"
[85] "Israel"
[86] "Isle of Man"
[87] "India"
[88] "Iraq"
[89] "Iran, Islamic Rep."
[90] "Iceland"
[91] "Italy"
[92] "Jamaica"
[93] "Jordan"
[94] "Japan"
[95] "Kenya"
[96] "Kyrgyz Republic"
[97] "Cambodia"
[98] "Kiribati"
[99] "Comoros"
[100] "St. Kitts and Nevis"
[101] "Korea, Dem. People's Rep."
[102] "Korea, Rep."
[103] "Kuwait"
[104] "Cayman Islands"
[105] "Kazakhstan"
[106] "Lao PDR"
[107] "Lebanon"
[108] "St. Lucia"
[109] "Liechtenstein"
[110] "Sri Lanka"
[111] "Liberia"
[112] "Lesotho"
[113] "Lithuania"
[114] "Luxembourg"
[115] "Latvia"
[116] "Libya"
[117] "Morocco"
[118] "Monaco"
[119] "Moldova"
[120] "Montenegro"
[121] "Madagascar"
[122] "Marshall Islands"
[123] "North Macedonia"
[124] "Mali"
[125] "Myanmar"
[126] "Mongolia"
[127] "Macao SAR, China"
[128] "Northern Mariana Islands"
[129] "Mauritania"
[130] "Malta"
[131] "Mauritius"
[132] "Maldives"
[133] "Malawi"
[134] "Mexico"
[135] "Malaysia"
[136] "Mozambique"
[137] "Namibia"
[138] "New Caledonia"
[139] "Niger"
[140] "Nigeria"
[141] "Nicaragua"
[142] "Netherlands"
[143] "Norway"
[144] "Nepal"
[145] "Nauru"
[146] "New Zealand"
[147] "OECD members"
[148] "Oman"
[149] "Panama"
[150] "Peru"
[151] "French Polynesia"
[152] "Papua New Guinea"
[153] "Philippines"
[154] "Pakistan"
[155] "Poland"
[156] "Puerto Rico"
[157] "West Bank and Gaza"
[158] "Portugal"
[159] "Palau"
[160] "Paraguay"
[161] "Qatar"
[162] "Romania"
[163] "Serbia"
[164] "Russian Federation"
[165] "Rwanda"
[166] "Saudi Arabia"
[167] "Solomon Islands"
[168] "Seychelles"
[169] "Sudan"
[170] "Sweden"
[171] "Singapore"
[172] "Slovenia"
[173] "Slovak Republic"
[174] "Sierra Leone"
[175] "San Marino"
[176] "Senegal"
[177] "Somalia"
[178] "Suriname"
[179] "South Sudan"
[180] "Sao Tome and Principe"
[181] "El Salvador"
[182] "Sint Maarten (Dutch part)"
[183] "Syrian Arab Republic"
[184] "Eswatini"
[185] "Turks and Caicos Islands"
[186] "Chad"
[187] "Togo"
[188] "Thailand"
[189] "Tajikistan"
[190] "Timor-Leste"
[191] "Turkmenistan"
[192] "Tunisia"
[193] "Tonga"
[194] "Turkiye"
[195] "Trinidad and Tobago"
[196] "Tuvalu"
[197] "Taiwan, China"
[198] "Tanzania"
[199] "Ukraine"
[200] "Uganda"
[201] "United States"
[202] "Uruguay"
[203] "Uzbekistan"
[204] "St. Vincent and the Grenadines"
[205] "Venezuela, RB"
[206] "British Virgin Islands"
[207] "Virgin Islands (U.S.)"
[208] "Vietnam"
[209] "Vanuatu"
[210] "Samoa"
[211] "IBRD only"
[212] "IDA only"
[213] "Least developed countries: UN classification"
[214] "Low income"
[215] "Lower middle income"
[216] "Yemen, Rep."
[217] "South Africa"
[218] "Zambia"
[219] "Zimbabwe"
df_pub_priv %>%
filter(country %in% c("Japan", "Norway", "Sweden", "Denmark", "Finland"), year %in% 1970:2020) %>%
ggplot(aes(year, value, color = country, linetype = category)) + geom_line()We choose two indicators: ‘wealg’ and ‘wealp’. WIR2022 indicators consists of 6 characters; 1 letter code plus 5 letter code. You can find the list in the codebook.
If you want to study WIR2022, please study the report, the codebook, and wir vignette together with the R Notebook.
As I mentioned earlier, the data tables used in the report are available from the following page.