tidyr
and WIR2022
a3_123456.nb.html
by
replacing 123456 with your ID)
a3_123456.Rmd
,a3_123456.nb.html
,a3_123456.nb.html
to Moodle.Choose data with at least two categorical variables and at least two numerical variables.
Explore the data using visualization using
ggplot2
.
Observations based on your data visualization and difficulties and questions encountered, if any.
Due: 2023-01-23 23:59:00. Submit your R Notebook file in Moodle (The Fourth Assignment). Due on Monday!
library(tidyverse)
library(readxl) # for excel files
library(WDI)
The following is useful when you use WDI.
<- WDIcache() wdi_cache
Or, write the cache and read it from your computer. Since
wdi_cache
is a list of two data frames, we cannot use
write_csv()
; instead, we use write_rds
.
write_rds(wdi_cache, "./data/wdi_cache")
<- read_rds("./data/wdi_cache") wdi_cache
WDIcache()
produces a list containing two data frames:
wdi_cache$series
and wdi_cache$country
.
glimpse(wdi_cache)
List of 2
$ series :'data.frame': 21034 obs. of 5 variables:
..$ indicator : chr [1:21034] "1.0.HCount.1.90usd" "1.0.HCount.2.5usd" "1.0.HCount.Mid10to50" "1.0.HCount.Ofcl" ...
..$ name : chr [1:21034] "Poverty Headcount ($1.90 a day)" "Poverty Headcount ($2.50 a day)" "Middle Class ($10-50 a day) Headcount" "Official Moderate Poverty Rate-National" ...
..$ description : chr [1:21034] "The poverty headcount index measures the proportion of the population with daily per capita income (in 2011 PPP"| __truncated__ "The poverty headcount index measures the proportion of the population with daily per capita income (in 2005 PPP"| __truncated__ "The poverty headcount index measures the proportion of the population with daily per capita income (in 2005 PPP"| __truncated__ "The poverty headcount index measures the proportion of the population with daily per capita income below the of"| __truncated__ ...
..$ sourceDatabase : chr [1:21034] "LAC Equity Lab" "LAC Equity Lab" "LAC Equity Lab" "LAC Equity Lab" ...
..$ sourceOrganization: chr [1:21034] "LAC Equity Lab tabulations of SEDLAC (CEDLAS and the World Bank)." "LAC Equity Lab tabulations of SEDLAC (CEDLAS and the World Bank)." "LAC Equity Lab tabulations of SEDLAC (CEDLAS and the World Bank)." "LAC Equity Lab tabulations of data from National Statistical Offices." ...
$ country:'data.frame': 299 obs. of 9 variables:
..$ iso3c : chr [1:299] "ABW" "AFE" "AFG" "AFR" ...
..$ iso2c : chr [1:299] "AW" "ZH" "AF" "A9" ...
..$ country : chr [1:299] "Aruba" "Africa Eastern and Southern" "Afghanistan" "Africa" ...
..$ region : chr [1:299] "Latin America & Caribbean" "Aggregates" "South Asia" "Aggregates" ...
..$ capital : chr [1:299] "Oranjestad" "" "Kabul" "" ...
..$ longitude: chr [1:299] "-70.0167" "" "69.1761" "" ...
..$ latitude : chr [1:299] "12.5167" "" "34.5228" "" ...
..$ income : chr [1:299] "High income" "Aggregates" "Low income" "Aggregates" ...
..$ lending : chr [1:299] "Not classified" "Aggregates" "IDA" "Aggregates" ...
Please add mode="wb"
(web binary). This should work
better.
<- "https://wir2022.wid.world/www-site/uploads/2022/03/WIR2022TablesFigures-Summary.xlsx"
url_summary download.file(url = url_summary,
destfile = "./data/WIR2022s.xlsx",
mode = "wb")
If you get an error, download the file directory from the methodology site into your computer, then open it with Excel and save it in the data folder of your R Studio project. Then R studio can recognize it easily as an Excel data.
Generally, a text file such as a CSV file is easy to import, but a binary file is difficult to handle. It is because unless R can recognize its file type, for example, Excel or so, R cannot import the data.
excel_sheets("./data/WIR2022s.xlsx")
[1] "Index" "F1" "F2" "F3" "F4" "F5."
[7] "F6" "F7" "F8" "F9" "F10" "F11"
[13] "F12" "F13" "F14" "F15" "T1" "data-F1"
[19] "data-F2" "data-F3" "data-F4" "data-F5" "data-F6" "data-F7"
[25] "data-F8" "data-F9" "data-F10" "data-F11" "data-F12" "data-F13."
[31] "data-F14." "data-F15"
Reproducibility and Literate Programming are critical to exploratory data analysis (EDA). These are for communication; communication with readers of the paper, graders of the assignments, and communication with yourself, as we always forget. Please think about the reader of the article, and record the procedure and output so that reader can easily understand what you have done.
The data source is critical. Unless the reader obtains the same data
quickly, the communication on EDA does not start. If the data is not
downloaded automatically through the code chunk, you should explain how
to obtain the data and the part of the data you applied. It is crucial
when you use copying and paste using
read_delim(clipboard())
. Please describe the way for the
reader to retrieve the same data easily. It is best to read your paper;
in some cases, it can be a hard copy from the beginning to check whether
the reader can reproduce what you have done in the article.
In this Assignment Four, we required the following:
You can create a simple chart, such as a histogram or a box plot with
only one variable. If you have two variables, you can create a scatter
plot. But with ggplot2
, you can create various charts with
rich information using more than two variables. For example, the year
can be used for both numerical and categorical variables using
factor(year)
or recognized as a character vector by
as.character(year)
. So the distinction between categorical
variables and numerical variables is flexible. The purpose of this
assignment is to experience creating a chart with rich information using
more than two variables.
However, I needed to clarify the variables’ requirements for some of you. So I sent out an extra message from Announcement that you do not need to take it so strictly.
If you use WDI, the following may be examples:
If you use WIR, the following may be examples you saw in the executive summary:
two categorical and one numerical: F1, F2, F4, F13 (year in this case is categorical), F15
two numerical and one categorical: F6, F7, F10
Three categorical: F3
Two categorical and two numerical: F8, F11
Data visualization is a key to EDA. Create various charts and write your observations you can or cannot obtain from the chart.
The following are the first two fundamental questions you keep in mind.
Here is a list of data your classmates used for Assignment Four.
As for WIR2022, please refer to: https://ds-sl.github.io/data-analysis/wir2022.nb.html
I added explanations to each chart.
There is a step-by-step explanation of how to recreate a chart.
<- read_excel("./data/WIR2022s.xlsx", sheet = "data-F8")
df_f8 df_f8
pivot_longer
.<- df_f8 %>% filter(year == "2020") %>%
df_f8_rev select(year, Germany_public = Germany, Germany_private = 'Germany (private)',
Spain_public = Spain, Spain_private = 'Spain (private)',
France_public = France, France_private = 'France (private)',
UK_public = UK, UK_private = 'UK (private)',
Japan_public = Japan, Japan_private = 'Japan (private)',
Norway_public = Norway, Norway_private = 'Norway (private)',
USA_public = USA, USA_private = 'USA (private)') %>%
pivot_longer(!year, names_to = c("country",".value"), names_sep = "_") %>%
pivot_longer(3:4, names_to = "type", values_to = "value")
df_f8_rev
ggplot2
.Then, in this case, geom_col
seems to fit.
%>%
df_f8_rev ggplot() +
geom_col(aes(x = country, y = value, fill = type), position = "dodge") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(title = "Private versus public wealth in rich countries in 2020",
x = "", y = "wealth as % of national income", color = "", type = "")
Can you find a similar data of other countries of this type?
It is in Chapter 3 of the report:
https://wir2022.wid.world/chapter-3/
From methodology, I explained on January 25, you can download the data for chapter three: WIR2022TablesFigures-Chapter3.xlsx
The strange looking line graph is called a sawtooth shape, and happens very often. So let me explain it
WDI indicator: BX.KLT.DINV.CD.WD: Foreign Direct Investment (FDI) inflows
Step 1. Import the data.
<- WDI(country = "all", indicator = c(fdi = "BX.KLT.DINV.CD.WD"), start =1970 , extra = TRUE, cache = NULL)
df_fdi df_fdi
income
names.The following code in Base R
does the same as the
following using tidyverse
:
df_fdi %>% distinct(income) %>% pull()
. If the list
is long, it may be better to check using tibble
by
`df_fdi %>% distinct(income)
. You can also use
DT::datatable(df_fdi)
and search items of interest, though
it takes up a lot of memory.
unique(df_fdi$income)
[1] "Low income" "Aggregates" "Upper middle income"
[4] "Lower middle income" "High income" NA
[7] "Not classified"
%>% ggplot(aes(x=year, y=fdi, color=income)) + geom_line() df_dfi
We observe several problems. But the most significant issue is it
looks like a sawtooth. It is because there are so many y
values at the same x
value. When you draw a line graph, you
need to choose only several countries or use group_by and summarize and
use summarized data. However, there is an option; we can use a model to
summarize the data of each group using geom_smooth()
. Since
you do not want a line but a curve, we use “loess” with
span
, we used to draw some of WIR2022 charts.
group_by
and
summarize
.%>% drop_na(fdi) %>% drop_na(income) %>%
df_fdi filter(!income %in% c("Aggregates","Not classified")) %>%
group_by(income, year) %>% summarize(fdi_mean = mean(fdi)) %>%
ggplot(aes(x=year,y=fdi_mean,color=income)) +
geom_line()
`summarise()` has grouped output by 'income'. You can override using the `.groups` argument.
If you do not want the message ‘summarise()
has grouped
output by ’income’. You can override using the .groups
argument.’ try the following by adding .group = drop
.
%>% drop_na(fdi) %>% drop_na(income) %>%
df_fdi filter(!income %in% c("Aggregates","Not classified")) %>%
group_by(income, year) %>% summarize(fdi_mean = mean(fdi), .groups = "drop") %>%
ggplot(aes(x=year,y=fdi_mean,color=income)) +
geom_line()
geom_smooth
with loess
and span
.Do you see similarities and differences? We need to choose the one from the other by our objective, and explain
%>% drop_na(fdi) %>% drop_na(income) %>%
df_fdi filter(!income %in% c("Aggregates","Not classified")) %>%
ggplot(aes(x=year,y=fdi,color=income)) +
geom_smooth(formula = y~x, method = "loess", span = 0.25, se = FALSE)
It may be a good choice to use scale_y_log10()
. However,
since log10 is not finite if the value is not positive, you need to
choose those with the indicator positive. Let us see how many zero
values are in each income level.
%>% filter(!income %in% c(NA, "Aggregates")) %>% filter(fdi <= 0) %>%
df_fdi ggplot(aes(x = income, fill = income)) + geom_bar() +
labs(title = "Number of countries with FDI is not positive") +
theme(legend.position = "none")
%>% drop_na(income) %>% filter(fdi > 0) %>%
df_fdi filter(!income %in% c("Aggregates","Not classified")) %>%
ggplot(aes(x=year,y=fdi,color=income)) +
geom_smooth(formula = y~x, method = "loess", span = 0.25, se = FALSE) +
scale_y_log10() + labs(title="The Value FID < 0 or Zero Excluded")
Note. If this is the target chart, it may be better
to check the number of NA values, 0 values, negative values, and nonzero
values in each income group. I add
mutate(value = factor(value, levels = c("Positive", "Zero", "Negative", "NA"), labels = c("Positive", "Zero", "Negative", "NA")))
in order to set the order of the labels. Please try the same without the
line.
%>% select(country, year, fdi, income) %>%
df_fdi filter(!income %in% c("Aggregates", NA)) %>%
mutate(value = case_when(
== NA ~ "NA",
fdi == 0 ~ "Zero",
fdi < 0 ~ "Negative",
fdi > 0 ~ "Positive")) %>%
fdi mutate(value = factor(value, levels = c("Positive", "Zero", "Negative", "NA"), labels = c("Positive", "Zero", "Negative", "NA"))) %>%
group_by(income, value) %>% summarize(n = n(), .groups = "drop") %>%
ggplot(aes(income, n, fill = value)) + geom_col(position="dodge") +
labs(x = "")
vjust
and hjust
values to place the labels in appropriate places:
theme(axis.text.x = element_text(angle = 30, vjust = 1, hjust=1))
%>% select(country, year, fdi, income) %>%
df_fdi filter(!income %in% c("Aggregates", NA)) %>%
mutate(value = case_when(
== NA ~ "NA",
fdi == 0 ~ "Zero",
fdi < 0 ~ "Negative",
fdi > 0 ~ "Positive")) %>%
fdi mutate(value = factor(value, levels = c("Positive", "Zero", "Negative", "NA"), labels = c("Positive", "Zero", "Negative", "NA"))) %>%
group_by(income, value) %>% summarize(n = n(), .groups = "drop") %>%
ggplot(aes(income, n, fill = value)) + geom_col(position="dodge") +
theme(axis.text.x = element_text(angle = 30, vjust = 1, hjust=1)) +
labs(x = "")
stringr
included in
tidyverse
but not loaded.
scale_x_discrete(labels = function(x) stringr::str_wrap(x, width = 15))
Change the width value to fit to your chart. If you add
library(stringr), then
scale_x_discrete(labels = function(x) str_wrap(x, width = 15))
is enough.%>% select(country, year, fdi, income) %>%
df_fdi filter(!income %in% c("Aggregates", NA)) %>%
mutate(value = case_when(
== NA ~ "NA",
fdi == 0 ~ "Zero",
fdi < 0 ~ "Negative",
fdi > 0 ~ "Positive")) %>%
fdi mutate(value = factor(value, levels = c("Positive", "Zero", "Negative", "NA"), labels = c("Positive", "Zero", "Negative", "NA"))) %>%
group_by(income, value) %>% summarize(n = n(), .groups = "drop") %>%
ggplot(aes(income, n, fill = value)) + geom_col(position="dodge") +
scale_x_discrete(labels = function(x) stringr::str_wrap(x, width = 15)) +
labs(x = "")
\n
for the
line feed.%>% select(country, year, fdi, income) %>%
df_fdi filter(!income %in% c("Aggregates", NA)) %>%
mutate(value = case_when(
== NA ~ "NA",
fdi == 0 ~ "Zero",
fdi < 0 ~ "Negative",
fdi > 0 ~ "Positive")) %>%
fdi mutate(value = factor(value, levels = c("Positive", "Zero", "Negative", "NA"), labels = c("Positive", "Zero", "Negative", "NA"))) %>%
group_by(income, value) %>% summarize(n = n(), .groups = "drop") %>%
ggplot(aes(income, n, fill = value)) + geom_col(position="dodge") +
scale_x_discrete(labels = function(x) stringr::str_wrap(x, width = 15)) +
labs(title = "long long long long long long long \nlong long long title", x = "")
Step 1. If you want to use you own color palette, choose the codes or the color names from the following sites.
<- c("#00AE9D","#F58220","#6C676E") color_list
<- read_excel("./data/WIR2022s.xlsx", sheet = "data-F1") df_f1
New names:
<- pivot_longer(df_f1, -1, names_to = "group", values_to = "value")
df_f1_rev df_f1_rev
geom_col()
, change the
default fill color using the list of the color in Step 1, and change the
scale of the y axis into percents.$group != "Top 1%",] %>%
df_f1_rev[df_f1_revggplot(aes(x = ...1, y = value, fill = group)) +
geom_col(position = "dodge", width = 0.8) +
scale_fill_manual(values = color_list) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(x = "")
$group != "Top 1%",] %>%
df_f1_rev[df_f1_revggplot(aes(x = ...1, y = value, fill = group)) +
geom_col(position = "dodge", width = 0.8) +
scale_fill_manual(values = color_list) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(x="") +
geom_text(aes(x = ...1, y = value, group = group,
label = scales::label_percent(accuracy=1)(value)),
position = position_dodge(0.8))
vjust = -0.2
.$group != "Top 1%",] %>%
df_f1_rev[df_f1_revggplot(aes(x = ...1, y = value, fill = group)) +
geom_col(position = "dodge", width = 0.8) +
scale_fill_manual(values = color_list) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(x="") +
geom_text(aes(x = ...1, y = value, group = group,
label = scales::label_percent(accuracy=1)(value)), vjust = -0.2,
position = position_dodge(0.8))
0.03
to the
value of y
by y = value+0.03
. Great!$group != "Top 1%",] %>%
df_f1_rev[df_f1_revggplot(aes(x = ...1, y = value, fill = group)) +
geom_col(position = "dodge", width = 0.8) +
scale_fill_manual(values = color_list) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(x="") +
geom_text(aes(x = ...1, y = value+0.03, group = group,
label = scales::label_percent(accuracy=1)(value)),
position = position_dodge(0.8))
Please try as various charts as possible. You can learn only by experience or from others.
year
as a group?<- WDI(
df_wdi country = "all",
indicator = c(lifeExp = "SP.DYN.LE00.IN"), start = 1990, extra = TRUE, cache = wdi_cache)
df_wdi
%>%
df_wdi filter(year %in% c("1988", "1998", "2008", "2018")) %>%
filter(country %in% c("Afghanistan", "Israel", "Azerbaijan", "Austria", "Australia")) %>%
ggplot(aes(x=year)) +
geom_boxplot(aes(y=lifeExp, fill=country))
I erased the second line:
filter(year %in% c("1988", "1998", "2008", "2018"))
but the
result is very similar.
%>%
df_wdi filter(country %in% c("Afghanistan", "Israel", "Azerbaijan", "Austria", "Australia")) %>%
ggplot(aes(x=year)) +
geom_boxplot(aes(y=lifeExp, fill=country))
If you look at the table, you can see that year is a integer vector, not a character vector. Then what happens if we remove quotation marks. The next chart is not a box plot anymore. It is because, for each year there is only one value for each country.
%>%
df_wdi filter(year %in% c(1988, 1998, 2008, 2018)) %>%
filter(country %in% c("Afghanistan", "Israel", "Azerbaijan", "Austria", "Australia")) %>%
ggplot(aes(x=factor(year))) +
geom_boxplot(aes(y=lifeExp, fill=country))
If we want to take year
as a group after selecting some
years, then we should try the next using factor(year)
. You
can change the label of x axis by labs(x = "year")
easily.
We should also notice that there are no values for 1988. We should check
basic information as such first.
%>%
df_wdi filter(year %in% c(1988, 1998, 2008, 2018)) %>%
filter(country %in% c("Afghanistan", "Israel", "Azerbaijan", "Austria", "Australia")) %>%
ggplot(aes(x=factor(year), y=lifeExp, fill=country)) +
geom_col(position = "dodge", col = "black")
It is possible if you change year to a character vector by
mutate(year = as.character(year))
.
%>% mutate(year = as.character(year)) %>%
df_wdi filter(year %in% c("1998", "2008", "2018")) %>%
filter(country %in% c("Afghanistan", "Israel", "Azerbaijan", "Austria", "Australia")) %>%
ggplot(aes(x=year, y=lifeExp, fill=country)) +
geom_col(position = "dodge", col = "black") +
labs(x = "year")
Data of World Development Indicators are in a uniform format and downloadable using an R package WDI. So it is easy to handle. However, other data require data transformation to make it tidy. We give a couple of examples. Most of the UN data, they are in CSV, and you can get a link quickly, or download it by clicking. Though the data structure is not uniform, it is relatively easy to handle.
By the following, you can see that the first row is not the column name. R gives column names such as …1, …2, etc., when the column name is void.
You can copy the link (url) by right click or ctrl+click.
<- "https://data.un.org/_Docs/SYB/CSV/SYB65_309_202209_Education.csv" url_un_edu
<- read_csv(url_un_edu) un_edu
New names:Rows: 7283 Columns: 7── Column specification ─────────────────────────────────────────────────────
Delimiter: ","
chr (7): T08, Enrolment in primary, secondary and tertiary education leve...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
un_edu
Let is skip the first row by adding 1skip = 11.
<- read_csv(url_un_edu, skip = 1) un_edu
New names:Rows: 7282 Columns: 7── Column specification ─────────────────────────────────────────────────────
Delimiter: ","
chr (4): ...2, Series, Footnotes, Source
dbl (2): Region/Country/Area, Year
num (1): Value
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
un_edu
It is a very large data, and we need to check the values.
summary(un_edu)
Region/Country/Area ...2 Year Series
Min. : 1.0 Length:7282 Min. :2000 Length:7282
1st Qu.:178.0 Class :character 1st Qu.:2005 Class :character
Median :417.0 Mode :character Median :2010 Mode :character
Mean :408.8 Mean :2012
3rd Qu.:626.0 3rd Qu.:2015
Max. :894.0 Max. :2021
Value Footnotes Source
Min. : 0.0 Length:7282 Length:7282
1st Qu.: 71.6 Class :character Class :character
Median : 100.2 Mode :character Mode :character
Mean : 2534.7
3rd Qu.: 133.6
Max. :750125.0
We can see that the Year is from 2000 to 2021. The first variable,
Region/Country/Area and the fifth variable, Value are dbl
,
i.e., double; hence, these are numerical variables, and you can see them
from the summary as well. But it is not easy to see other variables. Let
us try them one by one.
%>% distinct(...2) un_edu
%>% distinct(Series) un_edu
%>% distinct(Footnotes) un_edu
%>% distinct(Source) un_edu
<- un_edu %>%
df_un_edu select(Region = ...2, Year, Series, Value)
df_un_edu
Is there a way to separate regions from countries?
%>% left_join(wdi_cache$country, by = c("Region"="country")) %>%
df_un_edu filter(!is.na(iso2c)) %>% distinct(Region)
%>% left_join(wdi_cache$country, by = c("Region"="country")) %>%
df_un_edu filter(is.na(iso2c)) %>% distinct(Region)
%>% left_join(wdi_cache$country, by = c("Region"="country")) %>%
df_un_edu filter(is.na(iso2c)) %>% distinct(Region) %>% pull()
[1] "Total, all countries or areas" "Northern Africa"
[3] "Sub-Saharan Africa" "Northern America"
[5] "Latin America & the Caribbean" "Central Asia"
[7] "Eastern Asia" "South-eastern Asia"
[9] "Southern Asia" "Western Asia"
[11] "Europe" "Oceania"
[13] "Anguilla" "Bahamas"
[15] "Bolivia (Plurin. State of)" "China, Hong Kong SAR"
[17] "China, Macao SAR" "Congo"
[19] "Cook Islands" "Côte d’Ivoire"
[21] "Curaçao" "Dem. People's Rep. Korea"
[23] "Dem. Rep. of the Congo" "Egypt"
[25] "Gambia" "Holy See"
[27] "Iran (Islamic Republic of)" "Kyrgyzstan"
[29] "Lao People's Dem. Rep." "Micronesia (Fed. States of)"
[31] "Montserrat" "Netherlands Antilles [former]"
[33] "Niue" "Republic of Korea"
[35] "Republic of Moldova" "Saint Kitts and Nevis"
[37] "Saint Lucia" "Saint Vincent & Grenadines"
[39] "Slovakia" "State of Palestine"
[41] "Sudan [former]" "Tokelau"
[43] "Türkiye" "United Rep. of Tanzania"
[45] "United States of America" "Venezuela (Boliv. Rep. of)"
[47] "Viet Nam" "Yemen"
There are some countries iso2c is not properly assigned. From the list above, Probably, the first 12 are areas and the value contains the aggregated value.
<- df_un_edu %>% distinct(Region) %>% slice(1:12) %>% pull()
area area
[1] "Total, all countries or areas" "Northern Africa"
[3] "Sub-Saharan Africa" "Northern America"
[5] "Latin America & the Caribbean" "Central Asia"
[7] "Eastern Asia" "South-eastern Asia"
[9] "Southern Asia" "Western Asia"
[11] "Europe" "Oceania"
<- df_un_edu %>% filter(Region %in% area)
un_edu_area <- df_un_edu %>% filter(!Region %in% area) un_edu_region
Now we can start studying the data.
%>%
un_edu_area filter(Series %in% c("Gross enrollment ratio - Upper secondary level (male)", "Gross enrollment ratio - Upper secondary level (female)")) %>%
ggplot(aes(Year, Value, color = Region, linetype = Series)) + geom_line()
%>%
un_edu_area filter(Series %in% c("Gross enrollment ratio - Upper secondary level (male)", "Gross enrollment ratio - Upper secondary level (female)")) %>%
pivot_wider(names_from = Series, values_from = Value) %>%
mutate (Ratio = `Gross enrollment ratio - Upper secondary level (female)`/`Gross enrollment ratio - Upper secondary level (male)`) %>%
ggplot(aes(Year, Ratio, color = Region, linetype = Region)) + geom_line() +
labs(title = "Upper Secondary Level Education", subtitle = "Ratio = female/male")
Data structure is similar to the previous one. So use
skip=1
, and check the variable s briefly.
= "https://data.un.org/_Docs/SYB/CSV/SYB65_246_202209_Population%20Growth,%20Fertility%20and%20Mortality%20Indicators.csv"
url_un_pop <- read.csv(url_un_pop, skip = 1)
df_un_pop df_un_pop
%>% distinct(Source) df_un_pop
%>% distinct(Footnotes) df_un_pop
%>% distinct(X) df_un_pop
%>% distinct(Series) df_un_pop
<- df_un_pop %>% distinct(X) %>% slice(1:30) %>% pull()
pop_area pop_area
[1] "Total, all countries or areas" "Africa"
[3] "Northern Africa" "Sub-Saharan Africa"
[5] "Eastern Africa" "Middle Africa"
[7] "Southern Africa" "Western Africa"
[9] "Northern America" "Latin America & the Caribbean"
[11] "Caribbean" "Central America"
[13] "South America" "Asia"
[15] "Central Asia" "Eastern Asia"
[17] "South-central Asia" "South-eastern Asia"
[19] "Southern Asia" "Western Asia"
[21] "Europe" "Eastern Europe"
[23] "Northern Europe" "Southern Europe"
[25] "Western Europe" "Oceania"
[27] "Australia and New Zealand" "Melanesia"
[29] "Micronesia" "Polynesia"
<- df_un_pop %>% select(Region = X, Year, Series, Value)
un_pop un_pop
Let us change the names of series.
<- un_pop %>% pivot_wider(names_from = Series, values_from = Value)
un_pop_wide colnames(un_pop_wide) <- c("Region", "Year", "IncRate", "Fert", "InfDeath", "MatDeath", "LifeExp", "LifeExpM", "LifeExpF")
un_pop_wide
<- un_pop_wide %>% pivot_longer(cols = -c(1,2), names_to = "Series", values_to = "Value")
un_pop_long un_pop_long
<- un_pop_long %>% filter(Region %in% pop_area)
un_pop_long_area <- un_pop_long %>% filter(!Region %in% pop_area)
un_pop_long_region <- un_pop_wide %>% filter(Region %in% pop_area)
un_pop_wide_area <- un_pop_wide %>% filter(!Region %in% pop_area) un_pop_wide_region
Now we can visualize data.
In the following, we explain how to download data by an R package
wir
. First, you need to install the package. However, it is
not an official R package yet; you need to use the package
devtools
to install it.
install.packages("devtools")
::install_github("WIDworld/wid-r-tool") devtools
I have not studied fully, but you can download the data by a package
called wir
. See here.
After installing the package, check the codebook of the
indicators. The following is not the ratio given in F8, but an
example.
library(wid)
<- download_wid(indicators = "wwealg", areas = "all", years = "all")
wwealg <- download_wid(indicators = "wwealp", areas = "all", years = "all") wwealp
<- wwealg %>% select(country, year, public = value)
public public
<- wwealp %>% select(country, year, private = value)
private private
<- public %>% left_join(private) public_vs_private
Joining, by = c("country", "year")
public_vs_private
<- public_vs_private %>% pivot_longer(cols = c(3,4), names_to = "category", values_to = "value") %>% left_join(wdi_cache$country, by = c("country"="iso2c")) %>%
df_pub_priv select(country = country.y, iso2c = country, year, category, value, region, income, lending)
df_pub_priv
unique(df_pub_priv$country)
[1] "Andorra"
[2] "United Arab Emirates"
[3] "Afghanistan"
[4] "Antigua and Barbuda"
[5] NA
[6] "Albania"
[7] "Armenia"
[8] "Angola"
[9] "Argentina"
[10] "American Samoa"
[11] "Austria"
[12] "Australia"
[13] "Aruba"
[14] "Azerbaijan"
[15] "Bosnia and Herzegovina"
[16] "Barbados"
[17] "Bangladesh"
[18] "Belgium"
[19] "Burkina Faso"
[20] "Bulgaria"
[21] "Bahrain"
[22] "Burundi"
[23] "Benin"
[24] "Bermuda"
[25] "Brunei Darussalam"
[26] "Bolivia"
[27] "Brazil"
[28] "Bahamas, The"
[29] "Bhutan"
[30] "Botswana"
[31] "Belize"
[32] "Canada"
[33] "Congo, Dem. Rep."
[34] "Central African Republic"
[35] "Congo, Rep."
[36] "Switzerland"
[37] "Cote d'Ivoire"
[38] "Chile"
[39] "Cameroon"
[40] "China"
[41] "Colombia"
[42] "Costa Rica"
[43] "Cuba"
[44] "Cabo Verde"
[45] "Curacao"
[46] "Cyprus"
[47] "Czechia"
[48] "Germany"
[49] "Djibouti"
[50] "Denmark"
[51] "Dominica"
[52] "Dominican Republic"
[53] "Algeria"
[54] "Ecuador"
[55] "Estonia"
[56] "Egypt, Arab Rep."
[57] "Eritrea"
[58] "Spain"
[59] "Ethiopia"
[60] "Finland"
[61] "Fiji"
[62] "Micronesia, Fed. Sts."
[63] "France"
[64] "Gabon"
[65] "United Kingdom"
[66] "Grenada"
[67] "Georgia"
[68] "Ghana"
[69] "Greenland"
[70] "Gambia, The"
[71] "Guinea"
[72] "Equatorial Guinea"
[73] "Greece"
[74] "Guatemala"
[75] "Guam"
[76] "Guinea-Bissau"
[77] "Guyana"
[78] "Hong Kong SAR, China"
[79] "Honduras"
[80] "Croatia"
[81] "Haiti"
[82] "Hungary"
[83] "Indonesia"
[84] "Ireland"
[85] "Israel"
[86] "Isle of Man"
[87] "India"
[88] "Iraq"
[89] "Iran, Islamic Rep."
[90] "Iceland"
[91] "Italy"
[92] "Jamaica"
[93] "Jordan"
[94] "Japan"
[95] "Kenya"
[96] "Kyrgyz Republic"
[97] "Cambodia"
[98] "Kiribati"
[99] "Comoros"
[100] "St. Kitts and Nevis"
[101] "Korea, Dem. People's Rep."
[102] "Korea, Rep."
[103] "Kuwait"
[104] "Cayman Islands"
[105] "Kazakhstan"
[106] "Lao PDR"
[107] "Lebanon"
[108] "St. Lucia"
[109] "Liechtenstein"
[110] "Sri Lanka"
[111] "Liberia"
[112] "Lesotho"
[113] "Lithuania"
[114] "Luxembourg"
[115] "Latvia"
[116] "Libya"
[117] "Morocco"
[118] "Monaco"
[119] "Moldova"
[120] "Montenegro"
[121] "Madagascar"
[122] "Marshall Islands"
[123] "North Macedonia"
[124] "Mali"
[125] "Myanmar"
[126] "Mongolia"
[127] "Macao SAR, China"
[128] "Northern Mariana Islands"
[129] "Mauritania"
[130] "Malta"
[131] "Mauritius"
[132] "Maldives"
[133] "Malawi"
[134] "Mexico"
[135] "Malaysia"
[136] "Mozambique"
[137] "Namibia"
[138] "New Caledonia"
[139] "Niger"
[140] "Nigeria"
[141] "Nicaragua"
[142] "Netherlands"
[143] "Norway"
[144] "Nepal"
[145] "Nauru"
[146] "New Zealand"
[147] "OECD members"
[148] "Oman"
[149] "Panama"
[150] "Peru"
[151] "French Polynesia"
[152] "Papua New Guinea"
[153] "Philippines"
[154] "Pakistan"
[155] "Poland"
[156] "Puerto Rico"
[157] "West Bank and Gaza"
[158] "Portugal"
[159] "Palau"
[160] "Paraguay"
[161] "Qatar"
[162] "Romania"
[163] "Serbia"
[164] "Russian Federation"
[165] "Rwanda"
[166] "Saudi Arabia"
[167] "Solomon Islands"
[168] "Seychelles"
[169] "Sudan"
[170] "Sweden"
[171] "Singapore"
[172] "Slovenia"
[173] "Slovak Republic"
[174] "Sierra Leone"
[175] "San Marino"
[176] "Senegal"
[177] "Somalia"
[178] "Suriname"
[179] "South Sudan"
[180] "Sao Tome and Principe"
[181] "El Salvador"
[182] "Sint Maarten (Dutch part)"
[183] "Syrian Arab Republic"
[184] "Eswatini"
[185] "Turks and Caicos Islands"
[186] "Chad"
[187] "Togo"
[188] "Thailand"
[189] "Tajikistan"
[190] "Timor-Leste"
[191] "Turkmenistan"
[192] "Tunisia"
[193] "Tonga"
[194] "Turkiye"
[195] "Trinidad and Tobago"
[196] "Tuvalu"
[197] "Taiwan, China"
[198] "Tanzania"
[199] "Ukraine"
[200] "Uganda"
[201] "United States"
[202] "Uruguay"
[203] "Uzbekistan"
[204] "St. Vincent and the Grenadines"
[205] "Venezuela, RB"
[206] "British Virgin Islands"
[207] "Virgin Islands (U.S.)"
[208] "Vietnam"
[209] "Vanuatu"
[210] "Samoa"
[211] "IBRD only"
[212] "IDA only"
[213] "Least developed countries: UN classification"
[214] "Low income"
[215] "Lower middle income"
[216] "Yemen, Rep."
[217] "South Africa"
[218] "Zambia"
[219] "Zimbabwe"
%>%
df_pub_priv filter(country %in% c("Japan", "Norway", "Sweden", "Denmark", "Finland"), year %in% 1970:2020) %>%
ggplot(aes(year, value, color = country, linetype = category)) + geom_line()
We choose two indicators: ‘wealg’ and ‘wealp’. WIR2022 indicators consists of 6 characters; 1 letter code plus 5 letter code. You can find the list in the codebook.
If you want to study WIR2022, please study the report, the codebook, and wir vignette together with the R Notebook.
As I mentioned earlier, the data tables used in the report are available from the following page.