B Appendix B Public Data

B.1 Introduction

B.1.1 Open Data Defined by World Bank

The term “Open Data” has a very precise meaning. Data or content is open if anyone is free to use, re-use or redistribute it, subject at most to measures that preserve provenance and openness.

  1. The data must be , which means they must be placed in the public domain or under liberal terms of use with minimal restrictions.
  2. The data must be , which means they must be published in electronic formats that are machine readable and non-proprietary, so that anyone can access and use the data using common, freely available software tools. Data must also be publicly available and accessible on a public server, without password or firewall restrictions. To make Open Data easier to find, most organizations create and manage Open Data catalogs.

B.1.2 A List of Open Data Catalogue

B.1.2.1 International Institutions

B.1.2.3 Other Open Public Data

B.1.2.4 Financial Data

B.1.2.4.2 Quantmod

B.1.3 Examples

B.1.3.2 Hans Rosling (1948 – 2017)

Hans Rosling was a Swedish physician, academic, and public speaker. He was a professor of international health at Karolinska Institute[4] and was the co-founder and chairman of the Gapminder Foundation, which developed the Trendalyzer software system. (wikipedia)

B.1.3.2.1 Factfulness is … From the book

recognizing when a decision feels urgent and remembering that it rarely is.

To control the urgency instinct, take small steps.

  • Take a breath. When your urgency instinct is triggered, your other instincts kick in and your analysis shuts down. Ask for more time and more information. It’s rarely now or never and it’s rarely either/or.
  • Insist on the data. If something is urgent and important, it should be measured. Beware of data that is relevant but inaccurate, or accurate but irrelevant. Only relevant and accurate data is useful.
  • Beware of fortune-tellers. Any prediction about the future is uncertain. Be wary of predictions that fail to acknowledge that. Insist on a full range of scenarios, never just the best or worst case. Ask how often such predictions have been right before.
  • Be wary of drastic action. Ask what the side effects will be. Ask how the idea has been tested. Step-by-step practical improvements, and evaluation of their impact, are less dramatic but usually more effective.

B.2 World Bank

B.2.1 About

  • World Bank: https://www.worldbank.org
  • Who we are:
    • To end extreme poverty: By reducing the share of the global population that lives in extreme poverty to 3 percent by 2030.
    • To promote shared prosperity: By increasing the incomes of the poorest 40 percent of people in every country.
  • World Bank Open Data: https://data.worldbank.org

B.2.2 WDI - World Development Indicaters

  • World Development Indicators (WDI): the World Bank’s premier compilation of cross-country comparable data on development.
    • Poverty and Inequality
    • People
    • Environment
    • Economy
    • States and Markets
    • Global Links

B.2.3 R Package WDI

  • WDI: World Development Indicators and Other World Bank Data
  • Search and download data from over 40 databases hosted by the World Bank, including the World Development Indicators (‘WDI’), International Debt Statistics, Doing Business, Human Capital Index, and Sub-national Poverty indicators.
  • Version: 2.7.4
  • Materials: README - usage
    • NEWS - version history
  • Published: 2021-04-06
  • Reference manual: WDI.pdf

B.2.3.1 Function WDI: World Development Indicators (World Bank)

  • Description: Downloads the requested data by using the World Bank’s API, parses the resulting XML file, and formats it in long country-year format.
  • Usage
WDI(
  country = "all",
  indicator = "NY.GDP.PCAP.KD",
  start = 1960,
  end = 2020,
  extra = FALSE,
  cache = NULL,
  latest = NULL,
  language = "en"
)
  • Arguments
    • country:
      Vector of countries (ISO-2 character codes, e.g. “BR”, “US”, “CA”) for which the data is needed. Using the string “all” instead of individual iso codes pulls data for every available country.

    • indicator:
      Character vector of indicators codes. See the WDIsearch() function. If you supply a named vector, the indicators will be automatically renamed: ‘c(’women_private_sector’ = ‘BI.PWK.PRVS.FE.ZS’)’

    • start: Start date, usually a year in integer format. Must be 1960 or greater.

    • end:
      End date, usually a year in integer format. Must be greater than the ‘start’ argument.

    • extra:
      TRUE returns extra variables such as region, iso3c code, and incomeLevel

    • cache:
      NULL (optional) a list created by WDIcache() to be used with the extra=TRUE argument

    • Value: Data frame with country-year observations. You can extract a data.frame with indicator names and descriptive labels by inspecting the label attribute of the resulting data.frame: attr(dat, 'label')

B.2.3.2 Function WDIsearch

Search names and descriptions of available WDI series

  • Description
    Data frame with series code, name, description, and source for the WDI series which match the given criteria

  • Usage
    WDIsearch(string = “gdp”, field = “name”, short = TRUE, cache = NULL)

  • Arguments

    • string:
      Character string. Search for this string using grep with ignore.case=TRUE.

    • field:
      Character string. Search this field. Admissible fields: ‘indicator’, ‘name’, ‘description’, ‘sourceDatabase’, ‘sourceOrganization’

    • short;
      TRUE: Returns only the indicator’s code and name. FALSE: Returns the indicator’s code, name, description, and source.

    • cache;
      Data list generated by the WDIcache function. If omitted, WDIsearch will search a local list of series.

    • Value; Data frame with code, name, source, and description of all series which match the criteria.

B.2.4 The First Example

B.2.4.1 Setup

In this R Notebook, we will use the following packages. Istall them before you compile this R Notebook.

  1. tidyverse Package, a collection of packages for data science
  2. WDI Package for World Development Indicators
library(tidyverse)
library(WDI)

B.2.4.2 GDP Per Capita

The following is taken from the usage.

WDI(
  country = "all",
  indicator = "NY.GDP.PCAP.KD",
  start = 1960,
  end = 2020,
  extra = FALSE,
  cache = NULL,
  latest = NULL,
  language = "en"
)
WDIsearch(string = "NY.GDP.PCAP.KD", field = "indicator", cache = NULL)
##               indicator                               name
## 11431    NY.GDP.PCAP.KD GDP per capita (constant 2015 US$)
## 11432 NY.GDP.PCAP.KD.ZG   GDP per capita growth (annual %)
WDIsearch(string = "NY.GDP.PCAP.KD", field = "indicator", short = FALSE, cache = NULL)
##               indicator                               name
## 11431    NY.GDP.PCAP.KD GDP per capita (constant 2015 US$)
## 11432 NY.GDP.PCAP.KD.ZG   GDP per capita growth (annual %)
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          description
## 11431                                                                 GDP per capita is gross domestic product divided by midyear population. GDP is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2015 U.S. dollars.
## 11432 Annual percentage growth rate of GDP per capita based on constant local currency. GDP per capita is gross domestic product divided by midyear population. GDP at purchaser's prices is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources.
##                     sourceDatabase
## 11431 World Development Indicators
## 11432 World Development Indicators
##                                                              sourceOrganization
## 11431 World Bank national accounts data, and OECD National Accounts data files.
## 11432 World Bank national accounts data, and OECD National Accounts data files.
df <- as_tibble(WDI(
  country = c("CN", "IN", "US", "ID","PK", "BR", "NG", "BD", "RU", "MX", "JP"), 
  indicator = "NY.GDP.PCAP.KD",
  start = 1960,
  end = 2020,
  extra = FALSE,
  cache = NULL,
  latest = NULL,
  language = "en"
))
df
## # A tibble: 671 × 5
##    country    iso2c iso3c  year NY.GDP.PCAP.KD
##    <chr>      <chr> <chr> <int>          <dbl>
##  1 Bangladesh BD    BGD    2020          1620.
##  2 Bangladesh BD    BGD    2019          1582.
##  3 Bangladesh BD    BGD    2018          1481.
##  4 Bangladesh BD    BGD    2017          1395.
##  5 Bangladesh BD    BGD    2016          1323.
##  6 Bangladesh BD    BGD    2015          1248.
##  7 Bangladesh BD    BGD    2014          1185.
##  8 Bangladesh BD    BGD    2013          1130.
##  9 Bangladesh BD    BGD    2012          1078.
## 10 Bangladesh BD    BGD    2011          1024.
## # … with 661 more rows
ggplot(df) + 
  geom_line(aes(x = year, y = NY.GDP.PCAP.KD, color = country)) +
  labs(title = "GDP per capita (constant 2010 US$)")
## Warning: Removed 29 rows containing missing values (`geom_line()`).

ggplot(df) + 
  geom_line(aes(x = year, y = NY.GDP.PCAP.KD, color = country)) +
  scale_y_continuous(trans='log10') +
  labs(title = "GDP per capita (constant 2010 US$)", 
       subtitle = "Log10 Scale")
## Warning: Removed 29 rows containing missing values (`geom_line()`).

ggplot(df) + 
  geom_line(aes(x = year, y = NY.GDP.PCAP.KD, color = country)) +
  scale_y_continuous(trans='log10') +
  facet_wrap(vars(country)) +
  labs(title = "GDP per capita (constant 2010 US$)", 
      subtitle = "Log10 Scale")
## Warning: Removed 29 rows containing missing values (`geom_line()`).

### More Examples

B.2.4.4 Population

WDIsearch(string = "Population, Total", field = "name", cache = NULL)
##            indicator                                            name
## 9659  JI.POP.URBN.ZS Urban population, total (% of total population)
## 17674    SP.POP.TOTL                               Population, total
WDIsearch(string = "SP.POP.TOTL", field = "indicator", cache = NULL)
##                indicator                                             name
## 17674        SP.POP.TOTL                                Population, total
## 17675  SP.POP.TOTL.FE.IN                               Population, female
## 17676  SP.POP.TOTL.FE.ZS       Population, female (% of total population)
## 17677    SP.POP.TOTL.ICP                       SP.POP.TOTL.ICP:Population
## 17678 SP.POP.TOTL.ICP.ZS SP.POP.TOTL.ICP.ZS:Population shares (World=100)
## 17679  SP.POP.TOTL.MA.IN                                 Population, male
## 17680  SP.POP.TOTL.MA.ZS         Population, male (% of total population)
## 17681     SP.POP.TOTL.ZS                          Population (% of total)

B.2.4.5 More Than One Indicator

dfp <- as_tibble(WDI(
  country = c("CN", "IN", "US", "ID","PK", "BR", "NG", "BD", "RU", "MX", "JP"), 
  indicator = c("NY.GDP.PCAP.KD","SP.POP.TOTL"),
  start = 1960,
  end = 2020,
  extra = FALSE,
  cache = NULL,
  latest = NULL,
  language = "en"
))
dfp
## # A tibble: 671 × 6
##    country    iso2c iso3c  year NY.GDP.PCAP.KD SP.POP.TOTL
##    <chr>      <chr> <chr> <int>          <dbl>       <dbl>
##  1 Bangladesh BD    BGD    1960           463.    48013505
##  2 Bangladesh BD    BGD    1961           478.    49362834
##  3 Bangladesh BD    BGD    1962           490.    50752150
##  4 Bangladesh BD    BGD    1963           474.    52202008
##  5 Bangladesh BD    BGD    1964           511.    53741721
##  6 Bangladesh BD    BGD    1965           504.    55385114
##  7 Bangladesh BD    BGD    1966           501.    57157651
##  8 Bangladesh BD    BGD    1967           476.    59034250
##  9 Bangladesh BD    BGD    1968           505.    60918452
## 10 Bangladesh BD    BGD    1969           497.    62679765
## # … with 661 more rows
ggplot(dfp) + 
  geom_line(aes(x = year, y = SP.POP.TOTL, color = country)) +
  labs(title = "Population, total")

ggplot(dfp) + 
  geom_line(aes(x = year, y = SP.POP.TOTL, color = country)) +
  scale_y_continuous(trans='log10') +
  labs(title = "Population, total", 
       subtitle = "Log10 Scale")

B.2.4.5.1 All Countries
df_gdp_all <- as_tibble(WDI(
  country = "all", 
  indicator = c("NY.GDP.PCAP.KD","SP.POP.TOTL"),
  start = 1960,
  end = 2020,
  extra = FALSE,
  cache = NULL,
  latest = NULL,
  language = "en"
))
df_gdp_all
## # A tibble: 16,226 × 6
##    country     iso2c iso3c  year NY.GDP.PCAP.KD SP.POP.TOTL
##    <chr>       <chr> <chr> <int>          <dbl>       <dbl>
##  1 Afghanistan AF    AFG    1960             NA     8996967
##  2 Afghanistan AF    AFG    1961             NA     9169406
##  3 Afghanistan AF    AFG    1962             NA     9351442
##  4 Afghanistan AF    AFG    1963             NA     9543200
##  5 Afghanistan AF    AFG    1964             NA     9744772
##  6 Afghanistan AF    AFG    1965             NA     9956318
##  7 Afghanistan AF    AFG    1966             NA    10174840
##  8 Afghanistan AF    AFG    1967             NA    10399936
##  9 Afghanistan AF    AFG    1968             NA    10637064
## 10 Afghanistan AF    AFG    1969             NA    10893772
## # … with 16,216 more rows
df_gdp_all %>% filter(year == 2020) %>% arrange(desc(SP.POP.TOTL))
## # A tibble: 266 × 6
##    country                    iso2c iso3c  year NY.GDP.PCAP.KD SP.POP.TOTL
##    <chr>                      <chr> <chr> <int>          <dbl>       <dbl>
##  1 World                      1W    "WLD"  2020         10549.  7763932702
##  2 IDA & IBRD total           ZT    "IBT"  2020          5017.  6571053159
##  3 Low & middle income        XO    "LMY"  2020          4862.  6494812232
##  4 Middle income              XP    "MIC"  2020          5341.  5811279241
##  5 IBRD only                  XF    "IBD"  2020          6290.  4862446431
##  6 Early-demographic dividend V2    "EAR"  2020          3374.  3332103561
##  7 Lower middle income        XN    ""     2020          2297.  3318682068
##  8 Upper middle income        XT    ""     2020          9395.  2492597173
##  9 East Asia & Pacific        Z4    "EAS"  2020         11136.  2361517682
## 10 Late-demographic dividend  V3    "LTE"  2020          9711.  2316803603
## # … with 256 more rows
df_gdp_all %>% filter(year == 2020) %>% arrange(NY.GDP.PCAP.KD)
## # A tibble: 266 × 6
##    country                  iso2c iso3c  year NY.GDP.PCAP.KD SP.POP.TOTL
##    <chr>                    <chr> <chr> <int>          <dbl>       <dbl>
##  1 Burundi                  BI    BDI    2020           271.    11890781
##  2 Malawi                   MW    MWI    2020           394.    19129955
##  3 Central African Republic CF    CAF    2020           415.     4829764
##  4 Madagascar               MG    MDG    2020           442.    27691019
##  5 Somalia                  SO    SOM    2020           445.    15893219
##  6 Congo, Dem. Rep.         CD    COD    2020           505.    89561404
##  7 Niger                    NE    NER    2020           523.    24206636
##  8 Afghanistan              AF    AFG    2020           530.    38928341
##  9 Mozambique               MZ    MOZ    2020           575.    31255435
## 10 Liberia                  LR    LBR    2020           616.     5057677
## # … with 256 more rows
df_gdp_all %>% filter(year == 2020) %>% arrange(desc(NY.GDP.PCAP.KD))
## # A tibble: 266 × 6
##    country        iso2c iso3c  year NY.GDP.PCAP.KD SP.POP.TOTL
##    <chr>          <chr> <chr> <int>          <dbl>       <dbl>
##  1 Monaco         MC    MCO    2020        159222.       39244
##  2 Luxembourg     LU    LUX    2020        104879.      630419
##  3 Bermuda        BM    BMU    2020         99729.       63893
##  4 Switzerland    CH    CHE    2020         85685.     8636561
##  5 Ireland        IE    IRL    2020         78733.     4985674
##  6 Cayman Islands KY    CYM    2020         77959.       65720
##  7 Norway         NO    NOR    2020         75017.     5379475
##  8 Singapore      SG    SGP    2020         58982.     5685807
##  9 United States  US    USA    2020         58060.   331501080
## 10 Australia      AU    AUS    2020         58030.    25693267
## # … with 256 more rows
df_gdp_all %>% filter(year == 2020) %>% mutate(GDP = NY.GDP.PCAP.KD * SP.POP.TOTL) %>% arrange(desc(GDP))
## # A tibble: 266 × 7
##    country                   iso2c iso3c  year NY.GDP.PCAP.KD SP.POP.T…¹     GDP
##    <chr>                     <chr> <chr> <int>          <dbl>      <dbl>   <dbl>
##  1 World                     1W    "WLD"  2020         10549. 7763932702 8.19e13
##  2 High income               XD    ""     2020         40336. 1240684527 5.00e13
##  3 OECD members              OE    "OED"  2020         35869. 1372980201 4.92e13
##  4 Post-demographic dividend V4    "PST"  2020         41227. 1117278019 4.61e13
##  5 IDA & IBRD total          ZT    "IBT"  2020          5017. 6571053159 3.30e13
##  6 Low & middle income       XO    "LMY"  2020          4862. 6494812232 3.16e13
##  7 Middle income             XP    "MIC"  2020          5341. 5811279241 3.10e13
##  8 IBRD only                 XF    "IBD"  2020          6290. 4862446431 3.06e13
##  9 East Asia & Pacific       Z4    "EAS"  2020         11136. 2361517682 2.63e13
## 10 Upper middle income       XT    ""     2020          9395. 2492597173 2.34e13
## # … with 256 more rows, and abbreviated variable name ¹​SP.POP.TOTL

B.2.4.6 Gender

as_tibble(WDIsearch(string = "Gender", field = "name", cache = NULL))
## # A tibble: 382 × 2
##    indicator            name                                                    
##    <chr>                <chr>                                                   
##  1 2.3_GIR.GPI          "Gender parity index for gross intake ratio in grade 1" 
##  2 2.6_PCR.GPI          "Gender parity index for primary completion rate "      
##  3 5.51.01.07.gender    "Gender equality"                                       
##  4 BI.EMP.PWRK.PB.FE.ZS "Public sector employment, as a share of paid employmen…
##  5 BI.EMP.PWRK.PB.MA.ZS "Public sector employment, as a share of paid employmen…
##  6 BI.EMP.TOTL.PB.FE.ZS "Public sector employment, as a share of total employme…
##  7 BI.EMP.TOTL.PB.MA.ZS "Public sector employment, as a share of total employme…
##  8 BI.WAG.PREM.PB.FE    "Public sector wage premium, by gender: Female (compare…
##  9 BI.WAG.PREM.PB.FM    "P-Value: Public sector wage premium, by gender (compar…
## 10 BI.WAG.PREM.PB.FM.ED "P-Value: Gender wage premium in the public sector, by …
## # … with 372 more rows
as_tibble(WDIsearch(string = "Females", field = "name", cache = NULL))
## # A tibble: 69 × 2
##    indicator            name                                                    
##    <chr>                <chr>                                                   
##  1 BI.PWK.PRVS.CK.FE.ZS Females, as a share of private paid employees by occupa…
##  2 BI.PWK.PRVS.EO.FE.ZS Females, as a share of private paid employees by occupa…
##  3 BI.PWK.PRVS.FE.Q1.ZS Females, as a share of private paid employees by wage q…
##  4 BI.PWK.PRVS.FE.Q2.ZS Females, as a share of private paid employees by wage q…
##  5 BI.PWK.PRVS.FE.Q3.ZS Females, as a share of private paid employees by wage q…
##  6 BI.PWK.PRVS.FE.Q4.ZS Females, as a share of private paid employees by wage q…
##  7 BI.PWK.PRVS.FE.Q5.ZS Females, as a share of private paid employees by wage q…
##  8 BI.PWK.PRVS.FE.ZS    Females, as a share of private paid employees           
##  9 BI.PWK.PRVS.PN.FE.ZS Females, as a share of private paid employees by occupa…
## 10 BI.PWK.PRVS.SN.FE.ZS Females, as a share of private paid employees by occupa…
## # … with 59 more rows
WDIsearch(string = "BI.PWK.PRVS.FE.ZS", field = "indicator", cache = NULL)
##              indicator                                          name
## 1636 BI.PWK.PRVS.FE.ZS Females, as a share of private paid employees
WDIsearch(string = "BI.PWK.PUBS.FE.ZS", field = "indicator", cache = NULL)
##              indicator                                         name
## 1659 BI.PWK.PUBS.FE.ZS Females, as a share of public paid employees
# Rename indicators on the fly
dfwe <- as_tibble(WDI(country = c("CN", "IN", "US", "ID","PK", "BR", "NG", "BD", "RU", "MX", "JP"), 
               indicator = c('women_private_sector' = 'BI.PWK.PRVS.FE.ZS',
                                  'women_public_sector' = 'BI.PWK.PUBS.FE.ZS')))
dfwe
## # A tibble: 231 × 6
##    country    iso2c iso3c  year women_private_sector women_public_sector
##    <chr>      <chr> <chr> <int>                <dbl>               <dbl>
##  1 Bangladesh BD    BGD    2000               NA                  NA    
##  2 Bangladesh BD    BGD    2001               NA                  NA    
##  3 Bangladesh BD    BGD    2002               NA                  NA    
##  4 Bangladesh BD    BGD    2003                0.193               0.231
##  5 Bangladesh BD    BGD    2004               NA                  NA    
##  6 Bangladesh BD    BGD    2005               NA                  NA    
##  7 Bangladesh BD    BGD    2006               NA                  NA    
##  8 Bangladesh BD    BGD    2007               NA                  NA    
##  9 Bangladesh BD    BGD    2008               NA                  NA    
## 10 Bangladesh BD    BGD    2009               NA                  NA    
## # … with 221 more rows
summary(dfwe, country)
##    country             iso2c              iso3c                year     
##  Length:231         Length:231         Length:231         Min.   :2000  
##  Class :character   Class :character   Class :character   1st Qu.:2005  
##  Mode  :character   Mode  :character   Mode  :character   Median :2010  
##                                                           Mean   :2010  
##                                                           3rd Qu.:2015  
##                                                           Max.   :2020  
##                                                                         
##  women_private_sector women_public_sector
##  Min.   :0.08965      Min.   :0.1285     
##  1st Qu.:0.30555      1st Qu.:0.4159     
##  Median :0.40629      Median :0.5508     
##  Mean   :0.35629      Mean   :0.4679     
##  3rd Qu.:0.46033      3rd Qu.:0.5760     
##  Max.   :0.58404      Max.   :0.6583     
##  NA's   :153          NA's   :152
dfwe %>% group_by(country) %>% select(2,4,5,6) %>%
  summarize(private_n = sum(is.na(women_private_sector)), 
            private = mean(is.na(women_private_sector)), 
            public_n = sum(is.na(women_public_sector)), 
            public = mean(is.na(women_public_sector)))
## Adding missing grouping variables: `country`
## # A tibble: 11 × 5
##    country       private_n private public_n public
##    <chr>             <int>   <dbl>    <int>  <dbl>
##  1 Bangladesh           17   0.810       17  0.810
##  2 Brazil                6   0.286        6  0.286
##  3 China                19   0.905       19  0.905
##  4 India                17   0.810       17  0.810
##  5 Indonesia            20   0.952       20  0.952
##  6 Japan                21   1           21  1    
##  7 Mexico                5   0.238        5  0.238
##  8 Nigeria              19   0.905       18  0.857
##  9 Pakistan             13   0.619       13  0.619
## 10 Russia                4   0.190        4  0.190
## 11 United States        12   0.571       12  0.571

B.2.5 World Bank Country and Lending Groups and an Option extra = TRUE

B.2.5.1 Review Basics: World Development Indicators: ?WDI

  1. Basic Usage
WDI(country = "all", indicator = "NY.GDP.PCAP.KD")
  1. Vector Notation
WDI(country = c("US", "CN", JP"), # ISO-2 codes
   indicator = c("gdp_pcap" = "NY.GDP.PCAP.KD", 
                "life_exp" = "SP.DYN.LE00.IN"))
  1. Use Extra
WDI(country = "all", 
   indicator = c("gdp_pcap" = NY.GDP.PCAP.KD", 
                "life_exp" = "SP.DYN.LE00.IN"), extra = TRUE)

extra: TRUE returns extra variables such as region, iso3c code, and incomeLevel

B.2.5.2 World Bank Country and Lending Groups

B.2.5.2.1 About CLASS.xls

This table classifies all World Bank member countries (189), and all other economies with populations of more than 30,000. For operational and analytical purposes, economies are divided among income groups according to 2019 gross national income (GNI) per capita, calculated using the World Bank Atlas method. The groups are:

  • low income, $1,035 or less;
  • lower middle income, $1,036 - 4,045;
  • upper middle income, $4,046 - 12,535;
  • and high income, $12,536 or more.

The effective operational cutoff for IDA eligibility is $1,185 or less.

B.2.5.2.2 Geographic classifications
  • IBRD: International Bank for Reconstruction and Development
  • IDA: International Development Association

 

  • IDA countries are those that lack the financial ability to borrow from IBRD.

  • IDA credits are deeply concessional—interest-free loans and grants for programs aimed at boosting economic growth and improving living conditions.

  • IBRD loans are noncessional.

  • Blend countries are eligible for IDA credits because of their low per capita incomes but are also eligible for IBRD because they are financially creditworthy.

B.2.5.2.3 Note
  • The term country, used interchangeably with economy, does not imply political independence but refers to any territory for which authorities report separate social or economic statistics.

  • Income classifications set on 1 July 2020 remain in effect until 1 July 2021.

  • Argentina, which was temporarily unclassified in July 2016 pending release of revised national accounts statistics, was classified as upper middle income for FY17 as of 29 September 2016 based on alternative conversion factors. Also effective 29 September 2016, Syrian Arab Republic is reclassified from IBRD lending category to IDA-only.

  • On 29 March 2017, new country codes were introduced to align World Bank 3-letter codes with ISO 3-letter codes: Andorra (AND), Dem. Rep. Congo (COD), Isle of Man (IMN), Kosovo (XKX), Romania (ROU), Timor-Leste (TLS), and West Bank and Gaza (PSE).

B.2.5.3 Importing Excel Files

url_class <- "https://databankfiles.worldbank.org/data/download/site-content/CLASS.xlsx"
download.file(url = url_class, destfile = "data/CLASS.xlsx")
B.2.5.3.1 Countries

Let us look at the first sheet.

  1. The column names are in the 5th row.
  2. The country data starts from the 7th row.
  3. Zimbabue is at the last row.
library(readxl)
wb_countries_tmp <- read_excel("data/CLASS.xlsx", sheet = 1, skip = 0, n_max =219) 
wb_countries <- wb_countries_tmp %>% 
  select(country = Economy, iso3c = Code, region = Region, income = `Income group`, lending = "Lending category", other = "Other (EMU or HIPC)")
wb_countries
## # A tibble: 218 × 6
##    country              iso3c region                     income    lending other
##    <chr>                <chr> <chr>                      <chr>     <chr>   <chr>
##  1 Aruba                ABW   Latin America & Caribbean  High inc… <NA>    <NA> 
##  2 Afghanistan          AFG   South Asia                 Low inco… IDA     HIPC 
##  3 Angola               AGO   Sub-Saharan Africa         Lower mi… IBRD    <NA> 
##  4 Albania              ALB   Europe & Central Asia      Upper mi… IBRD    <NA> 
##  5 Andorra              AND   Europe & Central Asia      High inc… <NA>    <NA> 
##  6 United Arab Emirates ARE   Middle East & North Africa High inc… <NA>    <NA> 
##  7 Argentina            ARG   Latin America & Caribbean  Upper mi… IBRD    <NA> 
##  8 Armenia              ARM   Europe & Central Asia      Upper mi… IBRD    <NA> 
##  9 American Samoa       ASM   East Asia & Pacific        Upper mi… <NA>    <NA> 
## 10 Antigua and Barbuda  ATG   Latin America & Caribbean  High inc… IBRD    <NA> 
## # … with 208 more rows
B.2.5.3.2 Regions
  1. Regions start from the 227th row.
  2. Regions end at the 272th row.
wb_regions_tmp <- read_excel("data/CLASS.xlsx", sheet = 1, skip = 0, n_max =266) %>% 
  slice(-(1:220))
wb_regions <- wb_regions_tmp %>% 
  select(region = Economy, iso3c = Code) %>% drop_na()
wb_regions
## # A tibble: 45 × 2
##    region                                        iso3c
##    <chr>                                         <chr>
##  1 Caribbean small states                        CSS  
##  2 Central Europe and the Baltics                CEB  
##  3 Early-demographic dividend                    EAR  
##  4 East Asia & Pacific                           EAS  
##  5 East Asia & Pacific (excluding high income)   EAP  
##  6 East Asia & Pacific (IDA & IBRD)              TEA  
##  7 Euro area                                     EMU  
##  8 Europe & Central Asia                         ECS  
##  9 Europe & Central Asia (excluding high income) ECA  
## 10 Europe & Central Asia (IDA & IBRD)            TEC  
## # … with 35 more rows

Let us look at the second sheet.

wb_groups_tmp <- read_excel("data/CLASS.xlsx", sheet = "Groups") # sheet = 3
wb_groups <- wb_groups_tmp %>% 
  select(gcode = GroupCode, group = GroupName, iso3c = CountryCode, country = CountryName)

B.2.5.4 Filtering Join

Description

gdp_pcap <- WDI(country = "all", indicator = "NY.GDP.PCAP.KD") 
head(gdp_pcap)
##                       country iso2c iso3c year NY.GDP.PCAP.KD
## 1 Africa Eastern and Southern    ZH   AFE 2021       1477.249
## 2 Africa Eastern and Southern    ZH   AFE 2020       1452.730
## 3 Africa Eastern and Southern    ZH   AFE 2019       1534.890
## 4 Africa Eastern and Southern    ZH   AFE 2018       1544.078
## 5 Africa Eastern and Southern    ZH   AFE 2017       1546.796
## 6 Africa Eastern and Southern    ZH   AFE 2016       1548.813
gdp_pcap_extra <- WDI(country = "all", indicator = "NY.GDP.PCAP.KD", extra = TRUE)
head(gdp_pcap_extra)
##       country iso2c iso3c year NY.GDP.PCAP.KD status lastupdated     region
## 1 Afghanistan    AF   AFG 2015       556.0072         2022-09-16 South Asia
## 2 Afghanistan    AF   AFG 2011       511.9985         2022-09-16 South Asia
## 3 Afghanistan    AF   AFG 2014       565.1793         2022-09-16 South Asia
## 4 Afghanistan    AF   AFG 2013       568.9645         2022-09-16 South Asia
## 5 Afghanistan    AF   AFG 2012       557.9497         2022-09-16 South Asia
## 6 Afghanistan    AF   AFG 2007       392.7105         2022-09-16 South Asia
##   capital longitude latitude     income lending
## 1   Kabul   69.1761  34.5228 Low income     IDA
## 2   Kabul   69.1761  34.5228 Low income     IDA
## 3   Kabul   69.1761  34.5228 Low income     IDA
## 4   Kabul   69.1761  34.5228 Low income     IDA
## 5   Kabul   69.1761  34.5228 Low income     IDA
## 6   Kabul   69.1761  34.5228 Low income     IDA
  • Compare the following:

    • 16,226 rows = 13,054 rows (country) + 3172 rows (not country)
gdp_pcap_country <- gdp_pcap %>% semi_join(wb_countries, by = "country")
head(gdp_pcap_country)
##       country iso2c iso3c year NY.GDP.PCAP.KD
## 1 Afghanistan    AF   AFG 2021             NA
## 2 Afghanistan    AF   AFG 2020       529.7412
## 3 Afghanistan    AF   AFG 2019       555.1390
## 4 Afghanistan    AF   AFG 2018       546.7430
## 5 Afghanistan    AF   AFG 2017       553.3551
## 6 Afghanistan    AF   AFG 2016       552.9969
gdp_pcap_extra %>% filter(region != "Aggregates") %>% slice(10)
##       country iso2c iso3c year NY.GDP.PCAP.KD status lastupdated     region
## 1 Afghanistan    AF   AFG 2003         332.22         2022-09-16 South Asia
##   capital longitude latitude     income lending
## 1   Kabul   69.1761  34.5228 Low income     IDA
  • What is the difference?

    • 13,176 rows is not equal to 13,054 rows + 183 rows
gdp_pcap_extra %>% filter(region != "Aggregates") %>% anti_join(gdp_pcap_country) %>% slice(10)
## Joining, by = c("country", "iso2c", "iso3c", "year", "NY.GDP.PCAP.KD")
##         country iso2c iso3c year NY.GDP.PCAP.KD status lastupdated
## 1 Cote d'Ivoire    CI   CIV 2003       1566.577         2022-09-16
##               region      capital longitude latitude              income
## 1 Sub-Saharan Africa Yamoussoukro   -4.0305    5.332 Lower middle income
##   lending
## 1     IDA
gdp_pcap %>% anti_join(wb_countries) %>% slice(10)
## Joining, by = c("country", "iso3c")
##                       country iso2c iso3c year NY.GDP.PCAP.KD
## 1 Africa Eastern and Southern    ZH   AFE 2012        1513.37

Note: World Bank provides the WDI_csv for more information.

  1. WDICountry-Series.csv - 965,9KB
  2. WDICountry.csv - 125.8KB
  3. WDIData.csv - 193 MB
  4. WDIFootNote.csv - 61.1MB
  5. WDISeries-Time.csv - 46.1KB
  6. WDISeries.csv” - 3.5MB

B.2.5.5 Join Tables

There are three types of joining tables. Commands are from tidyverse packages though there is a way to do the same by Base R with appropriate arguments.

  1. Bind rows: bind_rows(), intersect(), setdiff(), union()
  2. Bind columns: bind_cols(), left_join(), right_join(), inner_join(), full_join()
  3. Filtering join: semi_join(), anti_join()

B.2.5.6 Join Tables: Quick References

  1. https://r4ds.had.co.nz/relational-data.html#relational-data

  2. Cheatsheet: Data Transformation, pages 2 and 3. You can download it from RStudio > Help.

  3. Tidyverse Homepage:

    1. Efficiently bind multiple data frames by row and column: bind_rows(), bind_cols()
    2. Set operations: intersect(), setdiff(), union()
    3. Mutating joins: left_join(), right_join(), inner_join(), full_join()
    4. Filtering joins: semi_join(), anti_join()
  4. R Studio Primers: Tidy Your Data – r4ds: Wrangle, II

    • Reshape Data, Separate and Unite Columns, Join Data Sets

B.3 United Nations

` * UN Data: https://data.un.org - Datamarts: http://data.un.org/Explorer.aspx

B.3.1 Importing Data

  1. Get the URL (uniform resource locator) - copy the link
  • url_of_data <- "https://data.un.org/--long url--.csv"
  1. Download the file into the destfile in data folder:
  • download.file(url = url_of_data, destfile = "data/un_pop.csv")
  1. Read the file:
  • df_un_pop <- read_csv("data/un_pop.csv")

Alternative, skip 2 and read the file using the URL. * read_csv(url)

B.3.2 Example

B.3.2.1 Population

The followind do not look line the pdf above.

df_un_pop <- read_csv("https://data.un.org/_Docs/SYB/CSV/SYB64_1_202110_Population,%20Surface%20Area%20and%20Density.csv", skip = 1)
## New names:
## Rows: 7260 Columns: 7
## ── Column specification
## ──────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: "," chr (4): ...2, Series, Footnotes, Source dbl (2): Region/Country/Area, Year num (1): Value
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types =
## FALSE` to quiet this message.
## • `` -> `...2`
head(df_un_pop)
## # A tibble: 6 × 7
##   `Region/Country/Area` ...2                    Year Series Value Footn…¹ Source
##                   <dbl> <chr>                  <dbl> <chr>  <dbl> <chr>   <chr> 
## 1                     1 Total, all countries …  2010 Popul… 6957. <NA>    Unite…
## 2                     1 Total, all countries …  2010 Popul… 3508. <NA>    Unite…
## 3                     1 Total, all countries …  2010 Popul… 3449. <NA>    Unite…
## 4                     1 Total, all countries …  2010 Sex r…  102. <NA>    Unite…
## 5                     1 Total, all countries …  2010 Popul…   27  <NA>    Unite…
## 6                     1 Total, all countries …  2010 Popul…   11  <NA>    Unite…
## # … with abbreviated variable name ¹​Footnotes
colnames(df_un_pop)
## [1] "Region/Country/Area" "...2"                "Year"               
## [4] "Series"              "Value"               "Footnotes"          
## [7] "Source"
un_pop_tbl <- df_un_pop %>% select(num = "Region/Country/Area", region = "...2", year = "Year", series = "Series", value = "Value") %>% 
  pivot_wider(names_from = series, values_from = value) 
head(un_pop_tbl)
## # A tibble: 6 × 11
##     num region      year Popul…¹ Popul…² Popul…³ Sex r…⁴ Popul…⁵ Popul…⁶ Popul…⁷
##   <dbl> <chr>      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1     1 Total, al…  2010   6957.   3508.   3449.   102.     27      11      53.5
## 2     1 Total, al…  2015   7380.   3721.   3659.   102.     26.2    12.2    56.7
## 3     1 Total, al…  2019   7713.   3889.   3824.   102.     25.6    13.2    59.3
## 4     1 Total, al…  2021   7875.   3970.   3905.   102.     25.3    13.7    60.5
## 5     2 Africa      2010   1039.    518.    521.    99.5    41.5     5.1    35.1
## 6     2 Africa      2015   1182.    590.    592.    99.7    41.1     5.3    39.9
## # … with 1 more variable: `Surface area (thousand km2)` <dbl>, and abbreviated
## #   variable names ¹​`Population mid-year estimates (millions)`,
## #   ²​`Population mid-year estimates for males (millions)`,
## #   ³​`Population mid-year estimates for females (millions)`,
## #   ⁴​`Sex ratio (males per 100 females)`,
## #   ⁵​`Population aged 0 to 14 years old (percentage)`,
## #   ⁶​`Population aged 60+ years old (percentage)`, ⁷​`Population density`
colnames(un_pop_tbl) <- c("num", "region", "year", "total", "male", "female", "ratio", "0-14", "60+", "density", "area")
head(un_pop_tbl)
## # A tibble: 6 × 11
##     num region         year total  male female ratio `0-14` `60+` density   area
##   <dbl> <chr>         <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>   <dbl>  <dbl>
## 1     1 Total, all c…  2010 6957. 3508.  3449. 102.    27    11      53.5     NA
## 2     1 Total, all c…  2015 7380. 3721.  3659. 102.    26.2  12.2    56.7 136162
## 3     1 Total, all c…  2019 7713. 3889.  3824. 102.    25.6  13.2    59.3 130094
## 4     1 Total, all c…  2021 7875. 3970.  3905. 102.    25.3  13.7    60.5     NA
## 5     2 Africa         2010 1039.  518.   521.  99.5   41.5   5.1    35.1     NA
## 6     2 Africa         2015 1182.  590.   592.  99.7   41.1   5.3    39.9  30311