Macroeconomic Trends Explorer

Author

Ephrata Getachew, Aika Shorayeva, Ebony Wamwitha

Published

December 13, 2023

Introduction

Knowing what influences a country’s development is a never-ending task in the field of macroeconomics. Expanding upon the work completed in the middle of the semester, our final assignment explores the macroeconomic patterns in many nations. However, this time, our focal point revolves around key indicators that shape the Human Development Index (HDI), including life expectancy at birth, expected and mean years of schooling, and Gross National Income (GNI) per capita. Through the use of unsupervised learning, we aim to cluster countries based on these HDI indicators, revealing trends and ideas beyond geographic borders. The main goal of our research is to predict the HDI category a certain nation may fall into, and potentially compare the results with the recognized HDI Rank list. We hope that this effort will help evaluate our model’s accuracy. We also hope to develop a predictive model for countries’ Gross Domestic Product (GDP) using a combination of the HDI indicators. By dividing the data into training and testing sets, our method makes it possible to carefully assess the prediction accuracy.

k-Means Clsutering over HDI Indicators

HDI (Human Development Index) - a commonly used measure of each country’s social and economic development - is determined on the basis of following factors:

  1. Health
  2. Education
  3. Standard of Living

Calculated as a geometric mean of the 3, HDI Rank is assigned to the countries, whcih are then put into 4 categories, from very high human development (> 0.8) to low human development (< 0.55).

For our Cluster Analysis, we tried to imitate this process of categorizing countries using different means. Namely, we clustered all of the countries in 3-dimensional space over three variables (Expected Years of Schooling, Life Expectancy, and GNI) that we chose to signify the factors mentioned above. The visualization for that can be seen in the link below.

3-Dimensional Clustering Visualization

Cluster Mapping

While it the 3-dimensional visualization helps us makes sense of how clusters were made, a map might aid us better to see which countries were grouped together by our model.

Interpretation

While there are some distinct patterns (e.g., Northern American and European countries getting grouped together), which seem to coincide with the categories generated by UNDP, we also observe a lot of inconsistencies across 2 lists. There are several potential reasons for that:

  • Different Methodologies
    As was mentioned previously, HDI Rank of individual countries is calculated as a geometric mean of the variables of interest. There are also very strict points of cutout, based on which UN puts different countries into different categories of human development. Meanwhile, rather than categorizing countries in a linear manner k-Means clustering over multiple variables identifies distinct clusters of datapoint in space, which might not always coincide with the list generated via the previous method.
  • Different Variables
    For research purposes, we have identified 3 variables of interest, including Expected Years of Schooling, Life Expectancy, and GNI to signify various facets of human development. UNDP also uses the same variables. However, oftentimes, they include several other variation of these variables (e.g., rate of education completion) to the mix to account for a greater picture.
  • Other Factors
    While not disclosed to public, there might be series of different factors that UN considers whlist making the ranking, which we might’ve not accounted for

Predictive Model

A linear regression model was developed using the macro_trends_imp dataset to predict Gross Domestic Product (GDP). The model leverages three key predictors: Gross National Income (gni), Years of Schooling (yrs_sch), and Life Expectancy (life_exp). The dataset was split into a training set, encompassing observations up to the year 2018, and a test set comprising data from 2019 onwards. The assessment of the model’s performance involved analyzing prediction errors and conducting linear regression analyses for each predictor.


Call:
lm(formula = gdp ~ gni + yrs_sch + life_exp, data = train)

Residuals:
   Min     1Q Median     3Q    Max 
-31222  -3511  -2281   -899 170532 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2151.41052 1866.52861   1.153 0.249131    
gni            0.97019    0.01218  79.658  < 2e-16 ***
yrs_sch      770.97389   97.00764   7.948 2.44e-15 ***
life_exp    -127.21633   35.96436  -3.537 0.000409 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12710 on 4119 degrees of freedom
Multiple R-squared:  0.7058,    Adjusted R-squared:  0.7056 
F-statistic:  3294 on 3 and 4119 DF,  p-value: < 2.2e-16

In the linear regression analysis on GDP, predictors such as gni and yrs_sch exhibit strong positive effects, while life_exp shows a weaker negative impact. The R-squared value of 0.7036 indicates that the model explains approximately 70.36% of GDP variability.

The linear regression model trained on the data have not accurately captured the relationship between GDP and GNI due to the presence of outliers. The plot suggests that the model struggled to fit the extreme values on the left side, potentially resulting in skewed predictions.

Table

The model struggles to accurately predict GDP for countries due to the complex nature of economies. Economic systems can be unpredictable and easily influenced by external factors, making it challenging to create precise models. The limited data for some countries makes it harder to establish strong connections between predictors and GDP, resulting in less dependable predictions. Moreover, each country has unique aspects like regional dependencies and different economic structures that the chosen predictors might not fully capture.

Additionally, we only considered three variables in the model, but there are more factors that we didn’t take into account. In the model, we treated the variables as independent, but in the real world, these predictors are often interdependent. For instance, consider the relationship between Gross National Income (GNI), Years of Schooling (yrs_sch), and Life Expectancy (life_exp). In reality, higher GNI might positively influence both education levels and life expectancy. Neglecting such interdependencies can lead to an oversimplified model that fails to capture the intricate dynamics of the economic system. To improve accuracy, future models should consider a broader set of variables and account for the interconnections between predictors to better reflect real-world complexities.

Conclusion

Utilizing unsupervised learning to cluster countries and predicting GDP through supervised learning, our model provided valuable insights into the dynamics of economic development. However, a notable observation is the tendency for GDP overprediction by our model, likely influenced by the unprecedented global disruptions caused by the COVID-19 pandemic. This may emphasize how the model needs to be improved to better account for unanticipated shocks and how real-time data may need to be included for increased accuracy.

Suggestions for the future

As we look to the future, our investigation of macroeconomic patterns using Human Development Index (HDI) indicators opens up new and intriguing areas for further study and improvement. Further developments of the project might consider using more sophisticated algorithms and methods, and broadening the dataset to cover a wider range of socioeconomic factors in order to improve the model. It should also be a priority to address the effects of external shocks like the COVID-19 pandemic, which may prompt the creation of future models that can adjust to unforeseen interruptions.

References

Baumer, B. S., Kaplan, D. T., and Horton, N. J. (2021), Modern Data Science with R (2nd ed.), Boca Raton, FL: CRC Press.

World Bank (2000-2022). “World Development Indicators,” The World Bank DataBank. Available at https://databank.worldbank.org/source/world-development-indicators.

C. Sievert. Interactive Web-Based Data Visualization with R, plotly, and shiny. Chapman and Hall/CRC Florida, 2020.

H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.

Schloerke B, Cook D, Larmarange J, Briatte F, Marbach M, Thoen E, Elberg A, Crowley J (2021). GGally: Extension to ‘ggplot2’. R package version 2.1.2, https://CRAN.R-project.org/package=GGally.

Stef van Buuren, Karin Groothuis-Oudshoorn (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1-67. DOI 10.18637/jss.v045.i03.

Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.

Wickham H, Vaughan D, Girlich M (2023). tidyr: Tidy Messy Data. R package version 1.3.0, https://CRAN.R-project.org/package=tidyr.

Winston Chang, Joe Cheng, JJ Allaire, Carson Sievert, Barret Schloerke, Yihui Xie, Jeff Allen, Jonathan McPherson, Alan Diper, Barbara Borges (2023). shiny: Web Application Framework for R https://CRAN.R-project.org/package=shiny

Xie Y, Cheng J, Tan X (2022). “DT: A Wrapper of the JavaScript Library ‘DataTables’,” R package version 0.24, available at https://CRAN.R-project.org/package=DT.