Module 4.5

Model Selection

Prework
  • Install the olsrr package: install.packages("olsrr") and read the documentation pertaining to variable selection
  • Run the following code to get set up for this module:
Code
library(tidyverse)
library(vdemlite)

# Load V-Dem data for 2019
model_data <- fetchdem(
  indicators = c(
  "v2x_libdem", 
  "e_gdppc", 
  "v2cacamps",
  "v2x_gender",
  "v2x_corr",
  "e_regionpol_6C"),
  start_year = 2006, 
  end_year = 2006
  ) |>
  rename(
    country = country_name, 
    lib_dem = v2x_libdem, 
    wealth = e_gdppc,
    polarization = v2cacamps,
    women_empowerment = v2x_gender,
    corruption = v2x_corr,
    region = e_regionpol_6C
    ) |>
  mutate(
    region = factor(
    region,
    labels = c(
      "Eastern Europe", 
      "Latin America", 
      "MENA", 
      "Sub-Saharan Africa", 
      "Western Europe & North America", 
      "Asia & Pacific")),
    log_wealth = log(wealth) # have to manually log-transform wealth for olsrr
    ) |>
  drop_na(lib_dem:region) # remove missing values for olsrr

#glimpse(model_data)

Overview

Now that we know how to run a regression we can talk about how to select the best model. In this module, we will explore the process of model selection in multiple regression contexts. Model selection involves choosing which predictor variables to include in a regression model to best explain the outcome variable while balancing complexity and interpretability.

This decision process involves navigating the fundamental bias-variance tradeoff in statistical modeling. Including more predictors can reduce bias by capturing important relationships between variables and the outcome. However, additional predictors also increase model variance by making the model more susceptible to overfitting to the specific sample under study. Conversely, models with fewer predictors may exhibit reduced variance but increased bias through the omission of important explanatory variables.

Several approaches can guide variable selection decisions. Theory-driven selection incorporates variables based on theoretical understanding of the phenomenon under investigation based on domain expertise. Statistical criteria employ measures such as AIC, BIC, or adjusted R-squared to balance model fit against complexity. Cross-validation assesses model performance on held-out data to evaluate generalizability.

Multiple regression modeling also requires consideration of several methodological concerns. Multicollinearity occurs when predictor variables exhibit high correlations with one another, complicating the separation of individual variable effects. Overfitting represents the tendency for models to fit training data too closely, resulting in poor generalization to new observations. Additionally, complex models present interpretation challenges, and larger numbers of predictors necessitate correspondingly larger sample sizes for reliable parameter estimation.

Model Evaluation Criteria

Model selection procedures require criteria for comparing alternative specifications. Each criterion approaches the balance between model fit and complexity differently, leading to potentially different optimal models.

Adjusted R-squared

Adjusted R-squared addresses a fundamental limitation of regular R-squared: the tendency for R-squared to increase mechanically with the addition of any predictor variable, regardless of that variable’s true explanatory value. The adjusted R-squared applies a penalty for additional parameters according to the formula:

\[R_{adj}^2 = 1 - \frac{(1 - R^2)(n - 1)}{n - k - 1}\]

where \(n\) represents the number of observations and \(k\) represents the number of predictor variables. Higher adjusted R-squared values indicate superior model performance, as they reflect greater explained variance while accounting for model complexity.

Akaike Information Criterion (AIC)

The Akaike Information Criterion provides an alternative approach to model evaluation grounded in information theory. AIC balances model fit against complexity through the formula:

\[AIC = -2(\text{log-likelihood}) + 2k\]

where \(k\) represents the number of parameters (including predictors and intercept) and log-likelihood measures model fit to the observed data. Lower AIC values indicate superior models, representing the opposite interpretation from R-squared measures. AIC typically favors slightly more complex models than adjusted R-squared due to its lighter penalty structure for additional parameters.

P-values as Selection Criteria

Recall that p-values represent the probability of observing a test statistic as extreme or more extreme than what was actually observed, assuming the null hypothesis is true. In regression contexts, p-values test whether individual regression coefficients differ significantly from zero. p-values have historically been employed in variable selection procedures, with common approaches involving the removal of variables with p-values exceeding predetermined thresholds (typically 0.05 or 0.10).

However, p-value-based variable selection introduces important methodological concerns. Stepwise selection increases the risk of a false positive (Type 1 error) by overfitting to noise. In the standard hypothesis testing framework featuring a five percent critical value, one out of 20 tests will yield a statistically significant result purely by chance. This risk is compounded in stepwise selection procedures that test multiple variables sequentially, as each test increases the likelihood of identifying spurious relationships.

Stepwise regression can also lead to false negatives (Type II error) by excluding variables that may have substantive theoretical importance but appear non-significant in intermediate steps.

Another concern is post-selection inference bias resulting from the fact that the same data are used both to select the model and to conduct inference. In this context, the reported p-values and confidence intervals no longer reflect valid hypothesis tests as they understate the true uncertainty associated with the selected model.

These limitations suggest that p-value-based selection should be approached with considerable caution, particularly compared to criteria like adjusted R-squared or AIC that explicitly account for the model selection process and do not suffer from the same multiple testing problems.

Modern Approaches to Model Selection

While stepwise selection remains widely employed, contemporary statistical practice increasingly favors alternative approaches that address some limitations of traditional methods.

Regularization methods such as LASSO and Ridge regression incorporate variable selection directly into the model fitting process. These techniques apply penalties to coefficient magnitudes, automatically shrinking less important predictors toward zero.

Cross-validation approaches evaluate model performance through systematic partitioning of data into training and testing subsets. This strategy provides more reliable assessments of generalizability compared to information criteria calculated on the same data used for model fitting.

Ensemble methods acknowledge model uncertainty by averaging predictions across multiple plausible models rather than selecting a single optimal specification. These approaches often demonstrate superior predictive performance compared to any individual model.

These advanced techniques are important to be aware of even though they exceed the scope of introductory regression courses.

Automated Stepwise Selection

Automated selection procedures can systematically implement variable selection strategies. This section demonstrates these approaches using data from the Varieties of Democracy (V-Dem) project to predict liberal democracy scores across countries. We will use the olsrr package, which provides a suite of functions for stepwise regression and variable selection.

Note

For stepwise selection to work, you must have no missing data in your dataset; the olsrr package will not work with missing data. If you look back at the setup code chunk, you will noticed that we have removed any rows with missing values in the model_data dataset using the drop_na() function from the tidyr package in the last line of the code.

Also, olssr cannot handle transformed variables, so we have to manually log-transform the wealth variable in the setup code chunk.

The analysis begins with estimation of a full model incorporating all available predictors:

full_model <- lm(lib_dem ~ log_wealth + polarization + women_empowerment + 
                 corruption + region, data = model_data)

summary(full_model)

Call:
lm(formula = lib_dem ~ log_wealth + polarization + women_empowerment + 
    corruption + region, data = model_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.43252 -0.06331  0.00253  0.07402  0.26311 

Coefficients:
                                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)                           0.125962   0.093258   1.351   0.1787    
log_wealth                           -0.001450   0.014026  -0.103   0.9178    
polarization                         -0.009154   0.008323  -1.100   0.2730    
women_empowerment                     0.640860   0.072556   8.833 1.61e-15 ***
corruption                           -0.344607   0.055068  -6.258 3.35e-09 ***
regionLatin America                   0.062175   0.033837   1.837   0.0680 .  
regionMENA                           -0.004997   0.043303  -0.115   0.9083    
regionSub-Saharan Africa             -0.011206   0.034987  -0.320   0.7491    
regionWestern Europe & North America  0.109096   0.039341   2.773   0.0062 ** 
regionAsia & Pacific                 -0.029083   0.036100  -0.806   0.4216    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1228 on 162 degrees of freedom
Multiple R-squared:  0.8055,    Adjusted R-squared:  0.7947 
F-statistic: 74.55 on 9 and 162 DF,  p-value: < 2.2e-16

The full model achieves an adjusted R-squared of 0.795, indicating that approximately 79.5 percent of the variance in democracy scores can be explained by the included predictors.

Forward Selection

Forward selection begins with an intercept-only model and sequentially adds predictors. At each step, the procedure selects the variable that produces the greatest improvement in the chosen criterion (adjusted R-squared in this example):

library(olsrr)

forward_model <- ols_step_forward_adj_r2(full_model)

forward_model

                                  Stepwise Summary                                   
-----------------------------------------------------------------------------------
Step    Variable               AIC         SBC         SBIC        R2       Adj. R2 
-----------------------------------------------------------------------------------
 0      Base Model             41.896      48.191    -449.065    0.00000    0.00000 
 1      women_empowerment    -131.886    -122.444    -622.069    0.64012    0.63801 
 2      corruption           -217.702    -205.112    -706.085    0.78402    0.78146 
 3      region               -224.442    -196.115    -720.133    0.80405    0.79568 
 4      polarization         -223.726    -192.251    -719.231    0.80550    0.79596 
-----------------------------------------------------------------------------------

Final Model Output 
------------------

                         Model Summary                           
----------------------------------------------------------------
R                       0.897       RMSE                  0.119 
R-Squared               0.806       MSE                   0.014 
Adj. R-Squared          0.796       Coef. Var            29.792 
Pred R-Squared          0.782       AIC                -223.726 
MAE                     0.090       SBC                -192.251 
----------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                               ANOVA                                 
--------------------------------------------------------------------
               Sum of                                               
              Squares         DF    Mean Square      F         Sig. 
--------------------------------------------------------------------
Regression     10.111          8          1.264    84.382    0.0000 
Residual        2.441        163          0.015                     
Total          12.553        171                                    
--------------------------------------------------------------------

                                               Parameter Estimates                                                
-----------------------------------------------------------------------------------------------------------------
                               model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
-----------------------------------------------------------------------------------------------------------------
                         (Intercept)     0.120         0.076                  1.580    0.116    -0.030     0.271 
                   women_empowerment     0.641         0.072        0.471     8.898    0.000     0.499     0.784 
                          corruption    -0.342         0.047       -0.387    -7.214    0.000    -0.435    -0.248 
                 regionLatin America     0.063         0.033        0.082     1.871    0.063    -0.003     0.129 
                          regionMENA    -0.005         0.043       -0.006    -0.128    0.899    -0.090     0.079 
            regionSub-Saharan Africa    -0.009         0.030       -0.016    -0.311    0.756    -0.069     0.050 
regionWestern Europe & North America     0.109         0.039        0.139     2.796    0.006     0.032     0.185 
                regionAsia & Pacific    -0.028         0.035       -0.038    -0.811    0.418    -0.096     0.040 
                        polarization    -0.009         0.008       -0.045    -1.105    0.271    -0.026     0.007 
-----------------------------------------------------------------------------------------------------------------

The forward selection results demonstrate the sequential addition process. Women’s empowerment enters first due to its highest individual contribution to adjusted R-squared, followed by corruption control, with the process continuing until no remaining variables provide meaningful improvements to model fit.

Backward Elimination

Backward elimination employs the opposite strategy, beginning with the full model and sequentially removing predictors. At each step, the procedure eliminates the variable whose removal least deteriorates model performance:

backward_model <- ols_step_backward_adj_r2(full_model)

backward_model

                               Stepwise Summary                               
----------------------------------------------------------------------------
Step    Variable        AIC         SBC         SBIC        R2       Adj. R2 
----------------------------------------------------------------------------
 0      Full Model    -221.737    -187.115    -717.119    0.80552    0.79471 
 1      log_wealth    -223.726    -192.251    -719.294    0.80550    0.79596 
----------------------------------------------------------------------------

Final Model Output 
------------------

                         Model Summary                           
----------------------------------------------------------------
R                       0.897       RMSE                  0.119 
R-Squared               0.806       MSE                   0.014 
Adj. R-Squared          0.796       Coef. Var            29.792 
Pred R-Squared          0.782       AIC                -223.726 
MAE                     0.090       SBC                -192.251 
----------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                               ANOVA                                 
--------------------------------------------------------------------
               Sum of                                               
              Squares         DF    Mean Square      F         Sig. 
--------------------------------------------------------------------
Regression     10.111          8          1.264    84.382    0.0000 
Residual        2.441        163          0.015                     
Total          12.553        171                                    
--------------------------------------------------------------------

                                               Parameter Estimates                                                
-----------------------------------------------------------------------------------------------------------------
                               model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
-----------------------------------------------------------------------------------------------------------------
                         (Intercept)     0.120         0.076                  1.580    0.116    -0.030     0.271 
                        polarization    -0.009         0.008       -0.045    -1.105    0.271    -0.026     0.007 
                   women_empowerment     0.641         0.072        0.471     8.898    0.000     0.499     0.784 
                          corruption    -0.342         0.047       -0.387    -7.214    0.000    -0.435    -0.248 
                 regionLatin America     0.063         0.033        0.082     1.871    0.063    -0.003     0.129 
                          regionMENA    -0.005         0.043       -0.006    -0.128    0.899    -0.090     0.079 
            regionSub-Saharan Africa    -0.009         0.030       -0.016    -0.311    0.756    -0.069     0.050 
regionWestern Europe & North America     0.109         0.039        0.139     2.796    0.006     0.032     0.185 
                regionAsia & Pacific    -0.028         0.035       -0.038    -0.811    0.418    -0.096     0.040 
-----------------------------------------------------------------------------------------------------------------

Bidirectional Selection

Bidirectional selection combines forward and backward approaches, allowing both addition and removal of variables at each step. This method can potentially identify optimal models that pure forward or backward procedures might miss:

both_model <- ols_step_both_adj_r2(full_model)

both_model

                                    Stepwise Summary                                     
---------------------------------------------------------------------------------------
Step    Variable                   AIC         SBC         SBIC        R2       Adj. R2 
---------------------------------------------------------------------------------------
 0      Base Model                 41.896      48.191    -449.065    0.00000    0.00000 
 1      women_empowerment (+)    -131.886    -122.444    -622.069    0.64012    0.63801 
 2      corruption (+)           -217.702    -205.112    -706.085    0.78402    0.78146 
 3      region (+)               -224.442    -196.115    -720.133    0.80405    0.79568 
 4      polarization (+)         -223.726    -192.251    -719.231    0.80550    0.79596 
---------------------------------------------------------------------------------------

Final Model Output 
------------------

                         Model Summary                           
----------------------------------------------------------------
R                       0.897       RMSE                  0.119 
R-Squared               0.806       MSE                   0.014 
Adj. R-Squared          0.796       Coef. Var            29.792 
Pred R-Squared          0.782       AIC                -223.726 
MAE                     0.090       SBC                -192.251 
----------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                               ANOVA                                 
--------------------------------------------------------------------
               Sum of                                               
              Squares         DF    Mean Square      F         Sig. 
--------------------------------------------------------------------
Regression     10.111          8          1.264    84.382    0.0000 
Residual        2.441        163          0.015                     
Total          12.553        171                                    
--------------------------------------------------------------------

                                               Parameter Estimates                                                
-----------------------------------------------------------------------------------------------------------------
                               model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
-----------------------------------------------------------------------------------------------------------------
                         (Intercept)     0.120         0.076                  1.580    0.116    -0.030     0.271 
                   women_empowerment     0.641         0.072        0.471     8.898    0.000     0.499     0.784 
                          corruption    -0.342         0.047       -0.387    -7.214    0.000    -0.435    -0.248 
                 regionLatin America     0.063         0.033        0.082     1.871    0.063    -0.003     0.129 
                          regionMENA    -0.005         0.043       -0.006    -0.128    0.899    -0.090     0.079 
            regionSub-Saharan Africa    -0.009         0.030       -0.016    -0.311    0.756    -0.069     0.050 
regionWestern Europe & North America     0.109         0.039        0.139     2.796    0.006     0.032     0.185 
                regionAsia & Pacific    -0.028         0.035       -0.038    -0.811    0.418    -0.096     0.040 
                        polarization    -0.009         0.008       -0.045    -1.105    0.271    -0.026     0.007 
-----------------------------------------------------------------------------------------------------------------

Comparison of the full and stepwise-selected models reveals how little changes in explanatory power despite model simplification. The adjusted R-squared for the full model is 0.7947, while the adjusted R-squared for the model selected via stepwise procedures is 0.796. At the same time, the number of predictors decreases slightly, from 9 in the full model to 8 in the selected model.

This minimal difference in fit highlights an important point: variable selection procedures may reduce complexity but often yield models with similar performance. This highlights the relevance of domain expertise in making meaningful modeling decisions.

Incorporating Domain Knowledge

Statistical criteria alone provide insufficient guidance for model specification. Substantive theory and domain expertise must inform variable selection decisions. In democracy research, certain variables maintain theoretical importance regardless of their statistical significance in particular samples.

For example, modernization theory establishes wealth (GDP per capita) as a fundamental predictor of democratic development while comparative political analysis routinely includes regional controls to account for spatial diffusion effects and shared historical experiences.

In this scenario, we have strong reasons to believe that wealth and region are important predictors of democracy. We are, in other words, developing our model based on our theoretical understanding of the world, rather than simply relying on statistical criteria. Or, another way of stating it is that we are basing our model on our hypotheses about the outcome in question, rather than simply relying on statistical criteria.

To this end, the olsrr package accommodates theoretical constraints through the include argument, which forces retention of specified variables throughout the selection process:

theory_guided_model <- ols_step_both_adj_r2(full_model, include = c("log_wealth", "region"))

theory_guided_model

                                    Stepwise Summary                                     
---------------------------------------------------------------------------------------
Step    Variable                   AIC         SBC         SBIC        R2       Adj. R2 
---------------------------------------------------------------------------------------
 0      Base Model                -76.144     -50.964    -576.247    0.53048    0.51341 
 1      women_empowerment (+)    -180.455    -152.127    -678.110    0.74694    0.73614 
 2      corruption (+)           -222.457    -190.982    -718.039    0.80406    0.79445 
 3      polarization (+)         -221.737    -187.115    -717.119    0.80552    0.79471 
---------------------------------------------------------------------------------------

Final Model Output 
------------------

                         Model Summary                           
----------------------------------------------------------------
R                       0.898       RMSE                  0.119 
R-Squared               0.806       MSE                   0.014 
Adj. R-Squared          0.795       Coef. Var            29.883 
Pred R-Squared          0.779       AIC                -221.737 
MAE                     0.090       SBC                -187.115 
----------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                               ANOVA                                 
--------------------------------------------------------------------
               Sum of                                               
              Squares         DF    Mean Square      F         Sig. 
--------------------------------------------------------------------
Regression     10.111          9          1.123    74.552    0.0000 
Residual        2.441        162          0.015                     
Total          12.553        171                                    
--------------------------------------------------------------------

                                               Parameter Estimates                                                
-----------------------------------------------------------------------------------------------------------------
                               model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
-----------------------------------------------------------------------------------------------------------------
                         (Intercept)     0.126         0.093                  1.351    0.179    -0.058     0.310 
                          log_wealth    -0.001         0.014       -0.006    -0.103    0.918    -0.029     0.026 
                 regionLatin America     0.062         0.034        0.081     1.837    0.068    -0.005     0.129 
                          regionMENA    -0.005         0.043       -0.006    -0.115    0.908    -0.091     0.081 
            regionSub-Saharan Africa    -0.011         0.035       -0.019    -0.320    0.749    -0.080     0.058 
regionWestern Europe & North America     0.109         0.039        0.140     2.773    0.006     0.031     0.187 
                regionAsia & Pacific    -0.029         0.036       -0.039    -0.806    0.422    -0.100     0.042 
                   women_empowerment     0.641         0.073        0.471     8.833    0.000     0.498     0.784 
                          corruption    -0.345         0.055       -0.391    -6.258    0.000    -0.453    -0.236 
                        polarization    -0.009         0.008       -0.045    -1.100    0.273    -0.026     0.007 
-----------------------------------------------------------------------------------------------------------------

The mandatory inclusion of wealth and regional indicators alters the selection process, demonstrating how theoretical considerations should guide rather than merely supplement statistical procedures. In this case, the theory guided approach essentially brought us back to the full model, but in other contexts you may find that the model is more parsimonious while still retaining theoretically important variables.

Your Turn!!

Apply model selection using AIC as the evaluation criterion for the democracy data by using the ols_step_both_aic() function from the olsrr package.

  1. Perform bidirectional stepwise selection using AIC
  2. Compare the selected variables to those chosen by adjusted R-squared
  3. Calculate and compare AIC values for both the full model and your selected model
# Bidirectional selection using AIC
aic_model <- ols_step_both_aic(full_model)
aic_model

                                    Stepwise Summary                                     
---------------------------------------------------------------------------------------
Step    Variable                   AIC         SBC         SBIC        R2       Adj. R2 
---------------------------------------------------------------------------------------
 0      Base Model                 41.896      48.191    -449.065    0.00000    0.00000 
 1      women_empowerment (+)    -131.886    -122.444    -622.069    0.64012    0.63801 
 2      corruption (+)           -217.702    -205.112    -706.085    0.78402    0.78146 
 3      region (+)               -224.442    -196.115    -720.133    0.80405    0.79568 
---------------------------------------------------------------------------------------

Final Model Output 
------------------

                         Model Summary                           
----------------------------------------------------------------
R                       0.897       RMSE                  0.120 
R-Squared               0.804       MSE                   0.014 
Adj. R-Squared          0.796       Coef. Var            29.812 
Pred R-Squared          0.784       AIC                -224.442 
MAE                     0.090       SBC                -196.115 
----------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                               ANOVA                                 
--------------------------------------------------------------------
               Sum of                                               
              Squares         DF    Mean Square      F         Sig. 
--------------------------------------------------------------------
Regression     10.093          7          1.442    96.133    0.0000 
Residual        2.460        164          0.015                     
Total          12.553        171                                    
--------------------------------------------------------------------

                                               Parameter Estimates                                                
-----------------------------------------------------------------------------------------------------------------
                               model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
-----------------------------------------------------------------------------------------------------------------
                         (Intercept)     0.135         0.075                  1.798    0.074    -0.013     0.283 
                   women_empowerment     0.640         0.072        0.470     8.874    0.000     0.498     0.783 
                          corruption    -0.359         0.045       -0.407    -8.050    0.000    -0.447    -0.271 
                 regionLatin America     0.064         0.033        0.083     1.900    0.059    -0.003     0.130 
                          regionMENA    -0.007         0.043       -0.008    -0.163    0.871    -0.092     0.078 
            regionSub-Saharan Africa    -0.013         0.030       -0.021    -0.430    0.668    -0.072     0.046 
regionWestern Europe & North America     0.110         0.039        0.141     2.822    0.005     0.033     0.186 
                regionAsia & Pacific    -0.031         0.035       -0.041    -0.884    0.378    -0.099     0.038 
-----------------------------------------------------------------------------------------------------------------
# Compare AIC values
cat("Full Model AIC:", round(AIC(full_model), 2), "\n")
Full Model AIC: -221.74 
cat("Selected Model AIC:", round(AIC(aic_model$model), 2), "\n")
Selected Model AIC: -224.44 
# Compare which variables were selected
cat("\nVariables in Adjusted R-squared model:", 
    paste(names(coef(both_model$model))[-1], collapse = ", "), "\n")

Variables in Adjusted R-squared model: women_empowerment, corruption, regionLatin America, regionMENA, regionSub-Saharan Africa, regionWestern Europe & North America, regionAsia & Pacific, polarization 
cat("Variables in AIC model:", 
    paste(names(coef(aic_model$model))[-1], collapse = ", "), "\n")
Variables in AIC model: women_empowerment, corruption, regionLatin America, regionMENA, regionSub-Saharan Africa, regionWestern Europe & North America, regionAsia & Pacific 

Additional Methodological Considerations

Multicollinearity

Multicollinearity occurs when predictor variables exhibit substantial correlations with one another. This condition creates several analytical problems including unstable coefficient estimates, inflated standard errors that reduce the power to detect significant relationships, and interpretation difficulties when attempting to isolate individual variable effects.

Variance Inflation Factors (VIF) provide a diagnostic tool for assessing multicollinearity severity:

We can use the ols_vif_tol() function from the olsrr package to calculate VIF values for the full model:

ols_vif_tol(full_model)
                             Variables Tolerance      VIF
1                           log_wealth 0.3106153 3.219416
2                         polarization 0.7322931 1.365573
3                    women_empowerment 0.4230084 2.364019
4                           corruption 0.3080812 3.245898
5                  regionLatin America 0.6160077 1.623356
6                           regionMENA 0.4755160 2.102979
7             regionSub-Saharan Africa 0.3557731 2.810780
8 regionWestern Europe & North America 0.4714850 2.120958
9                 regionAsia & Pacific 0.5080389 1.968353

Conventional guidelines suggest that VIF values exceeding 5-10 indicate problematic multicollinearity requiring remedial action. Here we do not see any variables exceeding that threshold so we should feel comfortable ruling out multicollinearity as a major concern in this model.

Note

Another common package for calculating VIF is car, which provides the vif() function. You can use it by simply loading car (after you have installed it of course!) and running vif(full_model).

Examination of correlations among continuous predictors provides additional insight into potential multicollinearity issues:

# Correlation matrix for numeric predictors
model_data |>
  select(log_wealth, polarization, women_empowerment, corruption) |>
  cor() |>
  round(3)
                  log_wealth polarization women_empowerment corruption
log_wealth             1.000       -0.381             0.340     -0.674
polarization          -0.381        1.000            -0.312      0.500
women_empowerment      0.340       -0.312             1.000     -0.594
corruption            -0.674        0.500            -0.594      1.000

Here we would be looking for correlations exceeding 0.7 or 0.8, which would indicate potential multicollinearity concerns. In this case, we do not see any such high correlations, again giving us confidence to proceed with the analysis.

Sample Size Requirements

Adequate sample sizes ensure reliable parameter estimation in multiple regression models. The conventional guideline recommends minimum ratios of 10-15 observations per predictor variable. The current analysis includes 172 observations and 9 predictors, yielding a ratio of 19.1:1, which meets standard adequacy thresholds.

Model Validation

Selected models require validation to assess their reliability and generalizability. Residual analysis examines whether regression assumptions hold in the fitted model. Cross-validation techniques test model performance on independent data subsets. Out-of-sample prediction provides the strongest test of model generalizability when feasible. Substantive interpretation ensures that statistical results align with theoretical expectations and domain knowledge. We will discuss some of these issues in greater detail in a subsequent module on regression diagnostics.

Summary

Model selection involves balancing competing objectives of explanatory power and parsimony. Multiple evaluation criteria exist, including adjusted R-squared, AIC, and p-values, each with distinct properties that can lead to different optimal models. Automated stepwise procedures provide systematic approaches to variable selection, though they should supplement rather than replace theoretical reasoning.

Domain knowledge plays a crucial role in model specification, often requiring the retention of theoretically important variables regardless of their statistical performance in particular samples. Additional methodological considerations include multicollinearity assessment, sample size adequacy, and model validation procedures.

The ultimate goal of model selection extends beyond maximizing statistical criteria to developing models that are simultaneously statistically sound and substantively meaningful. This dual objective requires integrating statistical techniques with theoretical understanding and domain expertise throughout the modeling process.