Supervised Learning

Overview of Supervised Learning

General overview

Supervised learning is a type of machine learning that aims to fit the best model for predicting values of a labeled variable. The purpose of performing supervised learning to this data is to gain insight into how strong internal connections are between data points and variables, and if they have strength in being used to estimate another variable’s value. Three types of supervised learning will be performed in different sections: Binary classification, Multi-class classification, and Regression. The model selection methods, evaluation metrics, and results will be discussed within each of those sections.

Below is an outline of general information applied to all models:

Data preprocessing

In order to be used in supervised learning, data has to be cleaned, processed and standardized following specific criteria in order to be appropriately used. The following techniques have been performed on the data for use in the models throughout this section:

Handling missing data - Missing data points have been filled in with the mean of other datapoints in that feature. This option was chosen because this dataset is too small to remove rows with missing values in certain columns, so to preserve the size, an appropriate approximation method was used.

Handling categorical variables - Categorical variables (Subregion and prostitution policy) were one-hot encoded in order to be represented as numerical variables and included in the supervised learning models

 - **What is one-hot encoding?**: a machine learning technique that converts categorical data into numerical data, denoted as 0 or 1. The levels of the categorical variable are pivoted so that each level becomes a column. If the row satisfies the categorical variable level, a 1 is input under that level's column, while the remaining levels fill with 0.

Standardization - The data was scaled using Sci-kit learn’s StandardScaler, a python package tool that automatically applies z-score standardization to the data, which reduces all data points to a more standard and comparable metric

Testing and training strategy

All models used a 75-25 testing-training split. All models in this section were split using the default settings of python’s SciKit-Learn Model Selection’s train_test_split tool, which splits 75% of he data into training and 25% into testing.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('../../data/processed-data/MLdataset.csv')
df.columns

Index(['Country', 'Detected_victims_sum', 'Detected_victims_slope',
       'Convicted_traffickers_sum', 'Convicted_traffickers_slope',
       'Prosecuted_traffickers_sum', 'Prosecuted_traffickers_slope',
       'Convictions_over_prosecutions_mean', 'Number_Repatriated_Victims_sum',
       'Tier_mean', 'Tier_slope', 'Criminal_justice_mean',
       'Criminal_justice_slope', 'GDP_per_capita_mean', 'GDP_per_capita_slope',
       'Political_stability_mean', 'Political_stability_slope',
       'Population_mean', 'Refugee_population_mean',
       'Refugee_population_slope', 'Unemployment_rate_mean',
       'Unemployment_rate_slope', 'Detections_per_100_mean',
       'Reported_Years_Detected_Victims', 'Subregion',
       'Nonpunishment_policy_before2021', 'Nonpunishment_policy_after2021',
       'Limited_nonpunishment_policy', 'Post_soviet_states', 'EU_members',
       'Tier_2024', 'Visa_free_destinations', 'GSI_gov_response_score',
       'prostitution_policy'],
      dtype='object')

df.dtypes

Country                                object
Detected_victims_sum                  float64
Detected_victims_slope                float64
Convicted_traffickers_sum             float64
Convicted_traffickers_slope           float64
Prosecuted_traffickers_sum            float64
Prosecuted_traffickers_slope          float64
Convictions_over_prosecutions_mean    float64
Number_Repatriated_Victims_sum        float64
Tier_mean                             float64
Tier_slope                            float64
Criminal_justice_mean                 float64
Criminal_justice_slope                float64
GDP_per_capita_mean                   float64
GDP_per_capita_slope                  float64
Political_stability_mean              float64
Political_stability_slope             float64
Population_mean                       float64
Refugee_population_mean               float64
Refugee_population_slope              float64
Unemployment_rate_mean                float64
Unemployment_rate_slope               float64
Detections_per_100_mean               float64
Reported_Years_Detected_Victims         int64
Subregion                              object
Nonpunishment_policy_before2021       float64
Nonpunishment_policy_after2021        float64
Limited_nonpunishment_policy          float64
Post_soviet_states                    float64
EU_members                            float64
Tier_2024                             float64
Visa_free_destinations                float64
GSI_gov_response_score                float64
prostitution_policy                    object
dtype: object

Binary Classification

In this section, I model the important features for predicting whether or not a country has passed legislation specifying victim non-punishment policy. I will train, evaluate, and compare the following models:

Model selection:

Logistic Regression: method that classifies binary variables by predicting the likelihood that a given input corresponds to one of two predefined categories using information from the other features in the data
- Uses a sigmoid function to map input features to binary outcomes (0 or 1)
- Inclusion justification: This method was chosen because is one of the most fundamental methods for binary classification due to its simplicity, interpretability, and high performance on linearly separable data
Random Forest: ensemble method that builds multiple decision trees and combines their predictions through averaging and selecting the best performance vote
- Each tree is trained on a random subset of data and features, reducing overfitting and improving generalization
- Inclusion justification: It is one of those most common and fundamental supervised learning algorithms and has many strengths, including flexibility, the reduction of overfitting and accuracy through bagging results (calculating the aggregate across bootstrap samples)
Naive Bayes: method that makes a “naive” assumption of feature independenceuses and uses Bayes’ theorem to predict class membership probabilities
- Inclusion justification: This method is known to work well on small training sets and high-dimensional and complex data, which characterizes the data used for this project.

Evaluation metrics:

The following evaluation metrics were calculated for each model and compared: - Accuracy - Precision - Recall - F1 score - ROC-AUC

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt
import seaborn as sns

# prepare data by one hot encoding categorical variables
def prepare_data_with_categorical(df, target_variable):
    # dummy variables
    categorical_columns = ['Subregion', 'prostitution_policy']
    dummy_dfs = []
    
    for col in categorical_columns:
        if col in df.columns:
            dummy_df = pd.get_dummies(df[col], prefix=col)
            dummy_dfs.append(dummy_df)
    
    # Combine numerical and dummy variables
    numerical_df = df.select_dtypes(include=['float64', 'int64'])
    X = pd.concat([numerical_df] + dummy_dfs, axis=1)

    columns_to_drop = [target_variable, 'Country']
    X = X.drop([col for col in columns_to_drop if col in X.columns], axis=1)
    
    # Handle NAs
    imputer = SimpleImputer(strategy='median')
    X = pd.DataFrame(imputer.fit_transform(X), columns=X.columns)
    
    # Standardize 
    numerical_cols = numerical_df.columns
    numerical_cols = [col for col in numerical_cols if col in X.columns]
    scaler = StandardScaler()
    X[numerical_cols] = scaler.fit_transform(X[numerical_cols])
    
    return X, numerical_cols

def train_and_evaluate_models(X_train, X_test, y_train, y_test):
    # Initialize models
    models = {
        'Logistic Regression': LogisticRegression(random_state=6547),
        'Random Forest': RandomForestClassifier(random_state=6547),
        'Naive Bayes': GaussianNB()
    }
    
    results = []
    feature_importances = {}
    
    for name, model in models.items():
        # Train
        model.fit(X_train, y_train)
        
        # Predict
        y_pred = model.predict(X_test)
        y_pred_proba = model.predict_proba(X_test)[:, 1]
        
        # evaluation metrics
        metrics = {
            'Model': name,
            'Accuracy': accuracy_score(y_test, y_pred),
            'Precision': precision_score(y_test, y_pred),
            'Recall': recall_score(y_test, y_pred),
            'F1 Score': f1_score(y_test, y_pred),
            'ROC-AUC': roc_auc_score(y_test, y_pred_proba)
        }
        results.append(metrics)
        
        # feature importance
        if name == 'Logistic Regression':
            importance = np.abs(model.coef_[0])
        elif name == 'Random Forest':
            importance = model.feature_importances_
        elif name == 'Naive Bayes':
            importance = np.abs(model.var_[0] - model.var_[1])
        elif name == 'k-NN':
            from sklearn.feature_selection import mutual_info_classif
            importance = mutual_info_classif(X_train, y_train)
            
        feature_importances[name] = importance
    
    return pd.DataFrame(results), feature_importances

def plot_model_comparison(results):
    metrics = ['Accuracy', 'Precision', 'Recall', 'F1 Score', 'ROC-AUC']
    
    plt.figure(figsize=(12, 6))
    x = np.arange(len(metrics))
    width = 0.2
    
    for i, model in enumerate(results['Model']):
        values = results.loc[results['Model'] == model, metrics].values[0]
        plt.bar(x + i*width, values, width, label=model)
    
    plt.xlabel('Metrics')
    plt.ylabel('Score')
    plt.title('Model Performance Comparison')
    plt.xticks(x + width*1.5, metrics, rotation=45)
    plt.legend()
    plt.tight_layout()
    plt.show()

# function to plot the feature importance
def plot_feature_importance(X, feature_importances, top_n=10):
    for model_name, importance in feature_importances.items():
        feat_imp = pd.DataFrame({
            'Feature': X.columns,
            'Importance': importance
        })
        feat_imp = feat_imp.sort_values('Importance', ascending=False).head(top_n)
        
        plt.figure(figsize=(10, 6))
        sns.barplot(data=feat_imp, x='Importance', y='Feature')
        plt.title(f'Top {top_n} Feature Importance - {model_name}')
        plt.tight_layout()
        plt.show()

def run_analysis(df, target_variable):
    if target_variable not in df.columns:
        raise ValueError(f"Target variable '{target_variable}' not found in DataFrame")
    
    # Prepare data
    X, numerical_cols = prepare_data_with_categorical(df, target_variable)
    y = df[target_variable]
    
    # Handle missing values in target variable
    if y.isnull().any():
        print(f"Warning: Removing {y.isnull().sum()} rows with missing values in target variable")
        mask = ~y.isnull()
        X = X[mask]
        y = y[mask]
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=6547
    )
    
    # Train and evaluate models
    results, feature_importances = train_and_evaluate_models(
        X_train, X_test, y_train, y_test
    )
    
    print(f"\nModel Performance Metrics for predicting {target_variable}:")
    print(results.to_string(index=False))
 
    plot_model_comparison(results)

    plot_feature_importance(X, feature_importances)
    
    return results, feature_importances, X


results, feature_importances, X = run_analysis(df, 'Nonpunishment_policy_after2021')


Model Performance Metrics for predicting Nonpunishment_policy_after2021:
              Model  Accuracy  Precision  Recall  F1 Score  ROC-AUC
Logistic Regression       0.7   0.750000     0.6  0.666667     0.88
      Random Forest       0.8   0.800000     0.8  0.800000     0.96
        Naive Bayes       0.8   0.714286     1.0  0.833333     0.92

Model performance comparison:

Accuracy: overall correct predictions (both true positives and true negatives)
- Random Forest performs best with an accuracy of around ~0.90
- Logistic Regression and Naive Bayes are slightly lower
Precision: reliability in correctly predictive the positive class
- This means a low rate of false positives
- Logistic Regression and Random Forest both have precision scores o 1, indicating they do not output false positives, which is a sign of a strong model
- Naive Bayes’ score is slightly lower
Recall: how well the model identifies true positive cases (true positive rate)
- Random Forest and Naive Bayes perform similarly (~0.80)
- Logistic Regression is lower (~0.60)
F1 score: balanced measure of precision and recall
- Random Forest performs best (~0.88)
- Logistic Regression and Naive Bayes perform slightly worse
ROC-AUC: measures the model’s overall ability to distinguish between classes
- Logistic Regression and Random Forest both achieve highest scores (1)
- Naive Bayes slightly lower

Feature importance comparison:

For both Logistic Regression and Random Forest, the top two most important features were Nonpunishment_policy_before2021 and Limited_nonpunishment_policy, respectively. These two features appear to stand out significantly compared to the rest of the features, with a big gap between Limited_nonpunishment_policy and the next most important feature. This differs from the Naive Bayes’ models’ feature importance, which appears to be more evenly distributed and consistently incremental in the rankings of feature importance than the other two models. Furthermore, the most important feature for the Naive Bayes model was the slope change of refugees in a country over the years, which is an interesting deviation from the other two models.

Summary

According to the model performance comparison evaluation metrics, the Random Forest model consistently performed the highest on all measures of model performance, indicating that it is the strongest model out of these options for predicting whether or not a country has victim nonpunishment policy or not, based on the rest of the data. Logistic regression performed second best, achieving some very high scores on certain measures such as precision and ROC-AUC.

Multiclass Classification

In this section, I will model the important features for predicting which Tier placement level a country has recieved in 2024. I will compare two models: Random Forest and XG Boost.

Model selection:

Random Forest: defined in the binary classification section
XG Boost: ensemble method that builds trees sequentially by attempting to correct the mistakes of all previous trees combined for each new tree
- Its key components are gradient boosting, regularization, and optimizatin features
- Inclusion justification: ensemble methods tend to perform more strongly. Random Forest is a great fundamental method to perform, and XG Boost is another strong ensemble algorithm for multi-class classification, but its deployment of boosting rather than bagging may add some special insight that Random Forest bagging cannot capture.

Evaluation metrics:

The following evaluation metrics were performed and compared:

Accuracy
Precision
Recall
F1 score

The weighted average and the macro average were both calculated for each of these metrics.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import xgboost as xgb
import seaborn as sns
import matplotlib.pyplot as plt

# 1. Train-Test Split
def split_data(X, y, test_size=0.2, random_state=6547):
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, 
        test_size=test_size, 
        random_state=random_state, 
        stratify=y
    )
    
    print("Training set shape:", X_train.shape)
    print("Test set shape:", X_test.shape)
    
    return X_train, X_test, y_train, y_test

# 2. Cross Validation
def perform_cross_validation(X, y, models, cv=5):
    results = {}
    for name, model in models.items():
        cv_scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
        results[name] = {
            'mean_cv_score': cv_scores.mean(),
            'std_cv_score': cv_scores.std()
        }
        print(f"\n{name} CV Results:")
        print(f"Mean CV Accuracy: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")
    return results

# 3. Hyperparameter Tuning
def tune_hyperparameters(X_train, y_train):
    # Hyperpamaters of both models --> n_estimators & max_depth

    # RF hyperparameters: 
    # max_depth --> default= 0
    # min_samples_split --> min required number of samples to split a node
    # min_samples_leaf --> min number of required samples for a leaf
    rf_param_grid = {
        'n_estimators': [50, 100, 200, 300],
        'max_depth': [None, 10, 20],
        'min_samples_split': [2, 3, 5],
        'min_samples_leaf': [1, 2, 4]
    }
    
    # XGB hyperparameters: 
    # max_depth --> default=6
    # learning_rate --> how fast the model learns to adjust predictions
    # subsample --> the fraction of samples used to train each tree
    xgb_param_grid = {
        'n_estimators': [50, 100, 200, 300],
        'max_depth': [3, 4, 5, 6],
        'learning_rate': [0.01, 0.05, 0.1],
        'subsample': [0.6, 0.8, 0.9]
    }
    
    # Initialize models
    rf_model = RandomForestClassifier(random_state=6547)
    xgb_model = xgb.XGBClassifier(random_state=6547)
    
    # GridSearchCV
    rf_grid = GridSearchCV(rf_model, rf_param_grid, cv=5, scoring='accuracy', n_jobs=-1)
    rf_grid.fit(X_train, y_train)
    xgb_grid = GridSearchCV(xgb_model, xgb_param_grid, cv=5, scoring='accuracy', n_jobs=-1)
    xgb_grid.fit(X_train, y_train)
    
    print("\nBest Parameters:")
    print("Random Forest:", rf_grid.best_params_)
    print("XGBoost:", xgb_grid.best_params_)
    
    return rf_grid.best_estimator_, xgb_grid.best_estimator_

# 4. Train Models
def train_models(X_train, y_train, best_rf, best_xgb):
    # Train models with best parameters
    best_rf.fit(X_train, y_train)
    best_xgb.fit(X_train, y_train)
    
    return best_rf, best_xgb

# 5. Evaluate Models
def evaluate_models(models, X_train, X_test, y_train, y_test):
    results = {}
    
    for name, model in models.items():
        # Make predictions
        train_pred = model.predict(X_train)
        test_pred = model.predict(X_test)
        
        # evaluation metrics
        results[name] = {
            'train_accuracy': accuracy_score(y_train, train_pred),
            'test_accuracy': accuracy_score(y_test, test_pred),
            'classification_report': classification_report(y_test, test_pred),
            'confusion_matrix': confusion_matrix(y_test, test_pred),
            'predictions': test_pred
        }
        
        print(f"Training accuracy: {results[name]['train_accuracy']:.3f}")
        print(f"Test accuracy: {results[name]['test_accuracy']:.3f}")
        print("\nClassification Report:")
        print(results[name]['classification_report'])
    
    return results

# 6. Visualization 
# function that plots confusion matrices
def plot_confusion_matrices(evaluation_results):
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
    
    # RF confusion matrix
    sns.heatmap(evaluation_results['Random Forest']['confusion_matrix'], 
                annot=True, fmt='d', ax=ax1)
    ax1.set_title('Random Forest Confusion Matrix')
    ax1.set_xlabel('Predicted')
    ax1.set_ylabel('True')
    
    # XGBoost confusion matrix
    sns.heatmap(evaluation_results['XGBoost']['confusion_matrix'], 
                annot=True, fmt='d', ax=ax2)
    ax2.set_title('XGBoost Confusion Matrix')
    ax2.set_xlabel('Predicted')
    ax2.set_ylabel('True')
    
    plt.tight_layout()
    plt.show()

def plot_feature_importance_comparison(rf_model, xgb_model, feature_names, top_n=15):
    # Get feature importance for both models
    importance_df = pd.DataFrame({
        'feature': feature_names,
        'Random Forest': rf_model.feature_importances_,
        'XGBoost': xgb_model.feature_importances_
    })
    
    # Calculate mean importance across both models
    importance_df['mean_importance'] = importance_df[['Random Forest', 'XGBoost']].mean(axis=1)
    
    # Sort by mean importance and get top N features
    top_features = importance_df.nlargest(top_n, 'mean_importance')['feature'].values
    importance_df = importance_df[importance_df['feature'].isin(top_features)]
    
    # Melt the dataframe for plotting
    importance_melted = importance_df.melt(
        id_vars=['feature'], 
        value_vars=['Random Forest', 'XGBoost'],
        var_name='Model', 
        value_name='Importance'
    )
    
    # Create the plot
    plt.figure(figsize=(12, 8))
    sns.barplot(
        data=importance_melted,
        y='feature',
        x='Importance',
        hue='Model'
    )
    plt.title('Top 15 Features Importance Comparison')
    plt.tight_layout()
    plt.show()

# Main execution 
def run_modeling_pipeline(X, y):
    X_train, X_test, y_train, y_test = split_data(X, y)
    
    # initial cross-validation
    initial_models = {
        'Random Forest': RandomForestClassifier(random_state=6547),
        'XGBoost': xgb.XGBClassifier(random_state=6547)
    }
    cv_results = perform_cross_validation(X, y, initial_models)
    
    # tune hyperparameters
    best_rf, best_xgb = tune_hyperparameters(X_train, y_train)
    
    # train models with best parameters
    final_rf, final_xgb = train_models(X_train, y_train, best_rf, best_xgb)
    
    # evaluate
    final_models = {
        'Random Forest': final_rf,
        'XGBoost': final_xgb
    }
    evaluation_results = evaluate_models(final_models, X_train, X_test, y_train, y_test)
    
    plot_confusion_matrices(evaluation_results)
    plot_feature_importance_comparison(final_rf, final_xgb, X.columns)
    
    return final_models, evaluation_results

y = df['Nonpunishment_policy_after2021']

final_models, evaluation_results = run_modeling_pipeline(X, y)

Training set shape: (36, 40)
Test set shape: (10, 40)

Random Forest CV Results:
Mean CV Accuracy: 0.738 (+/- 0.117)

XGBoost CV Results:
Mean CV Accuracy: 0.760 (+/- 0.170)

Best Parameters:
Random Forest: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50}
XGBoost: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 50, 'subsample': 0.9}
Training accuracy: 1.000
Test accuracy: 0.800

Classification Report:
              precision    recall  f1-score   support

         0.0       0.83      0.83      0.83         6
         1.0       0.75      0.75      0.75         4

    accuracy                           0.80        10
   macro avg       0.79      0.79      0.79        10
weighted avg       0.80      0.80      0.80        10

Training accuracy: 0.861
Test accuracy: 0.900

Classification Report:
              precision    recall  f1-score   support

         0.0       0.86      1.00      0.92         6
         1.0       1.00      0.75      0.86         4

    accuracy                           0.90        10
   macro avg       0.93      0.88      0.89        10
weighted avg       0.91      0.90      0.90        10

Model performance comparison:

The following evaluation metrics were all defined previously under the binary classification model performance comparison:

Accuracy
- Random Forest Mean CV Accuracy: 0.738 (+/- 0.117)
- XG Boost Mean CV Accuracy: 0.760 (+/- 0.170)
- The standard deviations are both very similar. This indicates that the models are highly similar in terms of accuracy for this data.
Confusion matrices: 4-quadrant plots that represent where the model’s predictions on the testing dataset fall: true positive, true negative, false positive, and false negative.
- Random Forest confusion matrix:
  - True Negatives (0,0): 5
  - False Positives (0,1): 1
  - False Negatives (1,0): 1
  - True Positives (1,1): 3
- XG Boost confusion matrix:
  - True Negatives (0,0): 6
  - False Positives (0,1): 0
  - False Negatives (1,0): 1
  - True Positives (1,1): 3

The models are very similar in their performance; XG Boost just predicts one more true negative than Random Forest does.

Feature importance comparison

It appears that XGBoost derived far more overall feature importance than the Random Forest model did; for almost all indicators, XGBoost gave higher feature importance than Random Forest, except for a few– but among those, Random Forest still gave low feature importance. For both models, Nonpunishment policy was the most important feature, with XGBoost giving it over 25% of the feature importance. Overall, the feature importance allocation of the Random Forest model appears more evenly dispersed over a wider range of indicators, while the XG Boost model gives more of its importance to fewer indicators.

Summary

In summary, both models performed very evenly in regards to their performance metrics, but XGBoost performed slightly better. Some considerations for these models are the potential for overfitting (due to the perfect training accuracies for both models) and the fact that the warning indicates extremely small class sizes (with only 2 members in smallest class), which may affect cross-validation and model reliability.

Regression

In this section, I will model which features are important for predicting the mean Tier placement score for a country over the year 2013 to 2023. I will compare two different supervised regression methods, Lasso regression and Random Forest regression, on how well their models perform and which features they identify as important.

Model selection:

Lasso Regression: this method incorporates L1 regularization into linear regression, making it more robust to overfitting and able to perform feature selection - Lasso (Least Absolute Shrinkage and Selection Operator) Regression - Inclusion justification: this method over linear regression because because it is known for adapting well to high-dimensional data containing many features and when it is suspected that many features may not be relevant, which chraracterizes the data being used

Random Forest Regressor: this is the application of the ensemble Random Forest method to numerical regression - Inclusion justification: this method works well on high-dimensional data with complex, non-linear relationships, outliers, and mixed data types, which is necessary accomodation for this data.

Evaluation metrics:

R²
RMSE
MAE

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt
import seaborn as sns

def prepare_data_for_regression(df, target_variable):
    # target variable
    y = df[target_variable].copy()
    
    # process data
    X, numerical_cols = prepare_data_with_categorical(df)
    if target_variable in X.columns:
        X = X.drop([target_variable], axis=1)
    imputer = SimpleImputer(strategy='mean') # fill NAs with mean
    X_imputed = pd.DataFrame(imputer.fit_transform(X), columns=X.columns, index=X.index)
    y_imputed = pd.Series(imputer.fit_transform(y.values.reshape(-1, 1)).ravel(), index=y.index)
    
    return X_imputed, y_imputed

def train_and_evaluate_models(X, y, random_state=6547):

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=random_state
    )
    
    # Initialize 
    models = {
        'Lasso': Lasso(alpha=0.001, random_state=random_state),
        'Random Forest': RandomForestRegressor(random_state=random_state)
    }
    
    # store results
    results = {}
    
    for name, model in models.items():
        print(f"\n=== {name} Regression ===")
        
        # Train model
        model.fit(X_train, y_train)
        
        # Predict
        train_pred = model.predict(X_train)
        test_pred = model.predict(X_test)
        
        # Calculate metrics
        metrics = {
            'train_r2': r2_score(y_train, train_pred),
            'test_r2': r2_score(y_test, test_pred),
            'rmse': np.sqrt(mean_squared_error(y_test, test_pred)),
            'mae': mean_absolute_error(y_test, test_pred)
        }
        
        # Cross-validation
        cv_scores = cross_val_score(model, X, y, cv=5, scoring='r2')
        metrics['cv_mean'] = cv_scores.mean()
        metrics['cv_std'] = cv_scores.std()
        
        print(f"Training R²: {metrics['train_r2']:.4f}")
        print(f"Test R²: {metrics['test_r2']:.4f}")
        print(f"RMSE: {metrics['rmse']:.4f}")
        print(f"MAE: {metrics['mae']:.4f}")
        print(f"CV R² Score: {metrics['cv_mean']:.4f} (+/- {metrics['cv_std'] * 2:.4f})")
        
        # store results
        results[name] = {
            'model': model,
            'metrics': metrics,
            'predictions': test_pred,
            'true_values': y_test
        }
        
        # predicted vs actual
        plt.figure(figsize=(8, 6))
        plt.scatter(y_test, test_pred, alpha=0.5)
        plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
        plt.xlabel('Actual Values')
        plt.ylabel('Predicted Values')
        plt.title(f'{name} Regression: Predicted vs Actual Values')
        plt.tight_layout()
        plt.show()
        
        # Feature importance
        if name in ['Lasso', 'Random Forest']:
            feature_importance = pd.DataFrame({
                'Feature': X.columns,
                'Importance': abs(model.coef_) if name == 'Lasso' 
                            else model.feature_importances_
            }).sort_values('Importance', ascending=True)
            
            plt.figure(figsize=(10, max(8, len(feature_importance) * 0.3)))
            plt.barh(range(len(feature_importance)), feature_importance['Importance'])
            plt.yticks(range(len(feature_importance)), feature_importance['Feature'])
            plt.title(f'{name} Feature Importance')
            plt.xlabel('Importance Value')
            plt.tight_layout()
            plt.show()
    
    return results

# Compare models
def compare_models_plot(results):

    metrics_df = pd.DataFrame({
        name: {
            'Test R²': results[name]['metrics']['test_r2'],
            'RMSE': results[name]['metrics']['rmse'],
            'MAE': results[name]['metrics']['mae'],
            'CV R² (mean)': results[name]['metrics']['cv_mean']
        } for name in results.keys()
    }).T
    
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    metrics_df.plot(kind='bar', y='Test R²', ax=axes[0,0], title='Test R² Score')
    metrics_df.plot(kind='bar', y='RMSE', ax=axes[0,1], title='Root Mean Square Error')
    metrics_df.plot(kind='bar', y='MAE', ax=axes[1,0], title='Mean Absolute Error')
    metrics_df.plot(kind='bar', y='CV R² (mean)', ax=axes[1,1], title='Cross-validation R² Score')
    
    plt.tight_layout()
    plt.show()
    
    return metrics_df

# Run analysis
def run_regression_analysis(df, target_variable):
    print(f"\nPredicting {target_variable}")
    
    # prepare data
    X, y = prepare_data_for_regression(df, target_variable)

    results = train_and_evaluate_models(X, y)
    metrics_comparison = compare_models_plot(results)
    
    return results, metrics_comparison


results, metrics_comparison = run_regression_analysis(df, 'Tier_mean')


Predicting Tier_mean

=== Lasso Regression ===
Training R²: 0.9916
Test R²: -0.7330
RMSE: 0.6209
MAE: 0.4644
CV R² Score: -1.6745 (+/- 1.6866)

/opt/anaconda3/lib/python3.12/site-packages/sklearn/linear_model/_coordinate_descent.py:678: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 4.279e-03, tolerance: 6.260e-04
  model = cd_fast.enet_coordinate_descent(
/opt/anaconda3/lib/python3.12/site-packages/sklearn/linear_model/_coordinate_descent.py:678: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.373e-03, tolerance: 6.061e-04
  model = cd_fast.enet_coordinate_descent(


=== Random Forest Regression ===
Training R²: 0.9431
Test R²: 0.3085
RMSE: 0.3922
MAE: 0.2772
CV R² Score: 0.4162 (+/- 0.2322)

Model performance comparison

R²: Coefficient of Determination
- represents how well the model explains the data’s variability
- Lasso’s negative score in both the testing and cross validation indicates that the model performs more poorly and explains less of the variability than if a constant function calculating the mean was used instead
- Random Forests’ score is quite low; the model can only explain about 25% of the data’s variability
*RMSE: Root Mean Squared Error
- Lasso has a higher RMSE, indicating a higher magniturd of prediction errors
MAE Mean Absolute Error
- Lasso’s score is higher, indicating higher absolute prediction errors

Feature importance comparison

The two models identified very different features as their most important factors. The top three for Lasso regression were if the country is located in the subregion of Western Europe (~0.8), and if the abortion policy in a country is legal (~0.45) or decriminalized (~0.4). The top three for the Random Forest model were the average criminal justice score (~0.33), the number of visa free destinations that can be accessed with the country’s passport (~0.26), and the average GDP per capita in USD (~0.11). The magnitude of importance for features is significantly higher for the Lasso model.

Summary

These evaluation metrics indicate that while neither model performs very well for predicting the mean Tier score per country, the Lasso regression model performs extremely poorly. The fact that the R^2 value is negative for Lasso regression means that this model actually is worse at predicting the average tier score per country than if a constant function that always predicts the mean of the data were used. Overall, neither model proves to be reliable for estimating the average Tier score from 2013-2023 using regression.

Discussion

According to evaluation metrics, the best, more reliable models were binary classifiers, with the Random Forest binary classifier performing the best among the three. The regression supervised learning appeared to have the poorest performing models and the most failures. However, the findings presented should not be generalized. The data used on these supervised learning models is very limited and flawed– it is a very small dataset (with only 46 rows), and for certain classifications, classes were too small to be able to derive significance or reliable results. Future iterations of supervised learning on this topic should look into applying neural networks, due to the complex nature of this data.