Unsupervised Learning

Overview of Unsupervised Learning

Unsupervised learning is a type of machine learning that analyzes data for patterns and relationships without having any data points’ labels. To put it simply, unsupervised learning “blindly” tries to make assumptions about data points’ associations with one another by assessing their similarities when their dimensionality is boiled down. This is different from supervised learning, which already knows the labels of data points, but learns to build a model to predict this labeled variable by assessing its relevance to other features. For this unsupervised learning section, I will first perform dimensionality reduction, which allows for preliminary visualization of the underlying relationships between data points. Next, I will perform three different methods of unsupervised clustering: K-means, Density-Based Spatial Clustering Applications with Noise (DBSCAN), and Hierarchical clustering.

# Load dataset
import pandas as pd
df = pd.read_csv('../../data/processed-data/MLdataset.csv')

Dimensionality Reduction

Dimensionality reduction is a critical step when working with unsupervised learning. Often, unsupervised learning takes in very high-dimensional data– which means that it has many features, or columns. With so many dimensions, it is difficult to holistically visualize and understand the mutlidimensional relationships between data points. Principle Component Analysis (PCA) and t-Distributed Stochastic Neighbors (t-SNE) are two important methods for dimensionality reduction. While they perform different tasks and deliver different insights, they both allow for the visualization of this high-dimensional data onto a 2-dimensional plane by standardizing all of the features into single comparable metrics. Since unsupervised learning aims to find relationships between variables without having prior labels, the ability to compare data points at this level is crucial.

Dimensionality reduction also provides practical benefits by mitigating common issues such as multicollinearity and overfitting in the visualization. This is important as it allows a clearer view of the true nature of the data and its relationships.

PCA

Principle Component Analysis (PCA) is a common data preprocessing step which reduces the number of dimensions in large datasets by transforming seemingly correlated variables into smaller sets of variables (principle components). The result is a smaller set of dimensions that can retain most of the original information. PCA is important for computing efficiency and impactful for identifying trends, patterns or outliers in high-dimensional data. See IBM’s definition for further information.

In the code below, I will: 1. Prepare the data for dimensionality reduction by handling categorical variables, fill missing values with the mean value, and scale the data using Sci-kit learn’s StandardScaler (this is a python package, and its tool automatically applies z-score standardization to the data, which reduces all data points to a more standard and comparable metric) 2. Write functions to: - create a plot that visualizes the amount of variance explained with each incremental additional included component (variable) - calculate the optimal number of clusters using the silhouette score on the PCA transformed data - create a scatter plot that visualizes how similar data points are to each other that are in the same cluster and how distinguished the different clusters are

import pandas as pd             
import numpy as np                
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA 
from sklearn.preprocessing import StandardScaler  
from sklearn.impute import SimpleImputer  
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score  
import seaborn as sns        


def prepare_data(df):
    # One-hot encode the categorical variables (turns them into numerical 0 or 1)
    df_encoded = pd.get_dummies(df, columns=['Subregion', 'prostitution_policy'])
    
    # Drop non-numeric columns
    df_encoded = df_encoded.drop(['Country'], axis=1)
    
    # Handle missing values
    df_encoded = df_encoded.fillna(df_encoded.mean())
    
    # Scale
    scaler = StandardScaler()
    X = scaler.fit_transform(df_encoded)
    
    return X, df_encoded.columns

# analyses the optimal number of PCA components (the number of features to include)
def analyze_pca(X):
    pca = PCA()
    pca.fit(X)
    
    # cumulative variance ratio
    cumulative_variance_ratio = np.cumsum(pca.explained_variance_ratio_)
    
    # Plot explained variance
    plt.figure(figsize=(10, 6))
    plt.plot(range(1, len(cumulative_variance_ratio) + 1), 
             cumulative_variance_ratio, 'bo-')
    plt.axhline(y=0.8, color='r', linestyle='--', label='80% explained variance')
    plt.axhline(y=0.95, color='g', linestyle='--', label='95% explained variance')
    plt.xlabel('Number of Components')
    plt.ylabel('Cumulative Explained Variance Ratio')
    plt.title('PCA Explained Variance')
    plt.legend()
    plt.show()
    
    # Two different thresholds are used : .80 vs 0.95
    # Though an 80% explained variance is quite low, this is complex data and I wanted to be able to compare to 
    # understand how much explained variance is lost with fewer components included
    n_components_80 = np.argmax(cumulative_variance_ratio >= 0.8) + 1 
    n_components_95 = np.argmax(cumulative_variance_ratio >= 0.95) + 1
    
    print(f"Number of components needed for 80% variance: {n_components_80}")
    print(f"Number of components needed for 95% variance: {n_components_95}")
    
    return n_components_80, n_components_95

# function to identify the optimal number of clusters to visualize using the elbow method and silhouette score 

def find_optimal_clusters(X_pca, min_clusters=2, max_clusters=10):
    silhouette_scores = []
    cluster_range = range(min_clusters, max_clusters + 1)
    
    for k in cluster_range:
        kmeans = KMeans(n_clusters=k, random_state=42)
        labels = kmeans.fit_predict(X_pca)
        score = silhouette_score(X_pca, labels)
        silhouette_scores.append(score)
        print(f"Number of clusters: {k}, Silhouette score: {score:.4f}")
    
    # Find the number of clusters with the max silhouette score
    optimal_k = cluster_range[np.argmax(silhouette_scores)]
    print(f"\nOptimal number of clusters based on silhouette score: {optimal_k}")
    
    # silhouette scores
    plt.figure(figsize=(10, 6))
    plt.plot(cluster_range, silhouette_scores, marker='o', linestyle='--')
    plt.title('Silhouette Score for Different Numbers of Clusters')
    plt.xlabel('Number of Clusters')
    plt.ylabel('Silhouette Score')
    plt.show()
    
    return optimal_k
    



# function to run both of the above and output results
def pca_analysis(df):
    # output annotations
    print("1. Preparing and standardizing data...")
    X_scaled, feature_names = prepare_data(df)  # Unpack both returned values
    
    print("\n2. Analyzing optimal number of components...")
    # analyze number of components needed
    pca_full = PCA()
    pca_full.fit(X_scaled)
    cumulative_variance_ratio = np.cumsum(pca_full.explained_variance_ratio_)
    
    # Plot explaine variance
    plt.figure(figsize=(10, 6))
    plt.plot(range(1, len(cumulative_variance_ratio) + 1), 
             cumulative_variance_ratio, 'bo-')
    plt.axhline(y=0.8, color='r', linestyle='--', label='80% explained variance')
    plt.axhline(y=0.95, color='g', linestyle='--', label='95% explained variance')
    plt.xlabel('Number of Components')
    plt.ylabel('Cumulative Explained Variance Ratio')
    plt.title('PCA Explained Variance')
    plt.legend()
    plt.show()
    
    print("\n3. Performing PCA transformation for visualization...")
    # 2D for visualization
    pca_2d = PCA(n_components=2)
    X_pca = pca_2d.fit_transform(X_scaled)
    print(f"Explained variance ratios for first two components: {pca_2d.explained_variance_ratio_}")
    
    # First plot: Basic PCA scatter plot without clustering
    plt.figure(figsize=(10, 8))
    plt.scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.7)
    plt.xlabel('Component 1')
    plt.ylabel('Component 2')
    plt.title('PCA: First Two Principal Components')
    plt.grid(True)
    plt.show()
    
    print("\n4. Finding optimal number of clusters...")
    opt_k = find_optimal_clusters(X_pca)
    print(f"Optimal number of clusters: {opt_k}")
    
    print("\n5. Performing K-means clustering...")
    kmeans = KMeans(n_clusters=opt_k, random_state=6547)
    cluster_labels = kmeans.fit_predict(X_pca)
    
    # Second plot: PCA with cluster assignments
    plt.figure(figsize=(10, 8))
    scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], 
                         c=cluster_labels, cmap='viridis', 
                         alpha=0.7)
    plt.xlabel('Component 1')
    plt.ylabel('Component 2')
    plt.title(f'PCA with {opt_k} Clusters')
    plt.colorbar(scatter)
    plt.grid(True)
    plt.show()
    
    results = {
        'X_scaled': X_scaled,
        'X_pca': X_pca,
        'pca': pca_2d,
        'cluster_labels': cluster_labels,
        'optimal_k': opt_k,
        'explained_variance_ratio': pca_2d.explained_variance_ratio_
    }
    
    return results

pca_results = pca_analysis(df)

1. Preparing and standardizing data...

2. Analyzing optimal number of components...


3. Performing PCA transformation for visualization...
Explained variance ratios for first two components: [0.16484017 0.11272867]


4. Finding optimal number of clusters...
Number of clusters: 2, Silhouette score: 0.3721
Number of clusters: 3, Silhouette score: 0.4892
Number of clusters: 4, Silhouette score: 0.4150
Number of clusters: 5, Silhouette score: 0.4281
Number of clusters: 6, Silhouette score: 0.4086
Number of clusters: 7, Silhouette score: 0.3730
Number of clusters: 8, Silhouette score: 0.3758
Number of clusters: 9, Silhouette score: 0.4114
Number of clusters: 10, Silhouette score: 0.3952

Optimal number of clusters based on silhouette score: 3

Optimal number of clusters: 3

5. Performing K-means clustering...

The explained variance plot indicates that when reduced to 13 principle components, 80% of the variance in the data can be explained, and 95% of the variance in the data can be explained when approximately 22 principle components are included. This suggests that the data is highly complex and there are not clear, discernible relationships between data points– the variance cannot be explained with only a few highly representative points.

t-SNE

T-Distributed Stochastic Neighbors, or t-SNE, is another essential unsupervised dimensionality reduction technique. It differs from PCA in that it is non-linear– while PCA depends on linear structure to minimize variances and preserve larger, global pariwise distances, t-SNE focuses more on using other techniques to preserve smaller, more local pairwise distances. t-SNE allows for the visualization of high-dimensional data by converting the relationships between two data points into joint probabilities and minimizing the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data– essentially, it tries to compare data points as if they are distributions. The underlying methods are quite complex, but Datacamp and Scikit-learn.org’s explanations are very helpful for understanding the tool.

In the code below, I will: - transform the data to t-SNE - visualize the t-SNE plots with varying degrees of perplexity (perplexity is a hyperparameter, or key influencing characteristic of t-SNE, which is commonly defined as quantifying the the balance between preserving the global and the local structure of the data. Therefore, by varying the perplexity, the plots should show different physical clustering characteristics). - calculate the optimal number of clusters using the same silhouette score-based function from PCA - plot the t-SNE shaded with the optimal number of clusters identified

from sklearn.manifold import TSNE

def tsne_analysis(df):
    # add annotations to accompany the outputs: 
    print("1. Preparing and standardizing data...")
    X_scaled, feature_names = prepare_data(df)
    
    print("\n2. Testing different perplexity values...")
    # test different perplexity values
    perplexities = [5, 25, 40]
    tsne_results = {}
    
    fig, axes = plt.subplots(1, len(perplexities), figsize=(20, 6))
    
    for idx, perp in enumerate(perplexities):
        print(f"Computing t-SNE with perplexity={perp}...")
        tsne = TSNE(n_components=2, perplexity=perp, random_state=42)
        tsne_result = tsne.fit_transform(X_scaled)
        
        # Store 
        tsne_results[f'perplexity_{perp}'] = tsne_result
        
        # Plot
        axes[idx].scatter(tsne_result[:, 0], tsne_result[:, 1])
        axes[idx].set_title(f't-SNE (perplexity={perp})')
        axes[idx].set_xlabel('t-SNE 1')
        axes[idx].set_ylabel('t-SNE 2')
    
    plt.tight_layout()
    plt.show()
    
    print("\n3. Using perplexity=25 for further analysis...")
    # chose 25 because it's in the middle
    X_tsne = tsne_results['perplexity_25']
    
    print("\n4. Finding optimal number of clusters...")
    opt_k = find_optimal_clusters(X_tsne)
    print(f"Optimal number of clusters: {opt_k}")
    
    print("\n5. Performing K-means clustering...")
    kmeans = KMeans(n_clusters=opt_k, random_state=6547)
    cluster_labels = kmeans.fit_predict(X_tsne)
    
    # Plot t-SNE with cluster assignments
    plt.figure(figsize=(10, 8))
    scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], 
                         c=cluster_labels, cmap='viridis', 
                         alpha=0.7)
    plt.xlabel('t-SNE 1')
    plt.ylabel('t-SNE 2')
    plt.title(f't-SNE with {opt_k} Clusters')
    plt.colorbar(scatter)
    plt.grid(True)
    plt.show()
    
    results = {
        'X_scaled': X_scaled,
        'tsne_results': tsne_results,
        'X_tsne': X_tsne,
        'cluster_labels': cluster_labels,
        'optimal_k': opt_k
    }
    
    return results

# Usage:
tsne_results = tsne_analysis(df)

1. Preparing and standardizing data...

2. Testing different perplexity values...
Computing t-SNE with perplexity=5...
Computing t-SNE with perplexity=25...
Computing t-SNE with perplexity=40...


3. Using perplexity=25 for further analysis...

4. Finding optimal number of clusters...
Number of clusters: 2, Silhouette score: 0.3900
Number of clusters: 3, Silhouette score: 0.3288
Number of clusters: 4, Silhouette score: 0.3529
Number of clusters: 5, Silhouette score: 0.3727
Number of clusters: 6, Silhouette score: 0.3548
Number of clusters: 7, Silhouette score: 0.3796
Number of clusters: 8, Silhouette score: 0.3820
Number of clusters: 9, Silhouette score: 0.4362
Number of clusters: 10, Silhouette score: 0.3830

Optimal number of clusters based on silhouette score: 9

Optimal number of clusters: 9

5. Performing K-means clustering...

Interestingly, the varying perplexity values inputted into the t-SNE did not produce visually distinguishable plots; none of the perplexity levels portrayed significant differences in clustering or clear clustering at all. However, the data points are closer together the higher the perplexity score– but not necessarily in distinct clusters.

PCA and t-SNE differ in their aims, methods, and outputs. As mentioned, PCA is a linear-based technique that prioritizes preserving the global structure of the data, while t-SNe is non-linear and prioritizes local structures. While t-SNE is strictly a data visualization tool, PCA can also perform dimensionality reduction on data. One clear dinstinction between the two tools with this data is the difference in the optimal number of clusters identified by performing silhouette score optimization on the PCA and t-SNE transformed data. While PCA idenfitied three clusters (which is consistent and reasonable considering the numbers identified in the three algorithms deployed below), t-SNE identified nine. This highlights imbalanced, inconsistent, and difficult data, likely due to its small sample size and complexity.

K-Means

K-Means is a distance-based clustering method based on vector quantization. Its goal is to partition the observations into k clusters, in which each observation belongs to the cluster with the nearest center, by minimizing within-cluster variances. Thus, its main concern is the similarity or difference measure between the data points, which it quantifies using the Euclidean distance. The main hyperparameter of a K-Means model is n_clusters (the number of clusters), which is a key characteristic to set in order for the clustering algorithm to be meaningful on the data. There are two ways to identify the optimal number of clusters for K-means: the Elbow method and the silhouette score.

Elbow Method: - The optimal number of clusters is considered the point where the line on the graph begins to plateau

Silhouette Score: - Measures how close that point lies to its nearest neighbor points, across all clusters - Pptimal number of clusters can be explicitly determined

The code below will: - Perform both the elbow method and silhouette score method and output their plots - Automate the identification of the optimal silhouette score, and input that number of clusters into the K-Means model - Output visualizations that compare the number of observations in each cluster and the distribution of key features across the different clusters (by going back and extracting the labeled data).

Hyperparameters: Settings that need to be set in advance for the model to work (cannot be calculated or derived from the model’s learning process)

Hyperparameter tuning: the process of selecting the best hyperparameters for a machine learning model to optimize its performance.

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import seaborn as sns

# find optimal number of clusters
def find_optimal_clusters(data, max_clusters=10):
    inertias = []
    silhouette_scores = []
    
    for k in range(2, max_clusters + 1):
        kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
        kmeans.fit(data)
        inertias.append(kmeans.inertia_)
        silhouette_scores.append(silhouette_score(data, kmeans.labels_))
    
    # elbow curve
    plt.figure(figsize=(12, 5))
    plt.subplot(1, 2, 1)
    plt.plot(range(2, max_clusters + 1), inertias, 'bo-')
    plt.xlabel('Number of Clusters (k)')
    plt.ylabel('Inertia')
    plt.title('Elbow Method for Optimal k')
    
    # Silhouette score plot
    plt.subplot(1, 2, 2)
    plt.plot(range(2, max_clusters + 1), silhouette_scores, 'ro-')
    plt.xlabel('Number of Clusters (k)')
    plt.ylabel('Silhouette Score')
    plt.title('Silhouette Score for Optimal k')
    
    plt.tight_layout()
    plt.show()
    
    # Find optimal k based on silhouette score
    optimal_k = silhouette_scores.index(max(silhouette_scores)) + 2
    print(f"\nOptimal number of clusters based on silhouette score: {optimal_k}")
    print(f"Silhouette score for k={optimal_k}: {max(silhouette_scores):.3f}")
    
    return optimal_k

# Perform K-means clustering
def perform_kmeans(data, n_clusters):
    kmeans = KMeans(n_clusters=n_clusters, random_state=6547, n_init=10)
    clusters = kmeans.fit_predict(data)
    
    return kmeans, clusters

# visualize
def plot_cluster_distributions(data, clusters, feature_names):
    df_clustered = pd.DataFrame(data, columns=feature_names)
    df_clustered['Cluster'] = clusters
    
    plt.figure(figsize=(10, 5))
    cluster_sizes = pd.Series(clusters).value_counts().sort_index()
    plt.bar(cluster_sizes.index, cluster_sizes.values)
    plt.xlabel('Cluster')
    plt.ylabel('Number of Samples')
    plt.title('Cluster Size Distribution')
    plt.show()
    
    # chose these features to compare
    selected_features = [
        "Tier_mean",
        "Nonpunishment_policy_after2021",
        "Detected_victims_sum",
        "Criminal_justice_mean",  
        "Political_stability_mean",
        "EU_members"
    ]
    
    # Create box plots for selected features
    plt.figure(figsize=(20, 8))  # accommodate 6 plots
    for i, feature in enumerate(selected_features):
        plt.subplot(1, 6, i+1)  # 1 row, 6 columns
        sns.boxplot(x='Cluster', y=feature, data=df_clustered)
        plt.xticks(rotation=45)
        plt.title(feature)
    plt.tight_layout()
    plt.show()

# execution
def run_kmeans_analysis(df):
    scaled_data, feature_names = prepare_data(df)
    
    # Find optimal number of clusters
    optimal_k = find_optimal_clusters(scaled_data)
    
    # clustering
    kmeans_model, clusters = perform_kmeans(scaled_data, optimal_k)
    
    # Visualize results
    plot_cluster_distributions(scaled_data, clusters, feature_names)
    
    return kmeans_model, clusters, feature_names

# Run 
kmeans_model, clusters, feature_names = run_kmeans_analysis(df)

# Print summary stats for clusters
print("\nCluster Summary:")
for i in range(len(np.unique(clusters))):
    print(f"\nCluster {i} size: {np.sum(clusters == i)}")
print("\nCluster labels:", np.unique(clusters))

Preparing data...

Finding optimal number of clusters...


Optimal number of clusters based on silhouette score: 2
Silhouette score for k=2: 0.135

Performing K-means clustering...

Creating visualizations...


Cluster Summary:

Cluster 0 size: 30

Cluster 1 size: 16

Cluster labels: [0 1]

From the beginning, it is clear that this data does not have many underlying similarities or it is too small for significant similarities to be identified, which is evident from the fact that the elbow method plot does not have any elbow point, and the silhouette score plot has great variation. The silhouette score identified that 2 clusters are optional for ensuring that the majority of points are optimally close to their nearest neighbor. These 2 clusters are imbalanced, with the size of on being 30 and the other 16.

I chose to also add a dimension of visualization to bring in important indicators from the original dataset to see how they vary across the clustered groups. This allows some exploration as into what features intrinsically group datapoints. The Tier mean, average criminal justice score, average political stablity score, and EU membership all appear to vary between the two clusters– especially the EU membership, which appears to almost completely split the two clusters except for one outlier in each group. Victim nonpunishment policy appears to not vary across groups– indicating that it is a rather random variable, and does not have much association with other features that appeared to have stronger relationships with one another.

DBSCAN

Density-Based Spatial Clustering of Applications with Noise, or DBSCAN, is an unsupervised clustering algorithm that defines clusters as dense regions separated by regions of lower density. It groups together datapoints closely packed together and identifies outliers as noise. It has completely different parameters than K-Means and does not require the optimal number of clusters to be identified beforehand. Its two main hyperparameters are Eps (epsilon, the maximum distance between two points for them to be considered neighbors) and Minpts (the minimum number of points required to form a dense region).

The code below performs the following:
- Identifies the best DBSCAN hyperparameters by trying different combinations of given values - Runs the DBSCAN algorithm - Plots the distribution of counts in the clusters and the variation of the key indicators across those clusters

def find_best_dbscan_params(data):
    # ranges for parameters
    eps_values = [0.5, 1.0, 1.5, 2.0, 2.5, 3.0]
    min_samples_values = [2, 3, 4, 5]
    
    best_score = -1
    best_params = None
    best_clusters = None
    
    results = []
    
    for eps in eps_values:
        for min_samples in min_samples_values:

            dbscan = DBSCAN(eps=eps, min_samples=min_samples)
            clusters = dbscan.fit_predict(data)
            
            # number of clusters (excluding noise)
            n_clusters = len(set(clusters)) - (1 if -1 in clusters else 0)
            n_noise = list(clusters).count(-1)
            
            # Only evaluate if at least 2 clusters & not too many noise points
            if n_clusters >= 2 and n_noise < len(data) * 0.5:  # less than 50% noise
                try:
                    score = silhouette_score(data, clusters)
                    results.append({
                        'eps': eps,
                        'min_samples': min_samples,
                        'n_clusters': n_clusters,
                        'n_noise': n_noise,
                        'silhouette': score
                    })
                    
                    if score > best_score:
                        best_score = score
                        best_params = (eps, min_samples)
                        best_clusters = clusters
                        
                    print(f"eps={eps}, min_samples={min_samples}: {n_clusters} clusters, "
                          f"{n_noise} noise points, silhouette={score:.3f}")
                except:
                    continue
    
    if best_params is None:
        print("Could not find suitable parameters. Trying with larger eps values...")
        return find_best_dbscan_params_large(data)
    
    return best_params, best_clusters

# automate the search for best parameters
def find_best_dbscan_params_large(data):
    eps_values = [3.5, 4.0, 4.5, 5.0, 6.0, 7.0]
    min_samples_values = [2, 3, 4]
    
    best_score = -1
    best_params = None
    best_clusters = None
    
    for eps in eps_values:
        for min_samples in min_samples_values:
            dbscan = DBSCAN(eps=eps, min_samples=min_samples)
            clusters = dbscan.fit_predict(data)
            
            n_clusters = len(set(clusters)) - (1 if -1 in clusters else 0)
            n_noise = list(clusters).count(-1)
            
            if n_clusters >= 2 and n_noise < len(data) * 0.5:
                try:
                    score = silhouette_score(data, clusters)
                    if score > best_score:
                        best_score = score
                        best_params = (eps, min_samples)
                        best_clusters = clusters
                        
                    print(f"eps={eps}, min_samples={min_samples}: {n_clusters} clusters, "
                          f"{n_noise} noise points, silhouette={score:.3f}")
                except:
                    continue
    
    return best_params, best_clusters

# Run the analysis
# Prepare data
scaled_data, feature_names = prepare_data(df)

# Find best parameters and perform DBSCAN
best_params, clusters = find_best_dbscan_params(scaled_data)

if best_params is not None:
    print(f"\nBest parameters found: eps={best_params[0]}, min_samples={best_params[1]}")
    
    # Visualize results
    plot_cluster_distributions(scaled_data, clusters, feature_names)
    
    # Print cluster summary
    print("\nCluster Summary:")
    for i in sorted(set(clusters)):
        if i == -1:
            print(f"Noise points: {np.sum(clusters == i)}")
        else:
            print(f"Cluster {i} size: {np.sum(clusters == i)}")
else:
    print("\nCould not find suitable clustering parameters for this dataset")

Could not find suitable parameters. Trying with larger eps values...
eps=6.0, min_samples=2: 4 clusters, 11 noise points, silhouette=0.046
eps=6.0, min_samples=3: 3 clusters, 13 noise points, silhouette=0.034

Best parameters found: eps=6.0, min_samples=2


Cluster Summary:
Noise points: 11
Cluster 0 size: 27
Cluster 1 size: 3
Cluster 2 size: 3
Cluster 3 size: 2

The DBSCAN algorithm found that 4 clusters was the optimal number to capture this data. However, given DBSCAN’s quality of identifying outliers as noise, it also groups outliers in their own cluster, denoted as -1. Observations in the -1 cluster mean they are noise points– they cannot be included in any cluster. This happens because they are too far away and isolated to be within the specified eps distance of any cluster. DBSCAN identified 11 noise points, which indicates that there is significant variation among the data even at levels or reduced dimensionality. Among the other clusters, one captured 27 data points and the three others had 2 or 3 data points. This reflects a highly imbalanced dataset in regards to clustering and similarities.

The boxplot visualization comparing key indicators across the clusters shows that there is some clear difference in certain indicators by cluster. For example, though Cluster 2 is very small (with a sample size of three), it appears that the points in that cluster are very successful, high-performing countries– their mean Tier score is very low, indicating consistent excellent placement on the Tier rankings, and their mean criminal justice and political stability scores are very high. Conversely, clusters 1 and 3 appears to perform more poorly in these indicators. DBSCAN and the application of labels to the blind clustering groups allows for interesting insights into the nature of the data.

Hierarchical clustering

Hierarchical clustering is another unsupervised learning method that groups data points into branches of a tree of nested clusters based on hierarchical divisions of characteristics. There are two types of hierarchical clustering: 1. Agglomerative (or bottom-up approach) begins with many small clusters and repeatedly merges them into larger ones until a single cluster is formed 2. Divisive (or top-down approach) begins with a single cluster and continues to divide the data into successive clusters until they are all single data points.

For this analysis, divisive hierarchical clustering will be performed. This approach allows the visualization of clustering trends of the data by displaying the breaking down of branches. Some branches cover a wide succession of divisions, meaning they all stemmed from a similar common trait.

linkage_matrix = linkage(scaled_data, method='ward')

# Plot dendrogram
plt.figure(figsize=(10, 7))
dendrogram(linkage_matrix)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()

The interpretation of a hierarchical clustering model is that the optimal number of clusters can be decided on where the dendrogram is “cut”. In other words, if you “cut” the dendrogram horizontally at a certain distance level represented on the y axis, then the vertical lines that are distinguished at that point are considered the clusters. In this dendrogram, cutting around distance = 15 results in 3 clusters (orange, green, and red branches).

Based on this, the dendrogram suggests that the data can naturally be grouped into 3 clusters at a reasonable threshold, with varying internal similarities. It demonstrates that there are two larger “families” or branches, represented in green and red. The orange cluster appears to be more distinct and small, indicating that its points must be different from the rest is many ways.

A helpful quality of this dendrogram is the representating of data indicies on the x-axis. This can be compared with the actual dataset for further exploration into the trends and natural characteristics of the data.

Summary

The unsupervised learning models used in this section found different number of optimal cluster groupings for this data– what especially stood out was how the t-SNE-transformed data found 9 clusters to be optimal. Insights gained from the dimensionality reduction include the recognition that this data does not naturally conform to clusters, even at varying perplexity levels in t-SNE analysis. All of the unsupervised learning methods found a different optimal value for n_clusters– K-Means found two optimal clusters, DBSCAN found four (and a noise group that captured 11 data points), and hierarchical clustering found three. One thing that was consistent among the unsupervised learning models is that all of them identified one cluster group to be significantly larger than the other(s), which reflects imbalanced data.

Overall, this unsupervised exploration of the data indicates a complex dataset with difficult underlying patterns. Next, this data will be exlored using supervised learning to see if some more information on the interactions between variables can be gained.