Exploratory Data Analysis

Code

The purpose of this code is to perform exploratory data analysis (EDA) on the data through advanced visualizations and statistical testing in order to deliver insights on trends, correlations, and other informative qualities of the data. The visualizations will include a heatmap (representing a correlation matrix), line graphs (to visualize data over time), boxplots and stacked bar graphs (to compare levels of categorical variables), scatterplots (to compare numerical variables), and interactive map plots (to provide the ability to explore geospatial insights). The statistical testing methods that will be performed are the ANOVA, Kruskal-Wallis, and Chi-squared tests.

To start, necessary libraries and the two datasets must be loaded:

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "notebook"

# temporal data
timedf = pd.read_csv('../../data/processed-data/complete_merge.csv')
timedf['Tier_2024'] = timedf['Tier_2024'].astype('category')

# ML data
df = pd.read_csv('../../data/processed-data/MLdataset.csv')

Interactive Maps

Tiers and Victim Nonpunishment Policy by Country

import plotly.express as px

tiermap = df

# Make new column renaming the Tier to a more representative name for plotting
tier_map = {
    1.0: 'Tier 1',
    2.0: 'Tier 2',
    2.2: 'Tier 2 Watch List',
    3.0: 'Tier 3'
}
tiermap['Tier_Label'] = tiermap['Tier_2024'].map(tier_map)

# Map ISO codes to country name in dataset -- this helps choropleth identify locations
iso_codes = {
    'Albania': 'ALB', 'Andorra': 'AND', 'Armenia': 'ARM', 'Austria': 'AUT',
    'Azerbaijan': 'AZE', 'Belgium': 'BEL', 'Bosnia and Herzegovina': 'BIH',
    'Bulgaria': 'BGR', 'Croatia': 'HRV', 'Cyprus': 'CYP', 'Czechia': 'CZE',
    'Denmark': 'DNK', 'Estonia': 'EST', 'Finland': 'FIN', 'France': 'FRA',
    'Georgia': 'GEO', 'Germany': 'DEU', 'Greece': 'GRC', 'Hungary': 'HUN',
    'Iceland': 'ISL', 'Ireland': 'IRL', 'Italy': 'ITA', 'Latvia': 'LVA',
    'Lithuania': 'LTU', 'Luxembourg': 'LUX', 'Malta': 'MLT', 'Moldova': 'MDA',
    'Monaco': 'MCO', 'Montenegro': 'MNE', 'Netherlands': 'NLD',
    'North Macedonia': 'MKD', 'Norway': 'NOR', 'Poland': 'POL', 'Portugal': 'PRT',
    'Romania': 'ROU', 'Russian Federation': 'RUS', 'Serbia': 'SRB',
    'Slovakia': 'SVK', 'Slovenia': 'SVN', 'Spain': 'ESP', 'Sweden': 'SWE',
    'Switzerland': 'CHE', 'Turkiye': 'TUR', 'Ukraine': 'UKR',
    'United Kingdom': 'GBR'
}

tiermap['ISO'] = tiermap['Country'].map(iso_codes)

color_map = {
    'Tier 1': '#1a9850',      
    'Tier 2': '#fee08d',   
    'Tier 2 Watch List': '#f46d43',  
    'Tier 3': '#d73027'
}

# Create the choropleth map
fig = px.choropleth(
    tiermap,
    locations='ISO',
    locationmode='ISO-3',
    color='Tier_Label',
    scope='europe',
    color_discrete_map=color_map,
    category_orders={'Tier_Label': ['Tier 1', 'Tier 2', 'Tier 2 Watch List', 'Tier 3']},
    title='Human Trafficking Tier Ratings in Europe',
    custom_data=['Country', 'Tier_Label'] 
)


# Add star markers to denote countries with nonpunishment policy 
marker_data = tiermap[tiermap['Nonpunishment_policy_after2021'] == 1]
fig.add_scattergeo(
    locations=marker_data['ISO'],
    locationmode='ISO-3',
    marker=dict(size=10, symbol='star', color='black', line=dict(width=1, color='white')),
    name='Nonpunishment Policy After 2021',
    hoverinfo='text',
    text=marker_data['Country']
)

# specify layout
fig.update_layout(
    title_x=0.5,
    geo=dict(
        showframe=False,
        showcoastlines=True,
        projection_type='equirectangular',
        center=dict(lon=15, lat=50),
        lataxis_range=[35, 70],
        lonaxis_range=[-10, 40]
    ),
    width=1000,
    height=600,
    template='plotly_white',
    legend_title_text='Tier Rating',
    legend=dict(
        yanchor="top",
        y=0.99,
        xanchor="left",
        x=0.01
    )
)

fig.update_traces(
    hovertemplate="<b>%{customdata[0]}</b><br>%{customdata[1]}<extra></extra>"
)

fig.show()

This map shows that there appears to be a geographical trend in the Tier placements. Countries in Northern and Western Europe, with a few exceptions, are mainly Tier 1, whereas countries in Southern Europe are exclusively Tier 2, as well as most of Eastern Europe. As for the countries with victim nonpunishment policy specified in national legislation, there appears to be no geographical trend, or relationship to Tier placements.

Detected Trafficking Victims over time by Country

import plotly.express as px

# map the country codes determined in the previous code 
timedf['ISO'] = timedf['Country'].map(iso_codes)

# interactive choropleth map
fig = px.choropleth(
    timedf,
    locations='ISO',
    locationmode='ISO-3',
    color='Detected_victims',
    scope='europe',
    color_continuous_scale='Reds', # specify color palette 
    title='Changes in Human Trafficking Detections per 100 People in Europe',
    animation_frame='Year', # adds slider feature for years
    hover_name='Country', # hover the country name
    labels={'Detections_per_100': 'Detected victims'},  
    custom_data=['Country', 'Detected_victims'] 
)

# specify layout
fig.update_layout(
    title_x=0.5,
    geo=dict(
        showframe=False,
        showcoastlines=True,
        projection_type='equirectangular',
        center=dict(lon=15, lat=50),
        lataxis_range=[35, 70],
        lonaxis_range=[-10, 40]
    ),
    width=1000,
    height=600,
    template='plotly_white',
    coloraxis_colorbar=dict(
        title="Detections per 100",
        tickformat=".2f"
    )
)

# Customize hover text
fig.update_traces(
    hovertemplate="<b>%{customdata[0]}</b><br>Detections per 100: %{customdata[1]:.2f}<extra></extra>"
)

fig.show()

Detected Trafficking Victims per 100 Population over time by Country

# same code as above, just switch to detected victims per 100 population

# interactive choropleth map
fig = px.choropleth(
    timedf,
    locations='ISO',
    locationmode='ISO-3',
    color='Detected_victims',
    scope='europe',
    color_continuous_scale='Reds', # specify color palette 
    title='Changes in Human Trafficking Detections per 100 People in Europe',
    animation_frame='Year', # adds slider feature for years
    hover_name='Country', # hover the country name
    labels={'Detections_per_100': 'Detected victims'},  
    custom_data=['Country', 'Detected_victims'] 
)

# Layout and appearance
fig.update_layout(
    title_x=0.5,
    geo=dict(
        showframe=False,
        showcoastlines=True,
        projection_type='equirectangular',
        center=dict(lon=15, lat=50),
        lataxis_range=[35, 70],
        lonaxis_range=[-10, 40]
    ),
    width=1000,
    height=600,
    template='plotly_white',
    coloraxis_colorbar=dict(
        title="Detections per 100",
        tickformat=".2f"
    )
)

# Customize hover text
fig.update_traces(
    hovertemplate="<b>%{customdata[0]}</b><br>Detections per 100: %{customdata[1]:.2f}<extra></extra>"
)

# Show the map
fig.show()

This shows that over the years, Bulgaria has consistently had a high rate of detecting trafficking victims per 100 population compared to other countries. This visualization highlights which countries with smaller populations are cracking down on detecting trafficking victims.

Prostitution Policy by Country in Europe

import plotly.express as px
import pandas as pd


df['ISO'] = df['Country'].map(iso_codes)

# specify the color mapping for different levels of prostitution policy
color_map = {
    'prohibited': '#fa0202',    
    'neoabolitionism': '#f57c18',   
    'abolitionism': '#311a98',
    'decriminalization': '#80f72a',
    'legal': '#27981a'
}


fig = px.choropleth(df,
                    locations='ISO',
                    locationmode='ISO-3',
                    color='prostitution_policy',
                    scope='europe',
                    color_discrete_map=color_map,
                    category_orders={'prostitution_policy': ['legal', 'decriminalization', 'abolitionism', 'neoabolitionism', 'prohibited']},
                    title='Prostitution Policy by Country in Europe',
                    custom_data=['Country', 'prostitution_policy']
                    )

# specify layour
fig.update_layout(
    title_x=0.5,
    geo=dict(
        showframe=False,
        showcoastlines=True,
        projection_type='equirectangular',
        center=dict(lon=15, lat=50),
        lataxis_range=[35, 70],
        lonaxis_range=[-10, 40]
    ),
    width=1000,
    height=600,
    template='plotly_white',
    legend_title_text='Tier Rating',
    legend=dict(
        yanchor="top",
        y=0.99,
        xanchor="left",
        x=0.01
    )
)

fig.update_traces(
    hovertemplate="<b>%{customdata[0]}</b><br>%{customdata[1]}<extra></extra>"
)

fig.show()

Heatmap correlation plot


# drop non-numerical columns
dropcols = ['Country', 'Subregion', 'prostitution_policy'] 
numeric_df = df.drop(columns=dropcols)

# calculate the correlation matrix
correlation_matrix = numeric_df.corr()

# heat map visualization 
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(20, 16))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', cbar=True)
plt.title("Correlation Matrix")
plt.show()

Key observations: The country indicators do not have strong correlations with the human trafficking detection data– the human trafficking data only has relatively strong correlations with other detection indicators (i.e. the correlation coefficient between the sum of detected victims and the sum of repatriated victims is 0.62, which indicates that there is a relatively positive relationship between the two).

The highest correlation coefficients in the data between two independent variables are:

0.87 between the mean criminal justice score and the mean political stability over time
-0.80 between the mean criminal justice score and the mean Tier placement over time
0.80 between the mean criminal justice score and Henley passport index score
0.77 between the mean criminal justice score and the mean GDP per capita
0.76 between the mean political stability over time and Henley passport index score
0.72 between the mean political stability and the mean GDP per capita over time
0.72 between the EU membership and the Henley passport index score
-0.71 between the mean Tier placement over time and Henley passport index score
-0.67 between the mean criminal justice score and the Tier placement in 2024
-0.64 between the mean political stability and the mean Tier placement over time

The mean criminal justice score of a country appeared in the first five highest correlation coefficients between pairwise variables, and in six out of the ten top. This demonstrates that criminal justice score is a central quality of a country that influences and has relationships with a multitude of other variables. In terms of data that reflects government response score, the Tier placement over time appeared to be quite significantly negatively correlated (-0.80) with the mean criminal justice score, indicating that the higher of a criminal justice score, the lower the Tier placement (the more likely it is to be 1, which represents a strong government preparation to combat human trafficking). The average Tier placement over time is also moderately negatively correlated with the Henley passport index score (-0.71) (also known as the number of Visa free destinations granted by the country’s passport) and

Pairplot

In the code below, I subsetted to the variables that I deemed to be the most significant in the correlation matrix heatmap visualization (the variables that appeared in most of the high correlation coefficients) and visualized them in a pairplot format. This shows the specific datapoints’ relationships to one another. I also included GSI government response score and the sum of detected victims to this pairplot, because the first offers an insightful perspective into government efforts to address human trafficking and the second offers insight into results. Assessing these variables alongside the other strong indicators may reveal some important findngs. I decided to fill in the color of the data points with their Tier score in 2024.

significants = df[['Criminal_justice_mean', 'Visa_free_destinations', 'GDP_per_capita_mean', 'Political_stability_mean', 'EU_members', 'GSI_gov_response_score', 'Detected_victims_sum', 'Tier_2024']]
# custom palette to original orange and blue-- it started outputting the hue in shades of pink originally
custom_palette = sns.color_palette(["#1f77b4", "#ff7f0e", "#aec7e8", "#ffbb78"])  # blue and orange theme
sns.pairplot(significants,hue='Tier_2024', palette=custom_palette)

The pairplot above reveals interesting characteristics in the data:

Aside from showing the relationships between numerical variables, the hue representing the Tiers also show that generally, Tier 1 is associated with higher criminal justice scoring, more visa free destinations, GDP per capita, and political stability.
The mean criminal justice scores appear to be different between countries in Tier 1 and Tier 2, as derived from the top right plot, in which the peak of the spread of Tier 1 criminal justice mean values is greater than that of Tier 2.
GSI government response and the sum of detected victims appear to have no or very little relationship to the other indicators.* This can be inferred from the vertical line shapes of the scatter plots for in the GSI_gov_response_score column (the second from the right), which indicates that all GSI government response scores are actually very similar and do not vary over a considerable range. Thus, they are not correlated with (or influenced by) the other indicators. Furthermore, the distributions visualized in the Kernel Density Estimation plots (the non-scatterplot plots) indicates an overlap in the distribution of government response scores between Tiers, indicating that there is no significant relationship between GSI government response and Tier ranking in 2024.
The scatterplot shapes pertaining to the sum of detected victims all have very random and sporadic shapes, indicating that none of the variables can explain or influence the sum of detected victims.

Line plots

Total Detected Victims over time

This plot visualizes the total number of detected victims over time using the temporal dataset.

# Sum of detected victims in each subregion per year
data_summary = timedf.groupby(['Year', 'Subregion'])['Detected_victims'].sum().reset_index()

# line plot: 
plt.figure(figsize=(12, 6))
sns.lineplot(data=data_summary, x='Year', y='Detected_victims', hue='Subregion', marker='o')
xticks = range(int(data_summary['Year'].min()), int(data_summary['Year'].max()) + 1, 2)
plt.xticks(ticks=xticks)
plt.title('Detected Victims Over Time by Subregion')
plt.xlabel('Year')
plt.ylabel('Sum of Detected Victims')
plt.legend(title='Subregion')
plt.grid(True)
plt.show()

This plot shows that between 2011 and 2016, human trafficking detected victims were highest in Esatern Europe. It is difficult to derive any overall trends from this visualization as the changes over time tend to be rather random. However, the number of detected victims in Southern Europe appears to be increasing rather steadily. A noticeable drop in detected victims is apparent for all regions in 2021 and 2022, likely due to the COVID-19 pandemic.

Detected Victims per 100 Citizens per Subregion over time

# Sum of detected victims in each subregion per year
per100 = timedf.groupby(['Year', 'Subregion'])['Detections_per_100'].sum().reset_index()

# line plot: 
plt.figure(figsize=(12, 6))
sns.lineplot(data=per100, x='Year', y='Detections_per_100', hue='Subregion', marker='o')
xticks = range(int(data_summary['Year'].min()), int(data_summary['Year'].max()) + 1, 2)
plt.xticks(ticks=xticks)
plt.title('Detected Victims per 100 Over Time by Subregion')
plt.xlabel('Year')
plt.ylabel('Detected Victims per 100 People')
plt.legend(title='Subregion')
plt.grid(True)
plt.show()

In contrast to the previous plot, this one shows the trends in the standardized rate of detected human trafficking victims per country per year, represented as the number of detected victims per 100 population. This plot shows that in recent years, countries in Southern Europe have surpassed all others in the number of victims they detected per 100 population.

Detected Victims by Country, grouped by Subregion

import matplotlib.pyplot as plt
import seaborn as sns

# create figure with 5 subplots 
fig, axes = plt.subplots(3, 2, figsize=(20, 25))
axes = axes.flatten() 

# create a plot for each of the five subregions
subregions = timedf['Subregion'].unique()
for idx, subregion in enumerate(subregions):
    # filter data for the current subregion
    subregion_data = timedf[timedf['Subregion'] == subregion]
    
    # plot line plot
    ax = axes[idx] 
    for country in subregion_data['Country'].unique():
        country_data = subregion_data[subregion_data['Country'] == country]
        ax.plot(country_data['Year'], country_data['Detected_victims'], 
                label=country, marker='o')
    
    # Customize subplots
    ax.set_title(f'Detected Victims Over Time in {subregion}', pad=20)
    ax.set_xlabel('Year')
    ax.set_ylabel('Detected Victims')
    ax.grid(True, alpha=0.3)
    
    # logistics
    ax.tick_params(axis='x', rotation=45)
    ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left', title='Country')

if len(subregions) < 6:
    fig.delaxes(axes[5])

# adjust layout to prevent overlap
plt.tight_layout()
plt.show()

No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.

This plot shows that interestingly, each subregion has one country that stands out compared to the others with much higher relative counts of detected victims of human trafficking. The outlier country in Southern Europe is Italy, in Western Asia it’s Turkey, in Western Europe it’s the Netherlands, in Eastern Europe it’s Romania, and in Northern Europe it’s the UK, but the data reported by the UK stops after 2016. The Netherlands appears to be the country with the highest counts of detected victims over time.

Tier over time by Subregion

tiersub = timedf.groupby(['Year', 'Subregion'])['Tier'].mean().reset_index()

# line plot: 
plt.figure(figsize=(12, 6))
sns.lineplot(data=tiersub, x='Year', y='Tier', hue='Subregion', marker='o')
xticks = range(int(tiersub['Year'].min()), int(tiersub['Year'].max()) + 1, 2)
plt.xticks(ticks=xticks)
plt.title('Average Tier Placement per Subregion Over Time')
plt.xlabel('Year')
plt.ylabel('Average Tier Placement')
plt.legend(title='Subregion')
plt.grid(True)
plt.show()

This plot visualizes the average Tier placement score per region over time. The slopes all appear to be gradually increasing, suggesting that they are falling (getting worse) in their Tier placements. Or, this may mean that the USDS is becoming more strict and evaluative in their Tier placements over time. However, this doesn’t apply to Western Asia which remains steadily at an average of Tier 2 placement over time. An important characteristic to note is the significant change in Eastern European Tier placements over the years– in 2023, it surpassed the mean of Western Asia, and the mean actually lies under a value of 2, indicating a new weight of countries on the Tier 2 Watchlist and in Tier 3.

GDP per Capita over time by Subregion

# Group by Year and Subregion to calculate average GDP
data_gdp = timedf.groupby(['Year', 'Subregion'])['GDP_per_capita'].mean().reset_index()

# Line plot
plt.figure(figsize=(12, 6))
sns.lineplot(data=data_gdp, x='Year', y='GDP_per_capita', hue='Subregion', marker='o')
plt.title('GDP per Capita Over Time by Subregion')
plt.xlabel('Year')
plt.ylabel('Average GDP per Capita')
plt.legend(title='Subregion')
plt.grid(True)
plt.show()

This plot shows that the the GDP per capita is substantially higher in Western and Northern Europe than the other subregions, and especially the highest in Western Europe, with a noticeable increase in recent years.

Bar charts

Tier placements in 2024

This plot visualizes the breakdown of countries in Tier placements in the USDS’s 2024 Trafficking in Persons report. As a reminder:

Tier 1 (1): countries that fully comply with the Trafficking Victim Protection Act’s (TVPA) minimum standards
Tier 2 (2): countries that do not fully meet the TVPA’s minimum standards but are making significant efforts to
Tier 2 Watchlist (2.2): countries in Tier 2 that also are either:
- experiencing an increase in human trafficking prevalence that they are not proportionally responding to, OR
- failing to provide evidence of increasing efforts
Tier 3 (3): countries that do not fully meet the TVPA’s minimum standards and are not making significant efforts to do so.

# convert Tier_2024 to categorical 
df['Tier_2024'] = df['Tier_2024'].astype('category')

plt.figure(figsize=(12, 6))
sns.countplot(data=df, x='Tier_2024') 
plt.xlabel('Tier Placements')
plt.ylabel('Count')
plt.title('Counts of Each Level of Tier_2024')
plt.show()

This shows that the majority of countries in Europe are on Tier 2, and the second most are in Tier 1. Tier 2 Watchlist and Tier 3 both contain very few country observations in 2024.

Convictions and Prosecution totals by Subregion

This code creates a side-by-side boxplot to compare the total number of convicted and prosecuted traffickers in each region. This may help provide some perception on which regions are more strict about convicting traffickers once prosecuted, which may reflect how seriously they take the crime.

# group by Subregion to sum convictions and prosecutions
sub = timedf.groupby('Subregion')[['Convicted_traffickers', 'Prosecuted_traffickers']].sum().reset_index()

# Melt the data for easier plotting
sub_melted = sub.melt(id_vars='Subregion', 
                                                  value_vars=['Convicted_traffickers', 'Prosecuted_traffickers'],
                                                  var_name='Metric', value_name='Count')

# Bar plot
plt.figure(figsize=(12, 6))
sns.barplot(data=sub_melted, x='Subregion', y='Count', hue='Metric')
plt.title('Convictions vs Prosecutions by Subregion')
plt.xlabel('Subregion')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

This plot shows that the ratio of convicted traffickers over prosecuted traffickers is highest in Northern Europe, where there is a smaller gap between the counts of the two values. However, it is imperative to recognize the scale of the counts here– Northern Europe has far fewer cases of prosecuted traffickers. This may reflect a poorer job of law enforcement in detecting traffickrs, or that there are simply fewer traffickers in this subregion.

Tier by Prostitution Policy

# crosstab to count occurrences
crosstab = pd.crosstab(df['Tier_2024'], df['prostitution_policy'])

# stacked bar chart
crosstab.plot(kind='bar', stacked=True, figsize=(10, 6))

plt.title('Prostitution policy by Tier')
plt.xlabel('Tier')
plt.ylabel('Count')
plt.legend(title='Prostitution policy')
# set customize y axis tick intervals: 
max_y = crosstab.values.sum(axis=1).max() 
plt.yticks(np.arange(0, max_y + 2, 2)) 
plt.show()

This stacked bar chart shows that few countries in Tier 1 have prohibited prostitution policy, whereas this policy constitutes a larger share in the lower Tiers.

Tier by Nonpunishment Policy

# crosstab to count occurrences
crosstab = pd.crosstab(df['Tier_2024'], df['Nonpunishment_policy_after2021'])

# stacked bar chart
crosstab.plot(kind='bar', stacked=True, figsize=(10, 6))

plt.title('Victim nonpunishment policy by Tier')
plt.xlabel('Tier')
plt.ylabel('Count')
plt.legend(title='Victim nonpunishment policy')
max_y = crosstab.values.sum(axis=1).max()  
plt.yticks(np.arange(0, max_y + 2, 2))  
plt.show()

There appears to be no noticeable difference in victim nonpunishment policy between Tiers.

Tier Placements 2024 Breakdown by Subregion

# Create a pivot table for the counts
df['Tier_2024'] = pd.to_numeric(df['Tier_2024'], errors='coerce')
df_cleaned = df.dropna(subset=['Tier_2024']).copy()
df_cleaned = df_cleaned[df_cleaned['Subregion'] != "Western Asia"].copy()

# order Subregion levels
subregion_order = ['Northern Europe', 'Western Europe', 'Southern Europe', 'Eastern Europe']
df_cleaned['Subregion'] = pd.Categorical(df_cleaned['Subregion'], categories=subregion_order, ordered=True)

# create pivoy table
tier_pivot = pd.crosstab(df_cleaned['Subregion'], df_cleaned['Tier_2024'])

# stacked bar chart
tier_pivot.plot(kind='bar', stacked=True, figsize=(10, 6), colormap='Set2')

plt.title('Distribution of Tier by Subregion')
plt.xlabel('Subregion')
plt.ylabel('Proportion')
plt.legend(title='Tier')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

This output demonstrates not only the counts of countries in each region, but also the breakdown of Tiers. From this plot, we see that the majority of countries in Northern and Western Europe are Tier 1. There are zero Tier 1 countries in Southern Europe and all two Tier 2 Watchlist countries (Malta and Serbia) are in Southern Europe. The only Tier 3 country in Europe (Russia) is in Eastern Europe.

Boxplots

Sum of detected victims by Tier, Nonpunishment Policy, Prostitution Policy, and EU Membership

The code below outputs a visualization with four boxplots assessing the impact of various categorical vavriables on the sum of detected victims, facilitating side-by-side comparison.


from scipy import stats

def boxplots(df):
    # allow 4 subplots
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(20, 15))
    
    # 1. Detected Victims sum by Tier
    sns.boxplot(data=df, x='Tier_2024', y='Detected_victims_sum', ax=ax1)
    ax1.set_title('Detected Victims per 100 Population by Tier')
    ax1.set_ylabel('Number of Detected Victims')
    
    # 2. Detected Victims sum by Nonpunishment Policy 
    sns.boxplot(data=df, x='Nonpunishment_policy_after2021', y='Detected_victims_sum', ax=ax2)
    ax2.set_title('Detected Victims per 100 Population by Nonpunishment Policy')
    ax2.set_xlabel('Has Nonpunishment Policy')
    ax2.set_ylabel('Number of Detected Victims')
    
    # 3. Detected Victims sum by Prostitution Policy 
    sns.boxplot(data=df, x='prostitution_policy', y='Detected_victims_sum', ax=ax3)
    ax3.set_title('Detected Victims per 100 Population by Prostitution Policy')
    ax3.set_xlabel('Prostitution policy')
    ax3.set_ylabel('Number of Detected Victims')

    #4. Detected Victims sum by EU membership 
    sns.boxplot(data=df, x='EU_members', y='Detected_victims_sum', ax=ax4)
    ax4.set_title('Detected Victims per 100 Population by EU membership')
    ax4.set_xlabel('EU Member')
    ax4.set_ylabel('Number of Detected Victims')
    

    plt.tight_layout()

correlation_matrix = boxplots(df)

Among the four categorical variables assessed, none appear to have noticeable associations with the sum of detected victims. The only trends that stand out are that the sum is much lower in the Tier 2 Watchlist compared to the other tiers. Additionally, the sum of detected victims is lower and has a lower range for the prostitution policies of prohibition and neoabolitionims. There is no discernible difference in the sum of detected victims between victim nonpunishment policy and EU membership.

Mean Criminal Justice Score by Tier, Nonpunishment Policy, Prostitution Policy, and EU Membership


def boxplots(df):
    # allow 4 subplots
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(20, 15))
    
    # 1. Mean Criminal Justice by Tier
    sns.boxplot(data=df, x='Tier_2024', y='Criminal_justice_mean', ax=ax1)
    ax1.set_title('Average Criminal Justice Score by Tier')
    ax1.set_ylabel('Average Criminal Justice Score')
    
    # 2. Mean Criminal Justice by Nonpunishment Policy 
    sns.boxplot(data=df, x='Nonpunishment_policy_after2021', y='Criminal_justice_mean', ax=ax2)
    ax2.set_title('Average Criminal Justice Score by Nonpunishment Policy')
    ax2.set_xlabel('Has Nonpunishment Policy')
    ax2.set_ylabel('Average Criminal Justice Score')
    
    # 3. Mean Criminal Justice by Prostitution Policy 
    sns.boxplot(data=df, x='prostitution_policy', y='Criminal_justice_mean', ax=ax3)
    ax3.set_title('Average Criminal Justice Score by Prostitution Policy')
    ax3.set_xlabel('Prostitution policy')
    ax3.set_ylabel('Average Criminal Justice Score')

    #4. Mean Criminal Justice by EU membership 
    sns.boxplot(data=df, x='EU_members', y='Criminal_justice_mean', ax=ax4)
    ax4.set_title('Average Criminal Justice Score by EU membership')
    ax4.set_xlabel('EU Member')
    ax4.set_ylabel('Average Criminal Justice Score')
    

    plt.tight_layout()

correlation_matrix = boxplots(df)

The side-by-side visualization of these boxplots conveys that criminal justice scores are higher in Tier 1, countries with neoabolitionist prostitution policy, and EU members. The criminal justice score is alarmingly low for Tier 3 countries and interestingly lower in countries with prohibited prostitution. There is not a strong apparent difference for nonpunishment policy, but the visualization suggests that the mean criminal just scores among country that do not specify victim nonpunishment policy are actually slightly higher.

Kruskal-Wallis test: Criminal justice scores by Tier

To take the boxplot results of the mean criminal justice scores by Tier one step further, the statistical significance of the difference between groups will be assessed by performing the Kruskal-Wallis test Kruskal-Wallis test along with Dunn’s pairwise analysis test. These nonparametric tests were chosen because they are more robust to violations of the assumptions of ANOVA– for example, the Kruskal-Wallis test can handle very small sample sizes and varying sample sizes across groups.

import scikit_posthocs as sp
from scipy.stats import kruskal

# A condition for the KW test is that all groups must have at least two observations, 
# so the data must be filtered to only include these. 
df_filtered = df.groupby('Tier_2024').filter(lambda x: len(x) >= 2).copy()

# drop missing values 
df_filtered = df_filtered.dropna(subset=['Criminal_justice_mean'])

# Perform Kruskal-Wallis test
groups = [df_filtered[df_filtered['Tier_2024'] == tier]['Criminal_justice_mean'] for tier in df_filtered['Tier_2024'].unique()]
h_stat, p_value = kruskal(*groups)
print(f"H-statistic: {h_stat}, P-value: {p_value}")

# Pairwise comparisons using Dunn’s test
pairwise_results = sp.posthoc_dunn(df_filtered, val_col='Criminal_justice_mean', group_col='Tier_2024', p_adjust='bonferroni')
print(pairwise_results)

H-statistic: 17.21799939487991, P-value: 0.0006374006592237717
          1.0       2.0  2.2       3.0
1.0  1.000000  0.000883  1.0  0.142186
2.0  0.000883  1.000000  1.0  1.000000
2.2  1.000000  1.000000  1.0  1.000000
3.0  0.142186  1.000000  1.0  1.000000

These results indicate that there is a statistically significant difference in the median of the mean criminal justice score between countries in Tier 1 and Tier 2 countries.

T-test: Criminal Justice Scores by EU Membership

To further explore the differences visualized in the boxplots above, a t-test will be performed to statistically assess if EU members are distinguished from their counterparts when it comes to criminal justice scores.

from scipy.stats import ttest_ind

ttestdf = df.dropna(subset=['Criminal_justice_mean'])
t_stat, p_value = ttest_ind(ttestdf['EU_members'], ttestdf['Criminal_justice_mean'], equal_var=False)
print(f"T-statistic: {t_stat}, P-value: {p_value}")

T-statistic: -20.595011394291824, P-value: 6.5410194981716e-24

The extremely small p-value is well below the alpha threshold of 0.05, indicating that this data provides significant evidence to suggest a difference in the true criminal justice mean scores between EU and non-EU member countries.

Criminal Justice Scores by Subregion

plt.figure(figsize=(12, 6))
sns.boxplot(data=df, x='Subregion', y='Criminal_justice_mean')
plt.title('Criminal Justice Scores by Subregion')
plt.xlabel('Subregion')
plt.ylabel('Criminal Justice Score')
plt.xticks(rotation=45)
plt.show()

The average criminal justice scores in Western and Northern Europe appear to be consistently and considerably higher than the other regions. Eastern Europe in particular has a country that has a very low criminal justice mean score, around 20/100.

Convictions over Prosecutions Ratio by Tier

# Convictions over prosecutions by Tier 
plt.figure(figsize=(12, 6))
sns.boxplot(df, x='Tier_2024', y='Convictions_over_prosecutions_mean')
plt.title('Convictions over prosecutions ratio by Tier ')
plt.xlabel('Tier')
plt.ylabel('Convictions over prosecutions ratio')
plt.xticks(rotation=45)
plt.show()

This plot shows that, contrary to what might be expected, countries with Tier 2 Watchlist placements in 2024 actually had higher convictions over prosecutions. This signals a considerable effort by law enforcement to actually punish traffickers. However, we cannot draw any claims from this visualization because it lacks information on the sample sizes of the countries making up the different tiers.

GSI Government response score by Tier

# GSI government response score by Tier 
plt.figure(figsize=(12, 6))
sns.boxplot(data=df, x='Tier', y='GSI_gov_response_score')
plt.title('GSI Government Response Score by Tier')
plt.xlabel('Subregion')
plt.ylabel('GSI Government Response Score')
plt.xticks(rotation=45)
plt.show()

This plot shows that the mean government response score to human trafficking does not vary considerably among Tiers. This is thought-provoking, as Tier 1 countries are expected to have better government responses– that is the basis of their classification.

Sum of Repatriated Victims by Tier


plt.figure(figsize=(12, 6))
sns.boxplot(data=df, x='Tier_2024', y='Number_Repatriated_Victims_sum')
plt.title('Sum of repatriated victims by Tier')
plt.xlabel('Tier')
plt.ylabel('Repatriated victims sum')
plt.xticks(rotation=45)
plt.show()

The sum of repatrated victims does not vary isgnificantly by Tier, except for in Malta and Serbia, the Tier 2 Watchlist countries. However, this may be explained by the fact that Malta at least has a very small population.

Summary

This EDA allowed for deeper insight into this data and into what indicators interact with each other. This is a critical phase for a project that tackles such a multifaceted problem. Specifically, this EDA highlighted how the mean criminal justice score of a country is a critical indicator related to various variables indirectly relevant to human trafficking drivers and governent response, such as Tier placement and Other indicators that were expected to be more associated with government response to human trafficking, namely the specification of victim nonpunishment policy and Walk Free’s GSI government response score, proved to be independent of Tier placement and criminal justice scores. The implications of these results are discussed in the report of this project and the indicators are further explored for their importance in predicting factors related to human trafficking in the supervised learning section.