Data Cleaning

Overview

In this document, I will combine the individual datasets retained in the data collection phase. All of the data comes in tabular format (in either Excel or CSV files), but they have inconsistencies in various qualities, such as the country name presentation (some sources use a country’s full, official title– for example, “United Kingdom of Great Britain and Northern Ireland” versus “United Kingdom”). The goal of this data cleaning code is to standardize data to a consistent format in order to merge all of the indicators into one dataset.

This process will ultimately produce two datasets: one that includes temporal aspects of the data (so it includes data over various years for each country), and one that aggregates the temporal data to representative metrics such as the slope and mean so that the time series aspect is mitigated and the data can be used in machine learning models. More information on this process is described in the “ML dataset” section below.

Code

UNODC data cleaning

The code below shows the workflow of cleaning and subsetting the large, comprehensive, raw dataset on trafficking in human beings from the UNODC to a dataframe that only includes data on detected trafficking victims, convicted traffickers, and prosecuted traffickers. I created two main datasets with this data: one filtered to include data that specifies the nationality of the people detected/ convicted / prosecuted, and one that only includes the crude total for each indicator in each country each year. The code for compiling the former is below:

To subset to include data on the nationality of the persons detected/convicted/prosecuted, I filtered the ‘Dimension’ column to include ‘by country of repatriation’. In the subsequent column, ‘Category’, the nationality is provided. When there is an entry for ‘by country of repatriation’, it means that a national of the country of interest was detected as a trafficking victim or convicted / prosecuted in a country different than their nationality. This data was included for performing analysis on characteristics of how countries detect / convict / prosecute their own citizens versus foreigners.

As numerous entries were denoted with “<5”, a measure was taken to convert any observations with this value to a numeric 5. Though this is an approximation and may not be accurate, it was considered the most appropriate way to handle the nontransparent data entry. The same measure was applied to the second code block below.

import pandas as pd

# UNODC human trafficking dataframe with additional data on citizenship of victims / traffickers
repats = pd.read_excel('../../data/raw-data/UNODC_TIP_data.xlsx', skiprows=2) # remove first two rows

# 1. subset to Region == "Europe"
repats = repats[repats['Region'] == 'Europe']

# 2. subset indicator to only detected trafficking victims
repats = repats[repats['Indicator'] == 'Detected trafficking victims']

# 3. subset dimension column to only include "Total" and "by country of repatriation"
repats = repats[repats['Dimension'].isin(['Total', 'by country of repatriation'])]

# 4. subset to the Sex == "Total"
repats = repats[repats['Sex'] == 'Total']

# 5. subset to Age == "Total"
repats = repats[repats['Age'] == 'Total']

# 6. rename txtVALUE --> "Counts"
repats = repats.rename(columns={'txtVALUE': 'Counts'})

# 7. convert observations where Counts = "<5" to numeric value 5
repats['Counts'] = repats['Counts'].replace("<5", 5)
repats['Counts'] = pd.to_numeric(repats['Counts'], errors='coerce')

# 8. remove unnecessary columns
columns_to_drop = ['Iso3_code', 'Region', 'Sex', 'Age', 'Unit of measurement', 'Source']
repats = repats.drop(columns=columns_to_drop)

# 9. group and get the sum of total repatriated victims for each country each year
foreign_victims = repats[
    (repats['Dimension'] == 'by country of repatriation') & 
    (repats['Category'] != repats['Country'])
]

repat_totals = foreign_victims.groupby(['Country', 'Year'])['Counts'].sum().reset_index()
repat_totals = repat_totals.rename(columns = {'Counts': 'Number_Repatriated_Victims'})

repat_totals
Country Year Number_Repatriated_Victims
0 Albania 2014 5.0
1 Albania 2016 5.0
2 Albania 2017 21.0
3 Albania 2018 21.0
4 Albania 2019 10.0
... ... ... ...
231 United Kingdom of Great Britain and Northern I... 2012 47.0
232 United Kingdom of Great Britain and Northern I... 2013 30.0
233 United Kingdom of Great Britain and Northern I... 2014 29.0
234 United Kingdom of Great Britain and Northern I... 2015 23.0
235 United Kingdom of Great Britain and Northern I... 2016 114.0

236 rows × 3 columns

The code below, in contrast to the prior, simply extracts the total counts for each indicator (detected trafficking victims, convicted traffickers, and prosecuted traffickers), regardless of contextual characteristics such as the citizenship of the persons at hand. The resulting dataframe has five columns: 1.) the country name, 2.) the subregion (with four levels: Southern Europe, Eastern Europe, Northern Europe, Western Europe) to explore differences among these categories, 3.) the indicator, 4.) the year, and 5.) the count. This dataframe will be merged with other country features in code cells farther below.

# COMBINED TOTALS dataframe with all trafficking data variables

totals = pd.read_excel('../../data/raw-data/UNODC_TIP_data.xlsx', skiprows=2) # remove first two rows

# 1. subset to Region == "Europe"
european_country_codes = [
    "ALB", "AND", "ARM", "AUT", "AZE", "BEL", "BIH", "BGR", "CHE", "CYP", "CZE",
    "DEU", "DNK", "ESP", "EST", "FIN", "FRA", "GBR", "GEO", "GRC", "HRV", "HUN",
    "IRL", "ISL", "ITA", "KAZ", "KGZ", "KOS", "LIE", "LTU", "LUX", "LVA", "MCO",
    "MDA", "MKD", "MLT", "MNE", "NLD", "NOR", "POL", "PRT", "ROU", "RUS", "SMR",
    "SRB", "SVK", "SVN", "SWE", "TJK", "TKM", "TUR", "UKR", "UZB"
]
# totals = totals[totals['Region'] == 'Europe']
totals = totals[totals['Iso3_code'].isin(european_country_codes)]

# 2. subset indicator to only detected trafficking victims
totals = totals[totals['Indicator'].isin(['Detected trafficking victims', 'Persons convicted', 'Persons prosecuted'])]

# 3. subset dimension column to only include "Total" and "by country of repatriation"
totals = totals[totals['Dimension'] == 'Total']

# 4. subset to the Sex == "Total"
totals = totals[totals['Sex'] == 'Total']

# 5. subset to Age == "Total"
totals = totals[totals['Age'] == 'Total']

# 6. rename txtVALUE --> "Counts"
totals = totals.rename(columns={'txtVALUE': 'Counts'})

# 7. convert observations where Counts = "<5" to numeric value 5
totals['Counts'] = totals['Counts'].replace("<5", 5)
totals['Counts'] = pd.to_numeric(totals['Counts'], errors='coerce')

# 8. remove unnecessary columns
columns_to_drop = ['Iso3_code', 'Region', 'Dimension', 'Category', 'Sex', 'Age', 'Unit of measurement', 'Source']
totals = totals.drop(columns=columns_to_drop)

# 9. pivot wider so that each indicator is its own column
totals = totals.pivot(index=["Country", "Subregion", "Year"], columns="Indicator", values="Counts").reset_index()
totals.columns.name = None  # Remove the name of the columns to remove the multi-index created by pivot
totals.columns = [col if isinstance(col, str) else col for col in totals.columns]

# 10. rename the columns so they don't have spaces:
totals = totals.rename(columns={
                                'Detected trafficking victims': 'Detected_victims', 
                                'Persons convicted': 'Convicted_traffickers',
                                'Persons prosecuted': 'Prosecuted_traffickers'})

# 11. create NEW column to represent the seriousness of law enforcement by dividing the number of convictions over the number of prosecutions
totals['convictions_over_prosecutions'] = totals['Convicted_traffickers'] / totals['Prosecuted_traffickers']

# 12. MERGE the number of repatriated victims 
totals = totals.merge(repat_totals, on = ["Country", "Year"], how = "outer")

print(totals.head())
   Country        Subregion  Year  Detected_victims  Convicted_traffickers  \
0  Albania  Southern Europe  2003               NaN                   40.0   
1  Albania  Southern Europe  2004               NaN                   44.0   
2  Albania  Southern Europe  2005               NaN                   49.0   
3  Albania  Southern Europe  2006               NaN                   56.0   
4  Albania  Southern Europe  2007               NaN                    9.0   

   Prosecuted_traffickers  convictions_over_prosecutions  \
0                     NaN                            NaN   
1                     NaN                            NaN   
2                     NaN                            NaN   
3                     NaN                            NaN   
4                    25.0                           0.36   

   Number_Repatriated_Victims  
0                         NaN  
1                         NaN  
2                         NaN  
3                         NaN  
4                         NaN  

World Bank data

# World Bank indicators data
wb = pd.read_csv("../../data/raw-data/europe_world_bank_data.csv")
print(wb.head())
   Country  Year  Criminal_justice  GDP_per_capita  Political_stability  \
0  Albania  2010         42.654030     4094.349686            38.388626   
1  Albania  2011         41.314552     4437.141161            37.440758   
2  Albania  2012         39.906105     4247.631343            40.284359   
3  Albania  2013         38.967136     4413.063383            49.289101   
4  Albania  2014         44.230770     4578.633208            61.428570   

   Population  Refugee_population  Unemployment_rate  
0   2913021.0                75.0             14.086  
1   2905195.0                79.0             13.481  
2   2900401.0                84.0             13.376  
3   2895092.0                96.0             15.866  
4   2889104.0               111.0             18.055  

Both the World Bank and UNODC datasets are subsetted to countries in Europe and Central Asia, but in order to merge the datasets on their Country name and year, we need to ensure that the country names are consistent in the two datasets. Country names, especially those with longer official names or special characters in the name, can be denoted differently by different sources– for example, one source may call Moldova its official name of the “Republic of Moldova”, or Turkey may be denoted by its true spelling, Turkiye or Türkiye. So, in this step, I have compared the country names and then standardized both datasets to have the same names for accurate merging.


# isolate unique country names from both datasets
wb_countries = wb['Country'].unique()
totals_countries = totals['Country'].unique()

# Show countries in totals but not in wb
missing_in_wb = set(totals_countries) - set(wb_countries)
# Show countries in wb but not in totals
missing_in_totals = set(wb_countries) - set(totals_countries)

print("Countries in UNODC dataset but not in World Bank data:\n", missing_in_wb)
print("\nCountries in World Bank data but not in UNODC data:\n", missing_in_totals)
Countries in UNODC dataset but not in World Bank data:
 {'Türkiye', 'Slovakia', 'United Kingdom of Great Britain and Northern Ireland', 'Kyrgyzstan', 'Republic of Moldova', 'Belarus'}

Countries in World Bank data but not in UNODC data:
 {'Moldova', 'Slovak Republic', 'Kyrgyz Republic', 'Turkiye', 'San Marino', 'United Kingdom'}
print(totals['Country'].unique())
['Albania' 'Andorra' 'Armenia' 'Austria' 'Azerbaijan' 'Belarus' 'Belgium'
 'Bosnia and Herzegovina' 'Bulgaria' 'Croatia' 'Cyprus' 'Czechia'
 'Denmark' 'Estonia' 'Finland' 'France' 'Georgia' 'Germany' 'Greece'
 'Hungary' 'Iceland' 'Ireland' 'Italy' 'Kazakhstan' 'Kyrgyzstan' 'Latvia'
 'Liechtenstein' 'Lithuania' 'Luxembourg' 'Malta' 'Monaco' 'Montenegro'
 'Netherlands' 'North Macedonia' 'Norway' 'Poland' 'Portugal'
 'Republic of Moldova' 'Romania' 'Russian Federation' 'Serbia' 'Slovakia'
 'Slovenia' 'Spain' 'Sweden' 'Switzerland' 'Tajikistan' 'Turkmenistan'
 'Türkiye' 'Ukraine'
 'United Kingdom of Great Britain and Northern Ireland' 'Uzbekistan']

With this information, I’ve written a mapping dictionary to apply to both datasets to make the country names the same in each dataset. I’ve chosen to give both datasets the more simple version of the name, where applicable.

country_name_mapping = {
    "Türkiye": "Turkiye",
    "Turkey": "Turkiye",
    "Slovak Republic": "Slovakia",
    "Republic of Moldova": "Moldova",
    "Kyrgyz Republic": "Kyrgyzstan",
    "United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
    "Czech Republic": "Czechia",
    "Russia": "Russian Federation",
}

# apply the dictionary to both datasets
wb['Country'] = wb['Country'].replace(country_name_mapping)
totals['Country'] = totals['Country'].replace(country_name_mapping)

# now, check that the mapping worked: 
wb_countries = wb['Country'].unique()
totals_countries = totals['Country'].unique()
missing_in_wb = set(totals_countries) - set(wb_countries)
missing_in_totals = set(wb_countries) - set(totals_countries)

print("Countries in UNODC dataset but not in World Bank data:\n", missing_in_wb)
print("\nCountries in World Bank data but not in UNODC data:\n", missing_in_totals)
Countries in UNODC dataset but not in World Bank data:
 {'Belarus'}

Countries in World Bank data but not in UNODC data:
 {'San Marino'}

This output informs us that the mapping worked correctly and now each country is represented with the same name in each dataset. However, the country San Marino is present in the World Bank data but not in the UNODC, indicating that there is no data on human trafficking in San Marino. Because of this, I’ll drop rows with data on San Marino from the World Bank dataset, because I will be merging the two datasets on their “outer” similarities– meaning that when I merge on the shared variables of Country and Year, I will include all rows with the unique Country and Year combination, because while one dataset may not have any data for that row, the other may, and this is still valuable and insightful information.

# drop San Marino from World Bank data
wb = wb[wb['Country'] != "San Marino"]

# Merge the datasets!
df = pd.merge(totals, wb, on=["Country", "Year"], how="outer")

The following code now creates a new variable, Detections_per_100, which standardizes the number of detected victims (from UNODC data) per year to the country’s population size (from World Bank data), by dividing the number of detected victims by the population and multiplying this fraction by 100 for each year.

# Ensure no zeros / NAs in the population column to avoid division errors
df['Population'] = df['Population'].replace(0, None) 
df.dropna(subset=['Population', 'Detected_victims'], inplace=True) 

# Add a new column for detections per 100 people
df['Detections_per_100'] = (df['Detected_victims'] / df['Population']) * 100

# Display the updated dataset
print(df[['Country', 'Detected_victims', 'Population', 'Detections_per_100']].head())
    Country  Detected_victims  Population  Detections_per_100
7   Albania              97.0   2913021.0            0.003330
8   Albania              84.0   2905195.0            0.002891
9   Albania              92.0   2900401.0            0.003172
10  Albania              95.0   2895092.0            0.003281
11  Albania             125.0   2889104.0            0.004327

Tier placements over time data


tiers = pd.read_csv('../../data/raw-data/TIP_Tiers_overyears.csv')

# filter to only keep rows where Location has "Europe" in the entry value
tiers = tiers[tiers['Location'].str.contains('Europe', na=False)].copy()

# clean column names-- remove brackets and content within brackets
tiers.columns = tiers.columns.str.replace(r"\[.*\]", "", regex=True).str.strip()

# drop unnecessary columns
tiers = tiers.drop(columns=['Location', 'Main article'])

# clean country names using the mapping from the previous step-- this ensures that this data matches the existing df
country_name_mapping = {
    "Türkiye": "Turkiye",
    "Turkey": "Turkiye",
    "Slovak Republic": "Slovakia",
    "Republic of Moldova": "Moldova",
    "Kyrgyz Republic": "Kyrgyzstan",
    "United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
    "Czech Republic": "Czechia",
    "Russia": "Russian Federation",
}
tiers['Country'] = tiers['Country'].replace(country_name_mapping)

# melt the data into long format (columns: Country, Year, Tier Placement)
tier_long = tiers.melt(id_vars=['Country'], 
                         var_name='Year', 
                         value_name='Tier')

# replace Tier '2w' to 2.2 
tier_long['Tier'] = tier_long['Tier'].replace('2w', '2.2')

# convert 'Tier Placement' and 'Year' to numeric 
tier_long['Tier'] = pd.to_numeric(tier_long['Tier'], errors='coerce')
tier_long['Year'] = pd.to_numeric(tier_long['Year'], errors='coerce')

print(tier_long.head())
# MERGE with dataset
df = df.merge(tier_long, on=["Country", "Year"], how="outer")
                  Country  Year  Tier
0                 Albania  2011   2.0
1                 Austria  2011   1.0
2                 Belarus  2011   2.0
3                 Belgium  2011   1.0
4  Bosnia and Herzegovina  2011   1.0

Next, I will create a dataset that has “static” variables. These are variables for which there was only data available for the status of a country for the most recent time, and not over many years. These variables have more to do with government policy.

Henley passport data

# load dataset with Henley Passport Index Data
passport = pd.read_csv('../../data/raw-data/henley_passport_index.csv')

# clean the country names using some additions to the country_name_mapping dictionary defined earlier: 
country_name_mapping = {
    "Türkiye": "Turkiye",
    "Turkey": "Turkiye",
    "Slovak Republic": "Slovakia",
    "Republic of Moldova": "Moldova",
    "Kyrgyz Republic": "Kyrgyzstan",
    "United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
    "Czech Republic": "Czechia",
    "Russia": "Russian Federation",
}

passport['Country'] = passport['Country'].replace(country_name_mapping)

# clean the Country column by subsetting to the countries in df
EU_centralasia = df['Country'].unique()
passport = passport[passport['Country'].isin(EU_centralasia)]
print(passport.head())
print(passport['Country'].unique())
       Country  Visa_free_destinations
1      Albania                     123
3      Andorra                     171
7      Armenia                      68
9      Austria                     191
10  Azerbaijan                      71
['Albania' 'Andorra' 'Armenia' 'Austria' 'Azerbaijan' 'Belarus' 'Belgium'
 'Bosnia and Herzegovina' 'Bulgaria' 'Croatia' 'Cyprus' 'Czechia'
 'Denmark' 'Estonia' 'Finland' 'France' 'Georgia' 'Germany' 'Greece'
 'Hungary' 'Iceland' 'Ireland' 'Italy' 'Kazakhstan' 'Kosovo' 'Kyrgyzstan'
 'Latvia' 'Liechtenstein' 'Lithuania' 'Luxembourg' 'Malta' 'Moldova'
 'Montenegro' 'Netherlands' 'North Macedonia' 'Norway' 'Poland' 'Portugal'
 'Romania' 'Russian Federation' 'Serbia' 'Slovakia' 'Slovenia' 'Spain'
 'Sweden' 'Switzerland' 'Tajikistan' 'Turkiye' 'Turkmenistan' 'Ukraine'
 'United Arab Emirates' 'United Kingdom' 'Uzbekistan']

Policy data

# load the manually compiled policy dataset
policy = pd.read_excel('../../data/raw-data/EU_policy.xlsx')

# clean the data:
# Replace "X" with 1
policy = policy.replace("X", 1)
# Replace NaN with 0
policy = policy.fillna(0)
# clean country names 
policy['Country'] = policy['Country'].replace(country_name_mapping)

# Define a mapping for the Tier columns
tier_mapping = {
    "Tier_1": 1,
    "Tier_2": 2,
    "Tier_2_watchlist": 2.2,
    "Tier_3": 3
}

# Melt the DataFrame for the Tier columns
tier_columns = ["Tier_1", "Tier_2", "Tier_2_watchlist", "Tier_3"]
melted = policy.melt(id_vars=["Country"], value_vars=tier_columns, 
                     var_name="Tier_Type", value_name="Tier_Value")

# Map the Tier_Type to numerical values and filter rows where Tier_Value is 1
melted['Tier_2024'] = melted['Tier_Type'].map(tier_mapping)
melted = melted[melted['Tier_Value'] == 1]

# Drop unnecessary columns and merge back if needed
melted = melted[['Country', 'Tier_2024']]

policy = policy.drop(columns=["Tier_1", "Tier_2", "Tier_2_watchlist", "Tier_3"])

# Merge back to the original DataFrame if required
policy = policy.merge(melted, on="Country", how="left")

print(policy.head())
print(policy.columns)
      Country  Nonpunishment_policy_before2021  \
0     Albania                              0.0   
1     Andorra                              0.0   
2     Armenia                              1.0   
3     Austria                              0.0   
4  Azerbaijan                              1.0   

   Nonpunishment_policy_after2021  Limited_nonpunishment_policy  \
0                             1.0                           0.0   
1                             0.0                           0.0   
2                             1.0                           1.0   
3                             0.0                           0.0   
4                             1.0                           0.0   

   Prostitution_abolitionism  Prostitution_neoabolitionism  \
0                        0.0                           0.0   
1                        0.0                           0.0   
2                        0.0                           0.0   
3                        0.0                           0.0   
4                        0.0                           0.0   

   Prostitution_decriminalization  Prostitution_legal  \
0                             0.0                 0.0   
1                             0.0                 0.0   
2                             0.0                 0.0   
3                             0.0                 1.0   
4                             0.0                 0.0   

   Prostitution_prohibited  Post_soviet_states  EU_members  Tier_2024  
0                      1.0                 1.0         0.0        2.0  
1                      0.0                 0.0         0.0        NaN  
2                      1.0                 1.0         0.0        2.0  
3                      0.0                 0.0         1.0        1.0  
4                      1.0                 1.0         0.0        2.0  
Index(['Country', 'Nonpunishment_policy_before2021',
       'Nonpunishment_policy_after2021', 'Limited_nonpunishment_policy',
       'Prostitution_abolitionism', 'Prostitution_neoabolitionism',
       'Prostitution_decriminalization', 'Prostitution_legal',
       'Prostitution_prohibited', 'Post_soviet_states', 'EU_members',
       'Tier_2024'],
      dtype='object')
/var/folders/0s/3s0t3s4d31d4xtm_jt4pfhpr0000gn/T/ipykernel_2475/3635385757.py:6: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  policy = policy.replace("X", 1)
policy['Tier_2024'].unique()
array([2. , nan, 1. , 3. , 2.2])
# isolate unique country names from both datasets
passport_countries = passport['Country'].unique()
policy_countries = policy['Country'].unique()

# Show countries in totals but not in wb
missing_in_passport = set(policy_countries) - set(passport_countries)
# Show countries in wb but not in totals
missing_in_policy = set(passport_countries) - set(policy_countries)

print("Countries in policy data but missing in passport data:\n", missing_in_passport)
print("\nCountries in passport index data but not in policy data:\n", missing_in_policy)
Countries in policy data but missing in passport data:
 {'Monaco'}

Countries in passport index data but not in policy data:
 {'Kazakhstan', 'Turkmenistan', 'Tajikistan', 'Liechtenstein', 'Kyrgyzstan', 'United Arab Emirates', 'Uzbekistan'}

From this, I see that the passport data is missing an entry for Turkiye, Czech Rebpulic, Belarus, Kosovo and Russia (likely because of a different notation), so I will go back to the previous cleaning steps to include those. To do so, I will manually look through the csv outputted in the data collection phase.

This shows the inconsistencies in country notation between the two datasets. Similar to the country name mapping performed previously, I will manually construct a dictionary given these differences and apply them to both datasets to ensure matching.

GSI government response score data from Walk Free

The first dataset that is loaded in the cell below is from the Walk Free Foundation, an international non-profit human rights group dedicated to research and spreading awareness in the pursuit of eradicating modern slavery. It has a comprehensive dataset for the year 2023 titled “The Global Slavery Index”. One of the indicators for which it collects data is what they call the “government response score”. This is calculated through a complex framework encapsulating a variety of activities performed by a government to address modern slavery, ranging from efforts for ensuring appropriate identification and protection of victims, to deploying the necessary personnel and resources to adequately address modern slavery, to having a strong judicial system to trained to tackle modern slavery cases. Walk Free experts calculate this metric from extensive policy research. It is important to note that this is not an official, validated, government-verified source.

# load Walk Free's Global Slavery Index dataset
# extract only the 5th sheet of the excel file, which contains data on the government response score
walkfree = pd.read_excel('../../data/raw-data/2023-Global-Slavery-Index-Data.xlsx', sheet_name=4, skiprows=2)

# subset to countries in Europe and Central Asia 
walkfree = walkfree[walkfree["Region"] == "Europe and Central Asia"]

# subset to only include important columns -- Country and the Total (which refers to the total number of appropriate anti-trafficking actions taken by government)
walkfree = walkfree[["Country", "Total (n)"]]

# clean the Country column with the dictionary previously defined
walkfree['Country'] = walkfree['Country'].replace(country_name_mapping)

# rename the column to the appropriate, identifiable indicator 
walkfree = walkfree.rename(columns={"Total (n)": "GSI_gov_response_score"})

print(walkfree.head())
       Country  GSI_gov_response_score
0      Albania                    48.0
5      Armenia                    42.0
7      Austria                    48.0
8   Azerbaijan                    46.0
13     Belarus                    37.0

Now, I will merge the henley passport index, policy data, and walk free data into one dataset:

static_data = policy.merge(passport, on="Country", how="left")
print(static_data.head())
static_data = static_data.merge(walkfree, on="Country", how="left")
      Country  Nonpunishment_policy_before2021  \
0     Albania                              0.0   
1     Andorra                              0.0   
2     Armenia                              1.0   
3     Austria                              0.0   
4  Azerbaijan                              1.0   

   Nonpunishment_policy_after2021  Limited_nonpunishment_policy  \
0                             1.0                           0.0   
1                             0.0                           0.0   
2                             1.0                           1.0   
3                             0.0                           0.0   
4                             1.0                           0.0   

   Prostitution_abolitionism  Prostitution_neoabolitionism  \
0                        0.0                           0.0   
1                        0.0                           0.0   
2                        0.0                           0.0   
3                        0.0                           0.0   
4                        0.0                           0.0   

   Prostitution_decriminalization  Prostitution_legal  \
0                             0.0                 0.0   
1                             0.0                 0.0   
2                             0.0                 0.0   
3                             0.0                 1.0   
4                             0.0                 0.0   

   Prostitution_prohibited  Post_soviet_states  EU_members  Tier_2024  \
0                      1.0                 1.0         0.0        2.0   
1                      0.0                 0.0         0.0        NaN   
2                      1.0                 1.0         0.0        2.0   
3                      0.0                 0.0         1.0        1.0   
4                      1.0                 1.0         0.0        2.0   

   Visa_free_destinations  
0                   123.0  
1                   171.0  
2                    68.0  
3                   191.0  
4                    71.0  

Merge all into comprehensive dataset

final_merge = df.merge(static_data, on="Country", how="inner")
final_merge.head()
Country Subregion Year Detected_victims Convicted_traffickers Prosecuted_traffickers convictions_over_prosecutions Number_Repatriated_Victims Criminal_justice GDP_per_capita ... Prostitution_abolitionism Prostitution_neoabolitionism Prostitution_decriminalization Prostitution_legal Prostitution_prohibited Post_soviet_states EU_members Tier_2024 Visa_free_destinations GSI_gov_response_score
0 Albania Southern Europe 2010 97.0 17.0 39.0 0.435897 NaN 42.654030 4094.349686 ... 0.0 0.0 0.0 0.0 1.0 1.0 0.0 2.0 123.0 48.0
1 Albania Southern Europe 2011 84.0 15.0 28.0 0.535714 NaN 41.314552 4437.141161 ... 0.0 0.0 0.0 0.0 1.0 1.0 0.0 2.0 123.0 48.0
2 Albania Southern Europe 2012 92.0 5.0 30.0 0.166667 NaN 39.906105 4247.631343 ... 0.0 0.0 0.0 0.0 1.0 1.0 0.0 2.0 123.0 48.0
3 Albania Southern Europe 2013 95.0 5.0 42.0 0.119048 NaN 38.967136 4413.063383 ... 0.0 0.0 0.0 0.0 1.0 1.0 0.0 2.0 123.0 48.0
4 Albania Southern Europe 2014 125.0 NaN NaN NaN 5.0 44.230770 4578.633208 ... 0.0 0.0 0.0 0.0 1.0 1.0 0.0 2.0 123.0 48.0

5 rows × 29 columns

Now, I have noticed that to perform EDA more easily, it would be better to melt the prostitution policy data into a single categorical column, instead of multiple binary columns with inputs of 0 or 1. The code below performs this transformation:

# create the prostitution policy mapping
prostitution_map = {}

# identify all of the prostitution columns (representing the different levels of policy)
prostitution_cols = [col for col in final_merge.columns if col.startswith('Prostitution_')]

# ror each row, find which prostitution policy is marked as 1
for idx, row in final_merge.iterrows():
    for col in prostitution_cols:
        if row[col] == 1:
            # Get the policy name from after the underscore
            policy = col.split('_')[1].lower()
            # Create a key of country and year
            key = (row['Country'], row['Year'])
            prostitution_map[key] = policy

# add this new column to the merged dataset and drop the original binary prostitution policy columns
final_merge['prostitution_policy'] = final_merge.apply(lambda row: prostitution_map.get((row['Country'], row['Year']), None), axis=1)
final_merge = final_merge.drop(columns=prostitution_cols)

# view the output
print(final_merge[['Country', 'Year', 'prostitution_policy']].head())
   Country  Year prostitution_policy
0  Albania  2010          prohibited
1  Albania  2011          prohibited
2  Albania  2012          prohibited
3  Albania  2013          prohibited
4  Albania  2014          prohibited

The output looks good! This dataframe will now be saved to a CSV for use in the EDA section of the project.

final_merge.to_csv('../../data/processed-data/complete_merge.csv', index=False)

Create ML Dataset

The dataset compiled in the previous steps include observations over time. However, time series data cannot be used in general machine learning models, because the autocorrelation of the observations violates the I.I.D. assumption of machine learning data, which assumes that data is independently and identically distributed. While this temporal data will be useful for data visualization and exploring trends over time, the temporal qualities need to be addressed in order to use this data in supervised and unsupervised learning models. There are various ways to manipulate time-series data into data for machine learning. For this project, the time series data will be handled by aggregating the temporal data into single point estimates for each country over all years. The data over the years will either be boiled down to the slope (rate of change) of the data over the years using scipy.stat’s linregress tool, the mean over the years, or the sum over the years. The aggregate metrics for each specific variable were chosen with particular consideration of the meaning of the variable. For example, the number of detected victims was aggregated to its sum because the total number may provide more insight, since the number of detected victims varies significantly over years and this may distort the mean. This makes the data appropriate to use in ML models where Time Series data violates the I.I.D assumptions.

A function to calculate the slope using linregress was created and the aggregation metrics were applied in a for loop.

import numpy as np
from scipy.stats import linregress

df = final_merge

def calculate_slope(group, x_col, y_col):
    # x_col and y_col identify the indices; for each column specified in the foor loop, each constant y_col value will match with each unique x_col value (representing the country and year)
    # drop any missing pairs so that the slope does not produce an error 
    filtered_group = group.dropna(subset=[x_col, y_col])
    
    if len(filtered_group) > 1:  # calculating slope requires at least two data points
        x = filtered_group[x_col]
        y = filtered_group[y_col]
        return linregress(x, y).slope
    return np.nan  # return NA if not enough data points for a slope calculation

# Aggregate metrics
aggregated_data = []

# Group by country:
for country, group in df.groupby('Country'):
    group = group.sort_values('Year')  # sort by year for slope calculations
    
    # calculate aggregated values
    data = {
        'Country': country,
        'Detected_victims_sum': group['Detected_victims'].sum(skipna=True), # skipna=True ensures that NAs don't cause error in the output 
        'Detected_victims_slope': calculate_slope(group, 'Year', 'Detected_victims'),
        'Convicted_traffickers_sum': group['Convicted_traffickers'].sum(skipna=True),
        'Convicted_traffickers_slope': calculate_slope(group, 'Year', 'Convicted_traffickers'),
        'Prosecuted_traffickers_sum': group['Prosecuted_traffickers'].sum(skipna=True),
        'Prosecuted_traffickers_slope': calculate_slope(group, 'Year', 'Prosecuted_traffickers'),
        'Convictions_over_prosecutions_mean': group['convictions_over_prosecutions'].mean(skipna=True),
        'Number_Repatriated_Victims_sum': group['Number_Repatriated_Victims'].sum(skipna=True),
        'Tier_mean': group['Tier'].mean(skipna=True),
        'Tier_slope': calculate_slope(group, 'Year', 'Tier'),
        'Criminal_justice_mean': group['Criminal_justice'].mean(skipna=True),
        'Criminal_justice_slope': calculate_slope(group, 'Year', 'Criminal_justice'),
        'GDP_per_capita_mean': group['GDP_per_capita'].mean(skipna=True),
        'GDP_per_capita_slope': calculate_slope(group, 'Year', 'GDP_per_capita'),
        'Political_stability_mean': group['Political_stability'].mean(skipna=True),
        'Political_stability_slope': calculate_slope(group, 'Year', 'Political_stability'),
        'Population_mean': group['Population'].mean(skipna=True),
        'Refugee_population_mean': group['Refugee_population'].mean(skipna=True),
        'Refugee_population_slope': calculate_slope(group, 'Year', 'Refugee_population'),
        'Unemployment_rate_mean': group['Unemployment_rate'].mean(skipna=True),
        'Unemployment_rate_slope': calculate_slope(group, 'Year', 'Unemployment_rate'),
        'Detections_per_100_mean': group['Detections_per_100'].mean(skipna=True)
    }
    aggregated_data.append(data)

# additional column:
# count the number of years each country has reported detected victims of human trafficking 
reported_years = df.groupby('Country')['Detected_victims'].count().reset_index()
reported_years.rename(columns={'Detected_victims': 'Reported_Years_Detected_Victims'}, inplace=True)
# a country with a high value in this column suggests 1.) a high prevalence of human trafficking in the country or 2.) 
# a dedicated government effort to detecting victims

# store as dataframe
aggregated_df = pd.DataFrame(aggregated_data)

# merge the number of reported years
aggregated_df = aggregated_df.merge(reported_years, on="Country", how="left")

aggregated_df.head()
Country Detected_victims_sum Detected_victims_slope Convicted_traffickers_sum Convicted_traffickers_slope Prosecuted_traffickers_sum Prosecuted_traffickers_slope Convictions_over_prosecutions_mean Number_Repatriated_Victims_sum Tier_mean ... GDP_per_capita_slope Political_stability_mean Political_stability_slope Population_mean Refugee_population_mean Refugee_population_slope Unemployment_rate_mean Unemployment_rate_slope Detections_per_100_mean Reported_Years_Detected_Victims
0 Albania 1235.0 2.353147 42.0 -4.600000 349.0 1.361111 0.314332 92.0 2.016667 ... 159.811093 51.233780 1.355719 2.875263e+06 107.083333 3.884615 14.4861 -0.198939 0.003584 12
1 Andorra 25.0 0.000000 10.0 0.000000 43.0 -2.153846 0.575758 0.0 NaN ... -282.908853 98.107819 0.477088 7.638540e+04 NaN NaN NaN NaN 0.006550 5
2 Armenia 177.0 -0.206294 75.0 -0.248918 83.0 -1.416667 0.833808 0.0 NaN ... 142.277720 34.566121 -2.971136 2.869258e+06 22847.083333 4956.115385 15.0360 -1.061524 0.000514 12
3 Austria 1344.0 25.339827 134.0 -0.843034 1194.0 16.659612 0.213883 96.0 1.000000 ... 121.474676 86.073943 -2.083460 8.663381e+06 89785.700000 11512.599567 5.2954 -0.005429 0.001531 10
4 Azerbaijan 792.0 6.300699 231.0 0.956522 177.0 -0.392857 0.762087 0.0 NaN ... -275.485406 24.015193 -1.409298 9.660932e+06 1403.500000 -28.510490 5.3550 0.068252 0.000676 12

5 rows × 24 columns

Next, I add the static variables to this temporally aggregated dataframe by extracted the first observation of each static indicator for each country (so a single value is extracted) to merge with the aggregated dataframe.

static_vars = final_merge[['Country', 'Subregion', 'Nonpunishment_policy_before2021',
       'Nonpunishment_policy_after2021', 'Limited_nonpunishment_policy',
       'Post_soviet_states', 'EU_members', 'Tier_2024', 'Visa_free_destinations',
       'GSI_gov_response_score', 'prostitution_policy']]

static_vars = static_vars.drop_duplicates(subset='Country', keep='first')
print(static_vars)
                    Country        Subregion  Nonpunishment_policy_before2021  \
0                   Albania  Southern Europe                              0.0   
13                  Andorra  Southern Europe                              0.0   
18                  Armenia     Western Asia                              1.0   
30                  Austria   Western Europe                              0.0   
43               Azerbaijan     Western Asia                              1.0   
55                  Belarus              NaN                              1.0   
67                  Belgium              NaN                              0.0   
79   Bosnia and Herzegovina  Southern Europe                              0.0   
92                 Bulgaria   Eastern Europe                              0.0   
105                 Croatia  Southern Europe                              0.0   
118                  Cyprus     Western Asia                              1.0   
127                 Czechia   Eastern Europe                              0.0   
140                 Denmark  Northern Europe                              0.0   
153                 Estonia  Northern Europe                              0.0   
166                 Finland  Northern Europe                              1.0   
179                  France              NaN                              0.0   
191                 Georgia     Western Asia                              1.0   
202                 Germany   Western Europe                              1.0   
215                  Greece  Southern Europe                              1.0   
228                 Hungary   Eastern Europe                              0.0   
241                 Iceland              NaN                              0.0   
253                 Ireland  Northern Europe                              0.0   
266                   Italy  Southern Europe                              0.0   
279                  Kosovo              NaN                              0.0   
291                  Latvia  Northern Europe                              0.0   
304               Lithuania  Northern Europe                              1.0   
317              Luxembourg              NaN                              1.0   
329                   Malta  Southern Europe                              1.0   
341                 Moldova   Eastern Europe                              0.0   
354              Montenegro  Southern Europe                              0.0   
367             Netherlands   Western Europe                              0.0   
380         North Macedonia  Southern Europe                              0.0   
393                  Norway  Northern Europe                              0.0   
406                  Poland              NaN                              0.0   
418                Portugal  Southern Europe                              0.0   
431                 Romania              NaN                              1.0   
443      Russian Federation   Eastern Europe                              0.0   
448                  Serbia  Southern Europe                              0.0   
461                Slovakia   Eastern Europe                              0.0   
474                Slovenia  Southern Europe                              0.0   
487                   Spain              NaN                              1.0   
499                  Sweden  Northern Europe                              0.0   
512             Switzerland   Western Europe                              0.0   
525                 Turkiye     Western Asia                              0.0   
538                 Ukraine   Eastern Europe                              0.0   
551          United Kingdom  Northern Europe                              0.0   

     Nonpunishment_policy_after2021  Limited_nonpunishment_policy  \
0                               1.0                           0.0   
13                              0.0                           0.0   
18                              1.0                           1.0   
30                              0.0                           0.0   
43                              1.0                           0.0   
55                              1.0                           1.0   
67                              0.0                           0.0   
79                              1.0                           0.0   
92                              1.0                           0.0   
105                             0.0                           0.0   
118                             1.0                           0.0   
127                             0.0                           0.0   
140                             0.0                           0.0   
153                             0.0                           0.0   
166                             1.0                           1.0   
179                             0.0                           0.0   
191                             1.0                           1.0   
202                             1.0                           0.0   
215                             1.0                           0.0   
228                             0.0                           0.0   
241                             0.0                           0.0   
253                             0.0                           0.0   
266                             0.0                           0.0   
279                             0.0                           0.0   
291                             1.0                           0.0   
304                             1.0                           0.0   
317                             1.0                           0.0   
329                             1.0                           0.0   
341                             0.0                           0.0   
354                             0.0                           0.0   
367                             0.0                           0.0   
380                             0.0                           0.0   
393                             0.0                           0.0   
406                             0.0                           0.0   
418                             0.0                           0.0   
431                             1.0                           1.0   
443                             0.0                           0.0   
448                             0.0                           0.0   
461                             1.0                           1.0   
474                             0.0                           0.0   
487                             1.0                           1.0   
499                             0.0                           0.0   
512                             0.0                           0.0   
525                             0.0                           0.0   
538                             0.0                           0.0   
551                             1.0                           1.0   

     Post_soviet_states  EU_members  Tier_2024  Visa_free_destinations  \
0                   1.0         0.0        2.0                   123.0   
13                  0.0         0.0        NaN                   171.0   
18                  1.0         0.0        2.0                    68.0   
30                  0.0         1.0        1.0                   191.0   
43                  1.0         0.0        2.0                    71.0   
55                  1.0         0.0        3.0                    81.0   
67                  0.0         1.0        1.0                   190.0   
79                  0.0         0.0        2.0                   123.0   
92                  1.0         0.0        2.0                   177.0   
105                 0.0         1.0        2.0                   183.0   
118                 0.0         1.0        1.0                   178.0   
127                 1.0         1.0        1.0                   189.0   
140                 0.0         1.0        1.0                   190.0   
153                 1.0         1.0        1.0                   185.0   
166                 0.0         1.0        1.0                   191.0   
179                 0.0         1.0        1.0                   192.0   
191                 1.0         0.0        NaN                   122.0   
202                 0.0         1.0        1.0                   192.0   
215                 0.0         1.0        2.0                   188.0   
228                 1.0         1.0        2.0                   187.0   
241                 0.0         0.0        1.0                   184.0   
253                 0.0         1.0        2.0                   191.0   
266                 0.0         1.0        2.0                   192.0   
279                 0.0         0.0        2.0                    79.0   
291                 1.0         1.0        2.0                   184.0   
304                 1.0         1.0        1.0                   185.0   
317                 0.0         1.0        1.0                   191.0   
329                 0.0         1.0        2.2                   187.0   
341                 1.0         0.0        2.0                   122.0   
354                 0.0         0.0        2.0                   128.0   
367                 0.0         1.0        1.0                   191.0   
380                 0.0         0.0        2.0                   128.0   
393                 0.0         0.0        2.0                   190.0   
406                 1.0         1.0        1.0                   188.0   
418                 0.0         1.0        2.0                   189.0   
431                 1.0         1.0        2.0                   177.0   
443                 1.0         0.0        3.0                   116.0   
448                 0.0         0.0        2.2                   140.0   
461                 0.0         1.0        2.0                   184.0   
474                 0.0         1.0        2.0                   184.0   
487                 0.0         1.0        1.0                   192.0   
499                 0.0         1.0        1.0                   191.0   
512                 0.0         0.0        2.0                   190.0   
525                 0.0         0.0        NaN                   116.0   
538                 1.0         0.0        2.0                   148.0   
551                 0.0         0.0        1.0                   190.0   

     GSI_gov_response_score prostitution_policy  
0                      48.0          prohibited  
13                      NaN                None  
18                     42.0          prohibited  
30                     48.0               legal  
43                     46.0          prohibited  
55                     37.0          prohibited  
67                     46.0   decriminalization  
79                     45.0        abolitionism  
92                     43.0          prohibited  
105                    46.0          prohibited  
118                    45.0        abolitionism  
127                    46.0        abolitionism  
140                    48.0        abolitionism  
153                    44.0        abolitionism  
166                    47.0        abolitionism  
179                    48.0     neoabolitionism  
191                    48.0          prohibited  
202                    48.0               legal  
215                    48.0               legal  
228                    43.0               legal  
241                    37.0     neoabolitionism  
253                    49.0     neoabolitionism  
266                    46.0        abolitionism  
279                    33.0          prohibited  
291                    46.0               legal  
304                    45.0          prohibited  
317                    40.0                None  
329                    38.0        abolitionism  
341                    37.0          prohibited  
354                    47.0          prohibited  
367                    52.0               legal  
380                    45.0        abolitionism  
393                    49.0     neoabolitionism  
406                    43.0        abolitionism  
418                    52.0        abolitionism  
431                    45.0        abolitionism  
443                    19.0          prohibited  
448                    44.0          prohibited  
461                    45.0        abolitionism  
474                    44.0        abolitionism  
487                    49.0        abolitionism  
499                    49.0     neoabolitionism  
512                    39.0               legal  
525                    38.0               legal  
538                    40.0          prohibited  
551                    53.0        abolitionism  

Now, merge this dataset with the aggregated metrics dataset. The output should have one row for each unique country.

MLdata = aggregated_df.merge(static_vars, on="Country", how="left")
MLdata.head()
Country Detected_victims_sum Detected_victims_slope Convicted_traffickers_sum Convicted_traffickers_slope Prosecuted_traffickers_sum Prosecuted_traffickers_slope Convictions_over_prosecutions_mean Number_Repatriated_Victims_sum Tier_mean ... Subregion Nonpunishment_policy_before2021 Nonpunishment_policy_after2021 Limited_nonpunishment_policy Post_soviet_states EU_members Tier_2024 Visa_free_destinations GSI_gov_response_score prostitution_policy
0 Albania 1235.0 2.353147 42.0 -4.600000 349.0 1.361111 0.314332 92.0 2.016667 ... Southern Europe 0.0 1.0 0.0 1.0 0.0 2.0 123.0 48.0 prohibited
1 Andorra 25.0 0.000000 10.0 0.000000 43.0 -2.153846 0.575758 0.0 NaN ... Southern Europe 0.0 0.0 0.0 0.0 0.0 NaN 171.0 NaN None
2 Armenia 177.0 -0.206294 75.0 -0.248918 83.0 -1.416667 0.833808 0.0 NaN ... Western Asia 1.0 1.0 1.0 1.0 0.0 2.0 68.0 42.0 prohibited
3 Austria 1344.0 25.339827 134.0 -0.843034 1194.0 16.659612 0.213883 96.0 1.000000 ... Western Europe 0.0 0.0 0.0 0.0 1.0 1.0 191.0 48.0 legal
4 Azerbaijan 792.0 6.300699 231.0 0.956522 177.0 -0.392857 0.762087 0.0 NaN ... Western Asia 1.0 1.0 0.0 1.0 0.0 2.0 71.0 46.0 prohibited

5 rows × 34 columns

The merged, aggregated dataset is exported as a csv file to the processed-data folder for use in exploratory data analysis and machine learning.

MLdata.to_csv('../../data/processed-data/MLdataset.csv', index=False)