The goal of this project is to collect and combine a variety of country indicators that may be informational in representing its response to human trafficking. This goal was established after analysing the requirements for Tier placements in the U.S. Department of State’s Trafficking in Persons annual report, in which countries are categorized based on some rather vague characteristics. This project instead seeks to aggregate data that represents various dimensions of a country’s ability to fight human trafficking, as well as their prioritization of protecting its people from the crime. This data will include features on the financial and political stability of a country, its enactment of policies including its level of prostitution legality policy and if it outlines victim nonpunishment policy, characteristics of the strength of the nation’s passport and its judicial system, the amount of detected victims of human trafficking in that country and nationals detected abroad, and more. By collecting holistically representative and well-rounded data on features identified to be influential factors in human trafficking response through the literature review, this project seeks to provide insight into the shortcomings of current categorizations of countries and shed light on countries making considerable progress without recognition. Moreover, the insights gained may help inform improved methods for assessing country performance.
Motivation
Human trafficking is the world’s fastest growing global crime, and it takes the freedom, health, well-being, and humanity of people on this earth everyday. It is a crime that is overlooked in research and government prioritization because of how complex it is to identify, intervene in, and solve. Being an underground crime, it requires special and intensive methods in order to be cracked down on.
There is a considerable divide between countries in Europe. Generally, countries in Western and Northern Europe are considered to be highly civilized, advanced, stable, and peaceful, whereas countries in Southern and Eastern Europe (especially those formerly part of the Soviet bloc or Yugoslavia) are known to have more geopolitical tensions and other disadvantages as a result of a history of political turmoil. Human trafficking tends to occur more in places that are less financially, politically, and legally stable, which results in the crime being more prevalent in Southern and Eastern Europe.
Countries in Southern and Eastern Europe are consistently categorized lower in the Tier placements by the U.S. Department of State. However, signfiicant efforts and reforms may be being performed in those countries to combat the crime they have been unfortunately plagued with a higher rate of crime to actually combat. This means that criteria used to compare countries may not reflect the entire situation. For example, countries in Northern and Western Europe tend to be more financially apt to fund the resources to fight against human trafficking, but they don’t have to use these resources because the rate of the crime is lower. In contrast, Southern and Eastern European countries must use more of their limited funding to provide more intervention. Therefore, the motivation behind this project is to uncover if other indicator spaces may provide more comprehensive explanation of a country’s true response to human trafficking.
Objectives
The main objectives of this project are to analyze data in order to:
Identify which indicators are the most important for characterizing the trends and flows of human trafficking in countries in Europe
Determine which indicators are the most explanatory of a government’s response to human trafficking
Provide insights into the clarity or lack of explainability of the included data in being able to identify a country’s Tier placement
By providing these analyses, this project hopes to present a better understanding of what national characteristics may influence a country’s human trafficking trends and government response abilities.
Methods
Data collection
All of the data used in this project was stored in CSV files in tabular format.
Data collection methods for this project included manual csv downloads, manual data entry, and extracting data from online sites using public APIs and parsing webpages using BeautifulSoup. The variables included in the dataset are listed below, alongside the source from which they were extracted and the method in which they were extracted.
All of the data from the UNODC was downloaded from dataUNDOC’s page on trafficking in persons, which contains the world’s most official reports on human trafficking variables.
The World Bank’s public API was used to extract information on relevant factors influencing the human trafficking responses and patterns identified from the literature review. The base url of the API takes two inputs: the country ISO code and the indicator code. I manually found the datasets on their web pages and copied their indicator abbreviation codes from the URL of the page. These indicators were stored in a dictionary called indicators so they could be looped as inputs to the base url and their data fetched. I also specified the country ISO codes so that data would only be extracted for countries in Europe and Central Asia. ChatGPT was consulted to create the list of countries, given my specified geographic region. I then confirmed with the International Organization for Standards’s official website. Since I was running into trouble extracting the data in an appropriate format on my own, ChatGPT was also consulted for specifying the parameters and specifications in the fetch_indicator_data function, which helped me work around the data being stored in the second list.
Python’s package BeautifulSoup was also used to parse data from two Wikipedia pages on the USDS Tier placement over time and Henley Passport Index per country. BeautifulSoup accomplishes web scraping by parsing the HTML content of the webpage. It accesses the raw HTML content of the webpage and extracts the specified information through its creation of a “parse tree”. Additional methods such as find(), find_all(), or tag navigation can be used to locate and extract the desired data.
The variables included in the preliminary merged dataset to be manipulated in the data cleaning phase contained these variables:
Country
Year
Detected victims of human trafficking (UNODC)
Human trafficking convictions per country (UNODC)
Human trafficking prosecutions per country (UNODC)
Population (World Bank)
GDP per capita (World Bank)
Unemployment rate (World Bank)
Criminal justice score (World Bank)
Political stability (World Bank)
Percentage of population that are refugees (World Bank)
Prostitution policy (manually extracted and input into dataset from the European Parliament)
Number of visa-free destinations with national passport (also known as Henley passport index) (Wikipedia)
EU membership (manually extracted and input into dataset from the E.U. official site)
Post-soviet membership (manually extracted and input into dataset from Wikipedia)
GSI Government Response Score
The Global Slavery Index (GSI) dataset is codified and presented by Walk Free, an international human rights-based nonprofit that streamlines research and calls for policy reform into modern slavery, an umbrella term that encompasses exploitation, forced labour, human trafficking and forced marriage
The government response score is one metric developed in the GSI dataset among many others. The score is based on the country’s status on dozens of activity criteria fulfillments
To extract the GSI government response score per country, the Total accumulation of points of criteria satisfaction was taken from the sheet in the dataset with the results
The resulting CSV file with all of this combined data had 565 rows for each unique combination of country and year. While some variables (namely, the data from the UNODC, World Bank, and Tier placement over time) had data over time, the rest of the indicators were static variables– meaning a single value was extracted per country. These values were repeated for every combination of year and country, per country.
However, temporal data of this kind violates the I.I.D. (independently and identically distributed) assumption of data for use in machine learning algorithms. In order to be able to use this data for supervised and unsupervised learning, the temporal data had to be manipulated into single point estimates characterizing the indicator’s nature over time. This was completed in the data cleaning section of the project.
Data cleaning
Various functions in from python’s pandas package were deployed for data cleaning: the standardization of country naming, the merging of different datasets, the aggregation of temporal data into single metrics for input into machine learning models, and more.
Country names were standardized to common country title representations by first comparing the differing denotations between the UNODC and World Bank datasets, and then manually writing a mapping tool to change any instance of a certain presentation of the country title to the preferred one. For example, the Czech Republic was listed as “Czechoslovakia” in some sources, and Turkey denoted as “Turkiye”.
Using pandas tools, the temporal data was aggregated to single metrics for input into the machine learning models to represent either the rate of change of the variable over time, the mean value over time, or the sum of values. Some variables included more than one aggregate representation. ChatGPT was consulted in the process of brainstorming how to approach this data manipulation strategy by prompting the LLM with the question: “What are the best ways to force data with temporal component to obey IID assumptions?” The rate of change was calculated as the regression slope between each successive change between consecutive years using the scipy.stat's linregress tool in python. The final dataset for input into the machine learning models thus contained the following variables. The shape of the dataset is (46,43), indicating 46 rows for each unique country and 43 final indicators. Any variable titles ending in _sum represent the total accumulated count of the value over the years. Any ending in _mean represent the mean value over the years. Any ending in _slope represent the general regression slope of the value over the years, for which a positive value indicates that the indicator has been increasing over the years.
The following variables composed the final, aggregated dataset for input into the machine learning models:
The data visualization component of this project utilizes numerous python packages, including pandas, numpy, matplotlib.pyplot, and plotly.express, including its choropleth tool for the creation of interactive, time-based maps. Interactive maps, correlation matrix heatmaps, pairplots, line graphs, bar plots and boxplots were all generated to perform visual perspectives of the data. Furthermore, statistical hypothesis tests such as the t-test and Kruskal-Wallis test with Dunn’s pairwise analysis were performed to assess the statistical significance of differences between levels of categorical variables.
The t-test is a hypothesis test that compares the means of two groups.
Assumptions:
Normality: the data in each group should be roughly Normally distributed
Equality of variances: the variance of each group should be approximately the same
If this assumption can’t be confirmed, Welch’s t-test can be performed
Types of t-tests:
One-sample t-test: assesses if the mean of one group differs significantly from the known population mean
Independent two-sample t-test: Compares the means of two independent groups to determine if they are significantly different.
Paired t-test: Compares the means of two related groups (for example, before-and-after measurements)
For the EDA in this project, Welch’s two-sample independent t-test will be performed using scipy.stats.ttest_ind
The Kruskal-Wallis test compares the medians of three or more independent groups to determine any statistically significant differences using scipy.stats.kruskal
Non-parametric alternative to ANOVA
Non-parametric: means that the test does not depend on assumptions like Normality or equality of variances
Test statistic is calculted by the rank sums of observations in each group
Works well for small sample sizes or groups with varying sample sizes
When the Kruskal-Wallis test detects a significant difference, Dunn’s test is used for pairwise comparisons between groups to determine which specific groups differ (using scikit_posthocs)
Machine learning
The supervised and unsupervised components used various tools in python’s sklearn packages for different machine learning tasks, as well as specific python packages for more specific models, such as xgboost. The models used are briefly outlined below, with more specific and detailed explanations in the respective section of the project report in which they are applied.
Unsupervised Learning
Dimensionality Reduction:
Principal Component Analysis (PCA): Reduces data dimensions by transforming features into orthogonal principal components while retaining maximum variance
t-Distributed Stochastic Neighbor Embedding (t-SNE): Projects high-dimensional data into a lower-dimensional space, preserving local structure for visualization
Clustering Methods:
K-Means: Partitions data into clusters by minimizing intra-cluster Euclidean distances
Density-Based Spatial Clustering of Applications with Noise (DBSCAN): Groups data points based on density, detecting clusters of varying shapes while labeling outliers as noise
Hierarchical Clustering: Groups data points by constructing a hierarchy of clusters using an agglomerative (bottom-up) or divisive (top-down) approach
Supervised Learning
Classification Methods:
Logistic Regression: Predicts binary outcomes using a logistic function
Random Forest: Ensemble method using decision trees to improve accuracy and reduce overfitting
Naive Bayes: Probabilistic classifier based on Bayes’ theorem, assuming feature independence
XGBoost: Gradient-boosting algorithm that combines multiple weak learners for robust classification
Regression Methods:
Lasso Regression: Linear regression with L1 regularization to enforce sparsity and prevent overfitting
Random Forest Regressor: Uses ensemble decision trees to model non-linear relationships in regression tasks
As noted, please see the respective unsupervised and supervised learning sections of this project for more detailed explanations of the models alongside their code applications.
All code in this project was executed using the programming language Python in Jupyter notebooks, and this website was assembled using Quarto.
Code
In the following code, the World Bank’s public API was used to extract information on relevant factors influencing the human trafficking responses and patterns, which were identified from the literature review. As described in the methods section, this approach contains both manual and automated extraction of data. The manual component entailed finding the dataset links online and copying their indicator codes in the indicators mapping list in Step 1, as seen below. Then, ChatGPT was consulted to help compile the list of ISO codes to identify countries in Europe and Central Asia, as seen in Step 2. A function was created and written in Step 3 to parse the data, and the remainder of the code focuses on formatting the data. The dataset compiled below has four columns: 1.) the country name (written out in its official English World Bank name), 2.) the year (ranging from 2010-2022), 3.) the indicator, 4.) the value of that indicator in that year for that country. This data was stored in a CSV for use later.
import requestsimport pandas as pd# The base URL initialized here allows for the unique input of country code and indicator code, # so that data on each of the specified indicators can be extracted for each of the specified # country codes pertaining to countries in Europe and Central Asia. base_url ="http://api.worldbank.org/v2/country/{country_code}/indicator/{indicator_code}"# Step 1: specify indicators to extractindicators = {"Population": "SP.POP.TOTL","GDP_per_capita": "NY.GDP.PCAP.CD","Unemployment_rate": "SL.UEM.TOTL.NE.ZS","Political_stability": "PV.PER.RNK", "Criminal_justice": "RL.PER.RNK", "Refugee_population": "SM.POP.REFG"}# Step 2: specify country ISO codes for countries in Europe and Central Asiaeuropean_country_codes = ["ALB", "AND", "ARM", "AUT", "AZE", "BEL", "BIH", "BGR", "CHE", "CYP", "CZE","DEU", "DNK", "ESP", "EST", "FIN", "FRA", "GBR", "GEO", "GRC", "HRV", "HUN","IRL", "ISL", "ITA", "KAZ", "KGZ", "KOS", "LIE", "LTU", "LUX", "LVA", "MCO","MDA", "MKD", "MLT", "MNE", "NLD", "NOR", "POL", "PRT", "ROU", "RUS", "SMR","SRB", "SVK", "SVN", "SWE", "TJK", "TKM", "TUR", "UKR", "UZB"]# Step 3: Function to extract from World Bank APIdef fetch_indicator_data(indicator_code, indicator_name, country_code): url = base_url.format(country_code=country_code, indicator_code=indicator_code) params = {"format": "json", "per_page": 1000, "date": "2010:2022"} # 2010 to 2022 selected as data range (to match the UNODC years) response = requests.get(url, params=params)if response.status_code ==200: data = response.json()iflen(data) >1: df = pd.DataFrame(data[1]) # since data is in the second listifnot df.empty: df["country"] = df["country"].apply(lambda x: x["value"]) df = df[["country", "date", "value"]] # specified to ony extract the necessary columns df["indicator"] = indicator_namereturn dfreturn pd.DataFrame()# list initialized to store the dataframesdataframes = []# Step 4: loop through indicators and countriesfor code in european_country_codes:for name, indicator_code in indicators.items(): df = fetch_indicator_data(indicator_code, name, code)ifnot df.empty: dataframes.append(df)# Step 5: Combine, reorder, renamecombined_data = pd.concat(dataframes, ignore_index=True)combined_data = combined_data[["country", "date", "indicator", "value"]]combined_data = combined_data.rename(columns={"country": "Country","date": "Year", "indicator": "Indicator","value": "Value"})# Step 6: pivot wider so each indicator has its own columncombined_data = combined_data.pivot(index=["Country", "Year"], columns="Indicator", values="Value").reset_index()combined_data.columns.name =Nonecombined_data.columns = [col ifisinstance(col, str) else col for col in combined_data.columns]print(combined_data.head())# Step 7: save to csv for use latercombined_data.to_csv("../../data/raw-data/europe_world_bank_data.csv", index=False)
/var/folders/0s/3s0t3s4d31d4xtm_jt4pfhpr0000gn/T/ipykernel_11283/638825811.py:52: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
combined_data = pd.concat(dataframes, ignore_index=True)
The code below scrapes a table from Wikipedia’s page on the Henley Passport Index, which indicates the number of countries that a passport from each country allows visa-free entrance into. Python’s BeautifulSoup package was used to parse the page content and extract it in the format of a Pandas DataFrame. This data was collected for investigation into whether the restrictiveness of border crossing ability with a passport has a relationship with trafficking in that country.
ChatGPT and Claude AI were consulted for assistance in writing the loop code for debugging a problem with some countries being nested within the same row in a special way due to the Rank column in the Wikipedia table that groups multiple rows together if they have the same number of Visa-free destinations. With my original code, not every country was captured in the output. With help from ChatGPT and Claude, which I used to investigate the debug output, I was able to modify the code to include every country.
import requestsfrom bs4 import BeautifulSoupimport pandas as pdimport reurl ='https://en.wikipedia.org/wiki/Henley_Passport_Index'# sends a GET request to perform HTML parsingresponse = requests.get(url)soup = BeautifulSoup(response.content, 'html.parser')table = soup.find('table', {'class': 'wikitable'})# initializedata = []current_visa_free =Nonefor row in table.find_all('tr')[1:]: # skip the first (header) row cells = row.find_all(['td', 'th'])iflen(cells) >=2:# Check if this row has a visa-free numberiflen(cells) >=3: visa_text = cells[2].text.strip()try: current_visa_free =int(visa_text)exceptValueError:continue country_cell = cells[1]# Check if the cell has rowspan-- this identifies and handles grouped countries rowspan = cells[0].get('rowspan') if cells[0].get('rowspan') else1# Get the rows that are grouped under the same Rank if rowspan andint(rowspan) >1:# To handle grouped entries: next_rows = [] current_row = rowfor _ inrange(int(rowspan)-1): current_row = current_row.find_next_sibling('tr')if current_row: next_rows.append(current_row)# Process the current rowfor link in country_cell.find_all('a'): country_name = link.text.strip()# only add if country_name is not a number -- this ensures that only countries are in the row for proper degroupingif country_name andnot country_name.isdigit(): data.append({'Country': country_name,'Visa_free_destinations': current_visa_free })# Process the following rows in the same Rank groupfor next_row in next_rows: next_cells = next_row.find_all(['td', 'th'])if next_cells:for link in next_cells[0].find_all('a'): country_name = link.text.strip()# Only add if country_name is not a number -- same as previouslyif country_name andnot country_name.isdigit(): data.append({'Country': country_name,'Visa_free_destinations': current_visa_free })else:# For countries (rows) that are not grouped in a Rank with other countriesfor link in country_cell.find_all('a'): country_name = link.text.strip()# Only add if country_name is not a numberif country_name andnot country_name.isdigit(): data.append({'Country': country_name,'Visa_free_destinations': current_visa_free })df = pd.DataFrame(data)# clean and sort the data in alphabetical order by country, which is the order of the other datasets used in this projectdf = df.dropna()df = df.drop_duplicates()df = df.sort_values('Country', ascending=True).reset_index(drop=True)print(df)# save to CSV for later usedf.to_csv('../../data/raw-data/henley_passport_index.csv', index=False)
The data on USDS Trafficking in Persons Tier placements over time (from 2013 to 2023) was also extracted from this Wikipedia page using python’s BeautifulSoup package. The output was saved to a csv file to be merged with the other indicators in the data cleaning section.
import requestsfrom bs4 import BeautifulSoupimport pandas as pdurl ="https://en.wikipedia.org/wiki/Trafficking_in_Persons_Report"response = requests.get(url)soup = BeautifulSoup(response.text, 'html.parser')# adjusts index to find the necessary table with data tables = soup.find_all('table', {'class': 'wikitable'})target_table = tables[0] # to specify that the first table is the one to extract data from # Extract table headersheaders = [header.text.strip() for header in target_table.find_all('th')]# Extract table rowsrows = []for row in target_table.find_all('tr')[1:]: # Skip the header row cells = row.find_all(['td', 'th']) row_data = [cell.text.strip() for cell in cells] rows.append(row_data)df = pd.DataFrame(rows, columns=headers)print(df.head())# the output shows that column names will need to be cleaned# save to CSV for later usedf.to_csv('../../data/raw-data/TIP_Tiers_overyears.csv', index=False)
Country Location 2011 [4] 2012 [5] 2013 [6] 2014 [7] \
0 Afghanistan Central Asia 2w 2w 2w 2
1 Albania Southeast Europe 2 2 2w 2
2 Algeria Northeast Africa 3 3 3 3
3 Angola Central Africa 2w 2w 2w 2w
4 Antigua and Barbuda Caribbean Sea 2 2 2w 2
2015 [8] 2016 [9] 2017 [10] 2018 [11] 2019 [12] 2020 [13] 2021 [14] \
0 2 2w 2 2 2w 3 3
1 2 2 2 2 2 2 2
2 3 3 2w 2w 2w 3 3
3 2 2 2 2w 2w 2 2
4 2w 2w 2w 2 2 2 2
2023 [15] Main article
0 3 Human trafficking in Afghanistan
1 2 Human trafficking in Albania
2 3 Human trafficking in Algeria
3 2 Human trafficking in Angola
4 2 Human trafficking in Antigua and Barbuda
Since the remainder of the data is manually collected and extracted, it will be handled in the data cleaning portion.
Summary
This section should be written for a technical audience, focusing on detailed analysis, factual reporting, and clear presentation of data. The following serves as a guide, but feel free to adjust as needed.
Challenges
Discuss any technical challenges faced during the project, such as data limitations, computational issues, or obstacles encountered during analysis.
Explain unexpected results and their technical implications.
Identify areas for future work, including potential optimizations, further analysis, or scaling solutions.
This project was limited in regards to numerous data considerations, including:
The reports of detected human trafficking victims are likely inaccurate and underrepresentative of the true scope of the crime
The complexity of interpreting data on trafficking counts, convictions, and prosecutions: It is difficult to determine whether low detected human trafficking victim counts reflect a safer country with a lower prevalence of the crime, or if it signifies a country whose law enforcement is not doing well in identifying and intervening in human trafficking cases
Manual extraction and input of policy data: Since much of the policy information was found in the form of written policy reports, the information was manually input into a CSV file. This provides both strengths and weaknesses– the benefit is that the data input was carefully supervised, ensuring accuracy of data collection. The disadvantage is that it makes the data used in this project irreproducible for future iterations.
Dataset size for machine learning
Computational and analytical issues:
When performing supervised learning analysis, some of the methods attempted return blank results, meaning that the algorithms were unable to output any predictions and evaluation metrics on the data. This problem occurred with the applications of Gradient Boosting and Support Vector Regression models to the data. This signifies that the data does not have enough information for the methods to even be able to train a model on.
Conclusion and Future Steps
This project requires better data in order to have any truly significant technical results and impact; the data is simply too small and unrelated to provide any meaningful insight
Thus, research should definitely look into applying LSTM or ARIMA to the data compiled in the first part of the data collection process of this project, which contained larger data
Research could also look into expanding beyond just European countries
Footnotes
The Group of Experts on Action against Trafficking in Human Beings (GRETA) is responsible for monitoring the implementation of the Council of Europe Convention on Action against Trafficking in Human Beings by the Parties. It conducts research and publishes country reports evaluating government activities and performance.↩︎