Data Collection

Overview

Goals

The goal of this project is to collect and combine a variety of country indicators that may be informational in representing its response to human trafficking. This goal was established after analysing the requirements for Tier placements in the U.S. Department of State’s Trafficking in Persons annual report, in which countries are categorized based on some rather vague characteristics. This project instead seeks to aggregate data that represents various dimensions of a country’s ability to fight human trafficking, as well as their prioritization of protecting its people from the crime. This data will include features on the financial and political stability of a country, its enactment of policies including its level of prostitution legality policy and if it outlines victim nonpunishment policy, characteristics of the strength of the nation’s passport and its judicial system, the amount of detected victims of human trafficking in that country and nationals detected abroad, and more. By collecting holistically representative and well-rounded data on features identified to be influential factors in human trafficking response through the literature review, this project seeks to provide insight into the shortcomings of current categorizations of countries and shed light on countries making considerable progress without recognition. Moreover, the insights gained may help inform improved methods for assessing country performance.

Motivation

Human trafficking is the world’s fastest growing global crime, and it takes the freedom, health, well-being, and humanity of people on this earth everyday. It is a crime that is overlooked in research and government prioritization because of how complex it is to identify, intervene in, and solve. Being an underground crime, it requires special and intensive methods in order to be cracked down on.

There is a considerable divide between countries in Europe. Generally, countries in Western and Northern Europe are considered to be highly civilized, advanced, stable, and peaceful, whereas countries in Southern and Eastern Europe (especially those formerly part of the Soviet bloc or Yugoslavia) are known to have more geopolitical tensions and other disadvantages as a result of a history of political turmoil. Human trafficking tends to occur more in places that are less financially, politically, and legally stable, which results in the crime being more prevalent in Southern and Eastern Europe.

Countries in Southern and Eastern Europe are consistently categorized lower in the Tier placements by the U.S. Department of State. However, signfiicant efforts and reforms may be being performed in those countries to combat the crime they have been unfortunately plagued with a higher rate of crime to actually combat. This means that criteria used to compare countries may not reflect the entire situation. For example, countries in Northern and Western Europe tend to be more financially apt to fund the resources to fight against human trafficking, but they don’t have to use these resources because the rate of the crime is lower. In contrast, Southern and Eastern European countries must use more of their limited funding to provide more intervention. Therefore, the motivation behind this project is to uncover if other indicator spaces may provide more comprehensive explanation of a country’s true response to human trafficking.

Objectives

The main objectives of this project are to analyze data in order to:

  • Identify which indicators are the most important for characterizing the trends and flows of human trafficking in countries in Europe
  • Determine which indicators are the most explanatory of a government’s response to human trafficking
  • Provide insights into the clarity or lack of explainability of the included data in being able to identify a country’s Tier placement

By providing these analyses, this project hopes to present a better understanding of what national characteristics may influence a country’s human trafficking trends and government response abilities.

Methods

Data collection

All of the data used in this project was stored in CSV files in tabular format.

Data collection methods for this project included manual csv downloads, manual data entry, and extracting data from online sites using public APIs and parsing webpages using BeautifulSoup. The variables included in the dataset are listed below, alongside the source from which they were extracted and the method in which they were extracted.

All of the data from the UNODC was downloaded from dataUNDOC’s page on trafficking in persons, which contains the world’s most official reports on human trafficking variables.

The World Bank’s public API was used to extract information on relevant factors influencing the human trafficking responses and patterns identified from the literature review. The base url of the API takes two inputs: the country ISO code and the indicator code. I manually found the datasets on their web pages and copied their indicator abbreviation codes from the URL of the page. These indicators were stored in a dictionary called indicators so they could be looped as inputs to the base url and their data fetched. I also specified the country ISO codes so that data would only be extracted for countries in Europe and Central Asia. ChatGPT was consulted to create the list of countries, given my specified geographic region. I then confirmed with the International Organization for Standards’s official website. Since I was running into trouble extracting the data in an appropriate format on my own, ChatGPT was also consulted for specifying the parameters and specifications in the fetch_indicator_data function, which helped me work around the data being stored in the second list.

Python’s package BeautifulSoup was also used to parse data from two Wikipedia pages on the USDS Tier placement over time and Henley Passport Index per country. BeautifulSoup accomplishes web scraping by parsing the HTML content of the webpage. It accesses the raw HTML content of the webpage and extracts the specified information through its creation of a “parse tree”. Additional methods such as find(), find_all(), or tag navigation can be used to locate and extract the desired data.

The variables included in the preliminary merged dataset to be manipulated in the data cleaning phase contained these variables:

  • Country
  • Year
  • Detected victims of human trafficking (UNODC)
  • Human trafficking convictions per country (UNODC)
  • Human trafficking prosecutions per country (UNODC)
  • Population (World Bank)
  • GDP per capita (World Bank)
  • Unemployment rate (World Bank)
  • Criminal justice score (World Bank)
  • Political stability (World Bank)
  • Percentage of population that are refugees (World Bank)
  • Tier placement in 2024 (2024 USDS Trafficking in Persons report)
  • Tier placements every year from 2013-2023 (Wikipedia)
  • Victim nonpunishment policy (manually extracted and input into dataset from GRETA’s non-punishment principle report)1
  • Prostitution policy (manually extracted and input into dataset from the European Parliament)
  • Number of visa-free destinations with national passport (also known as Henley passport index) (Wikipedia)
  • EU membership (manually extracted and input into dataset from the E.U. official site)
  • Post-soviet membership (manually extracted and input into dataset from Wikipedia)
  • GSI Government Response Score
    • The Global Slavery Index (GSI) dataset is codified and presented by Walk Free, an international human rights-based nonprofit that streamlines research and calls for policy reform into modern slavery, an umbrella term that encompasses exploitation, forced labour, human trafficking and forced marriage
    • The government response score is one metric developed in the GSI dataset among many others. The score is based on the country’s status on dozens of activity criteria fulfillments
    • To extract the GSI government response score per country, the Total accumulation of points of criteria satisfaction was taken from the sheet in the dataset with the results

The raw datasets can be found here:

UNODC data World Bank data Visa free destinations data Tier placements over time Policy data (all manually extracted) GSI Government response score

The resulting CSV file with all of this combined data had 565 rows for each unique combination of country and year. While some variables (namely, the data from the UNODC, World Bank, and Tier placement over time) had data over time, the rest of the indicators were static variables– meaning a single value was extracted per country. These values were repeated for every combination of year and country, per country.

However, temporal data of this kind violates the I.I.D. (independently and identically distributed) assumption of data for use in machine learning algorithms. In order to be able to use this data for supervised and unsupervised learning, the temporal data had to be manipulated into single point estimates characterizing the indicator’s nature over time. This was completed in the data cleaning section of the project.

Data cleaning

Various functions in from python’s pandas package were deployed for data cleaning: the standardization of country naming, the merging of different datasets, the aggregation of temporal data into single metrics for input into machine learning models, and more.

Country names were standardized to common country title representations by first comparing the differing denotations between the UNODC and World Bank datasets, and then manually writing a mapping tool to change any instance of a certain presentation of the country title to the preferred one. For example, the Czech Republic was listed as “Czechoslovakia” in some sources, and Turkey denoted as “Turkiye”.

Using pandas tools, the temporal data was aggregated to single metrics for input into the machine learning models to represent either the rate of change of the variable over time, the mean value over time, or the sum of values. Some variables included more than one aggregate representation. ChatGPT was consulted in the process of brainstorming how to approach this data manipulation strategy by prompting the LLM with the question: “What are the best ways to force data with temporal component to obey IID assumptions?” The rate of change was calculated as the regression slope between each successive change between consecutive years using the scipy.stat's linregress tool in python. The final dataset for input into the machine learning models thus contained the following variables. The shape of the dataset is (46,43), indicating 46 rows for each unique country and 43 final indicators. Any variable titles ending in _sum represent the total accumulated count of the value over the years. Any ending in _mean represent the mean value over the years. Any ending in _slope represent the general regression slope of the value over the years, for which a positive value indicates that the indicator has been increasing over the years.

The following variables composed the final, aggregated dataset for input into the machine learning models:

  • Country
  • Subregion
  • Detected_victims_sum
  • Detected_victims_slope
  • Convicted_traffickers_sum
  • Convicted_traffickers_slope
  • Prosecuted_traffickers_sum
  • Prosecuted_traffickers_slope
  • Convictions_over_prosecutions_mean
  • Number_Repatriated_Victims_sum
  • Tier_mean
  • Tier_slope
  • Criminal_justice_mean
  • Criminal_justice_slope
  • GDP_per_capita_mean
  • GDP_per_capita_slope
  • Political_stability_mean
  • Political_stability_slope
  • Population_mean
  • Refugee_population_mean
  • Refugee_population_slope
  • Unemployment_rate_mean
  • Unemployment_rate_slope
  • Detections_per_100_mean
  • Reported_Years_Detected_Victims
  • Nonpunishment_policy_before2021
  • Nonpunishment_policy_after2021
  • Limited_nonpunishment_policy
  • Post_soviet_states
  • EU_members
  • Visa_free_destinations
  • GSI_gov_response_score
  • prostitution_policy
  • Tier_2024

Download the final dataset

Exploratory Data Analysis (EDA)

The data visualization component of this project utilizes numerous python packages, including pandas, numpy, matplotlib.pyplot, and plotly.express, including its choropleth tool for the creation of interactive, time-based maps. Interactive maps, correlation matrix heatmaps, pairplots, line graphs, bar plots and boxplots were all generated to perform visual perspectives of the data. Furthermore, statistical hypothesis tests such as the t-test and Kruskal-Wallis test with Dunn’s pairwise analysis were performed to assess the statistical significance of differences between levels of categorical variables.

  • The t-test is a hypothesis test that compares the means of two groups.
    • Assumptions:
      1. Normality: the data in each group should be roughly Normally distributed
      2. Equality of variances: the variance of each group should be approximately the same
        • If this assumption can’t be confirmed, Welch’s t-test can be performed
    • Types of t-tests:
      1. One-sample t-test: assesses if the mean of one group differs significantly from the known population mean
      2. Independent two-sample t-test: Compares the means of two independent groups to determine if they are significantly different.
      3. Paired t-test: Compares the means of two related groups (for example, before-and-after measurements)
    • For the EDA in this project, Welch’s two-sample independent t-test will be performed using scipy.stats.ttest_ind
  • The Kruskal-Wallis test compares the medians of three or more independent groups to determine any statistically significant differences using scipy.stats.kruskal
    • Non-parametric alternative to ANOVA
    • Non-parametric: means that the test does not depend on assumptions like Normality or equality of variances
    • Test statistic is calculted by the rank sums of observations in each group
    • Works well for small sample sizes or groups with varying sample sizes
    • When the Kruskal-Wallis test detects a significant difference, Dunn’s test is used for pairwise comparisons between groups to determine which specific groups differ (using scikit_posthocs)

Machine learning

The supervised and unsupervised components used various tools in python’s sklearn packages for different machine learning tasks, as well as specific python packages for more specific models, such as xgboost. The models used are briefly outlined below, with more specific and detailed explanations in the respective section of the project report in which they are applied.

Unsupervised Learning

  • Dimensionality Reduction:
    • Principal Component Analysis (PCA): Reduces data dimensions by transforming features into orthogonal principal components while retaining maximum variance
    • t-Distributed Stochastic Neighbor Embedding (t-SNE): Projects high-dimensional data into a lower-dimensional space, preserving local structure for visualization
  • Clustering Methods:
    • K-Means: Partitions data into clusters by minimizing intra-cluster Euclidean distances
    • Density-Based Spatial Clustering of Applications with Noise (DBSCAN): Groups data points based on density, detecting clusters of varying shapes while labeling outliers as noise
    • Hierarchical Clustering: Groups data points by constructing a hierarchy of clusters using an agglomerative (bottom-up) or divisive (top-down) approach

Supervised Learning

  • Classification Methods:
    • Logistic Regression: Predicts binary outcomes using a logistic function
    • Random Forest: Ensemble method using decision trees to improve accuracy and reduce overfitting
    • Naive Bayes: Probabilistic classifier based on Bayes’ theorem, assuming feature independence
    • XGBoost: Gradient-boosting algorithm that combines multiple weak learners for robust classification
  • Regression Methods:
    • Lasso Regression: Linear regression with L1 regularization to enforce sparsity and prevent overfitting
    • Random Forest Regressor: Uses ensemble decision trees to model non-linear relationships in regression tasks

As noted, please see the respective unsupervised and supervised learning sections of this project for more detailed explanations of the models alongside their code applications.

All code in this project was executed using the programming language Python in Jupyter notebooks, and this website was assembled using Quarto.

Code

In the following code, the World Bank’s public API was used to extract information on relevant factors influencing the human trafficking responses and patterns, which were identified from the literature review. As described in the methods section, this approach contains both manual and automated extraction of data. The manual component entailed finding the dataset links online and copying their indicator codes in the indicators mapping list in Step 1, as seen below. Then, ChatGPT was consulted to help compile the list of ISO codes to identify countries in Europe and Central Asia, as seen in Step 2. A function was created and written in Step 3 to parse the data, and the remainder of the code focuses on formatting the data. The dataset compiled below has four columns: 1.) the country name (written out in its official English World Bank name), 2.) the year (ranging from 2010-2022), 3.) the indicator, 4.) the value of that indicator in that year for that country. This data was stored in a CSV for use later.

import requests
import pandas as pd

# The base URL initialized here allows for the unique input of country code and indicator code, 
# so that data on each of the specified indicators can be extracted for each of the specified 
# country codes pertaining to countries in Europe and Central Asia. 
base_url = "http://api.worldbank.org/v2/country/{country_code}/indicator/{indicator_code}"

# Step 1: specify indicators to extract
indicators = {
    "Population": "SP.POP.TOTL",
    "GDP_per_capita": "NY.GDP.PCAP.CD",
    "Unemployment_rate": "SL.UEM.TOTL.NE.ZS",
    "Political_stability": "PV.PER.RNK", 
    "Criminal_justice": "RL.PER.RNK", 
    "Refugee_population": "SM.POP.REFG"
}

# Step 2: specify country ISO codes for countries in Europe and Central Asia
european_country_codes = [
    "ALB", "AND", "ARM", "AUT", "AZE", "BEL", "BIH", "BGR", "CHE", "CYP", "CZE",
    "DEU", "DNK", "ESP", "EST", "FIN", "FRA", "GBR", "GEO", "GRC", "HRV", "HUN",
    "IRL", "ISL", "ITA", "KAZ", "KGZ", "KOS", "LIE", "LTU", "LUX", "LVA", "MCO",
    "MDA", "MKD", "MLT", "MNE", "NLD", "NOR", "POL", "PRT", "ROU", "RUS", "SMR",
    "SRB", "SVK", "SVN", "SWE", "TJK", "TKM", "TUR", "UKR", "UZB"
]

# Step 3: Function to extract from World Bank API
def fetch_indicator_data(indicator_code, indicator_name, country_code):
    url = base_url.format(country_code=country_code, indicator_code=indicator_code)
    params = {"format": "json", "per_page": 1000, "date": "2010:2022"} # 2010 to 2022 selected as data range (to match the UNODC years)
    response = requests.get(url, params=params)
    if response.status_code == 200:
        data = response.json()
        if len(data) > 1:
            df = pd.DataFrame(data[1])  # since data is in the second list
            if not df.empty:
                df["country"] = df["country"].apply(lambda x: x["value"])  
                df = df[["country", "date", "value"]]  # specified to ony extract the necessary columns
                df["indicator"] = indicator_name
                return df
    return pd.DataFrame()

# list initialized to store the dataframes
dataframes = []

# Step 4: loop through indicators and countries
for code in european_country_codes:
    for name, indicator_code in indicators.items():
        df = fetch_indicator_data(indicator_code, name, code)
        if not df.empty:
            dataframes.append(df)

# Step 5: Combine, reorder, rename
combined_data = pd.concat(dataframes, ignore_index=True)
combined_data = combined_data[["country", "date", "indicator", "value"]]
combined_data = combined_data.rename(columns={
    "country": "Country",
    "date": "Year", 
    "indicator": "Indicator",
    "value": "Value"
})

# Step 6: pivot wider so each indicator has its own column
combined_data = combined_data.pivot(index=["Country", "Year"], columns="Indicator", values="Value").reset_index()
combined_data.columns.name = None  
combined_data.columns = [col if isinstance(col, str) else col for col in combined_data.columns]

print(combined_data.head())

# Step 7: save to csv for use later
combined_data.to_csv("../../data/raw-data/europe_world_bank_data.csv", index=False)
   Country  Year  Criminal_justice  GDP_per_capita  Political_stability  \
0  Albania  2010         42.654030     4094.349686            38.388626   
1  Albania  2011         41.314552     4437.141161            37.440758   
2  Albania  2012         39.906105     4247.631343            40.284359   
3  Albania  2013         38.967136     4413.063383            49.289101   
4  Albania  2014         44.230770     4578.633208            61.428570   

   Population  Refugee_population  Unemployment_rate  
0   2913021.0                75.0             14.086  
1   2905195.0                79.0             13.481  
2   2900401.0                84.0             13.376  
3   2895092.0                96.0             15.866  
4   2889104.0               111.0             18.055  
Data extraction complete. Saved to 'europe_world_bank_data.csv'.
/var/folders/0s/3s0t3s4d31d4xtm_jt4pfhpr0000gn/T/ipykernel_11283/638825811.py:52: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  combined_data = pd.concat(dataframes, ignore_index=True)

The code below scrapes a table from Wikipedia’s page on the Henley Passport Index, which indicates the number of countries that a passport from each country allows visa-free entrance into. Python’s BeautifulSoup package was used to parse the page content and extract it in the format of a Pandas DataFrame. This data was collected for investigation into whether the restrictiveness of border crossing ability with a passport has a relationship with trafficking in that country.

ChatGPT and Claude AI were consulted for assistance in writing the loop code for debugging a problem with some countries being nested within the same row in a special way due to the Rank column in the Wikipedia table that groups multiple rows together if they have the same number of Visa-free destinations. With my original code, not every country was captured in the output. With help from ChatGPT and Claude, which I used to investigate the debug output, I was able to modify the code to include every country.

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

url = 'https://en.wikipedia.org/wiki/Henley_Passport_Index'
# sends a GET request to perform HTML parsing
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find('table', {'class': 'wikitable'})

# initialize
data = []
current_visa_free = None


for row in table.find_all('tr')[1:]: # skip the first (header) row
    cells = row.find_all(['td', 'th'])
    
    if len(cells) >= 2:
        # Check if this row has a visa-free number
        if len(cells) >= 3:
            visa_text = cells[2].text.strip()
            try:
                current_visa_free = int(visa_text)
            except ValueError:
                continue
                
        country_cell = cells[1]
        
        # Check if the cell has rowspan-- this identifies and handles grouped countries
        rowspan = cells[0].get('rowspan') if cells[0].get('rowspan') else 1
        
        # Get the rows that are grouped under the same Rank 
        if rowspan and int(rowspan) > 1:
            # To handle grouped entries: 
            next_rows = []
            current_row = row
            for _ in range(int(rowspan)-1):
                current_row = current_row.find_next_sibling('tr')
                if current_row:
                    next_rows.append(current_row)
            
            # Process the current row
            for link in country_cell.find_all('a'):
                country_name = link.text.strip()
                # only add if country_name is not a number -- this ensures that only countries are in the row for proper degrouping
                if country_name and not country_name.isdigit():
                    data.append({
                        'Country': country_name,
                        'Visa_free_destinations': current_visa_free
                    })
            
            # Process the following rows in the same Rank group
            for next_row in next_rows:
                next_cells = next_row.find_all(['td', 'th'])
                if next_cells:
                    for link in next_cells[0].find_all('a'):
                        country_name = link.text.strip()
                        # Only add if country_name is not a number -- same as previously
                        if country_name and not country_name.isdigit():
                            data.append({
                                'Country': country_name,
                                'Visa_free_destinations': current_visa_free
                            })
        else:
            # For countries (rows) that are not grouped in a Rank with other countries
            for link in country_cell.find_all('a'):
                country_name = link.text.strip()
                # Only add if country_name is not a number
                if country_name and not country_name.isdigit():
                    data.append({
                        'Country': country_name,
                        'Visa_free_destinations': current_visa_free
                    })

df = pd.DataFrame(data)

# clean and sort the data in alphabetical order by country, which is the order of the other datasets used in this project
df = df.dropna()
df = df.drop_duplicates()
df = df.sort_values('Country', ascending=True).reset_index(drop=True)

print(df)

# save to CSV for later use
df.to_csv('../../data/raw-data/henley_passport_index.csv', index=False)
         Country  Visa_free_destinations
0    Afghanistan                      26
1        Albania                     123
2        Algeria                      55
3        Andorra                     171
4         Angola                      52
..           ...                     ...
194    Venezuela                     124
195      Vietnam                      51
196        Yemen                      33
197       Zambia                      70
198     Zimbabwe                      65

[199 rows x 2 columns]

The data on USDS Trafficking in Persons Tier placements over time (from 2013 to 2023) was also extracted from this Wikipedia page using python’s BeautifulSoup package. The output was saved to a csv file to be merged with the other indicators in the data cleaning section.

import requests
from bs4 import BeautifulSoup
import pandas as pd


url = "https://en.wikipedia.org/wiki/Trafficking_in_Persons_Report"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# adjusts index to find the necessary table with data 
tables = soup.find_all('table', {'class': 'wikitable'})
target_table = tables[0]  # to specify that the first table is the one to extract data from 

# Extract table headers
headers = [header.text.strip() for header in target_table.find_all('th')]

# Extract table rows
rows = []
for row in target_table.find_all('tr')[1:]:  # Skip the header row
    cells = row.find_all(['td', 'th'])
    row_data = [cell.text.strip() for cell in cells]
    rows.append(row_data)


df = pd.DataFrame(rows, columns=headers)
print(df.head())
# the output shows that column names will need to be cleaned

# save to CSV for later use
df.to_csv('../../data/raw-data/TIP_Tiers_overyears.csv', index=False)
               Country          Location 2011 [4] 2012 [5] 2013 [6] 2014 [7]  \
0          Afghanistan      Central Asia       2w       2w       2w        2   
1              Albania  Southeast Europe        2        2       2w        2   
2              Algeria  Northeast Africa        3        3        3        3   
3               Angola    Central Africa       2w       2w       2w       2w   
4  Antigua and Barbuda     Caribbean Sea        2        2       2w        2   

  2015 [8] 2016 [9] 2017 [10] 2018 [11] 2019 [12] 2020 [13] 2021 [14]  \
0        2       2w         2         2        2w         3         3   
1        2        2         2         2         2         2         2   
2        3        3        2w        2w        2w         3         3   
3        2        2         2        2w        2w         2         2   
4       2w       2w        2w         2         2         2         2   

  2023 [15]                              Main article  
0         3          Human trafficking in Afghanistan  
1         2              Human trafficking in Albania  
2         3              Human trafficking in Algeria  
3         2               Human trafficking in Angola  
4         2  Human trafficking in Antigua and Barbuda  

Since the remainder of the data is manually collected and extracted, it will be handled in the data cleaning portion.

Summary

This section should be written for a technical audience, focusing on detailed analysis, factual reporting, and clear presentation of data. The following serves as a guide, but feel free to adjust as needed.

Challenges

  • Discuss any technical challenges faced during the project, such as data limitations, computational issues, or obstacles encountered during analysis.

  • Explain unexpected results and their technical implications.

  • Identify areas for future work, including potential optimizations, further analysis, or scaling solutions.

  • This project was limited in regards to numerous data considerations, including:

    • The reports of detected human trafficking victims are likely inaccurate and underrepresentative of the true scope of the crime
    • The complexity of interpreting data on trafficking counts, convictions, and prosecutions: It is difficult to determine whether low detected human trafficking victim counts reflect a safer country with a lower prevalence of the crime, or if it signifies a country whose law enforcement is not doing well in identifying and intervening in human trafficking cases
    • Manual extraction and input of policy data: Since much of the policy information was found in the form of written policy reports, the information was manually input into a CSV file. This provides both strengths and weaknesses– the benefit is that the data input was carefully supervised, ensuring accuracy of data collection. The disadvantage is that it makes the data used in this project irreproducible for future iterations.
    • Dataset size for machine learning
  • Computational and analytical issues:

    • When performing supervised learning analysis, some of the methods attempted return blank results, meaning that the algorithms were unable to output any predictions and evaluation metrics on the data. This problem occurred with the applications of Gradient Boosting and Support Vector Regression models to the data. This signifies that the data does not have enough information for the methods to even be able to train a model on.

Conclusion and Future Steps

  • This project requires better data in order to have any truly significant technical results and impact; the data is simply too small and unrelated to provide any meaningful insight
  • Thus, research should definitely look into applying LSTM or ARIMA to the data compiled in the first part of the data collection process of this project, which contained larger data
  • Research could also look into expanding beyond just European countries

Footnotes

  1. The Group of Experts on Action against Trafficking in Human Beings (GRETA) is responsible for monitoring the implementation of the Council of Europe Convention on Action against Trafficking in Human Beings by the Parties. It conducts research and publishes country reports evaluating government activities and performance.↩︎