This page documents the data preprocessing and acquisition details of this project for transparent demonstration of methods used to communicate this story.
Incident data is sourced from both public and private channels, including:
Systematic media filtering using a custom-built data scraping tool developed for Humanitarian Outcomes
Direct submissions from aid organizations and security entities
Information-sharing agreements with regional and field-level security consortiums
All incident reports undergo annual cross-verification with relevant agencies to ensure accuracy. Note that the most recent entries are provisional and may be updated following formal verification.
Data Cleaning
Data cleaning was performed independently using Python. The following tasks were completed:
Standardized column names
Ensured numeric integrity for numeric columns, coercing to numeric types and removing negative values
Converted date values (year, month, and day) and geospatial coordinates (longitude and latitude) columns to numeric types
Converted all columns representing counts of affected people to numeric types
Filtered and validated date columns:
Retained only valid date ranges:
Years between 1997 and 2025
Months between 1 and 12
Days between 1 and 31
Removed duplicate rows
Exported the cleaned dataset to a new CSV file for further analysis
The code for this data cleaning can be found in the cell below.
Code
import pandas as pdimport numpy as npimport reimport janitor # Load the datasetdf = pd.read_csv('../data/security_incidents.csv')print(f"\nDataset shape: {df.shape}")print(f"Number of missing values:\n{df.isna().sum()}")# 1. Clean column names (similar to janitor in R)df = df.clean_names()print("Cleaned column names:")print(df.columns.tolist())# 2. Clean Year column - keep only 1997-2025 and convert to numericdf['year'] = pd.to_numeric(df['year'], errors='coerce')df = df[(df['year'] >=1997) & (df['year'] <=2025)]df['year'] = df['year'].astype('int64') # Convert to integer typeprint(f"\nUnique years after cleaning: {sorted(df['year'].unique())}")# 3. Clean Month column - keep only 1-12df['month'] = pd.to_numeric(df['month'], errors='coerce')df = df[(df['month'] >=1) & (df['month'] <=12)]df['month'] = df['month'].astype('int64')print(f"Unique months after cleaning: {sorted(df['month'].unique())}")# 4. Clean Day column - keep only 1-31df['day'] = pd.to_numeric(df['day'], errors='coerce')df = df[(df['day'] >=1) & (df['day'] <=31)]df['day'] = df['day'].astype('int64')print(f"Day column range: {df['day'].min()} - {df['day'].max()}")# 5. Clean rows with indices 10-30, ensure positive numeric valuesiflen(df.columns) >30:for col in df.columns[10:30]: # Python is 0-indexed, so 9:30 gives columns 9-29 df[col] = pd.to_numeric(df[col], errors='coerce') df[col] = df[col].apply(lambda x: x if x >=0else np.nan)# 6. Ensure Latitude and Longitude are numericdf['latitude'] = pd.to_numeric(df['latitude'], errors='coerce')df['longitude'] = pd.to_numeric(df['longitude'], errors='coerce')print(f"\nLatitude range: {df['latitude'].min()} - {df['latitude'].max()}")print(f"Longitude range: {df['longitude'].min()} - {df['longitude'].max()}")# 7. Remove duplicatesoriginal_count =len(df)df = df.drop_duplicates()new_count =len(df)print(f"\nRemoved {original_count - new_count} duplicate rows")# Final dataset summaryprint(f"\nFinal dataset shape: {df.shape}")print(f"Number of missing values:\n{df.isna().sum()}")# Save cleaned datasetdf.to_csv('../data/cleaned_security_incidents.csv', index=False)print("\nCleaned data saved to 'cleaned_data.csv'")
Tools, Packages, Methods
This project employed a range of coding tools and techniques to create an interactive storytelling experience. Custom HTML styling was used to design and enhance the homepage effects. The dynamic global map timeline was built using Plotly and choropleth mapping, while the interactive dashboard was also developed with Plotly. The interactive scatterplot was created using Altair. The entire project was implemented in Python, and all open-source code is available in the linked GitHub repository.