Projects

End-to-end projects spanning applied deep learning, computer vision, NLP and LLM pipelines, RAG systems, data science, data analytics, and data engineering — built with production-quality standards including HPC-distributed GPU training, cloud storage (AWS S3), containerised deployments with Docker, CI/CD automation, and live web applications. Use the filter to explore by domain.

PROJECT 01

Dog Breed Classification
via Deep Learning & Transfer Learning

Deep Learning & Computer Vision

A fine-grained image classification system capable of identifying 120 dog breeds from the Stanford Dogs Dataset (20,580 images). The project explores the full deep learning lifecycle — from data preprocessing and augmentation design through distributed GPU training to final evaluation — demonstrating strong generalization with a top-1 test accuracy of 84.20%.

84.20%
Top-1 Accuracy
96.54%
Top-3 Accuracy
98.05%
Top-5 Accuracy
0.67%
Generalization Gap
120
Breed Classes
20,580
Training Images

Training Pipeline

Stanford Dogs Dataset 20,580 images · 120 breeds
Annotation Parsing XML bounding boxes
BBox Crop + Resize 224×224 px · +7% accuracy
Stratified Split 70 / 20 / 10
On-the-fly Augmentation Albumentations · 10 transforms
ResNet50 ImageNet pretrained · frozen backbone
AdamW + CosineAnnealingLR 50 epochs · AMP · grad clipping
Evaluation 84.56% Top-1 · 98.25% Top-5
  • Applied transfer learning with a pretrained ResNet50 (ImageNet) backbone, replacing the final classification head for 120-class output on fine-grained dog breed recognition.
  • Built a robust preprocessing pipeline using bounding-box-guided cropping and 224×224 resizing, yielding ~7% accuracy gain over uncropped baselines.
  • Implemented aggressive on-the-fly data augmentation via Albumentations (random crops, flips, rotations, colour jitter, Gaussian noise/blur, random erasing) to combat overfitting.
  • Trained over 50 epochs on a 2-GPU distributed setup (2× Tesla V100, University of Luxembourg HPC) using PyTorch DDP, AdamW optimiser with CosineAnnealingLR scheduling, and gradient clipping.
  • Achieved a generalization gap of only 0.67% between validation and test accuracy, demonstrating strong model robustness and minimal overfitting.
Python PyTorch Torchvision Albumentations CUDA PyTorch DDP ResNet50 Scikit-learn NumPy Pandas Matplotlib Seaborn HPC / Tesla V100
PROJECT 02

Spacecraft & Space Debris Detection
via YOLOv11 Fine-Tuning

Deep Learning & Computer Vision

An object detection system to identify and localize 10 real ESA spacecraft (including SOHO, XMM-Newton, and LISA Pathfinder) alongside space debris in 1024×1024 satellite imagery, built on the SPARK 2022 Challenge dataset (University of Luxembourg SnT). The project fine-tunes a YOLOv11-Medium model with a targeted augmentation pipeline simulating real-world orbital imaging conditions.

~86%
Classification Accuracy
10
ESA Spacecraft Classes
1024px
Image Resolution
YOLOv11-M
Base Architecture

Detection Pipeline

SPARK 2022 Dataset 1024×1024 · 11 classes
CSV Annotation Parsing bbox coordinates
YOLO Label Conversion normalised x·y·w·h
Offline Augmentation Albumentations · orbital conditions
YOLOv11-Medium pretrained · fine-tuned
SLURM HPC Training 35 epochs · Tesla V100 · Iris cluster
Inference Pipeline class + bbox · ~86% accuracy
  • Fine-tuned a YOLOv11-Medium model on a custom annotated dataset for multi-class detection of ESA spacecraft (SOHO, XMM-Newton, LISA Pathfinder, and others) and space debris.
  • Converted bounding-box annotations from CSV format to normalized YOLO labels, handling all preprocessing and annotation pipeline engineering end-to-end.
  • Designed a targeted data augmentation pipeline using Albumentations (Gaussian blur, motion blur, brightness/contrast shifts, horizontal/vertical flips, additive Gaussian noise) to simulate sensor noise, orbital lighting, and motion artifacts.
  • Configured training with an initial learning rate of 0.01, L2 weight decay regularization, and 35 epochs with early stopping to prevent overfitting on the multi-class detection task.
  • Deployed training on the University of Luxembourg HPC Iris cluster via SLURM job scheduling; built a full inference pipeline generating class labels and bounding-box coordinates on unseen test imagery.

Detected Spacecraft Classes

SOHO XMM-Newton LISA Pathfinder + 7 other ESA spacecraft Space Debris
Python YOLOv11 Ultralytics Albumentations SLURM HPC / GPU Cluster OpenCV
PROJECT 03

EEG Topomap Binary Classification
via Deep Learning

Deep Learning & Computer Vision

A binary image classification system that predicts whether a brain's neurological response to a visual design stimulus is positive or negative, using EEG topographic map (topomap) images as input. The project applies computer vision to a neuroscience domain — replacing manual expert interpretation of EEG patterns with an end-to-end deep learning pipeline. Built entirely from scratch in PyTorch on a small, class-imbalanced dataset of 1,178 images, achieving 90.96% accuracy on a fully held-out test set.

90.96%
Test Accuracy
97.18%
Best Val Accuracy
1,178
Total Images
60
Training Epochs
2
Classes
Custom CNN
Architecture

Model Architecture

Input 3×128×128
Conv Block 1 16→32
Conv Block 2 64→128
Conv Block 3 256
Conv Block 4 512
AdaptiveAvgPool
FC Head 8192→1024→1024→1
Sigmoid → {0, 1}
  • Designed a custom CNN from scratch — no pretrained backbone — with four progressively deepening convolutional blocks (16→32→64→128→256→512 filters), AdaptiveAvgPool2d for input-size robustness, and dual Dropout (p=0.5) on the fully connected head, achieving strong regularization on a small dataset.
  • Applied stratified 70/15/15 train/val/test splits to preserve class balance across a naturally imbalanced dataset (678 bad / 500 good), ensuring evaluation integrity throughout training and final reporting.
  • Built a full training pipeline with on-the-fly data augmentation (random horizontal flips, ±10° rotations, brightness/contrast color jitter), BCELoss, Adam optimizer (lr = 0.001), and ReduceLROnPlateau scheduling (factor 0.5, patience 5) — checkpointing on best validation loss across 60 epochs.
  • Implemented a batched inference API (load_and_predict) with automatic GPU/CPU device detection, non-blocking memory transfers for CUDA overlap, and configurable batch sizes — enabling the model to be deployed as a standalone, reusable prediction module.
  • The model achieves 90.96% accuracy on the held-out test set — a set the model never saw during training or validation — confirming genuine generalization rather than overfitting to the training distribution.

Dataset & Training Details

Domain: EEG Neuroscience / BCI
Input size: 128 × 128 px
Normalisation: ImageNet mean/std
Classes: bad (0) · good (1)
Class split: 678 bad / 500 good
Exposure duration: 6 s design stimulus
Python PyTorch Torchvision Scikit-learn NumPy Pillow Custom CNN BCELoss CUDA EEG / BCI
PROJECT 04

Signature Origin Classification
Human vs. Generative AI

Deep Learning & Sequence Classification

An end-to-end deep learning pipeline for 4-class sequence classification, distinguishing genuine human handwritten signatures from those synthesized by three generative architectures — GAN, SDT, and VAE. The project addresses the growing challenge of AI-generated biometric content attribution, processing raw 2D pen-stroke coordinate sequences from CSV files through a full training and inference pipeline built entirely in PyTorch, achieving 80.21% test accuracy and 85.70% accuracy on the full dataset.

80.21%
Test Accuracy
85.70%
Full Dataset Accuracy
4
Classes
60
Training Epochs
BiGRU
Architecture
150
Sequence Length

Classification Target

Label 0 — Human
Genuine handwritten signature
Label 1 — GAN
Generative Adversarial Network
Label 2 — SDT
Sigma-Delta Transform model
Label 3 — VAE
Variational Autoencoder

Model Architecture

Input (X, Y) sequence · 150 steps
BiGRU 2 layers · 128 units/dir
Concat forward + backward · 256-dim
Dropout p = 0.25
Linear 256 → 4 classes
  • Designed a Bidirectional GRU classifier (2-layer, 128 units/direction) that captures both forward and backward temporal dependencies in pen-stroke sequences — chosen over unidirectional RNNs to exploit the full context of each signature trajectory.
  • Engineered a preprocessing pipeline with per-signature independent Min-Max normalization on X and Y axes, zero-padding/truncation to a fixed 150-timestep window, and graceful handling of degenerate single-point sequences.
  • Conducted a systematic sequence-length hyperparameter search across four values (150, 200, 400, 570 timesteps), selecting 150 as the optimal trade-off — achieving the lowest validation loss (0.3348) with the most efficient training time.
  • Trained with Adam optimizer, ReduceLROnPlateau scheduling (factor 0.5, patience 7 — automatically halving the learning rate at epoch 52), gradient clipping (threshold 1.0), and early stopping (patience 15) across 60 epochs, with best checkpoint saved at epoch 57.
  • Implemented a batched inference API (load_and_predict) with automatic CUDA/CPU device detection, recursive CSV discovery, and configurable batch sizes — enabling the model to be used as a standalone, reusable prediction module.
  • Applied a stratified 70/15/15 train/validation/test split to ensure representative class distribution across all partitions, with the test set kept fully held-out throughout training and hyperparameter tuning.

Training Configuration & Results

Optimizer: Adam (lr = 0.001)
Weight decay: 1e-4 (L2)
Scheduler: ReduceLROnPlateau
Gradient clip: 1.0
Best val loss: 0.3291 (Epoch 57)
Best val accuracy: 82.24%
Data split: 70 / 15 / 15
Loss function: CrossEntropyLoss

Sequence Length Hyperparameter Search

MAX_SEQ_LENGTH Val Loss Val Accuracy Note
150 · selected 0.3348 80.21% Best loss · most efficient
200 0.4132 78.74% Worse loss & accuracy
400 0.3812 80.00% Comparable accuracy
570 0.3520 81.89% Marginal gain, high cost
Python PyTorch Bidirectional GRU Sequence Classification Scikit-learn NumPy Pandas CUDA Biometric Authentication Time-Series
PROJECT 05

Network Data Extraction using LLMs
for Historical Documents

LLM & NLP

An end-to-end NLP pipeline for extracting structured social network data — entities and their relationships — from long-form WWII-era historical memoirs using Large Language Models. The project bridges Digital History and Computer Science, reconstructing survival networks of people who provided shelter, food, false documents, and protection to those living underground during the Nazi persecution. Conducted over ~20 months at the Luxembourg Centre for Contemporary and Digital History (C²DH), University of Luxembourg, and formed the basis of a Master's thesis in Computer Science. Findings are currently being prepared for publication in a peer-reviewed journal.

0.841
Best F1-Score
0.906
Precision
0.785
Recall
+250%
F1 vs. Baseline
43
Methods Tested
176
Ground-Truth Relations

Pipeline Architecture

Text Cleaning
Chunking (LangChain)
Coreference Resolution (LLM)
Relationship Extraction (LLM)
JSON Output
  • Designed and implemented a modular NLP pipeline integrating text cleaning, overlapping chunking (LangChain RecursiveCharacterTextSplitter), LLM-based coreference resolution, and structured JSON extraction via OpenAI APIs (GPT-4o, GPT-4.1, GPT-5, o3).
  • Introduced a dedicated Coreference Resolution (CR) stage that resolves ambiguous pronouns ("he", "they") and collective family references ("the Wendlands") into explicit named individuals before extraction — independently increasing attribution accuracy by ~20% and reducing both false positives and false negatives.
  • Implemented a context window mechanism in the CR stage, providing the model with one preceding chunk as reference to resolve cross-boundary references — a CW of 1 improved recall by ~12–15% with no loss in precision; larger windows introduced noise and degraded results.
  • Benchmarked 43 experimental configurations across 8 systematic phases, varying chunking parameters (2000–10000 chars), context window sizes, extraction strategies (one-step vs. multi-step NER+RE), prompt designs, and model selection; outputs manually reviewed and labeled (TP/FP/FN) against a curated ground-truth dataset of 176 annotated relationships.
  • Evolved prompt engineering from simple to advanced through three iterative stages: simple baseline prompts → refined prompts with 9 explicit relationship type definitions and inclusion/exclusion rules → advanced prompts with borderline case clarifications, reducing false positives by over 80% while maintaining recall.
  • Both the CR and extraction stages use a two-prompt design per chunk: a system prompt carrying permanent rules, relationship definitions, and structural instructions; and a user prompt constructed dynamically per chunk, providing the main text to process and — in the CR stage — the preceding chunk(s) as reference context for resolving cross-boundary references.
  • Validated reproducibility via 10-run stability testing of the three top configurations; the o3-based pipeline achieved mean F1 = 0.819 ± 0.01, confirming consistent and reliable performance across repeated runs.
  • Identified OpenAI o3 as the optimal model for both CR and extraction, outperforming GPT-4.1 (+0.04 F1) and GPT-5 (+0.07 F1) — demonstrating that reasoning depth matters more than model scale for context-heavy historical texts.

9 Extracted Relationship Categories

Shelter & Protection Medical Care Food & Resources Introductions & Connections Employment False Documentation Information & Advice Emotional Support Other Assistance

Top Configuration Results (10-Run Average)

The framework supports independent model selection for each stage — one model for Coreference Resolution and a separate model for Relationship Extraction — enabling direct like-for-like comparison under identical preprocessing and prompt conditions.

CR Model → Extraction Model Precision Recall F1-Score
o3 (CR) → o3 (Extraction) · best overall 0.845 0.795 0.819
o3 (CR) → GPT-4.1 (Extraction) · highest recall 0.764 0.866 0.812
o3 (CR) → GPT-5 (Extraction) 0.772 0.754 0.763

All configurations: chunk 6000/600 · context window 1 · advanced prompts · one-step extraction

Python OpenAI API LangChain GPT-4o GPT-4.1 GPT-5 o3 Reasoning Model Coreference Resolution NER Information Extraction Prompt Engineering JSON Digital Humanities University of Luxembourg
PROJECT 06

Census Income Prediction
using Machine Learning

Data Science & Machine Learning

An end-to-end KDD (Knowledge Discovery in Databases) pipeline applied to the UCI Adult Census dataset (47,621 instances, 14 features) to predict whether an individual's annual income exceeds $50,000. The project covers the full data science lifecycle — cleaning, exploratory analysis, feature engineering, hyperparameter tuning, and comparative model evaluation — demonstrating that gradient-boosted trees outperform linear and ensemble baselines on structured socioeconomic data.

86.95%
Best Accuracy (XGBoost)
0.814
F1 Score (macro)
+2.2pp
Gain over LR Baseline
3
Models Compared
47,621
Instances
14
Features

KDD Pipeline

Data Cleaning
EDA
Preprocessing (Scaling + Encoding)
HP Tuning
Model Evaluation
  • Applied the full KDD process to the UCI Adult Census dataset — normalising inconsistent target labels, dropping rows with missing values (~2.5%), and performing stratified 70/30 train-test splitting to preserve class balance.
  • Conducted Exploratory Data Analysis (EDA) including descriptive statistics, a gender-vs-income stacked bar chart, a numerical-feature correlation heatmap, and an age distribution histogram with KDE — revealing key demographic patterns in income inequality.
  • Built a preprocessing pipeline applying MinMaxScaler to 6 numerical features and one-hot encoding (via pd.get_dummies, drop_first=True) to 8 categorical features, followed by LabelEncoder on the binary target.
  • Performed systematic hyperparameter sensitivity analysis for all three models: swept regularisation C for Logistic Regression, n_estimators and max_depth for Random Forest, and learning_rate and n_estimators for XGBoost — selecting optimal values from accuracy-vs-parameter plots.
  • Evaluated all models on 7 metrics (Accuracy, MSE, RMSE, R², Precision, Recall, F1) with confusion matrices; XGBoost achieved the best results (86.95% accuracy, F1 = 0.814), outperforming Random Forest by 0.97 pp and Logistic Regression by 2.16 pp.
  • Delivered the project as a fully documented Jupyter Notebook with narrative markdown commentary alongside each code section, making the methodology transparent and reproducible — alongside a clean, modular Python script structured for production use.

Model Performance Comparison — Test Set

Model Accuracy Precision Recall F1
Logistic Regression (C = 100) 84.79% 0.810 0.760 0.780
Random Forest (200 trees, depth 20) 85.98% 0.836 0.767 0.793
XGBoost (lr = 0.30, 100 trees) · best overall 86.95% 0.839 0.796 0.814

All models evaluated on 70/30 stratified split · metrics macro-averaged

Python Scikit-learn XGBoost Pandas NumPy Matplotlib Seaborn Jupyter Notebook KDD Pipeline UCI ML Repository
PROJECT 07

Employee Attrition Prediction
via Machine Learning & Explainable AI

Data Science & Machine Learning

A complete, end-to-end machine learning pipeline that predicts whether an employee is at risk of leaving a company — and more importantly, explains why. Built on the IBM HR Analytics dataset (1,470 employees, 35 features), the project covers every stage of a professional data science workflow: exploratory data analysis, preprocessing, feature engineering, multi-model training with cross-validation, and model interpretability using SHAP (SHapley Additive exPlanations). The final pipeline is deployed as a live interactive Streamlit web application, making predictions and explanations accessible to anyone without any code.

0.966
CV ROC-AUC
1,470
Employees
35
Raw Features
7
Engineered Features
3
Models Compared
84 / 16
Class Split (%)

ML Pipeline

IBM HR Dataset 1,470 employees · 35 features
EDA 84/16 imbalance · driver identification
Preprocessing encoding · StandardScaler · SMOTE
Feature Engineering 7 new features · top 30 selected
Model Training LR · RF · XGBoost · 5-fold CV
SHAP Interpretability global + per-employee explanations
Streamlit App live deployment · Streamlit Cloud
  • Conducted thorough Exploratory Data Analysis across all 35 features, identifying key attrition drivers including overtime patterns, compensation gaps relative to experience, satisfaction scores, and early-tenure risk — informing every subsequent modeling decision.
  • Built a rigorous preprocessing pipeline: Label Encoding for binary categoricals, One-Hot Encoding with drop_first=True to eliminate multicollinearity, StandardScaler fitted exclusively on training data, and SMOTE (Synthetic Minority Over-sampling Technique) applied to the training set only — preserving test set integrity and preventing data leakage throughout.
  • Engineered 7 domain-driven features from existing columns — including IncomePerYearExp, SatisfactionScore, CareerGrowthRate, and IsEarlyCareer — capturing business-meaningful signals not present in the raw data; final feature set selected via Random Forest importance ranking (top 30 of 37 total features retained).
  • Trained and compared Logistic Regression, Random Forest, and XGBoost using 5-fold stratified cross-validation; selected ROC-AUC as the primary metric (over accuracy) due to class imbalance — a naive classifier predicting "Stayed" for all employees would achieve 84% accuracy with zero business value.
  • Applied SHAP TreeExplainer for both global and individual-level model interpretability: beeswarm summary plots revealing feature impact direction across all employees, dependence plots exposing interaction effects between top features, and per-employee waterfall charts explaining each prediction step by step — bridging the gap between model output and business action.
  • Deployed the full pipeline as a live Streamlit web application on Streamlit Community Cloud, featuring real-time employee risk scoring, a dynamic SHAP waterfall explanation for every prediction, and a model performance dashboard — making the model accessible to non-technical stakeholders without any local setup.

Model Performance Comparison — Test Set

Model CV ROC-AUC Test ROC-AUC Test F1 Test Recall
Logistic Regression 0.9236 0.7206 0.3762 0.4043
Random Forest · best Test ROC-AUC 0.9658 0.7309 0.3250 0.2766
XGBoost 0.9648 0.7014 0.3291 0.2766

All models evaluated on 80/20 stratified split · ROC-AUC chosen as primary metric due to 84/16 class imbalance · 5-fold stratified cross-validation

Top Attrition Drivers — SHAP Analysis

🔴 Highest risk: OverTime = Yes
🔴 Risk factor: Low SatisfactionScore
🔴 Risk factor: Low IncomePerYearExp
🔴 Risk factor: StockOptionLevel = 0
🔴 Risk factor: MaritalStatus = Single
🟢 Protective: High YearsAtCompany
Python Scikit-learn XGBoost SHAP SMOTE Pandas NumPy Matplotlib Seaborn Streamlit Streamlit Cloud Jupyter Notebook IBM HR Analytics
PROJECT 08

Urban Air Quality Intelligence Platform
via Machine Learning & Live API Integration

Data Science & Machine Learning

An end-to-end data science system that predicts urban air quality using two parallel data sources: the UCI Air Quality dataset (9,357 hourly sensor readings, Italy 2004–2005) and live multi-city measurements fetched programmatically from the OpenAQ REST API v3 (Italy · France · Germany). The project covers the complete ML workflow — exploratory data analysis, time-series feature engineering with lag features, rolling window statistics, and cyclical encodings, dual-track modelling (classification + regression), and model interpretability via SHAP — deployed as a live interactive Streamlit dashboard with real-time AQI prediction.

86.6%
XGBoost Accuracy
0.74
F1 (macro)
0.885
R² (NO₂ Regression)
16.5
RMSE (µg/m³)
46
Engineered Features
9,357
Hourly Readings

Data Pipeline

UCI Dataset 9,357 rows
+
OpenAQ API v3 live IT · FR · DE
EDA + Cleaning
Feature Eng. 46 features
XGBoost cls + reg
SHAP Explainability
Streamlit Live App
  • Combined two heterogeneous data sources — a static UCI sensor dataset and a live REST API (OpenAQ v3) — into a unified analysis pipeline, demonstrating real-world data ingestion across Italy, France, and Germany with programmatic API calls at both acquisition time and dashboard runtime.
  • Engineered 46 features from raw hourly sensor readings: lag features (NO2_lag_1h, NO2_lag_3h, NO2_lag_24h), rolling window statistics (3h/6h/24h mean and std), and cyclical sin/cos encodings for hour-of-day, day-of-week, and month — preventing the model from treating midnight and 11pm as 23 steps apart.
  • Applied time-aware 70/15/15 chronological splitting to prevent data leakage — a critical distinction from random splitting on time-series data that would allow the model to train on the future of its own test set; also handled class imbalance via class_weight="balanced" and per-sample weights in XGBoost.
  • Ran two parallel modelling tracks: AQI category classification (4 classes: Good / Moderate / Poor / Very Poor) and NO₂ concentration regression — training Dummy baseline, Random Forest, and XGBoost for each; XGBoost achieved 86.6% accuracy (+20.7pp over baseline) and R² = 0.885 on the regression task.
  • Applied SHAP TreeExplainer for exact (non-approximate) global and local model interpretability — generated beeswarm, dependence, and per-prediction waterfall charts; top feature NO2GT_rolling_3h_mean (mean |SHAP| = 1.51) confirmed that sustained recent pollution history is the model's strongest predictive signal, physically interpretable and trust-building.
  • Deployed as a 4-page Streamlit web application on Streamlit Community Cloud, featuring real-time AQI prediction with per-class confidence breakdown, an interactive SHAP explainer with waterfall charts for individual predictions, and a live OpenAQ data fetch panel — fully accessible via public URL with no local setup.

Classification — Model Performance Comparison (Test Set)

Model Accuracy F1 (macro) F1 (weighted)
Dummy baseline · majority class 65.9% 0.20 0.52
Random Forest · 200 trees 80.8% 0.63 0.75
XGBoost · 300 rounds, early stopping · best overall 86.6% 0.74 0.84

Regression — NO₂ Concentration (µg/m³) (Test Set)

Model RMSE MAE
Dummy baseline · training mean 70.39 57.60 −1.094
Random Forest · 200 trees 17.18 10.02 0.875
XGBoost · early stopping round 214 · best overall 16.49 9.37 0.885

Chronological 70/15/15 time-aware split · class imbalance handled via balanced class weights · 3 models × 2 tracks

Top SHAP Features — Predicting "Poor" AQI

🔴 Rank 1 (|SHAP| 1.511): NO2GT_rolling_3h_mean
🔴 Rank 2 (|SHAP| 0.548): NO2GT_lag_1h
🔴 Rank 3 (|SHAP| 0.505): NOx(GT) sensor
🔴 Rank 4 (|SHAP| 0.484): NO2GT_change_1h
🔴 Rank 5 (|SHAP| 0.401): NO2GT_change_3h
🟢 Physical insight: Lag + rolling features dominate — pollution evolves gradually
Python Pandas NumPy Scikit-learn XGBoost SHAP Matplotlib Seaborn Streamlit Streamlit Cloud OpenAQ API v3 REST API Time-Series ML Feature Engineering UCI ML Repository Jupyter Notebook
PROJECT 09

Urban Development Analytics Pipeline
Chicago Building Permits 2018–2023

Data Analytics & Engineering

An end-to-end data analytics pipeline processing 833,978 municipal building permit records from the City of Chicago Open Data Portal, built as a freelance-grade analytics engagement for urban planning stakeholders. The project covers every stage of the data lifecycle — raw ingestion, SQL-based cleaning and aggregation, automated Python ETL, statistical analysis, dual BI dashboard delivery across Tableau Public and Power BI, Docker containerization, and a GitHub Actions CI/CD pipeline for automated validation — surfacing actionable insights on construction trends, fee revenue, and permitting efficiency across Chicago's 50 wards.

833,978
Raw Records
819,820
After Cleaning
−20.7%
Permit Drop 2020
+44%
Processing Time 2018→2023
$27.2M
Peak Fee Revenue (2019)
2
BI Dashboards (Tableau + Power BI)

Pipeline Architecture

Raw CSV 833,978 rows
load_to_db.py SQLite ingestion
sql_analysis.py 8 SQL queries
pipeline.py ETL + 6 visualizations
Tableau Public 4-view dashboard
+
Power BI 3-page dashboard
Docker containerized
GitHub Actions CI/CD validation
  • Designed and implemented a multi-stage ETL pipeline ingesting 833,978 raw permit records, programmatically parsing currency-formatted fee columns across 16 fields (stripping $ symbols, converting to float) before loading into a SQLite database via SQLAlchemy — producing a clean table of 819,820 records after null and quality filters.
  • Built a SQL aggregation layer with 8 analytical queries using GROUP BY, COUNT, AVG, SUM, CAST, and date parsing via SUBSTR — covering permit type distribution, ward-level year-over-year trends, community area fee revenue, monthly volume trends, and processing time analysis.
  • Developed an automated Python ETL pipeline (Pandas, NumPy) performing date parsing, feature engineering (days_to_issue, year, month, quarter, year_month), outlier removal at the 99th percentile, and export of 10 analysis-ready CSVs — reducing manual data preparation by ~70%.
  • Produced 6 Matplotlib/Seaborn visualizations: monthly trend line, permit type horizontal bar chart, dual-axis yearly volume vs. processing time, neighborhood-labeled ward fee revenue chart, processing time box plot by year, and a year-over-year diverging bar chart — all saved as high-resolution PNGs.
  • Delivered a 4-view interactive Tableau Public dashboard (monthly trend · permit types · geographic dot map of 238,496 permits · yearly KPI panel) and a complementary 3-page Power BI dashboard covering financial analysis, operational efficiency, and an executive summary — together providing full stakeholder coverage across geographic, financial, and operational dimensions.
  • Containerized the full pipeline using Docker and docker-compose with a minimal Linux image, volume mounts for data and output directories, and a single-command execution pattern — enabling fully reproducible pipeline runs across environments; implemented a GitHub Actions CI/CD workflow that automatically generates a 500-row synthetic test dataset and validates the full pipeline on every push to main, verifying all 10 processed CSVs and 6 visualizations are produced correctly.
  • Surfaced key findings: 20.7% permit volume decline in 2020 (COVID impact), 44% increase in average processing time from 17.9 days (2018) to 25.9 days (2023), Express Permits accounting for 45.7% of total fee revenue, and average reported construction cost rising from $95K (2018) to $194K (2022).

BI Dashboards

Tableau Public (live)
Monthly trend · Permit types · Geographic dot map · Yearly KPIs — cross-filter interactivity
Power BI Desktop (.pbix in repo)
Financial analysis · Operational efficiency · Executive summary — 3-page analytical dashboard

Key Findings & Pipeline Outputs

Permit volume 2020: −20.7% (COVID impact)
Avg processing time 2018→2023: 17.9 → 25.9 days (+44%)
Peak fee revenue: $27.2M in 2019
Avg construction cost 2022: $194,549 (vs $95,302 in 2018)
Express Permit revenue share: 45.7% of total fees
Processed CSVs exported: 10 analysis-ready files

Dataset & Pipeline Details

Source: Chicago Open Data Portal
Raw records: 833,978
Clean records: 819,820
Pipeline range: 2018–2023
Dashboard records: 238,496
License: Public domain
Python SQL SQLite SQLAlchemy Pandas NumPy Matplotlib Seaborn Tableau Power BI Docker docker-compose GitHub Actions CI/CD ETL Pipeline Data Cleaning Feature Engineering Chicago Open Data Jupyter Notebook
PROJECT 10

LexiAssist — Production RAG Chatbot
for AI/ML Research

AI Engineering & MLOps

An end-to-end production-grade Retrieval-Augmented Generation (RAG) system that ingests 145 AI/ML research papers from ArXiv, indexes them into a ChromaDB vector store, and answers natural-language questions with grounded, citation-backed responses and multi-turn conversational memory. The project covers the full AI engineering stack — from automated data ingestion and semantic search through REST API design, pipeline evaluation, containerization, and CI/CD automation — demonstrating how a real LLM application is built and deployed in a production environment.

0.96
Answer Relevancy
0.70
Faithfulness
145
Papers Ingested
394
Indexed Chunks
6
Automated API Tests
3-Stage
CI/CD Pipeline

Pipeline Architecture

ArXiv API 145 papers · 15 topics Chunking LangChain · 394 chunks Embeddings text-embedding-3-small ChromaDB vector store · persisted RAG Chain LangChain · top-5 retrieval FastAPI REST backend · 3 endpoints Streamlit UI chat interface
  • Built a complete RAG pipeline using LangChain: ArXiv paper ingestion across 15 AI/ML topics, RecursiveCharacterTextSplitter chunking (1,000-char chunks, 200-char overlap), OpenAI text-embedding-3-small vector generation, and ChromaDB persistence — producing 394 semantically indexed chunks from 145 papers.
  • Designed a LangChain RAG chain with engineered system prompt for hallucination mitigation via context injection, top-5 semantic retrieval, multi-turn conversational memory using LangChain message history, and structured source citation extraction with deduplication — all wired into a single composable pipeline.
  • Built a production FastAPI REST backend with Pydantic request/response validation, CORS middleware, startup chain preloading, and automatic Swagger documentation — separating the API layer cleanly from the RAG logic, with /chat, /health, and /ingest endpoints.
  • Evaluated RAG pipeline quality using a custom LLM-as-judge framework built from scratch — GPT-4o-mini independently scores faithfulness, answer relevancy, context precision, and context recall across 20 hand-curated question/answer pairs; achieved 0.96 answer relevancy and 0.70 faithfulness with documented analysis of context score limitations.
  • Containerized the full application with Docker and Docker Compose (multi-service: API + frontend), volume mounts for data and vectorstore persistence, and a pre-built image published to Docker Hub — enabling single-command deployment on any machine.
  • Automated the full software delivery lifecycle via a GitHub Actions CI/CD pipeline: flake8 linting and black formatting → pytest API tests using fixture data (no live external API calls in CI) → Docker image build and push to Docker Hub — triggered on every push to main.

RAG Evaluation Results — LLM-as-Judge (20 Samples)

Custom evaluation framework built from scratch — GPT-4o-mini scores each metric independently per sample, mirroring the approach used by production RAG evaluation libraries such as RAGAs.

Metric Score What It Measures
Answer Relevancy 0.96 Does the answer directly address the question asked?
Faithfulness 0.70 Are claims grounded in the retrieved documents?
Context Precision 0.56 Are the retrieved chunks relevant to the query?
Context Recall 0.54 Does the context contain what's needed to answer?

Context scores reflect abstract-only knowledge base — full PDF ingestion is the identified path to improvement.

System & Pipeline Details

LLM: GPT-4o-mini
Embedding model: text-embedding-3-small
Vector store: ChromaDB
Chunk size / overlap: 1,000 / 200 chars
Top-k retrieval: 5 chunks per query
Topics covered: 15 AI/ML domains
API tests: 6 pytest endpoints
CI/CD stages: lint → test → docker push
Container image: Docker Hub (public)
Python LangChain ChromaDB OpenAI API GPT-4o-mini RAG Semantic Search Prompt Engineering FastAPI Pydantic Streamlit Docker Docker Compose GitHub Actions CI/CD pytest LLM-as-Judge Vector Embeddings
PROJECT 11

NYC Taxi Analytics Pipeline
End-to-End Batch Data Engineering

Data Analytics & Engineering

An end-to-end production-style batch data engineering pipeline processing 9.55 million real NYC Yellow Taxi trip records (Q1 2024) through a full Medallion Architecture (Bronze → Silver → Gold), covering every stage of the modern data engineering workflow — PySpark-based ETL, data quality validation, Apache Airflow orchestration, dbt SQL transformation, AWS S3 cloud data lake storage, PostgreSQL warehousing, Docker containerization, GitHub Actions CI/CD, and a 5-page interactive Streamlit analytics dashboard — demonstrating how a real-world data engineering team would build, schedule, validate, and serve a large-scale data pipeline.

9.55M
Raw Trips Ingested
8.47M
Clean Trips (Silver)
$234.8M
Q1 2024 Revenue
17/18
Data Quality Checks
8
dbt SQL Models
17
Dashboard Charts

Pipeline Architecture

NYC TLC Parquet 3 files · Q1 2024
extract.py Bronze · 9.55M rows · AWS S3
transform_silver.py Clean + 10 features · AWS S3
transform_gold.py 4 aggregation tables · AWS S3
validate.py 18 quality checks · HTML report
load.py PostgreSQL via JDBC
dbt 4 staging views + 4 mart tables
Airflow DAG 7 tasks · daily 06:00 UTC
Streamlit 5-page dashboard · 17 charts
  • Implemented a PySpark ETL pipeline with explicit schema definition (faster than inference), reading 9.55M real NYC taxi trip records across 3 Parquet files; removed 1,075,337 invalid records (11.25%) — including negative fares, GPS-glitch distances, and timestamp-corrupted trips — and engineered 10 domain-driven features (trip_duration_minutes, speed_mph, time_of_day bucketing, is_weekend, fare_per_mile, tip_percentage, payment_type_desc) covering the full Silver transformation layer.
  • Integrated AWS S3 as a cloud data lake using boto3 with Hive-style partitioned Parquet storage — Bronze partitioned by VendorID, Silver by pickup_month — uploaded automatically after each pipeline stage; implemented USE_S3=false local fallback so the pipeline runs without AWS credentials for development.
  • Built 8 dbt SQL models (4 staging views + 4 mart tables) on top of PostgreSQL using advanced SQL: window functions (LAG(), NTILE(4), AVG() OVER rolling windows), PARTITION BY for monthly market share, CASE WHEN demand classification, and NULLIF for division-by-zero safety — surfacing findings including 28 zones (10.8%) driving 80% of total Q1 revenue (Pareto) and credit card tip rate of ~20% vs near 0% for cash.
  • Designed an Apache Airflow DAG with 7 tasks, daily cron scheduling (0 6 * * *), 1 retry per task with 5-minute delay, and execution timeouts — orchestrating the full Extract → Transform → Validate → Load workflow end-to-end; loaded Gold layer tables to PostgreSQL via JDBC with an indexed schema (5 indexes) for optimised analytical queries.
  • Implemented 18 automated data quality checks across Silver and Gold layers (17/18 passing) generating an HTML validation report — checks cover null detection, range validation, referential integrity, and statistical plausibility; ran 10 pytest unit tests and flake8 linting in a parallel GitHub Actions CI/CD workflow triggered on every push to main.
  • Delivered a 5-page, 17-chart Streamlit dashboard connecting live to PostgreSQL — featuring a Pareto revenue concentration curve (28 zones = 80% of revenue), a heatmap of trip demand by hour × day of week, rolling revenue averages, NTILE location tier analysis, and payment behaviour insights — containerized with Docker Compose alongside PostgreSQL, Airflow, and Spark in a single-command startup environment.

Key Analytical Findings

Busiest hour: 19:00 — 461,200 trips
Revenue concentration: 28 zones = 80% of $234.8M
Credit card share: 63.9% of all trips
Tip rate — credit card: ~20% vs ~0% for cash
Night avg fare: $20.39 (highest time period)
Invalid data removed: 1,075,337 rows (11.25%)

Pipeline Outputs & Infrastructure

Dataset: NYC TLC Yellow Taxi Q1 2024
Raw records: 9,554,778
Clean records: 8,471,484
Gold tables: 4 (509 + 258 + 96 + 18 rows)
dbt mart tables: 4 (analytics schema)
Cloud storage: AWS S3 · eu-west-1
Services (Docker): PostgreSQL · Airflow · Spark
CI/CD: test + lint · parallel jobs
License: NYC TLC Open Data
Python PySpark Apache Airflow PostgreSQL dbt AWS S3 boto3 Docker Docker Compose GitHub Actions CI/CD Streamlit Plotly SQL Apache Parquet JDBC pytest Medallion Architecture ETL Pipeline Data Quality
PROJECT 12

Olist E-Commerce Analytics Pipeline
Brazilian Market · 2016–2018

Data Analytics & Engineering

An end-to-end data analytics pipeline processing 96,457 delivered orders from the Brazilian Olist e-commerce platform across 9 relational tables, covering the full analytics lifecycle — raw CSV ingestion to AWS S3, PostgreSQL ETL with a Bronze/Silver architecture, 18-check automated data quality validation, SQL-based business analysis across 8 dimensions, 7 Python visualizations, and interactive BI dashboard delivery across three platforms (Tableau Public, Power BI, and Streamlit) — surfacing actionable insights on revenue trends, delivery performance, and customer satisfaction across 27 Brazilian states.

96,457
Orders Analyzed
R$8.7M
Total Revenue
10.9%
Late Delivery Rate
−40%
Review Score Drop (Late)
18
Data Quality Checks
3
BI Dashboards

Pipeline Architecture

9 Raw CSVs Olist dataset
AWS S3 Bronze data lake
PostgreSQL 9 raw tables · FK constraints
ETL + Cleaning 6 clean tables · Silver layer
Validation 18 automated checks
SQL Analysis 8 business queries
Tableau + Power BI + Streamlit 3 dashboards
GitHub Actions CI/CD · 12 tests
  • Designed and implemented a multi-stage ETL pipeline ingesting 9 Olist relational tables into PostgreSQL 15 with enforced foreign key constraints across all child tables; uploaded all raw source files to AWS S3 as a Bronze data lake layer using boto3, then applied a Bronze/Silver architecture — producing 6 clean tables from 9 raw sources with zero orphaned foreign keys confirmed by automated referential integrity checks.
  • Applied three distinct professional missing value strategies across 10 identified data quality issues: contextual filtering (2,479 order items from non-delivered orders removed), meaningful imputation (610 null product categories filled with 'unknown' to preserve real revenue), and intentional null retention (58,247 null review comments kept as meaningful absence of feedback) — each decision documented with an explicit business reason.
  • Built a SQL analytical layer with 8 business queries using multi-table JOINs across 6 related tables, GROUP BY aggregations across revenue, delivery, payment, seller, and retention dimensions, CASE WHEN for conditional aggregation, and COALESCE for null-safe category joins; engineered 2 domain-driven features (delivery_days, is_late) and ran 18 automated data quality checks across raw, clean, referential integrity, and business logic categories.
  • Produced 7 Matplotlib/Seaborn visualizations including a dual-axis monthly revenue trend, late delivery rate bar chart with national average reference line, side-by-side review score impact analysis, and payment method breakdown — all queried directly from PostgreSQL and saved as 150 DPI PNGs; delivered a 4-view interactive Tableau Public dashboard (live URL), a 4-view Power BI dashboard with DAX calculated columns and conditional formatting, and a 4-page Streamlit application.
  • Surfaced key findings: AL state late delivery rate of 24.1% vs 10.9% national average — a geographic concentration pointing to logistics infrastructure gaps in Brazil's northeast; late orders averaging 2.55 stars vs 4.21 for on-time deliveries with 46% of late orders receiving 1-star reviews; Health & Beauty as top revenue category at R$1.24M; and a 3% customer repeat purchase rate flagging a significant retention gap for business strategy.
  • Containerized the PostgreSQL environment using Docker and docker-compose for one-command setup; implemented a GitHub Actions CI/CD pipeline running 12 pytest unit tests covering cleaning logic, revenue calculations, timestamp extraction, and data integrity rules on every push to main — ensuring pipeline correctness is automatically validated with every code change.

BI Dashboards

Tableau Public (live)
Revenue trend · Top categories · Delivery by state · Review vs delivery — live public URL
Power BI Desktop (.pbix in repo)
DAX calculated columns · Conditional formatting · Top N filtering · Dual Y-axis
Streamlit App (4 pages)
Overview KPIs · Revenue & Products · Delivery Performance · Customer Insights

Key Findings & Pipeline Outputs

Top revenue category: Health & Beauty · R$1.24M
Worst late delivery state: AL · 24.1% (vs 10.9% avg)
On-time avg review score: 4.21 / 5
Late delivery avg review score: 2.55 / 5 (−40%)
Credit card transaction share: 74% of all payments
Customer repeat purchase rate: 3.0% — retention gap flagged

Dataset & Pipeline Details

Source: Olist / Kaggle (CC BY-NC-SA 4.0)
Raw tables: 9 relational CSVs
Total orders: 99,441
Analysis scope: 96,457 delivered orders
Dashboard flat table: 110,814 rows · 21 columns
Date range: Sep 2016 – Aug 2018
Python SQL PostgreSQL SQLAlchemy Pandas NumPy Matplotlib Seaborn Tableau Power BI Streamlit AWS S3 boto3 Docker docker-compose GitHub Actions CI/CD ETL Pipeline Data Quality Bronze/Silver Architecture Feature Engineering DAX
More Work

Explore all projects on GitHub

Open Source

All repositories — including experiments, utilities, and additional work — are available on my GitHub profile.