Projects

PROJECT 01

Dog Breed Classification
via Deep Learning & Transfer Learning

Deep Learning & Computer Vision

A fine-grained image classification system capable of identifying 120 dog breeds from the Stanford Dogs Dataset (20,580 images). The project explores the full deep learning lifecycle — from data preprocessing and augmentation design through distributed GPU training to final evaluation — demonstrating strong generalization with a top-1 test accuracy of 84.20%.

84.20%

Top-1 Accuracy

96.54%

Top-3 Accuracy

98.05%

Top-5 Accuracy

0.67%

Generalization Gap

120

Breed Classes

20,580

Training Images

Training Pipeline

Stanford Dogs Dataset 20,580 images · 120 breeds

→

Annotation Parsing XML bounding boxes

→

BBox Crop + Resize 224×224 px · +7% accuracy

→

Stratified Split 70 / 20 / 10

→

On-the-fly Augmentation Albumentations · 10 transforms

→

ResNet50 ImageNet pretrained · frozen backbone

→

AdamW + CosineAnnealingLR 50 epochs · AMP · grad clipping

→

Evaluation 84.56% Top-1 · 98.25% Top-5

Applied transfer learning with a pretrained ResNet50 (ImageNet) backbone, replacing the final classification head for 120-class output on fine-grained dog breed recognition.
Built a robust preprocessing pipeline using bounding-box-guided cropping and 224×224 resizing, yielding ~7% accuracy gain over uncropped baselines.
Implemented aggressive on-the-fly data augmentation via Albumentations (random crops, flips, rotations, colour jitter, Gaussian noise/blur, random erasing) to combat overfitting.
Trained over 50 epochs on a 2-GPU distributed setup (2× Tesla V100, University of Luxembourg HPC) using PyTorch DDP, AdamW optimiser with CosineAnnealingLR scheduling, and gradient clipping.
Achieved a generalization gap of only 0.67% between validation and test accuracy, demonstrating strong model robustness and minimal overfitting.

Python PyTorch Torchvision Albumentations CUDA PyTorch DDP ResNet50 Scikit-learn NumPy Pandas Matplotlib Seaborn HPC / Tesla V100

View on GitHub

PROJECT 02

Spacecraft & Space Debris Detection
via YOLOv11 Fine-Tuning

Deep Learning & Computer Vision

An object detection system to identify and localize 10 real ESA spacecraft (including SOHO, XMM-Newton, and LISA Pathfinder) alongside space debris in 1024×1024 satellite imagery, built on the SPARK 2022 Challenge dataset (University of Luxembourg SnT). The project fine-tunes a YOLOv11-Medium model with a targeted augmentation pipeline simulating real-world orbital imaging conditions.

~86%

Classification Accuracy

10

ESA Spacecraft Classes

1024px

Image Resolution

YOLOv11-M

Base Architecture

Detection Pipeline

SPARK 2022 Dataset 1024×1024 · 11 classes

→

CSV Annotation Parsing bbox coordinates

→

YOLO Label Conversion normalised x·y·w·h

→

Offline Augmentation Albumentations · orbital conditions

→

YOLOv11-Medium pretrained · fine-tuned

→

SLURM HPC Training 35 epochs · Tesla V100 · Iris cluster

→

Inference Pipeline class + bbox · ~86% accuracy

Fine-tuned a YOLOv11-Medium model on a custom annotated dataset for multi-class detection of ESA spacecraft (SOHO, XMM-Newton, LISA Pathfinder, and others) and space debris.
Converted bounding-box annotations from CSV format to normalized YOLO labels, handling all preprocessing and annotation pipeline engineering end-to-end.
Designed a targeted data augmentation pipeline using Albumentations (Gaussian blur, motion blur, brightness/contrast shifts, horizontal/vertical flips, additive Gaussian noise) to simulate sensor noise, orbital lighting, and motion artifacts.
Configured training with an initial learning rate of 0.01, L2 weight decay regularization, and 35 epochs with early stopping to prevent overfitting on the multi-class detection task.
Deployed training on the University of Luxembourg HPC Iris cluster via SLURM job scheduling; built a full inference pipeline generating class labels and bounding-box coordinates on unseen test imagery.

Detected Spacecraft Classes

SOHO XMM-Newton LISA Pathfinder + 7 other ESA spacecraft Space Debris

Python YOLOv11 Ultralytics Albumentations SLURM HPC / GPU Cluster OpenCV

View on GitHub

PROJECT 03

EEG Topomap Binary Classification
via Deep Learning

Deep Learning & Computer Vision

A binary image classification system that predicts whether a brain's neurological response to a visual design stimulus is positive or negative, using EEG topographic map (topomap) images as input. The project applies computer vision to a neuroscience domain — replacing manual expert interpretation of EEG patterns with an end-to-end deep learning pipeline. Built entirely from scratch in PyTorch on a small, class-imbalanced dataset of 1,178 images, achieving 90.96% accuracy on a fully held-out test set.

90.96%

Test Accuracy

97.18%

Best Val Accuracy

1,178

Total Images

60

Training Epochs

2

Classes

Custom CNN

Architecture

Model Architecture

Input 3×128×128

→

Conv Block 1 16→32

→

Conv Block 2 64→128

→

Conv Block 3 256

→

Conv Block 4 512

→

AdaptiveAvgPool

→

FC Head 8192→1024→1024→1

→

Sigmoid → {0, 1}

Designed a custom CNN from scratch — no pretrained backbone — with four progressively deepening convolutional blocks (16→32→64→128→256→512 filters), AdaptiveAvgPool2d for input-size robustness, and dual Dropout (p=0.5) on the fully connected head, achieving strong regularization on a small dataset.
Applied stratified 70/15/15 train/val/test splits to preserve class balance across a naturally imbalanced dataset (678 bad / 500 good), ensuring evaluation integrity throughout training and final reporting.
Built a full training pipeline with on-the-fly data augmentation (random horizontal flips, ±10° rotations, brightness/contrast color jitter), BCELoss, Adam optimizer (lr = 0.001), and ReduceLROnPlateau scheduling (factor 0.5, patience 5) — checkpointing on best validation loss across 60 epochs.
Implemented a batched inference API (load_and_predict) with automatic GPU/CPU device detection, non-blocking memory transfers for CUDA overlap, and configurable batch sizes — enabling the model to be deployed as a standalone, reusable prediction module.
The model achieves 90.96% accuracy on the held-out test set — a set the model never saw during training or validation — confirming genuine generalization rather than overfitting to the training distribution.

Dataset & Training Details

Domain: EEG Neuroscience / BCI

Input size: 128 × 128 px

Normalisation: ImageNet mean/std

Classes: bad (0) · good (1)

Class split: 678 bad / 500 good

Exposure duration: 6 s design stimulus

Python PyTorch Torchvision Scikit-learn NumPy Pillow Custom CNN BCELoss CUDA EEG / BCI

View on GitHub

PROJECT 04

Signature Origin Classification
Human vs. Generative AI

Deep Learning & Sequence Classification

An end-to-end deep learning pipeline for 4-class sequence classification, distinguishing genuine human handwritten signatures from those synthesized by three generative architectures — GAN, SDT, and VAE. The project addresses the growing challenge of AI-generated biometric content attribution, processing raw 2D pen-stroke coordinate sequences from CSV files through a full training and inference pipeline built entirely in PyTorch, achieving 80.21% test accuracy and 85.70% accuracy on the full dataset.

80.21%

Test Accuracy

85.70%

Full Dataset Accuracy

4

Classes

60

Training Epochs

BiGRU

Architecture

150

Sequence Length

Classification Target

Label 0 — Human
Genuine handwritten signature

Label 1 — GAN
Generative Adversarial Network

Label 2 — SDT
Sigma-Delta Transform model

Label 3 — VAE
Variational Autoencoder

Model Architecture

Input (X, Y) sequence · 150 steps

→

BiGRU 2 layers · 128 units/dir

→

Concat forward + backward · 256-dim

→

Dropout p = 0.25

→

Linear 256 → 4 classes

Designed a Bidirectional GRU classifier (2-layer, 128 units/direction) that captures both forward and backward temporal dependencies in pen-stroke sequences — chosen over unidirectional RNNs to exploit the full context of each signature trajectory.
Engineered a preprocessing pipeline with per-signature independent Min-Max normalization on X and Y axes, zero-padding/truncation to a fixed 150-timestep window, and graceful handling of degenerate single-point sequences.
Conducted a systematic sequence-length hyperparameter search across four values (150, 200, 400, 570 timesteps), selecting 150 as the optimal trade-off — achieving the lowest validation loss (0.3348) with the most efficient training time.
Trained with Adam optimizer, ReduceLROnPlateau scheduling (factor 0.5, patience 7 — automatically halving the learning rate at epoch 52), gradient clipping (threshold 1.0), and early stopping (patience 15) across 60 epochs, with best checkpoint saved at epoch 57.
Implemented a batched inference API (load_and_predict) with automatic CUDA/CPU device detection, recursive CSV discovery, and configurable batch sizes — enabling the model to be used as a standalone, reusable prediction module.
Applied a stratified 70/15/15 train/validation/test split to ensure representative class distribution across all partitions, with the test set kept fully held-out throughout training and hyperparameter tuning.

Training Configuration & Results

Optimizer: Adam (lr = 0.001)

Weight decay: 1e-4 (L2)

Scheduler: ReduceLROnPlateau

Gradient clip: 1.0

Best val loss: 0.3291 (Epoch 57)

Best val accuracy: 82.24%

Data split: 70 / 15 / 15

Loss function: CrossEntropyLoss

Sequence Length Hyperparameter Search

MAX_SEQ_LENGTH	Val Loss	Val Accuracy	Note
150 · selected	0.3348	80.21%	Best loss · most efficient
200	0.4132	78.74%	Worse loss & accuracy
400	0.3812	80.00%	Comparable accuracy
570	0.3520	81.89%	Marginal gain, high cost

Python PyTorch Bidirectional GRU Sequence Classification Scikit-learn NumPy Pandas CUDA Biometric Authentication Time-Series

View on GitHub

PROJECT 05

Network Data Extraction using LLMs
for Historical Documents

LLM & NLP

An end-to-end NLP pipeline for extracting structured social network data — entities and their relationships — from long-form WWII-era historical memoirs using Large Language Models. The project bridges Digital History and Computer Science, reconstructing survival networks of people who provided shelter, food, false documents, and protection to those living underground during the Nazi persecution. Conducted over ~20 months at the Luxembourg Centre for Contemporary and Digital History (C²DH), University of Luxembourg, and formed the basis of a Master's thesis in Computer Science. Findings are currently being prepared for publication in a peer-reviewed journal.

0.841

Best F1-Score

0.906

Precision

0.785

Recall

+250%

F1 vs. Baseline

43

Methods Tested

176

Ground-Truth Relations

Pipeline Architecture

Text Cleaning

→

Chunking (LangChain)

→

Coreference Resolution (LLM)

→

Relationship Extraction (LLM)

→

JSON Output

Designed and implemented a modular NLP pipeline integrating text cleaning, overlapping chunking (LangChain RecursiveCharacterTextSplitter), LLM-based coreference resolution, and structured JSON extraction via OpenAI APIs (GPT-4o, GPT-4.1, GPT-5, o3).
Introduced a dedicated Coreference Resolution (CR) stage that resolves ambiguous pronouns ("he", "they") and collective family references ("the Wendlands") into explicit named individuals before extraction — independently increasing attribution accuracy by ~20% and reducing both false positives and false negatives.
Implemented a context window mechanism in the CR stage, providing the model with one preceding chunk as reference to resolve cross-boundary references — a CW of 1 improved recall by ~12–15% with no loss in precision; larger windows introduced noise and degraded results.
Benchmarked 43 experimental configurations across 8 systematic phases, varying chunking parameters (2000–10000 chars), context window sizes, extraction strategies (one-step vs. multi-step NER+RE), prompt designs, and model selection; outputs manually reviewed and labeled (TP/FP/FN) against a curated ground-truth dataset of 176 annotated relationships.
Evolved prompt engineering from simple to advanced through three iterative stages: simple baseline prompts → refined prompts with 9 explicit relationship type definitions and inclusion/exclusion rules → advanced prompts with borderline case clarifications, reducing false positives by over 80% while maintaining recall.
Both the CR and extraction stages use a two-prompt design per chunk: a system prompt carrying permanent rules, relationship definitions, and structural instructions; and a user prompt constructed dynamically per chunk, providing the main text to process and — in the CR stage — the preceding chunk(s) as reference context for resolving cross-boundary references.
Validated reproducibility via 10-run stability testing of the three top configurations; the o3-based pipeline achieved mean F1 = 0.819 ± 0.01, confirming consistent and reliable performance across repeated runs.
Identified OpenAI o3 as the optimal model for both CR and extraction, outperforming GPT-4.1 (+0.04 F1) and GPT-5 (+0.07 F1) — demonstrating that reasoning depth matters more than model scale for context-heavy historical texts.

9 Extracted Relationship Categories

Shelter & Protection Medical Care Food & Resources Introductions & Connections Employment False Documentation Information & Advice Emotional Support Other Assistance

Top Configuration Results (10-Run Average)

The framework supports independent model selection for each stage — one model for Coreference Resolution and a separate model for Relationship Extraction — enabling direct like-for-like comparison under identical preprocessing and prompt conditions.

CR Model → Extraction Model	Precision	Recall	F1-Score
o3 (CR) → o3 (Extraction) · best overall	0.845	0.795	0.819
o3 (CR) → GPT-4.1 (Extraction) · highest recall	0.764	0.866	0.812
o3 (CR) → GPT-5 (Extraction)	0.772	0.754	0.763

All configurations: chunk 6000/600 · context window 1 · advanced prompts · one-step extraction

Python OpenAI API LangChain GPT-4o GPT-4.1 GPT-5 o3 Reasoning Model Coreference Resolution NER Information Extraction Prompt Engineering JSON Digital Humanities University of Luxembourg

View on GitHub

PROJECT 06

Census Income Prediction
using Machine Learning

Data Science & Machine Learning

An end-to-end KDD (Knowledge Discovery in Databases) pipeline applied to the UCI Adult Census dataset (47,621 instances, 14 features) to predict whether an individual's annual income exceeds $50,000. The project covers the full data science lifecycle — cleaning, exploratory analysis, feature engineering, hyperparameter tuning, and comparative model evaluation — demonstrating that gradient-boosted trees outperform linear and ensemble baselines on structured socioeconomic data.

86.95%

Best Accuracy (XGBoost)

0.814

F1 Score (macro)

+2.2pp

Gain over LR Baseline

3

Models Compared

47,621

Instances

14

Features

KDD Pipeline

Data Cleaning

→

EDA

→

Preprocessing (Scaling + Encoding)

→

HP Tuning

→

Model Evaluation

Applied the full KDD process to the UCI Adult Census dataset — normalising inconsistent target labels, dropping rows with missing values (~2.5%), and performing stratified 70/30 train-test splitting to preserve class balance.
Conducted Exploratory Data Analysis (EDA) including descriptive statistics, a gender-vs-income stacked bar chart, a numerical-feature correlation heatmap, and an age distribution histogram with KDE — revealing key demographic patterns in income inequality.
Built a preprocessing pipeline applying MinMaxScaler to 6 numerical features and one-hot encoding (via pd.get_dummies, drop_first=True) to 8 categorical features, followed by LabelEncoder on the binary target.
Performed systematic hyperparameter sensitivity analysis for all three models: swept regularisation C for Logistic Regression, n_estimators and max_depth for Random Forest, and learning_rate and n_estimators for XGBoost — selecting optimal values from accuracy-vs-parameter plots.
Evaluated all models on 7 metrics (Accuracy, MSE, RMSE, R², Precision, Recall, F1) with confusion matrices; XGBoost achieved the best results (86.95% accuracy, F1 = 0.814), outperforming Random Forest by 0.97 pp and Logistic Regression by 2.16 pp.
Delivered the project as a fully documented Jupyter Notebook with narrative markdown commentary alongside each code section, making the methodology transparent and reproducible — alongside a clean, modular Python script structured for production use.

Model Performance Comparison — Test Set

Model	Accuracy	Precision	Recall	F1
Logistic Regression (C = 100)	84.79%	0.810	0.760	0.780
Random Forest (200 trees, depth 20)	85.98%	0.836	0.767	0.793
XGBoost (lr = 0.30, 100 trees) · best overall	86.95%	0.839	0.796	0.814

All models evaluated on 70/30 stratified split · metrics macro-averaged

Python Scikit-learn XGBoost Pandas NumPy Matplotlib Seaborn Jupyter Notebook KDD Pipeline UCI ML Repository

View on GitHub

PROJECT 07

Employee Attrition Prediction
via Machine Learning & Explainable AI

Data Science & Machine Learning

A complete, end-to-end machine learning pipeline that predicts whether an employee is at risk of leaving a company — and more importantly, explains why. Built on the IBM HR Analytics dataset (1,470 employees, 35 features), the project covers every stage of a professional data science workflow: exploratory data analysis, preprocessing, feature engineering, multi-model training with cross-validation, and model interpretability using SHAP (SHapley Additive exPlanations). The final pipeline is deployed as a live interactive Streamlit web application, making predictions and explanations accessible to anyone without any code.

0.966

CV ROC-AUC

1,470

Employees

35

Raw Features

7

Engineered Features

3

Models Compared

84 / 16

Class Split (%)

ML Pipeline

IBM HR Dataset 1,470 employees · 35 features

→

EDA 84/16 imbalance · driver identification

→

Preprocessing encoding · StandardScaler · SMOTE

→

Feature Engineering 7 new features · top 30 selected

→

Model Training LR · RF · XGBoost · 5-fold CV

→

SHAP Interpretability global + per-employee explanations

→

Streamlit App live deployment · Streamlit Cloud

Conducted thorough Exploratory Data Analysis across all 35 features, identifying key attrition drivers including overtime patterns, compensation gaps relative to experience, satisfaction scores, and early-tenure risk — informing every subsequent modeling decision.
Built a rigorous preprocessing pipeline: Label Encoding for binary categoricals, One-Hot Encoding with drop_first=True to eliminate multicollinearity, StandardScaler fitted exclusively on training data, and SMOTE (Synthetic Minority Over-sampling Technique) applied to the training set only — preserving test set integrity and preventing data leakage throughout.
Engineered 7 domain-driven features from existing columns — including IncomePerYearExp, SatisfactionScore, CareerGrowthRate, and IsEarlyCareer — capturing business-meaningful signals not present in the raw data; final feature set selected via Random Forest importance ranking (top 30 of 37 total features retained).
Trained and compared Logistic Regression, Random Forest, and XGBoost using 5-fold stratified cross-validation; selected ROC-AUC as the primary metric (over accuracy) due to class imbalance — a naive classifier predicting "Stayed" for all employees would achieve 84% accuracy with zero business value.
Applied SHAP TreeExplainer for both global and individual-level model interpretability: beeswarm summary plots revealing feature impact direction across all employees, dependence plots exposing interaction effects between top features, and per-employee waterfall charts explaining each prediction step by step — bridging the gap between model output and business action.
Deployed the full pipeline as a live Streamlit web application on Streamlit Community Cloud, featuring real-time employee risk scoring, a dynamic SHAP waterfall explanation for every prediction, and a model performance dashboard — making the model accessible to non-technical stakeholders without any local setup.

Model Performance Comparison — Test Set

Model	CV ROC-AUC	Test ROC-AUC	Test F1	Test Recall
Logistic Regression	0.9236	0.7206	0.3762	0.4043
Random Forest · best Test ROC-AUC	0.9658	0.7309	0.3250	0.2766
XGBoost	0.9648	0.7014	0.3291	0.2766

All models evaluated on 80/20 stratified split · ROC-AUC chosen as primary metric due to 84/16 class imbalance · 5-fold stratified cross-validation

Top Attrition Drivers — SHAP Analysis

🔴 Highest risk: OverTime = Yes

🔴 Risk factor: Low SatisfactionScore

🔴 Risk factor: Low IncomePerYearExp

🔴 Risk factor: StockOptionLevel = 0

🔴 Risk factor: MaritalStatus = Single

🟢 Protective: High YearsAtCompany

Python Scikit-learn XGBoost SHAP SMOTE Pandas NumPy Matplotlib Seaborn Streamlit Streamlit Cloud Jupyter Notebook IBM HR Analytics

View on GitHub Live App ↗

PROJECT 08

Urban Air Quality Intelligence Platform
via Machine Learning & Live API Integration

Data Science & Machine Learning

An end-to-end data science system that predicts urban air quality using two parallel data sources: the UCI Air Quality dataset (9,357 hourly sensor readings, Italy 2004–2005) and live multi-city measurements fetched programmatically from the OpenAQ REST API v3 (Italy · France · Germany). The project covers the complete ML workflow — exploratory data analysis, time-series feature engineering with lag features, rolling window statistics, and cyclical encodings, dual-track modelling (classification + regression), and model interpretability via SHAP — deployed as a live interactive Streamlit dashboard with real-time AQI prediction.

86.6%

XGBoost Accuracy

0.74

F1 (macro)

0.885

R² (NO₂ Regression)

16.5

RMSE (µg/m³)

46

Engineered Features

9,357

Hourly Readings

Data Pipeline

UCI Dataset 9,357 rows

+

OpenAQ API v3 live IT · FR · DE

→

EDA + Cleaning

→

Feature Eng. 46 features

→

XGBoost cls + reg

→

SHAP Explainability

→

Streamlit Live App

Combined two heterogeneous data sources — a static UCI sensor dataset and a live REST API (OpenAQ v3) — into a unified analysis pipeline, demonstrating real-world data ingestion across Italy, France, and Germany with programmatic API calls at both acquisition time and dashboard runtime.
Engineered 46 features from raw hourly sensor readings: lag features (NO2_lag_1h, NO2_lag_3h, NO2_lag_24h), rolling window statistics (3h/6h/24h mean and std), and cyclical sin/cos encodings for hour-of-day, day-of-week, and month — preventing the model from treating midnight and 11pm as 23 steps apart.
Applied time-aware 70/15/15 chronological splitting to prevent data leakage — a critical distinction from random splitting on time-series data that would allow the model to train on the future of its own test set; also handled class imbalance via class_weight="balanced" and per-sample weights in XGBoost.
Ran two parallel modelling tracks: AQI category classification (4 classes: Good / Moderate / Poor / Very Poor) and NO₂ concentration regression — training Dummy baseline, Random Forest, and XGBoost for each; XGBoost achieved 86.6% accuracy (+20.7pp over baseline) and R² = 0.885 on the regression task.
Applied SHAP TreeExplainer for exact (non-approximate) global and local model interpretability — generated beeswarm, dependence, and per-prediction waterfall charts; top feature NO2GT_rolling_3h_mean (mean |SHAP| = 1.51) confirmed that sustained recent pollution history is the model's strongest predictive signal, physically interpretable and trust-building.
Deployed as a 4-page Streamlit web application on Streamlit Community Cloud, featuring real-time AQI prediction with per-class confidence breakdown, an interactive SHAP explainer with waterfall charts for individual predictions, and a live OpenAQ data fetch panel — fully accessible via public URL with no local setup.

Classification — Model Performance Comparison (Test Set)

Model	Accuracy	F1 (macro)	F1 (weighted)
Dummy baseline · majority class	65.9%	0.20	0.52
Random Forest · 200 trees	80.8%	0.63	0.75
XGBoost · 300 rounds, early stopping · best overall	86.6%	0.74	0.84

Regression — NO₂ Concentration (µg/m³) (Test Set)

Model	RMSE	MAE	R²
Dummy baseline · training mean	70.39	57.60	−1.094
Random Forest · 200 trees	17.18	10.02	0.875
XGBoost · early stopping round 214 · best overall	16.49	9.37	0.885

Chronological 70/15/15 time-aware split · class imbalance handled via balanced class weights · 3 models × 2 tracks

Top SHAP Features — Predicting "Poor" AQI

🔴 Rank 1 (|SHAP| 1.511): NO2GT_rolling_3h_mean

🔴 Rank 2 (|SHAP| 0.548): NO2GT_lag_1h

🔴 Rank 3 (|SHAP| 0.505): NOx(GT) sensor

🔴 Rank 4 (|SHAP| 0.484): NO2GT_change_1h

🔴 Rank 5 (|SHAP| 0.401): NO2GT_change_3h

🟢 Physical insight: Lag + rolling features dominate — pollution evolves gradually

Python Pandas NumPy Scikit-learn XGBoost SHAP Matplotlib Seaborn Streamlit Streamlit Cloud OpenAQ API v3 REST API Time-Series ML Feature Engineering UCI ML Repository Jupyter Notebook

View on GitHub Live App ↗

PROJECT 09

Urban Development Analytics Pipeline
Chicago Building Permits 2018–2023

Data Analytics & Engineering

An end-to-end data analytics pipeline processing 833,978 municipal building permit records from the City of Chicago Open Data Portal, built as a freelance-grade analytics engagement for urban planning stakeholders. The project covers every stage of the data lifecycle — raw ingestion, SQL-based cleaning and aggregation, automated Python ETL, statistical analysis, dual BI dashboard delivery across Tableau Public and Power BI, Docker containerization, and a GitHub Actions CI/CD pipeline for automated validation — surfacing actionable insights on construction trends, fee revenue, and permitting efficiency across Chicago's 50 wards.

833,978

Raw Records

819,820

After Cleaning

−20.7%

Permit Drop 2020

+44%

Processing Time 2018→2023

$27.2M

Peak Fee Revenue (2019)

2

BI Dashboards (Tableau + Power BI)

Pipeline Architecture

Raw CSV 833,978 rows

→

load_to_db.py SQLite ingestion

→

sql_analysis.py 8 SQL queries

→

pipeline.py ETL + 6 visualizations

→

Tableau Public 4-view dashboard

+

Power BI 3-page dashboard

→

Docker containerized

→

GitHub Actions CI/CD validation

Designed and implemented a multi-stage ETL pipeline ingesting 833,978 raw permit records, programmatically parsing currency-formatted fee columns across 16 fields (stripping $ symbols, converting to float) before loading into a SQLite database via SQLAlchemy — producing a clean table of 819,820 records after null and quality filters.
Built a SQL aggregation layer with 8 analytical queries using GROUP BY, COUNT, AVG, SUM, CAST, and date parsing via SUBSTR — covering permit type distribution, ward-level year-over-year trends, community area fee revenue, monthly volume trends, and processing time analysis.
Developed an automated Python ETL pipeline (Pandas, NumPy) performing date parsing, feature engineering (days_to_issue, year, month, quarter, year_month), outlier removal at the 99th percentile, and export of 10 analysis-ready CSVs — reducing manual data preparation by ~70%.
Produced 6 Matplotlib/Seaborn visualizations: monthly trend line, permit type horizontal bar chart, dual-axis yearly volume vs. processing time, neighborhood-labeled ward fee revenue chart, processing time box plot by year, and a year-over-year diverging bar chart — all saved as high-resolution PNGs.
Delivered a 4-view interactive Tableau Public dashboard (monthly trend · permit types · geographic dot map of 238,496 permits · yearly KPI panel) and a complementary 3-page Power BI dashboard covering financial analysis, operational efficiency, and an executive summary — together providing full stakeholder coverage across geographic, financial, and operational dimensions.
Containerized the full pipeline using Docker and docker-compose with a minimal Linux image, volume mounts for data and output directories, and a single-command execution pattern — enabling fully reproducible pipeline runs across environments; implemented a GitHub Actions CI/CD workflow that automatically generates a 500-row synthetic test dataset and validates the full pipeline on every push to main, verifying all 10 processed CSVs and 6 visualizations are produced correctly.
Surfaced key findings: 20.7% permit volume decline in 2020 (COVID impact), 44% increase in average processing time from 17.9 days (2018) to 25.9 days (2023), Express Permits accounting for 45.7% of total fee revenue, and average reported construction cost rising from $95K (2018) to $194K (2022).

BI Dashboards

Tableau Public (live)
Monthly trend · Permit types · Geographic dot map · Yearly KPIs — cross-filter interactivity

Power BI Desktop (.pbix in repo)
Financial analysis · Operational efficiency · Executive summary — 3-page analytical dashboard

Key Findings & Pipeline Outputs

Permit volume 2020: −20.7% (COVID impact)

Avg processing time 2018→2023: 17.9 → 25.9 days (+44%)

Peak fee revenue: $27.2M in 2019

Avg construction cost 2022: $194,549 (vs $95,302 in 2018)

Express Permit revenue share: 45.7% of total fees

Processed CSVs exported: 10 analysis-ready files

Dataset & Pipeline Details

Source: Chicago Open Data Portal

Raw records: 833,978

Clean records: 819,820

Pipeline range: 2018–2023

Dashboard records: 238,496

License: Public domain

Python SQL SQLite SQLAlchemy Pandas NumPy Matplotlib Seaborn Tableau Power BI Docker docker-compose GitHub Actions CI/CD ETL Pipeline Data Cleaning Feature Engineering Chicago Open Data Jupyter Notebook

View on GitHub Live Dashboard ↗

PROJECT 10

LexiAssist — Production RAG Chatbot
for AI/ML Research

AI Engineering & MLOps

An end-to-end production-grade Retrieval-Augmented Generation (RAG) system that ingests 145 AI/ML research papers from ArXiv, indexes them into a ChromaDB vector store, and answers natural-language questions with grounded, citation-backed responses and multi-turn conversational memory. The project covers the full AI engineering stack — from automated data ingestion and semantic search through REST API design, pipeline evaluation, containerization, and CI/CD automation — demonstrating how a real LLM application is built and deployed in a production environment.

0.96

Answer Relevancy

0.70

Faithfulness

145

Papers Ingested

394

Indexed Chunks

6

Automated API Tests

3-Stage

CI/CD Pipeline

Pipeline Architecture

ArXiv API 145 papers · 15 topics → Chunking LangChain · 394 chunks → Embeddings text-embedding-3-small → ChromaDB vector store · persisted → RAG Chain LangChain · top-5 retrieval → FastAPI REST backend · 3 endpoints → Streamlit UI chat interface

Built a complete RAG pipeline using LangChain: ArXiv paper ingestion across 15 AI/ML topics, RecursiveCharacterTextSplitter chunking (1,000-char chunks, 200-char overlap), OpenAI text-embedding-3-small vector generation, and ChromaDB persistence — producing 394 semantically indexed chunks from 145 papers.
Designed a LangChain RAG chain with engineered system prompt for hallucination mitigation via context injection, top-5 semantic retrieval, multi-turn conversational memory using LangChain message history, and structured source citation extraction with deduplication — all wired into a single composable pipeline.
Built a production FastAPI REST backend with Pydantic request/response validation, CORS middleware, startup chain preloading, and automatic Swagger documentation — separating the API layer cleanly from the RAG logic, with /chat, /health, and /ingest endpoints.
Evaluated RAG pipeline quality using a custom LLM-as-judge framework built from scratch — GPT-4o-mini independently scores faithfulness, answer relevancy, context precision, and context recall across 20 hand-curated question/answer pairs; achieved 0.96 answer relevancy and 0.70 faithfulness with documented analysis of context score limitations.
Containerized the full application with Docker and Docker Compose (multi-service: API + frontend), volume mounts for data and vectorstore persistence, and a pre-built image published to Docker Hub — enabling single-command deployment on any machine.
Automated the full software delivery lifecycle via a GitHub Actions CI/CD pipeline: flake8 linting and black formatting → pytest API tests using fixture data (no live external API calls in CI) → Docker image build and push to Docker Hub — triggered on every push to main.

RAG Evaluation Results — LLM-as-Judge (20 Samples)

Custom evaluation framework built from scratch — GPT-4o-mini scores each metric independently per sample, mirroring the approach used by production RAG evaluation libraries such as RAGAs.

Metric	Score	What It Measures
Answer Relevancy	0.96	Does the answer directly address the question asked?
Faithfulness	0.70	Are claims grounded in the retrieved documents?
Context Precision	0.56	Are the retrieved chunks relevant to the query?
Context Recall	0.54	Does the context contain what's needed to answer?

Context scores reflect abstract-only knowledge base — full PDF ingestion is the identified path to improvement.

System & Pipeline Details

LLM: GPT-4o-mini

Embedding model: text-embedding-3-small

Vector store: ChromaDB

Chunk size / overlap: 1,000 / 200 chars

Top-k retrieval: 5 chunks per query

Topics covered: 15 AI/ML domains

API tests: 6 pytest endpoints

CI/CD stages: lint → test → docker push

Container image: Docker Hub (public)

Python LangChain ChromaDB OpenAI API GPT-4o-mini RAG Semantic Search Prompt Engineering FastAPI Pydantic Streamlit Docker Docker Compose GitHub Actions CI/CD pytest LLM-as-Judge Vector Embeddings

View on GitHub

PROJECT 11

NYC Taxi Analytics Pipeline
End-to-End Batch Data Engineering

Data Analytics & Engineering

An end-to-end production-style batch data engineering pipeline processing 9.55 million real NYC Yellow Taxi trip records (Q1 2024) through a full Medallion Architecture (Bronze → Silver → Gold), covering every stage of the modern data engineering workflow — PySpark-based ETL, data quality validation, Apache Airflow orchestration, dbt SQL transformation, AWS S3 cloud data lake storage, PostgreSQL warehousing, Docker containerization, GitHub Actions CI/CD, and a 5-page interactive Streamlit analytics dashboard — demonstrating how a real-world data engineering team would build, schedule, validate, and serve a large-scale data pipeline.

9.55M

Raw Trips Ingested

8.47M

Clean Trips (Silver)

$234.8M

Q1 2024 Revenue

17/18

Data Quality Checks

8

dbt SQL Models

17

Dashboard Charts

Pipeline Architecture

NYC TLC Parquet 3 files · Q1 2024

→

extract.py Bronze · 9.55M rows · AWS S3

→

transform_silver.py Clean + 10 features · AWS S3

→

transform_gold.py 4 aggregation tables · AWS S3

→

validate.py 18 quality checks · HTML report

→

load.py PostgreSQL via JDBC

→

dbt 4 staging views + 4 mart tables

→

Airflow DAG 7 tasks · daily 06:00 UTC

→

Streamlit 5-page dashboard · 17 charts

Implemented a PySpark ETL pipeline with explicit schema definition (faster than inference), reading 9.55M real NYC taxi trip records across 3 Parquet files; removed 1,075,337 invalid records (11.25%) — including negative fares, GPS-glitch distances, and timestamp-corrupted trips — and engineered 10 domain-driven features (trip_duration_minutes, speed_mph, time_of_day bucketing, is_weekend, fare_per_mile, tip_percentage, payment_type_desc) covering the full Silver transformation layer.
Integrated AWS S3 as a cloud data lake using boto3 with Hive-style partitioned Parquet storage — Bronze partitioned by VendorID, Silver by pickup_month — uploaded automatically after each pipeline stage; implemented USE_S3=false local fallback so the pipeline runs without AWS credentials for development.
Built 8 dbt SQL models (4 staging views + 4 mart tables) on top of PostgreSQL using advanced SQL: window functions (LAG(), NTILE(4), AVG() OVER rolling windows), PARTITION BY for monthly market share, CASE WHEN demand classification, and NULLIF for division-by-zero safety — surfacing findings including 28 zones (10.8%) driving 80% of total Q1 revenue (Pareto) and credit card tip rate of ~20% vs near 0% for cash.
Designed an Apache Airflow DAG with 7 tasks, daily cron scheduling (0 6 * * *), 1 retry per task with 5-minute delay, and execution timeouts — orchestrating the full Extract → Transform → Validate → Load workflow end-to-end; loaded Gold layer tables to PostgreSQL via JDBC with an indexed schema (5 indexes) for optimised analytical queries.
Implemented 18 automated data quality checks across Silver and Gold layers (17/18 passing) generating an HTML validation report — checks cover null detection, range validation, referential integrity, and statistical plausibility; ran 10 pytest unit tests and flake8 linting in a parallel GitHub Actions CI/CD workflow triggered on every push to main.
Delivered a 5-page, 17-chart Streamlit dashboard connecting live to PostgreSQL — featuring a Pareto revenue concentration curve (28 zones = 80% of revenue), a heatmap of trip demand by hour × day of week, rolling revenue averages, NTILE location tier analysis, and payment behaviour insights — containerized with Docker Compose alongside PostgreSQL, Airflow, and Spark in a single-command startup environment.

Key Analytical Findings

Busiest hour: 19:00 — 461,200 trips

Revenue concentration: 28 zones = 80% of $234.8M

Credit card share: 63.9% of all trips

Tip rate — credit card: ~20% vs ~0% for cash

Night avg fare: $20.39 (highest time period)

Invalid data removed: 1,075,337 rows (11.25%)

Pipeline Outputs & Infrastructure

Dataset: NYC TLC Yellow Taxi Q1 2024

Raw records: 9,554,778

Clean records: 8,471,484

Gold tables: 4 (509 + 258 + 96 + 18 rows)

dbt mart tables: 4 (analytics schema)

Cloud storage: AWS S3 · eu-west-1

Services (Docker): PostgreSQL · Airflow · Spark

CI/CD: test + lint · parallel jobs

License: NYC TLC Open Data

Python PySpark Apache Airflow PostgreSQL dbt AWS S3 boto3 Docker Docker Compose GitHub Actions CI/CD Streamlit Plotly SQL Apache Parquet JDBC pytest Medallion Architecture ETL Pipeline Data Quality

View on GitHub

PROJECT 12

Olist E-Commerce Analytics Pipeline
Brazilian Market · 2016–2018

Data Analytics & Engineering

An end-to-end data analytics pipeline processing 96,457 delivered orders from the Brazilian Olist e-commerce platform across 9 relational tables, covering the full analytics lifecycle — raw CSV ingestion to AWS S3, PostgreSQL ETL with a Bronze/Silver architecture, 18-check automated data quality validation, SQL-based business analysis across 8 dimensions, 7 Python visualizations, and interactive BI dashboard delivery across three platforms (Tableau Public, Power BI, and Streamlit) — surfacing actionable insights on revenue trends, delivery performance, and customer satisfaction across 27 Brazilian states.

96,457

Orders Analyzed

R$8.7M

Total Revenue

10.9%

Late Delivery Rate

−40%

Review Score Drop (Late)

18

Data Quality Checks

3

BI Dashboards

Pipeline Architecture

9 Raw CSVs Olist dataset

→

AWS S3 Bronze data lake

→

PostgreSQL 9 raw tables · FK constraints

→

ETL + Cleaning 6 clean tables · Silver layer

→

Validation 18 automated checks

→

SQL Analysis 8 business queries

→

Tableau + Power BI + Streamlit 3 dashboards

→

GitHub Actions CI/CD · 12 tests

Designed and implemented a multi-stage ETL pipeline ingesting 9 Olist relational tables into PostgreSQL 15 with enforced foreign key constraints across all child tables; uploaded all raw source files to AWS S3 as a Bronze data lake layer using boto3, then applied a Bronze/Silver architecture — producing 6 clean tables from 9 raw sources with zero orphaned foreign keys confirmed by automated referential integrity checks.
Applied three distinct professional missing value strategies across 10 identified data quality issues: contextual filtering (2,479 order items from non-delivered orders removed), meaningful imputation (610 null product categories filled with 'unknown' to preserve real revenue), and intentional null retention (58,247 null review comments kept as meaningful absence of feedback) — each decision documented with an explicit business reason.
Built a SQL analytical layer with 8 business queries using multi-table JOINs across 6 related tables, GROUP BY aggregations across revenue, delivery, payment, seller, and retention dimensions, CASE WHEN for conditional aggregation, and COALESCE for null-safe category joins; engineered 2 domain-driven features (delivery_days, is_late) and ran 18 automated data quality checks across raw, clean, referential integrity, and business logic categories.
Produced 7 Matplotlib/Seaborn visualizations including a dual-axis monthly revenue trend, late delivery rate bar chart with national average reference line, side-by-side review score impact analysis, and payment method breakdown — all queried directly from PostgreSQL and saved as 150 DPI PNGs; delivered a 4-view interactive Tableau Public dashboard (live URL), a 4-view Power BI dashboard with DAX calculated columns and conditional formatting, and a 4-page Streamlit application.
Surfaced key findings: AL state late delivery rate of 24.1% vs 10.9% national average — a geographic concentration pointing to logistics infrastructure gaps in Brazil's northeast; late orders averaging 2.55 stars vs 4.21 for on-time deliveries with 46% of late orders receiving 1-star reviews; Health & Beauty as top revenue category at R$1.24M; and a 3% customer repeat purchase rate flagging a significant retention gap for business strategy.
Containerized the PostgreSQL environment using Docker and docker-compose for one-command setup; implemented a GitHub Actions CI/CD pipeline running 12 pytest unit tests covering cleaning logic, revenue calculations, timestamp extraction, and data integrity rules on every push to main — ensuring pipeline correctness is automatically validated with every code change.

BI Dashboards

Tableau Public (live)
Revenue trend · Top categories · Delivery by state · Review vs delivery — live public URL

Power BI Desktop (.pbix in repo)
DAX calculated columns · Conditional formatting · Top N filtering · Dual Y-axis

Streamlit App (4 pages)
Overview KPIs · Revenue & Products · Delivery Performance · Customer Insights

Key Findings & Pipeline Outputs

Top revenue category: Health & Beauty · R$1.24M

Worst late delivery state: AL · 24.1% (vs 10.9% avg)

On-time avg review score: 4.21 / 5

Late delivery avg review score: 2.55 / 5 (−40%)

Credit card transaction share: 74% of all payments

Customer repeat purchase rate: 3.0% — retention gap flagged

Dataset & Pipeline Details

Source: Olist / Kaggle (CC BY-NC-SA 4.0)

Raw tables: 9 relational CSVs

Total orders: 99,441

Analysis scope: 96,457 delivered orders

Dashboard flat table: 110,814 rows · 21 columns

Date range: Sep 2016 – Aug 2018

Python SQL PostgreSQL SQLAlchemy Pandas NumPy Matplotlib Seaborn Tableau Power BI Streamlit AWS S3 boto3 Docker docker-compose GitHub Actions CI/CD ETL Pipeline Data Quality Bronze/Silver Architecture Feature Engineering DAX

View on GitHub Live Dashboard ↗

More Work

Explore all projects on GitHub

Open Source

All repositories — including experiments, utilities, and additional work — are available on my GitHub profile.

View GitHub Profile ↗

Dog Breed Classificationvia Deep Learning & Transfer Learning

Spacecraft & Space Debris Detectionvia YOLOv11 Fine-Tuning

EEG Topomap Binary Classificationvia Deep Learning

Signature Origin ClassificationHuman vs. Generative AI

Network Data Extraction using LLMsfor Historical Documents

Census Income Predictionusing Machine Learning

Employee Attrition Predictionvia Machine Learning & Explainable AI

Urban Air Quality Intelligence Platformvia Machine Learning & Live API Integration

Urban Development Analytics PipelineChicago Building Permits 2018–2023

LexiAssist — Production RAG Chatbotfor AI/ML Research

NYC Taxi Analytics PipelineEnd-to-End Batch Data Engineering

Olist E-Commerce Analytics PipelineBrazilian Market · 2016–2018

Explore all projects on GitHub

Dog Breed Classification
via Deep Learning & Transfer Learning

Spacecraft & Space Debris Detection
via YOLOv11 Fine-Tuning

EEG Topomap Binary Classification
via Deep Learning

Signature Origin Classification
Human vs. Generative AI

Network Data Extraction using LLMs
for Historical Documents

Census Income Prediction
using Machine Learning

Employee Attrition Prediction
via Machine Learning & Explainable AI

Urban Air Quality Intelligence Platform
via Machine Learning & Live API Integration

Urban Development Analytics Pipeline
Chicago Building Permits 2018–2023

LexiAssist — Production RAG Chatbot
for AI/ML Research

NYC Taxi Analytics Pipeline
End-to-End Batch Data Engineering

Olist E-Commerce Analytics Pipeline
Brazilian Market · 2016–2018