Applied AI R&D — LLM Pipelines & NLP
Luxembourg Centre for Contemporary and Digital History, University of Luxembourg
- Designed and implemented a modular, end-to-end NLP pipeline for entity and relationship extraction from long-form historical documents, integrating text cleaning, overlapping chunking (LangChain), LLM-based coreference resolution, and structured JSON output via OpenAI APIs (GPT-4o, GPT-4.1, GPT-5, o3).
- Introduced a dedicated coreference resolution stage to disambiguate pronouns and collective entity references prior to extraction, independently increasing attribution accuracy by ~20%; benchmarked 43 experimental configurations varying chunking, context windows, prompting strategies, and model selection across a curated ground-truth dataset of 176 annotated relationships.
- Achieved a 250% F1-score improvement over the naïve baseline through iterative prompt engineering and pipeline refinement; validated reproducibility via 10-run stability testing, identifying OpenAI o3 as the optimal model for reasoning-intensive extraction tasks.
- Research formed the basis of an interdisciplinary Master's thesis at the intersection of NLP and Digital History; findings are being prepared for publication in a peer-reviewed journal in collaboration with supervisors at C²DH.