An end-to-end production-grade Retrieval-Augmented Generation (RAG) system
that ingests 145 AI/ML research papers from ArXiv, indexes them into a ChromaDB vector
store, and answers natural-language questions with grounded, citation-backed responses and
multi-turn conversational memory. The project covers the full AI engineering stack — from
automated data ingestion and semantic search through REST API design, pipeline evaluation,
containerization, and CI/CD automation — demonstrating how a real LLM application is built
and deployed in a production environment.
Pipeline Architecture
ArXiv API 145 papers · 15 topics
→
Chunking LangChain · 394 chunks
→
Embeddings text-embedding-3-small
→
ChromaDB vector store · persisted
→
RAG Chain LangChain · top-5 retrieval
→
FastAPI REST backend · 3 endpoints
→
Streamlit UI chat interface
- Built a complete RAG pipeline using LangChain: ArXiv paper ingestion across 15 AI/ML topics, RecursiveCharacterTextSplitter chunking (1,000-char chunks, 200-char overlap), OpenAI text-embedding-3-small vector generation, and ChromaDB persistence — producing 394 semantically indexed chunks from 145 papers.
- Designed a LangChain RAG chain with engineered system prompt for hallucination mitigation via context injection, top-5 semantic retrieval, multi-turn conversational memory using LangChain message history, and structured source citation extraction with deduplication — all wired into a single composable pipeline.
- Built a production FastAPI REST backend with Pydantic request/response validation, CORS middleware, startup chain preloading, and automatic Swagger documentation — separating the API layer cleanly from the RAG logic, with
/chat, /health, and /ingest endpoints.
- Evaluated RAG pipeline quality using a custom LLM-as-judge framework built from scratch — GPT-4o-mini independently scores faithfulness, answer relevancy, context precision, and context recall across 20 hand-curated question/answer pairs; achieved 0.96 answer relevancy and 0.70 faithfulness with documented analysis of context score limitations.
- Containerized the full application with Docker and Docker Compose (multi-service: API + frontend), volume mounts for data and vectorstore persistence, and a pre-built image published to Docker Hub — enabling single-command deployment on any machine.
- Automated the full software delivery lifecycle via a GitHub Actions CI/CD pipeline: flake8 linting and black formatting → pytest API tests using fixture data (no live external API calls in CI) → Docker image build and push to Docker Hub — triggered on every push to
main.
RAG Evaluation Results — LLM-as-Judge (20 Samples)
Custom evaluation framework built from scratch — GPT-4o-mini scores each metric independently per sample, mirroring the approach used by production RAG evaluation libraries such as RAGAs.
| Metric |
Score |
What It Measures |
| Answer Relevancy |
0.96 |
Does the answer directly address the question asked? |
| Faithfulness |
0.70 |
Are claims grounded in the retrieved documents? |
| Context Precision |
0.56 |
Are the retrieved chunks relevant to the query? |
| Context Recall |
0.54 |
Does the context contain what's needed to answer? |
Context scores reflect abstract-only knowledge base — full PDF ingestion is the identified path to improvement.
System & Pipeline Details
LLM: GPT-4o-mini
Embedding model: text-embedding-3-small
Vector store: ChromaDB
Chunk size / overlap: 1,000 / 200 chars
Top-k retrieval: 5 chunks per query
Topics covered: 15 AI/ML domains
API tests: 6 pytest endpoints
CI/CD stages: lint → test → docker push
Container image: Docker Hub (public)
Python
LangChain
ChromaDB
OpenAI API
GPT-4o-mini
RAG
Semantic Search
Prompt Engineering
FastAPI
Pydantic
Streamlit
Docker
Docker Compose
GitHub Actions
CI/CD
pytest
LLM-as-Judge
Vector Embeddings