How I Built a Public RAG App for Chicago Budget Documents

City budget documents are public, but in practice they are hard to use. They are long PDFs, full of department codes and financial terms, and most people do not have time to manually scan hundreds of pages just to answer one question.

I wanted to make this easier, so I built a public app that lets people ask plain-language questions about Chicago's FY2026 budget ordinances and get answers tied to exact source pages. That same civic-information product instinct also shaped how we think about accessibility and trust in The Common News.

Code: github.com/devthakker/Chicago-budget

The Problem

The core problem was not "can we call an LLM". It was:

How do we retrieve the right evidence from large, messy PDFs?
How do we show users exactly where the answer came from?
How do we make this reliable enough for public use?

The two source documents are large enough that naive search quality drops quickly, especially with tables and codes:

chicago_Annual_Appropriation_Ordinance_2026.pdf
chicago_Grant_Details_Ordinance_2026.pdf

Architecture

The app follows a straightforward RAG pipeline:

PDF text extraction (pdftotext -layout)
Chunking with page metadata and section awareness
Hybrid retrieval (BM25 + optional embeddings)
Reranking (heuristic, with optional cross-encoder path)
Answer generation with citations
Source UX: open exact PDF page in tab or embedded viewer

Stack:

Backend: FastAPI
Retrieval/indexing: custom Python engine
Frontend: server-rendered HTML/CSS templates
Deployment: Docker
Model providers: OpenAI, AWS Bedrock, or local Ollama

Retrieval Design Choices That Mattered

1) Hybrid retrieval over vectors-only

Budget queries are often code-heavy (GA00, 925S, ARPA, etc.), where lexical match is strong. I kept BM25 as the dominant signal and blended vectors as a secondary signal.

Default weighting:

RAG_BM25_WEIGHT=0.85
RAG_VECTOR_WEIGHT=0.15

2) TOC suppression

One early failure mode was retrieval returning table-of-contents chunks. They look lexically relevant but are poor evidence. I added TOC detection plus penalty plus optional suppression.

3) Smaller, section-aware chunking

Large generic chunks blurred unrelated sections. I moved to smaller windows with overlap and section boundaries, which improved precision and citation usefulness.

Evidence UX: Trust Through Citations

Good answers are not enough for civic data. Users need to verify.

So each result includes:

source document name
page range
direct "Open in Viewer" action
"Open in New Tab" action to the exact page anchor

I also added an embedded PDF panel so users can inspect evidence without leaving the page.

Making It Operable in Production

Public tools need controls, not just model output.

I added:

Dockerized deployment
Env-based provider switching (openai, bedrock, ollama)
Rate limiting for public traffic
Site on/off feature flag with a temporary disabled page linking to the open-source repo
Export options for user queries (Markdown, JSON, CSV)

Quality Loop: Evaluation and Tuning

I added an evaluation harness (eval_rag.py) with a sample benchmark file (eval/questions.sample.json) so retrieval changes can be measured, not guessed.

Metrics:

Hit Rate@k
MRR (mean reciprocal rank)
first-hit rank per query

I also added a tuning mode to test BM25/vector blends and select the best setting for the current benchmark.

Deployment Notes

The app runs on AWS EC2 with Docker and Caddy for HTTPS. DNS is managed in Vercel and points a subdomain to the EC2 Elastic IP.

This setup kept deployment simple while still giving me:

HTTPS
controlled rollout
easy environment-based config

What I Would Improve Next

If I were taking this to the next level, I would prioritize:

richer structured extraction for budget tables and fields
stronger evaluation coverage with a larger real-user query set
Redis-backed shared rate limiting for multi-instance scale
filterable UI facets (department, fund, grant code)

The biggest lesson was that civic RAG quality is mostly a retrieval and product design problem, not just a prompt problem. The model helps summarize, but trust comes from retrieval quality, transparent evidence, and operational discipline.

If you're building something similar and want to compare notes, feel free to reach out.