A specialized Retrieval-Augmented Generation (RAG) search engine built to traverse decades of Tamil news publication archives, enabling journalists and researchers to query archives in natural language and receive verified historical answers.
How user queries flow through local semantic stores and LLM generation pipelines.
Journalist types natural Tamil questions (e.g. historical elections).
Translates query keywords to SQL/Vector matching against PostgreSQL archives.
Assembles top relevant news texts & metadata into structured context prompt.
Processes prompt to generate factually accurate Tamil summaries.
Presents context-grounded response with dynamic news article link references.
Dinamalar, one of India's preeminent Tamil daily newspapers, maintains a digital archive covering decades of historical news content.
The Problem: Navigating years of archived prints, news topics, and historical columns was highly labor-intensive. Editorial and research departments had to rely on precise boolean keywords and manually scan hundreds of documents to verify historical facts.
The Solution: I built the AI-Powered Archival Search System. Utilizing a custom-designed Retrieval-Augmented Generation (RAG) pipeline, users can write natural queries in their local language (Tamil). The system performs database queries and vector matches against the news repository, constructs a prompt context from matching article text, and utilizes the Google Gemini API to return a precise, context-grounded answer accompanied by source citation links.
Enables research staff to type native questions without needing complex database search syntax.
Goes beyond exact word matching to fetch news stories based on topics and conceptual synonyms.
Traverses 10+ years of indexed daily news archives, database rows, and editorial print backups.
Synthesizes complex multi-article stories into concise, bulleted summaries to expedite editorial drafts.
Working with RAG in local regional languages presents unique tokenization and semantic analysis challenges. By combining semantic matching (handling sentence intent) with traditional database filters (dateRanges, categoryTags), we achieved 90% retrieval accuracy.
Grounding Gemini prompts strictly within retrieved SQL snippets eliminated hallucination issues, guaranteeing that the generated responses remain 100% faithful to the historically printed news assets.