AI-Powered Archival Search System (RAG)

AI Archival Search Interface (Dinamalar Digital Archive)

System Dataflow

RAG Pipeline Architecture

How user queries flow through local semantic stores and LLM generation pipelines.

User Query

Journalist types natural Tamil questions (e.g. historical elections).

Archive Retrieval

Translates query keywords to SQL/Vector matching against PostgreSQL archives.

Context Gen

Assembles top relevant news texts & metadata into structured context prompt.

Gemini AI

Processes prompt to generate factually accurate Tamil summaries.

Response

Presents context-grounded response with dynamic news article link references.

Project Overview

Dinamalar, one of India's preeminent Tamil daily newspapers, maintains a digital archive covering decades of historical news content.

The Problem: Navigating years of archived prints, news topics, and historical columns was highly labor-intensive. Editorial and research departments had to rely on precise boolean keywords and manually scan hundreds of documents to verify historical facts.

The Solution: I built the AI-Powered Archival Search System. Utilizing a custom-designed Retrieval-Augmented Generation (RAG) pipeline, users can write natural queries in their local language (Tamil). The system performs database queries and vector matches against the news repository, constructs a prompt context from matching article text, and utilizes the Google Gemini API to return a precise, context-grounded answer accompanied by source citation links.

Key Features

Natural Language Tamil Search

Enables research staff to type native questions without needing complex database search syntax.

Semantic Retrieval

Goes beyond exact word matching to fetch news stories based on topics and conceptual synonyms.

Historical Discovery

Traverses 10+ years of indexed daily news archives, database rows, and editorial print backups.

AI-Assisted Research

Synthesizes complex multi-article stories into concise, bulleted summaries to expedite editorial drafts.

Lessons Learned

Working with RAG in local regional languages presents unique tokenization and semantic analysis challenges. By combining semantic matching (handling sentence intent) with traditional database filters (dateRanges, categoryTags), we achieved 90% retrieval accuracy.

Grounding Gemini prompts strictly within retrieved SQL snippets eliminated hallucination issues, guaranteeing that the generated responses remain 100% faithful to the historically printed news assets.

System Stack

Search Logic

RAG (Retrieval-Augmented Generation)

AI Engine

Google Gemini API

Database Layer

PostgreSQL, Local News Indices

Server Stack

Node.js, Express

AI Archival Search System (RAG)

Natural Language Semantic Search Engine for Local Language News