FinanceRAGMultimodal8 min read

FinRAG

Multimodal Financial Document Intelligence Platform

PythonGoogle Gemini EmbeddingsGoogle Gemini Flash 2.5QdrantCloudflare R2Fly.ioNext.jsVercel

<2s

Query Latency

end-to-end

Doc Types Supported

multimodal

91.4%

Retrieval Precision

top-3 accuracy

01The Challenge

FinRAG was born from a question DataSalt kept running into across financial NLP engagements: why do RAG systems struggle with financial documents?

The answer is structure. Financial filings — 10-Ks, 10-Qs, earnings call transcripts, proxy statements — are among the most information-dense documents in existence. They contain dense prose, nested tables, footnotes that override headline numbers, embedded charts, and cross-references that span hundreds of pages. A standard text-chunking RAG pipeline treats all of this as undifferentiated text and performs poorly on exactly the queries that matter most: "What was the effective tax rate excluding one-time items?", "How did segment margins change year-over-year?", "What did management say about guidance on the Q3 call?"

The second problem is multimodality. Critical data in financial documents lives in tables and charts — not prose. Standard embedding pipelines that chunk raw text miss or corrupt tabular data entirely. A system that can't read a table can't answer a financial question reliably.

We built FinRAG to close that gap: a RAG architecture purpose-built for financial documents, with table-aware parsing, structure-preserving chunking, and a hybrid retrieval strategy that handles both semantic and structured queries. The system is live at finrag.io and serves as a companion to the market-sentiment.io dashboard — together forming DataSalt's financial AI demonstration suite.

02Our Approach

We built a six-stage pipeline: multimodal document ingestion, structure-aware embedding via Google Gemini, hybrid dense + sparse indexing in Qdrant, reciprocal rank fusion with cross-encoder reranking, LLM synthesis with page-level citations, and streamed responses with TTS audio for earnings calls.

Google Gemini Embeddings 2 — native multimodal embedding — text, tables, and chart images in a shared 3072-dim vector space with no intermediate captioning step
Google Gemini Flash 2.5 (TTS) — synthesized audio playback of earnings call summaries and key passages — enabling on-the-go consumption of financial intelligence
Qdrant — vector database with payload filtering by chunk type (text, table, image) and named vector support for hybrid dense + sparse retrieval
Cloudflare R2 — object storage for the raw PDF corpus and extracted assets (serialized tables, chart images)
Fly.io — Python API server (FastAPI) hosting the retrieval pipeline, cross-encoder reranker, and streaming generation endpoint
Next.js + Vercel — frontend with Server-Sent Events for streamed responses, audio playback for TTS, and full-text citation rendering

Document Ingestion

PDF parse + chunk by type

Multimodal Embedding

Gemini Embeddings 2

Hybrid Indexing

Qdrant dense + BM25 sparse

Retrieval + Reranking

RRF fusion → cross-encoder

LLM Synthesis

Cited, grounded answers

Streaming Response

SSE + TTS audio

03Key Findings

Retrieval Precision by Document Type

Top-3 chunk retrieval precision across five document types. Table chunks slightly underperform prose chunks due to serialization edge cases in heavily merged-cell tables — a known area for improvement.

Latency by Query Type

Retrieval and generation time by query complexity. Multi-hop and comparative queries require more retrieved chunks and longer generation context, driving higher latency — both remain within acceptable interactive UX bounds.

Hybrid vs. Dense-Only Retrieval Recall

Recall across a 100-query evaluation set. BM25 provides the largest gains on queries containing exact financial terminology (EBITDA, segment names, specific fiscal quarters) where dense embeddings under-represent surface-form specificity.

Interactive Demo

Try FinRAG Live

Query a curated corpus of SEC filings and earnings transcripts — with table-aware retrieval and cited answers.

Launch FinRAG

04Business Impact

Query-to-Answer Time

Manual: 15–30 min<2 seconds

~99% faster

Table Data Accessibility

Not retrievedSerialized + embedded

Full coverage

Answer Grounding

NonePage-level citations

100% of responses

Doc Types Handled

Text onlyText + tables + charts

Multimodal

Earnings Call Access

Read-only transcriptSynthesized audio

TTS playback

Projected Annual Value

Purpose-built financial RAG with multimodal retrieval, page-level citations, and earnings call TTS

FinRAG demonstrates what a purpose-built financial RAG system looks like when document structure is treated as a first-class concern — not an afterthought. The hybrid retrieval strategy outperforms dense-only retrieval by 4–8 percentage points on financial-specific queries. Table serialization enables a class of queries that standard chunking pipelines cannot answer. And citation-first generation makes every response auditable — a non-negotiable requirement in financial workflows where hallucination poses real risk.

As a DataSalt portfolio project, FinRAG is designed to be deployed against any corpus of financial documents. It currently indexes a demonstration set of public SEC filings (10-K and 10-Q) for three S&P 500 companies, with earnings call transcripts for the same period. FinRAG pairs with market-sentiment.io — DataSalt's live sentiment regime detection dashboard — to form a complete financial AI demonstration suite.

05Technical Details

Document Parsing & Ingestion

PDF extraction: pdfplumber for structured layout detection; PyMuPDF as fallback
Table parsing: pdfplumber table extraction → markdown serialization with column headers preserved
Chart/figure handling: extracted as images and passed directly to the Gemini embedding model — no intermediate captioning step
Document storage: raw PDFs and extracted assets stored in Cloudflare R2

Embeddings — The Multimodal Core

Model: Google Gemini Embeddings 2 preview (gemini-embedding-exp-03-07)
Modalities: text prose, serialized table markdown, and chart/figure images in a unified vector space
Dimension: 3072 — single model handles all content types without modality-specific pipelines

Retrieval Architecture

Vector store: Qdrant with payload filtering by chunk type (text / table / image)
Sparse: BM25 via rank-bm25, indexed on tokenized financial corpus
Fusion: Reciprocal Rank Fusion (k=60) across dense and sparse ranked lists
Reranker: cross-encoder/ms-marco-MiniLM-L-6-v2 for top-20 → top-5 selection

LLM & Generation

Model: Claude Sonnet 4.5 via Anthropic API
Context: top-5 reranked chunks (~3,000 tokens) + system prompt
Output: answer + cited sources (document name, chunk ID, page number)
Streaming: Anthropic streaming API → Next.js Server-Sent Events

Text-to-Speech (Earnings Calls)

Model: Google Gemini Flash 2.5 TTS
Synthesized audio playback of RAG-generated summaries and key earnings call passages
Generated server-side via Gemini API, streamed to client as audio

Infrastructure

Backend: FastAPI (Python), deployed on Fly.io
Frontend: Next.js 14 App Router, deployed on Vercel at finrag.io
Vector DB: Qdrant
Object storage: Cloudflare R2 (PDF corpus and extracted assets)
Ingestion: offline batch pipeline (Python), triggered on corpus updates

Facing similar challenges?

Let's discuss how data science can drive results for your business.

Discuss a Similar Project Live Demo

All Case Studies