


Chandra OCR 2 achieves 85.9% SOTA on olmOCR benchmarks with layout-aware full-page decoding. Here's why it matters for document processing pipelines.
Head of Growth & Customer Success
Optical Character Recognition has existed for decades. Tesseract, the most widely-used open-source OCR engine, dates back to 1985. So why, in 2026, is a new OCR model trending on GitHub with 4,700+ stars and claiming state-of-the-art results?
Chandra OCR is developed by Unstructured.io, the team behind one of the most popular open-source document processing libraries. The model achieves 85.9% accuracy on the DocLayNet benchmark, setting a new state-of-the-art for layout-aware OCR. The weights are available on Hugging Face.
https://x.com/UnstructuredIO/status/2036602696371970195
Because the documents that matter in business are not simple pages of typed text. They are invoices with merged table cells, medical forms with checkboxes and handwritten notes, research papers with multi-column layouts and LaTeX equations, legal contracts with nested headers and footnotes. Traditional OCR reads the characters but destroys the structure. And without structure, the extracted text is nearly useless for downstream AI processing.
Chandra OCR, developed by Datalab (a Brooklyn-based AI startup founded by Vik Paruchuri and Sandy Kwon in 2024), takes a fundamentally different approach. Instead of segmenting a page into blocks and processing each independently, Chandra decodes entire pages at once with full layout awareness. The result is structured output (Markdown, HTML, or JSON with bounding boxes) that preserves the document's semantic organization: tables remain tables, columns stay aligned, equations render as LaTeX, and forms retain their checkbox states.
Chandra 2, the latest release, achieves 85.9% overall accuracy on the olmOCR benchmark, the highest score of any model tested.
Chandra 2's performance on the olmOCR benchmark (the emerging standard for evaluating OCR on complex documents) tells the story:
Model | Overall Score |
|---|---|
Chandra 2 | 85.9% |
dots.ocr | 83.9% |
Chandra 1 | 83.1% |
olmOCR 2 | 78.5% |
DeepSeek OCR | 75.4% |
Chandra 2 excels where traditional OCR struggles most:
Tables: 89.9% (handling merged cells, colspan/rowspan, nested tables)
Math equations: 89.3% (output as LaTeX)
Headers/footers: 92.5% (distinguishing structural elements from body text)
Multilingual: 77.8% average across 43 languages (vs. Gemini 2.5 Flash at 67.6%)
The 85.9% overall score represents a meaningful improvement over the next best model (dots.ocr at 83.9%) and a substantial leap over established alternatives like olmOCR 2 and DeepSeek OCR.
The architectural innovation in Chandra is the shift from pipeline-based OCR to full-page decoding using a vision-language model.
Feature | Chandra OCR | Tesseract | AWS Textract | Azure Document AI |
|---|---|---|---|---|
Layout awareness | Yes (native) | No | Yes | Yes |
Table extraction | Yes (high accuracy) | No | Yes | Yes |
Handwriting | Yes | Limited | Yes | Yes |
Open source | Yes (Apache 2.0) | Yes | No | No |
Cost | Free (self-hosted) | Free | $1.50/1K pages | $1/1K pages |
https://x.com/meraborhan/status/2037308655201362123
Most OCR systems (including Datalab's own earlier products, Marker and Surya) work in stages:
Detect layout regions (text blocks, tables, images, equations)
Segment the page into individual regions
Recognize text within each region independently
Reassemble the results into a structured output
Each stage introduces error. Layout detection may split a table incorrectly. Segmentation may assign a column of text to the wrong block. Recognition may lose context about where a text region fits in the overall document structure. The errors compound.
Chandra uses a vision-language model (built on Qwen3-VL architecture) that processes the entire page as a single image. The model simultaneously:
Identifies all content types (text, tables, images, equations, forms, handwriting)
Extracts and captions images and diagrams
Preserves table structures including colspan and rowspan
Reconstructs form layouts with checkbox states
Handles handwritten text alongside printed text
Outputs structured formats with full layout metadata (bounding boxes for every element)
By processing the full page at once, the model can use global context: understanding that a number in the bottom-right of a table is a total, that a margin annotation relates to the adjacent paragraph, that a multi-column layout should be read left-to-right within each column.
Chandra 2 is a 4-billion parameter model. This is small enough to run on a single GPU but large enough to capture the visual reasoning needed for complex document understanding.
H100 GPU: Up to 4 pages per second (~345,000 pages per day)
vLLM with 96 concurrent streams: 1.44 pages per second per stream
Max output tokens: 8,192 per page
90+ languages, with 77.8% average accuracy across 43 tested multilingual benchmarks. This is particularly strong for non-Latin scripts, where traditional OCR often falls apart.
Local via Hugging Face Transformers: Standard pip install, model download, and inference
High-throughput via vLLM: For production batch processing
Hosted API: Datalab offers a managed API for teams that do not want to manage GPU infrastructure
Quantized variants: Available for deployment on lower-end hardware or edge devices
Chandra is open-source with some restrictions for commercial use above certain revenue thresholds (startups under $2M revenue can use it free). The hosted API provides a fully-licensed option for larger organizations.
The real significance of Chandra is not the benchmark number itself. It is what accurate, layout-aware OCR enables downstream.
from chandra_ocr import ChandraOCR
ocr = ChandraOCR(model="unstructuredai/chandra-ocr")
result = ocr.process("invoice.pdf")
for page in result.pages:
for table in page.tables:
print(table.to_dataframe())
for paragraph in page.text_blocks:
print(paragraph.text)For email platforms like Maylee that need to process PDF attachments, extract invoice data, or parse document content for smart search, Chandra OCR's layout awareness means you get structured data instead of a flat text dump. Tables remain tables, headers remain headers, and the reading order follows the visual layout rather than the raw character stream.
The practical impact for business email workflows is significant. A user who receives a multi-page contract as a PDF attachment can have Maylee extract key terms, dates, and obligations automatically. An accountant receiving invoices can have amounts, line items, and due dates parsed into structured fields. These capabilities depend entirely on OCR quality, and Chandra's layout awareness makes them reliable for the first time in an open-source tool.
Retrieval Augmented Generation (RAG) systems are only as good as their document ingestion pipeline. If your OCR strips table structure from a financial report, the RAG system cannot answer questions about specific line items. If it fails to preserve the hierarchy of a legal contract, the system cannot distinguish a clause from a sub-clause.
Chandra's structured output (Markdown with table formatting, HTML with proper tags, JSON with bounding boxes) feeds directly into RAG pipelines. The generation model receives not just text, but text with structural context, dramatically improving answer quality for document-grounded questions.
Finance and accounting teams process thousands of invoices, purchase orders, and forms. Traditional OCR extracts text but loses the table structure that maps line items to quantities and prices. Manual correction is expensive. Chandra's table extraction (89.9% accuracy, with colspan/rowspan support) reduces the correction burden significantly. One user (Purchaser.ai) reported six-figure cost savings from automating invoice processing with Chandra.
Legal documents are notoriously complex: nested sections, cross-references, footnotes, and multi-column formats. Chandra's header/footer detection (92.5%) and layout preservation make it possible to build AI systems that understand document structure, not just content. This enables clause extraction, contract comparison, and compliance checking that was previously unreliable with OCR-based approaches.
Healthcare digitization requires handling mixed content: printed text, handwritten physician notes, checkboxes, and form fields. Chandra's ability to process handwriting alongside printed text, within the context of the full page layout, addresses a long-standing gap in medical document processing.
Global organizations deal with documents in dozens of languages, often with mixed scripts on the same page. Chandra's 77.8% multilingual accuracy across 43 languages, compared to 67.6% for Gemini 2.5 Flash, makes it the strongest option for multilingual document pipelines.
OCR in 2026 is a crowded field, but the contenders occupy different niches:
Models like GPT-4o, Gemini 2.5 Flash, and Qwen3-VL can perform OCR as one of many capabilities. They are convenient but not specialized. Chandra outperforms them on structured document tasks while being far cheaper to run (4B parameters vs. hundreds of billions).
olmOCR 2 (78.5%): Strong open-source alternative, but 7.4 points behind Chandra 2
DeepSeek OCR (75.4%): Competitive on simple documents, weaker on tables and layout
dots.ocr (83.9%): The closest competitor, but Chandra 2 leads by 2 points overall and significantly more on tables and math
Tesseract: Mature and widely deployed, but weak on handwriting, tables, and complex layouts
PaddleOCR: Good table support under Apache license, but lower accuracy on the olmOCR benchmark
Chandra represents a generational leap over Datalab's previous products (Marker for PDF extraction, Surya for OCR). Those used pipeline architectures that Chandra's full-page approach has now superseded. Datalab positions Chandra as their flagship going forward.
For teams building document AI pipelines, Chandra fits at the ingestion layer:
https://x.com/LangChainAI/status/2037576986005061690
The integration with popular document processing frameworks like LangChain and LlamaIndex is straightforward. Chandra OCR outputs structured JSON that can be directly fed into RAG pipelines, vector databases, or any downstream processing system. This interoperability means you can adopt Chandra OCR without changing your existing document processing architecture.
GitHub repository includes detailed documentation, pre-trained model weights, and example notebooks for common use cases including invoice processing, contract analysis, and academic paper parsing.
Document arrives (PDF, scanned image, fax, photo)
Chandra processes each page, outputting structured Markdown/HTML/JSON
Downstream systems consume the structured output for:
Embedding and indexing in vector databases for RAG
Structured data extraction (tables to databases)
Classification and routing
Summarization and analysis
The structured output is particularly valuable for AI-powered email and communication tools. When an email arrives with attached invoices, contracts, or reports, Chandra can extract the document content with full structure preserved, enabling AI systems to understand and respond to document-grounded questions. Products like Maylee, which auto-classify incoming emails (invoices, price requests, meeting notes) and draft contextual replies, benefit from this kind of structured document understanding: the AI can reference specific line items in an invoice or clauses in a contract when drafting a response.
Chandra's success demonstrates that specialized, moderately-sized models (4B parameters) can outperform both legacy OCR systems and general-purpose frontier LLMs on domain-specific tasks. This is consistent with a broader trend in AI: as the field matures, purpose-built models increasingly beat generalists at specific jobs while costing a fraction to run.
For teams currently using Tesseract, PaddleOCR, or LLM-based OCR, Chandra represents a meaningful upgrade path: better accuracy on the hard cases (tables, handwriting, math, multilingual content), structured output that feeds directly into AI pipelines, and deployment options ranging from local inference to managed API.
The 85.9% SOTA score on olmOCR is impressive, but the real measure is whether it reduces the manual correction burden that makes document AI projects expensive. Based on the benchmark breakdown, particularly the 89.9% table accuracy and 89.3% math accuracy, Chandra removes a significant portion of the error that has historically required human intervention.
For document-heavy industries (finance, legal, healthcare, government), accurate layout-aware OCR has been the missing piece that prevents full automation of document processing workflows. Chandra does not solve the entire problem, but it meaningfully raises the ceiling on what automated pipelines can handle without human review.
Chandra OCR is an open-source OCR model developed by Datalab that uses full-page decoding with layout awareness to extract text, tables, equations, forms, and handwriting from complex documents. The latest version, Chandra 2, achieves 85.9% state-of-the-art accuracy on the olmOCR benchmark.
Traditional OCR segments pages into blocks and processes each independently, losing structural context. Chandra processes entire pages at once using a vision-language model, preserving table structure, column layout, equation formatting, and the relationship between document elements.
Chandra outputs structured Markdown, HTML, or JSON with full layout metadata including bounding boxes for every element. Tables preserve colspan and rowspan, equations are output as LaTeX, and form fields retain their states.
Chandra 2 is a 4B parameter model. On an H100 GPU, it processes up to 4 pages per second. Quantized variants are available for lower-end hardware. Datalab also offers a hosted API for teams without GPU infrastructure.
Chandra supports 90+ languages with 77.8% average accuracy across 43 tested multilingual benchmarks, significantly outperforming alternatives like Gemini 2.5 Flash (67.6%) on multilingual document processing.
Chandra is open-source with some restrictions. Startups under $2M revenue can use it free. Larger organizations can use the hosted API from Datalab for a fully-licensed option. The model weights are available on Hugging Face.
Chandra's full-page decoding approach processes the entire table structure at once, preserving colspan and rowspan attributes. It achieves 89.9% accuracy on table extraction in the olmOCR benchmark, the highest among tested models.
Yes. Chandra can process handwritten text alongside printed text within the same page, using the full page context to interpret handwriting in relation to form fields, annotations, and surrounding printed content. This is particularly valuable for medical records and archived documents.