OCR Format Overview

OCR (Optical Character Recognition) data formats are used to store scanned document content, often including both the raw image and extracted text/annotations. These formats are widely used in document digitization, search indexing, and automated processing pipelines.

1. Key Characteristics

Image + Text Storage

OCR data typically combines the original image (TIFF, PNG, JPEG) with text results.
Text may be stored with positional metadata (bounding boxes, coordinates), allowing precise mapping back to the source image.

Structured Annotations

Modern OCR frameworks often store results in JSON, XML, or proprietary formats.
Annotations can include:
Recognized text
Confidence scores
Word/line/page bounding boxes
Language or font hints

Hierarchical Organization

Documents can have multiple pages, each with multiple regions or blocks of text.
This hierarchical structure enables efficient search and retrieval of text in large documents.

2. Usage Scenarios

Document Digitization

Converting scanned documents to searchable PDFs or text archives.
Storing both the original image and extracted content for verification.

Search Indexing

Index OCR results for full-text search in document management systems.
Positional data allows highlighting and annotation features.

Data Extraction

Extract structured information (invoices, receipts, forms) using OCR results.
Combine with NLP or entity recognition pipelines.

Machine Learning Training

OCR datasets are often used to train models for text detection and recognition.
Annotation formats like COCO-Text, ICDAR, or PAGE XML are standard.

3. Popular OCR Data Formats

Format	Description	Official Link / Notes
PAGE XML	XML-based standard storing page layout, text, and metadata	https://www.primaresearch.org/page/page-xml
HOCR	HTML-based format for OCR results, storing word positions and confidence	https://github.com/tmbdev/hocr-spec
ALTO XML	XML format for OCR results, widely used in libraries and archives	https://www.loc.gov/standards/alto/
JSON	Custom or framework-specific JSON annotations	e.g., Tesseract output, Google Vision OCR API

4. Integration in Kumo

In Kumo Stack, OCR data formats are typically used for:

Storing and indexing scanned documents for search
Supporting positional text highlighting in search results
Feeding downstream pipelines (NLP, classification, entity extraction)

Integration considerations:

Choose a format compatible with your OCR engine (Tesseract, Google Vision, AWS Textract, etc.)
Maintain mapping between original images and extracted text
Consider compression and storage for large document archives (e.g., zipped images + JSON/XML)

5. Performance Notes

I/O Efficiency

Storing text separately from images allows faster search and indexing
Large image archives benefit from block storage or cloud object storage

Data Size Considerations

OCR output is usually small relative to raw images
Use binary XML or compressed JSON to reduce disk footprint

Parallel Processing

Multi-page documents or large collections can be processed in parallel per page or region

6. Useful Links

PAGE XML: https://www.primaresearch.org/page/page-xml
HOCR Specification: https://github.com/tmbdev/hocr-spec
ALTO XML: https://www.loc.gov/standards/alto/
Tesseract OCR: https://tesseract-ocr.github.io/

1. Key Characteristics​

2. Usage Scenarios​

3. Popular OCR Data Formats​

4. Integration in Kumo​

5. Performance Notes​

6. Useful Links​