Skip to main content

OCR Format Overview

OCR (Optical Character Recognition) data formats are used to store scanned document content, often including both the raw image and extracted text/annotations. These formats are widely used in document digitization, search indexing, and automated processing pipelines.


1. Key Characteristics

Image + Text Storage

  • OCR data typically combines the original image (TIFF, PNG, JPEG) with text results.
  • Text may be stored with positional metadata (bounding boxes, coordinates), allowing precise mapping back to the source image.

Structured Annotations

  • Modern OCR frameworks often store results in JSON, XML, or proprietary formats.

  • Annotations can include:

  • Recognized text

  • Confidence scores

  • Word/line/page bounding boxes

  • Language or font hints

Hierarchical Organization

  • Documents can have multiple pages, each with multiple regions or blocks of text.
  • This hierarchical structure enables efficient search and retrieval of text in large documents.

2. Usage Scenarios

Document Digitization

  • Converting scanned documents to searchable PDFs or text archives.
  • Storing both the original image and extracted content for verification.

Search Indexing

  • Index OCR results for full-text search in document management systems.
  • Positional data allows highlighting and annotation features.

Data Extraction

  • Extract structured information (invoices, receipts, forms) using OCR results.
  • Combine with NLP or entity recognition pipelines.

Machine Learning Training

  • OCR datasets are often used to train models for text detection and recognition.
  • Annotation formats like COCO-Text, ICDAR, or PAGE XML are standard.

FormatDescriptionOfficial Link / Notes
PAGE XMLXML-based standard storing page layout, text, and metadatahttps://www.primaresearch.org/page/page-xml
HOCRHTML-based format for OCR results, storing word positions and confidencehttps://github.com/tmbdev/hocr-spec
ALTO XMLXML format for OCR results, widely used in libraries and archiveshttps://www.loc.gov/standards/alto/
JSONCustom or framework-specific JSON annotationse.g., Tesseract output, Google Vision OCR API

4. Integration in Kumo

In Kumo Stack, OCR data formats are typically used for:

  • Storing and indexing scanned documents for search
  • Supporting positional text highlighting in search results
  • Feeding downstream pipelines (NLP, classification, entity extraction)

Integration considerations:

  • Choose a format compatible with your OCR engine (Tesseract, Google Vision, AWS Textract, etc.)
  • Maintain mapping between original images and extracted text
  • Consider compression and storage for large document archives (e.g., zipped images + JSON/XML)

5. Performance Notes

I/O Efficiency

  • Storing text separately from images allows faster search and indexing
  • Large image archives benefit from block storage or cloud object storage

Data Size Considerations

  • OCR output is usually small relative to raw images
  • Use binary XML or compressed JSON to reduce disk footprint

Parallel Processing

  • Multi-page documents or large collections can be processed in parallel per page or region