MonkeyOCR Installation & Guide – Fast, Accurate Document Parser with SRR Triplet Framework

MonkeyOCR installation in local or Google Colab guide

Introduction

In an age of information overload, documents remain a dominant medium for communicating complex data, from scientific papers to financial reports. Yet, parsing such structured content poses significant challenges for traditional OCR systems. This is where MonkeyOCR excels. Built on the Structure-Recognition-Relation (SRR) paradigm, MonkeyOCR transforms document parsing by addressing “Where is it?”, “What is it?” and “How is it organized?” in a unified yet modular architecture. So, it recognises the text reading order and accurately identifies multi-column text or paragraphs. Developed by Huazhong University of Science and Technology in collaboration with Kingsoft Office, this model simplifies layout analysis, content recognition, and region ordering. Backed by MonkeyDoc, the largest document parsing dataset to date, MonkeyOCR sets a new standard in accuracy, speed, and multilingual flexibility.

What is MonkeyOCR?

MonkeyOCR is a layout-aware document OCR model built on a unified encoder-decoder framework, aimed at solving the problem of end-to-end document text spotting. It is capable of detecting and recognizing text from document images that may contain multiple columns, varied font styles, or complex layouts. This makes it ideal for digitizing scanned documents, invoices, forms, and academic papers.

Key Features of MonkeyOCR

  • Unified Transformer Framework: Combines both detection and recognition into one pipeline.
  • Layout Awareness: Uses positional and spatial embeddings to understand document structure.
  • Multi-Task Learning: Supports auxiliary tasks like span classification for improved text spotting accuracy.
  • Superior Benchmarks: Outperforms previous methods on datasets like FUNSD, CORD, and DocVQA.

The Structure-Recognition-Relation (SRR) Triplet Paradigm

The SRR paradigm represents a fundamental breakthrough in document processing methodology. Unlike traditional approaches that rely on complex multi-tool pipelines or computationally expensive large multimodal models, MonkeyOCR’s SRR framework operates through three interconnected components:

  1. Structure Detection: The model first identifies and maps the layout structure of documents, including headers, paragraphs, tables, figures, and other elements. This structural understanding forms the foundation for accurate content extraction.
  2. Content Recognition: Once the structure is identified, the model performs precise content recognition, extracting text, mathematical formulas, table data, and other relevant information with high accuracy.
  3. Relationship Prediction: The final component establishes relationships between different document elements, ensuring that the extracted content maintains its logical flow and hierarchical organization.

This integrated approach eliminates the error propagation issues common in pipeline-based methods while avoiding the computational overhead of processing entire document pages through massive language models.

Exceptional Performance Benchmarks

MonkeyOCR’s performance metrics demonstrate its superiority across multiple evaluation criteria, making it a game-changer for organizations seeking efficient document processing solutions.

MonkeyOCR Benchmarks vs Top Models
Model Overall Edit ↓ Text Edit ↓ Formula CDM ↑ Table TEDS ↑ Reading Order Edit ↓
MinerU 0.150 0.061 57.3 78.6 0.079
Qwen2.5-VL 0.312 0.157 79.0 76.4 0.149
GPT-4o 0.233 0.144 72.8 72.0 0.128
InternVL3-8B 0.314 0.134 78.3 66.1 0.118
Nougat 0.452 0.365 15.1 39.9 0.382
MonkeyOCR-3B 0.140 0.058 78.7 80.2 0.093

Speed and Efficiency Advantages

One of MonkeyOCR’s most impressive achievements is its processing speed. The model achieves remarkable throughput rates that significantly outperform existing solutions:

  • MonkeyOCR: 0.84 pages per second
  • MinerU: 0.65 pages per second
  • Qwen2.5 VL-7B: 0.12 pages per second

This 7x speed improvement over Qwen2.5 VL-7B and 29% improvement over MinerU translates to substantial time savings for high-volume document processing applications.

Accuracy Improvements Across Document Types

MonkeyOCR demonstrates consistent superiority across various document categories and languages. When compared to MinerU, the leading pipeline-based method, MonkeyOCR achieves:

  • Overall improvement: 5.1% average gain across nine document types
  • Formula recognition: 15.0% performance increase
  • Table processing: 8.6% accuracy improvement

These improvements are particularly significant for organizations processing technical documents, financial reports, academic papers, and other content-rich materials that require precise formula and table recognition.

Multilingual Capabilities

MonkeyOCR excels in both Chinese and English document processing, making it ideal for global organizations and multilingual document workflows. The model’s robust performance across language barriers ensures consistent results regardless of document language.

Technical Architecture and Implementation

Model Specifications and Requirements

MonkeyOCR’s efficient architecture allows it to run on modest hardware configurations while delivering enterprise-grade performance:

  • Model Size: 3 billion parameters
  • Minimum GPU Requirements: NVIDIA RTX 3090 (can run efficiently)
  • Memory Optimization: Optimized for single-GPU deployment
  • Processing Frameworks: Supports both LMDeploy and Transformers backends

Step-by-Step Guide: Install & Run MonkeyOCR

MonkeyOCR is a powerful document parser built on a Structure-Recognition-Relation (SRR) paradigm.

MonkeyOCR Google Colab Installation Guide

Follow this Google Colab guide to install and run MonkeyOCR on your own PDFs.

1: Install Miniconda

!wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
!bash miniconda.sh -b -p /usr/local/miniconda
!rm miniconda.sh

2: Configure PATH and Initialize Conda

!echo 'export PATH=/usr/local/miniconda/bin:$PATH' >> ~/.bashrc
!source ~/.bashrc

import sys
sys.path.append('/usr/local/miniconda/bin')
!conda init bash
!source ~/.bashrc
!/usr/local/miniconda/bin/conda --version

3: Create MonkeyOCR Conda Environment

!/usr/local/miniconda/bin/conda create -n monkeyocr python=3.10 -y
!/usr/local/miniconda/bin/conda run -n monkeyocr python --version

4: Clone MonkeyOCR Repository

!git clone https://github.com/Yuliang-Liu/MonkeyOCR.git
%cd MonkeyOCR

5: Install PyTorch and MonkeyOCR

!pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
!pip install -e .

6: Install Additional Required Packages

!pip install huggingface_hub pdf2image

7: Hugging Face Authentication

!huggingface-cli login

8: Install LMDeploy (for optimized inference)

!pip install lmdeploy

9: Download Pretrained Models

!python /content/MonkeyOCR/tools/download_model.py

10: Apply LMDeploy Patch

%cd /content/MonkeyOCR
!python tools/lmdeploy_patcher.py patch

11: Upload and Parse a PDF Document

Upload your PDF to Colab using the sidebar or this code:

from google.colab import files
uploaded = files.upload()

Then run the parser (replace your_document.pdf with your filename):

!python /content/MonkeyOCR/parse.py /content/your_document.pdf -c model_configs.yaml

MonkeyOCR Local Installation Guide (GPU | RTX 3090/4090 | Conda)

MonkeyOCR is a powerful open-source document parser built on the SRR (Structure–Recognition–Relation) paradigm. This guide helps you install and run MonkeyOCR locally on Linux/macOS using Python 3.10, Conda, and CUDA 12.4.

1: Set Up Conda Environment

conda create -n MonkeyOCR python=3.10 -y
conda activate MonkeyOCR

2: Clone the GitHub Repository

git clone https://github.com/Yuliang-Liu/MonkeyOCR.git
cd MonkeyOCR

3: Install PyTorch (CUDA 12.4)

pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124

4: Install MonkeyOCR & Dependencies

pip install -e .
pip install huggingface_hub pdf2image

5: Download Pretrained Models

python tools/download_model.py

(Optional) Fix RTX 3090 / 4090 Shared Memory Error

If using LMDeploy and encountering the following error:

triton.runtime.errors.OutOfResources: out of resource: shared memory

Apply the patch:

python tools/lmdeploy_patcher.py patch

To revert:

python tools/lmdeploy_patcher.py restore

6: Run MonkeyOCR Inference

Replace the placeholder paths with your actual document or image file.

📄 For PDF:

python parse.py path/to/your.pdf

🖼️ For Images:

python parse.py path/to/your/image.jpg

💾 Specify output directory and config:

python parse.py path/to/your.pdf -o ./output -c model_configs.yaml

Optional: Switch Inference Backend to Transformers

1: Install Flash Attention 2

🔗 Download compatible versions here

pip install flash-attn==2.7.4.post1 --no-build-isolation

2: Edit model_configs.yaml

Modify the backend config:

chat_config:
  backend: transformers
  batch_size: 10  # Adjust based on your GPU (e.g., 6–12 for RTX 3090)

This allows you to bypass LMDeploy and use Hugging Face’s transformers inference, often more portable and compatible across environments.

MonkeyOCR Docker Deployment Installation Guide

MonkeyOCR supports GPU-accelerated Docker deployment with both a Gradio-based demo and a FastAPI server. Follow the steps below to set up MonkeyOCR using Docker, including support for RTX 30/40 series GPUs.

1: Navigate to the Docker Directory

cd docker

2: Enable NVIDIA GPU Support (If Needed)
If you’re not already using nvidia-docker2, set up the environment for GPU compatibility:

bash env.sh

3: Build the Docker Image

For standard GPUs:

docker compose build monkeyocr

For RTX 30-series or 40-series GPUs (to avoid shared memory errors with LMDeploy):

docker compose build monkeyocr-fix

4: Run the Gradio Web Demo (Port 7860)

Launch the demo UI:

docker compose up monkeyocr-demo

📌 This will open a local Gradio app at http://localhost:7860

5: Launch Development Shell

Start an interactive container for testing or development:

docker compose run --rm monkeyocr-dev

6: Run the FastAPI Inference Service (Port 7861)

To deploy the backend API:

docker compose up monkeyocr-api

📌 Access via http://localhost:7861

Example Output

MonkeyOCR will generate:

  • Structured text
  • Detected tables, formulas
  • Logical region sequence
  • Output in JSON and optionally images (with bounding boxes)

Conclusion

MonkeyOCR represents a quantum leap forward in document parsing technology, delivering unprecedented performance through its innovative Structure-Recognition-Relation paradigm. By successfully balancing accuracy, speed, and efficiency, it addresses the fundamental challenges that have long plagued document processing applications.

The model’s ability to outperform industry-leading solutions like Gemini 2.5 Pro and GPT-4o while maintaining a compact 3-billion parameter architecture demonstrates the power of specialized, purpose-built AI models. For organizations seeking to modernize their document processing workflows, MonkeyOCR offers a compelling combination of performance, efficiency, and cost-effectiveness.

As the model continues to evolve with planned enhancements, including photographed document support and improved deployment options, MonkeyOCR is positioned to become the standard for intelligent document processing. Whether you’re managing enterprise document workflows, conducting academic research, or developing document-centric applications, MonkeyOCR provides the tools and performance needed to transform how you handle document processing challenges.

Official Resources for MonkeyOCR

Website |  + posts

Md Monsur Ali is a tech writer and researcher specializing in AI, LLMs, and automation. He shares tutorials, reviews, and real-world insights on cutting-edge technology to help developers and tech enthusiasts stay ahead.