Xiaomi MiMo-VL-7B-RL Installation Guide & Model Overview

Xiaomi MiMo-VL-7B-RL architecture showcasing vision-language fusion and reinforcement learning

How Xiaomi’s first open-source vision-language model is setting new standards in multimodal AI and human-AI interaction

Introduction

Artificial intelligence is rapidly evolving, and Xiaomi has officially entered the spotlight with the release of Xiaomi MiMo-VL-7B-RL, its first open-source multimodal model tailored for advanced reasoning tasks. Positioned at the intersection of visual understanding and language generation, MiMo-VL-7B-RL integrates cutting-edge components with an innovative training framework to push the boundaries of what vision-language models (VLMs) can achieve.

What is Xiaomi MiMo-VL-7B-RL?

MiMo-VL-7B-RL is built by Xiaomi’s Large Language Model (LLM) Core Team and features a Transformer-based architecture that tightly couples vision and language inputs. At its core, it uses a native resolution Vision Transformer (ViT), which preserves fine-grained spatial details crucial for visual grounding, GUI analysis, and object localization. This is a step ahead of conventional models that downscale images and lose contextual cues.

Instead of relying on heavy fusion modules, Xiaomi has implemented a streamlined Multi-Layer Perceptron (MLP) projector. This lightweight component efficiently aligns the visual embeddings from the ViT with the token embeddings of the language model, enabling a seamless fusion of modalities without performance bottlenecks.

One of the standout capabilities of MiMo-VL-7B-RL is its long-context handling. Whether parsing dense documents, multi-element GUI screens, or extended visual narratives, the model can maintain coherent reasoning across multiple visual and textual inputs.

Engineering at Scale: The Core Design

The model architecture centers on three pivotal components: the ViT encoder for high-resolution visual input, the MLP projector for modality fusion, and Xiaomi’s MiMo-7B language model foundation. The language backbone has been post-trained for rigorous reasoning tasks and achieved an impressive 55.4 score on the AIME 2025 benchmark, surpassing OpenAI’s o1-mini by 4.7 points—a clear indicator of MiMo’s superior analytical and grounding abilities.

A Two-Phase Training Revolution

What makes Xiaomi MiMo-VL-7B-RL even more compelling is its two-phase training process, which balances supervised learning with a novel form of reinforcement learning:

  • Phase 1: Supervised Pretraining – This stage warms up the projector, aligns cross-modal inputs, and tunes the model using general multimodal data. It concludes with supervised fine-tuning on long-context inputs, resulting in the intermediate MiMo-VL-7B-SFT model.
  • Phase 2: Mixed On-Policy Reinforcement Learning (MORL) – This is where Xiaomi breaks new ground. The MORL framework integrates reward signals from four axes: perception accuracy, visual grounding, logical reasoning, and alignment with human preferences. The result is a highly capable model that can juggle competing priorities and remain contextually accurate across domains.

Benchmarking Excellence

MiMo-VL-7B-RL has demonstrated top-tier performance across a range of multimodal reasoning benchmarks.

MiMo-VL-7B-RL Benchmark

It excels not only in visual understanding but also in contextual inference, chain-of-thought reasoning, and grounding tasks. Its performance positions it at the forefront of open-source VLMs, especially in applications requiring deep analytical skills and fine-grained perception.

Tackling Reinforcement Learning Challenges

Applying reinforcement learning in multimodal domains is not without challenges. One key issue Xiaomi addressed was domain interference, where conflicting objectives across text, image, and user preference domains can reduce overall performance. The MiMo team’s MORL solution offers a promising direction for future research by showing that reinforcement learning can enhance rather than dilute multi-domain capabilities when implemented correctly.

Step-by-step Installation & Usage Guide for Xiaomi MiMo-VL-7B-RL (Vision + Language Model)

1. Install Required Python Packages

Open a terminal and run the following commands to install the dependencies:

pip install torch torchvision torchaudio einops timm pillow huggingface_hub sentencepiece bitsandbytes protobuf decord numpy
pip install git+https://github.com/huggingface/transformers
pip install git+https://github.com/huggingface/accelerate
pip install git+https://github.com/huggingface/diffusers
pip install qwen-vl-utils[decord]==0.0.8

2. Import Required Libraries

Create a Python script (e.g., run_mimo_vl.py) and add these imports:

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
from PIL import Image

3. Load the Pretrained Model and Processor

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "XiaomiMiMo/MiMo-VL-7B-RL",
    torch_dtype="auto",       # Automatically selects appropriate dtype
    device_map="auto"         # Automatically places model parts on available devices (GPU/CPU)
)

processor = AutoProcessor.from_pretrained("XiaomiMiMo/MiMo-VL-7B-RL")

4. Prepare Your Image and Prompt

image = Image.open('/home/ubuntu/images/1.png').convert('RGB')
prompt = "Are there any people in this image? If yes what they are doing?"

5. Format the Input Messages

messages = [{ "role": "user",
                       "content": [
                      {"type": "image", "image": image},
                      {"type": "text", "text": prompt}
                    ]}]

6. Process Input with the Processor

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)

inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)

inputs = inputs.to("cuda")   # Send to GPU if available

7. Generate the Output

generated_ids = model.generate(**inputs, max_new_tokens=528)

# Trim the input tokens from generated output to get only the new generated part
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]

8. Decode and Print the Result

output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(output_text)

Official GitHub Link: Link

Huggingface Link: Link

Conclusion

Xiaomi’s MiMo-VL-7B-RL marks a pivotal advancement in the world of open-source vision-language AI. Its thoughtfully designed architecture, long-context awareness, and innovative reinforcement learning strategy have set a new benchmark for multimodal reasoning. As Xiaomi continues to scale its AI ambitions, MiMo-VL-7B-RL not only showcases technical prowess but also hints at what’s next for real-world AI applications—from smart assistants to GUI navigation and beyond.

Website |  + posts

Md Monsur Ali is a tech writer and researcher specializing in AI, LLMs, and automation. He shares tutorials, reviews, and real-world insights on cutting-edge technology to help developers and tech enthusiasts stay ahead.

Leave a Reply

Your email address will not be published. Required fields are marked *