Introduction
Voice technology has become the cornerstone of modern human-computer interaction, revolutionizing how we communicate with digital systems. Voxtral Mini 3B is an enhancement of Ministral 3B, incorporating state-of-the-art audio input capabilities while retaining best-in-class text performance. This groundbreaking 3B parameter model represents a significant leap forward in accessible, open-source speech understanding technology. This compact yet robust model combines the language capabilities of Mistral’s MiniStral 3B with powerful audio understanding, multilingual transcription, summarization, and even function calling, all in an Apache 2.0 licensed, production-ready package.
With a token window of 32k and designed for local deployment and edge devices, Voxtral Mini 3B stands out as one of the most versatile and accessible open-source speech models to date.
What Is Voxtral Mini 3B?
Voxtral Mini 3B is a multi-purpose speech-language model trained by Mistral AI, blending audio processing with strong language reasoning in a compact 3 billion parameter setup. Built using the MiniStral 3B decoder, the model features:
- A dedicated 640M audio encoder
- A 25M adapter that projects audio into the token space
- A 400M embedding layer
- A 3.6B decoder (MiniStral 3B backbone)
Together, these modules form a speech intelligence model capable of transcription, translation, summarization, Q&A, and audio-to-function workflows.
Core Features of Voxtral Mini 3B
1. Multilingual Audio Transcription
Voxtral Mini can transcribe audio in over 10 languages, including English, French, German, Spanish, Portuguese, Italian, Dutch, Polish, Romanian, and Hindi. It uses language auto-detection and switches into a specialized transcription mode when prompted.
2. 32k Context Window
With a context size of 32,000 tokens, Voxtral Mini 3B supports over 30 minutes of transcription or understanding, enabling robust summarization, Q&A, or context-sensitive workflows from a single audio prompt.
3. Built-in Q&A & Summarization
Users can directly ask questions about an audio clip or request summaries without external chains or RAG setups. The model natively understands spoken content and responds in structured, concise output.
4. Voice Function Calling
Voxtral introduces spoken command execution. For example, a user saying “Email John the meeting summary” can be routed to a function call API for real-time action. This bridges speech and automation in one step.
5. Strong Language Backbone
Being built on the MiniStral 3B decoder, the model benefits from inference speed, language reasoning, and low hallucination rates inherited from the original Mistral architecture.
Model Architecture Breakdown
The technical foundation of Voxtral Mini incorporates several breakthrough features that set it apart from conventional speech models.
| Component | Size | Purpose |
|---|---|---|
| Audio Encoder | 640M | Converts audio into embeddings |
| Adapter | 25M | Projects embeddings into token space |
| Text Embeddings | 400M | Tokenizes for language understanding |
| Decoder | 3.6B | MiniStral 3B for language output |
| Total Parameters | 4.7B | Full pipeline |
Voxtral Mini 3B is trained using a combination of:
- Audio-text paired data
- Supervised instruction fine-tuning
- Online DPO (Direct Preference Optimization)
This leads to better factual grounding, less hallucination, and stable inference even in mixed-modality tasks.
Benchmark Performance of Voxtral Mini 3B
The model’s performance credentials are established through rigorous benchmarking across multiple evaluation criteria.
Transcription
Voxtral comprehensively outperforms Whisper large-v3, the current leading open-source Speech Transcription model. It beats GPT-4o mini Transcribe and Gemini 2.5 Flash across all tasks.
- Outperforms OpenAI Whisper large-v3
- Comparable to closed models like Gemini 1.5 Flash and GPT-4o-mini
- Strong accuracy on:
- FLEURS (speech benchmark)
- Common Voice
- MLS (Multilingual LibriSpeech)

Translation & Summarization
- BLEU scores on-par with top LLMs
- Can translate audio in multilingual prompts
- Audio-to-summary works with 2–3 sentence clarity

Real-Time Function Calling
- Able to convert speech into actionable output via structured format:
{
"function_call": {
"name": "send_email",
"arguments": {
"recipient": "John",
"subject": "Meeting Summary",
"body": "Here's the key takeaway..."
}
}
}
Step-by-Step: Install and Run Voxtral Mini 3B Locally
Running Voxtral Mini 3B locally with vLLM provides developers with full control and blazing-fast inference, leveraging GPU acceleration. Below is a detailed guide to get you started from scratch using Python and ngrok.
1. Install Core Dependencies
Start by installing uv, which is a fast Python package manager, and then install vllm with audio capabilities:
!pip install uv
!uv pip install -U "vllm" --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
Check your installation:
!python -c "import mistral_common; print(mistral_common.__version__)"
2. Set Up vLLM API Server
Install pyngrok and launch the vLLM server with optimized settings:
!pip install pyngrok --quiet
!vllm serve mistralai/Voxtral-Mini-3B-2507 \
--tokenizer_mode mistral \
--config_format mistral \
--load_format mistral \
--max-model-len 4096 \
--gpu-memory-utilization 0.9 \
--host 0.0.0.0 \
--port 8000 > server.log 2>&1 &
Change max length of token according to GPU compability.
3. Expose Localhost with Ngrok
Replace "Authentication_Key" with your Ngrok authtoken:
!ngrok config add-authtoken "Authentication_Key"
Then, expose the vLLM server:
from pyngrok import ngrok
public_url = ngrok.connect(8000)
print("🔗 vLLM Public URL:", public_url)
4. Set up OpenAI-Compatible Client
from openai import OpenAI
client = OpenAI(
api_key="EMPTY", # vLLM uses dummy key
base_url=public_url.public_url + "/v1"
)
# Fetch the model ID
models = client.models.list()
model = models.data[0].id
print("🎯 Using model:", model)
5. Test with Real Audio Input
from mistral_common.protocol.instruct.messages import TextChunk, AudioChunk, UserMessage, AssistantMessage
from mistral_common.audio import Audio
from huggingface_hub import hf_hub_download
# Download a sample audio file
obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")
# Convert audio to AudioChunk
def file_to_chunk(file: str) -> AudioChunk:
audio = Audio.from_file(file, strict=False)
return AudioChunk.from_audio(audio)
# Create user prompt
text_chunk = TextChunk(text="What is the speaker talking about, and how inspiring is the speech?")
user_msg = UserMessage(content=[file_to_chunk(obama_file), text_chunk]).to_openai()
print("=" * 30 + "USER 1" + "=" * 30)
print(text_chunk.text)
6. Generate and Print AI Response
response = client.chat.completions.create(
model=model,
messages=[user_msg],
temperature=0.2,
top_p=0.95,
)
content = response.choices[0].message.content
print("=" * 30 + "BOT 1" + "=" * 30)
print(content)
Output
🔗 vLLM Public URL: NgrokTunnel: "https://bcc027d6c2e5.ngrok-free.app" -> "http://localhost:8000"
🎯 Using model: mistralai/Voxtral-Mini-3B-2507
==============================USER 1==============================
What is the speaker talking about, and how inspiring is the speech?
==============================BOT 1==============================
The speaker is delivering his final farewell address to the nation, reflecting on his eight years as president and the impact of the American people on his presidency and personal growth. The speech is inspiring because it highlights the goodness, resilience, and hope of the American people, as well as the importance of self-government and civic engagement. The speaker emphasizes that the success of the nation depends on the participation of its citizens, regardless of political affiliation. He encourages listeners to engage in the work of citizenship, whether through community service, running for office, or simply talking to others in real life. The speech is inspiring because it underscores the power of collective effort and the potential for positive change when people come together.
==============================USER 2==============================
Now summarize the speech.
==============================BOT 2==============================
In his final farewell address, the speaker reflects on his eight years as president and the impact of the American people on his presidency and personal growth. He highlights the goodness, resilience, and hope of the American people, sharing personal anecdotes and achievements, such as the recovery from the economic crisis, the expansion of affordable healthcare, and the rebuilding of communities like Joplin. The speaker also emphasizes the importance of self-government and civic engagement, encouraging listeners to participate in the work of citizenship, regardless of political affiliation. He believes that the success of the nation depends on the participation of its citizens and encourages them to engage in community service, run for office, or simply talk to others in real life. The speech concludes with the speaker expressing his optimism about the country's promise and his commitment to working alongside the American people as a citizen for the rest of his life.
Comparison to Whisper & GPT-4o
| Feature | Voxtral Mini 3B | Whisper v3 | GPT-4o |
|---|---|---|---|
| Open Source | ✅ Yes | ✅ Yes | ❌ No |
| Context Window | 32k tokens | 8k | 128k |
| Function Calling | ✅ Yes | ❌ No | ✅ Yes |
| Translation | ✅ Multilingual | ✅ | ✅ |
| Reasoning Capability | ✅ MiniStral | ❌ | ✅ |
| Cost | $0.001/min | $0.006/min | $0.06–0.12/min |
More Details: Huggingface and Official Blog
Conclusion
Voxtral Mini 3B represents a major evolution in open-source speech AI. It’s fast, multilingual, accurate, and affordable, perfect for both individual developers and large enterprises. From long-form transcription to voice-triggered automation, its tight integration of speech and text understanding places it ahead of other models in the same weight class.
🚀 Want to know more about my journey in AI, tech tutorials, and digital exploration? Learn more about me here 👤 and follow my latest insights on Medium 📝 for in-depth articles, and feel free to connect with me on LinkedIn 🔗.
Md Monsur Ali is a tech writer and researcher specializing in AI, LLMs, and automation. He shares tutorials, reviews, and real-world insights on cutting-edge technology to help developers and tech enthusiasts stay ahead.
