How to Install & Run MiMo-Audio-7B-Instruct Locally?

by Ayush Kumar | September 29, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

MiMo-Audio-7B-Instruct is Xiaomi’s instruction-tuned audio language model that handles any-to-any tasks across speech and text (ASR, TTS, audio understanding, audio editing/continuation, voice conversion, and style transfer). Built on the MiMo-Audio stack, it uses a 1.2B MiMo-Audio-Tokenizer (25 Hz RVQ) plus a patch encoder/decoder so the LLM reasons on a downsampled 6.25 Hz sequence—unlocking few-shot generalization on new audio tasks without task-specific fine-tuning. Trained on 100M+ hours of audio, the base model reaches open-source SOTA on speech intelligence & audio-understanding benchmarks, while the Instruct variant adds robust “thinking” for both understanding and generation. Runs locally via the provided Gradio demo with CUDA ≥ 12.0 and FlashAttention-2.

GPU Configuration (Rule-of-Thumb)

Assumptions: PyTorch + FlashAttention-2, CUDA ≥ 12.0, BF16 preferred. “Context (audio)” is the rough total duration you can process/generate comfortably per request in the demo scripts before hitting memory pressure. Actual headroom varies with batch size, prompt history, and generation settings.

Scenario	Min VRAM	Comfortable VRAM	Example GPUs	Precision	Typical Use	Context (audio)	Notes
Entry (single-user demo)	16 GB	24 GB	RTX 4080/4090 24G, L4 24G	BF16 / FP16	ASR or short TTS; audio understanding Q&A	~30–90 s	Works on 16 GB with CPU offload; smoother on 24 GB. If OOM, shorten clips or reduce max new tokens.
Standard (creator / lab)	24 GB	40–48 GB	A100 40G, L40S 48G	BF16	Mixed tasks (ASR/TTS/conversion), longer refs, light batching (2–4)	~2–5 min	Best speed-vs-capacity; comfortable for few-shot prompts and longer generations.
Pro (heavier pipelines)	48 GB	80 GB	A100 80G, H100 80G	BF16	Multi-minute inputs + multi-minute generations, higher sampling quality, bigger batches (4–8)	~5–15 min	Headroom for longer prompts, higher decode steps, and concurrent users.
Quant/Offload fallback	8–12 GB	16 GB	RTX 3060 12G, T4 16G	8-bit/CPU offload	Basic ASR or short TTS toy runs	~10–30 s	Use torch.compile + FlashAttn; expect slower I/O. Keep batch=1 and trim context aggressively.
Multi-GPU (TP=2)	2×16–24 GB	2×24–40 GB	2×A10 24G, 2×L4 24G	BF16	Longer contexts with moderate throughput	~3–8 min	Tensor parallel splits the LLM; tokenizer/decoder can stay on one device or offload to CPU if needed. NVLink helps.

Practical Tips

Enable FlashAttention-2 (version noted in the repo) and BF16 if supported.
For tight VRAM: reduce max new tokens, cut input audio duration, and keep batch=1.
CPU offload is viable but will slow generation; prefer 24 GB+ for smooth TTS/voice-conversion loops.
Longer, high-quality TTS (style transfer/continuation) benefits most from 40–80 GB headroom.

Resources

Link: https://huggingface.co/XiaomiMiMo/MiMo-Audio-7B-Instruct

Step-by-Step Process to Install & Run MiMo-Audio-7B-Instruct Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H200s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image

In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running MiMo-Audio-7B-Instruct, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.

We chose the following image:

nvidia/cuda:12.1.1-devel-ubuntu22.04

This image is essential because it includes:

Full CUDA toolkit (including nvcc)
Proper support for building and running GPU-based models like MiMo-Audio-7B-Instruct.
Compatibility with CUDA 12.1.1 required by certain model operations

Launch Mode

We selected:

Interactive shell server

This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like MiMo-Audio-7B-Instruct.

Docker Repository Authentication

We left all fields empty here.

Since the Docker image is publicly available on Docker Hub, no login credentials are required.

Identification

Template Name:

nvidia/cuda:12.1.1-devel-ubuntu22.04

CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.

This setup ensures that the MiMo-Audio-7B-Instruct runs in a GPU-enabled environment with proper CUDA access and high compute performance.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.

Now open your terminal and paste the proxy SSH IP or direct SSH IP.

Next, If you want to check the GPU details, run the command below:

nvidia-smi

Step 8: Check the Available Python version and Install the new version

Run the following commands to check the available Python version.

If you check the version of the python, system has Python 3.8.1 available by default. To install a higher version of Python, you’ll need to use the deadsnakes PPA.

Run the following commands to add the deadsnakes PPA:

sudo apt update
sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:deadsnakes/ppa
sudo apt update

Step 9: Install Python 3.11

Now, run the following command to install Python 3.11 or another desired version:

sudo apt install -y python3.12 python3.12-venv python3.12-dev

Step 10: Update the Default `Python3` Version

Now, run the following command to link the new Python version as the default python3:

sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 1
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.12 2
sudo update-alternatives --config python3

Then, run the following command to verify that the new Python version is active:

python3 --version

Step 11: Install and Update Pip

Run the following command to install and update the pip:

python3.11 -m ensurepip --upgrade
python3.11 -m pip install --upgrade pip setuptools wheel
python3.11 -m pip -V

Then, run the following command to check the version of pip:

pip --version

Step 12: Set Up Python Environment

Run the following command to setup the Python environment:

python3.11 -m venv /opt/mimo-audio
source /opt/mimo-audio/bin/activate
python -V
pip -V

Step 13: Clone the MiMo-Audio Repository

Clone the official MiMo-Audio repository from GitHub and move into the project directory:

git clone https://github.com/XiaomiMiMo/MiMo-Audio.git && cd MiMo-Audio

Step 14: Install Required Dependencies

Run the following command inside the MiMo-Audio folder to install all required Python packages:

pip install -r requirements.txt

Step 15: Upgrade Pip & Install PyTorch (CUDA 12.4 Wheels)

Run the following commands to upgrade pip & install pytorch(CUDA 12.4 wheels):

# Upgrade packaging tools
python -m pip install -U pip setuptools wheel

# Install Torch stack matching CUDA 12.x (use cu124 wheels)
pip install --index-url https://download.pytorch.org/whl/cu124 \
  torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0

Step 16: Install FlashAttention (For Faster Attention Kernels)

Install the FlashAttention wheel compatible with Torch 2.6 and CUDA 12.x:

pip install flash-attn==2.7.4.post1

Step 17: Launch the MiMo-Audio Gradio Server

Start the app (binds to 0.0.0.0:7897):

python run_mimo_audio.py

You should see:

🚀 Launch MiMo-Audio...
🎨 Create Gradio Interface...
🌐 Launch Service - Address: 0.0.0.0:7897
* Running on local URL:  http://0.0.0.0:7897

Keep this terminal open while you use the UI.

Step 18: Open the Gradio UI in your Browser (SSH Port-Forwarding)

Run this on your laptop to forward the VM’s port 7897:

ssh -p 50880 -L 7897:127.0.0.1:7897 root@142.189.182.244

Now open: http://127.0.0.1:7897
You should see the MiMo-Audio WebRTC UI. Click “Click to Access Microphone” → speak → stop to get the model’s reply.

Now, Interact with the model.

Conclusion

That’s it—you’ve got MiMo-Audio-7B-Instruct running on a GPU VM with CUDA 12, FlashAttention, and a clean Gradio UI. This setup gives you a fast, local, privacy-friendly playground for any-to-any audio tasks—ASR, TTS, voice conversion, continuation, and style transfer—without task-specific fine-tuning. For best results, keep BF16 enabled, trim overly long audio, and scale VRAM (24–48 GB) as your contexts and batches grow. From here, try custom prompts/voices, batch a few clips, or swap in multi-GPU when you need longer, higher-quality generations.

Relevant blog posts

October 13, 2025

How to Install & Run AI21-Jamba-Reasoning-3B Locally?

Jamba Reasoning 3B is AI21’s compact, hybrid Transformer–Mamba model built for efficient reasoning on modest hardware. With just ~3B params (26 Mamba layers + 2 attention layers), it achieves strong scores on reasoning benchmarks, supports very long context windows (up to 256K), and runs smoothly with vLLM or Transformers. The Mamba layers drastically cut cache overhead, so you get long-context throughput without the usual KV-cache blow-up—great for laptops, single-GPU boxes, and edge deployments.

October 11, 2025

How to Install & Run Qwen3-VL-30B-A3B-Thinking Locally?

Qwen3-VL-30B-A3B-Thinking is one of the most advanced multimodal reasoning models in the Qwen3 series, designed to seamlessly fuse text, vision, and video understanding with large-scale reasoning. Built on a Mixture-of-Experts (MoE) architecture with 30B active parameters, the model introduces a specialized Thinking variant, tuned for deep multimodal reasoning across STEM, math, and complex real-world scenarios. Key Strengths Include Visual Agent Capabilities – Can perceive GUI elements, invoke tools, and complete tasks on PC/mobile interfaces. Visual Coding Boost – Converts diagrams, screenshots, and videos into structured code artifacts (e.g., HTML, CSS, JavaScript, Draw.io). Advanced Spatial & Video Perception – Supports 3D grounding, object occlusion reasoning, timestamp alignment, and long-horizon video comprehension. Massive Context Handling – Native 256K tokens, expandable up to 1M, enabling book-level comprehension or hours-long video indexing. Robust OCR & Recognition – Trained on broad visual corpora, supports 32 languages, rare/ancient scripts, and noisy/tilted text scenarios. Unified Text-Vision Understanding – Matches pure LLMs in text reasoning while tightly aligning vision inputs for lossless multimodal comprehension. Overall, Qwen3-VL-30B-A3B-Thinking is positioned as a research-grade, enterprise-ready model that excels at multimodal STEM reasoning, vide

October 10, 2025

How to Install & Run Microsoft UserLM-8B Locally?

UserLM-8b is Microsoft’s open-weight large language model uniquely designed to simulate the “user” role in conversations. Unlike most LLMs that play the assistant role, UserLM-8b was fine-tuned on the WildChat-1M dataset to generate realistic user utterances. This makes it particularly useful for evaluating assistant LLMs, synthetic data generation, and research on user behavior modeling. Built on top of Llama-3.1-8B-Base, the model was fully fine-tuned with 227 hours of training on NVIDIA RTX A6000 GPUs. UserLM-8b can: Generate first-turn user queries given a task intent. Simulate multi-turn follow-up responses across long conversations. Signal the natural end of a conversation with a special token. Its evaluations show that UserLM-8b achieves lower perplexity, stronger distributional alignment, and more realistic conversational diversity compared to assistant-based simulators. While not designed as an assistant model, UserLM-8b helps researchers stress-test assistants under a wide range of conversational conditions, making it a valuable tool for robustness and evaluation studies.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.