MiMo-Audio-7B-Instruct is Xiaomi’s instruction-tuned audio language model that handles any-to-any tasks across speech and text (ASR, TTS, audio understanding, audio editing/continuation, voice conversion, and style transfer). Built on the MiMo-Audio stack, it uses a 1.2B MiMo-Audio-Tokenizer (25 Hz RVQ) plus a patch encoder/decoder so the LLM reasons on a downsampled 6.25 Hz sequence—unlocking few-shot generalization on new audio tasks without task-specific fine-tuning. Trained on 100M+ hours of audio, the base model reaches open-source SOTA on speech intelligence & audio-understanding benchmarks, while the Instruct variant adds robust “thinking” for both understanding and generation. Runs locally via the provided Gradio demo with CUDA ≥ 12.0 and FlashAttention-2.
GPU Configuration (Rule-of-Thumb)
Assumptions: PyTorch + FlashAttention-2, CUDA ≥ 12.0, BF16 preferred. “Context (audio)” is the rough total duration you can process/generate comfortably per request in the demo scripts before hitting memory pressure. Actual headroom varies with batch size, prompt history, and generation settings.
Scenario | Min VRAM | Comfortable VRAM | Example GPUs | Precision | Typical Use | Context (audio) | Notes |
---|
Entry (single-user demo) | 16 GB | 24 GB | RTX 4080/4090 24G, L4 24G | BF16 / FP16 | ASR or short TTS; audio understanding Q&A | ~30–90 s | Works on 16 GB with CPU offload; smoother on 24 GB. If OOM, shorten clips or reduce max new tokens. |
Standard (creator / lab) | 24 GB | 40–48 GB | A100 40G, L40S 48G | BF16 | Mixed tasks (ASR/TTS/conversion), longer refs, light batching (2–4) | ~2–5 min | Best speed-vs-capacity; comfortable for few-shot prompts and longer generations. |
Pro (heavier pipelines) | 48 GB | 80 GB | A100 80G, H100 80G | BF16 | Multi-minute inputs + multi-minute generations, higher sampling quality, bigger batches (4–8) | ~5–15 min | Headroom for longer prompts, higher decode steps, and concurrent users. |
Quant/Offload fallback | 8–12 GB | 16 GB | RTX 3060 12G, T4 16G | 8-bit/CPU offload | Basic ASR or short TTS toy runs | ~10–30 s | Use torch.compile + FlashAttn; expect slower I/O. Keep batch=1 and trim context aggressively. |
Multi-GPU (TP=2) | 2×16–24 GB | 2×24–40 GB | 2×A10 24G, 2×L4 24G | BF16 | Longer contexts with moderate throughput | ~3–8 min | Tensor parallel splits the LLM; tokenizer/decoder can stay on one device or offload to CPU if needed. NVLink helps. |
Practical Tips
- Enable FlashAttention-2 (version noted in the repo) and BF16 if supported.
- For tight VRAM: reduce max new tokens, cut input audio duration, and keep batch=1.
- CPU offload is viable but will slow generation; prefer 24 GB+ for smooth TTS/voice-conversion loops.
- Longer, high-quality TTS (style transfer/continuation) benefits most from 40–80 GB headroom.
Resources
Link: https://huggingface.co/XiaomiMiMo/MiMo-Audio-7B-Instruct
Step-by-Step Process to Install & Run MiMo-Audio-7B-Instruct Locally
For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.
Step 1: Sign Up and Set Up a NodeShift Cloud Account
Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.
Follow the account setup process and provide the necessary details and information.
Step 2: Create a GPU Node (Virtual Machine)
GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H200s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.
Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy
Step 3: Select a Model, Region, and Storage
In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.
We will use 1 x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.
Step 4: Select Authentication Method
There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.
Step 5: Choose an Image
In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running MiMo-Audio-7B-Instruct, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.
We chose the following image:
nvidia/cuda:12.1.1-devel-ubuntu22.04
This image is essential because it includes:
- Full CUDA toolkit (including
nvcc
)
- Proper support for building and running GPU-based models like MiMo-Audio-7B-Instruct.
- Compatibility with CUDA 12.1.1 required by certain model operations
Launch Mode
We selected:
Interactive shell server
This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like MiMo-Audio-7B-Instruct.
Docker Repository Authentication
We left all fields empty here.
Since the Docker image is publicly available on Docker Hub, no login credentials are required.
Identification
nvidia/cuda:12.1.1-devel-ubuntu22.04
CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.
This setup ensures that the MiMo-Audio-7B-Instruct runs in a GPU-enabled environment with proper CUDA access and high compute performance.
After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.
Step 6: Virtual Machine Successfully Deployed
You will get visual confirmation that your node is up and running.
Step 7: Connect to GPUs using SSH
NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.
Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.
Now open your terminal and paste the proxy SSH IP or direct SSH IP.
Next, If you want to check the GPU details, run the command below:
nvidia-smi
Step 8: Check the Available Python version and Install the new version
Run the following commands to check the available Python version.
If you check the version of the python, system has Python 3.8.1 available by default. To install a higher version of Python, you’ll need to use the deadsnakes
PPA.
Run the following commands to add the deadsnakes
PPA:
sudo apt update
sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:deadsnakes/ppa
sudo apt update
Step 9: Install Python 3.11
Now, run the following command to install Python 3.11 or another desired version:
sudo apt install -y python3.12 python3.12-venv python3.12-dev
Step 10: Update the Default Python3
Version
Now, run the following command to link the new Python version as the default python3
:
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 1
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.12 2
sudo update-alternatives --config python3
Then, run the following command to verify that the new Python version is active:
python3 --version
Step 11: Install and Update Pip
Run the following command to install and update the pip:
python3.11 -m ensurepip --upgrade
python3.11 -m pip install --upgrade pip setuptools wheel
python3.11 -m pip -V
Then, run the following command to check the version of pip:
pip --version
Step 12: Set Up Python Environment
Run the following command to setup the Python environment:
python3.11 -m venv /opt/mimo-audio
source /opt/mimo-audio/bin/activate
python -V
pip -V
Step 13: Clone the MiMo-Audio Repository
Clone the official MiMo-Audio repository from GitHub and move into the project directory:
git clone https://github.com/XiaomiMiMo/MiMo-Audio.git && cd MiMo-Audio
Step 14: Install Required Dependencies
Run the following command inside the MiMo-Audio
folder to install all required Python packages:
pip install -r requirements.txt
Step 15: Upgrade Pip & Install PyTorch (CUDA 12.4 Wheels)
Run the following commands to upgrade pip & install pytorch(CUDA 12.4 wheels):
# Upgrade packaging tools
python -m pip install -U pip setuptools wheel
# Install Torch stack matching CUDA 12.x (use cu124 wheels)
pip install --index-url https://download.pytorch.org/whl/cu124 \
torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0
Step 16: Install FlashAttention (For Faster Attention Kernels)
Install the FlashAttention wheel compatible with Torch 2.6 and CUDA 12.x:
pip install flash-attn==2.7.4.post1
Step 17: Launch the MiMo-Audio Gradio Server
Start the app (binds to 0.0.0.0:7897
):
python run_mimo_audio.py
You should see:
🚀 Launch MiMo-Audio...
🎨 Create Gradio Interface...
🌐 Launch Service - Address: 0.0.0.0:7897
* Running on local URL: http://0.0.0.0:7897
Keep this terminal open while you use the UI.
Step 18: Open the Gradio UI in your Browser (SSH Port-Forwarding)
Run this on your laptop to forward the VM’s port 7897:
ssh -p 50880 -L 7897:127.0.0.1:7897 root@142.189.182.244
Now open: http://127.0.0.1:7897
You should see the MiMo-Audio WebRTC UI. Click “Click to Access Microphone” → speak → stop to get the model’s reply.
Now, Interact with the model.
Conclusion
That’s it—you’ve got MiMo-Audio-7B-Instruct running on a GPU VM with CUDA 12, FlashAttention, and a clean Gradio UI. This setup gives you a fast, local, privacy-friendly playground for any-to-any audio tasks—ASR, TTS, voice conversion, continuation, and style transfer—without task-specific fine-tuning. For best results, keep BF16 enabled, trim overly long audio, and scale VRAM (24–48 GB) as your contexts and batches grow. From here, try custom prompts/voices, batch a few clips, or swap in multi-GPU when you need longer, higher-quality generations.