Qwen3-Omni-30B-A3B-Instruct is a multilingual, any-to-any omni-modal MoE model with a native Thinker–Talker design. It ingests text, image, audio, and video and can stream back text or natural speech in real time. Thanks to early text-first pretraining, mixed multimodal training, and a multi-codebook audio stack, it delivers SOTA-level ASR/AV while keeping strong unimodal text & vision performance. It supports FlashAttention-2, long contexts, and runs well with Transformers or vLLM. Use the Instruct variant for end-to-end voice/chat experiences (thinker + talker), or the Thinking variant when you only need chain-of-thought text output.
GPU Configuration (Quick Reference)
Assumptions
- Precision: BF16 (FlashAttention-2 enabled)
- Framework: Transformers (for min-VRAM math); vLLM is recommended for serving and may change practical headroom.
max_model_len≈32k
, default image/video preprocessing (e.g., ~2 fps for eval), and single-request unless noted.
Evaluation
Performance of Qwen3-Omni
Qwen3-Omni maintains state-of-the-art performance on text and visual modalities without degradation relative to same-size single-model Qwen counterparts. Across 36 audio and audio-visual benchmarks, it achieves open-source SOTA on 32 and sets the SOTA on 22, outperforming strong closed-source systems such as Gemini 2.5 Pro and GPT-4o.
Text -> Text
| GPT-4o-0327 | Qwen3-235B-A22B Non Thinking | Qwen3-30B-A3B-Instruct-2507 | Qwen3-Omni-30B-A3B-Instruct | Qwen3-Omni-Flash-Instruct |
---|
Alignment Tasks | IFEval | 83.9 | 83.2 | 84.7 | 81.0 | 81.7 |
Creative Writing v3 | 84.9 | 80.4 | 86.0 | 80.6 | 81.8 |
WritingBench | 75.5 | 77.0 | 85.5 | 82.6 | 83.0 |
Agent | BFCL-v3 | 66.5 | 68.0 | 65.1 | 64.4 | 65.0 |
Multilingual Tasks | MultiIF | 70.4 | 70.2 | 67.9 | 64.0 | 64.7 |
PolyMATH | 25.5 | 27.0 | 43.1 | 37.9 | 39.3 |
| | Gemini-2.5-Flash Thinking | Qwen3-235B-A22B Thinking | Qwen3-30B-A3B-Thinking-2507 | Qwen3-Omni-30B-A3B-Thinking | Qwen3-Omni-Flash-Thinking |
---|
General Tasks | MMLU-Redux | 92.1 | 92.7 | 91.4 | 88.8 | 89.7 |
GPQA | 82.8 | 71.1 | 73.4 | 73.1 | 73.1 |
Reasoning | AIME25 | 72.0 | 81.5 | 85.0 | 73.7 | 74.0 |
LiveBench 20241125 | 74.3 | 77.1 | 76.8 | 71.8 | 70.3 |
Code | MultiPL-E | 84.5 | 79.9 | 81.3 | 80.6 | 81.0 |
Alignment Tasks | IFEval | 89.8 | 83.4 | 88.9 | 85.1 | 85.2 |
Arena-Hard v2 | 56.7 | 61.5 | 56.0 | 55.1 | 57.8 |
Creative Writing v3 | 85.0 | 84.6 | 84.4 | 82.5 | 83.6 |
WritingBench | 83.9 | 80.3 | 85.0 | 85.5 | 85.9 |
Agent | BFCL-v3 | 68.6 | 70.8 | 72.4 | 63.2 | 64.5 |
Multilingual Tasks | MultiIF | 74.4 | 71.9 | 76.4 | 72.9 | 73.2 |
PolyMATH | 49.8 | 54.7 | 52.6 | 47.1 | 48.7 |
Audio -> Text
| Seed-ASR | Voxtral-Mini | Voxtral-Small | GPT-4o-Transcribe | Gemini-2.5-Pro | Qwen2.5-Omni | Qwen3-Omni-30B-A3B-Instruct | Qwen3-Omni-Flash-Instruct |
---|
EN & ZH ASR (wer) |
Wenetspeech net | meeting | 4.66 | 5.69 | 24.30 | 31.53 | 20.33 | 26.08 | 15.30 | 32.27 | 14.43 | 13.47 | 5.91 | 7.65 | 4.69 | 5.89 | 4.62 | 5.75 |
Librispeech clean | other | 1.58 | 2.84 | 1.88 | 4.12 | 1.56 | 3.30 | 1.39 | 3.75 | 2.89 | 3.56 | 1.74 | 3.45 | 1.22 | 2.48 | 1.27 | 2.44 |
CV15-en | – | 9.47 | 7.79 | 10.01 | 9.89 | 7.61 | 6.05 | 5.94 |
CV15-zh | – | 24.67 | 19.30 | 9.84 | 8.00 | 5.13 | 4.31 | 4.28 |
Fleurs-en | 3.40 | 3.96 | 3.77 | 3.32 | 2.94 | 3.77 | 2.72 | 2.74 |
Fleurs-zh | 2.69 | 12.22 | 7.98 | 2.44 | 2.71 | 2.54 | 2.20 | 2.19 |
Multilingual ASR (wer) |
Fleurs-avg (19 lang) | – | 15.67 | 8.09 | 4.48 | 5.55 | 14.04 | 5.33 | 5.31 |
Lyric ASR (wer) |
MIR-1K (vocal-only) | 6.45 | 23.33 | 18.73 | 11.87 | 9.85 | 8.15 | 5.90 | 5.85 |
Opencpop-test | 2.98 | 31.01 | 16.06 | 7.93 | 6.49 | 2.84 | 1.54 | 2.02 |
S2TT (BLEU) |
Fleurs-en2xx | – | 30.35 | 37.85 | – | 39.25 | 29.22 | 37.50 | 36.22 |
Fleurs-xx2en | – | 27.54 | 32.81 | – | 35.41 | 28.61 | 31.08 | 30.71 |
Fleurs-zh2xx | – | 17.03 | 22.05 | – | 26.63 | 17.97 | 25.17 | 25.10 |
Fleurs-xx2zh | – | 28.75 | 34.82 | – | 37.50 | 27.68 | 33.13 | 31.19 |
| GPT-4o-Audio | Gemini-2.5-Flash | Gemini-2.5-Pro | Qwen2.5-Omni | Qwen3-Omni-30B-A3B-Instruct | Qwen3-Omni-30B-A3B-Thinking | Qwen3-Omni-Flash-Instruct | Qwen3-Omni-Flash-Thinking |
---|
VoiceBench |
AlpacaEval | 95.6 | 96.1 | 94.3 | 89.9 | 94.8 | 96.4 | 95.4 | 96.8 |
CommonEval | 89.8 | 88.3 | 88.4 | 76.7 | 90.8 | 90.5 | 91.0 | 90.9 |
WildVoice | 91.6 | 92.1 | 93.4 | 77.7 | 91.6 | 90.5 | 92.3 | 90.9 |
SD-QA | 75.5 | 84.5 | 90.1 | 56.4 | 76.9 | 78.1 | 76.8 | 78.5 |
MMSU | 80.3 | 66.1 | 71.1 | 61.7 | 68.1 | 83.0 | 68.4 | 84.3 |
OpenBookQA | 89.2 | 56.9 | 92.3 | 80.9 | 89.7 | 94.3 | 91.4 | 95.0 |
BBH | 84.1 | 83.9 | 92.6 | 66.7 | 80.4 | 88.9 | 80.6 | 89.6 |
IFEval | 76.0 | 83.8 | 85.7 | 53.5 | 77.8 | 80.6 | 75.2 | 80.8 |
AdvBench | 98.7 | 98.9 | 98.1 | 99.2 | 99.3 | 97.2 | 99.4 | 98.9 |
Overall | 86.8 | 83.4 | 89.6 | 73.6 | 85.5 | 88.8 | 85.6 | 89.5 |
Audio Reasoning |
MMAU-v05.15.25 | 62.5 | 71.8 | 77.4 | 65.5 | 77.5 | 75.4 | 77.6 | 76.5 |
MMSU | 56.4 | 70.2 | 77.7 | 62.6 | 69.0 | 70.2 | 69.1 | 71.3 |
| Best Specialist Models | GPT-4o-Audio | Gemini-2.5-Pro | Qwen2.5-Omni | Qwen3-Omni-30B-A3B-Instruct | Qwen3-Omni-Flash-Instruct |
---|
RUL-MuchoMusic | 47.6 (Audio Flamingo 3) | 36.1 | 49.4 | 47.3 | 52.0 | 52.1 |
GTZAN Acc. | 87.9 (CLaMP 3) | 76.5 | 81.0 | 81.7 | 93.0 | 93.1 |
MTG Genre Micro F1 | 35.8 (MuQ-MuLan) | 25.3 | 32.6 | 32.5 | 39.0 | 39.5 |
MTG Mood/Theme Micro F1 | 10.9 (MuQ-MuLan) | 11.3 | 14.1 | 8.9 | 21.0 | 21.7 |
MTG Instrument Micro F1 | 39.8 (MuQ-MuLan) | 34.2 | 33.0 | 22.6 | 40.5 | 40.7 |
MTG Top50 Micro F1 | 33.2 (MuQ-MuLan) | 25.0 | 26.1 | 21.6 | 36.7 | 36.9 |
MagnaTagATune Micro F1 | 41.6 (MuQ) | 29.2 | 28.1 | 30.1 | 44.3 | 46.8 |
Vision -> Text
Datasets | GPT4-o | Gemini-2.0-Flash | Qwen2.5-VL 72B | Qwen3-Omni-30B-A3B -Instruct | Qwen3-Omni-Flash -Instruct |
---|
General Visual Question Answering |
MMStar | 64.7 | 71.4 | 70.8 | 68.5 | 69.3 |
HallusionBench | 55.0 | 56.3 | 55.2 | 59.7 | 58.5 |
MM-MT-Bench | 7.7 | 6.7 | 7.6 | 7.4 | 7.6 |
Math & STEM |
MMMU_val | 69.1 | 71.3 | 70.2 | 69.1 | 69.8 |
MMMU_pro | 51.9 | 56.1 | 51.1 | 57.0 | 57.6 |
MathVista_mini | 63.8 | 71.4 | 74.8 | 75.9 | 77.4 |
MathVision_full | 30.4 | 48.6 | 38.1 | 56.3 | 58.3 |
Documentation Understanding |
AI2D | 84.6 | 86.7 | 88.7 | 85.2 | 86.4 |
ChartQA_test | 86.7 | 64.6 | 89.5 | 86.8 | 87.1 |
Counting |
CountBench | 87.9 | 91.2 | 93.6 | 90.0 | 90.0 |
Video Understanding |
Video-MME | 71.9 | 72.4 | 73.3 | 70.5 | 71.4 |
LVBench | 30.8 | 57.9 | 47.3 | 50.2 | 51.1 |
MLVU | 64.6 | 71.0 | 74.6 | 75.2 | 75.7 |
Datasets | Gemini-2.5-flash-thinking | InternVL-3.5-241B-A28B | Qwen3-Omni-30B-A3B-Thinking | Qwen3-Omni-Flash-Thinking |
---|
General Visual Question Answering |
MMStar | 75.5 | 77.9 | 74.9 | 75.5 |
HallusionBench | 61.1 | 57.3 | 62.8 | 63.4 |
MM-MT-Bench | 7.8 | – | 8.0 | 8.0 |
Math & STEM |
MMMU_val | 76.9 | 77.7 | 75.6 | 75.0 |
MMMU_pro | 65.8 | – | 60.5 | 60.8 |
MathVista_mini | 77.6 | 82.7 | 80.0 | 81.2 |
MathVision_full | 62.3 | 63.9 | 62.9 | 63.8 |
Documentation Understanding |
AI2D_test | 88.6 | 87.3 | 86.1 | 86.8 |
ChartQA_test | – | 88.0 | 89.5 | 89.3 |
Counting |
CountBench | 88.6 | – | 88.6 | 92.5 |
Video Understanding |
Video-MME | 79.6 | 72.9 | 69.7 | 69.8 |
LVBench | 64.5 | – | 49.0 | 49.5 |
MLVU | 82.1 | 78.2 | 72.9 | 73.9 |
AudioVisual -> Text
Datasets | Previous Open-source SoTA | Gemini-2.5-Flash | Qwen2.5-Omni | Qwen3-Omni-30B-A3B-Instruct | Qwen3-Omni-Flash-Instruct |
---|
WorldSense | 47.1 | 50.9 | 45.4 | 54.0 | 54.1 |
Datasets | Previous Open-source SoTA | Gemini-2.5-Flash-Thinking | Qwen3-Omni-30B-A3B-Thinking | Qwen3-Omni-Flash-Thinking |
---|
DailyOmni | 69.8 | 72.7 | 75.8 | 76.2 |
VideoHolmes | 55.6 | 49.5 | 57.3 | 57.3 |
Zero-shot Speech Generation
Datasets | Model | Performance |
---|
SEED test-zh | test-en | Seed-TTSICL | 1.11 | 2.24 |
Seed-TTSRL | 1.00 | 1.94 |
MaskGCT | 2.27 | 2.62 |
E2 TTS | 1.97 | 2.19 |
F5-TTS | 1.56 | 1.83 |
Spark TTS | 1.20 | 1.98 |
CosyVoice 2 | 1.45 | 2.57 |
CosyVoice 3 | 0.71 | 1.45 |
Qwen2.5-Omni-7B | 1.42 | 2.33 |
Qwen3-Omni-30B-A3B | 1.07 | 1.39 |
Multilingual Speech Generation
Language | Content Consistency | Speaker Similarity |
---|
Qwen3-Omni-30B-A3B | MiniMax | ElevenLabs | Qwen3-Omni-30B-A3B | MiniMax | ElevenLabs |
---|
Chinese | 0.716 | 2.252 | 16.026 | 0.772 | 0.780 | 0.677 |
English | 1.069 | 2.164 | 2.339 | 0.773 | 0.756 | 0.613 |
German | 0.777 | 1.906 | 0.572 | 0.738 | 0.733 | 0.614 |
Italian | 1.067 | 1.543 | 1.743 | 0.742 | 0.699 | 0.579 |
Portuguese | 1.872 | 1.877 | 1.331 | 0.770 | 0.805 | 0.711 |
Spanish | 1.765 | 1.029 | 1.084 | 0.744 | 0.762 | 0.615 |
Japanese | 3.631 | 3.519 | 10.646 | 0.763 | 0.776 | 0.738 |
Korean | 1.670 | 1.747 | 1.865 | 0.778 | 0.776 | 0.700 |
French | 2.505 | 4.099 | 5.216 | 0.689 | 0.628 | 0.535 |
Russian | 3.986 | 4.281 | 3.878 | 0.759 | 0.761 | 0.676 |
Cross-Lingual Speech Generation
Language | Qwen3-Omni-30B-A3B | CosyVoice3 | CosyVoice2 |
---|
en-to-zh | 5.37 | 5.09 | 13.5 |
ja-to-zh | 3.32 | 3.05 | 48.1 |
ko-to-zh | 0.99 | 1.06 | 7.70 |
zh-to-en | 2.76 | 2.98 | 6.47 |
ja-to-en | 3.31 | 4.20 | 17.1 |
ko-to-en | 3.34 | 4.19 | 11.2 |
zh-to-ja | 8.29 | 7.08 | 13.1 |
en-to-ja | 7.53 | 6.80 | 14.9 |
ko-to-ja | 4.24 | 3.93 | 5.86 |
zh-to-ko | 5.13 | 14.4 | 24.8 |
en-to-ko | 4.96 | 5.87 | 21.9 |
ja-to-ko | 6.23 | 7.92 | 21.5 |
Resources
Link: https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct
Step-by-Step Process to Install & Run Qwen3-Omni-30B-A3B-Instruct Locally
For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.
Step 1: Sign Up and Set Up a NodeShift Cloud Account
Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.
Follow the account setup process and provide the necessary details and information.
Step 2: Create a GPU Node (Virtual Machine)
GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H200s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.
Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy
Step 3: Select a Model, Region, and Storage
In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.
We will use 1 x H200 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.
Step 4: Select Authentication Method
There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.
Step 5: Choose an Image
In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running Qwen3-Omni-30B-A3B-Instruct, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.
We chose the following image:
nvidia/cuda:12.1.1-devel-ubuntu22.04
This image is essential because it includes:
- Full CUDA toolkit (including
nvcc
)
- Proper support for building and running GPU-based models like Qwen3-Omni-30B-A3B-Instruct.
- Compatibility with CUDA 12.1.1 required by certain model operations
Launch Mode
We selected:
Interactive shell server
This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching models like Qwen3-Omni-30B-A3B-Instruct.
Docker Repository Authentication
We left all fields empty here.
Since the Docker image is publicly available on Docker Hub, no login credentials are required.
Identification
nvidia/cuda:12.1.1-devel-ubuntu22.04
CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.
This setup ensures that the Qwen3-Omni-30B-A3B-Instruct runs in a GPU-enabled environment with proper CUDA access and high compute performance.
After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.
Step 6: Virtual Machine Successfully Deployed
You will get visual confirmation that your node is up and running.
Step 7: Connect to GPUs using SSH
NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.
Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.
Now open your terminal and paste the proxy SSH IP or direct SSH IP.
Next, If you want to check the GPU details, run the command below:
nvidia-smi
Step 8: Check the Available Python version and Install the new version
Run the following commands to check the available Python version.
If you check the version of the python, system has Python 3.8.1 available by default. To install a higher version of Python, you’ll need to use the deadsnakes
PPA.
Run the following commands to add the deadsnakes
PPA:
sudo apt update
sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:deadsnakes/ppa
sudo apt update
Step 9: Install Python 3.11
Now, run the following command to install Python 3.11 or another desired version:
sudo apt install -y python3.11 python3.11-venv python3.11-dev
Step 10: Update the Default Python3
Version
Now, run the following command to link the new Python version as the default python3
:
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 1
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 2
sudo update-alternatives --config python3
Then, run the following command to verify that the new Python version is active:
python3 --version
Step 11: Install and Update Pip
Run the following command to install and update the pip:
python3.11 -m ensurepip --upgrade
python3.11 -m pip install --upgrade pip setuptools wheel
python3.11 -m pip -V
Then, run the following command to check the version of pip:
pip --version
Step 12: Set Up Python Environment
Run the following command to setup the Python environment:
python3.11 -m venv /opt/py311
source /opt/py311/bin/activate
python -V
pip -V
Step 13: Install Transformers, Accelerate & Qwen Omni Utils
Run the following commands to install transformers, accelerate & qwen omni utils:
pip install git+https://github.com/huggingface/transformers
pip install accelerate
pip install qwen-omni-utils -U
Step 14: Install Wheel and Flash Attention
Run the following commands to install wheel and flash attention:
pip install wheel
pip install -U flash-attn --no-build-isolation
Step 15: Install PyTorch with GPU support
Run the following command to install torch:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
Step 16: Install FFMPEG
Run the following command to install ffmpeg:
apt update
apt install -y ffmpeg libsndfile1
Step 17: Connect to Your GPU VM with a Code Editor
Before you start running model script with the Qwen3-Omni-30B-A3B-Instruct model, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.
- You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
- In this example, we’re using cursor code editor.
- Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.
Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.
Step 18: Create the Script
Create a file (ex: # app.py) and add the following code:
import os
os.environ["TRANSFORMERS_ATTENTION_IMPLEMENTATION"] = "sdpa" # force SDPA
from transformers import (
Qwen3OmniMoeForConditionalGeneration,
Qwen3OmniMoeProcessor,
logging as hf_logging,
)
from qwen_omni_utils import process_mm_info
import torch
import soundfile as sf
hf_logging.set_verbosity_error()
MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Instruct"
USE_AUDIO_IN_VIDEO = True
# Prefer SDPA; keep FA2 disabled
torch.backends.cuda.enable_flash_sdp(False)
torch.backends.cuda.enable_mem_efficient_sdp(True)
torch.backends.cuda.enable_math_sdp(True)
# Load model/processor
model = Qwen3OmniMoeForConditionalGeneration.from_pretrained(
MODEL_PATH,
device_map="auto",
torch_dtype="auto",
attn_implementation="sdpa",
low_cpu_mem_usage=True,
)
processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)
# Conversation
conversation = [
{
"role": "user",
"content": [
{"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"},
{"type": "audio", "audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"},
{"type": "text", "text": "What can you see and hear? Answer in one short sentence."}
],
},
]
# Prepare inputs
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=USE_AUDIO_IN_VIDEO)
inputs = processor(
text=text,
audio=audios,
images=images,
videos=videos,
return_tensors="pt",
padding=True,
use_audio_in_video=USE_AUDIO_IN_VIDEO,
)
# Move to device; cast ONLY floating tensors
for k, v in inputs.items():
if isinstance(v, torch.Tensor):
if torch.is_floating_point(v):
inputs[k] = v.to(model.device, dtype=model.dtype)
else:
inputs[k] = v.to(model.device)
# Generate (ask for dict-style outputs)
with torch.inference_mode():
gen_out = model.generate(
**inputs,
speaker="Ethan",
use_audio_in_video=USE_AUDIO_IN_VIDEO,
return_dict_in_generate=True, # <-- ensure .sequences
thinker_return_dict_in_generate=True, # <-- thinker submodule returns dict too
# max_new_tokens=128, # (optional) keep runtime predictable
)
# Some builds return a tuple (text_out, audio); handle both
audio = None
text_out = gen_out
if isinstance(gen_out, tuple):
text_out, audio = gen_out
else:
audio = getattr(gen_out, "audio", None)
# Get sequences tensor whether it's a ModelOutput or a plain tensor
sequences = getattr(text_out, "sequences", text_out)
# Decode just the newly generated tokens
prompt_len = inputs["input_ids"].shape[1]
decoded = processor.batch_decode(
sequences[:, prompt_len:],
skip_special_tokens=True,
clean_up_tokenization_spaces=False,
)
print(decoded)
# Save audio if present
if audio is not None:
if isinstance(audio, torch.Tensor):
audio_np = audio.reshape(-1).detach().cpu().numpy()
else:
# already numpy or list
import numpy as np
audio_np = np.asarray(audio).reshape(-1)
sf.write("output.wav", audio_np, samplerate=24000)
print("Saved: output.wav")
What This Script Does
- Forces Transformers to use SDPA attention (not FlashAttention) for compatibility.
- Loads the Qwen3-Omni-30B-A3B-Instruct multimodal model + processor on GPU with
torch_dtype="auto"
.
- Builds a multimodal chat: one image URL, one audio URL, and a short user text prompt.
- Uses
qwen_omni_utils.process_mm_info
+ the processor to prepare tensors for text, audio, image (and optional video).
- Moves inputs to the model’s device, casting only floating tensors to the model dtype (keeps integer IDs intact).
- Calls
model.generate(...)
(dict-style outputs enabled) with speaker="Ethan"
and thinker outputs on.
- Decodes just the newly generated text (skips the prompt tokens) and prints the response.
- If the model returns audio, saves it to
output.wav
at 24 kHz.
- Silences most HF warnings for a cleaner log.
- Uses
torch.inference_mode()
for memory/speed efficiency during generation.
Step 19: Run the Script
Run the script from the following command:
python3 test_jina.py
This will download the model and generate response on terminal.
Conclusion
That’s it—you’ve got Qwen3-Omni-30B-A3B-Instruct running end-to-end on a NodeShift GPU VM, with clean SDPA attention (no FlashAttention headaches), multimodal inputs (image + audio), and real-time text/speech output. This setup is reproducible, stable, and ready for experiments—whether you’re testing ASR/AV, building a voice chat demo, or benchmarking against your own datasets.
Next up, try serving with vLLM for throughput, switch to the Thinking variant for chain-of-thought text only, and plug in your own media streams. If you share results, tag NodeShift—we’d love to see what you build.