Qwen3-VL-235B-A22B-Instruct is a Mixture-of-Experts (MoE) vision-language model with ~235B total parameters and ~22B active per token. It’s designed for image/video + text reasoning, tool-use, and long-context understanding (native 256K, extendable). Highlights:
- Visual agent skills (operate GUIs, invoke tools), visual coding (generate Draw.io/HTML/CSS/JS from media).
- Strong OCR (32 languages), spatial/temporal grounding for images and long videos.
- Uses architectural upgrades like Interleaved-MRoPE, DeepStack, and text–timestamp alignment for better long-horizon and video reasoning.
- Ships as an Instruct chat model (Transformers support), with recommended flash-attention 2 for multi-image/video efficiency.
GPU Configuration (Inference, Rule-of-Thumb)
Scenario | Precision / Quant | Min VRAM (est.) | Good VRAM (est.) | Batch / Media Notes | Tips |
---|
Edge / Lightweight | 4-bit (Q4) | 16 GB | 24 GB | Single image, short prompts; small batches | Use AWQ/GPTQ; CPU/NVMe offload for KV if tight |
Affordable Single-GPU | 8-bit (Q8) | 24 GB | 32–40 GB | 1–2 imgs or short clips; modest batch | Enable flash_attention_2 ; reduce max_new_tokens |
Standard Quality (single GPU) | BF16/FP16 | 48 GB | 80 GB | Batch 1–2; multi-image OK; short video | Prefer A100/H100/L40S; paged KV cache |
High Throughput (server) | BF16/FP16 | 2×40–80 GB | 2×80 GB+ | Batch 4–8; multi-image/video stable | Tensor parallel --tensor-parallel-size 2 ; pin memory |
Long-context / Video-heavy | BF16 + TP | 4×80 GB | 4–8×80 GB | Long videos / hundreds of frames | Use paged-KV + chunked frames; cap resolution/FPS |
Notes
- Numbers are estimates for inference (not training). MoE means ~22B active weights per token → BF16 weights ≈ 44 GB, plus vision tower & caches.
- KV cache dominates at long contexts—reduce sequence length, frames/resolution, or rely on paged-KV/CPU offload.
- For Transformers, set
attn_implementation="flash_attention_2"
when available for big memory wins on multi-image/video.
- Quantized (Q4/Q8) builds are fine for many tasks; validate on your eval set.
- For OpenAI-style serving, prefer vLLM/SGLang with tensor parallel for 48–80 GB+ GPUs.
If you’d like, I can add a ready-to-paste HF Transformers snippet (with Qwen3VLMoeForConditionalGeneration
+ AutoProcessor
) and a vLLM launch command matrix by GPU tier.
Model Performance
Multimodal Performance
Category | Benchmark | Score |
---|
STEM & Puzzle | MMMUVal | 78.7 |
| MMMU_Pro | 68.1 |
| MathVista_mini | 84.9 |
| MathVision | 66.5 |
| MathVerse_mini | 72.5 |
| VisLogic | 29.9 |
General VQA | MMBench_EN_V1.1_dev | 89.9 |
| RealWorldQA | 79.3 |
| MMStar | 78.4 |
| SimpleVQA | 63.0 |
Subjective Experience & Instruction Following | HallusionBench | 63.2 |
| MM_MT_Bench | 8.5 |
| MIAbench | 91.3 |
| MMLongBenchDoc | 57.0 |
Text Recognition & Chart/Document Understanding | DocVQA_TEST | 97.1 |
| InfoVQA_TEST | 89.2 |
| AI2D_TEST | 89.7 |
| OCRBench | 920.0 |
| OCRBenchV2 (en/ch) | 67.1 / 61.8 |
| CC_OCR | 82.2 |
| CharXiv (RQ) | 62.1 |
2D/3D Grounding | RefCOCO_avg | 91.9 |
| CountBench | 93.0 |
| ODivN13 | 48.6 |
| ARKitScenes | 56.9 |
| Hypersim | 13.0 |
| SUNRGBD | 39.4 |
Multi-Image | BLINK | 70.7 |
| MUIREBENCH | 72.8 |
| ERQA | 51.3 |
Embodied & Spatial Understanding | VSI-Bench | 62.6 |
| EmbSpatialBench | 83.1 |
| RefSpatialBench | 65.5 |
| RoboSpatialHome | 69.5 |
Video | VideoMME (w/o sub) | 79.2 |
| MLVU | 84.3 |
| LVBench | 67.7 |
| CharadesSTA | 64.8 |
| VideoMMMU | 74.7 |
Agent | ScreenSpot | 95.4 |
| ScreenSpot Pro | 62.0 |
| OSWorldG | 66.7 |
| AndroidWorld | 63.7 |
Coding | Design2Code | 92.0 |
| ChartMimic_v2_Direct | 80.5 |
| UniSvg | 69.3 |
Pure Text Performance
Category | Benchmark | Score |
---|
Knowledge | MMLU | 88.8 |
Knowledge | MMLU-Pro | 81.8 |
Knowledge | MMLU-Redux | 92.2 |
Knowledge | SuperGPQA | 60.4 |
Knowledge | SimpleQA | 51.9 |
Knowledge | CSimpleQA | 83.4 |
Reasoning | AIME25 | 74.7 |
Reasoning | HMMT25 | 57.4 |
Reasoning | LiveBench1125 | 74.8 |
Code | LCBV5 | 61.4 |
Code | LCBV6 (25.02–25.05) | 54.3 |
Code | MultiPL-E | 86.1 |
Instruction Following | SIFO | 60.5 |
Instruction Following | SIFO-multiturn | 63.7 |
Instruction Following | IFEval | 87.8 |
Subjective Evaluation | Arena-Hard v2 | 77.4 |
Subjective Evaluation | Creative Writing v3 | 86.5 |
Subjective Evaluation | WritingBench | 85.5 |
Agent | BFCL-v3 | 67.7 |
Multilingual | MultiIF | 76.3 |
Multilingual | MMLU-ProX | 77.8 |
Multilingual | INCLUDE | 80.0 |
Resources
Link: https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct
Step-by-Step Process to Install & Run Qwen3-VL-235B-A22B-Instruct Locally
For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.
Step 1: Sign Up and Set Up a NodeShift Cloud Account
Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.
Step 2: Create a GPU Node (Virtual Machine)
GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H200s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.
Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy
Step 3: Select a Model, Region, and Storage
In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.
We will use 1 x H200 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.
Step 4: Select Authentication Method
There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.
Step 5: Choose an Image (Use the Jupyter Template)
We’ll use the Jupyter image from NodeShift’s gallery so you don’t have to install Jupyter Notebook/Lab manually. This image is GPU-ready and comes with a preconfigured Python + Jupyter environment—perfect for testing and serving Qwen3-VL-235B-A22B-Instruct.
What you’ll do
- Pick the Jupyter template,
- Pick a CUDA/PyTorch variant if the UI offers it,
- Open JupyterLab in your browser,
- Install the few project-specific Python packages inside that environment.
How to select it
- In the Create VM flow, go to Choose an Image → Templates.
- Click Jupyter (see screenshot). You’ll see a short description like “A web-based interactive computing platform for data science.”
- If a version/stack dropdown appears, choose the latest CUDA 12.x / PyTorch variant (or “GPU-enabled” build).
- Click Create (or Next) to proceed to sizing and networking.
Why this image
- JupyterLab is already installed and enabled as a service, so the VM boots straight into a working notebook server.
- GPU drivers + CUDA runtime are aligned with the template, so PyTorch will detect your GPU out of the box.
- You can manage everything (terminals, notebooks, file browser) from the Jupyter UI—no extra desktop or VNC needed.
After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.
Step 6: Virtual Machine Successfully Deployed
You will get visual confirmation that your node is up and running.
Step 7: Access Your Deployment
Once your GPU VM is in the RUNNING state, you’ll see a control menu (three dots on the right side of the deployment card). This menu gives you multiple ways to access and manage your deployment.
Available Options
- Edit Name
Rename your deployment for easier identification (e.g., “Qwen3-VL-235B-A22B-Instruct”).
- Open Jupyter Notebook
- Click this to launch the pre-installed Jupyter environment directly in your browser.
- You’ll be taken to JupyterLab, where you can open notebooks, create terminals, and run code cells to set up Qwen3-VL-235B-A22B-Instruct.
- This is the most user-friendly way to start working immediately without additional setup.
- Connect with SSH
- Choose this if you prefer command-line access.
- You’ll get the SSH connection string (e.g.,
ssh -i <your-key> user@<vm-ip>
).
- Use this method for advanced management, server setups (like vLLM/SGLang), or installing additional system packages.
- Show Logs
- View system/service logs for debugging (useful if something isn’t starting correctly).
- Helps verify GPU initialization or catch errors during startup.
- Update Tags
- Add labels or tags to organize multiple deployments.
- Example: tag by project, model type, or experiment.
- Destroy Unit
- This permanently shuts down and deletes your VM.
- Use only when you are done, as this action cannot be undone.
Step 8: Open Jupyter Notebook
Once your VM is running, you can directly access the Jupyter Notebook environment provided by NodeShift. This will be your main workspace for running Qwen3-VL-235B-A22B-Instruct.
1. Click Open Jupyter Notebook
- From the My GPU Deployments panel, click the three-dot menu on your deployment card.
- Select Open Jupyter Notebook.
This will open a new browser tab pointing to your VM’s Jupyter instance.
2. Handle the Browser Security Warning
Since the Jupyter server is running with a self-signed SSL certificate, your browser may show a “Your connection is not private” warning.
- Click Advanced.
- Then, click Proceed to
<your-vm-ip>
(unsafe).
Don’t worry — this is expected. You’re connecting directly to your VM’s Jupyter server, not a public website.
3. JupyterLab Interface Opens
Once you proceed, you’ll land inside JupyterLab. Here you’ll see:
- Notebook options (Python 3, Python 3.10, etc.)
- Console options (interactive shells)
- Other tools like a Terminal, Text File, and Markdown File.
You can now use the Terminal inside JupyterLab to install dependencies and start working with Qwen3-VL-235B-A22B-Instruct.
Step 9: Open Python 3.10 Notebook and Rename
Now that JupyterLab is running, let’s create a notebook where we will set up and run Qwen3-VL-235B-A22B-Instruct.
1. Open a Python 3.10 Notebook
- In the Launcher screen, under Notebook, click on Python3.10 (python_310).
- This will open a new notebook editor with an empty code cell where you can type commands.
2. Rename the Notebook
- By default, the notebook will open as something like Untitled.ipynb.
- To rename:
- Right-click on the notebook tab name at the top.
- Select Rename Notebook….
- Enter a meaningful name such as:
Qwen3-VL-235B-A22B-Instruct.ipynb
3. Verify the Editor
- You should now see an empty notebook named Qwen3-VL-235B-A22B-Instruct.ipynb with a code cell ready.
- This is where you’ll run all the setup commands (installing dependencies, loading the model, and testing moderation).
Step 10: Verify the GPU
In a new notebook cell:
!nvidia-smi
You should see your GPU (e.g., H200, H100 80GB, A100 80GB, etc.).
Step 11: Install the Vision-Language Stack (In This Jupyter Kernel)
We’ll install PyTorch (CUDA wheels), the latest Transformers with Qwen3-VL support, and a few helpers for image/video I/O. Paste the cell below into your notebook and run it.
import sys
# 1) Core stack (PyTorch CUDA 12.1 wheels)
!{sys.executable} -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# 2) Transformers (Qwen3-VL code lives on latest main)
!{sys.executable} -m pip install -U "git+https://github.com/huggingface/transformers"
# 3) Utilities
!{sys.executable} -m pip install -U accelerate huggingface_hub pillow timm einops
# 4) Video support (optional, for .mp4/.webm clips)
!{sys.executable} -m pip install -U decord av
# 5) Build tools (handy for native wheels / optional extras)
!{sys.executable} -m pip install -U pip setuptools wheel ninja packaging
What This Does
- Installs PyTorch with CUDA 12.1 so tensors run on your GPU.
- Pulls Transformers from GitHub (latest Qwen3-VL code path).
- Adds accelerate / huggingface_hub for loading models, pillow/timm/einops for vision ops.
- Adds decord/av if you’ll analyze video.
- Updates build tools (helps when compiling optional packages).
When the cell finishes, you should see Successfully installed … lines. Next, proceed to Step 12 to verify versions and CUDA and then load the model.
Step 12: Install FlashAttention-2 (Optional Speed/VRAM Boost)
FlashAttention-2 makes multi-image and video prompts faster and more memory-efficient. Install it into the current Jupyter kernel.
Run in a notebook cell:
import sys
# Install (uses your notebook's Python)
!{sys.executable} -m pip install -U flash-attn --no-build-isolation
What success looks like: the cell ends with Successfully installed flash-attn-...
.
Step 13: Hugging Face Login + Runtime Sanity Check
Log in so the VM can download the weights, then confirm FlashAttention2 and your CUDA/versions are good.
# 13.1 Check FlashAttention2 (optional speed/VRAM saver)
from transformers.utils import is_flash_attn_2_available
print("FlashAttention2 available:", is_flash_attn_2_available())
# 13.2 Authenticate to Hugging Face (use a READ token)
from huggingface_hub import login, whoami
login("hf_your_access_token_here") # ← paste your token
print(whoami()) # shows your HF username if auth worked
# 13.3 Sanity-check CUDA and library versions
import torch, transformers, huggingface_hub
print("Torch:", torch.__version__, "| CUDA ok:", torch.cuda.is_available(), "| CUDA:", torch.version.cuda)
print("Transformers:", transformers.__version__)
print("HF Hub:", huggingface_hub.__version__)
Success looks like
FlashAttention2 available: True
(nice to have; False
is fine too).
whoami()
prints your HF username.
CUDA ok: True
with a CUDA version (e.g., 12.1
), and recent transformers
/hf_hub
.
Tips
- Prefer a read-only token; don’t commit it to git.
- If a token was exposed anywhere, revoke it and create a new one in HF settings.
Step 14: Load the Model + Processor (Enable FA2 if Available) and Warm-Up
Paste this whole cell and run it:
from transformers import Qwen3VLMoeForConditionalGeneration, AutoProcessor
from transformers.utils import is_flash_attn_2_available
import torch
MODEL_ID = "Qwen/Qwen3-VL-235B-A22B-Instruct"
# Use FlashAttention-2 if present; otherwise fall back to PyTorch SDPA.
attn_impl = "flash_attention_2" if is_flash_attn_2_available() else "sdpa"
print("Using attention implementation:", attn_impl)
# Small perf knobs (safe on Ampere/Hopper)
torch.backends.cuda.matmul.allow_tf32 = True
# Load weights onto GPU; bf16/fp16 chosen automatically
model = Qwen3VLMoeForConditionalGeneration.from_pretrained(
MODEL_ID,
torch_dtype=torch.bfloat16, # or "auto"; bf16 is ideal on A100/H100/L40S
device_map="auto",
attn_implementation=attn_impl,
low_cpu_mem_usage=True,
)
processor = AutoProcessor.from_pretrained(MODEL_ID)
# Optional: show where modules live (should be mostly on cuda:0)
print("Non-CPU parts in device map:",
[k for k, v in getattr(model, "hf_device_map", {}).items() if v != "cpu"])
# Quick warm-up (tiny text-only) to initialize kernels
msgs = [{"role":"user","content":[{"type":"text","text":"Hello!"}]}]
inputs = processor.apply_chat_template(
msgs, tokenize=True, add_generation_prompt=True,
return_dict=True, return_tensors="pt"
).to(model.device)
with torch.inference_mode():
_ = model.generate(**inputs, max_new_tokens=4, do_sample=False)
print("Model & processor loaded and warmed up.")
Step 15: Interact with Model
Use a Local Image and Short Generation
Remote URLs can stall; keep tokens small until it’s fast.
import io, requests
from PIL import Image
img_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
img = Image.open(io.BytesIO(requests.get(img_url, timeout=20).content)).convert("RGB")
messages = [{"role":"user","content":[
{"type":"image","image":img},
{"type":"text","text":"Describe this image in one sentence."}
]}]
inputs = processor.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True,
return_dict=True, return_tensors="pt"
).to(model.device)
with torch.inference_mode():
out = model.generate(
**inputs,
max_new_tokens=32, # small to start
do_sample=False # deterministic & faster
)
trimmed = [o[len(i):] for i,o in zip(inputs.input_ids, out)]
print(processor.batch_decode(trimmed, skip_special_tokens=True)[0])
Video Test
Needs decord or av. Keep short clips / few frames to start.
messages = [
{
"role": "user",
"content": [
{"type": "video", "video": "https://huggingface.co/datasets/Narsil/video-demo/resolve/main/short_ski.mp4"},
{"type": "text", "text": "What is happening? Summarize briefly."},
],
}
]
inputs = processor.apply_chat_template(messages, tokenize=True, add_generation_prompt=True,
return_dict=True, return_tensors="pt").to(model.device)
with torch.inference_mode():
out = model.generate(**inputs, max_new_tokens=128)
trimmed = [o[len(i):] for i, o in zip(inputs.input_ids, out)]
print(processor.batch_decode(trimmed, skip_special_tokens=True)[0])
Conclusion
You just took Qwen3-VL-235B-A22B-Instruct from zero to working on a Jupyter GPU VM: verified the GPU, installed the VL stack, (optionally) enabled FlashAttention-2, authenticated with Hugging Face, loaded the MoE VL weights, and ran image/video prompts. The model shines on long-context, multi-image/video reasoning, but it’s heavyweight—expect the best experience on H100/A100 80GB; 48GB works with 8-bit plus short contexts and smaller batches. Keep performance smooth by using local images, shorter max_new_tokens, FA2 when available, and watching the device map to avoid CPU offload. From here, you can productionize with vLLM/SGLang (tensor parallel for >1 GPU), test Q4/Q8 quantized variants, and evaluate on your own benchmarks to tune quality vs. latency. This playbook gives you a clear, reproducible path to deploy Qwen3-VL for real multimodal work.