How to Install & Run Qwen3-VL-235B-A22B-Instruct Locally?

by Ayush Kumar | September 25, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

Qwen3-VL-235B-A22B-Instruct is a Mixture-of-Experts (MoE) vision-language model with ~235B total parameters and ~22B active per token. It’s designed for image/video + text reasoning, tool-use, and long-context understanding (native 256K, extendable). Highlights:

Visual agent skills (operate GUIs, invoke tools), visual coding (generate Draw.io/HTML/CSS/JS from media).
Strong OCR (32 languages), spatial/temporal grounding for images and long videos.
Uses architectural upgrades like Interleaved-MRoPE, DeepStack, and text–timestamp alignment for better long-horizon and video reasoning.
Ships as an Instruct chat model (Transformers support), with recommended flash-attention 2 for multi-image/video efficiency.

GPU Configuration (Inference, Rule-of-Thumb)

Scenario	Precision / Quant	Min VRAM (est.)	Good VRAM (est.)	Batch / Media Notes	Tips
Edge / Lightweight	4-bit (Q4)	16 GB	24 GB	Single image, short prompts; small batches	Use AWQ/GPTQ; CPU/NVMe offload for KV if tight
Affordable Single-GPU	8-bit (Q8)	24 GB	32–40 GB	1–2 imgs or short clips; modest batch	Enable `flash_attention_2`; reduce `max_new_tokens`
Standard Quality (single GPU)	BF16/FP16	48 GB	80 GB	Batch 1–2; multi-image OK; short video	Prefer A100/H100/L40S; paged KV cache
High Throughput (server)	BF16/FP16	2×40–80 GB	2×80 GB+	Batch 4–8; multi-image/video stable	Tensor parallel `--tensor-parallel-size 2`; pin memory
Long-context / Video-heavy	BF16 + TP	4×80 GB	4–8×80 GB	Long videos / hundreds of frames	Use paged-KV + chunked frames; cap resolution/FPS

Notes

Numbers are estimates for inference (not training). MoE means ~22B active weights per token → BF16 weights ≈ 44 GB, plus vision tower & caches.
KV cache dominates at long contexts—reduce sequence length, frames/resolution, or rely on paged-KV/CPU offload.
For Transformers, set attn_implementation="flash_attention_2" when available for big memory wins on multi-image/video.
Quantized (Q4/Q8) builds are fine for many tasks; validate on your eval set.
For OpenAI-style serving, prefer vLLM/SGLang with tensor parallel for 48–80 GB+ GPUs.

If you’d like, I can add a ready-to-paste HF Transformers snippet (with Qwen3VLMoeForConditionalGeneration + AutoProcessor) and a vLLM launch command matrix by GPU tier.

Model Performance

Multimodal Performance

Category	Benchmark	Score
STEM & Puzzle	MMMUVal	78.7
	MMMU_Pro	68.1
	MathVista_mini	84.9
	MathVision	66.5
	MathVerse_mini	72.5
	VisLogic	29.9
General VQA	MMBench_EN_V1.1_dev	89.9
	RealWorldQA	79.3
	MMStar	78.4
	SimpleVQA	63.0
Subjective Experience & Instruction Following	HallusionBench	63.2
	MM_MT_Bench	8.5
	MIAbench	91.3
	MMLongBenchDoc	57.0
Text Recognition & Chart/Document Understanding	DocVQA_TEST	97.1
	InfoVQA_TEST	89.2
	AI2D_TEST	89.7
	OCRBench	920.0
	OCRBenchV2 (en/ch)	67.1 / 61.8
	CC_OCR	82.2
	CharXiv (RQ)	62.1
2D/3D Grounding	RefCOCO_avg	91.9
	CountBench	93.0
	ODivN13	48.6
	ARKitScenes	56.9
	Hypersim	13.0
	SUNRGBD	39.4
Multi-Image	BLINK	70.7
	MUIREBENCH	72.8
	ERQA	51.3
Embodied & Spatial Understanding	VSI-Bench	62.6
	EmbSpatialBench	83.1
	RefSpatialBench	65.5
	RoboSpatialHome	69.5
Video	VideoMME (w/o sub)	79.2
	MLVU	84.3
	LVBench	67.7
	CharadesSTA	64.8
	VideoMMMU	74.7
Agent	ScreenSpot	95.4
	ScreenSpot Pro	62.0
	OSWorldG	66.7
	AndroidWorld	63.7
Coding	Design2Code	92.0
	ChartMimic_v2_Direct	80.5
	UniSvg	69.3

Pure Text Performance

Category	Benchmark	Score
Knowledge	MMLU	88.8
Knowledge	MMLU-Pro	81.8
Knowledge	MMLU-Redux	92.2
Knowledge	SuperGPQA	60.4
Knowledge	SimpleQA	51.9
Knowledge	CSimpleQA	83.4
Reasoning	AIME25	74.7
Reasoning	HMMT25	57.4
Reasoning	LiveBench1125	74.8
Code	LCBV5	61.4
Code	LCBV6 (25.02–25.05)	54.3
Code	MultiPL-E	86.1
Instruction Following	SIFO	60.5
Instruction Following	SIFO-multiturn	63.7
Instruction Following	IFEval	87.8
Subjective Evaluation	Arena-Hard v2	77.4
Subjective Evaluation	Creative Writing v3	86.5
Subjective Evaluation	WritingBench	85.5
Agent	BFCL-v3	67.7
Multilingual	MultiIF	76.3
Multilingual	MMLU-ProX	77.8
Multilingual	INCLUDE	80.0

Resources

Link: https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct

Step-by-Step Process to Install & Run Qwen3-VL-235B-A22B-Instruct Locally

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.

Step 2: Create a GPU Node (Virtual Machine)

GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H200s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy

Step 3: Select a Model, Region, and Storage

In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

We will use 1 x H200 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.

Step 5: Choose an Image (Use the Jupyter Template)

We’ll use the Jupyter image from NodeShift’s gallery so you don’t have to install Jupyter Notebook/Lab manually. This image is GPU-ready and comes with a preconfigured Python + Jupyter environment—perfect for testing and serving Qwen3-VL-235B-A22B-Instruct.

What you’ll do

Pick the Jupyter template,
Pick a CUDA/PyTorch variant if the UI offers it,
Open JupyterLab in your browser,
Install the few project-specific Python packages inside that environment.

How to select it

In the Create VM flow, go to Choose an Image → Templates.
Click Jupyter (see screenshot). You’ll see a short description like “A web-based interactive computing platform for data science.”
If a version/stack dropdown appears, choose the latest CUDA 12.x / PyTorch variant (or “GPU-enabled” build).
Click Create (or Next) to proceed to sizing and networking.

Why this image

JupyterLab is already installed and enabled as a service, so the VM boots straight into a working notebook server.
GPU drivers + CUDA runtime are aligned with the template, so PyTorch will detect your GPU out of the box.
You can manage everything (terminals, notebooks, file browser) from the Jupyter UI—no extra desktop or VNC needed.

After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your node is up and running.

Step 7: Access Your Deployment

Once your GPU VM is in the RUNNING state, you’ll see a control menu (three dots on the right side of the deployment card). This menu gives you multiple ways to access and manage your deployment.

Available Options

Edit Name
Rename your deployment for easier identification (e.g., “Qwen3-VL-235B-A22B-Instruct”).
Open Jupyter Notebook
- Click this to launch the pre-installed Jupyter environment directly in your browser.
- You’ll be taken to JupyterLab, where you can open notebooks, create terminals, and run code cells to set up Qwen3-VL-235B-A22B-Instruct.
- This is the most user-friendly way to start working immediately without additional setup.
Connect with SSH
- Choose this if you prefer command-line access.
- You’ll get the SSH connection string (e.g., ssh -i <your-key> user@<vm-ip>).
- Use this method for advanced management, server setups (like vLLM/SGLang), or installing additional system packages.
Show Logs
- View system/service logs for debugging (useful if something isn’t starting correctly).
- Helps verify GPU initialization or catch errors during startup.
Update Tags
- Add labels or tags to organize multiple deployments.
- Example: tag by project, model type, or experiment.
Destroy Unit
- This permanently shuts down and deletes your VM.
- Use only when you are done, as this action cannot be undone.

Step 8: Open Jupyter Notebook

Once your VM is running, you can directly access the Jupyter Notebook environment provided by NodeShift. This will be your main workspace for running Qwen3-VL-235B-A22B-Instruct.

1. Click Open Jupyter Notebook

From the My GPU Deployments panel, click the three-dot menu on your deployment card.
Select Open Jupyter Notebook.

This will open a new browser tab pointing to your VM’s Jupyter instance.

2. Handle the Browser Security Warning

Since the Jupyter server is running with a self-signed SSL certificate, your browser may show a “Your connection is not private” warning.

Click Advanced.
Then, click Proceed to <your-vm-ip> (unsafe).

Don’t worry — this is expected. You’re connecting directly to your VM’s Jupyter server, not a public website.

3. JupyterLab Interface Opens

Once you proceed, you’ll land inside JupyterLab. Here you’ll see:

Notebook options (Python 3, Python 3.10, etc.)
Console options (interactive shells)
Other tools like a Terminal, Text File, and Markdown File.

You can now use the Terminal inside JupyterLab to install dependencies and start working with Qwen3-VL-235B-A22B-Instruct.

Step 9: Open Python 3.10 Notebook and Rename

Now that JupyterLab is running, let’s create a notebook where we will set up and run Qwen3-VL-235B-A22B-Instruct.

1. Open a Python 3.10 Notebook

In the Launcher screen, under Notebook, click on Python3.10 (python_310).
This will open a new notebook editor with an empty code cell where you can type commands.

2. Rename the Notebook

By default, the notebook will open as something like Untitled.ipynb.

To rename:
- Right-click on the notebook tab name at the top.
- Select Rename Notebook….
- Enter a meaningful name such as:

Qwen3-VL-235B-A22B-Instruct.ipynb

Press Enter to confirm.

3. Verify the Editor

You should now see an empty notebook named Qwen3-VL-235B-A22B-Instruct.ipynb with a code cell ready.
This is where you’ll run all the setup commands (installing dependencies, loading the model, and testing moderation).

Step 10: Verify the GPU

In a new notebook cell:

!nvidia-smi

You should see your GPU (e.g., H200, H100 80GB, A100 80GB, etc.).

Step 11: Install the Vision-Language Stack (In This Jupyter Kernel)

We’ll install PyTorch (CUDA wheels), the latest Transformers with Qwen3-VL support, and a few helpers for image/video I/O. Paste the cell below into your notebook and run it.

import sys

# 1) Core stack (PyTorch CUDA 12.1 wheels)
!{sys.executable} -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# 2) Transformers (Qwen3-VL code lives on latest main)
!{sys.executable} -m pip install -U "git+https://github.com/huggingface/transformers"

# 3) Utilities
!{sys.executable} -m pip install -U accelerate huggingface_hub pillow timm einops

# 4) Video support (optional, for .mp4/.webm clips)
!{sys.executable} -m pip install -U decord av

# 5) Build tools (handy for native wheels / optional extras)
!{sys.executable} -m pip install -U pip setuptools wheel ninja packaging

What This Does

Installs PyTorch with CUDA 12.1 so tensors run on your GPU.
Pulls Transformers from GitHub (latest Qwen3-VL code path).
Adds accelerate / huggingface_hub for loading models, pillow/timm/einops for vision ops.
Adds decord/av if you’ll analyze video.
Updates build tools (helps when compiling optional packages).

When the cell finishes, you should see Successfully installed … lines. Next, proceed to Step 12 to verify versions and CUDA and then load the model.

Step 12: Install FlashAttention-2 (Optional Speed/VRAM Boost)

FlashAttention-2 makes multi-image and video prompts faster and more memory-efficient. Install it into the current Jupyter kernel.

Run in a notebook cell:

import sys
# Install (uses your notebook's Python)
!{sys.executable} -m pip install -U flash-attn --no-build-isolation

What success looks like: the cell ends with Successfully installed flash-attn-....

Step 13: Hugging Face Login + Runtime Sanity Check

# 13.1 Check FlashAttention2 (optional speed/VRAM saver)
from transformers.utils import is_flash_attn_2_available
print("FlashAttention2 available:", is_flash_attn_2_available())

# 13.2 Authenticate to Hugging Face  (use a READ token)
from huggingface_hub import login, whoami
login("hf_your_access_token_here")   # ← paste your token
print(whoami())                      # shows your HF username if auth worked

# 13.3 Sanity-check CUDA and library versions
import torch, transformers, huggingface_hub
print("Torch:", torch.__version__, "| CUDA ok:", torch.cuda.is_available(), "| CUDA:", torch.version.cuda)
print("Transformers:", transformers.__version__)
print("HF Hub:", huggingface_hub.__version__)

Success looks like

FlashAttention2 available: True (nice to have; False is fine too).
whoami() prints your HF username.
CUDA ok: True with a CUDA version (e.g., 12.1), and recent transformers/hf_hub.

Tips

Prefer a read-only token; don’t commit it to git.
If a token was exposed anywhere, revoke it and create a new one in HF settings.

Step 14: Load the Model + Processor (Enable FA2 if Available) and Warm-Up

Paste this whole cell and run it:

from transformers import Qwen3VLMoeForConditionalGeneration, AutoProcessor
from transformers.utils import is_flash_attn_2_available
import torch

MODEL_ID = "Qwen/Qwen3-VL-235B-A22B-Instruct"

# Use FlashAttention-2 if present; otherwise fall back to PyTorch SDPA.
attn_impl = "flash_attention_2" if is_flash_attn_2_available() else "sdpa"
print("Using attention implementation:", attn_impl)

# Small perf knobs (safe on Ampere/Hopper)
torch.backends.cuda.matmul.allow_tf32 = True

# Load weights onto GPU; bf16/fp16 chosen automatically
model = Qwen3VLMoeForConditionalGeneration.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,      # or "auto"; bf16 is ideal on A100/H100/L40S
    device_map="auto",
    attn_implementation=attn_impl,
    low_cpu_mem_usage=True,
)

processor = AutoProcessor.from_pretrained(MODEL_ID)

# Optional: show where modules live (should be mostly on cuda:0)
print("Non-CPU parts in device map:",
      [k for k, v in getattr(model, "hf_device_map", {}).items() if v != "cpu"])

# Quick warm-up (tiny text-only) to initialize kernels
msgs = [{"role":"user","content":[{"type":"text","text":"Hello!"}]}]
inputs = processor.apply_chat_template(
    msgs, tokenize=True, add_generation_prompt=True,
    return_dict=True, return_tensors="pt"
).to(model.device)

with torch.inference_mode():
    _ = model.generate(**inputs, max_new_tokens=4, do_sample=False)

print("Model & processor loaded and warmed up.")

Step 15: Interact with Model

Use a Local Image and Short Generation

Remote URLs can stall; keep tokens small until it’s fast.

import io, requests
from PIL import Image

img_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
img = Image.open(io.BytesIO(requests.get(img_url, timeout=20).content)).convert("RGB")

messages = [{"role":"user","content":[
    {"type":"image","image":img},
    {"type":"text","text":"Describe this image in one sentence."}
]}]

inputs = processor.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True,
    return_dict=True, return_tensors="pt"
).to(model.device)

with torch.inference_mode():
    out = model.generate(
        **inputs,
        max_new_tokens=32,         # small to start
        do_sample=False            # deterministic & faster
    )

trimmed = [o[len(i):] for i,o in zip(inputs.input_ids, out)]
print(processor.batch_decode(trimmed, skip_special_tokens=True)[0])

Video Test

Needs decord or av. Keep short clips / few frames to start.

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "video": "https://huggingface.co/datasets/Narsil/video-demo/resolve/main/short_ski.mp4"},
            {"type": "text", "text": "What is happening? Summarize briefly."},
        ],
    }
]
inputs = processor.apply_chat_template(messages, tokenize=True, add_generation_prompt=True,
                                       return_dict=True, return_tensors="pt").to(model.device)
with torch.inference_mode():
    out = model.generate(**inputs, max_new_tokens=128)
trimmed = [o[len(i):] for i, o in zip(inputs.input_ids, out)]
print(processor.batch_decode(trimmed, skip_special_tokens=True)[0])

Conclusion

You just took Qwen3-VL-235B-A22B-Instruct from zero to working on a Jupyter GPU VM: verified the GPU, installed the VL stack, (optionally) enabled FlashAttention-2, authenticated with Hugging Face, loaded the MoE VL weights, and ran image/video prompts. The model shines on long-context, multi-image/video reasoning, but it’s heavyweight—expect the best experience on H100/A100 80GB; 48GB works with 8-bit plus short contexts and smaller batches. Keep performance smooth by using local images, shorter max_new_tokens, FA2 when available, and watching the device map to avoid CPU offload. From here, you can productionize with vLLM/SGLang (tensor parallel for >1 GPU), test Q4/Q8 quantized variants, and evaluate on your own benchmarks to tune quality vs. latency. This playbook gives you a clear, reproducible path to deploy Qwen3-VL for real multimodal work.

Relevant blog posts

October 13, 2025

How to Install & Run AI21-Jamba-Reasoning-3B Locally?

Jamba Reasoning 3B is AI21’s compact, hybrid Transformer–Mamba model built for efficient reasoning on modest hardware. With just ~3B params (26 Mamba layers + 2 attention layers), it achieves strong scores on reasoning benchmarks, supports very long context windows (up to 256K), and runs smoothly with vLLM or Transformers. The Mamba layers drastically cut cache overhead, so you get long-context throughput without the usual KV-cache blow-up—great for laptops, single-GPU boxes, and edge deployments.

October 11, 2025

How to Install & Run Qwen3-VL-30B-A3B-Thinking Locally?

Qwen3-VL-30B-A3B-Thinking is one of the most advanced multimodal reasoning models in the Qwen3 series, designed to seamlessly fuse text, vision, and video understanding with large-scale reasoning. Built on a Mixture-of-Experts (MoE) architecture with 30B active parameters, the model introduces a specialized Thinking variant, tuned for deep multimodal reasoning across STEM, math, and complex real-world scenarios. Key Strengths Include Visual Agent Capabilities – Can perceive GUI elements, invoke tools, and complete tasks on PC/mobile interfaces. Visual Coding Boost – Converts diagrams, screenshots, and videos into structured code artifacts (e.g., HTML, CSS, JavaScript, Draw.io). Advanced Spatial & Video Perception – Supports 3D grounding, object occlusion reasoning, timestamp alignment, and long-horizon video comprehension. Massive Context Handling – Native 256K tokens, expandable up to 1M, enabling book-level comprehension or hours-long video indexing. Robust OCR & Recognition – Trained on broad visual corpora, supports 32 languages, rare/ancient scripts, and noisy/tilted text scenarios. Unified Text-Vision Understanding – Matches pure LLMs in text reasoning while tightly aligning vision inputs for lossless multimodal comprehension. Overall, Qwen3-VL-30B-A3B-Thinking is positioned as a research-grade, enterprise-ready model that excels at multimodal STEM reasoning, vide

October 10, 2025

How to Install & Run Microsoft UserLM-8B Locally?

UserLM-8b is Microsoft’s open-weight large language model uniquely designed to simulate the “user” role in conversations. Unlike most LLMs that play the assistant role, UserLM-8b was fine-tuned on the WildChat-1M dataset to generate realistic user utterances. This makes it particularly useful for evaluating assistant LLMs, synthetic data generation, and research on user behavior modeling. Built on top of Llama-3.1-8B-Base, the model was fully fine-tuned with 227 hours of training on NVIDIA RTX A6000 GPUs. UserLM-8b can: Generate first-turn user queries given a task intent. Simulate multi-turn follow-up responses across long conversations. Signal the natural end of a conversation with a special token. Its evaluations show that UserLM-8b achieves lower perplexity, stronger distributional alignment, and more realistic conversational diversity compared to assistant-based simulators. While not designed as an assistant model, UserLM-8b helps researchers stress-test assistants under a wide range of conversational conditions, making it a valuable tool for robustness and evaluation studies.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.