A Step-By-Step Guide to Install Hunyuan-7B or 1.5B

by Aditi Bindal | August 5, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

Imagine running a state-of-the-art language model with 256K context window, hybrid reasoning, and agent-level intelligence, all on your local machine. Meet Hunyuan, Tencent’s powerful new family of open-source models built for versatility, speed, and long-context reasoning. Whether you’re building production-grade AI agents or running inference on edge devices, Hunyuan delivers exceptional performance at every scale, from the lightweight 0.5B to the robust 7B variant. With Grouped Query Attention (GQA) for faster inference, quantization support for low-resource deployment, and benchmarks that rival top-tier models (e.g., 88.25 on GSM8K and 82.95 on BBH for the 7B pretrain), Hunyuan models are engineered to handle real-world tasks with ease. What’s more, they come in both pretrained and instruction-tuned formats, optimized for coding, reasoning, science, and agent tasks. If you’re tired of LLMs that falter with long documents or lack local flexibility, Hunyuan is your next must-try.

In this guide, we’ll walk you through how to install and run Hunyuan 7B or 1.8B locally, unlocking cutting-edge AI right from your own machine, no GPU farm required.

Prerequisites

The minimum system requirements for running this model are:

GPU: 1x RTX4090 1x RTX A6000
Storage: 50 GB (preferable)
VRAM: at least 16 GB
Anaconda installed

Step-by-step process to install and run Hunyuan 7B or 1.5B

For the purpose of this tutorial, we’ll use a GPU-powered Virtual Machine by NodeShift since it provides high compute Virtual Machines at a very affordable cost on a scale that meets GDPR, SOC2, and ISO27001 requirements. Also, it offers an intuitive and user-friendly interface, making it easier for beginners to get started with Cloud deployments. However, feel free to use any cloud provider of your choice and follow the same steps for the rest of the tutorial.

Step 1: Setting up a NodeShift Account

Visit app.nodeshift.com and create an account by filling in basic details, or continue signing up with your Google/GitHub account.

If you already have an account, login straight to your dashboard.

Step 2: Create a GPU Node

After accessing your account, you should see a dashboard (see image), now:

Navigate to the menu on the left side.
Click on the GPU Nodes option.

Click on Start to start creating your very first GPU node.

These GPU nodes are GPU-powered virtual machines by NodeShift. These nodes are highly customizable and let you control different environmental configurations for GPUs ranging from H100s to A100s, CPUs, RAM, and storage, according to your needs.

Step 3: Selecting configuration for GPU (model, region, storage)

For this tutorial, we’ll be using 1x RTX A6000 GPU, however, you can choose any GPU as per the prerequisites.
Similarly, we’ll opt for 200GB storage by sliding the bar. You can also select the region where you want your GPU to reside from the available ones.

Step 4: Choose GPU Configuration and Authentication method

After selecting your required configuration options, you’ll see the available GPU nodes in your region and according to (or very close to) your configuration. In our case, we’ll choose a 1x RTX A6000 48GB GPU node with 64vCPUs/63GB RAM/200GB SSD.

2. Next, you’ll need to select an authentication method. Two methods are available: Password and SSH Key. We recommend using SSH keys, as they are a more secure option. To create one, head over to our official documentation.

Step 5: Choose an Image

The final step is to choose an image for the VM, which in our case is Nvidia Cuda.

That’s it! You are now ready to deploy the node. Finalize the configuration summary, and if it looks good, click Create to deploy the node.

Step 6: Connect to active Compute Node using SSH

As soon as you create the node, it will be deployed in a few seconds or a minute. Once deployed, you will see a status Running in green, meaning that our Compute node is ready to use!
Once your GPU shows this status, navigate to the three dots on the right, click on Connect with SSH, and copy the SSH details that appear.

As you copy the details, follow the below steps to connect to the running GPU VM via SSH:

Open your terminal, paste the SSH command, and run it.

2. In some cases, your terminal may take your consent before connecting. Enter ‘yes’.

3. A prompt will request a password. Type the SSH password, and you should be connected.

Output:

Next, If you want to check the GPU details, run the following command in the terminal:

!nvidia-smi

Step 7: Set up the project environment with dependencies

Create a virtual environment using Anaconda.

conda create -n hunyuan python=3.11 -y && conda activate hunyuan

Output:

2. Once you’re inside the environment, install vllm with dependencies.

pip install --upgrade vllm

Output:

3. Also, open a second terminal, connect to remote server with SSH and install open-webui.

pip install open-webui

4. Install Transformers.

pip install git+https://github.com/huggingface/transformers@4970b23cedaf745f963779b4eae68da281e8c6ca

Step 8: Download the model

Download model with vllm and host the endpoint at 8000.

Replace it with Hunyuan 1.5B to download the mode.

vllm serve tencent/Hunyuan-7B-Instruct --host 0.0.0.0 --port 8000 --trust_remote_code

Output:

2. In the second terminal connected with the GPU host with ssh, serve the open-webui frontend endpoint.

open-webui serve --port 3000

3. Forward both the ports and tunnel them to access in the local browser.

If you’re on a remote machine (e.g., NodeShift GPU), you’ll need to do SSH port forwarding in order to access the both vllm and open-webui session on your local browser.

Run the following command in your local terminal after replacing:

<YOUR_SERVER_PORT> with the PORT allotted to your remote server (For the NodeShift server – you can find it in the deployed GPU details on the dashboard).

<PATH_TO_SSH_KEY> with the path to the location where your SSH key is stored.

<YOUR_SERVER_IP> with the IP address of your remote server.

ssh -L 3000:localhost:3000 -p <YOUR_SERVER_PORT> -i <PATH_TO_SSH_KEY> root@<YOUR_SERVER_IP>

In another local terminal run forward the port for vllm endpoint:

ssh -L 8000:localhost:8000 -p <YOUR_SERVER_PORT> -i <PATH_TO_SSH_KEY> root@<YOUR_SERVER_IP>

Step 9: Run the model via Open WebUI Interface

Once ports are forwarded, you can simply access the model via Open WebUI interface and chat with it.

Before running the model, connect the webui with vllm API endpoint in the settings.

2. Select the Hunyuan-7B or 1.5 model in the chat page and run the prompt.

For e.g., we’re testing the following prompt:

You are an AI assistant embedded in a robot that will help a person relocate to a new city for a remote job. Your goal is to:
Recommend:
   - A city in India that fits best for software engineers.
   - An ideal living arrangement (rent or buy, apartment or house).
   - Internet providers suitable for remote work in that city.
   - A weekly schedule that balances productivity, social life, and wellness.
You must:
- Think through the options and tradeoffs.
- Justify your reasoning step-by-step.
- Format the output clearly in sections with bullet points.
Let’s begin by asking your 5 questions.

Output:

Conclusion

With Hunyuan’s powerful blend of ultra-long context handling, hybrid reasoning, and efficient local inference, deploying a high-performance LLM on your own hardware has never been more accessible. In this article, we explored how to set up the 7B and 1.8B variants locally, highlighting their standout features and real-world capabilities across reasoning, coding, and agent tasks. Backing this seamless experience is NodeShift cloud, which simplifies the deployment process with ready-to-use infrastructure, optimized environments, and hands-on support, ensuring you spend less time configuring and more time building with next-gen AI.

Relevant blog posts

August 13, 2025

How to Install & Run Qwen-Image-Lightning Locally?

Qwen-Image-Lightning is a distilled version of the original Qwen-Image model, designed to deliver fast, high-quality text-to-image generation with exceptional ability in complex text rendering and fine image details. The Lightning variants significantly reduce the number of inference steps (down to 4 or 8) while preserving — and in many cases matching — the visual quality of the full Qwen-Image model. This makes it a perfect choice for scenarios where speed matters, such as interactive creative workflows, live content generation, or rapid prototyping.

August 12, 2025

How to Install and Run GLM 4.5V

In the rapidly evolving world of AI, vision-language models are no longer just about recognizing objects in images, they’re about understanding, reasoning, and acting across multiple modalities in ways that feel genuinely intelligent. GLM-4.5V, the latest open-source release from ZhipuAI, is built on the powerhouse GLM-4.5-Air foundation (106B parameters, 12B active) and pushes the frontier of multimodal intelligence. It delivers state-of-the-art performance across 42 public VLM benchmarks, excelling at tasks from scene interpretation and multi-image reasoning to long-form video analysis, GUI navigation, complex chart parsing, and precise visual grounding. Thanks to its efficient hybrid training, GLM-4.5V handles everything from high-resolution imagery to lengthy research reports with accuracy and depth. And with its Thinking Mode toggle, you can choose between lightning-fast answers or deep, analytical reasoning, making it equally suited for quick turnarounds and demanding problem-solving. If you’re building intelligent agents, automating document workflows, or enabling advanced video analytics, GLM-4.5V offers the versatility, reasoning power, and usability needed to turn ambitious ideas into reality.

August 11, 2025

How to Install & Run GPT-OSS 20b and 120b GGUF Locally?

GPT-OSS is a two-model, open-weight lineup built for real work: 120B for high-reasoning, production use that fits on a single H100, and 20B for fast local runs, fine-tuning, and lower-latency apps. Both ship under Apache-2.0, support function calling/structured outputs, and use the Harmony chat format for consistent responses. Run them your way—Transformers/vLLM in the cloud or GGUF via llama.cpp/Ollama—with Unsloth’s quants for speed or F16 for maximum fidelity (120B uses MXFP4 MoE; 20B can run in ~16 GB). This guide covers the clean path to set up and deploy both.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.