google/gemma-3-270m
(Pre-trained)
A lightweight, open vision-language model from Google DeepMind, designed for both text and image inputs. With a 32K context window, it’s suitable for general-purpose text generation, summarization, reasoning, and image analysis. Trained on diverse multilingual, code, math, and visual datasets, it offers strong performance in resource-constrained environments like laptops or small cloud VMs.
google/gemma-3-270m-it
(Instruction-Tuned)
An instruction-optimized variant of Gemma 3-270M that’s fine-tuned to follow user prompts more accurately. It keeps the same multimodal capabilities as the base model but excels in conversational AI, question answering, and structured output tasks, making it more user-friendly for chatbots, assistants, and guided content generation.
unsloth/gemma-3-270m-it-GGUF
A GGUF-format, instruction-tuned Gemma 3-270M released by Unsloth AI for efficient local inference with llama.cpp and similar tools. It’s optimized for faster performance and lower memory usage while retaining multimodal capabilities, making it ideal for on-device or low-resource deployment scenarios.
Gemma 3 270M
GPU Configuration Table for Gemma-3-270m, GGUF & Instruct Models
Model | Parameters | Recommended Precision | Minimum GPU (for inference) | Minimum VRAM | Recommended GPU (for smooth use) | Recommended VRAM |
---|
google/gemma-3-270m (Pre-trained) | 270M | FP16 / BF16 | NVIDIA T4 / RTX 3060 | 8 GB | RTX 3090 / A100 40GB | 24–40 GB |
google/gemma-3-270m-it (Instruction-Tuned) | 270M | FP16 / BF16 | NVIDIA T4 / RTX 3060 | 8 GB | RTX 3090 / A100 40GB | 24–40 GB |
unsloth/gemma-3-270m-it-GGUF | 270M | Q4_K_M (4-bit) / Q8_0 (8-bit) | NVIDIA GTX 1650 / RTX 3050 | 4–6 GB | RTX 3060 Ti / RTX 4090 | 8–24 GB |
Notes:
- The GGUF version is much lighter because it uses quantization, so it can run even on lower-end GPUs or CPUs.
- The pre-trained (PT) and instruction-tuned (IT) models from Google will require more VRAM if used in FP16 or BF16 formats.
- If you use CPU inference with GGUF, you should have at least 8–16 GB of system RAM for smooth execution.
Resources
Link 1: https://huggingface.co/google/gemma-3-270m
Link 2: https://huggingface.co/google/gemma-3-270m-it
Link 3: https://huggingface.co/unsloth/gemma-3-270m-it-GGUF
Step-by-Step Process to Install & Run Gemma-3-270m, GGUF & Instruct Locally
For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice. NodeShift provides the most affordable Virtual Machines at a scale that meets GDPR, SOC2, and ISO27001 requirements.
Step 1: Sign Up and Set Up a NodeShift Cloud Account
Visit the NodeShift Platform and create an account. Once you’ve signed up, log into your account.
Follow the account setup process and provide the necessary details and information.
Step 2: Create a GPU Node (Virtual Machine)
GPU Nodes are NodeShift’s GPU Virtual Machines, on-demand resources equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.
Navigate to the menu on the left side. Select the GPU Nodes option, create a GPU Node in the Dashboard, click the Create GPU Node button, and create your first Virtual Machine deploy
Step 3: Select a Model, Region, and Storage
In the “GPU Nodes” tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.
We will use 1 x RTX A6000 GPU for this tutorial to achieve the fastest performance. However, you can choose a more affordable GPU with less VRAM if that better suits your requirements.
Step 4: Select Authentication Method
There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option. To create them, please refer to our official documentation.
Step 5: Choose an Image
In our previous blogs, we used pre-built images from the Templates tab when creating a Virtual Machine. However, for running Gemma-3-270m & Instruct, we need a more customized environment with full CUDA development capabilities. That’s why, in this case, we switched to the Custom Image tab and selected a specific Docker image that meets all runtime and compatibility requirements.
We chose the following image:
nvidia/cuda:12.1.1-devel-ubuntu22.04
This image is essential because it includes:
- Full CUDA toolkit (including
nvcc
)
- Proper support for building and running GPU-based applications like Gemma-3-270m & Instruct
- Compatibility with CUDA 12.1.1 required by certain model operations
Launch Mode
We selected:
Interactive shell server
This gives us SSH access and full control over terminal operations — perfect for installing dependencies, running benchmarks, and launching tools like Qwen-Image-Lightning.
Docker Repository Authentication
We left all fields empty here.
Since the Docker image is publicly available on Docker Hub, no login credentials are required.
Identification
nvidia/cuda:12.1.1-devel-ubuntu22.04
CUDA and cuDNN images from gitlab.com/nvidia/cuda. Devel version contains full cuda toolkit with nvcc.
This setup ensures that the Gemma-3-270m & Instruct runs in a GPU-enabled environment with proper CUDA access and high compute performance.
After choosing the image, click the ‘Create’ button, and your Virtual Machine will be deployed.
Step 6: Virtual Machine Successfully Deployed
You will get visual confirmation that your node is up and running.
Step 7: Connect to GPUs using SSH
NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.
Once your GPU Node deployment is successfully created and has reached the ‘RUNNING’ status, you can navigate to the page of your GPU Deployment Instance. Then, click the ‘Connect’ button in the top right corner.
Now open your terminal and paste the proxy SSH IP or direct SSH IP.
Next, If you want to check the GPU details, run the command below:
nvidia-smi
Step 8: Check the Available Python version and Install the new version
Run the following commands to check the available Python version.
If you check the version of the python, system has Python 3.8.1 available by default. To install a higher version of Python, you’ll need to use the deadsnakes
PPA.
Run the following commands to add the deadsnakes
PPA:
sudo apt update
sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:deadsnakes/ppa
sudo apt update
Step 9: Install Python 3.11
Now, run the following command to install Python 3.11 or another desired version:
sudo apt install -y python3.11 python3.11-venv python3.11-dev
Step 10: Update the Default Python3
Version
Now, run the following command to link the new Python version as the default python3
:
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 1
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 2
sudo update-alternatives --config python3
Then, run the following command to verify that the new Python version is active:
python3 --version
Step 11: Install and Update Pip
Run the following command to install and update the pip:
curl -O https://bootstrap.pypa.io/get-pip.py
python3.11 get-pip.py
Then, run the following command to check the version of pip:
pip --version
Step 12: Created and activated Python 3.11 virtual environment
Run the following commands to created and activated Python 3.11 virtual environment:
apt update && apt install -y python3.11-venv git wget
python3.11 -m venv openwebui
source openwebui/bin/activate
Step 13: Install Open-WebUI
Run the following command to install open-webui:
pip install open-webui
Step 14: Serve Open-WebUI
In your activated Python environment, start the Open-WebUI server by running:
open-webui serve
- Wait for the server to complete all database migrations and set up initial files. You’ll see a series of INFO logs and a large “OPEN WEBUI” banner in the terminal.
- When setup is complete, the WebUI will be available and ready for you to access via your browser.
Step 15: Set up SSH port forwarding from your local machine
On your local machine (Mac/Windows/Linux), open a terminal and run:
ssh -L 8080:localhost:8080 -p 40128 root@38.29.145.10
This forwards:
Local localhost:8000
→ Remote VM 127.0.0.1:8000
Step 16: Access Open-WebUI in Your Browser
Go to:
http://localhost:8080
- You should see the Open-WebUI login or setup page.
- Log in or create a new account if this is your first time.
- You’re now ready to use Open-WebUI to interact with your models!
Step 16: Install Ollama
After connecting to the terminal via SSH, it’s now time to install Ollama from the official Ollama website.
Website Link: https://ollama.com/
Run the following command to install the Ollama:
curl -fsSL https://ollama.com/install.sh | sh
Step 17: Serve Ollama
Run the following command to host the Ollama so that it can be accessed and utilized efficiently:
ollama serve
Step 18: Pull the Gemma3:270M Model
Run this command to pull the gemma3:270m model:
ollama pull gemma3:270m
Step 19: Run the Gemma3:270M Model for Inference
Now that your models are installed, you can start running them and interacting directly from the terminal.
To run the gemma3:270m model, use:
ollama run gemma3:270m
Step 20 — Chat with Gemma-3-270M in Open WebUI (auto-detected from Ollama)
You’ve already tested the model in the terminal with Ollama and installed Open WebUI earlier. Now we’ll use the Web UI to chat with the same local model.
Make sure Ollama is running
- If you’re in a VM, keep the Ollama service up.
- Quick check:
ollama pull gemma3:270m # if not pulled yet
curl http://localhost:11434/api/tags | jq . # should list gemma3:270m
Open the Web UI
- Visit your Open WebUI URL (e.g.,
http://<host>:8080
).
- Click the model dropdown at the top (“Select a model”).
Pick the model
- You should see
gemma3:270m
under Local. Select it.
- That’s it—Open WebUI automatically detects any model you’ve pulled with Ollama and shows it in the list.
(Your screen should look like the screenshot: gemma3:270m
visible in the model picker.)
Start chatting
- Type your prompt in the chat box and send.
- Use the icon (if available) to tweak temperature, max tokens, etc.
If the model doesn’t appear
- Click the refresh icon next to the model list, or go to Settings → Providers → Ollama and confirm the Base URL (usually
http://localhost:11434
), then Save and Sync Models.
- If Ollama runs on another machine, set the Base URL to that host (make sure the port is reachable).
Step 21 — Stress-test the model in Open WebUI (tune settings + quick rubric)
Now that gemma3:270m shows up in Open WebUI and you can chat, do a fast quality check and tune generation so it behaves well.
Open a new chat → pick gemma3:270m
- Click the gear (generation settings) and start with:
- Temperature: 0.6
- Top-p: 0.9
- Max new tokens: 512
- Repeat penalty: 1.1
- (Optional) Seed: 42 for reproducible runs
Paste 3 single-line “hard” prompts to probe reasoning & constraints
If five painters take five hours to paint five walls, how long would 100 painters take to paint 100 walls? Explain without skipping steps.
Summarize the book “The Little Prince” in exactly 7 words, keeping its emotional tone intact.
Translate “La vie est belle” into English, reverse each word, and then write a haiku using the reversed words as the first line.
Grade quickly with a mini-rubric (write notes in the chat or a doc)
- Correctness (math/logic right?)
- Constraint keeping (exact word count, formatting, “no synonyms” rules)
- Clarity (step-by-step, no hand-waving)
- Latency (tokens/sec acceptable?)
- Determinism (does it change across retries? if yes, lower temp)
If it struggles, tweak and retry
- Reasoning tasks: lower Temperature → 0.2–0.4.
- Short answers cut off: raise Max new tokens.
- Add a System message like: “Follow constraints strictly. Show numbered steps.”
Up to here, we’ve been interacting with google/gemma-3-270m via Ollama in the terminal and through Open WebUI in the browser (Open WebUI auto-detected the Ollama model, so chatting worked in both places). Now we’ll install the lightweight GGUF variant of this model directly from Hugging Face inside Open WebUI’s Manage Models panel, so you can run the llama.cpp-style build with lower memory usage and switch between the Ollama and GGUF versions from the same model dropdown.
Step 22 — Pull the GGUF build from Hugging Face (Unsloth)
Unsloth publishes a ready-to-run GGUF pack for this model: unsloth/gemma-3-270m-it-GGUF
.
In Open WebUI → Settings → Models → Manage Models, paste this repo path into “Pull a model from Ollama.com” (it accepts hf.co/...
too):
hf.co/unsloth/gemma-3-270m-it-GGUF
Click the download icon. When file choices appear, I recommend starting with:
gemma-3-270m-it.Q4_K_M.gguf
(best speed/quality balance)
- Lighter options if RAM/VRAM is tiny: IQ2_XXS / IQ3_XXS
- Higher quality: Q8_0 (or F16 if you want full precision)
After the download finishes, the GGUF model will show up in your model selector alongside the Ollama one, and you can chat with either version directly in Open WebUI.
Step 23 — Chat with the GGUF model in Open WebUI (verify + tune)
Select the GGUF build
Open a new chat and pick hf.co/unsloth/gemma-3-270m-it-GGUF:latest
from the model dropdown (you’ll see the full HF path in the header, like in your screenshot).
Use the same stress prompts
Paste the three single-line tests (√2 proof without “number”, paradox in one sentence, 12-word Inception). This makes A/B comparison with the Ollama version straightforward.
Tune generation for GGUF
- Temperature 0.4–0.6 (start 0.5)
- Top-p 0.9
- Max new tokens 512
- Repeat penalty 1.1
- Context/window: 8192 (you can go higher if your RAM allows)
Compare vs. Ollama run
- Correctness: does it keep constraints (exact word counts, banned words)?
- Coherence: fewer/random jumps → nudge temp down to 0.3–0.4.
- Latency: if slow on CPU, try a lighter quant (IQ3_XXS) or shorter max tokens. If quality feels thin, bump to Q6_K or Q8_0.
Optional: save a preset
Click … → Save as preset (e.g., “Gemma3-270m-GGUF-Q4KM”) so future chats load your tuned settings instantly.
If something’s off
- Model not loading: re-open Settings → Models → Manage Models → Sync/Refresh.
- Quality too low: switch the file to a higher quant (Q6_K / Q8_0).
- Memory tight: keep quant at Q4_K_M and reduce context or max tokens.
Now you can flip between Ollama (gemma3:270m) and GGUF (hf.co/unsloth/…) in the same UI and capture side-by-side behavior for your write-up.
Up to this point, we’ve been chatting with google/gemma-3-270m, google/gemma-3-270m-it, and the unsloth/gemma-3-270m-it-GGUF build via Ollama in the terminal and Open WebUI in the browser (which auto-detected our Ollama pulls). Now we’ll move beyond the UI and run the original Hugging Face models google/gemma-3-270m (pretrained) and google/gemma-3-270m-it (instruction-tuned) directly via script—downloading them with Transformers using your HF token, so we can control settings programmatically, batch tests, and log clean benchmarks.
Step 24 — Install Torch
Run the following command to install torch:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
Step 25: Install Python Dependencies
Run the following command to install python dependencies:
python -m pip install -U "transformers>=4.53" accelerate sentencepiece
Step 26 — Install/Verify Hugging Face Hub (CLI + token)
Install (or update) the Hub tools:
pip install -U huggingface_hub "transformers>=4.53"
huggingface-cli --version
Authenticate (same account that accepted Gemma access):
huggingface-cli login # paste HF_xxx token with read scope
# optional env var so scripts/daemons inherit it
export HF_TOKEN=HF_xxx
echo 'export HF_TOKEN=HF_xxx' >> ~/.bashrc
Step 27: Connect to Your GPU VM with a Code Editor
Before you start running Python scripts with the Gemma-3-270m & Instruct models and Transformers, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.
- You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
- In this example, we’re using cursor code editor.
- Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.
Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.
Step 28: Run Gemma-3-270M Models with Transformers in Python
Now you’re ready to interact with Gemma-3-270M directly in your own Python scripts using the Transformers library.
Here’s an example script (gemma3_run.py
) you can use:
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
import torch
model_id = "google/gemma-3-270m-it" # or "google/gemma-3-27m" for base PT
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto", # GPU if present, else CPU
attn_implementation="sdpa" # good default in recent PyTorch
)
streamer = TextStreamer(tok)
inputs = tok("Explain Rust ownership like I'm 12:", return_tensors="pt").to(model.device)
_ = model.generate(**inputs, max_new_tokens=200, streamer=streamer)
Step 29: Run the script and generate a response
Run the script with the following command to load google/gemma-3-270m-it and generate a response:
python3 gemma3_run.py
Step 30: Run Gemma-3-270M Models with Transformers in Python
Next we will interact with Gemma-3-270M directly in your own Python scripts using the Transformers library.
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
import torch
model_id = "google/gemma-3-270m" # or "google/gemma-3-27m-it" for instruct
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto", # GPU if present, else CPU
attn_implementation="sdpa" # good default in recent PyTorch
)
streamer = TextStreamer(tok)
inputs = tok("Explain Rust ownership like I'm 12:", return_tensors="pt").to(model.device)
_ = model.generate(**inputs, max_new_tokens=200, streamer=streamer)
Step 31: Run the script and generate a response
Run the script with the following command to load google/gemma-3-270m and generate a response:
python3 gemma3_run.py
Conclusion
Gemma-3-270M is a perfect example of how cutting-edge AI can be scaled down without losing its versatility. Whether you’re experimenting with the pre-trained variant for raw, general-purpose tasks, the instruction-tuned version for natural conversations, or the GGUF build for low-resource deployments, you get a model that’s fast, flexible, and surprisingly capable for its size.
With this guide, you’ve learned how to set up a GPU-powered environment, run Gemma models through Ollama, Open WebUI, and Transformers, and even optimize them for speed and memory efficiency. You can now seamlessly switch between interactive browser-based chats, terminal sessions, and custom Python scripts—all while taking advantage of the model’s multimodal capabilities.
Whether you’re building a chatbot, testing reasoning skills, summarizing content, or just exploring model behavior, Gemma-3-270M gives you the freedom to run it your way—from high-end GPUs to modest local machines. Now, it’s your turn to put it to the test, push its limits, and see what’s possible when big ideas meet small but mighty AI.