K2-Think is a 32B parameter open-weights reasoning model developed by LLM360, purpose-built for tough problem-solving in math, code, and science. It excels in competitive benchmarks like AIME, HMMT, and LiveCodeBench, showcasing strong chain-of-thought reasoning and verifiable step-by-step logic. Optimized for efficiency, K2-Think runs on both typical cloud setups and advanced hardware like Cerebras WSE, making it a powerful yet accessible system for researchers and developers who want high-performance reasoning without proprietary restrictions.
In this blog, we walk you through building a Math Dueler Agent powered by K2-Think—a system where two proposer agents attempt to solve the same math problem using different approaches, and a referee agent steps in to compare, verify, and declare the winner. You’ll learn how to set up the environment, organize your project files, implement proposers and referee logic, integrate math verification with Sympy, and finally wrap everything in a user-friendly Gradio interface. By the end, you’ll not only have K2-Think running locally but also a fully working agent framework that turns complex math problem solving into an interactive, competitive, and verifiable experience.
Before we dive into building the Math Dueler Agent, it’s important to note that we’ve already published a complete step-by-step guide on How to Install & Run K2-Think Locally. That guide covers the full setup process—environment preparation, model installation, and running K2-Think on your own machine. So if you’re starting fresh or haven’t yet run the model locally, make sure to follow that first. Once you have K2-Think up and running, you can come back here and directly jump into building the agent.
Link: https://nodeshift.cloud/blog/how-to-install-run-k2-think-locally
Step-by-Step Guide for Building a Math Dueler Agent with K2-Think
Step 1: Install Dependencies
Run the following command to install model dependencies:
pip install transformers accelerate torch sympy gradio
Step 2: Install Build Tools (For Optional FlashAttention-2 Speedup)
Run the following commands to install build tools:
pip install -U pip setuptools wheel ninja packaging
pip install "flash-attn>=2.5.8" --no-build-isolation --no-cache-dir
Why: flash-attn
(and some GPU extras) need modern build tooling to compile or fetch wheels. Upgrading pip/setuptools/wheel
+ ninja
avoids metadata/build errors and enables the FA2 acceleration path for much faster attention.
Step 3: Connect to Your GPU VM with a Code Editor
Before you start running agent script with the K2-Think model, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.
- You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
- In this example, we’re using cursor code editor.
- Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.
Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.
Step 4: Pull the Model
Create a script (e.g., pull.py
) and add the following code to pull the model:
from transformers import pipeline
import torch
model_id = "LLM360/K2-Think"
pipe = pipeline(
"text-generation",
model=model_id,
torch_dtype="auto",
device_map="auto",
)
What This Script Does
- Defines the model to load:
LLM360/K2-Think
.
- Downloads/cache the weights + tokenizer the first time.
- Creates a Transformers pipeline for text generation.
- Auto-selects GPU/CPU with
device_map="auto"
.
- Sets a fast, mixed-precision dtype (BF16 on GPU) for quicker inference.
Step 5: Run the Script
Now that the model files are pulled, simply run the script to load everything:
python pull.py
This step verifies that all model shards are downloaded correctly and confirms the pipeline initializes without errors. Once you see the progress bars finish and the checkpoint shards load to 100%, the model is ready to use.
Step 6: Create the App Package (5 files)
6.0. Make Folder
Run the following command to make app folder:
mkdir -p app
6.1. app/model.py
— Single shared model loader (fast, safe)
- Loads K2-Think once and shares the pipeline.
- Uses
dtype=
(no deprecation), BF16 on GPU, device_map="auto"
.
- FlashAttention-2 toggles automatically if installed; can be forced with env.
Create: app/model.py
import os, torch
from transformers import pipeline
# Speed-friendly GPU flags
torch.backends.cuda.matmul.allow_tf32 = True
try:
torch.set_float32_matmul_precision("high")
except Exception:
pass
_MODEL_ID = os.getenv("K2_MODEL_ID", "LLM360/K2-Think")
_USE_FA2 = os.getenv("K2_USE_FA2", "auto").lower() # "auto" | "1" | "0"
def _fa2_available() -> bool:
if _USE_FA2 == "0":
return False
if _USE_FA2 == "1":
return True # user promises FA2 exists
try:
import flash_attn # noqa: F401
return True
except Exception:
return False
_PIPE = None
# Default fast generation settings (you can override via env)
GEN_KWARGS = dict(
max_new_tokens=int(os.getenv("K2_MAX_NEW_TOKENS", "256")),
temperature=0.1,
do_sample=False,
top_p=1.0,
)
def get_pipe():
"""Return a singleton text-generation pipeline for K2-Think."""
global _PIPE
if _PIPE is not None:
return _PIPE
model_kwargs = {}
if _fa2_available():
model_kwargs["attn_implementation"] = "flash_attention_2"
_PIPE = pipeline(
task="text-generation",
model=_MODEL_ID,
dtype=(torch.bfloat16 if torch.cuda.is_available() else "auto"),
device_map="auto",
model_kwargs=model_kwargs, # empty dict if FA2 not present
)
return _PIPE
6.2. app/tools_math.py
— Small verifiers (Sympy)
- Tiny helpers that the referee can reference or you can expand later.
Create: app/tools_math.py
import sympy as sp
def is_prime(n: int) -> bool:
try:
return sp.isprime(int(n))
except Exception:
return False
def next_prime(n: int) -> int:
try:
return int(sp.nextprime(int(n)))
except Exception:
return -1
def equal_expr(a: str, b: str) -> bool:
"""Return True if two algebraic expressions are symbolically equal."""
try:
return sp.simplify(sp.sympify(a) - sp.sympify(b)) == 0
except Exception:
return False
6.3. app/proposers.py
— Run A and B in parallel
- Short prompts; parallelized using a small thread pool.
- Reuses the one global pipeline.
Create: app/proposers.py
import asyncio
from concurrent.futures import ThreadPoolExecutor
from app.model import get_pipe, GEN_KWARGS
PROMPT_A = "Role: Proposer A. Solve with numbered steps. End with \\boxed{answer}."
PROMPT_B = "Role: Proposer B. Solve differently. Numbered steps. End with \\boxed{answer}."
_pipe = get_pipe()
_executor = ThreadPoolExecutor(max_workers=2)
def _gen(prompt: str) -> str:
return _pipe([{"role": "user", "content": prompt}], **GEN_KWARGS)[0]["generated_text"]
async def _agen(prompt: str) -> str:
loop = asyncio.get_event_loop()
return await loop.run_in_executor(_executor, _gen, prompt)
def propose_parallel(problem: str):
"""Return (solution_A, solution_B) computed concurrently."""
pa = f"{PROMPT_A}\nProblem: {problem}"
pb = f"{PROMPT_B}\nProblem: {problem}"
loop = asyncio.new_event_loop()
try:
asyncio.set_event_loop(loop)
return loop.run_until_complete(asyncio.gather(_agen(pa), _agen(pb)))
finally:
loop.close()
6.4. app/referee.py
— Concise verdict JSON
- Asks for
{winner, reason, final_answer}
to keep UI tidy.
- You can expand later to include step-by-step checks.
Create: app/referee.py
from app.model import get_pipe, GEN_KWARGS
REFEREE_PROMPT = (
"Role: Referee. Compare Solution A and B. Be concise. "
"If a mistake exists, cite the first wrong step and give the corrected final answer. "
"Return JSON: {winner, reason, final_answer}."
)
_pipe = get_pipe()
def adjudicate(sol_a: str, sol_b: str) -> str:
prompt = f"{REFEREE_PROMPT}\n\nA:\n{sol_a}\n\nB:\n{sol_b}"
R = _pipe([{"role": "user", "content": prompt}], **GEN_KWARGS)[0]["generated_text"]
return R
6.5. app/ui_gradio.py
— Nice, tall panes + buttons
- Uses
gr.Blocks
+ custom CSS to make outputs tall and readable.
- Wires Submit and Clear.
Create: app/ui_gradio.py
import gradio as gr
from app.proposers import propose_parallel
from app.referee import adjudicate
def duel(problem: str):
sol_a, sol_b = propose_parallel(problem)
ref = adjudicate(sol_a, sol_b)
return sol_a, sol_b, ref
with gr.Blocks(css="""
.bigbox textarea {
height: 420px !important;
font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, "Liberation Mono", "Courier New", monospace;
font-size: 14px;
line-height: 1.35;
white-space: pre-wrap;
}
""") as demo:
gr.Markdown("# K2-Think Math Dueler\nTwo proposers generate solutions; a Referee compares and verifies. **(Fast mode)**")
with gr.Row():
with gr.Column(scale=5, min_width=420):
problem = gr.Textbox(
label="Enter a problem (e.g., 'Find the next prime after 2600')",
placeholder="Find the next prime after 2600",
lines=3
)
with gr.Row():
clear_btn = gr.Button("Clear", variant="secondary")
submit_btn = gr.Button("Submit", variant="primary")
with gr.Column(scale=7, min_width=520):
out_a = gr.Textbox(label="Proposer A", lines=22, show_copy_button=True, elem_classes=["bigbox"])
out_b = gr.Textbox(label="Proposer B", lines=22, show_copy_button=True, elem_classes=["bigbox"])
out_r = gr.Textbox(label="Referee Verdict", lines=22, show_copy_button=True, elem_classes=["bigbox"])
submit_btn.click(fn=duel, inputs=problem, outputs=[out_a, out_b, out_r])
clear_btn.click(lambda: ("", "", "", ""), inputs=None, outputs=[problem, out_a, out_b, out_r])
if __name__ == "__main__":
demo.launch()
Step 7: Launch the App (FA2 disabled for max compatibility)
Run with FlashAttention-2 off (safe fallback) and start the UI:
export K2_USE_FA2=0 # force standard attention (no FA2 required)
python -m app.ui_gradio # launch the Gradio app
Why this step: Ensures the model runs even if flash-attn
isn’t installed or mismatched with your Torch/CUDA.
What you should see: Gradio prints a local URL (like http://127.0.0.1:7860
). Open it in your browser and try a prompt:
Find the next prime after 2600
Step 8: Play with Agent
Conclusion
With this guide, you’ve gone from pulling the K2-Think model to building a complete Math Dueler Agent that runs locally on your GPU VM. You now have two proposer agents solving problems in different ways and a referee agent verifying and declaring the winner—all wrapped in a clean Gradio interface. This project not only showcases the reasoning power of K2-Think but also gives you a foundation to expand: add new tools, bring in more domains beyond math, or even adapt the referee to handle richer verification pipelines. The key takeaway—K2-Think isn’t just a model you run; it’s a powerful reasoning engine you can turn into interactive, verifiable applications.