Building a Math Dueler Agent with K2-Think: Step-by-Step Guide

by Ayush Kumar | September 30, 2025

Ready to build cheaper?

Custom CPU plans from as little as $0.012/hour.

K2-Think is a 32B parameter open-weights reasoning model developed by LLM360, purpose-built for tough problem-solving in math, code, and science. It excels in competitive benchmarks like AIME, HMMT, and LiveCodeBench, showcasing strong chain-of-thought reasoning and verifiable step-by-step logic. Optimized for efficiency, K2-Think runs on both typical cloud setups and advanced hardware like Cerebras WSE, making it a powerful yet accessible system for researchers and developers who want high-performance reasoning without proprietary restrictions.

In this blog, we walk you through building a Math Dueler Agent powered by K2-Think—a system where two proposer agents attempt to solve the same math problem using different approaches, and a referee agent steps in to compare, verify, and declare the winner. You’ll learn how to set up the environment, organize your project files, implement proposers and referee logic, integrate math verification with Sympy, and finally wrap everything in a user-friendly Gradio interface. By the end, you’ll not only have K2-Think running locally but also a fully working agent framework that turns complex math problem solving into an interactive, competitive, and verifiable experience.

Before we dive into building the Math Dueler Agent, it’s important to note that we’ve already published a complete step-by-step guide on How to Install & Run K2-Think Locally. That guide covers the full setup process—environment preparation, model installation, and running K2-Think on your own machine. So if you’re starting fresh or haven’t yet run the model locally, make sure to follow that first. Once you have K2-Think up and running, you can come back here and directly jump into building the agent.

Link: https://nodeshift.cloud/blog/how-to-install-run-k2-think-locally

Step-by-Step Guide for Building a Math Dueler Agent with K2-Think

Step 1: Install Dependencies

Run the following command to install model dependencies:

pip install transformers accelerate torch sympy gradio

Step 2: Install Build Tools (For Optional FlashAttention-2 Speedup)

Run the following commands to install build tools:

pip install -U pip setuptools wheel ninja packaging
pip install "flash-attn>=2.5.8" --no-build-isolation --no-cache-dir

Why: flash-attn (and some GPU extras) need modern build tooling to compile or fetch wheels. Upgrading pip/setuptools/wheel + ninja avoids metadata/build errors and enables the FA2 acceleration path for much faster attention.

Step 3: Connect to Your GPU VM with a Code Editor

Before you start running agent script with the K2-Think model, it’s a good idea to connect your GPU virtual machine (VM) to a code editor of your choice. This makes writing, editing, and running code much easier.

You can use popular editors like VS Code, Cursor, or any other IDE that supports SSH remote connections.
In this example, we’re using cursor code editor.
Once connected, you’ll be able to browse files, edit scripts, and run commands directly on your remote server, just like working locally.

Why do this?
Connecting your VM to a code editor gives you a powerful, streamlined workflow for Python development, allowing you to easily manage your code, install dependencies, and experiment with large models.

Step 4: Pull the Model

Create a script (e.g., pull.py) and add the following code to pull the model:

from transformers import pipeline
import torch

model_id = "LLM360/K2-Think"

pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype="auto",
    device_map="auto",
)

What This Script Does

Defines the model to load: LLM360/K2-Think.
Downloads/cache the weights + tokenizer the first time.
Creates a Transformers pipeline for text generation.
Auto-selects GPU/CPU with device_map="auto".
Sets a fast, mixed-precision dtype (BF16 on GPU) for quicker inference.

Step 5: Run the Script

Now that the model files are pulled, simply run the script to load everything:

python pull.py

This step verifies that all model shards are downloaded correctly and confirms the pipeline initializes without errors. Once you see the progress bars finish and the checkpoint shards load to 100%, the model is ready to use.

Step 6: Create the App Package (5 files)

6.0. Make Folder

Run the following command to make app folder:

mkdir -p app

6.1. `app/model.py` — Single shared model loader (fast, safe)

Loads K2-Think once and shares the pipeline.
Uses dtype= (no deprecation), BF16 on GPU, device_map="auto".
FlashAttention-2 toggles automatically if installed; can be forced with env.

Create: app/model.py

import os, torch
from transformers import pipeline

# Speed-friendly GPU flags
torch.backends.cuda.matmul.allow_tf32 = True
try:
    torch.set_float32_matmul_precision("high")
except Exception:
    pass

_MODEL_ID = os.getenv("K2_MODEL_ID", "LLM360/K2-Think")
_USE_FA2 = os.getenv("K2_USE_FA2", "auto").lower()   # "auto" | "1" | "0"

def _fa2_available() -> bool:
    if _USE_FA2 == "0":
        return False
    if _USE_FA2 == "1":
        return True  # user promises FA2 exists
    try:
        import flash_attn  # noqa: F401
        return True
    except Exception:
        return False

_PIPE = None

# Default fast generation settings (you can override via env)
GEN_KWARGS = dict(
    max_new_tokens=int(os.getenv("K2_MAX_NEW_TOKENS", "256")),
    temperature=0.1,
    do_sample=False,
    top_p=1.0,
)

def get_pipe():
    """Return a singleton text-generation pipeline for K2-Think."""
    global _PIPE
    if _PIPE is not None:
        return _PIPE

    model_kwargs = {}
    if _fa2_available():
        model_kwargs["attn_implementation"] = "flash_attention_2"

    _PIPE = pipeline(
        task="text-generation",
        model=_MODEL_ID,
        dtype=(torch.bfloat16 if torch.cuda.is_available() else "auto"),
        device_map="auto",
        model_kwargs=model_kwargs,   # empty dict if FA2 not present
    )
    return _PIPE

6.2. `app/tools_math.py` — Small verifiers (Sympy)

Tiny helpers that the referee can reference or you can expand later.

Create: app/tools_math.py

import sympy as sp

def is_prime(n: int) -> bool:
    try:
        return sp.isprime(int(n))
    except Exception:
        return False

def next_prime(n: int) -> int:
    try:
        return int(sp.nextprime(int(n)))
    except Exception:
        return -1

def equal_expr(a: str, b: str) -> bool:
    """Return True if two algebraic expressions are symbolically equal."""
    try:
        return sp.simplify(sp.sympify(a) - sp.sympify(b)) == 0
    except Exception:
        return False

6.3. `app/proposers.py` — Run A and B in parallel

Short prompts; parallelized using a small thread pool.
Reuses the one global pipeline.

Create: app/proposers.py

import asyncio
from concurrent.futures import ThreadPoolExecutor
from app.model import get_pipe, GEN_KWARGS

PROMPT_A = "Role: Proposer A. Solve with numbered steps. End with \\boxed{answer}."
PROMPT_B = "Role: Proposer B. Solve differently. Numbered steps. End with \\boxed{answer}."

_pipe = get_pipe()
_executor = ThreadPoolExecutor(max_workers=2)

def _gen(prompt: str) -> str:
    return _pipe([{"role": "user", "content": prompt}], **GEN_KWARGS)[0]["generated_text"]

async def _agen(prompt: str) -> str:
    loop = asyncio.get_event_loop()
    return await loop.run_in_executor(_executor, _gen, prompt)

def propose_parallel(problem: str):
    """Return (solution_A, solution_B) computed concurrently."""
    pa = f"{PROMPT_A}\nProblem: {problem}"
    pb = f"{PROMPT_B}\nProblem: {problem}"
    loop = asyncio.new_event_loop()
    try:
        asyncio.set_event_loop(loop)
        return loop.run_until_complete(asyncio.gather(_agen(pa), _agen(pb)))
    finally:
        loop.close()

6.4. `app/referee.py` — Concise verdict JSON

Asks for {winner, reason, final_answer} to keep UI tidy.
You can expand later to include step-by-step checks.

Create: app/referee.py

from app.model import get_pipe, GEN_KWARGS

REFEREE_PROMPT = (
    "Role: Referee. Compare Solution A and B. Be concise. "
    "If a mistake exists, cite the first wrong step and give the corrected final answer. "
    "Return JSON: {winner, reason, final_answer}."
)

_pipe = get_pipe()

def adjudicate(sol_a: str, sol_b: str) -> str:
    prompt = f"{REFEREE_PROMPT}\n\nA:\n{sol_a}\n\nB:\n{sol_b}"
    R = _pipe([{"role": "user", "content": prompt}], **GEN_KWARGS)[0]["generated_text"]
    return R

6.5. `app/ui_gradio.py` — Nice, tall panes + buttons

Uses gr.Blocks + custom CSS to make outputs tall and readable.
Wires Submit and Clear.

Create: app/ui_gradio.py

import gradio as gr
from app.proposers import propose_parallel
from app.referee import adjudicate

def duel(problem: str):
    sol_a, sol_b = propose_parallel(problem)
    ref = adjudicate(sol_a, sol_b)
    return sol_a, sol_b, ref

with gr.Blocks(css="""
.bigbox textarea {
  height: 420px !important;
  font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, "Liberation Mono", "Courier New", monospace;
  font-size: 14px;
  line-height: 1.35;
  white-space: pre-wrap;
}
""") as demo:
    gr.Markdown("# K2-Think Math Dueler\nTwo proposers generate solutions; a Referee compares and verifies. **(Fast mode)**")

    with gr.Row():
        with gr.Column(scale=5, min_width=420):
            problem = gr.Textbox(
                label="Enter a problem (e.g., 'Find the next prime after 2600')",
                placeholder="Find the next prime after 2600",
                lines=3
            )
            with gr.Row():
                clear_btn = gr.Button("Clear", variant="secondary")
                submit_btn = gr.Button("Submit", variant="primary")

        with gr.Column(scale=7, min_width=520):
            out_a = gr.Textbox(label="Proposer A", lines=22, show_copy_button=True, elem_classes=["bigbox"])
            out_b = gr.Textbox(label="Proposer B", lines=22, show_copy_button=True, elem_classes=["bigbox"])
            out_r = gr.Textbox(label="Referee Verdict", lines=22, show_copy_button=True, elem_classes=["bigbox"])

    submit_btn.click(fn=duel, inputs=problem, outputs=[out_a, out_b, out_r])
    clear_btn.click(lambda: ("", "", "", ""), inputs=None, outputs=[problem, out_a, out_b, out_r])

if __name__ == "__main__":
    demo.launch()

Step 7: Launch the App (FA2 disabled for max compatibility)

Run with FlashAttention-2 off (safe fallback) and start the UI:

export K2_USE_FA2=0          # force standard attention (no FA2 required)
python -m app.ui_gradio      # launch the Gradio app

Why this step: Ensures the model runs even if flash-attn isn’t installed or mismatched with your Torch/CUDA.

What you should see: Gradio prints a local URL (like http://127.0.0.1:7860). Open it in your browser and try a prompt:

Find the next prime after 2600

Step 8: Play with Agent

Conclusion

With this guide, you’ve gone from pulling the K2-Think model to building a complete Math Dueler Agent that runs locally on your GPU VM. You now have two proposer agents solving problems in different ways and a referee agent verifying and declaring the winner—all wrapped in a clean Gradio interface. This project not only showcases the reasoning power of K2-Think but also gives you a foundation to expand: add new tools, bring in more domains beyond math, or even adapt the referee to handle richer verification pipelines. The key takeaway—K2-Think isn’t just a model you run; it’s a powerful reasoning engine you can turn into interactive, verifiable applications.

Relevant blog posts

October 13, 2025

How to Install & Run AI21-Jamba-Reasoning-3B Locally?

Jamba Reasoning 3B is AI21’s compact, hybrid Transformer–Mamba model built for efficient reasoning on modest hardware. With just ~3B params (26 Mamba layers + 2 attention layers), it achieves strong scores on reasoning benchmarks, supports very long context windows (up to 256K), and runs smoothly with vLLM or Transformers. The Mamba layers drastically cut cache overhead, so you get long-context throughput without the usual KV-cache blow-up—great for laptops, single-GPU boxes, and edge deployments.

October 11, 2025

How to Install & Run Qwen3-VL-30B-A3B-Thinking Locally?

Qwen3-VL-30B-A3B-Thinking is one of the most advanced multimodal reasoning models in the Qwen3 series, designed to seamlessly fuse text, vision, and video understanding with large-scale reasoning. Built on a Mixture-of-Experts (MoE) architecture with 30B active parameters, the model introduces a specialized Thinking variant, tuned for deep multimodal reasoning across STEM, math, and complex real-world scenarios. Key Strengths Include Visual Agent Capabilities – Can perceive GUI elements, invoke tools, and complete tasks on PC/mobile interfaces. Visual Coding Boost – Converts diagrams, screenshots, and videos into structured code artifacts (e.g., HTML, CSS, JavaScript, Draw.io). Advanced Spatial & Video Perception – Supports 3D grounding, object occlusion reasoning, timestamp alignment, and long-horizon video comprehension. Massive Context Handling – Native 256K tokens, expandable up to 1M, enabling book-level comprehension or hours-long video indexing. Robust OCR & Recognition – Trained on broad visual corpora, supports 32 languages, rare/ancient scripts, and noisy/tilted text scenarios. Unified Text-Vision Understanding – Matches pure LLMs in text reasoning while tightly aligning vision inputs for lossless multimodal comprehension. Overall, Qwen3-VL-30B-A3B-Thinking is positioned as a research-grade, enterprise-ready model that excels at multimodal STEM reasoning, vide

October 10, 2025

How to Install & Run Microsoft UserLM-8B Locally?

UserLM-8b is Microsoft’s open-weight large language model uniquely designed to simulate the “user” role in conversations. Unlike most LLMs that play the assistant role, UserLM-8b was fine-tuned on the WildChat-1M dataset to generate realistic user utterances. This makes it particularly useful for evaluating assistant LLMs, synthetic data generation, and research on user behavior modeling. Built on top of Llama-3.1-8B-Base, the model was fully fine-tuned with 227 hours of training on NVIDIA RTX A6000 GPUs. UserLM-8b can: Generate first-turn user queries given a task intent. Simulate multi-turn follow-up responses across long conversations. Signal the natural end of a conversation with a special token. Its evaluations show that UserLM-8b achieves lower perplexity, stronger distributional alignment, and more realistic conversational diversity compared to assistant-based simulators. While not designed as an assistant model, UserLM-8b helps researchers stress-test assistants under a wide range of conversational conditions, making it a valuable tool for robustness and evaluation studies.

See all posts

Ready to build
with us?

The ideal way for organizations young and old to ease their way into the distributed and affordable cloud at their own pace.

Stay Tuned!

Stay up to date with the latest updates, news, and hotfixes for our product.

NodeShift creates a vital link between developers and affordable cloud.

Switch theme

English (EN)
Arabic (AR)
Chinese (ZH-CN)
German (DE)
Korean (KO)
Russian (RU)
French (FR)
Spanish (ES)
Portuguese (PT)
Japanese (JA)

JavaScript is disabled in your browser. For a better experience, please enable JavaScript.Learn how to enable JavaScript.