Hito 2B: A Small Model That Actually Reasons

Ouissam Drissi

Most small models can't reason. They can produce text that looks like reasoning. Long chains of thought, confident-sounding steps, a final answer that's often wrong. If you actually test them on math or logic, they fall apart. The words are there. The thinking isn't.

I spent the last few months trying to fix that on a 2 billion parameter model. Not by scaling it up. Not by training on more data. By changing what reasoning actually looks like inside the output.

The result is Hito 2B. It's open-source, built on Qwen3.5-2B, and it scores 60% on GSM8K where the base model sits at 25%. That's a 35 point jump from a fine-tune, not a bigger model.

The cognitive framework

Normal chain-of-thought lets the model ramble. It writes whatever comes next. Sometimes that's useful. Often it's the model talking itself into the wrong answer because nothing forces it to pause and check.

Hito uses structured nested tags inside a <think> envelope. Each tag represents a specific kind of cognitive move:

<understand> restate the problem in own words
<recall> pull relevant facts
<logic> apply a rule or derivation
<doubt> flag something that might be wrong
<verify> check against known constraints
<commit> lock in the answer

There are more. Tags for comparison, simulation, anticipation, reflection, honesty about limits. The full list covers five categories: comprehension, retrieval, deliberation, verification, and metacognition.

These aren't decoration. They're structural constraints on what the model is allowed to emit next. When the model enters a <verify> tag, it has to produce verification content. It can't skip ahead and commit. That single constraint prevents the most common failure mode in small models: confident shortcut reasoning that doesn't actually check anything.

Self-correction inside one response

The part I care about most is the <doubt> → <verify> → <commit> loop. The model can produce a tentative answer, flag doubt about it, verify against constraints, and update its commitment. All in a single generation.

This matters for two reasons. First, corrections become observable. You can see the moment the model catches its own mistake. Second, that creates a training signal. You can reward responses where doubt leads to verification leads to a better commit. The mechanism that produces correct answers is the same mechanism the reward function rewards.

Most models either never correct themselves or they correct themselves invisibly inside hidden activations. Hito does it in the token stream where you can actually work with it.

How it was trained

Two stages. Both use techniques I've written about before.

Stage one is Progressive LoRA Merging. Multiple rounds of LoRA fine-tuning on structured reasoning data. Each adapter gets merged into the base before the next round trains. This is how the cognitive framework grammar gets internalized instead of living on top of the base model as a thin adapter.

Stage two is GRPO with a custom reward formula. The key signal is reasoning-answer consistency. If the content inside the tags actually supports the final commitment, the response gets rewarded. If the commit doesn't follow from the reasoning, it doesn't. This pushes the model toward producing reasoning chains that are actually load-bearing, not decorative.

What changed in the outputs

Hito produces roughly one quarter the reasoning length of base Qwen3.5-2B on the same prompts. Less text, more correct answers. The base model tends to expand into repetitive verification loops that look rigorous but don't add information. Hito stops when the verification is done.

Median response time on hard items drops from around 33 seconds on the base model to under 10 seconds on Hito. Not because inference is faster. Because the output is shorter.

What it can actually do

Benchmarks I ran on matched conditions, identical prompts, 8K context, 4000 token budget, temperature 0, n=20 per benchmark:

GSM8K (math word problems)     Hito 60%   Base 25%   +35
MATH-500 (competition math)    Hito 15%   Base  5%   +10
ARC-Challenge (science)        Hito 75%   Base 65%   +10
HumanEval-style (code)         Hito 95%   Base 90%    +5

Macro average across reasoning tasks jumped 15 points. For a 2B model, ARC-AGI grid puzzles work at all, which they basically don't on most small models. Hito infers transformation rules from examples and applies them to novel inputs.

On classic traps like the disease-test Bayesian problem where naive reasoning gives 99% and the correct answer is closer to 50%, Hito lands on the right probability band. Small models usually miss this one entirely.

Running it

Two repos. hitonet/hito-2b is safetensors for Transformers. hitonet/hito-2b-GGUF is for Ollama, llama.cpp, and LM Studio. Ollama pulls straight from Hugging Face, so you just pick a quantization tag:

# Recommended default (1.4 GB, Q5_K_M)
ollama run hf.co/hitonet/hito-2b-GGUF:Q5_K_M

# Smaller if you need it (1.2 GB, Q4_K_M)
ollama run hf.co/hitonet/hito-2b-GGUF:Q4_K_M

# Near lossless (1.9 GB, Q8_0)
ollama run hf.co/hitonet/hito-2b-GGUF:Q8_0

Also available: F16 (3.6 GB, lossless), Q6_K (1.5 GB), Q2_K (924 MB), and TQ1_0 (687 MB, ternary, research only).

Or hit the hosted API at platform.hitonet.com if you just want to try it without running anything locally.

Why this approach

The industry default is: make the model bigger. 2B not enough? Train a 7B. Not enough? 70B. The compute bill keeps growing and the gains keep shrinking.

Structured reasoning is the alternative nobody wants to do because it's harder than scaling. You have to actually design the thinking process. You have to pick which cognitive moves matter, how they nest, how they constrain generation. You have to train the model to use the structure instead of drifting around it. Most teams would rather rent another cluster.

A 2B model that reasons beats a 7B model that doesn't. Hito 2B is what happens when you treat reasoning as an engineering problem instead of an emergent property you hope shows up at scale.

The weights are free for personal and research use. Commercial use needs a license. Everything you need is on the model card.