Writing · Technical report

I trained a language model that thinks the capital of Japan is Paris

Faris Allafi · July 2026 · Model: hr-diffuse-1-nano on Hugging Face

I am 13, and I spent hours of my time, and my own money, to train a language model that thinks the capital of Japan is Paris. First thing you should know: contrary to common belief, the capital of Japan is in fact Tokyo. Now I know what you are thinking... what is the point of this entire model? You might think I am just building another ChatGPT wrapper, and that could not be farther from the truth.

The transformer architecture, popularized by the paper Attention Is All You Need (Vaswani et al., 2017), is the current SOTA architecture in LLMs. I will not go in depth on how it works, since this is a technical overview of a different architecture, but you are welcome to read the paper. And to be clear, I am not knocking the transformer at all. Without it we would not have the (arguably, I would say at least partially) AGI we have in this day and age. But with great power comes great quadratic complexity: attention costs grow with the square of the context length. And with what we ask of AI today (coding agents holding entire repositories in context, assistants carrying week-long chat histories, retrieval pipelines stuffing dozens of documents into one prompt, and all of it expected to be fast and cheap), the way we currently process text starts to hurt.

That is where DIMBA comes in. Technically this is DIMBA II, the second generation of the architecture. The first generation never made it off the GPU, so as far as the world is concerned, this is DIMBA.

The architecture

DIMBA combines the extreme context efficiency of Mamba-2 (Dao and Gu, 2024) with the parallel generation of diffusion language models. As far as I can tell, nobody has published this combination: every masked diffusion text model I know of (LLaDA, MDLM, Dream) sits on a transformer backbone. DIMBA sits on a bidirectional Mamba spine instead.

Some of the fixes DIMBA II makes over DIMBA I, in short:

DIMBA I used latent-space diffusion, and early DIMBA II builds did too. After testing, this proved too problematic for full text generation (more on that in a moment). We may still bring it back in a larger base train as a "planning mode" that sketches the answer in latent space before the text is generated.
DIMBA I diffused Gaussian noise in a continuous space and then snapped the result to the nearest words. That final snap is where everything fell apart: smooth vectors decode to word salad. DIMBA II switched to what the current frontier uses, masked diffusion, where the model sees text with [MASK] tokens and learns to fill them in directly.
The fine-tuning loss is computed on the response plus exactly one end-of-sequence token, and never on the padding tail. Training on padding silently teaches the model that the best answer is an empty one, while the loss chart looks fantastic. Ask me how I know.
Ten percent of training rows hide the prompt entirely, which unlocks classifier-free guidance at inference time. This turned out to be the single biggest quality lever in the whole project.
An anti-repetition sampler: a frequency penalty that forgives the first use of every word and punishes repeats, plus a ban on committing the same token twice in a row.

What I actually built

We trained a roughly 300M parameter model (287.9M measured), cross-architecture distilled from SmolLM-135M, based on the DIMBA II architecture, using LLaDA style masked diffusion with a Mamba-based mixer, on 28B tokens on top of the MLPs extracted from the base model.

Now you might be thinking: wait a second, why is the model over double the size of its teacher? Because bidirectionality is expensive. To see context on both sides of a masked token, DIMBA runs a forward stack and a backward stack, which roughly doubles the mixer, and diffusion also pays for timestep conditioning that a normal LLM does not need. The honest label is 288M parameters with 135M-class knowledge capacity, since the two directions mostly end up storing the same facts twice. Keep that "twice" in mind, because one of my favorite results in this post is about deleting it.

The model did not train as well as I hoped, because of two specific bugs. The first: during the 28B-token distillation stage, the teacher model was off for effectively the entire run. I paid for a tutor and the tutor never showed up to class. The second was mentioned above: the whole run targeted latent diffusion, and latent diffusion gave me word salad.

By the time I understood both problems, it was too late to restart. I had poured a few hundred dollars into those weights. What I could do was salvage: a repair run of 1.6 billion tokens on the same model with the teacher switched ON, then a conversion stage that taught the model to speak in LLaDA-style masked diffusion, then supervised fine-tuning on about 422k instruction pairs. And voila: a model that can actually (kind of) speak English.

Small models cannot judge themselves

Most of the large SOTA models you have used are smart enough to judge and correct their own output. What I rediscovered, six different ways, is that at this scale the model simply cannot. I tested:

Perplexity reranking. Generate 8 candidate answers, let the model pick its favorite. It reliably picks degenerate repetition, because loops are easy to predict and the model loves them.
Confidence-based remasking. Find the tokens the model is least sure about, remask them, refill. Accuracy did not move.
The same, with a smarter refill strategy. Same result.
Repair training. Fine-tune on text with randomly planted errors so it learns to spot and fix corruption. Detection went from 0 to 7.1 percent, so it technically works. Then I tried training it on its own sampled mistakes instead, and detection collapsed to zero. Its own mistakes are, by definition, exactly what it finds plausible. Echo chamber.
Falling-confidence remasking. Watch for committed tokens whose probability drops as the surrounding context fills in, and remask those. It almost never fired. The model's confidence in its mistakes only grows as it commits to them. It does not doubt. It rationalizes.
Lookahead verification. Commit tokens, immediately re-score them with the commitments in place, revert whatever got worse. Also nearly never fired, for the same reason.

One external thing did work: a tiny critic head. It is a 300 thousand parameter add-on that reads the frozen model's internal features and scores every token as right or wrong, trained on planted errors. It flags wrong tokens at 52.5 percent precision where chance is 10 percent, and ranks a wrong token above a correct one 78.9 percent of the time where chance is 50. The model's own confidence performs at chance on the exact same test. So error detection is solvable from the outside. But wiring the critic into a correction loop fails in a textbook Goodhart way: the critic score improves on every pass while the actual answers do not get better, because the thing refilling the flagged tokens is still the same small model.

My takeaway from all six failures plus the critic: at small scale, quality has to come from external constraints, because self-judgment is the first casualty of small scale.

The dial

One of the coolest things I built at the inference level is a single dial between two ends:

high qualityslow · higher costlower qualityvery fast · cheaper

256 steps · 8 candidates16 steps · 1 candidate

Under the hood the dial controls two things at once: how many diffusion steps we run (16 up to 256; more steps means fewer tokens committed per step, so fewer mistakes get locked in), and how many complete candidate answers we generate (1 up to 8), with a verifier choosing the winner. The verifier works by masking out pieces of each finished candidate and measuring how strongly the model's prompt-conditioned predictions agree with what was actually written, with the critic head as an optional tiebreaker. One knob, and it is native to diffusion: an autoregressive model of this size has no equivalent lever.

The results we got are, well, humbling:

Dial	Steps	Candidates	QA accuracy	Seconds per answer
0.1	21	1	7.5%	2.2
0.3	37	1	17.5%	3.9
0.5	64	2	10.0%	7.1
0.7	111	4	12.5%	12.2
0.9	194	8	17.5%	21.2
1.0	256	8	20.0%	27.7

Accuracy vs. latency across the dial. The cost axis is smooth; the accuracy axis is noisy and barely moves.

The cost axis works perfectly: turn the knob and you smoothly trade 2.2 seconds for 27.7. The accuracy axis is noisy and barely moves. In practice: 15 percent at production settings, 20 percent with the dial maxed out, and around 18 at settings you would actually wait for.

The reason is the theme of this entire post: you cannot extract knowledge that was never stored in the model. We assumed the training run did not need the teacher model. It turns out it very much does. The dial is a working engine bolted to a small fuel tank, and the teacher bug means the tank was only partly filled. Fixing that one bug is the single cheapest way to raise every number in this post, and it is already fixed for the next run.

The scoreboard

So how does it stack up against real models of the same size? I benchmarked it against its own teacher and the usual small-model suspects: 40 factual questions, repetition metrics, a fill-in-the-middle test, and latency.

Model	QA accuracy	Loop rate	Infill recovery	Seconds per answer
hr-diffuse-1-nano (us)	15.0%	7.5%	14.0%	13.3
SmolLM-135M (teacher)	82.5%	37.5%	2.9%	0.63
SmolLM-135M-Instruct	60.0%	2.5%	0.0%	0.62
GPT-2 (124M)	20.0%	90.0%	0.0%	0.18
Pythia-160M	10.0%	15.0%	1.7%	0.19

Reading it honestly: We lose the knowledge test badly to my own teacher, which saw about 600B tokens of clean pretraining to my broken 28B. hr-1 roughly matches GPT-2 and beat Pythia-160M, which saw ten times my data. And there are two structural wins. Infill: DIMBA reconstructs the masked-out middle of a sentence at 14 percent while every autoregressive model scores near zero, because they physically cannot condition on text that comes after the blank. Loops: DIMBA degenerates into repetition on 7.5 percent of answers versus 37.5 for its own teacher and 90 for GPT-2. Those two properties come from the diffusion objective itself, and they scale up for free.

The experiment we ran in one afternoon

Remember the "stores every fact twice" problem? Midway through writing this post, I had an idea: what if both directions share one set of weights, and each direction gets a tiny LoRA adapter on top, so they can still specialize without duplicating the whole stack?

We tested it the same afternoon, on the same GPU, with three identical from-scratch training runs. Only the architecture differs:

Variant	Params	Loss (lower is better)
Full double stack (current)	287.9M	6.697
Fully shared weights	225.5M	6.968
Shared + per-direction LoRA (2.9M adapters)	228.4M	6.797

Pure sharing saves 62.5M parameters but costs 0.271 nats of loss. Adding just 2.9M of per-direction LoRA buys back 63 percent of that damage, landing 21 percent smaller than the original at a fraction of the penalty, and its loss curve was still catching up to the full model when the test ended. The honest caveat: 2000 steps from scratch measures early learning speed, not converged capacity, so the next run will confirm this with a longer pilot before betting on it. But as of today, shared-plus-LoRA is the best parameters-per-loss variant of this architecture that has been measured. Proposed at lunch, tested by dinner.

What I would do with compute

Everything in this post cost about $500 total, every mistake included, on rented H100s paid for out of my own pocket.

The next run is fully specced. 1.5B-3B parameters: the scale where the LLaDA line of work shows masked diffusion becoming competitive with same-size autoregressive models. Teacher-enabled distillation from a SmolLM series model, with the gating bug fixed and verified this time. The Muon optimizer, which we A/B tested against AdamW on this exact architecture (5.453 versus 5.470 final loss, stable throughout, and to my knowledge the first Muon result on a Mamba diffusion LM). The shared-plus-LoRA bidirectionality above. And a cheap pilot phase at the start that validates every one of those choices with small A/Bs before the real budget is spent, because the most expensive lesson in this post came from not verifying that the expensive thing was doing the expensive thing.

We are looking for capital and/or compute partners, or GPU sponsorships to back this 1.5B-3B run.

The scientific payoff is a clean four-point dataset showing how self-correction metrics (specifically planted-error detection, remask fire rate, critic accuracy, and the slope of the inference dial) evolve from the 135M scale up to 3B. If the curves bend upward, we locate the exact boundary where self-judgment emerges. If they stay flat, we publish a highly valuable negative result about small-scale diffusion dynamics.

If you run a compute program, are interested in funding us, have GPU credits gathering dust, or want to collaborate on the infrastructure side, my DMs are open. The full development archive, failed checkpoints, and training code are live on Hugging Face, and everything in this post is fully reproducible.

Faris Allafi, 13, Hamiltonian Research.