ASRL: Why Supervised and Reinforcement Learning Should Take Turns

Ouissam Drissi

I was training a 0.6B model on live data that was growing at about 200 examples an hour. Standard approach says you do all your supervised fine-tuning first, then switch to reinforcement learning after. Two separate stages. Everyone does it that way because that's how the first papers did it and nobody stopped to ask why.

I tried it. The model needed 12 epochs to converge. On hardware that wasn't exactly generous, that meant waiting around a lot. And every time new data came in while training was still running, I had to decide: stop and restart with the updated dataset, or finish the run on stale data and hope it still holds. Neither option felt right.

So I tried something different. Instead of finishing SFT and then switching to GRPO, I alternated between them inside each epoch. Supervised phase, then RL phase, then supervised again. Back and forth from the start.

It converged in 3 epochs.

Why sequential training is wasteful

When you do all your supervised training first, the model gets comfortable. It memorizes your examples. It learns to copy. Then you flip the switch to RL and suddenly the rules change. Now it has to explore, try things, get scored. All those habits it built during SFT? RL has to fight through them. You get instability. You get forgetting. The model unlearns useful stuff because the RL signal is pulling it somewhere else.

Alternation avoids that entirely. The supervised phase keeps the model grounded. It knows what good output looks like. The RL phase gives it space to develop its own strategies. Neither phase runs long enough to take over. The model never fully settles into imitation mode because RL keeps pushing. And it never goes off the rails during RL because the next supervised phase reels it back.

What you end up with is a model that actually understands what it's doing instead of just pattern matching. It maintains formatting compliance while also developing real problem-solving ability. Those two things usually fight each other. Alternation makes them work together.

Live data changes everything

The part that made this necessary was the live dataset. Static data is forgiving. You set up your pipeline, run it, come back later. Live data doesn't wait. 200 new examples per hour means your dataset is different at the end of training than it was at the start.

With ASRL, new data just gets absorbed in the next supervised phase. The RL phase immediately starts working with the updated knowledge. No restart. No checkpoint surgery. The loop keeps running and the model adapts as the data grows.

If you've ever trained on data from APIs, user submissions, anything that updates in real time, you know how painful the alternative is. Freeze the dataset and miss new data, or restart and waste everything you already computed. ASRL sidesteps that choice.

Small model, deliberate constraint

I did this on 600 million parameters. Not because I couldn't get access to bigger models but because I wanted to know if the method held up under real constraints. If a training approach only works when you throw a cluster at it, it's not a good approach. It's an expensive one.

3 epochs to convergence means you can actually experiment. Try something, see what happens, adjust, run again. With 12-epoch training every experiment is a commitment. You wait hours to find out your idea didn't work. That kills iteration speed and iteration speed is where good research actually happens.

The full paper is published in IJSET with the exact setup, baseline comparisons, and formatting compliance metrics. Sometimes the answer isn't tuning hyperparameters harder. Sometimes you just need to rethink the order of operations.