Beyond Standard LLMs - neuralnewshub.site

From DeepSeek R1 to MiniMax-M2, the largest and most capable open-weight LLMs today remain autoregressive decoder-style transformers, which are built on flavors of the original multi-head attention mechanism.However, we have also seen alternatives to standard LLMs popping up in recent years, from text diffusion models to the most recent linear attention hybrid architectures. Some of them are geared towards better efficiency, and others, like code world models, aim to improve modeling performance.After I shared my Big LLM Architecture Comparison a few months ago, which focused on the main transformer-based LLMs, I received a lot of questions with respect to what I think about alternative approaches. (I also recently gave a short talk about that at the PyTorch Conference 2025, where I also promised attendees to follow up with a write-up of these alternative approaches). So here it is!Figure 4: Illustration of the traditional scaled-dot-product attention mechanism in multi-head attention; the quadratic cost in attention due to sequence length n.2.2 Linear attentionLinear attention variants have been around for a long time, and I remember seeing tons of papers in the 2020s. For example, one of the earliest I recall is the 2020 Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention paper, where the researchers approximated the attention mechanism:Here, ϕ(⋅) is a kernel feature function, set to ϕ(x) = elu(x)+1.This approximation is efficient because it avoids explicitly computing the n×n attention matrix QKT.I don’t want to dwell too long on these older attempts. But the bottom line was that they reduced both time and memory complexity from O(n2) to O(n) to make attention much more efficient for long sequences.However, they never really gained traction as they degraded the model accuracy, and I have never really seen one of these variants applied in an open-weight state-of-the-art LLM.2.3 Linear Attention RevivalIn the second half of this year, there has been revival of linear attention variants, as well as a bit of a back-and-forth from some model developers as illustrated in the figure below.Figure 6: Qwen3-Next with gated attention and Gated DeltaNet.As depicted in the figure above, the attention mechanism is either implemented as gated attention or Gated DeltaNet. This simply means the 48 transformer blocks (layers) in this architecture alternate between this. Specifically, as mentioned earlier, they alternate in a 3:1 ratio. For instance, the transformer blocks are as follows:──────────────────────────────────
Layer 1 : Linear attention → MoE
Layer 2 : Linear attention → MoE
Layer 3 : Linear attention → MoE
Layer 4 : Full attention → MoE
──────────────────────────────────
Layer 5 : Linear attention → MoE
Layer 6 : Linear attention → MoE
Layer 7 : Linear attention → MoE
Layer 8 : Full attention → MoE
──────────────────────────────────
…Otherwise, the architecture is pretty standard and similar to Qwen3:And the gates control how that memory changes:α (alpha) regulates how much of the old memory to forget (decay).β (beta) regulates how much the current token at time step t updates the memory.(And the final output gate, not shown in the snippet above, is similar to gated attention; it controls how much of the output is kept.)So, in a sense, this state update in Gated DeltaNet is similar to how recurrent neural networks (RNNs) work. The advantage is that it scales linearly (via the for-loop) instead of quadratically with context length.The downside of this recurrent state update is that, compared to regular (or gated) attention, it sacrifices the global context modeling ability that comes from full pairwise attention.Gated DeltaNet, can, to some extend, still capture context, but it has to go through the memory (S) bottleneck. That memory is a fixed size and thus more efficient, but it compresses past context into a single hidden state similar to RNNs.That’s why the Qwen3-Next and Kimi Linear architectures don’t replace all attention layers with DeltaNet layers but use the 3:1 ratio mentioned earlier.2.7 DeltaNet Memory SavingsIn the previous section, we discussed the advantage of the DeltaNet over full attention in terms of linear instead of quadratic compute complexity with respect to the context length.Next to the linear compute complexity, another big advantage of DeltaNet is the memory savings, as DeltaNet modules don’t grow the KV cache. (For more information about KV caching, see my Figure 11: Qwen3-Next and Kimi Linear side by side.Gated DeltaNet is a linear attention variant with inspiration from recurrent neural networks, including a gating mechanism from the Figure 13: Illustration of an image diffusion process from Figure 15: Illustration of the denoising process using the 8B LLaDA model.As we can see in the animation above, the text diffusion process successively replaces [MASK] tokens with text tokens to generate the answer. If you are familiar with Figure 17: Benchmark performance of a (faster) diffusion LLM (Gemini Diffusion) versus a fast autoregressive LLM (Gemini 2.0 Flash-Lite). Based on the numbers reported in https://deepmind.google/models/gemini-diffusion/#capabilities.4. World ModelsSo far, we discussed approaches that focused on improving efficiency and making models faster or more scalable. And these approaches usually come at a slightly degraded modeling performance.Now, the topic in this section takes a different angle and focuses on improving modeling performance (not efficiency). This improved performance is achieved by teaching the models an “understanding of the world.”World models have traditionally been developed independently of language modeling, but the recent Figure 19: Example of code execution tracing in the Code World Model (CWM). The model predicts how variable states evolve step by step as each line of code executes. Here, the model effectively simulates the code’s behavior. Annotated figure from https://www.arxiv.org/abs/2510.02387.At inference time, CWM is still an autoregressive transformer that generates one token at a time, just like GPT-style models. The key difference is that these tokens can encode structured execution traces rather than plain text.So, I would maybe not call it a world model, but a world model-augmented LLM.For a first attempt, it performs surprisingly well, and is on par with gpt-oss-20b (mid reasoning effort) at roughly the same size.If test-time-scaling is used, it even performs slightly better than gpt-oss-120b (high reasoning effort) while being 4x smaller.Note that their test-time scaling uses a best@k procedure with generated unit tests (think of a fancy majority voting scheme). It would have been interesting to see a tokens/sec or time-to-solution comparison between CWM and gpt-oss, as they use different test-time-scaling strategies (best@k versus more tokens per reasoning effort).Figure 21: LLM landscape overview; this section small recursive transformers.More specifically, the HRM developers showed that even very small transformer models (with only 4 blocks) can develop impressive reasoning capabilities (on specialized problems) when trained to refine their answers step by step. This resulted in a top spot on the ARC challenge.Figure 23: The Tiny Recursive Model (TRM). Annotated figure from https://arxiv.org/abs/2510.04871.In the remainder of this section, let’s take a look at TRM in a bit more detail.5.1 What Does Recursion Mean Here?TRM refines its answer through two alternating updates:It computes a latent reasoning state from the current question and answer.It then updates the answer based on that latent state.The training runs for up to 16 refinement steps per batch. Each step performs several no-grad loops to iteratively refine the answer. This is followed by a gradient loop that backpropagates through the full reasoning sequence to update the model weights.It’s important to note that TRM is not a language model operating on text. However, because (a) it’s a transformer-based architecture, (b) reasoning is now a central focus in LLM research, and this model represents a distinctly different take on reasoning, and (c) many readers have asked me to cover HRM (and TRM is its more advanced successor) I decided to include it here.While TRM could be extended to textual question-answer tasks in the future, TRM currently works on grid-based inputs and outputs. In other words, both the “question” and the “answer” are grids of discrete tokens (for example, 9×9 Sudoku or 30×30 ARC/Maze puzzles), not text sequences.5.2 How Does TRM Differ From HRM?HRM consists of two small transformer modules (each 4 blocks) that communicate across recursion levels. TRM only uses a single 2-layer transformer. (Note that the previous TRM figure shows a 4× next to the transformer block, but that’s likely to make it easier to compare against HRM.)TRM backpropagates through all recursive steps, whereas HRM only backpropagates through the final few.HRM includes an explicit halting mechanism to determine when to stop iterating. TRM replaces this mechanism with a simple binary cross-entropy loss that learns when to stop iterating.Performance-wise, TRM performs really well compared to HRM, as shown in the figure below.Build a Large Language Model (From Scratch) is now available on Amazon. Build a Reasoning Model (From Scratch) is in Early Access at Manning.If you read the book and have a few minutes to spare, I’d really appreciate a brief review. It helps us authors a lot!Your support means a great deal! Thank you!

Share this:

Like this:

Related Posts

Custom Kernels for All from Codex and Claude

OpenAI Is Nuking Its 4o Model. China’s ChatGPT Fans Aren’t OK

Zillow Has Gone Wild—for AI