From GPT-2 to gpt-oss: Analyzing the Architectural Advances

OpenAI just released their new open-weight LLMs this week: gpt-oss-120b and gpt-oss-20b, their first open-weight models since GPT-2 in 2019. And yes, thanks to some clever optimizations, they can run locally (but more about this later).This is the first time since GPT-2 that OpenAI has shared a large, fully open-weight model. Earlier GPT models showed how the transformer architecture scales. The 2022 ChatGPT release then made these models mainstream by demonstrating concrete usefulness for writing and knowledge (and later coding) tasks. Now they have shared some long-awaited weight model, and the architecture has some interesting details.Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.I spent the past few days reading through the code and technical reports to summarize the most interesting details. (Just days after, OpenAI also announced GPT-5, which I will briefly discuss in the context of the gpt-oss models at the end of this article.)Below is a quick preview of what the article covers. For easier navigation, I recommend using the Table of Contents on the left of on the article page.Model architecture comparisons with GPT-2MXFP4 optimization to fit gpt-oss models onto single GPUsWidth versus depth trade-offs (gpt-oss vs Qwen3)Attention bias and sinksBenchmarks and comparisons with GPT-5I hope you find it informative!1. Model Architecture OverviewBefore we discuss the architecture in more detail, let’s start with an overview of the two models, gpt-oss-20b and gpt-oss-120b, shown in Figure 1 below.Figure 8: The feed forward module is replaced by a Mixture-of-Expert (MoE) module.So, replacing a single feed forward module with multiple feed forward modules (as done in a MoE setup) substantially increases the model’s total parameter count. However, the key trick is that we don’t use (“activate”) all experts for every token. Instead, a router selects only a small subset of experts per token.Because only a few experts are active at a time, MoE modules are often referred to as sparse, in contrast to dense modules that always use the full parameter set. However, the large total number of parameters via an MoE increases the capacity of the LLM, which means it can take up more knowledge during training. The sparsity keeps inference efficient, though, as we don’t use all the parameters at the same time.(Fun fact: In most MoE models, expert weights account for more than 90% of the total model parameters.)1.5 Grouped Query Attention Replaces Multi-Head AttentionAs mentioned in my previous articles, Grouped Query Attention (GQA) has emerged in recent years as a more compute- and parameter-efficient alternative to Multi-Head Attention (MHA).In MHA, each head has its own set of keys and values. GQA reduces memory usage by grouping multiple heads to share the same key and value projections.For example, as shown in Figure 9, if there are 2 key–value groups and 4 attention heads, heads 1 and 2 might share one set of keys and values, while heads 3 and 4 share another. This grouping decreases the total number of key and value computations, leading to lower memory usage and improved efficiency — without noticeably affecting modeling performance, according to ablation studies.Figure 10: Comparison between regular attention (left) and sliding window attention (right).Concretely, gpt-oss alternates between GQA layers that attend to the full context and GQA layers with a sliding window limited to 128 tokens.As I discussed in my Figure 12: Code implementations of LayerNorm and RMSNorm showing that RMSNorm is computationally simpler.1.8 The GPT-2 LegacyI still think that GPT-2 is an excellent beginner architecture when learning about LLMs. It’s simple enough to understand without getting lost in layers of optimization tricks, but still complex enough to give you a solid grasp of how modern transformer models work.By starting with GPT-2, you can focus on the fundamentals (attention mechanisms, positional embeddings, normalization, and the overall training pipeline) without being overwhelmed by the extra features and tweaks found in newer architectures.In fact, I think it’s worth the time to learn about and even implement GPT-2 first before trying to stack newer changes on top. You will not only have an easier time understanding those changes, but you will likely also appreciate them more, because you will get a better understanding of what limitations or problems they try to solve.For instance, starting with my GPT-2 code I recently implemented the Figure 14: Qwen3 has twice as many transformer blocks as gpt-oss-20b.On the other hand, gpt-oss is a much wider architecture:An embedding dimension of 2880 instead of 2048An intermediate expert (feed forward) projection dimension of 5760 instead of 768It’s also worth noting that gpt-oss uses twice as many attention heads, but this doesn’t directly increase the model’s width. The width is determined by the embedding dimension.Does one approach offer advantages over the other given a fixed number of parameters? As a rule of thumb, deeper models have more flexibility but can be harder to train due to instability issues, due to exploding and vanishing gradients (which RMSNorm and shortcut connections aim to mitigate).Wider architectures have the advantage of being faster during inference (with a higher tokens/second throughput) due to better parallelization at a higher memory cost.When it comes to modeling performance, there’s unfortunately no good apples-to-apples comparison I am aware of (where parameter size and datasets are kept constant) except for an ablation study in the Figure 16: The two gpt-oss architectures side by side, where the larger 120B model only scales the number of transformer blocks and number of experts.The boring explanation for the fact that the 20B and 120B models are so similar is probably that the 120B model was the main focus. And the easiest way to create a smaller model was to make it a bit shorter (fewer transformer blocks) and to reduce the number of experts, because that’s where most of the parameters are. However, one might speculate whether they started training the 120B model, and then chopped some of the transformer blocks and experts for continued pre-training (instead of starting from random weights).In any case, it’s because it’s quite unusual to only scale those two (transformer blocks and number of experts). For instance, when looking at Qwen3 MoE models of multiple sizes (Figure 17 below), they were scaled more proportionally to each other over many more aspects..Figure 18: gpt-oss models use bias units in the attention layers. See code example Figure 20: The use of attention sinks in gpt-oss; based on the Hugging Face code Figure 22: Current view of the Figure 24: The main benchmark charts are from the official GPT-5 announcement post. The gpt-oss data is taken from the official model card paper and announcement post, and the Qwen3 numbers are taken from the official Qwen3-Coder repository.All in all, even though some people called the release overhyped, I am glad that we have a new set of really strong open weight models that are not too far behind the best proprietary ones. Of course, benchmarks often do not accurately reflect real-world use, and it is still too early to tell based on the limited usage. But I think these are good times for people who like to work with open-weight and local (or privately hosted) models.This magazine is a personal passion project, and your support helps keep it alive. If you would like to contribute, there are a few great ways:Grab a copy of my book. Build a Large Language Model (From Scratch) walks you through building an LLM step by step, from tokenizer to training.Check out the video course. There’s now a 17-hour video course based on the book, available from Manning. It follows the book closely, section by section, and works well both as a standalone or as a code-along resource. The video course is ad-free (unlike the YouTube version) and has a cleaner, more structured format. It also contains 5 additional hours of pre-requisite video material created by Abhinav Kimothi.Subscribe. A paid subscription helps to make my writing sustainable and gives you access to additional contents.Thanks for reading, and for helping support independent research!Ahead of AI is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Share this:

Like this:

Related Posts

Introducing AI Sheets: a tool to work with datasets using open AI models!

Accelerate ND-Parallel: A Guide to Efficient Multi-GPU Training

Skip The Hype Reel — GPT-5’s Real Story Is In The System Card