Top 10 Open-Weight LLM Architectures from Early 2026

2026-03-17

If you have struggled a bit to keep up with open-weight model releases this month, this article will catch you up on the main themes. We will walk through the ten main releases from January and February 2026 in chronological order, with a focus on architecture similarities and differences.

Based on original research by Sebastian Raschka.


1. Arcee AI’s Trinity Large

A New US-Based Start-Up Sharing Open-Weight Models. On January 27, Arcee AI began releasing versions of their open-weight 400B Trinity Large LLMs, along with two smaller variants.

Their flagship large model is a 400B parameter Mixture-of-Experts (MoE) with 13B active parameters. The two smaller variants are Trinity Mini (26B with 3B active parameters) and Trinity Nano (6B with 1B active parameters).

There are several interesting architectural components added to the Trinity model. First, there are alternating local:global (sliding window) attention layers (SWA). SWA is a type of sparse local attention pattern where each token attends only to a fixed-size window of recent tokens (for example, 4096) instead of attending to the entire input. This reduces the per-layer regular attention cost from O(n²) to roughly O(n·t), making it exceptionally attractive for long-context models.

2. Moonshot AI’s Kimi K2.5

Kimi K2.5 is an open-weight model that set a new open-weight performance ceiling at the time of its release on Jan 27. Impressively, it was on par with leading proprietary models due to its massive scale—a 1-trillion-parameter model that is 2.5x larger than Trinity.

Overall, the Kimi K2.5 architecture is a scaled-up multimodal version of the DeepSeek V3 architecture, adding native vision support. During training, it adopted an "early fusion" approach, passing in the vision tokens early on alongside the text tokens.

3. StepFun’s Step 3.5 Flash

Step 3.5 Flash is a 196B parameter model that outperforms much larger models while delivering incredible speed. It achieves a 100 tokens/sec throughput at a 128k context length.

One reason for this higher performance is the model’s smaller active parameter size (11B parameters active per token). The other reason is Multi-Token Prediction (MTP). It trains the LLM to predict multiple future tokens at each step, rather than a single one. Step 3.5 Flash uses MTP with 3 additional tokens during both training and inference, resulting in robust and speedy generation.

4. Qwen3-Coder-Next

In early February 2026, Qwen3 team shared the 80B Qwen3-Coder-Next model (3B parameters active). It made big headlines for outperforming much larger models like DeepSeek V3.2 on coding tasks.

Instead of a regular attention mechanism, it uses a Gated DeltaNet + Gated Attention hybrid. This hybrid model enables a native 262k token context length while keeping memory usage under tight control. The DeltaNet block replaces standard attention with a fast-weight delta rule update, maintaining a tiny, fast-weight memory.

5. z.AI’s GLM-5

Released on February 12th, the GLM-5 model appeared to be on par with major flagship LLM offerings, including top proprietary models. It has 744B total parameters (40B active), doubling the scale of its GLM-4.7 predecessor.

GLM-5 adopted DeepSeek’s multi-head latent attention and Sparse Attention. These modifications are intended to reduce inference costs when working with long contexts. The embedding dimension and expert size increased, while the number of transformer layers was actually reduced to 78 to improve overall inference latency.

6. MiniMax M2.5

MiniMax M2.5 is an intensely popular model. It's a strong coder with "only" 230B parameters. The popularity is partly owed to the fact that it is a smaller, cheaper model with roughly similar modeling performance to trillion-parameter giants.

Architecture-wise, MiniMax M2.5 features a surprisingly classic design: it uses plain Grouped Query Attention, without sliding windows or complex hybrid attention tricks. It relies heavily on good training recipes rather than architectural hyper-optimization.

7. Nanbeige 4.1 3B

For on-device or edge deployment, Nanbeige 4.1 3B shines. It aims directly at the "small" LLM use case that models like Qwen 3 dominate.

Nanbeige 4.1 3B uses architectural components very similar to Llama 3.2 3B. The performance gains come mostly from immense post-training with supervised fine-tuning and reinforcement learning.

8. Qwen3.5: Hybrid Attention Continues

The Qwen team returned on Feb 15 with Qwen3.5. Their massive 397B parameter (17B active) MoE variant is a big step up from the largest Qwen3.

Architecture-wise, Qwen3.5 adopts the hybrid attention model (featuring Gated DeltaNet) that its Coder-Next cousins pioneered. This confirms that hybrid attention mechanisms are entering mainstream flagship models, as opposed to remaining experimental side projects.

9. Ant Group’s Ling 2.5 1T

Ling 2.5 (and its reasoning counterpart Ring 2.5) is another monstrous 1-trillion-parameter LLM featuring a hybrid attention architecture. Instead of Gated DeltaNet, it uses a recurrent linear attention variant called Lightning Attention.

Ling 2.5's main selling point is its exceptional efficiency in long contexts. When compared to models of similar size, it achieves up to 3.5x higher throughput at a sequence length of 32k tokens.

10. Tiny Aya: A 3.35B Model

Released by Cohere, Tiny Aya was dubbed the most capable multilingual open-weight model at the 3B parameter size class.

Tiny Aya is a classic decoder-style transformer with a very noteworthy highlight: parallel transformer blocks. The parallel transformer block computes attention and an MLP from the same normalized input, then adds both to the residual in a single step to increase computational throughput. Furthermore, it dropped QK-Norm to avoid interactions with long-context performance.

Bonus: Sarvam 30B and 105B

Released in early March from India, Sarvam offered two amazing reasoning models. The smaller 30B model uses classic Grouped Query Attention (GQA), whereas the larger 105B variant switched to DeepSeek-style Multi-Head Latent Attention (MLA).

Sarvam sets a new bar for regional capability with extremely efficient representations for Indian languages.


Conclusion

The main takeaway from the sprint of early 2026 is that there isn't just one right way to build a model. However, architectural design optimizations like Sliding Window Attention (SWA), Multi-Token Prediction (MTP), and Hybrid Linear Attention are transitioning from theoretical research directly to production models. Efficiency is the name of the game in 2026!