The Huffman Gazette

AI Science

Edition 1, March 22, 2026, 9:26 AM

Flash-MoE: Streaming a 397B MoE Model from SSD at 4.4 Tokens/Second

A new open-source project called Flash-MoE demonstrates that the 397-billion parameter Qwen3.5-397B-A17B Mixture-of-Experts model can run on a MacBook Pro with just 48GB of unified RAM, achieving 4.4+ tokens/second at 4-bit quantization (HN discussion). The entire 209GB model streams from SSD through a custom C/Metal compute pipeline — no Python, no ML frameworks.

The key insight is that MoE models are naturally suited to SSD streaming: each layer activates only 4 of 512 experts per token, so only ~6.75MB of expert weights need to be loaded per layer. Flash-MoE uses parallel pread() calls with GCD dispatch groups to stream these on demand, while relying entirely on the OS page cache (~35GB, ~71% hit rate) rather than custom caching — a counterintuitive "Trust the OS" design that outperformed every bespoke cache the authors tried, including Metal LRU, malloc caches, and LZ4 compressed caches.

The project's technical log of 58+ experiments is remarkably candid about what didn't work. Speculative early routing hurt by 38% due to cache pollution. Prefetching via F_RDADVISE was net zero because SSD DMA competes with GPU bandwidth on Apple Silicon's unified memory controller. An MLP-based expert routing predictor achieved only 31% accuracy. The winning formula was simpler: an FMA-optimized dequantization kernel that rearranges (nibble * scale + bias) * x into fma(nibble, scale*x, bias*x), yielding a 12% speed improvement by letting the GPU's fused multiply-add unit handle dequant and multiply in one instruction.

The HN discussion noted that this isn't the only path to running Qwen 3.5 397B locally — commenters reported ~20 tok/s on M1 Ultra with 128GB using 2.5 BPW GGUF quants via llama.cpp, with respectable benchmark scores (87.86% MMLU, 82.32% GPQA Diamond). But Flash-MoE's contribution is more architectural: it proves that SSD-streaming inference for MoE models is viable on consumer hardware, a technique inspired by Apple's own "LLM in a Flash" paper now taken to its practical extreme.

A Visual Atlas of Attention Variants in Modern LLMs

Sebastian Raschka has published a comprehensive visual guide to attention variants used in current open-weight LLMs, alongside a new LLM architecture gallery with 45+ entries (HN discussion). The guide traces the evolution from standard multi-head attention (MHA) through the efficiency frontier, and serves as a useful snapshot of where the field stands architecturally in early 2026.

The taxonomy is instructive. Grouped-query attention (GQA) remains the workhorse — used in Llama 3, Qwen3, Gemma 3, and many MoE models — reducing KV-cache cost by sharing key-value projections across multiple query heads. It's a spectrum: fewer groups means cheaper inference but can degrade modeling quality. Multi-head latent attention (MLA), introduced in DeepSeek-V2, takes a different approach: instead of reducing the number of KV heads, it compresses what gets cached via a learned latent representation. DeepSeek's ablations showed MLA preserving or even exceeding MHA quality at the same memory budget — a stronger claim than "it's just cheaper." MLA now appears in DeepSeek V3, Kimi K2, GLM-5, and Mistral Large 3, though Raschka notes it reportedly works best above ~100B parameters.

Sliding window attention (SWA) limits each token to a fixed local context window, with periodic global attention layers for full-sequence information flow. Gemma 3 pushed from a 1:1 to a 5:1 local-to-global ratio with a 1024-token window, with ablations showing minimal perplexity degradation. DeepSeek Sparse Attention, from V3.2, goes further by learning which past tokens to attend to via a lightning indexer and token selector, rather than hard-coding locality.

The most striking trend is the rise of hybrid architectures that replace most attention layers with cheaper linear or state-space modules. Qwen3-Next pioneered a 3:1 mix of Gated DeltaNet (a linear-attention variant related to Mamba-2) and gated full-attention blocks. Qwen3.5 promoted this from experimental side-branch to main flagship — a strong signal that hybrid attention is production-ready. Kimi Linear swaps in channel-wise gating and gated MLA. Ling 2.5 uses Lightning Attention with MLA. NVIDIA's Nemotron goes further with Mamba-2 as the primary sequence module. Raschka observes that while hybrids offer superior long-context efficiency, their inference stacks aren't yet as optimized as classic GQA setups for local deployment, and the field is still waiting on DeepSeek V4 to set the next trend.

2025 Turing Award Recognizes Quantum Information Pioneers

The ACM has named Charles H. Bennett and Gilles Brassard as co-recipients of the 2025 A.M. Turing Award for their foundational contributions to quantum information science — the first time the prize has recognized quantum computing research (HN discussion). Their most celebrated work is the BB84 quantum key distribution protocol (1984), which guarantees security through the laws of physics rather than mathematical complexity: any eavesdropper measuring quantum-encoded photons necessarily disturbs them, creating a detectable trace.

Bennett's earlier contributions are equally foundational. His 1973 proof that computation can be carried out reversibly — run forward and backward with no net energy cost — established a deep connection between physics and information theory, building on Rolf Landauer's 1961 argument that information is fundamentally physical. In 1993, Bennett co-authored the quantum teleportation protocol, demonstrating that quantum states can be transferred between locations using entanglement. The practical urgency of their cryptographic work has only grown since Peter Shor's 1994 proof that quantum computers could break most classical encryption — making BB84-style quantum key distribution look less like a theoretical curiosity and more like critical infrastructure for the post-quantum era.

"System 3": How AI May Be Reshaping Human Reasoning

A widely-discussed SSRN preprint proposes that AI tools are creating a "System 3" in human cognition — extending Kahneman's dual-process framework of fast intuitive thinking (System 1) and slow deliberate reasoning (System 2) with a third mode: offloading cognitive work to AI (HN discussion, 103 comments). The paper finds that participants with higher trust in AI and lower need for cognition showed greater "surrender" to this System 3, and that time pressure and incentives shifted baseline performance but did not eliminate the pattern.

The HN discussion (181 points) was substantive and divided. Several commenters reported personal experience with cognitive offloading — one noted they used to manually sanity-check financial data from SEC filings but stopped after relying on AI, catching fewer errors. Others argued AI had improved their reasoning by helping them discover solutions to longstanding problems. Skeptics raised concerns about the underlying framework itself, noting that Kahneman's System 1/System 2 model has faced replication challenges. Multiple readers also flagged that parts of the paper appeared to be AI-generated, adding an ironic layer to research about AI's effects on human thinking. The paper remains a preprint and does not appear to have undergone peer review.

ArXiv Declares Independence from Cornell

In a significant governance shift for scientific publishing infrastructure, arXiv has declared independence from Cornell University, which has hosted the pioneering preprint server since 1991 (HN discussion, 272 comments). The story gathered 799 points on HN, reflecting the research community's deep investment in the platform that underpins open-access dissemination across physics, mathematics, computer science, and — critically for AI research — machine learning.

While the full Science article is paywalled, the move represents a maturation of arXiv's institutional status. The preprint server has become the de facto venue for AI and ML research publication, with most significant papers appearing there weeks or months before formal peer review. Any governance change to arXiv has downstream implications for how AI research is disseminated, discovered, and cited.