Computer Architecture Today

Computer Architecture Today

https://feeds.feedblitz.com/sigarch-cat
26 Followers 2 Episodes Claim Ownership
Informing the broad computing community about current activities, advances and future directions in computer architecture.

Episode List

All in on MatMul? Don’t Put All Your Tensors in One Basket!

Oct 8th, 2025 2:00 PM

Have you noticed that some ideas in AI take off not necessarily because they’re better, but because they align better with our machines? That’s the hardware lottery: if your approach happens to align with the dominant hardware/software, you hit the jackpot; otherwise, better luck next time.  Yet this lottery isn’t a one-time draw. As Sara Hooker puts it, “we may be in the midst of a present-day hardware lottery”. The catch? Modern chips zero in on DNNs’ commercial sweet spots. They are exceptionally good at cranking through heavy-duty MatMuls and the garnish ops, such as non-linear functions, that keep you from getting hit by Amdahl’s Law. Pitch an off-road idea, and it is at best a high-risk long shot. The “winners” in research often align with what our tools run best, and that tendency can skew the trajectory of technological progress.  This post draws attention to how generality and programmability are often underemphasized on today’s accelerators, a pattern that risks stifling future algorithmic innovation unless it is actively addressed. The Reign of Matrix Multiplication It’s no coincidence that almost every AI breakthrough involves some kind of NN crunching numbers on xPUs. These chips have made a particular form of MatMul the de facto currency of AI. Crucially, the performance gains have come not just from accelerating MatMul itself, but from the realization that AI algorithms are resilient to reduced precision. Because ML frameworks are built around tensor operations, any problem reformulated as a sequence of MatMul ops instantly taps decades of compiler optimizations and accelerator infrastructure. In fact, we practice a pragmatic “MatMul-reduction,” much like NP-reductions, converting complex tasks into chained MatMuls. But we haven’t shown that all aspects of intelligence reduce neatly to MatMul, and by rewarding only MatMul-friendly ideas, we risk creating powerful yet brittle approximations that trap us in a local minimum.  Researchers naturally gravitate to methods that run efficiently on existing hardware and tools, while proposals that stray from the matrix-heavy paradigm face an uphill battle.  These ideas require achieving a kind of implementation escape velocity to land in a real chip. In effect, AI-specific chips create technological inertia. If tomorrow someone conceives an AI method that isn’t MatMul-centric, would it ever get a fair evaluation on hardware designed to support it? None of this is to say MatMul accelerators are “bad”;  far from it, they enabled the deep learning revolution. Yet their success has created a textbook example of the Matthew Effect: hardware favors certain algorithms, those algorithms dominate, and we then build even more specialized chips for them. That feedback loop is virtuous if you buy into the current paradigm, but vicious if it causes us to overlook alternative approaches. Who Holds the Keys to the House? So who’s running this casino, anyway? ML accelerators now dictate who gets to pursue what research. It’s no longer just about building better algorithms but about securing access to the large clusters, and today there is an increasing emphasis on chasing short-term returns on proven approaches rather than risking investments on wild new paradigms. Innovation can suffer, steered by what’s profitable and available rather than what’s intellectually promising.  History also provides a grim reminder that “computer architecture history is littered with the corpses of special-purpose machines.” These were often brilliant ideas defeated by the one-two punch of a narrow market and Moore’s Law economies of scale.  Today, materializing an idea is not just about access to spare xPUs, it’s about meeting the table stakes required to build chips at scale. Capital has become the ultimate gatekeeper for what gets researched. If Cloud Company X has a thousand xPUs idling, research that fits those machines is more likely to get done. If Idea Y would require building a whole new kind of chip or using hardware that nobody’s selling, it might never leave the drawing board. Is the Hardware Lottery Losing Relevance – Or More Relevant Than Ever? With MatMul machines everywhere, one might ask: is the hardware lottery still a thing? After all, if virtually everybody in AI is playing on the same hardware and using the same application, maybe it’s not a lottery anymore but a planned economy: “Matrix math won, alternative approaches need not apply”? 2024 Turing Laureate Rich Sutton epitomized this in “bitter lesson”, that the algorithms which scale best with compute will inevitably win, and so far DNNs have scaled very well. By this logic, it’s no accident or luck that deep learning is on top; it won fair and square by delivering results when thrown onto big hardware. In a world where more data + bigger models = better performance, focusing on one ubiquitous hardware type could simply accelerate progress for everyone.  If everyone concentrates on improving one kind of platform, you get compounding efficiencies. Maybe we don’t need radically different hardware paradigms, we just need to make the winning hardware cheaper and more accessible. On a standardized AI hardware, ideas compete on a more level playing field of implementation. In that scenario, the hardware lottery effectively ends. It ceases to be a game of chance and becomes what Thomas Kuhn called a dominant paradigm. MatMul-centric computing defines the “normal science” of the field, creating a powerful bias towards ideas that fit the existing model and treating those that don’t as mere anomalies.  However, many caution that this very uniformity could be lulling us into a false sense of security. The current domination of matrix-multiply-centric AI might just be masking the hardware lottery’s fangs. We won’t notice the lottery until the day we desperately need a different kind of hardware and realize we haven’t been buying those tickets. As Hooker notes, today’s specialization makes it “far more costly to stray from accepted building blocks”, implicitly pressuring researchers to stick to ideas that fit the hardware. The danger is that we might overfit to our hardware, optimizing our whole intellectual landscape around what runs fast on a GPU, and being blind to ideas that would require something fundamentally new (e.g., Training DistBelief with tens of thousands of CPU cores vs. AlexNet on two GPUs, both in 2012). What if the next breakthrough doesn’t look anything like a giant MatMul? The field of neurobiology points to sparse and event-driven primitives in our brain, but definitely not training a deep Transformer model. Kaplan et al. also report that doubling the MatMul compute produces only around 5% improvement in loss: big hardware effort, modest algorithmic return. If you suspect we will hit a wall in achieving human-level intelligence, then it’s reasonable to expect that somewhere out there is a different algorithmic path, one that might need its own kind of hardware, and attention, to truly flourish.  There are already glimmers of such paradigms. But they’re far from mainstream, partly because the entire ecosystem (funding, talent, compilers, access) orbits around the incumbent tech: no one uses a different hardware paradigm because it’s not supported, and it’s not supported because no one’s using it. If matrix-multiply AI is the only game in town, we risk a kind of innovation monoculture. Monocultures can be efficient, but they’re also brittle: one fundamental limitation of our favored approach could stall progress. Taking a step back, should we focus our effort more on hardware, or software/algorithms? From the 1980s until 2020, Moore’s Law delivered an impressive 30,000x speedup (Fig. 2a-Hardware Improvement). However, consider the example of kD trees, developed to accelerate approximate nearest-neighbor queries. This one algorithmic breakthrough resulted in a similar speedup to decades’ worth of hardware advancements. Retrospective data from MIT shows that algorithms transitioned from O(N^2) to O(N) at a rate of 0.5% per year. It seems important to enable such a breakthrough for Transformers, but focusing on dense linear algebra and chasing (comparatively modest) gains from hardware is unlikely to get us there. So, is the hardware lottery less relevant now? Or more relevant than ever? It might be less visible day-to-day because one approach dominates, but that dominance itself could be the biggest lottery effect of all. The fact that everything is so aligned on one type of hardware means if you’re working on anything else, you’re effectively locked out of the casino. And if the future of AI needs a different casino? We’ll wish we hadn’t put all our chips on one table. Avoiding Tomorrow’s Hardware Lottery? How? The convention wisdom boils down to two options: Let a thousand flowers bloom. Support research into non-xPU machines and their matching algorithms. A diverse hardware ecosystem makes the field more resilient and full of surprises. Academia is the natural lead here, long-horizon research and risk tolerance, with industry providing sustained resources and scale.  Go all-in. This strategy treats the hardware lottery like a casino game where one hand has already won big. The strategy is to go all-in, pushing all the chips—both financial and silicon—into making sure everyone can play with that winning hand.  That means driving down cost, improving energy efficiency, and expanding access. The risk? We are simply building a wider, more comfortable road into a potential cul-de-sac.  Both have merit and Chris Lattner captures this view: if we want AI to keep advancing, we must “expand access to alternative hardware, [maximize] efficiency on existing systems, and [accelerate] software innovation”, otherwise we risk hitting a wall where AI progress is bottlenecked by hardware.  Beyond these two extremes,  a pragmatic middle path would be to add generality and programmability to the specialized winners. This approach asks us to evolve our winning design into a more general hardware and broadening (or redesigning) the scope of algorithms. History gives precedent. The original GPGPU was a graphics chip with enough generality for scientific computing. Jensen Huang didn’t build a new machine from scratch; he evolved his existing one by betting on programmability, adding generality to a specialized chip—a lottery ticket that paid off greatly when AlexNet emerged. More recently, products like Nvidia’s Grace CPU-GPU combination are a continuation of this philosophy: improving a specialized core with general-purpose capabilities. This generalist route is promising but hardest. It faces the classic chicken-and-egg problem. One path to break the deadlock is to use today’s ML and algorithm discovery tools to search the co-design space, letting models propose microarchitectures and algorithms that are jointly efficient. This isn’t just about optimizing the current paradigm; it is about asking AI to help us discover the “winning” hardware primitives for the next one On the software front, Youtube’s Video Coding Unit (VCU) intentionally bakes in “only the computationally expensive infrequently-changing aspects of the system“. On the algorithm front, work such as Fast Feedforward Networks shows you can replace large dense feed-forward MatMuls with log-time, tree-based conditional execution, a different primitive that maps much better to sparse/event-driven or memory-centric hardware. And on the hardware front, designs like Stella Nera demonstrate multiplier-free, lookup/add-based accelerators that recast matrix multiplication into a very different hardware primitive, proof that alternative compute substrates can be both efficient and practical.  This offers a simple litmus test: if hardware cannot adapt to run new approaches as they emerge, it has already become too specialized. A Winning Bet? Why You Shouldn’t Put All Your Tensors in One Basket The hardware lottery has taught us that progress is not merely about brilliant ideas, but about the platforms that give those ideas life. We cannot afford to let the inertia of our current success steer us into a monoculture. Rather than choosing between exotic new hardware or wider access to old ones, let’s make a strategic bet on evolution. By building accelerators with broader generality, we don’t discard our winning ticket, we hedge our bets by adding new numbers. The true jackpot isn’t raw speed, but the hardware-and-algorithm duo that unlocks the next computing era. Stop spinning the wheel and start redesigning the machine. Acknowledgements: We want to thank our colleagues at Google DeepMind and across Google for their valuable feedback and insights while developing the ideas for this post.  About the Author: Amir Yazdanbakhsh is a Research Scientist at Google DeepMind, working at the intersection of machine learning and computer architecture. His primary focus is on applying machine learning to design efficient and sustainable computing systems, from leading the development of large-scale distributed training systems on TPUs to shaping the next generation of Google’s ML accelerators. His research on using AI to solve performance challenges in hyper-scale systems received an IEEE Micro Top Picks award. Jan Wassenberg is a Senior Staff Software Engineer at Google DeepMind. Over the past 20 years, Jan has applied SIMD and vectorization to a wide range of domains. His work includes founding the open source Highway library for performance-portable SIMD; developing vqsort, the fastest known sort for 64/128-bit integers; devising Randen, a CSPRNG sufficiently efficient to serve as Google’s default RNG; and leading the open source gemma.cpp project for LLM inference on CPU. Authors’ Disclaimer: Portions of this post were edited with the assistance of AI models. Some references and notes were also compiled using AI tools. The content represents the opinions of the authors and does not necessarily represent the views, policies, or positions of Google DeepMind or its affiliates. What would an AI have to say about this post? Listen to a two-way conversation generated by NotebookLM by pressing play right here. https://www.sigarch.org/wp-content/uploads/2025/10/The_Hardware_Lottery__Why_AI_s_MatMul_Monoculture_Risks_the_Nex.mp3 Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM. Click the icon below to listen.

Racing Camels in a Quantum Game of Camel Up

Jun 20th, 2025 2:00 PM

Camel Up is a light-hearted board game. Fueled by the randomness of dice, camels race around a cardboard track to be the first to cross the finish line. Throughout the race players place bets, trying to predict which camel will ultimately win. Unfortunately, correct predictions are difficult to make. Camels can ride on each other, benefiting from an opponent’s success, and external forces constantly shuffle the rankings. The result is a game filled with uncertainty. A camel that in one turn lags behind can find itself leading the race in the next. The board game Camel Up (JIP, CC BY-SA 3.0, via Wikimedia Commons) Today, the field of quantum computing could be construed as a game of camel up. The camels in our race are different qubit technologies: superconducting, trapped ions, neutral atoms, electron spins, photons, and our finish line is a demonstration of scalable, practical quantum advantage.  Betting on Camels Imagine you’re a player in this quantum game of camel up, and you’re trying to choose which camel to place a bet on. One strategy you might employ is to benchmark each camel on relevant metrics. If you believe the finish line lies in the Noisy Intermediate Scale Quantum (NISQ) regime these would be physical system metrics: gate fidelities, measurement error rates, and coherence times. You may even try and calculate a single number to summarize these, such as Quantum Volume, to make comparison easier. And if you had done this six years ago you would have found a race with two camels in the lead: superconducting qubits and trapped ion qubits. These technologies have features that are attractive for NISQ: they are mature, have good error rates, and exhibit flexible control for implementing general quantum programs. However quantum hardware is error-prone, much more so than classical hardware, and optimistic projections set expected error rates between 1 in 1,000  to 1 in 10,000. This is incompatible with large-scale applications for material science, chemistry, and cryptography that require millions or even trillions of operations. In response, many researchers now believe the finish line lies in the regime of Fault-Tolerant Quantum Computing (FTQC), defined by lower error rates through Quantum Error Correction (QEC).  But if you believe the finish line lies in the FTQC regime, how should you determine which camel to bet on? General, physical system metrics are not sufficient. The end-goal is logical-level performance, but this changes based on which QEC codes are used, of which there are hundreds. Co-Designing with Camels One way to answer this question is to look at the state of hardware today. With the shift towards FTQC, hardware design decisions have changed, and the co-design of quantum hardware with QEC codes has become increasingly popular. Google’s superconducting chips are designed to meet the connectivity requirements of a planar surface code, IBM’s roadmap now incorporates plans to implement nonlocal c-couplers necessary for a family of quantum LDPC codes, and many companies involved in DARPA’s Quantum Benchmarking Initiative have co-designed quantum hardware with QEC codes, such as those for cat qubits and photonically linked silicon spin qubits. This increase in co-design reflects a fundamental difference in designing systems for FTQC compared to NISQ. For FTQC the role of hardware is not to implement any quantum program, but instead to implement a layer of quantum error correction.  What Makes a Camel Good at QEC If QEC is critical to a camel’s success, it’s necessary to discuss which hardware features are most important for QEC. Surprisingly, the physical instruction set needed is quite simple–most codes can be implemented with a collection of Controlled NOT and Hadamard gates. The key challenges are instead scalability and connectivity. It’s expected that large-scale systems will require hundreds to thousands of simultaneous QEC codes which, depending on their encoding rates, can lead to resource estimates in the millions of physical qubits. This cost can be mitigated through the use of QEC codes with higher encoding rates, but this is dependent on the available connectivity. Particularly, it’s known that higher degrees of connectivity are necessary to implement codes with higher encoding rates. An ideal hardware for QEC is therefore one that can scale in size effectively while maintaining a high degree of connectivity between physical qubits. A Camel Case Study: Neutral Atoms One camel that might fit this bill is neutral atom arrays, a newer hardware built from 2D arrays of optically trapped atoms. Although neutral atoms have received interest for NISQ applications, they’ve become increasingly popular in recent years as a platform for QEC. They have high connectivity between atoms: two-qubit interactions can go beyond nearest-neighbors, and atoms can even be moved dynamically at runtime with reconfigurable traps. They also scale well: existing demonstrations have shown > 6,000 trapped atoms with expectations set at 10,000 atoms for a single device in the future.  However, these features also come with challenges. Individual control of atoms in large arrays can be difficult and impractical, logical operation times have historically been slow, and loss of atoms during execution requires lengthy reloading steps. As a result, co-designs of neutral atoms with QEC have tried to address these challenges.  Researchers at Harvard and QuEra realized a system that avoids challenges in control scalability through mid-circuit movement with zoned control, and mitigates the impact of slow cycle times through fast, transversal error-corrected operations. Movement-based systems have further been co-designed with high encoding rate QEC codes such as Hypergraph Product Codes and Generalized Bicycle Codes (see animation below) for effective quantum memories. https://www.sigarch.org/wp-content/uploads/2025/06/multiblock_movement_smaller.mp4 Another system based on trapping two atomic species was demonstrated at UChicago, leading to our own corresponding co-design with QEC. In that work we studied ways to mitigate control costs by exploiting global control per species. We also studied how to avoid slow cycle times, and we proposed interleaving QEC blocks to enable fast, transversal operations without the need for movement.  The Race Ahead While neutral atoms are a promising platform for the FTQC regime, they’re only one camel in the race. Today, it’s still difficult to predict which hardware platform will demonstrate scalable, practical quantum advantage. However, a more interim step is to ask which hardware implements QEC the best, and answering this requires the continued co-design of new QEC architectures with evolving quantum hardware.  About the Authors Joshua Viszlai is a PhD student at the University of Chicago advised by Fred Chong. His research studies the co-design of quantum error correction and underlying quantum hardware as well as software systems in the fault-tolerant regime. His work has addressed a range of architectural questions in surface codes, quantum LDPC codes, neutral atom arrays, and QEC decoding. He will be on the job market looking for academic positions this upcoming year. Fred Chong is the Seymour Goodman Professor of Computer Architecture at the University of Chicago and the Chief Scientist for Quantum Software at Infleqtion. He was the Lead Principal Investigator of the (Enabling Practical-scale Quantum Computation), an NSF Expedition in Computing, as well as the Lead PI of a Wellcome-Leap Q4Bio project. He is also an advisor to Quantum Circuits, Inc. Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM. Click the icon below to watch.

Get this podcast on your phone, Free

Create Your Podcast In Minutes

  • Full-featured podcast site
  • Unlimited storage and bandwidth
  • Comprehensive podcast stats
  • Distribute to Apple Podcasts, Spotify, and more
  • Make money with your podcast
Get Started
It is Free