Programmable Registers Provide Flexibility Needed in AI Chips

There is probably no hotter topic in electronics right now than artificial intelligence (AI). AI was a fringe technology for decades, of much academic interest but rarely applied in the real world. Seemingly overnight, AI-based applications (apps) are everywhere, changing the way that programmers write code and how apps behave. This evolution has also had major implications for chip design, including the use of programmable registers to provide flexibility within the hardware.

The AI Difference

Explaining computers to non-technical people often relies on analogies such as recipes for baking and cooking. A program is described as a list of steps, much like a recipe, and the hardware’s job is to execute the steps as quickly as possible. This suggests a level of predictability in the calculations: a program might deliver unexpected results but it will do so in a predictable way, by following the steps in the recipe.

AI techniques such as machine learning (ML) take a fundamentally different approach. To stretch the analog, an ML program might study millions of recipes to learn how they work and what sort of foods humans like. With this knowledge, ML can create new recipes that have never existed before but are still recognizable as such and should be palatable. AI can and does deliver unexpected results, sometimes impressive enough that humans question the true nature of its creativity.

AI computing is adaptive; the computations are not rigid and hard-coded as a “recipe” program. The flexibility of AI algorithms to produce novel results is due largely to their adaptive nature. Computations evolve based on the input data, including learning/training examples, and the desired outputs. It is clear that software can be flexible enough for such a computing paradigm, but it is less clear for the hardware of the chips running the code

Flexibility in Chip Design

ASICs and custom chips are “cast in stone” since the architecture and their design structures are fixed at fabrication time. One can imagine using a re-programmable device that can be configured on the fly in mission mode deployment, but this is impractical. Such devices are not large or fast enough for advanced applications, and most AI-based systems can’t possibly shut down for the time needed to reprogram them. The best approach is one that builds flexibility on top of the fixed design.

Architecturally-visible programmable registers provide just such a solution. These are often called configuration and status registers (CSRs) since they allow the software to control how the design operates and gather status on results. Registers form one side of the hardware-software interface (HSI), with low-level software such as drivers and embedded code on the other side. Chips typically support numerous operating profiles and options, and software controls these via the HSI.

Unlike simple register sets for holding data, CSRs are complex. Large system-on-chip (SoC) designs may have many thousands of registers, most with multiple fields. Many of these registers and fields have special attributes, such as read only, write only, locking, and shadowing. Designing and verifying these registers by hand, and then validating that they work properly with the software, is a big task in traditional SoC development.

Flexibility in AI Chip Design

In a chip designed for AI applications, CSRs allow for the customization of data paths and processing logic. This adaptability is crucial for AI models, such as Transformers, which have several components that are not always strictly “in-order”. Various layers and weights of the model, the numerous mathematical implementations, and the activation functions are too variable to directly implement in the hardware. Instead, CSRs allow the hardware to adapt to the computational demands of the model’s processing logic, optimizing how the hardware supports the execution of complex AI computations for each layer.

As an example, Google’s open source AI model Gemma features several innovative techniques such as knowledge distillation, sliding window attention, and logit soft-capping, all of which demand highly adaptable hardware to efficiently execute these processes. These capabilities are achieved through a rearranging of layers during the training process and critical mathematical applications applied to numerous sections. This flexibility in processing complex models on ‘rigid’ hardware is enabled by CSRs, which allow for the dynamic adjustment of data paths and training parameters to match the model’s requirements.

Programmable, configurable registers also open up options to trade off performance against power consumption and other metrics. The integration of CSRs into AI computing provides a way to achieve unparalleled performance and efficiency. By allowing dynamic adaptability in processing, they open new possibilities for scaling AI models and optimizing them for specific applications, all while using the same chip design.

AI relies on low-level drivers such as CUDA, OpenCL, or oneAPI to efficiently parallelize its computations. However, when AI-specific hardware is designed—such as custom chips integrating transformers—the design minimizes the need for software-driven parallelization, as the hardware is already tailored to the AI’s specific computational patterns.

Traditional chips do support AI models, using a more “hammer-and-chisel” approach, so that the drivers are forced to compute HOW to fit the model for computation. This is a less refined approach that replies on mediatory software to intelligently figure out how to parallelize.

Software is bound to be slower but more general purpose. Transformer-on-chip and similar solutions are fast because the hardware is already built with the exact plans for parallelization.

Additional Register Requirements

The register blocks in AI chips contain much more than just the registers themselves. The processors that program the registers and retrieve status access them via some form of bus interface. Typical examples include standard protocols such as APB, AHB, AHB-Lite, AXI4, AXI4-Lite, TileLink, Avalon, and Wishbone, plus a wide range of proprietary buses. These generally run on different clocks than the AI hardware, so clock-domain-crossing (CDC) logic is needed to avoid metastability.

CSRs are critical for the proper operation of the new generation of AI chips , and studies have shown that they present a high risk for design safety and security. An alpha particle flipping a register bit or silicon aging effects cannot be allowed to compromise the functionality in many critical applications. The register block must include appropriate safety mechanisms to detect and handle such faults. Typical examples include parity, error-correcting codes (ECCs), cyclic redundancy checks (CRCs), and triple module redundancy (TMR).

It is also important to note that developing registers goes well beyond just the RTL design:

The verification engineers must develop register models, testbenches, and tests compatible with the Universal Verification Methodology (UVM)
The programmers must write the software that configures and controls the design via the HSI
The pre-silicon validation team must run the UVM testbenches and software together to perform hardware-software co-verification
The bringup team must run the software on fabricated chips in the lab to perform post-silicon validation
The technical writers must create accurate, readable documentation

An Automated Solution for AI Chips

Manual register development flows are inefficient, and must be repeated every time that the register specification changes over the course of the project. Anything written by hand is also error prone, subject to typos and differing interpretations of specifications written in natural language. The only solution is a development flow using specification automation. With this approach, a wide range of files are generated automatically from executable register specifications, and re-generated whenever changes are made.

The IDesignSpec™ Suite from Agnisys provides the industry’s best solution for adding CSRs to complex chip designs, including those for AI applications. Users can define the registers in any standard formats, including SystemRDL, IP-XACT SystemRDL, IP-XACT, JSON, RALF, YAML, and XML, as well as an intuitive interactive editor. They can use the PSS standard or another specialized editor to define the sequences required to configure and program the CSRs.

IDesignSpec supports all the bus interfaces and safety mechanisms listed previously. It supports memories and more than 400 special register types. These include indirect, indexed, read-only/write-only, alias, lock, shadow, FIFO, buffer, interrupt, counter, paged, virtual, external, read/write pairs, and combinations of these types. From the register and sequence specifications, IDesignSpec generates:

The RTL design, including bus interfaces, CDC logic, and safety mechanisms
The UVM models, testbenches, and tests required for design verification
The C/C++ tests required for pre-silicon and post-silicon validation
User-quality documentation in Microsoft Word, HTML, PDF, Markdown, or DITA format

Conclusion

AI demands flexibility and adaptability in the hardware as well as the software, and CSRs are a key enabler. They provide the software powering these massive AI models, the hooks needed to optimize the hardware.

Given the importance and complexity of registers in an AI chip, specification automation is required to reduce project resources, shrink schedules, and achieve first-silicon success. The Agnisys IDesignSpec Suite has been proven on countless development projects, including many AI applications.