Architecting Chips For High-Performance Computing

Data center IC designs are evolving, based on workloads, but making the tradeoffs for those workloads is not always straightforward.

popularity

The world’s leading hyperscaler cloud data center companies — Amazon, Google, Meta, Microsoft, Oracle, and Akamai — are launching heterogeneous, multi-core architectures specifically for the cloud, and the impact is being felt in high-performance CPU development across the chip industry.

It’s unlikely that any these chips will ever be sold commercially. They are optimized for specific data types and workloads, with huge design budgets that can be rationalized by the savings from improved performance and lower power. The goal is to pack more compute horsepower into a smaller area with lower cooling costs, and the best way to do that is through customized architectures, with tightly integrated micro-architectures and well-designed data flows.

This trend started nearly a decade ago with AMD’s embrace of heterogeneous architectures and accelerated processing units, which supplanted the old model of homogenous multi-core CPUs, but it was slow to take off. It has since begun to take off, following in the footsteps of the kinds of designs created for mobile consumer devices, which need to deal with a very tight footprint and stringent power and thermal requirements.

“Monolithic silicon from industry stalwarts like Intel have AI NPUs in nearly every product code,” said Steve Roddy, vice president of marketing at Quadric. “And of course, AI pioneer NVIDIA has long utilized a mixture of CPUs, shader (CUDA) cores, and Tensor cores in its enormously successful data center offerings. The move in coming years to chiplets will serve to cement the transition completely, as the system buyer specifying the chiplet assemblage can pick and choose the types of compute and interconnect uniquely tailored to the design socket in question.”

Much of this is due to physics and the resulting economics. With the benefits of scaling shrinking, and the maturation of advanced packaging — which allows many more customized features to be added into a design that in the past were constrained by the size of the reticle — the competitive race for performance per watt and per dollar has kicked into high gear.

“Everyone is building their own architectures these days, especially the data center players, and a lot of the processor architecture comes down to what the workload looks like,” said Neil Hand, director of marketing, IC segment at Siemens EDA. “At the same time, these developers are asking what the best path to acceleration is, because there are many ways to do it. You can go with the parallelism route with lots of cores, which doesn’t work for some things, but works well for others. At the same time, applications are more memory-bandwidth limited, so you’ll start to see some of the high-performance compute companies spending all of their effort on the memory controllers. There are others that are saying, ‘It’s actually a decomposition problem, and we’re going to go the accelerator route, and have individual cores.’ But I don’t think there’s a one size fits all.”

Roddy noted that the CPU cores inside these new superchips still hew to the tried-and-true principles of high-performance CPU design — fast, deep pipelines that are supremely effective at chasing pointers — but that’s no longer the sole focus for design teams. “Those big CPUs now share real estate with other programmable engines — GPUs and general-purpose programmable NPUs that accelerator AI workloads,” he said. “And one notable difference from the highly specialized SoCs found in mass consumer devices is the avoidance of hard-wired logic blocks (accelerators) for tasks such as video transcoding or matrix acceleration in AI workloads. Devices designed for data centers need to maintain programmability to respond to a variety of workloads, not just a single known function in a consumer appliance.”

However, all of this requires much more analysis, and the design community is continuing to push more steps further left in the flow. “Whether it is because of the tooling, or through emulation or virtual prototyping, you have the tools to understand what the data is,” Hand said. “Also, the industry has grown and there is sufficient specialization to justify the expense. The first part talks to reducing the risk of doing new hardware, because you have the tooling to understand and you don’t have to play it safe and build the one size fits all. Now, the market has started to split, such that it is important enough that you can spend the money to do it. Additionally, there are also the ways and means to do it now. In the past, when Intel would bring out a processor, for someone to compete with Intel, it was almost inconceivable. Now, through a combination of the ecosystem, the technology, and everything else, it becomes much easier. The low-hanging fruit for the high-performance compute companies initially was, ‘Let’s just get a good platform that allows us to dimension it the way we want, and put a few accelerators in.’ So we started to see AI accelerators and video accelerators, then some of the more esoteric companies started to go after machine learning. What does that mean? It means they need very high MAC performance, for instance. They would focus their processor architecture on that, and that’s how they’re going to differentiate.”

Add in RISC-V and re-usable chiplets and hard IP, and architectures start to look very different than even a few years ago. “If you look at the data center and the whole software stack in the data center now, putting something in that stack is not as hard as it used to be, where you’d have to rebuild the whole data center,” Hand said. “What has become important today is the ability to do the system-level analysis. The system-level co-design into the application has become important, and it’s become more accessible because high-performance compute is no longer what it used to be. This is a data center on wheels.”

Many contend new architectures should be developed to overcome the memory challenges that have been seen for several CPU generations. “The need for AI/ML will accelerate the process of developing new application-specific architectures,” said Andy Heinig, head of department for efficient electronics in Fraunhofer IIS’ Engineering of Adaptive Systems Division. “The classic CPUs can be part of this revolution if they provide a much better memory interface to solve the memory problem. If CPUs provide such new memory architectures, AI/ML accelerators can be the best solution for data centers alongside the CPU. The CPU is then responsible for classic tasks where flexibility is required, while the accelerator provides the best performance for a very specific task.”

Arm, for example, collaborates directly with multiple hyperscalers on its Neoverse-based compute solutions for high performance, flexibility through customization, and a strong software and hardware ecosystem. This has produced publicly announced silicon like AWS Graviton and Nitro processors, Google’s Mt. Evans DPU, Microsoft Azure’s Cobalt 100, NVIDIA’s Grace CPU Superchip, and Alibaba’s Yitian 710.

“We are learning a lot from these and other design partners,” said Brian Jeff, senior director of product management for Arm‘s Infrastructure line of business. “One of the main ways we are shaping high performance CPU and platform development is through a deeper understanding of infrastructure workloads, leading to specific architectural and microarchitectural enhancements, particularly to the front end of our CPU pipelines and our CMN mesh fabric.”

But capturing that workload and developing a chip architecture for it isn’t always straightforward. This is particularly true for AI training and inferencing, which can shift as algorithms change.

“There are different models that are being trained, such as Meta’s publicly available Llama model, and the Chat GPT model,” said Priyank Shukla, principal product manager, interface IP at Synopsys. “All these models have a pattern, a certain number of parameters. Let’s take the example of GPT-3, which has 175 billion parameters. Each parameter has a 2-byte width, so that is 16 bits. You need to store this much information — 175 billion parameters into 2 bytes, which is equal to 350 billion bytes in memory. That memory needs to be stored in all the accelerators that share that model, which needs to be placed on the fabric of the accelerators, and the parameters need to be placed in the memory associated with this accelerator. So you need a fabric that can take that bigger model and then process it. You can implement that model in a different way — the way you implement that algorithm. You can do some work in serial fashion, and some work you can do in parallel. The work that happens in serial fashion needs to be cache-coherent with minimal latency. That kind of work, which happens in serial, will be divided within a rack so that the latency is minimal. The work that can happen in parallel will be split across racks over a scale out network. We see the system guys are creating this model, and the algorithm, and implementing it in custom hardware.”

Fig. 1: ML-optimized server rack. Source: Synopsys

Fig. 1: ML-optimized server rack. Source: Synopsys

Assembling the various processing elements is non-trivial. “They are heterogeneous multi-core architectures, typically a mix of general-purpose CPUs and GPUs, depending on the type of company, since they have a preference for one or the other,” said Patrick Verbist, product manager for ASIP tools at Synopsys. “Then there are RTL accelerators with a fixed function, which are in the mix of these heterogeneous multi-core architectures. The type of application loads these accelerators run, in general, include data manipulation, matrix multiplication engines, activation functions, compression/decompression of parameters, the weights of the graphs, and other things. But one common thing between all these things has to do with the operations in a massive way. Usually, these computations are done on either standard or custom data types. Int 16 is typically supported on many processing architectures, but you don’t want to waste 16 bits out of a 32-bit data path if you only have to process 16 bits of data. You have to make it custom. So instead of only running Floating Point 32 data types, the accelerators also need to support int 8 and/or int 16, maybe half precision float, custom int or custom float type of data types, and the functional units, the operators, are typically a combination of vector adders, vector multipliers, adder trees, and activation functions. These activation functions are typically transcendental functions like exponentials or hyperbolic functions, square roots, massive divisions, but vectorized and with single cycle throughput requirements, because every cycle, you want to start a new operation on these things. For these kind of accelerators, in that swipe of heterogeneity, we see many customers using ASIPs (application specific instruction processors) as one of those blocks in this heterogeneous space. ASIPs allow you to customize the operators, so the data path and the instruction set only execute a limited set of operations in a more efficient way than a regular DSP can do.”

A DSP is typically not performed, it’s too general purpose. On the other hand, fixed function RTL may not be flexible enough, and that creates the space for, ‘Yes, we need something more flexible than fixed function RTL and less flexible than the general purpose DSP.’ That’s where ASIPs have their play. If you look at a GPU, a GPU is also, to a certain extent, general purpose. It has to support a variety of workloads, but not all workloads. And that’s where ASIPs come into play to support the flexibility and the programmability. You need that flexibility to support a range of computation algorithms to adapt to changing software or AI graph requirements, as well as the changing requirements of the AI algorithms itself.”

Siemens’ Hand sees accounting for workload as a tough challenge.

“To address this, the vertically integrated companies are investing in this high performance compute in this way, because high performance compute is not that different than AI, and you can only work off the data patterns you see,” Hand said. “If you’re the likes of an Amazon or a Microsoft, then you’ve got a huge amount of trace data without snooping on any data, and you know where the bottlenecks are in your machines. You can use that information and say, ‘We are seeing that we get memory bandwidth, and we have to do something with that, or it’s a network bandwidth issue or, it’s an AI throughput issue, and we’re being caught in these areas.’ It’s really no different than the challenges that are happening on the edge. The goal is different at the edge and we’re often looking at it saying, ‘What can I get rid of? What don’t I need?’ Or, ‘Where can I shrink the power envelope?’ Whereas in the data center, you’re asking, ‘How can I push more data through, and how can I do it in a way that I do not burn up the devices? As the devices get bigger and bigger, how can I do it in a scalable way?’”

Hand believes the move to multi-die is going to drive a lot of interesting development, and is already being used by the likes of AMD and Nvidia. “Now you can start to have some interesting plug and play assemblies for these high performance compute applications, where, at a very large level, you can start to say, ‘What is my interconnect die for this application? What is my processing die for this application?’ It gives a mid-ground between building a standard computer without much to change. What can I do?  I can put in different processes, different network cards, different DIMMs. There’s a limit I can do as a cloud provider to differentiate. On the other end, you get the big cloud providers like Microsoft and Azure, that say, ‘I can build my own full SOC s and do whatever I like. And I can go build that.’  But you can now get this medium ground where, let’s say, you decide there’s a market for a biological compute data center, there’s enough people getting into this, that you can make some money. Can you assemble a 3D IC and make it work in that environment? It’s going to be interesting to see what pops up because it will be a lowering of the barrier of entry. We’re already seeing it being used by the likes of Apple, Intel, AMD and Nvidia as a way of getting faster times on their spins and more variety without having to test humongous dies, and I think that’s going to have a much bigger impact on high performance compute than people realize. When you start to combine them with things like a full digital twin of the environment, you can start to understand the workloads in the environment, understand the bottlenecks, then try different partitioning and then push down.”

Arm’s Jeff also sees data center chip architectures changing to accommodate AI/ML functionality. “On-CPU inference is very important, and we are seeing our partners take advantage of our SVE pipes and matrix math enhancements and data types to run inference. We are also seeing close coupling of AI accelerators through high-speed coherent interfaces come into play, and DPUs are expanding their bandwidth and intelligence to link nodes together.”

Multi-die is inevitable
The chip industry is well aware that a single-die solution is becoming unrealistic for many compute-intensive applications. The big question over the last decade was when the shift to a multi-chiplet solution would become more mainstream. “The whole industry is at an inflection point where you cannot avoid this anymore,” said Sutirtha Kabir, R&D director at Synopsys. “We talk about Moore’s Law and ‘SysMoore’ in the background, but the designers have to add more functionality in the CPUs and GPUs and there’s just no way they can do that because of reticle size limits, yield limits, and all that in one chip. Multi-die chips are an inevitability here, which brings up some interesting considerations. Number one, take a piece of paper and fold it. That’s basically one example of what going multi-die looks like. You take one chip, you fold it over, and if you can design this smartly, you can think that instead of having long timing paths, you can actually shorten the timing drastically. If you’re going from the top chip to the bottom chip, all you’re going through possibly is a small amount of routing in the chips, but they’re mostly bump-to-bump or bond-to-bump.”

Multi-die design challenges include figuring out how many paths need to be synchronous, whether timing should be put between two chips or closed individually, and if L1 should be put on the top chip or bottom chip — and if L4 can be added.

“The floor-planning now becomes very interesting from a 3D standpoint,” Kabir explained. “You’re taking a single story house and making it three stories or four stories. But then there are other design challenges. You cannot ignore thermal anymore. Thermal used to be something for the PCB, and the system designers now contend these chips are extremely hot. Jensen Huang said recently at SNUG that you send room temperature water in at one end and it comes out Jacuzzi temperature at the other. He was joking, but the fact is, these are hot chips from a temperature standpoint, and if you don’t take that into account during your floor-planning, you’re going to fry your processors. That means that you have to start doing this much earlier. In terms of 3D floor planning, when it comes to workloads, how do you know that you have analyzed the different workloads for multi-die and made sure that critical effects like IR, thermal, and timing are accounted for even when you don’t even have a netlist? We call this the zero netlist phase. These considerations are all becoming very interesting, and because you cannot avoid doing multi-die anymore, these are all front and center in the ecosystem from the foundry standpoint, from the EDA standpoint, with the designers in the middle.”

Connected to the thermal concerns of data center chips come the low power design aspects.

“These data centers use a huge amount of power,” said Marc Swinnen, director of product marketing at Ansys. “I was at ISSCC in San Francisco and we were in a booth right next to NVIDIA, which was showing one of its AI training boxes — a big, bold box with eight chips, and scads and scads of fans and heat sinks. We asked how much power it uses, and they said, ‘Oh, 10,000 watts at the top, but on average 6,000 watts.’ Power is really becoming crazy.”

Arm’s Jeff agreed that the best way to address the new challenges in data center chips is a complete system approach that comprehends the instruction set architecture, the software ecosystem and specific optimizations, the CPU micro-architecture, the fabric, system memory management and interrupt control, as well as in-package and off-chip I/O. “A complete system approach enables us to work with our partner to tailor SoC designs to modern workloads and process nodes, while taking advantage of chiplet-based design approaches.”

This approach to custom chip design enables data center operators to optimize their power costs and computing efficiency. “The high efficiency of our Neoverse N-series enables very high core counts from 128c to 192c and above per socket,” Jeff said. “These same N-series products can scale down to DPUs and 6g L2 designs and edge servers in smaller footprints. Our V-series line targets cloud with higher performance per thread, and higher vector performance (for workloads like AI inference and video transcoding), while still offering high efficiency. A wide array of options of accelerator attach enables our partners to bring the right mix of custom processing and cloud native compute together in an SoC tailored to their workloads.”

Conclusion
Where all of this ends up is nearly impossible to predict, given the evolutionary nature of high-performance computing, especially since there are different aspects as to how the data center can be optimized. “Toward the start of the explosion of the web stuff, people started to make north-south and east-west routing within the data center, and that changed all the network switching architectures, because that was the big bottleneck,” said Siemens’ Hand. “That led to a whole rethinking of what a data center looked like. Similar things have been happening on the memory side of things, and it’s going to be really interesting to see, when you start to get into integrating optics and some of the smarter memories, how that changes it.”

Hand pointed to an Intel Developer Conference years ago when the company explained how it was using the surface emitting optics in silicon photonics to separate the memory out from the storage in a data center rack. “They had a unified memory structure that could share between servers and could allocate memory from different servers,” he said. “So the topology of a data center starts to look really interesting. Even in the rack, you look at the NVIDIAs that have AI system structure that doesn’t look like a traditional server rack. The big shift is the fact that people can look at it, and if there’s a market, you can build it. We always think about the architecture being about whether the core is fast. We went from, ‘Is the core fast?’ to, ‘Do I have enough cores?’ But it goes deeper than that. Once you start breaking von Neumann architectures and start using different memory flows and start looking at in-memory compute, it gets really cool. And then you say, ‘What does high-performance compute even mean?’ It’s going to depend on the application.”

Further Reading
Linear Drive Optics May Reduce Data Latency
As data demands increase, the photonics industry tries new solutions.
Memory On Logic: The Good And Bad
Do the benefits outweigh the costs of using memory on logic as a stepping-stone toward 3D-ICs?



Leave a Reply


(Note: This name will be displayed publicly)