AMD: Addressing the Challenge of Energy-efficient Computing

AMD: Addressing the Challenge of Energy-efficient Computing ...

Advanced Micro Devices set a high goal of 2520, i.e., in the field of machine learning and high-performance computing in data centers, to achieve that goal in 2022. The company exceeded its objective, and now it has established a new 3025 objective, i.e., 30 times greater energy efficiency by 2025.

Sam Naffziger, AMD's senior vice president, corporate fellow, and product technology architect, spoke about this desire in an interview. Over the previous few generations, AMD's graphics processing units (GPUs) and central processing units (CPUs) have undergone significant changes, as the company strives to meet the demands of avid gamers, data center computing, and the demand for better performance-per-watt.

It's a recognition that performance isn't the only valuable metric to pursue. If our data centers melt the polar ice caps, they're no longer very useful. According to Naffziger, the chip industry is struggling against the limits of Moores Law.

Here's a modified transcript of our conversation.

2022 MetaBeat

On October 3-4, MetaBeat will bring together metaverse thought leaders to discuss how metaverse technology will transform the way all industries communicate and do business.

Sam Naffziger: I have been with AMD for 16 years. For most of that time, I have been leading our power efficiency and power technology. For the last few years, I have been in a product architecture role across the company, optimizing all of our products to make them the finest in the world. I started in late 2017 to lead an effort to restore competitiveness and leadership in the graphics division.

Weve now established an extremely solid track record that we were quite ecstatic about. Everything from servers to high-performance computing to gaming, is increasing and to the right. It's a great time to focus on efficiency improvements, since it's a long time since we've had a 25 by 20 initiative.

AMD's way of doing things is very transparent, and not broad, unmeasurable goals, like those that sound compelling, but which you cant be held accountable for. We were very clear with the measurement methodology over time. By the 2020 product deployment, we had met and exceeded that 25X goal, which was difficult to accomplish. It required driving performance up and down simultaneously, and significant engineering innovation.

We wanted to continue to improve upon that success. Notebooks are fantastic, and certainly efficiency and battery life drive a lot of consumer experience improvements there. We focused on the data center as well, with the 30 by 25 strategy we announced last year to achieve 30X efficiency gains in the machine learning and high-performance computing industries. That's the first step on the journey to 30X efficiency.

The following CDNA products are compatible with RDNA. They share a common core of graphics IP and components. These methodologies and approaches apply to both. Thats where weve been focusing on the gaming side as well. We created a long-term strategy back when I joined the graphics group. The Big Navi, or Navi 21, was a successor to the first RDNA generation.

We are using AMD's unique strengths in having excellent CPU and GPU technology, and nobody has both, at least not yet. We just thrive on innovating, solving difficult problems, and working together across the company. With our CPU designers, we had accomplished a fantastic job with the Zen architecture and delivery.

Graphics architecture is a completely different design area. It's handling textures and pixels, very close. It's historically been around 1 GHz forever. We conducted a lot of research and design reviews to see what we could do to maximize CPU capabilities and dramatically improve what graphics could deliver for efficiency. That's where many of the RDNA 2 improvements came from.

Naffziger: There are many games that may be played. A dual GPU may be able to operate at a higher efficiency, delivering more performance-per-watt. It is a matter of focus. We certainly were not shortchanging Nvidia's contributions, because they do have very powerful designs, and we have had that. We made a strategic decision to never fall behind again on performance-per-watt.

Power efficiency allows for more flexibility in design. We can choose to maximize performance while still using a lot of power, or optimize the efficiency. That is another component that we have exploited and invested heavily in: power management. We have increased the frequency at an all-time high, but we have done that in a power-efficient manner.

Frequency has the reputation of resulting in high power. However, if we just redesign the methods and eliminate huge gates and additional pipe stages and things like that, we can get the work done quicker. To reach 2.5 GHz, Nvidia might need to, but that would only take the voltage up to very high levels, 1.2 or 1.3 volts. Thats a squared effect on power.

With the smart power management, we can distinguish if we're in a game that requires much higher frequency, or if we're in a game that requires less memory bandwidth. No need to run the processor at maximum frequency if youre waiting for memory access. We developed very high-speed microcontrollers that tap into the performance monitors deep in the design to get insights into what's going on in the engine.

The other thing is just switching capacitance optimizations. Most of my background is in computer design. I designed a lot of the power improvements there that culminated in the Zen architecture. There are a lot of detailed engineering metrics that we monitor that evaluate the performance of the architecture. We should only switch the ones that are producing useful work.

We examine our design pre-silicon as we were developing it, to see if we need to switch it off? It's a mentality change that's analyzing the implementations to see whether or not it's required for performance. If it is not, shut it off. We took those kinds of approaches and that thinking from our CPU side and drove a pretty dramatic improvement in all of those switching metrics.

Raja is a visionary. He paints a wonderful and compelling picture of the gaming future and the capabilities that are required to take the gaming experience to the next level. He was an integral part of AMD's software development team until he left. He spent a year working with AMD on software development.

Naffziger: That's a very important point. The underlying manufacturing technology is critical. In fact, we typically break out the percentage gains that we got from each dimension performance-per-watt, power efficiency optimizations, and process technology. That's important. Nvidia has the freedom to choose TSMC as well. Their new Arc line has the same process technology as our GPUs.

The other thing to note is that from RDNA 1 to RDNA 2, we managed to get a doubling of performance and a 50% increase in performance-per-watt. We were pleasantly surprised by that. The Infinity Cache in particular is a very powerful, very powerful array. We believe Nvidia will follow suit with larger last-level caches. But no ones at 128MB yet.

Naffziger: It's been a real engineering challenge. We made a strategic choice to split our graphics line. They share a lot of common components, but different architecture lines, the Compute DNA and Radeon DNA. We reduced the overhead for 3D rendering because to the smaller space available.

Once we had that separate sandbox, if you will, where it's all about a compute optimized layout, let's just kill it for that market space. And the same strategies of optimizing the switching, the clocking, the power management, everything else, could be leveraged between gaming and compute. That's fantastic. It's a constant learning process.

The other item we launched during our financial analyst day that we were looking forward to delivering later this year is the RDNA 3. Were not going to let our momentum let up with the efficiency improvements. That's three generations of compounded efficiency gains there, 1.5 or more. One component is leveraging our chiplet expertise to unlock the full capabilities of the silicon we can purchase. It's going to be fun as we get more of that out.

Naffziger: Im concerned in the sense that it is pushing new directions of innovation to achieve the desired results. Weve seen this coming for a long time. Weve been investing in things like the Infinity Cache, chiplet architecture, and all these approaches that exploit new possibilities to keep the gains coming. So yes, but for those who prepare in advance and invest in the right technology, we still have a lot of opportunities.

Naffziger: It's difficult to speculate. Nvidia certainly hasn't joined the chiplet bandwagon yet. We have a big advantage there and we see great opportunities with that. Theyll be forced to do so. Look for it when they deploy it. Ponte Vecchio is the poster child for chiplet extremes. However, innovators who innovate in the right space immediately gain an advantage.

You may also like: