Formula M1: Why the Apple chip is so fast

Above all, a clever cache hierarchy helps Apple’s ARM chips to achieve a performance that AMD and Intel have not yet achieved in compact devices.

Since November 2020, MacBook Air, MacBook Pro, and Mac mini have been shipped with Apple’s proprietary M1 chip. But why is the performance so much higher than in the respective predecessors with Intel processors? The usual answer at Apple is of course: Through the perfect coordination of hardware and software. For the first time, the company has the entire system in its own hands, apart from a few interface modules. But in order for an optimized macOS to work properly, the hardware on which it runs must also be powerful enough.

In the case of the M1, this consists of a system-on-chip (SoC), i.e. an almost complete computer on just one component. This was made possible because Apple has been working towards it consistently for 30 years: In fact, the company was one of the first investors in Advanced RISC Machines Ltd in 1990, and later used its ARM architecture – on which the M1 is based – in Newton, iPod, and all others handheld one.

ARM only developed into a great success for Apple with the iPhone 2007. In 2012, the next logical step was taken: Apple was one of the few companies to acquire an architecture license and has since been allowed to make changes to the arithmetic units and the overall structure of the processors. The latter in particular has now happened with the M1 because the cache system goes beyond what other mobile ARM chips offer. SHORT AND SWEET

  • The interaction of performance and efficiency cores ensures high performance and low energy consumption.
  • The M1’s cache system goes far beyond other Apple SoCs and x86 processors.
  • Fabric network and system level cache harmonize with unified memory in memory chips on the processor carrier and flash components without PCIe.

Even from the outside, the M1 is different from most current SoCs for notebooks, it is a system-in-package (SiP). The chip itself sits on a substrate, and on top of that are two conventional memory chips in their own package. This is very compact, but absolutely not expandable and also a bit more complex to cool than with external RAM. SiPs have been in fashion for more than 20 years, among other things for graphics cards, but Apple now dares to sacrifice expandability for advantages in performance through short circuit paths and compact designs.

By the way, it looks like there is only half a heat spreader; However, the solution turns out to be well thought out: The M1 die sits on the left of the package and is completely covered by the heat spreader. The two less hot Hynix RAM chips are connected directly to the heat sink. A full-surface heat spreader that also covers the RAM would have required an additional layer for it, which would worsen the cooling.

As with the last SoCs of the A series, there are also two classes of ARM cores in the M1: The particularly fast ones called Firestorm with a clock frequency of up to 3.2 GHz and the particularly frugal ones called Icestorm with up to 2.1 GHz. Internally, Apple also calls these P and E cores for performance and efficiency. Compared to the A14 Bionic from the iPhone 12, that’s twice as many Firestorms (4 instead of 2) and a slightly increased clock rate of 3 GHz. Apple is silent about the exact design of the arithmetic units.

Because iOS apps are binary compatible, one can assume that the full 64-bit ARMv8-A instruction set is active. All cores can work together, so the M1 is a mobile 8-core processor. With a total power consumption of the computing cores of around 20 watts, this has so far only been available in the x86 world with the AMD Ryzen 7 4000U. Incidentally, macOS decides which core takes on which tasks. Previous experiments by developers show that manual load distribution is probably not possible at all or only possible with great difficulty.

The M1 is a system-in-package in which the almost square die of the SoC is only covered by a heat spreader on the left. On the right are the two LPDDR4 memory chips.

The fastest caches, the dedicated L1 caches per core, are quite large. Firestorm has 192 KB for instructions and 128 KB for data, Icestorm is still on 128 and 64 KB. The L2 caches, on the other hand, are downright huge for a mobile SoC: the Firestorms have 12 MB, and the Icestorms 4. In addition, there is a System Level Cache (SLC) of apparently 16 MB, which we will go into in more detail in a moment. When viewed as an L3 cache, there is a whopping 32MB of cache for the cores in the M1 along with the L2 memories. Intel can only keep up with server processors, only AMD builds even more cache into the Ryzen 5000 desktop processors with 64 MB.

With all of this, the runtimes are excellent: The MacBook Air runs for over 21 hours under light load and the MacBook Pro lasts for 13 hours when playing videos. The Pro predecessor with an Intel CPU only manages 7.5 hours with the display set to the same brightness. There is no secret behind this, but simply the basic property of ARM CPUs and handheld GPUs: They use almost no energy when doing nothing. A side glance at the iPhone: When not in use and not running self-updating apps, it can last for days. An Intel notebook with a comparatively large battery doesn’t even manage that in energy-saving mode (ACPI S3), from which it first has to wake up for a few seconds.

If, to stay with the example, an image has to be decoded and transported to the frame buffer just 50 times per second, the cores can go to sleep billions of times. The majority of the work is done by the video units of the SoC, during which the eight cores can largely switch off. Apple has a lot of experience here with the iPhone and iPad.

In addition, macOS can ensure that only the economical Icestorm cores are used for such simple tasks. This also works for Intel with the Core i5-L16G7 aka Lakefield, but there is only one fast Sunny Cove core and four individually very slow Atom cores, not 4+4 cores like in the M1. In addition, this chip has only been available since mid-2020, when Apple’s M1 was already finished. Intel should have pursued this idea vigorously years earlier, and also gotten to grips with its 10-nanometer fabrication – which the Lakefields are based on – to convince Apple to stay.

Behind the caches, which are huge for ARM chips, is a conscious design decision: You only build such large areas of a static memory (SRAM) when you really need them. After all, they take up a lot of space on a chip. For example, the L3 cache of the Xbox Series S/X SoCs, which are also new, is only 4 MB, while a large GPU is more important for gaming machines than particularly high CPU performance.

Especially with a RISC processor with fewer stored instruction routines like the M1, however, large caches are ideal for constantly feeding the execution units. Although ARM code (RISC) can be much more compact than Intel code (CISC) for simple algorithms such as small loops, the advantage of the reduced instruction set is reversed for complex tasks: the programs often take up a lot of space in the memory.

The fact that Apple makes the caches so large in the M1, in contrast to the A14, is due to their hunger for power when reading and writing is constant. The MacBook and Mac mini has a much higher energy budget than the iPhone or iPad. The large caches also offer an advantage in Rosetta 2 emulation, where x86 programs are translated into native ARM code. This happens to a large extent during installation or when the programs are started for the first time. However, Apple points out in its developer documentation that parts of the code must be translated at runtime – just like a just-in-time compiler (JIT). Large caches are very useful for such conversions that work continuously.

In addition, there is the system-level cache, which Apple does not describe in detail. In previous designs, it would be considered an L3 cache, but its naming as SLC and its placement in the middle of the die suggests an extended functionality: it most likely serves as a direct connection between all functional units including the neural engine. For example, if a graphic element is changed by the CPU, the GPU can pull the new data directly from the SLC for display without going through the much slower route via RAM.

Similar mechanisms had previously been used under the name Crossbar, for example in DEC’s Alpha processor, but proved to be difficult to implement with the manufacturing techniques of the time and required a lot of power. Apparently, Apple got it right for the first time, because their own A-SoCs don’t offer a pure L3 cache.

All the caches can only be afforded in Cupertino because the M1 is manufactured at TSMC using the 5-nanometer process – one of the most modern processes currently used in semiconductor production. In total, there are 16 billion transistors on the die, not counting the RAM chips. Even Nvidia’s current high-end GPU, the GA104-300 for the RTX 3070, has hardly more at 17.4 billion. So Apple already put a lot of effort into this with its first ARM SoC for Macs.

Apple speaks of cores for its self-developed GPUs as for the CPUs. However, this is unusual in the industry. It would be more comparable to specify execution units (EUs) or the number of shaders/ALUs as individual arithmetic units. After all, there is concrete information about the number of cores, which has doubled to eight compared to the four of the A14. Apple specifies the theoretical computing power at 2.6 teraflops, which speaks for one of the fastest integrated GPUs in mobile computers. Our benchmarks with games, even in the Rosetta 2 emulation, prove that.

The GPUs of the M1 in the MacBook Pro and Mac mini are identical. In the cheaper of the two MacBook Air configurations, Apple only activated seven of the eight GPU cores that are physically available, which slightly reduces the power consumption and thus the heat development in the fanless system.

With an SoC with integrated graphics, the performance depends heavily on the RAM – and here, too, it was a lot, not a mess: The LPDDR4X chips from Hynix are connected via eight memory channels on the M1 die and effectively work at 4266 MHz. Apart from plugged-in and very expensive overclocker modules, such high clock rates are only available with extremely short connections, which is probably why they are on the M1 package. According to synthetic tests by Anandtech, they almost reach the theoretical bandwidth of the components: reading is at almost 60 GB per second, writing at up to 36 GB/s. Copies within the memory are made at up to 62 GB/s. Comparable x86 chips such as AMD’s Ryzen 4000 or Intel’s Tiger Lake have not yet offered such values. As with the caches, high bandwidth was Apple’s design goal.

The company is also highlighting Unified Memory for the M1, and shared memory for the CPU, GPU, and all units on the SoC. The Intel chips with integrated graphics already offered this in a similar form, but the likely functionality of the SLC could bring something new here – but this has not yet been researched.

Apple may also have achieved a certain degree of coherence of RAM – not just caches – for different memory areas through the system-level cache. However, such advantages only become visible with software that has been optimized precisely for this purpose. There is still a lack of that since the developer kit in the Mac mini case was only equipped with the A12Z, which does not offer these functions.

This also applies to the 16-core neural engine in every M1, which is intended to accelerate inferencing in machine learning aka AI. So far, you could only use the GPU for such tasks, but this is orders of magnitude slower in many applications.

One detail clearly shows that Apple has not only replaced the Intel cores with ARM cores: the previous T2 controller is missing – among other things, it was responsible for connecting the SSD. It now hangs directly on the Apple fabric. The German translation of the manufacturer’s designation reflects the construction well: It is a fabric. Layers of horizontal and vertical conductors are stacked in the silicon layers of the M1, like several sheets of checkered notepad lying on top of each other. They connect all the units of the SoC, and the flash components of the SSD are now also attached to them. With M1 Macs, PCIe only serves Thunderbolt and the WLAN/Bluetooth module.

When it comes to Thunderbolt, Apple can’t do without Intel entirely, because the technology belongs to the now reviled chip manufacturer. Consequently, there are two of Intel’s JHL8040R driver chips. These are retimed that restore signal integrity on long traces from the actual controller to the ports. And indeed, the new MacBook Pro has the M1 and Retimers on opposite sides of the logic board. The Intel chip is priced at $2.40, so a little money goes to Intel with every M1 Mac.

Already in the M1, with the SLC and the fabric that is also used for the SSD, there are indications of how Apple could further develop this platform. The most obvious innovation compared to the A14 are the four instead of two Firestorm cores. For an iMac, that should easily be doubled again – if necessary by omitting the Icestorm units, because the power consumption here can be several times as high as with the three previously available devices MacBook Air, MacBook Pro and Mac mini.

However, Apple will only bring new M-SoCs if TSMC’s 5-nanometer production continues to run smoothly.

You can expect higher clock rates from this alone with more experience, the previous maximum of 3.2 GHz for the Firestorm cores can be regarded as quite conservative.

By the end of 2022, all Macs should run exclusively with Apple Silicon. And what about the relatively new Mac Pro? Although there are initial rumors about an Apple chip with 32 cores, plugged-in graphics cards, and other accelerators are mandatory here, and new cards, like very fast M.2 SSDs, are only available with PCI Express 4.0 (PCIe 4.0). A Mac Pro with Apple Silicon would need a few dozen more lanes than the M1. Some ARM-based mainframes already offer this today.

It seems doubtful that Apple’s own GPUs can outperform the recent generation changes from AMD and Nvidia. Consequently, the bus system and the associated expansion of the fabric and the SLC is probably the greatest challenge. On the software side, ARM drivers for Radeon/FirePro and Geforce/Quadro are required, which could also enable eGPUs for mobile devices. The question remains whether Apple really wants the latter.

In addition, the ARM macOS and the native applications are still so new that the M1 Macs are likely to gain significant speed in the next one to two years through software optimization alone. With the freedom of the entire platform from a single source, Apple has also taken on more responsibility. But even for users who don’t use Macs, that’s a good thing: AMD and Intel are now forced to thoroughly rethink their designs.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s