Discontiguous Memory

Fri 15 April 2005

Understanding why we have discontiguous memory requires some understanding of the chipsets the power Itanium systems. The ZX1 chipset, which powers the HP Integrity line of servers, is one of the more common ones. Others chipsets would follow similar principles.

With the ZX1, the processor is not connected directly to RAM chips, but runs through a memory I/O controller or MIO for short. It looks something like

+------------------------+   Processor Bus                       Memory Bus    +-----------------------+
|  Itanium 2 Processor 1 | <--------+        +---------------+       +-------> |       RAM DIMM        |
+------------------------+          |------> |   ZX1 MIO     |<------|         +-----------------------+
|  Itanium 2 Processor 2 | <--------+        +---------------+       +-------> |       RAM DIMM        |
+------------------------+                    | | | | | | | |                  +-----------------------+
                                     Ropes->  | | | | | | | |
                                             +---------------+
                                             |    ZX1 IOA    |
                                             +---------------+
                                                |   |   |
                                               agp pci pci-x

From a top level view, the chipset is broadly divided into the System Bus Adapter (or SBA) and the Lower Bus Adapater (LBA). You can think of the SBA as an interface between the bus that the processors sits on and the ropes bus (below). The LBA is the PCI Host bus adapater, and contains the IO SAPIC (Streamlined Advanced Programmable Interrupt Controller, this chip handles interrupts from PCI devices, and can be thought of as a smart "helper" for the CPU that controls when it sees interrupts).

A rope is described as a "fast, narrow point-to-point connection between the zx1 mio and a rope guest device (the zx1 ioa). The rope guest device is responsible for bridging from the I/O rope to an industry standard I/O bus (PCI, AGP, and PCI-X)". The ZX1 chipset can support up to 8 ropes, which can be bundled.

This is distinct from the typical North bridge/South Bridge (where the north bridge connects to the CPU/Memory and south bridge, and the south bridge connects to the north bridge and slower I/O pehperials) that it used in 386 architectures. This does however introduce some more interesting timing problems.

A simplified map of the memory layout presented by the ZX1 chipset MIO looks like

0xFFF FFFF FFFF +-------------------------------------+
                |               Unused                |
0xF80 0000 0000 +-------------------------------------+
                |  Greater than 4GB MMIO :            |
                |   (not currently used by anything)  |
0xF00 0000 0000 +-------------------------------------+
                |               Unused                |
                                 ....
                |                                     |
0x041 0000 0000 +-------------------------------------+
                |           Memory 1 (3GB)            |
                |                                     |
                |                                     |
0x040 4000 0000 +-------------------------------------+
                |               Unused                |
0x040 0000 0000 +-------------------------------------+
                |           Memory 2 (252GB)          |
                                  ....
                |                                     |
0x001 0000 0000 +-------------------------------------+
                |  Less than 4GB MMIO :               |
                |   Access to firmware, processor     |
                |   reserved space, chipset registers |
                |   etc (3GB)                         |
0x000 8000 0000 +-------------------------------------+
                |           Virutal I/O (1GB)         |
0x000 4000 0000 +-------------------------------------+
                |           Memory 0 (1GB)            |
0x000 0000 0000 +-------------------------------------+

You can see it implements a 44 bit address space that is divided up into a range of different sections. The maximum memory it can theoretically support is 256GB (i.e. Memory 0 + Memory 1 + Memory 2) however physically I think most boxes only comes with enough slots for ~16Gb of memory. The ZX1 has an extension called the zx1 sme or Scalable Memory Extension that allows more memory to be delt with. 1.3. Looking at the layout in more depth 1.3.1. Virtual I/O

The ZX1 chipset implements an IO-MMU, or IO Memory Managment unit, for DMA. This is very similar to the standard MMU, or memory managment unit in that converts a virtual address into a physical address. This is particularly important for a card 32 bit card that can not understand a 64 bit address. By setting up the virtual I/O region under 4GB (i.e. the maximum address a 32 bit card can deal with) the IOMMU can translate the address back to a full 64 bit address and give the 32 bit card access to all memory. So the process goes something like

driver issues DMA request that is in an area above 4GB that a 32 bit card can not deal with
IO-MMU sets up a virtual address in the Virtual I/O region that points to the original address, and issues the DMA request with this low address
The device sends it's data back to the low address, which the IOMMU then translates back to the high address.

The IOTLB in the ZX1 chipset is only big enough to map 1GB at time, which is why this area is only 1GB. This means there can be contention for this space if many drivers are requesting large DMA transfers -- however note 64 bit capable cards can bypass this all together reducing the problem.

As you can see from the layout above, system RAM is not presented to the processor in a contiguous manner; that is all the memory in one large block. It is split up into three different sections (Memory 0, 1, 2) which are placed in different areas of the address space. Memory 0 needs to be around for legacy applications that expect certain things at low addresses. The rest of the < 4Gb area is taken up by the Low Memory Mapped IO region; the physical memory that would have been mapped here is moved into Memory 1, which is located at 0x040 0000 0000 (256Gb). Should there be more than 4GB of memory (i.e. Memory 0 + 1 = 4Gb) it will be allocated in Memory 2 at 0x001 0000 0000 (4Gb).

The other thing you need to know about is how the Linux kernel looks at memory. The kernel keeps an array of every single physical frame of memory in an array of struct pages in a global variable called mem_map. Now the kernel has to be able to easily (i.e. quickly) translate a struct page to a physical address in memory; the simplest way would be to to say, in effect, "the difference between the address of this page_struct I am trying to find the physical address of and the address of mem_map (remember, it's a linear array, so mem_map is the base) is the index of the physical frame in memory. I know that frames are X bytes long, so the physical address is X * index".

If we keep mem_map as a linear array as above, we hit a serious problem with our memory layout from the zx1 chipset. Say we have 4 gigabytes of RAM, pushing us into Memory 1. To use the simple indexing scheme of mem_map described above, there would need to be a page_struct for each and every frame of memory between Memory 0 and Memory 1. This expands the range from 0x000 0000 0000 - 0x041 0000 0000. Say each struct page in the array handles a frame of 16 kilobytes, we need 17,039,360 entries, say each entry is around 40 bytes that makes 650 megabytes of memory required for the mem_map array! That's a fair chunk of our 4 gigabytes really wasted on mapping empty space.

This is also an issue for systems that may not have such a large gap imposed by the chipset, but receive such a large gap by participating inmposed by the chipset, but receive such a large gap by participating in a NUMA (non-uniform memory access) system. In a NUMA system, many individual nodes "pool" their memory into one large memory image. In this case, each node of the system will have gaps in their memory that is actually physically residing on another, remote, node. The upshot is that they will suffer from the same wasted space.

VIRTUAL_MEM_MAP and discontiguous memory support are some kernel options that get around this. By mapping the mem_map virtually, you can leave the mem_map sparse and only fill in what you need. Discontiguous memory support is a further option to allocate the memory map efficiently over a number of nodes.