The translation lookaside buffer

In an earlier post, I briefly discussed the oprofile system profiler. I was going somewhere with that; here’s another step along the path. Modern CPUs that use virtual memory have to be able to turn virtual addresses into physical addresses efficiently. While a userspace process operates on virtual addresses, accessing any part of the memory hierarchy requires physical addresses. This means that when your CPU goes to touch even the first-level (L1) cache, it must do this virtual-to-physical translation. The CPU does this efficiently using a tiny cache called the translation lookaside buffer, or TLB. Each entry in the TLB maps a virtual page to a physical page, where a page is a range of consecutive addresses. A typical TLB has between 32 and 2048 entries, and a CPU will have a hierarchy of TLBs to go along with its hierarchy of caches. For example, an AMD Opteron has a 32-entry first-level TLB, and a 512-entry second-level TLB. (In fact, it has separate TLBs for handling instruction and data addresses; these are called the iTLB and dTLB.) The number of pages that a TLB can translate at once is called its range. With a 4KB page size and 512 entries, a TLB has a range of 2MB. The 2MB doesn’t have to be contiguous; I just mean that such a TLB can translate 2MB of virtual addresses to physical addresses at one time. If the CPU needs to translate a virtual address that is not currently covered by a TLB entry, this is called a TLB miss. (If there is an entry in the TLB, a successful lookup is called a hit.) Because TLB lookup time is critical to the performance of the CPU, the normal case (of lookups resulting in hits) is handled entirely in hardware. In fact, a TLB lookup is performed in parallel with an L1 cache lookup, so that both can be done in a single cycle. But when a TLB miss occurs, the hardware throws up its hands and causes a trap. This suspends normal execution of the CPU so that the operating system can load a new value into the TLB entry where the miss occurred. The component of the kernel that handles this step is called the TLB miss handler, and it’s critical to performance, too. Since PC-class systems running Linux always use a 4KB page size, the reach of the TLB is tiny compared to the size of physical memory. For example, my desktop system has 2GB of RAM, and 1024 TLB entries; this gives its TLB a reach of only 4MB, or 0.2% of RAM. I’ll return to the subjects of TLB size, page size, and oprofile soon.
Posted in hardware, linux
3 comments on “The translation lookaside buffer
  1. What you’re describing is a software filled TLB, where the TLB is filled by the OS after a trap. AFAICR, both Intel and AMD have HW filled TLBs, where a hardware page table walker scans the page tables and fills the TLB if it can. Am I missing something?

  2. Jeremy Fitzhardinge says:

    I don’t think the TLB lookup and the L1 lookup can be concurrent if the cache is physically tagged; surely you need the physical address before you can look up the cache entry?

    Also, on x86 processors the TLB miss and associated pagetable walk are handled entirely in hardware, which makes the time overhead harder to observe directly. MIPS and some other architectures do it in software, which has the nice property of making the TLB misses much more measurable, and also allows a lot more flexibility – it would be nice to have it on more architectures.

  3. Lotta says:

    ThatÂ’s more than sesbnlie! ThatÂ’s a great post!

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>