Description of Simulation Infrastructure

The provided simulation framework is based on Data Prefetching Championship 2 simulator. The framework models a simple out-of-order core with the following basic parameters:

  • 256-entry instruction window with no scheduling restrictions (i.e., any instruction that is ready in the window can be scheduled out-of-order).
  • Processor has a 6-wide pipeline. A maximum of two loads and a maximum of one store can be issued every cycle.
  • Perfect branch prediction (i.e., no front-end or fetch hazards).
  • The memory model consists of a 3 level cache hierarchy, with an L1 data cache, an L2 data cache, and an L3 data cache. Instruction caching is not modeled. The L1 data cache is 16KB 8-way set-associative with LRU replacement. The L2 data cache is 128KB 8-way set-associative with LRU replacement. The L3 data cache is 16-way set-associative with LRU replacement. See Configurations Used for Evaluation for details on L3 data cache size.
  • The L2 data cache is inclusive of the L1, and the L3 data cache is inclusive of both the L1 and L2.
  • All instructions have one-cycle latency except for memory accesses. L1 cache hits have a 4 cycle latency. L2 cache hits have an additional 10 cycle latency. L3 cache hits have an additional 20 cycle latency.
  • The main memory is modeled in some detail (data bus contention, bank contention, write-to-read bus turnaround delays, and more), and has a configurable amount of bandwidth (see Configurations Used for Evaluation). All configurations use only 1 64-bit channel of memory, which has 1 rank with 8 banks.
  • Each cache has a read queue for storing outstanding requests to that cache. The L1 read queue is processed at a maximum rate of 2 reads/cycle. The L2 and L3 read queues are processed at a maximum rate of 1 read/cycle. There is no prioritization in these queues between prefetch and demand requests, and they are processed in FIFO order.
  • The main memory read queue can be processed out of order, according to a modified Open Row FR-FCFS policy. The priority is as follows (highest priority to lowest): demand row hit > prefetch row hit > demand row miss > prefetch row miss. Rows are left open after the last access to them has been processed.
  • The DRAM core access latency for row hits is approximately 13.5ns, and for row misses is approximately 40.5ns. Many other timing constraints, such as tFAW and DRAM refresh, are not modeled.
  • Prefetches are issued for whole cache blocks.
  • The prefetcher works in the physical address space only, and is restricted from crossing 4 KB physical page boundaries.
  • The prefetchers can only see the stream of L1 cache misses / L2 cache reads (including those L2 reads caused by an L1 write miss). In other words, the prefetcher lives at the L2 level of cache.
  • When a block comes into the cache, it replaces the LRU block in its set, whether it is a demand request or a prefetch request.
  • All cache fills from a lower level of the hierarchy/main memory must travel through a series of fill buffers between each level of the hierarchy, at a maximum rate of 1 fill/cycle. This means data will be available in the L1 at least 2 cycles after it is available in the L3.
  • The L2 read queue is 32 entries. Both L1 misses and all prefetch requests are inserted into this queue.
  • There are limited numbers of request tracking registers in the L1 and L2 caches. The L1 is limited to 8 outstanding L1 requests (even L1 hits occupy a register), and the L2 is limited to 16 outstanding L3 requests (only L2 misses occupy a register).
  • If the request tracking registers are all full, all further misses at that level will be stalled until a request tracking register is freed.
  • The prefetcher is invoked on each L2 cache access, AFTER its 10 cycle L2 access latency, and after it has been determined to be an L2 hit or miss.
  • Prefetches are added to the back of the L2 read queue, which is processed in FIFO order, and does not prioritize demands or prefetches.
  • The prefetcher is invoked ONLY on demand requests, and is not invoked recursively on prefetch addresses.
  • Prefetches may specify which level of the cache hierarchy they want the cache line to be prefetched into. For example, even though the prefetcher lives at the L2 level, it may prefetch data directly into the L2, or into the L3.
  • Prefetches into the L3 have the large advantage of not occupying an L2 request tracking register. Prefetches into the L2 must occupy an L2 request tracking register.

We will distribute the framework as a standalone simulator. The contestant will add his/her prefetching code to the file prefetcher.c, by modifying the functions l2_prefetcher_initialize() and l2_prefetcher_operate(), and adding any additional helper functions to that file. Within the l2_prefetcher_operate() function, contestants will add calls to l2_prefetch_line(), which actually performs the prefetch of a single cache line. Contestants will have access to the following information to help them make prefetching decisions:

  • The address of the current L2 read request.
  • The address of the instruction that generated the current L2 read request (called the Instruction Pointer, or IP).
  • Whether or not the current L2 read request was a hit or a miss in the L2.
  • The current occupancy of the L2 read queue (how many requests are waiting to lookup the L2).
  • The current occupancy of the L2 MSHRs (how many requests are currently outstanding to the L3).

In addition to this information, the contestant is provided with 32 KBytes (256 Kbits) of storage to implement state machines to manage the prefetching algorithm. The contestant will not be penalized for the complexity of the algorithm logic. The 32 Kbyte storage can be allocated as the contestant sees fit.

There is no built-in functionality to differentiate between hits to demand fetched and prefetched cache lines, however this functionality can be added using part of the prefetcher's hardware budget (NOTE: if you add this, or any other functionality, make sure all changes appear only in prefetcher.c). Also, prefetches can fail if either the L2 read queue is full, or if the prefetch address is in a different 4KB physical page than the current L2 read request. You can check for failed prefetches by checking the return value of l2_prefetch_line().