Computer Science and Engineering, Department of

 

Date of this Version

9-2009

Comments

University of Nebraska–Lincoln, Computer Science and Engineering Technical Report TR-UNL-CSE-2009-0013 Issued October 15, 2009. Also online at http://ponca.unl.edu/facdb/csefacdb/TechReportArchive/TR-UNL-CSE-2009-0013.pdf

Abstract

As the Memory Wall remains a bottleneck for Chip Multiprocessors (CMP), the effective management of CMP last level caches becomes of paramount importance in minimizing expensive off-chip memory accesses. For the CMPs with private last level caches, Cooperative Caching (CC) has been proposed to enable capacity sharing among private caches by spilling an evicted block from one cache to another. But this eviction-driven CC does not necessarily promote cache performance since it implicitly favors the applications full of block evictions regardless of their real capacity demand. The recent Dynamic Spill-Receive (DSR) paradigm improves cooperative caching by prioritizing applications with higher benefit from extra capacity in spilling blocks. However, the DSR paradigm only exploits the coarse-grained application-level difference in capacity demand, making it less effective as the non-uniformity exists at a much finer level.

This paper (i) highlights the observation of cache set-level non-uniformity of capacity demand, and (ii) presents a novel L2 cache design, named SNUG (Set-level Non-Uniformity identifier and Grouper), to exploit the fine-grained non-uniformity to further enhance the effectiveness of cooperative caching. By utilizing a per-set shadow tag array and saturating counter, SNUG can identify whether a set should either spill or receive blocks; by using an index-bit flipping scheme, SNUG can group peer sets for spilling and receiving in an flexible way, capturing more opportunities for cooperative caching. We evaluate our design through extensive execution-driven simulations on Quald-core CMP systems. Our results show that for 6 classes of workload combinations our SNUG cache can improve the CMP throughput by up to 22.3%, with an average of 13.9%, over the baseline configuration, while the state-of-the-art DSR scheme can only achieve an improvement by up to 14.5% and 8.4% on average.