Lecture 010

Set, Line, and Block

Reading Policy

Reading from Address:

for each set, check if the set index bits match
for each line with the same set index, check if the tag bits match
reading from the block offset

(primitive datatype should not span multiple cache line, but this can happen)

Policy for Direct Mapped

Direct Mapped:

Direct Mapped: there are 1 line per set
E-Way associative: there are E lines per set (no identical tag within a set)
Direct Mapped Cache
Direct Mapped Cache Can Still Have Unmatch Tag Bits
Direct Mapped Example
- replacing everything
- If not match, old line is evicted and replaced
- If replaced, bring the whole block to cache (so locality is good)
- valid bit: determine if a block has valid data (not being erased or start from cold)

Policy for 2-Way Set Associative Cache

Occupy Until Bucket Size (number of lines) Filled

increase associative reduce conflict misses

Writing Policy

Background: copies of data exists in L1, L2, L3, Main-Mem, Disk

Write-hit Policy

Write-through: update all copies in all location
Write-back: defer write until replacement happen
- add dirty bit for each line to indicate whether one of more byte is updated

Write-miss Policy

Write-allocate: load into cache, then update
No-write-allocate: write straight to memory, without loading to cache

Typical Combination:

Write-back + Write-allocate (Mostly Used)
Summary
Write-though + No-write-allocate

Memory Metrics and Facts

i-cache: store code, don't change
d-cache: store data

Metrics

Miss Rate: percentage of miss per access Hit Time: time to check and deliver a line to CPU (typically 4 cycle for L1, 10 cycle for L2) Miss Penalty: typically 50-200 cycles for main memory

(Therefore, 99% hits is twice as good as 97%, so we talk about miss rate rather than hit rate)

The Memory Mountain

Read throughput: read bandwidth Memory Mountain: measured read throughput as a function of spatial and temporal locality

Aggressive Pre-fetching: Guessing access pattern to fetch before CPU request
Spacial Locality: slope as stride increases
Temporal Locality: switch from L1, L2, to L3... as more bytes need to be accessed

Matrix Multiplication

Spacial Locality

We only care about the inner-most loop

Blocking

from Youtube
red indicate cache miss
yellow indicate cache hits

Optimized

Code
We Add to Final Matrix
Second Iteration
No Blocking: $\frac{9}{8}n^3$ miss
Blocking: $\frac{1}{4B}n^3$ miss
Use largest block size B, such that B satisfies $3B^2 < C$ (Fit three blocks in cache!)

Table of Content