Cache Optimization#

Memory Hierarchy#

In the simplest model of computers, we write programs that process data in a linearly-addressed memory. In reality, to achieve good runtime performance, the memory is hierarchical. The programs we write still treat it like a linearly-addressed space, but by understanding the underneath hierarchical structure, we may get the most performance from the system.

When it comes to performance, we may be tempted to make all memory as fast as possible. It is impractical because fast memory is very difficult to manufacture. The economically feasible approach is to keep the most frequently used data in the fastest memory, and the infrequently used data in slow and inexpensive memory. When the work with the data is finished, they are removed from the fast memory, and other data are loaded to it.

Memory comes in many kinds in the hierarchy. I put them in 4 categories:

  1. CPU register file

    The registers reside in the CPU circuits. Instructions (machine code or assembly) can directly operate them and the electronic signals flow through the CPU circuits. There is no delay in access time to registers. They are the fastest memory.

  2. CPU cache

    CPU cache memory works as a ‘buffer’ between the registers and the main memory. It usually uses fast and expensive static random access memory (SRAM). It is called static because the circuit keeps in one of the two stable states and access doesn’t change the state. The circuit takes more transistors than slower types of memory.

    The cache memory may be part of the CPU circuit or outside it. A CPU usually has multiple levels of cache. It can be a couple of MBs, or as large as tens or hundreds of MBs. See the following example specification of some devices:

  3. Main memory

    Main memory is away from the CPU chip package. It usually uses less expensive dynamic random access memory (DRAM). The circuit takes more time to access the data but the lower cost allows much larger space.

    Data in the main memory are lost when the system is powered off.

    Mainstream PC uses DDR (double data rate) 4 SDRAM (synchronous dynamic random-access memory) with DIMM (dual in-line memory module) and its variants, e.g. RDIMM (registered DIMM) and LRDIMM (load reduced DIMM). Depending on the bandwidth of the CPU memory controller, the data throughput may be around 60GB/s or higher.

    A powerful server may have up to 6TB of main memory:

  4. Storage

    Data in the storage cannot be directly accessed by CPU instructions. They need to be loaded to main memory and then the instructions can touch them. The loading and saving operations are considered input and output (I/O). Data in the storage are persisted when the system is powered off.

    The storage is usually called the “disks”, because it used to be hard-disk drives (HDD). In a recent system the storage changes to use the solid-state drives (SSD), which use the flash memory instead of hard disk.

    Example specification of some devices:

Here is a table (excerpt from Figure 6.23 in CS:APP (3/e)) [2] showing the latency of each of the memory, measured in CPU cycle (less than 1 ns):

Latency at each level of hierarchical memory#

Type

Latency (cycles)

CPU registers

0

L1 cache

4

L2 cache

10

L3 cache

50

Main memory

200

Storage (disk)

100,000

Nothing is fast without cache.

How Cache Works#

There are 3 ways to implement caches:

  1. Direct-map

  2. Set-associative

  3. Fully-associative

The direct-map caches are the simplest one and I will use it to show the basic concepts of caches.

If the accessed byte is already in the cache, it’s a hit, and CPU directly gets the byte from the cache. If the byte is not in the cache, it’s a miss, and the memory controller goes to the main memory to fetch the byte into cache, before CPU gets it.

According to the table above, a cache miss is expensive. When CPU can get data from cache, the latency is around a couple of cycles. When there is a miss, additional hundreds of cycles are required to get the data.

There are two kinds of misses. A cold miss happens when the requested byte is not in the cache. The second kind is conflict miss, and happens when multiple cacheable data claim the same cache block. One set pops out the other, and vice versa, and each access is a miss.

Assume we have a main memory of 64 bytes (6-bit address is sufficient), and a cache of 8 bytes (use 3 bits for addressing). The following example demonstrates how a cache works:

Access #

Decimal addr

Binary addr

Hit or miss

Cache block

1

22

010 110

miss (cold)

110

2

26

011 010

miss (cold)

010

3

22

010 110

hit

110

4

26

011 010

hit

010

5

16

010 000

miss (cold)

000

6

18

010 010

miss (cold)

010

7

26

011 010

miss (conflict)

010

8

18

010 010

miss (conflict)

010

This is a direct-map cache. To reduce conflict misses, we may use multi-way set associativity. 2-, 4-, or 8-way set associative cache is commonly used. Full associativity is too expensive in circuit implementation.

Cache Block (Line) Size Determines Speed#

A cache block usually contains more than one byte or word. In x86, the block (line) size is 64 bytes. When loading data from main memory to cache, it’s done block by block, rather than byte by byte.

I will be using an example of “skip access” to demonstrate that with cache, doing more things doesn’t take more time (it uses a simple timing helper class).

This is an example for skip access. We allocate a memory buffer of \(128 \times 1024 \times 1024 \times 4\) bytes (1 GB):

constexpr const size_t nelem = 128 * 1024 * 1024;
int * arr = new int[nelem];

Then we measure the time to perform the same operations on the data with different “skips”. The full example code can be found in 01_skip_access.cpp.

No Skip (Access All Elements)#

First is the time spent in the sequential access of all elements, i.e., skip of 1:

// Sequential; accessing all data every 4 bytes.
for (size_t i=0; i<nelem; ++i) { arr[i] = i; }
sw.lap();
for (size_t i=0; i<nelem; ++i) { arr[i] *= 3; }
elapsed = sw.lap();
std::cout << "Sequential takes: " << elapsed << " sec" << std::endl;
Sequential takes: 0.0909938 sec

Skip 2 – 16 Elements#

Without knowing the effect of cache, we might intuitively think that skipping more elements results in shorter runtime.

It’s not wrong, but not exactly the case when the skipped data are still in a cache line. If we only skip 2, it’s only slightly faster than accessing all:

// Skipping 2; accessing 4 bytes every 8 bytes.
for (size_t i=0; i<nelem; ++i) { arr[i] = i; }
sw.lap();
for (size_t i=0; i<nelem; i+=2) { arr[i] *= 3; }
elapsed = sw.lap();
std::cout << "Skipping 2 takes: " << elapsed << " sec" << std::endl;
Skipping 2 takes: 0.0858447 sec

And skipping 4 – 16 elements takes roughly the same time in the experiment:

// Skipping 4; accessing 4 bytes every 16 bytes.
for (size_t i=0; i<nelem; ++i) { arr[i] = i; }
sw.lap();
for (size_t i=0; i<nelem; i+=4) { arr[i] *= 3; }
elapsed = sw.lap();
std::cout << "Skipping 4 takes: " << elapsed << " sec" << std::endl;

// Skipping 8; accessing 4 bytes every 32 bytes.
for (size_t i=0; i<nelem; ++i) { arr[i] = i; }
sw.lap();
for (size_t i=0; i<nelem; i+=8) { arr[i] *= 3; }
elapsed = sw.lap();
std::cout << "Skipping 8 takes: " << elapsed << " sec" << std::endl;

// Skipping 16; accessing 4 bytes every 64 bytes.
for (size_t i=0; i<nelem; ++i) { arr[i] = i; }
sw.lap();
for (size_t i=0; i<nelem; i+=16) { arr[i] *= 3; }
elapsed = sw.lap();
std::cout << "Skipping 16 takes: " << elapsed << " sec" << std::endl;
Skipping 4 takes: 0.075287 sec
Skipping 8 takes: 0.0734199 sec
Skipping 16 takes: 0.0762235 sec

Skip 32 – 1024 Elements#

The runtime will significantly slow down after the skip number is larger than 16. See the following table.

Runtime comparison for different number of access#

Element skip

Byte skip

access / all elements

Runtime (seconds)

Sequential

4

128M

0.0909938

Skip elements

2

8

64M

0.0858447

4

16

32M

0.075287

8

32

16M

0.0734199

16

64

8M

0.0762235

Skip larger than cache line

32

128

4M

0.0581277

64

256

2M

0.0449813

128

512

1M

0.0307075

256

1024

512k

0.0125121

512

2048

256k

0.00623866

1024

4092

128k

0.00230463

Locality#

While coding we usually don’t have a lot of time to do detailed cache analysis. Instead, we keep in mind that the code runs faster when it’s more compact by using the concept of locality. There are two kinds of locality:

Temporal

Temporal locality means that a fixed address is reused in the near future.

Spatial

Spatial locality means that the addresses close to the current address is reused in the near future.

The better locality, of either kind, improves the performance. And the cache hierarchy is why locality works.

To take advantage of locality, programmers analyze by using “strides”. A stride is the number of indexes to elements to slide when accessing the data in arrays. The most basic striding is sequential access, or the 1-stride. Similarly, we may have n-strides. The larger the stride is, the worse the locality.

Recall that x86 uses 64-byte cache blocks, and a double-precision floating point takes 8 bytes.

Data Layout#

To demonstrate how the data layout (majoring) affects runtime, we use an example of populating a matrix of \(1024 \times 1024 \times 64 = 67108864\) integer elements. The matrix is populated along the two axes. First we iterate over the last index (row-majoring):

1
2
3
4
5
6
7
8
// Populate by last index.
for (size_t i=0; i<nrow; ++i) // the i-th row
{
    for (size_t j=0; j<ncol; ++j) // the j-th column
    {
        buffer[i*ncol + j] = i*ncol + j;
    }
}

Then iterate over the first index (column-majoring):

1
2
3
4
5
6
7
8
// Populate by first index.
for (size_t j=0; j<ncol; ++j) // the j-th column
{
    for (size_t i=0; i<nrow; ++i) // the i-th row
    {
        buffer[i*ncol + j] = i*ncol + j;
    }
}

To get the benchmark results correct, before the first benchmarked population, we should access everywhere in the buffer to make sure the memory is allocated:

// Pre-populate to cancel the effect of overcommit or delayed allocation.
for (size_t i=0; i<nelem; ++i) { buffer[i] = nelem-i; }

The full example code can be found in 02_locality.cpp. The benchmark results are:

Runtime comparison for majoring#

# rows

# columns

sec as flat

sec by last

sec by first

speed ratio

1024*1024*64

1

0.075598

0.138065

0.0613524

0.444374

1024*1024*32

2

0.0840775

0.0971987

0.134753

1.38637

1024*1024*16

4

0.0890497

0.0765863

0.242477

3.16606

1024*1024*8

8

0.0852944

0.0892297

0.481189

5.3927

1024*1024*4

16

0.0859778

0.0951305

0.626653

6.58731

1024*1024*2

32

0.0960501

0.0732025

0.787803

10.762

1024*1024

64

0.081576

0.095843

0.89465

9.33454

1024*512

128

0.0805293

0.0841207

0.883303

10.5004

1024*256

256

0.0876071

0.0827943

0.899938

10.8696

1024*128

512

0.0812722

0.0807163

0.816387

10.1143

1024*64

1024

0.0882161

0.0807104

0.821201

10.1747

1024*32

1024*2

0.0900379

0.0750308

0.586014

7.81031

1024*16

1024*4

0.0865169

0.0932342

0.558974

5.99537

1024*8

1024*8

0.0772652

0.0819846

0.589144

7.18603

While writing programs, it’s much easier to know the stride than analyzing the cache behavior. The latter, in many scenarios, is prohibitively difficult.

Since we know the cache line is 64 byte wide, we expect the cache performance may significantly reduce when the stride is around that value (16 int elements). As shown in the above benchmark.

Array Majoring in Numpy#

We can also use numpy to show how the data layout impacts the runtime by using matrix-vector multiplication as an example. Use a \(10000\times10000\) matrix:

dim = 10000
float_rmajor = np.arange(dim*dim, dtype='float64').reshape((dim,dim))
float_cmajor = float_rmajor.T.copy().T
vec = np.arange(dim, dtype='float64')

As a reference, the time spent in the bootstrapping is:

CPU times: user 1.17 s, sys: 388 ms, total: 1.56 s
Wall time: 1.56 s

Use numpy.dot() for the matrix-vector multiplication with the row-majored matrix:

res_float_rmajor = np.dot(float_rmajor, vec)

The time spent is:

CPU times: user 64.2 ms, sys: 1.26 ms, total: 65.5 ms
Wall time: 64.1 ms

Then do the multiplication with the column-majored matrix:

res_float_cmajor = np.dot(float_cmajor, vec)

The time is:

CPU times: user 138 ms, sys: 1.47 ms, total: 139 ms
Wall time: 138 ms

The column-majoring is twice as slow as the row-majoring.

Integer Matrix-Vector Multiplication#

Let’s also see the same matrix-vector multiplication for integers. The \(10000\times10000\) is set up in the same way as the floating-point:

dim = 10000
int_rmajor = np.arange(dim*dim, dtype='int64').reshape((dim,dim))
int_cmajor = int_rmajor.T.copy().T
vec = np.arange(dim, dtype='int64')

As a reference, the time spent in the bootstrapping is the same as that for floating-point:

CPU times: user 1.13 s, sys: 390 ms, total: 1.52 s
Wall time: 1.52 s

Also use numpy.dot() for the matrix-vector multiplication with the row-majored matrix:

res_int_rmajor = np.dot(int_rmajor, vec)

It is not too slow:

CPU times: user 81.6 ms, sys: 1.09 ms, total: 82.7 ms
Wall time: 81.4 ms

The perform the multiplication with the column-majored matrix:

res_int_cmajor = np.dot(int_cmajor, vec)

The performance difference of integer arrays is much larger than floating-point arrays:

CPU times: user 815 ms, sys: 2.01 ms, total: 817 ms
Wall time: 816 ms

It is 10 times slower. Note that double and int64 both take 8 bytes. The reason is that there is not optimized helpers in LAPACK / MKL / vecLib for the column-majored multiplication.

For the same reason, the floating-point multiplication is slightly faster than the integer.

Tiling#

This is a naive implementation of matrix-matrix multiplication:

for (size_t i=0; i<mat1.nrow(); ++i)
{
    for (size_t k=0; k<mat2.ncol(); ++k)
    {
        double v = 0;
        for (size_t j=0; j<mat1.ncol(); ++j)
        {
            v += mat1(i,j) * mat2(j,k);
        }
        ret(i,k) = v;
    }
}

The matrices are row-major. The stride for the second matrix is ncol2, so it doesn’t have good locality. This naive implementation is clear, but the we should expect bad performance.

Matrix-matrix multiplication is one of the most important problems for numerical calculation, and there are many techniques available for making it fast. Most if not all of them are about hiding the memory access latency. Tiling is a basic technique that delivers impressive speed-up by reordering the calculation and making it cache-friendly. The technique is shown in the example code 03_matrix_matrix.cpp.

The benchmark results are:

Performance for matrix-matrix multiplication#

Flavor

Time (s)

Gflops

Ratio

Naive

3.1151

0.344689 (baseline)

1

MKL

0.0489621

21.9301

63.6229

Tile 32*32 bytes

1.15064

0.933171

2.70729

Tile 64*64 bytes

0.452658

2.37208

6.88180

Tile 128*128 bytes

0.748723

1.4341

4.16056

Tile 256*256 bytes

0.833207

1.28869

3.73870

Tile 512*512 bytes

0.764348

1.40478

4.07540

Tile 1024*1024 bytes

0.934039

1.14957

3.33509

Exercises#

  1. Consult the data sheet of one x86 CPU and one Arm CPU. Make a table for the line size of each of the cache levels, and compare the difference between the two CPUs.

  2. Write a program that uses tiling to speed up matrix-matrix multiplication, and do not require the matrix dimension to be multiples of the tile size.

References#