Graphic Processing Units (Section 6.6 and Appendix C)

History of GPUs

- VGA (Video graphic array) in early 90’s -- A memory controller and display generator connected to some (video) RAM
- By 1997, VGA controllers were incorporating some acceleration functions
  - mapping, rasterization
- In 2000, a single chip graphics processor incorporated almost every detail of the traditional high-end workstation graphics pipeline
  - Processors oriented to 3D graphics tasks
  - Vertex/pixel processing, shading, texture mapping, rasterization

Contemporary PC architecture

- More recently, processor instructions and memory hardware were added to support general-purpose programming languages
- OpenGL: A standard specification defining an API for writing applications that produce 2D and 3D computer graphics
- CUDA (compute unified device architecture): A scalable parallel programming model and language for GPUs based on C/C++
Basic GPU architecture

- **SM (Streaming Multiprocessor)**
- **ROP (Raster Operations Pipeline)**
- **TPC (Texture Processing Cluster)**
- **SFU (Special Function Unit)**

Streaming Processor (SP)

- Also called “Shader core”

The CPU+GPU architecture

- **CPU**
- **GPU**
- **L2 cache**
- **CPU (Host) memory**
- **Global (GPU) memory**
- **PCIe bus**
- **IPC/ID**
- **L1 cache**
- **Shared memory**
The programming model

```c
int total_hits = 0;
sample_points_per_thread = sample_points / num_threads;

for (int i = 0; i < num_threads; i++){
    my_arg[i].t_seed = i; /* can chose any seed – here i is chosen*/
    pthread_create(&p_threads[i], &attr, compute_pi, &my_arg[i]);
}

for (i = 0; i < num_threads; i++){
    pthread_join(p_threads[i], NULL);
    total_hits += my_arg[i].hits;
}

computed_pi = 4.0*(double) total_hits / ((double) (sample_points));
```

Copy data from CPU memory to GPU memory

Launch the kernel

Copy data from GPU memory to CPU memory
The programming model

CPU program
(serial code)

\[ \text{CUDAMemcpy (…)} \]

Launch a kernel with \( nb \) blocks, each with \( nt \) (up to 512) threads

\[ \text{Function } \langle \langle nb, nt \rangle \rangle \]

Copy data from CPU memory to GPU memory

\[ \text{CUDAMemcpy (…)} \]

Copy results from GPU memory to CPU memory

\[ \text{global\_Function (…)} \]

Definition of a kernel
(the function executed by each GPU thread)

A kernel is a C function with the following restrictions:
• Cannot access host memory
• Must have “void” return type
• No variable number of arguments or static variables
• No recursion

The execution model

Blocks not dispatched initially are dispatched when some blocks finish execution

• The thread blocks are dispatched to SMs
• The number of blocks dispatched to an SM depends on the SM’s resources (registers, shared memory, …).

• When a block is dispatched to an SM, each of its threads executes on an SP in the SM.
The execution model

- Each block (up to 512 threads) is divided into groups of 32 threads (called warps) – empty threads are used as fillers.
- The 32 threads of a warp execute in SIMD mode on the SM.
- The (up to 16) warps of a block are “coarse grain multithreaded”

- Depending on the number of SPs per SM:
  - If 32 SP per SM → one thread of a warp executes on one SP (32 lanes of execution, one thread per lane)
  - If 16 SP per SM → every two threads of a warp are time multiplexed (fine grain multithreading) on one SP (16 lanes of execution, 2 threads per lane)
  - If 8 SP per SM → every four threads of a warp are time multiplexed (fine grain multithreading) on one SP (8 lanes of execution, 4 threads per lane)

All threads execute the same code

Assume one block with 64 threads – launched using Kernel <<<1, 64>>>

- Each thread in a thread block has a unique “thread index” → threadIdx.x
- The same sequence of instructions can apply to different data
Blocks of threads

Assume two block with 32 threads each – launched using Kernel \texttt{<<2,32>>}

- Each thread block has a unique “block index” \(\text{blockIdx.x}\)
- Each thread has a unique threadIdx.x within its own block
- Can compute a global index from the blockIdx.x and threadIdx.x

\[
\text{int } i = 32 \times \text{blockIdx.x} + \text{threadIdx.x};
\]
\[
\text{B}[i] = \text{A}[63-i];
\]
\[
\text{C}[i] = \text{B}[i] + \text{A}[i]
\]

Two-dimensions grids and blocks

Can launch a 2x2 array of blocks, each consisting of 4x8 array of threads using Kernel \texttt{<<(2,2),(4,8)>>}

- Each block has two indices (blockIdx.x, blockIdx.y)
- Each thread in a thread block has two indices (threadIdx.x, threadIdx.y)
Scheduling warps on SMs

Block diagram of a 16-lanes SM: A scoreboard keeps track of the current PCs (instruction address) of up to NW independent threads of SIMD instructions (NW warps). NW = 48 in older GPUs and 64 for more recent GPUs.

Up to 48 (64 in more recent GPUs) warps to be scheduled on the SM (may be from different blocks)

Scheduling warps on SMs

Scheduling threads of SIMD instructions (warps): The scheduler selects a ready instruction from some warp and issues it synchronously to all the SIMD lanes. Because warps are independent, the scheduler may select an instruction from a different warp at every issue.
SMs with multiple warp schedulers

Newer SMs have large number of SPs and may have multiple warp schedulers.

Example: SM with Dual warp Scheduler.

- Each SM has 32 SPs (cores) and is divided into two groups of 16 lanes each.
- One warp scheduler is responsible for scheduling a warp on one of the two 16-lanes groups.
- A warp is issued to each 16-lanes group every two cycles.

Sharing the resources of an SM

A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

A single SRAM (ex. 16KB) is partitioned among the dispatched blocks. All the threads of a block can access the partition assigned to that block.
The memory architecture

Can copy data between the CPU memory and the global memory using “cudaMemcpy( )”

CPU memory (DRAM) → PCIe bus → GPU Global memory (DRAM)

Shared by all the threads of a kernel.

A variable allocated by the CPU using “cudaMalloc” or declared “_device_” in the kernel function is allocated in global memory and is shared by all the threads in the kernel.

A variable declared “_shared_” in the kernel function is allocated in shared memory and is shared by all the threads in a block.

getting data into GPU memory

cudaMalloc (void **pointer, size_t nbytes); /* malloc in GPU global memory */
cudaMemset (void **pointer, int value, size_t count);
cudaMemcpy(void *dest, void *src, size_t nbytes, enum cudaMemcpyKind dir)
cudaFree(void **pointer);

enum cudaMemcpyKind
  • cudaMemcpyHostToDevice
  • cudaMemcpyDeviceToHost
  • cudaMemcpyDeviceToDevice

Notes:
  • cudaMemcpy() blocks CPU thread until copy is complete
  • cudaMemcpy() does not start copying until previous CUDA calls complete
**Data Movement Example**

```c
int main(void) {
    float *a_h, *b_h; // host data
    float *a_d, *b_d; // device data
    int N = 14, nBytes, i;
    nBytes = N*sizeof(float);
    a_h = (float *)malloc(nBytes);
    b_h = (float *)malloc(nBytes);
    cudaMalloc((void **) &a_d, nBytes);
    cudaMalloc((void **) &b_d, nBytes);
    for (i=0; i<N; i++) a_h[i] = 100.f + i;
    cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);
    GPUcomp<<<1,14>>>(a_d, b_d, N);
    cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);
    for (i=0; i<N; i++) assert( a_h[i] == b_h[i] );
    free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
    return 0;
}
```

**Variable qualifiers in CUDA kernel functions**

```c
int main (void) {
    //
    kernel <<<16,64 >>>
    //
    //
    //
    //
}

__global__ void kernel (...) {
    int i;
    __device__ float y[1024];
    __shared__ int x[64];
    //
    x[threadIdx.x] = ...
    ...
    = y[blockIdx.x]
}```
Function qualifiers

• The functional qualifier “_global_” is used to designate a kernel
  ➢ Called from Host and executed on Device
  ➢ Must return “void”

• The functional qualifier “_device_” designates a function
  ➢ Called from Device and executed on Device
  ➢ Cannot be called from Host

• The functional qualifier “_host_” designates a function
  ➢ Called from Host and executed on Host (default)

“_host_” and “_device_” can be combined to generate code that can execute on both Host and Device

Compiling

nvcc outputs either
➢ C code for the CPU
➢ PTX or object code for GPU

An executable requires linking to:
➢ “cudart” run time library
➢ “cuda” code library
Launching a kernel

- A kernel is launched using the syntax:
  \[ \text{kernel} \llll \text{dim3 dG, dim3 dB} \rrrr (\ldots) ; \]

- Execution configuration:
  - \( \text{dG} \rightarrow \) dimension and size of grids in blocks
  - 3-dimensions, \( \text{dG.x, dG.y and dG.z} \) (default \( \text{dG.y = dG.z = 1} \))
  - Number of blocks in the launched grid = \( \text{dG.x \times dG.y \times dG.z} \)
  - \( \text{dB} \rightarrow \) dimension and size of blocks in threads
  - 3-dimensions, \( \text{dB.x, dB.y and dB.z} \) (default \( \text{dB.y = dB.z = 1} \))
  - Number of threads per block = \( \text{dB.x \times dB.y \times dB.z} \)

Note: kernel launch is non-blocking \( \rightarrow \) control returns to CPU immediately

gridDim, blockDim, blockIdx and threadIdx are built in variables of type dim3 \( \rightarrow \) accessible by the kernel function.
Example:

```c
void main ()
{  cudaMalloc (int* &a, 20*sizeof(int));
    cudaMalloc (int* &b, 20*sizeof(int));
    cudaMalloc (int* &c, 20*sizeof(int));
    ...
    kernel<<<4,5>>>(a, b, c);
    ...
}

_global_ void kernel(int *a, *b, *c)
{  int i = blockIdx.x * blockDim.x + threadIdx.x;
    a[i] = i;
    b[i] = blockIdx.x;
    c[i] = threadIdx.x;
}
```

NOTE: Each block will consist of one warp – only 5 threads in the warp will do useful work and the other 27 threads will execute no-ops.

Global Memory

<table>
<thead>
<tr>
<th>a[]</th>
<th>0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19</th>
</tr>
</thead>
<tbody>
<tr>
<td>b[]</td>
<td>0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3</td>
</tr>
<tr>
<td>c[]</td>
<td>0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4</td>
</tr>
</tbody>
</table>

Example: increment the elements of an array

C program (on CPU)

```c
void inc_cpu(int *a, int n)
{  int i;
    for(i=0; i<n; i++)
        a[i] = a[i] + 1;
}

void main ()
{  ...
    inc_cpu(a,n);
    ...
}
```

CUDA program (on CPU+GPU)

```c
_global_ void inc_gpu(int *A, int n)
{  int i = blockIdx.x * blockDim.x + threadIdx.x;
    if( i < n )
}

void main ()
{  ...
    blocksize = 64;
    // cudaMalloc array A[n]
    // cudaMemcpy data to A
    dim3 dimB (blocksize);
    dim3 dimG(ceil(n/blocksize));
    inc_gpu<<<dimG, dimB>>>(A, n);
    ...
}
```

n/64 64
Example: computing \( y = ax + y \)

**C program (on CPU)**

```c
void saxpy_serial(int n, float a, float *x, float *y)
{
    for(int i = 0; i<n; i++)
        y[i] = a * x[i] + y[i];
}
```

**CUDA program (on CPU+GPU)**

```c
void main()
{
    ...
    saxpy_serial(n, 2.0, x, y);
    ...
}
```

---

Example: computing \( y = ax + y \)

**CUDA program (on CPU+GPU)**

```c
__global__ void saxpy_gpu(int n, float a, float *X, float *Y)
{
    int i = blockIdx.x*blockDim.x + threadIdx.x;
    if (i < n) Y[i] = a * X[i] + Y[i];
}
```

```c
void main()
{
    ...
    // cudaMalloc arrays X and Y
    // cudaMemcpy data to X and Y
    // blocksize = 256;
    int NB = (n + 255) / 256;
    saxpy_gpu<<<NB, 256>>>(n, 2.0, X, Y);
    // cudaMemcpy data from Y
}
```
Global Memory

- Global memory is the off-chip DRAM memory
  - Accesses must go through interconnect and memory controller
- Many concurrent threads generate memory requests → coalescing is necessary
  - Combining memory accesses made by threads in a warp into fewer transactions – each memory transaction is for 64 bytes (16, 4-bytes words).
  - E.g. if each thread of a warp are accessing consecutive 4-byte sized locations in memory, send two 64-byte request to DRAM instead of 32 4-byte requests
- Coalescing is achieved for any pattern of addresses that fits into a segment of size 128B.

Coalescing (cont.)

A warp may issue up to 32 memory accesses. They can be completed in
- Two transactions, each for 16 coalesced accesses (if perfectly coalesced)
- 32 separate transactions (if addresses cannot be coalesced)
- Fewer than 32 transactions (if partial coalescing is possible).
Shared Memory

- A memory address subspace in each SM (at least 48KB in Nvidia GPUs)
  - As fast as register files if no bank conflict
  - May be used to reduce global memory traffic (called scratchpad)
- Managed by the code (programmer)

- Many threads accessing shared memory → Highly banked
  - Successive 32-bit words assigned to successive banks
- Each bank serves one address per cycle
  - Shared Memory can service as many simultaneous accesses as it has banks
- Multiple concurrent accesses to a bank result in a bank conflict (has to be serialize)

Bank Addressing Example
Synchronization

- `_syncthreads()`; a barrier for threads within a thread block
- Allowed in conditional code only if all condition is uniform in all the threads of a block
- Used to avoid hazards when using shared memory

To synchronize threads across different thread blocks, need to use **atomic operations** on variables in global memory

- Add, sub, min, max, …
- And, or, xor
- Increment, decrement
- Exchange, compare and swap

`cudaThreadSynchronize()`: called in CPU code to block until all previously issued cuda calls complete.
Maximize the use of shared memory

Shared memory is hundreds of times faster than global memory

To take advantage of shared memory
- Partition the data sets into subsets that fit into shared memory
- Handle each subset with one thread block
  - Load the subset from global memory to shared memory
  - `__syncthreads()`
  - Perform the computation while data is in shared memory
  - `__syncthreads()`
  - Copy results back from shared to global memory

The code examples in the next slides are copied and modified from:

Transposing an nxn matrix

- $B[j][i] = A[i][j]$ for $i=0,...,n-1$ and $j=0,...,n-1$
- Partitions $A$ into $k^2 = k \times k$ tiles, each with $n/k$ rows and $n/k$ columns
- Launch a 2-D grid of $(k,k)$ thread blocks, each with $(n/k,n/k)$ threads
- Assign $(threadIdx.x, blockIdx.x)$ in $(blockIdx.x, blockIdx.y)$ to copy element $(row, col)$ of matrix $A$ into position $(col, row)$ of matrix $B$.

```c
__global__ void transpose(float* A, float* B, int n)
{
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    B[col][row] = A[row][col];
}
```

Naïve implementation exhibits non-coalesced access to global memory
Transposing with coalesced memory access

- Each block copies the rows of its tile from global to shared memory (coalesced)
- Transpose in shared memory
- Copy rows of the transposed tile from shared to global memory (coalesced)

```c
__global__ void transpose(float* A, float* B, int n)
{   _shared_ float C[tiledim][tiledim], D[tiledim][tiledim];  // defined parameter “tiledim” = n/k
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    C[threadIdx.y][threadIdx.x] = A[row][col]; // coalesced load from A
    _synchthreads(); // complete loading before transposing
    D[threadIdx.x][threadIdx.y] = C[threadIdx.y][threadIdx.x] ;  // complete transpose in shared memory
    _synchthreads();
    int row = blockIdx.x * blockDim.y + threadIdx.y; // note that blockDim.x = blockDim.y
    int col  = blockIdx.y * blockDim.x + threadIdx.x;
    B[row][col] = D[threadIdx.y][threadIdx.x] // coalesced storing to B
}
```

Multiplication \( c = A \times b \) of an \( n \times n \) matrix, A, by a vector b

- Each element \( c_{ij} \) will be computed by one thread
- Partitions \( c \) into \( k \) parts, each with \( n/k \) elements
- Launch \( k \) thread blocks, each with \( n/k \) threads
- Thread \( threadIdx.x \) in block \( blockIdx.x \) computes \( c[idx] \) where \( idx = blockIdx.x \times blockDim.x + threadIdx.x \)

```c
__global__ void mv(float* A, float* b, float* c, int n)
{   int row = blockIdx.x * blockDim.x + threadIdx.x;
    if (row < n) {
        float temp = 0 ;
        for (int k = 0; k < n; k++)
            temp += A[row][k] * b[k] ;
    }
    c[row] = temp ;
}
```

```c
#define blocksize = 128; // or any other block size
#define n = 2048 ; // or any other matrix size
int nblocks = (n + blocksize - 1) / blocksize;
mv<<<nblocks, blocksize>>>(A, b, c, n);
```
Multiplication $C = A \times B$ of two $n \times n$ matrices.

- Partitions each matrix into $k^2 = k \times k$ tiles, each with $M$ rows and $M$ columns ($M = N/k$)
- Launch a 2-D grid of $(k,k)$ thread blocks, each with $(M,M)$ threads
- Assign thread (threadIdx.x, threadIdx.y) in block (blockIdx.x, blockIdx.y) to handle element (row, col) of the matrix $C$.

```c
__global__ void mm_simple(float* C, float* A, float* B, int N)
{
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    float sum = 0.0;
    for (int k = 0; k < N; k++)
        sum += A[row][k] * B[k][col];
    C[row][col] = sum;
}
```

Matrix-matrix multiplication using shared memory

- The thread that computes $C[row][col]$ accesses row $A[row][]$ and column $B[[]][col]$ from the global memory
- Hence, each row of $A$ and each column of $B$ is accessed $M$ times by different threads in the same block.
- To avoid repeated access to global memory, the block that computes a tile $C[i,j]$, where $0 \leq i < k$ and $0 \leq j < k$ executes:
  ```
  for q = 0, ..., k-1 // $k = 3$ in the example shown */
  {
      load tiles $A_{i,q}$ and $B_{q,j}$ into the shared memory
      $C[i,j] = C[i,j] + A_{i,q} \times B_{q,j}$ // accumulate the product of $A_{i,q} \times B_{q,j}$
  }
  ```
Parallel reduction

```c
__global__ void reduce(int *input, int n, int *total_sum) {
    int tid = threadIdx.x;
    int i = blockIdx.x*blockDim.x + threadIdx.x;
    _shared_ int x[blocksize];
    x[tid] = input[i]; // load elements into shared memory
    _syncthreads();
    // Tree reduction over elements of the block.
    for(int half = blockDim.x/2; half > 0; half = half/2) {
        if(tid < half) x[tid] += x[tid + half];
        _syncthreads();
    }
    // Thread 0 adds the partial sum to the total sum
    if( tid == 0 ) atomicAdd(total_sum, x[tid]);
}
```

```c
#define blocksize = 128; // or any other block size
#define n = 2048; // or any other array size
int nblocks = n / blocksize;
reduce <<<nblocks, blocksize>>> (input, &total_sum);
```

Static Vs dynamic shared arrays

- When declaring a shared array using "_shared_ float C[tiledim][tiledim]",
tiledim has to be statically defined. For example, in the matrix transpose
example, the main should look like

```c
#define tiledim 16 ;      // blocksize = 16*16 = 256
n = 2048 ;              // or any other array size
int nblocks = n / blocksize;
reduce <<<nblocks, blocksize>>> (input, &total_sum);
```

- Can declare an unsized shared array using "extern __shared__ float C[]" (note
the [ ] ) and then at kernel launch time, use a third argument to specify the size
of the shared array:

```c
n = 2048 ;
tiledim = *** ;        // can determine dynamically
dim3 threads(tiledim, tiledim) ;
dim3 grid(n/tiledim, n/tiledim) ;        // assuming n is a multiple of 16
transpose<<<grid, threads>>>( A, B, n ) ;
```
Dealing with data dependency

- Instructions in a warp are scheduled “in order”
- No pipeline forwarding → a dependent instruction cannot be issued
- Latency due to dependencies are hidden by issuing instructions from other warps (similar to multithreading – call it multi-warping)

```
add.f32  $f3, $f1, $f2
add.f32  $f5, $f3, $f4
ld.shared.f32 $f6, 0($r31)
add.f32  $f6, $f6, $f7
sw.shared.f32 $f6, 0($r31)
```

- Each SP is a 6-stage pipeline with no forwarding
- Two dependent instructions have to be separated by 6 other instructions
- Can completely hide the latency if the scheduler can issue instructions from 6 different warps (192 threads). They may be from different blocks.

Blocks per grid heuristics

The number of blocks should be larger than the number of SMs

The number of threads per SM is limited by the number of registers per SM – note that local variables not assigned to registers are stored in global memory. May use `--maxregcount = n` flag when compiling to restrict the number of registers per thread to `n`.

Example: If kernel uses 8 registers per thread, and register file is 16K registers per SM, then each SM can support at most 2048 threads)

The number of Blocks per SM is determined by the shared memory declared in each block and the number of threads per block.

Example: If 2KB shared memory is declared per block and each SM has 16KB of shared memory, then each SM can support at most 8 blocks).

Example: If each block has 256 threads (8 warps), and the GPU can support 48 warps per SM, then each SM can support at most 6 blocks.
Unified CPU/GPU memory (in CUDA 6 and later)

- CPU and GPU share the same virtual memory (UVM)
- If physical memory is integrated (shared), then CPU and GPU use the same page table.
- If each of the CPU and the GPU has its own physical memory, pages are copied on demand (triggered by page faults).

Unified CPU/GPU memory

**CPU code (not using a GPU)**

```c
void sortfile(FILE *fp, int N) {
    char *data;
    data = (char *)malloc(N);
    fread(data, 1, N, fp);
    qsort(data, N, 1, compare);
    use_data(data);
    free(data);
}
```

**CUDA 6 code with unified memory**

```c
void sortfile(FILE *fp, int N) {
    char *data;
    cudaMallocManaged(&data, N);
    fread(data, 1, N, fp);
    qsort<<<...>>>(data, N, 1, compare);
    cudaDeviceSynchronize();
    use_data(data);
    cudaFree(data);
}
```

The pointer is passed as argument to the kernel → copying occurs on demand.

No need for
```
malloc(&cpudata, N);
cudaMalloc(&data, N);
cudaMemcpy(&cpudata, &data ..);
```
Overlapping Data Transfers and Computation

cudaMallocHost() allows allocation of page locked (pinned) host memory – which leads to faster cudaMemcpy().

Cuda stream = sequence of operations that execute in order on GPU
- Operations from different streams can be interleaved.
- Stream ID used as argument to async calls and kernel launches.
- Stream 0 is the default stream and any call in that stream blocks until all previous calls are complete (cannot overlap).

cudaMemcpyAsync(dst, src, size, dir, stream);
- A non-blocking call
- requires pinned host memory (allocated using cudaMallocHost())

Overlapping CPU computation and data transfer

cudaMemcpyAsync(a_d, a_h, size, cudaMemcpyHostToDevice, 0);
kERNEL<<<grid, block>>>(a_d);
cpuFunction();

Can overlap GPU computation and data transfer using different non-zero streams:

cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
cudaMemcpyAsync(dst, src, size, dir, stream1);
kERNEL<<<grid, block, 0, stream2>>>(…);

Sparse matrix representation

\[ A = \begin{bmatrix}
3 & 0 & 9 & 0 & 0 \\
0 & 5 & 0 & 2 \\
0 & 0 & 7 & 0 & 0 \\
0 & 0 & 5 & 8 & 4 \\
0 & 0 & 6 & 0 & 0 \\
\end{bmatrix} \]

- \( Av = [3 \ 9 \ 5 \ 2 \ 7 \ 5 \ 8 \ 4 \ 6] \) = non zero elements
- \( Aj = [0 \ 2 \ 4 \ 2 \ 3 \ 4 \ 2] \) = column indices of elements
- \( Ap = [0 \ 2 \ 4 \ 5 \ 8 \ 9] \) = pointers to the first element in each row

Serial sparse matrix/vector multiplication

```c
void csrmul_serial(int *Ap, int *Aj, float *Av, int num_rows, float *x, float *y)
{
    for(int row=0; row<num_rows; ++row)
    {
        int row_begin = Ap[row];
        int row_end = Ap[row+1];
        y[row] = multiply_row(row_end - row_begin, Aj[row_begin], Av[row_begin], x);
    }
}
```

```c
float multiply_row(int rowsize, int *Aj, float *Av, float *x)
{
    int j;
    float sum = 0;
    for(int col=0; col < rowsize; ++col)
    {
        j = Aj[col];
        // note that effectively, j = Aj[row_begin + col]
        sum += Av[col] * x[j];
    }
    return sum;
}
```
Parallel sparse matrix/vector multiplication

```c
__global__ void csrmul_kernel(int *Ap, int *Aj, float *Av, int num_rows, float *x, float *y)
{
    int row = blockIdx.x*blockDim.x + threadIdx.x;
    if( row<num_rows )
    {
        int row_begin = Ap[row];
        int row_end = Ap[row+1];
        float sum = 0;
        for(int col=row_begin; col<row_end; ++col)
        { int j = Aj[col];
            sum += Av[col] * x[j];
        }
        y[row] = sum;
    }
}
```

The code to launch the above parallel kernel is:

```c
unsigned int blocksize = 128; // or any size up to 512
unsigned int nblocks = (num_rows + blocksize - 1) / blocksize;
csrmul_kernel<<<nblocks,blocksize>>>(Ap, Aj, Av, num_rows, x, y);
```

Caching in shared memory

- Block_begin
- A thread
- Block_end
- Cache in shared memory
- Expect most of the non-zero elements here (around the diagonal)
- the “row” executed by a thread
global void csrMul_cached(int *Ap, int *Aj, float *Av, int num_rows, const float *x, float *y)
{
    _shared_float cache[blocksize];       // Cache the rows of x[] corresponding to this block.
    int block_begin = blockIdx.x * blockDim.x;
    int row = block_begin + threadIdx.x;
    int block_end = block_begin + blockDim.x;
    // Fetch and cache our window of x[].
    if(row<num_rows) cache[threadIdx.x] = x[row];
    __syncthreads();
    if(row<num_rows)
    {
        int row_begin = Ap[row];
        int row_end = Ap[row+1];
        float x_j, sum = 0;
        for(int col=row_begin; col<row_end; ++col)
        {
            int j = Aj[col];
            if(j>=block_begin && j<block_end) // Fetch x_j from our cache when possible
                x_j = cache[j-block_begin];
            else
                x_j = x[j];
            sum += Av[col] * x_j;
        }
        y[row] = sum;
    }
}