# Third session

## Finite difference example

In the meeting we discussed a simple finite difference code. A simple gpu verision very much like how it would look in a normal cpu code was shown. In addition, a version with a simple shared memory use was presented.

### Problem

We want to take one timestep in a heat equation solver. We assume that the solution lives on a grid, and use the discretization:

```     u'(i,j) := u(i,j) + D*(u(i-1,j) + u(i,j-1) -
4*u(i,j) + u(i,j+1) + u(i+1,j))
```

where u' is the solution at time t+dt, u is the solution at time t and D is the diffusion coefficient times dt divided by h^2 (h is the grid spacing).

### First GPU implementation

```// 6 global memory accesses, 7 flops.
__global__ static void
fdstep_gm(int nx,int ny,float D,float *X,float *Y) {
const int ix = blockIdx.x*(NTX-2) + threadIdx.x;
const int iy = blockIdx.y*(NTY-2) + threadIdx.y;
float x;

if(ix<nx && iy<ny) {
x = X[iy*nx+ix];
if(ix>0 && ix<nx-1 && iy>0 && iy<ny-1)
x = x + D*(X[(iy-1)*nx + (ix+0)] + X[(iy+0)*nx + (ix-1)] +
X[(iy+0)*nx + (ix+1)] + X[(iy+1)*nx + (ix+0)] -
4.0*x);
Y[iy*nx+ix] = x;
}
}
```

The outer if-statement can be removed if in each dimension, the number of gridpoints is a multiple of the number of threads per block in that dimension.

### Shared memory version

```#define NTX 16
#define NTY 16

// 2 global memory accesses, 7 flops
__global__ static void
fdstep_shm(int nx,int ny,float D,float *X,float *Y) {
const int ix = blockIdx.x*(NTX-2) + threadIdx.x;
const int iy = blockIdx.y*(NTY-2) + threadIdx.y;
float x;

__shared__ float Xs[NTY][NTX];

x = X[iy*nx + ix];
Xs[tiy][tix] = x;

if(tix>0 && tix<NTX-1 && tiy>0 && tiy<NTY-1) {
x = x + D*(Xs[tiy-1][tix+0] + Xs[tiy+0][tix-1] +
Xs[tiy+0][tix+1] + Xs[tiy+1][tix+0] -
4.0*x);

Y[iy*nx+ix] = x;
}
if(ix==0 || ix==nx-1 || iy==0 || iy==ny-1)
Y[iy*nx+ix] = x;
}
```

It is assumed that the number of gridpoints in each diemsion is a multiple of the number of threads per block in that dimension.

### Multiple time step version

```#define NTX 16
#define NTY 16
#define NSTEPS 4

__global__ static void
fdstep_multi(int nx,int ny,float D,float *X,float *Y) {
const int ix = blockIdx.x*(NTX-2*NSTEPS) + threadIdx.x;
const int iy = blockIdx.y*(NTY-2*NSTEPS) + threadIdx.y;
float x;
int i;

__shared__ float Xs[NTY][NTX];

x = X[iy*nx + ix];
Xs[tiy][tix] = x;

for(i = 0; i<NSTEPS; i++) {
x = Xs[tiy][tix];
if(tix>0 && tix<NTX-1 && tiy>0 && tiy<NTY-1)
x = x + D*(Xs[tiy-1][tix+0] + Xs[tiy+0][tix-1] +
Xs[tiy+0][tix+1] + Xs[tiy+1][tix+0] -
4.0*x);
Xs[tiy][tix] = x;
}

if((tix>=NSTEPS && tix<NTX-NSTEPS && tiy>=NSTEPS && tiy<NTY-NSTEPS) ||
(ix<NSTEPS || ix>=nx-NSTEPS || iy<NSTEPS || iy>=ny-NSTEPS))
Y[iy*nx+ix] = x;
}
```

In this code, the boundary elements are copied unchanged to the output vector.

#### Multiple timestep analysis

Assume that each thread block has p threads in each of D dimensions. Let the number of timesteps be s. The algorithm for a block of threads is as follows:

2. All the treads perform s timesteps using only shared memory.
3. Those threads that have interior points (those that have been computed using valid data), writes out their results.
• The number of global memory accesses is p^D reads and (p-2s)^D writes, so the total is p^D + (p-2s)^D.
• The number of useful flops is a*s(p-2s)^D, where a is the number of flops per grid point per timestep.

We want to maximize f(s) = a*s*(p-2*s)^D / (p^D + (p-2*s)^D). The picture below shows the case of the 2D heat equation timestep, with a=7 and p=32. 