# GPU08-session3

### From NA-Wiki

## Contents |

# Third session

## Finite difference example

In the meeting we discussed a simple finite difference code. A simple gpu verision very much like how it would look in a normal cpu code was shown. In addition, a version with a simple shared memory use was presented.

### Problem

We want to take one timestep in a heat equation solver. We assume that the solution lives on a grid, and use the discretization:

u'(i,j) := u(i,j) + D*(u(i-1,j) + u(i,j-1) - 4*u(i,j) + u(i,j+1) + u(i+1,j))

where u' is the solution at time t+dt, u is the solution at time t and D is the diffusion coefficient times dt divided by h^2 (h is the grid spacing).

### First GPU implementation

// 6 global memory accesses, 7 flops. __global__ static void fdstep_gm(int nx,int ny,float D,float *X,float *Y) { const int ix = blockIdx.x*(NTX-2) + threadIdx.x; const int iy = blockIdx.y*(NTY-2) + threadIdx.y; float x; if(ix<nx && iy<ny) { x = X[iy*nx+ix]; if(ix>0 && ix<nx-1 && iy>0 && iy<ny-1) x = x + D*(X[(iy-1)*nx + (ix+0)] + X[(iy+0)*nx + (ix-1)] + X[(iy+0)*nx + (ix+1)] + X[(iy+1)*nx + (ix+0)] - 4.0*x); Y[iy*nx+ix] = x; } }

The outer **if**-statement can be removed if in each dimension, the
number of gridpoints is a multiple of the number of threads per block
in that dimension.

### Shared memory version

#define NTX 16 #define NTY 16 // 2 global memory accesses, 7 flops __global__ static void fdstep_shm(int nx,int ny,float D,float *X,float *Y) { const int ix = blockIdx.x*(NTX-2) + threadIdx.x; const int iy = blockIdx.y*(NTY-2) + threadIdx.y; const int tix = threadIdx.x, tiy = threadIdx.y; float x; __shared__ float Xs[NTY][NTX]; x = X[iy*nx + ix]; Xs[tiy][tix] = x; __syncthreads(); if(tix>0 && tix<NTX-1 && tiy>0 && tiy<NTY-1) { x = x + D*(Xs[tiy-1][tix+0] + Xs[tiy+0][tix-1] + Xs[tiy+0][tix+1] + Xs[tiy+1][tix+0] - 4.0*x); Y[iy*nx+ix] = x; } if(ix==0 || ix==nx-1 || iy==0 || iy==ny-1) Y[iy*nx+ix] = x; }

It is assumed that the number of gridpoints in each diemsion is a multiple of the number of threads per block in that dimension.

### Multiple time step version

#define NTX 16 #define NTY 16 #define NSTEPS 4 __global__ static void fdstep_multi(int nx,int ny,float D,float *X,float *Y) { const int ix = blockIdx.x*(NTX-2*NSTEPS) + threadIdx.x; const int iy = blockIdx.y*(NTY-2*NSTEPS) + threadIdx.y; const int tix = threadIdx.x, tiy = threadIdx.y; float x; int i; __shared__ float Xs[NTY][NTX]; x = X[iy*nx + ix]; Xs[tiy][tix] = x; for(i = 0; i<NSTEPS; i++) { __syncthreads(); x = Xs[tiy][tix]; if(tix>0 && tix<NTX-1 && tiy>0 && tiy<NTY-1) x = x + D*(Xs[tiy-1][tix+0] + Xs[tiy+0][tix-1] + Xs[tiy+0][tix+1] + Xs[tiy+1][tix+0] - 4.0*x); __syncthreads(); Xs[tiy][tix] = x; } if((tix>=NSTEPS && tix<NTX-NSTEPS && tiy>=NSTEPS && tiy<NTY-NSTEPS) || (ix<NSTEPS || ix>=nx-NSTEPS || iy<NSTEPS || iy>=ny-NSTEPS)) Y[iy*nx+ix] = x; }

In this code, the boundary elements are copied unchanged to the output vector.

#### Multiple timestep analysis

Assume that each thread block has p threads in each of D dimensions. Let the number of timesteps be s. The algorithm for a block of threads is as follows:

- Each thread of the
**p^D**threads loads one data point to shared memory. - All the treads perform
**s**timesteps using only shared memory. - Those threads that have interior points (those that have been computed using valid data), writes out their results.

- The number of global memory accesses is p^D reads and (p-2s)^D writes, so the total is p^D + (p-2s)^D.
- The number of useful flops is a*s(p-2s)^D, where a is the number of flops per grid point per timestep.

We want to maximize f(s) = a*s*(p-2*s)^D / (p^D + (p-2*s)^D). The picture below shows the case of the 2D heat equation timestep, with a=7 and p=32.