GPU08-session3

From NA-Wiki

Jump to: navigation, search

Contents

Third session

Finite difference example

In the meeting we discussed a simple finite difference code. A simple gpu verision very much like how it would look in a normal cpu code was shown. In addition, a version with a simple shared memory use was presented.

Problem

We want to take one timestep in a heat equation solver. We assume that the solution lives on a grid, and use the discretization:

     u'(i,j) := u(i,j) + D*(u(i-1,j) + u(i,j-1) -
                            4*u(i,j) + u(i,j+1) + u(i+1,j))

where u' is the solution at time t+dt, u is the solution at time t and D is the diffusion coefficient times dt divided by h^2 (h is the grid spacing).

First GPU implementation

// 6 global memory accesses, 7 flops.
__global__ static void
fdstep_gm(int nx,int ny,float D,float *X,float *Y) {
  const int ix = blockIdx.x*(NTX-2) + threadIdx.x;
  const int iy = blockIdx.y*(NTY-2) + threadIdx.y;
  float x;

  if(ix<nx && iy<ny) {
    x = X[iy*nx+ix];
    if(ix>0 && ix<nx-1 && iy>0 && iy<ny-1)
      x = x + D*(X[(iy-1)*nx + (ix+0)] + X[(iy+0)*nx + (ix-1)] +
		 X[(iy+0)*nx + (ix+1)] + X[(iy+1)*nx + (ix+0)] -
		 4.0*x);
    Y[iy*nx+ix] = x;
  }
}

The outer if-statement can be removed if in each dimension, the number of gridpoints is a multiple of the number of threads per block in that dimension.

Shared memory version

#define NTX 16
#define NTY 16

// 2 global memory accesses, 7 flops
__global__ static void
fdstep_shm(int nx,int ny,float D,float *X,float *Y) {
  const int ix = blockIdx.x*(NTX-2) + threadIdx.x;
  const int iy = blockIdx.y*(NTY-2) + threadIdx.y;
  const int tix = threadIdx.x, tiy = threadIdx.y;
  float x;

  __shared__ float Xs[NTY][NTX];

  x = X[iy*nx + ix];
  Xs[tiy][tix] = x;
  __syncthreads();

  if(tix>0 && tix<NTX-1 && tiy>0 && tiy<NTY-1) {
    x = x + D*(Xs[tiy-1][tix+0] + Xs[tiy+0][tix-1] +
	       Xs[tiy+0][tix+1] + Xs[tiy+1][tix+0] -
	       4.0*x);

    Y[iy*nx+ix] = x;
  }
  if(ix==0 || ix==nx-1 || iy==0 || iy==ny-1)
    Y[iy*nx+ix] = x;
}

It is assumed that the number of gridpoints in each diemsion is a multiple of the number of threads per block in that dimension.

Multiple time step version

#define NTX 16
#define NTY 16
#define NSTEPS 4

__global__ static void
fdstep_multi(int nx,int ny,float D,float *X,float *Y) {
  const int ix = blockIdx.x*(NTX-2*NSTEPS) + threadIdx.x;
  const int iy = blockIdx.y*(NTY-2*NSTEPS) + threadIdx.y;
  const int tix = threadIdx.x, tiy = threadIdx.y;
  float x;
  int i;

  __shared__ float Xs[NTY][NTX];

  x = X[iy*nx + ix];
  Xs[tiy][tix] = x;

  for(i = 0; i<NSTEPS; i++) {
    __syncthreads();
    x = Xs[tiy][tix];
    if(tix>0 && tix<NTX-1 && tiy>0 && tiy<NTY-1)
      x = x + D*(Xs[tiy-1][tix+0] + Xs[tiy+0][tix-1] +
		 Xs[tiy+0][tix+1] + Xs[tiy+1][tix+0] -
		 4.0*x);
    __syncthreads();
    Xs[tiy][tix] = x;
  }

  if((tix>=NSTEPS && tix<NTX-NSTEPS && tiy>=NSTEPS && tiy<NTY-NSTEPS) ||
     (ix<NSTEPS || ix>=nx-NSTEPS || iy<NSTEPS || iy>=ny-NSTEPS))
    Y[iy*nx+ix] = x;
}

In this code, the boundary elements are copied unchanged to the output vector.

Multiple timestep analysis

Assume that each thread block has p threads in each of D dimensions. Let the number of timesteps be s. The algorithm for a block of threads is as follows:

  1. Each thread of the p^D threads loads one data point to shared memory.
  2. All the treads perform s timesteps using only shared memory.
  3. Those threads that have interior points (those that have been computed using valid data), writes out their results.
  • The number of global memory accesses is p^D reads and (p-2s)^D writes, so the total is p^D + (p-2s)^D.
  • The number of useful flops is a*s(p-2s)^D, where a is the number of flops per grid point per timestep.

We want to maximize f(s) = a*s*(p-2*s)^D / (p^D + (p-2*s)^D). The picture below shows the case of the 2D heat equation timestep, with a=7 and p=32.

Image:multi-opt.png
Personal tools