From NA-Wiki

Revision as of 11:12, 3 June 2008 by Tomaso (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search


First session

The major part of this session will be Erik Lindahl's introduction to streaming archtechtures. There will also be an outline of the course and expectations of the participants will be presented.

!! This page is still not in final form !!

Introductory example

Below you will find an example code for a simple program that compute vector addition using CUDA. It solves the following problem:

 Given two vectors a and b, both of length n,
 and a scalar gamma, compute the vector c
 according to:
    c = a + gamma*b

A routine written in C for solving this problem may look as follows:

static void
cpu_saxpy(int n,float gamma,float a[],float b[],float c[]) {
  int i;

  for(i = 0; i<n; i++)
    c[i] = a[i] + gamma*b[i];

In CUDA, we can think of having many processes avaiable to do work, so each process will do only part of the total work. An example of CUDA code solving the same problem as above follows below. In the code, np denote the total number of processes (threads) working on the problem, and pid is the process id, which ranges from Here is the code:

__global__ static void
gpu_saxpy(int n,float gamma,float a[],float b[],float c[]) {
  int np  = gridDim.x * blockDim.x;
  int pid = blockIdx.x*blockDim.x + threadIdx.x;
  int i;

  for(i = pid; i<n; i+=np)
    c[i] = a[i] + gamma*b[i];

The __global__ indentifier tells the CUDA compiler that this routine is to be called from the cpu, and executed on the graphics card. In CUDA the threads are arranged into blocks. np and pid are calculated from built-in variables that hold the number of blocks, the block id, the number of threads per block and the thread id within the block. The loop is set up so that each vector element is visitied / computed once by some thread.

To call a CUDA function, the following syntax is used:


This indicates that the function gpu_saxpy should be run using NBLOCKS thread blocks of NTHREADS threads each.

A full code which contain also memory allocation and timing of the cpu and graphics card versions of the vector addition can be found here:

Suggested homework

There is no mandatory homework in this course, but to get credits, a project must be made and presented. Below follows a set of optiojnal homework exercises to get you started and an hands on experience with CUDA and the kind of problems that arise in massive multi-threading.

Homework 1a

The first task is to download the code above (, and compile it with nvcc and run it. It can be run either in emulation mode on any linux or Windows box with CUDA installed, or on actual GPU hardware on na46 or the dedicated graphics cluster.

To compile and run in emulated mode on, you can do the following:

 na46> export PATH="$PATH:/opt/cuda/bin"
 na46> export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/opt/cuda/lib"
 na46> nvcc --device-emulation -o example1
 na46> ./example1

To run on the graphics card and enable optimization of the code that runs on the cpu, you can replace the compiation line by:

 na46> nvcc -O3 -o example1

Note: If the "export" command does not work,yo are probably using a shell other than bash (e.g. tsch or csh). You can then replace the lines with "export" by:

 na46> setenv PATH "${PATH}:/opt/cuda/bin"
 na46> setenv LD_LIBRARY_PATH "${LD_LIBRARY_PATH}:/opt/cuda/lib"

Homework 1b

The second task is to write a code yourself. It is supposed to sum all the elements of a vector using CUDA. Below follows a description of the complications you may run into and tips on how to solve them. Your solution does not have to use more than one block of threads.

You are welcome to use the vector addotion code above as a template, and edit it to solve this task.

In this task, the threads will need to communicate. This can be done through shared memory and synchronization.

The idea is as follows: Just like in the last example, let each thread work an a few elements, and compute the sum of those, S.

Now each thread has different S, and they need to be added. Assume we have a vector V with np elements, that all threads can access, we can compute the sum of all S's as follows:

 store local S into V(pid).
 assume the number of processes, np, is a power of 2.
 if pid < np/2, set V(tid) = V(tid) + V(tid+np/2)
 synchronize all processes, so that the writes to shared
 vector V are all completed.
 if pid < np/4, set V(tid) = V(tid) + V(tid+np/4)
 synchronize again
 repeat until all S's are accumulated into V(0).
 let the first (pid=0) process store V(0)
 at the result location.  

Below is a link to a pdf-file with slides showing how to solve and optimize this problem. In the pdf, the description of this example starts on page 45, with the first code example at page 50. The first 45 pages explains some of the architecture of CUDA, and has two other example applications. It is not necessary to read that for this task. For this task, it is enough to create a working program, no optimization is necessary. It should be able to handle vectors of arbitrary size, but need only use one block of threads (easier since different blocks can not communicate through shared memory). The easiest route is probably to take the vector add example, and then add code from the code example on page 50 in the pdf.

 * Here is the link: CUDA_Optimization

If you have any questions, please e-mail to Good luck!

CUDA documentation on na46

If you log on to na46, you will find the CUDA reference manual, and the nvcc (compiler) manual as well as documentation to FFT and BLAS libraries in the directory /opt/cuda/doc.

Personal tools