From NA-Wiki

Revision as of 12:06, 12 June 2008 by Tomaso (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search


Second session

This meeting is about understanding the execution model in CUDA, and to get an introduction to the software development kit, and see what libraries are avilable.

Execution model in CUDA

Software model

  • A kernel is a computational routine to be run on the graphisc card.
  • The kernel is executed by threads, typically thousands.
  • The threads are divided into thread blocks.
  • Threads in the same block can synchronize and communicate through shared memory.
  • The different blocks need to be independent, and can not communicate or synchronize.

nVidia hardware

  • A graphics card typically has 2-16 multiprocessors.
  • All threads in a block execute concurrently on the same multiprocessor.
  • A multiprocessor may run more than one block.
  • Different blocks of the same kernel may execute on different multiprocessors.
  • Each multiprocessor has 8192 registers and 16kb of shared memory.

CUDA software development kit

Available libraries


CUBLAS contains an implementation of a subset of the BLAS level 1, 2, and 3 routines. It can be used both as a transparent package, where data is copied to the graphics card, and the result is copied back in each call. This is the simplest way of changing an existing BLAS program to use the GPU. However, the copying incurs a lot of overhead, and will impact performance except for large matrix-matrix operations (e.g. matrix-matrix multiplication, GEMM).

One can also allocate memory on the card, copy data there, and have the CUBLAS operate on the data already on the card. If many BLAS calls a re done before the data has to be transferred back to the cpu, one can expect good performance.

CUFFT is an FFT library that uses the graphis card. It is very fast, and supports both 1D, 2D and 3D transforms. It uses an interface similar to FFTW. The documentation can be found in /opt/cuda/doc on na46, or on nVidias website.

Example: Using CUBLAS and CUFFT is fairly straightforward. Below is a link to an example program that copmutes a 2D FFT of a matrix, inverts the transform and compares the result to the original matrix. The program uses both CUFFT and CUBLAS.

It can be compiled and run with:

computer> nvcc -o fft-test -lcublas -lcufft
computer> ./fft-test

Here is some sample output:

computer> ./fft-test
* 2D fourier transform and inverse transform of 128x128 matrix
norm = 1.04531e+02      residual = 7.91525e-05      relative error = 7.57218e-07


CUDPP stands for CUDA data parallel primitives. It is a C++ library with routines for reduction (e.g. summing or finding the maximum of all elements in a vector), sorting and sparse matrix-vector multiplication.

It can be downloaded from, under the developer section. A direct link is here: CUDPP homepage

To install it under linux, download the .tar.gz-file, and do the following:

# Unpack
  tar xzf cudpp_1.0a.tar.gz
  cd cudpp_1.0a

# Compile, change /opt/cuda to whereever cuda
# is installed (it is in /opt/cuda on na46)
  echo "The make process takes some time, and uses a lot of memory."
  echo "Please be patient..."
  cd common ; make cuda-install=/opt/cuda
  cd ..
  cd cudpp ; make cuda-install=/opt/cuda
  cd ..

# Build test program
  cd apps/cudpp_testrig ; make cuda-install=/opt/cuda
  cd ../..
  cd bin/linux/release

# Try some tests...
  ./cudpp_testrig --scan --iterations=100 --n=100000

Note that the make process takes some time and uses a lot of memory, so please be patient!

The test program showed in the lecture is here: If the cudpp_1.0a is in the same directory as, you can compile and run with:

computer> nvcc -I cudpp_1.0a/cudpp/include -o sptest -L cudpp_1.0a/lib -lcudpp
computer> ./sptest has a fairly large matrix (10000x10000), and the setup time is significant. It takes maybe 10 seconds or so to run. Here is some sample output:

computer> time ./sptest
(CPU) flops = 1.651e+08, time per iteration = 12.114ms
      flops = 8.686e+08, time per iteration = 2.303ms

real    0m8.176s
user    0m6.820s
sys     0m1.329s
Personal tools