Launching kernels

CUDA kernels should be defined in their own source files with a .cu file extension, and registered with the library in the same file as they are defined by using the SDP_CUDA_KERNEL macro. This is done to avoid the need to compile the calling code as well as the kernels with nvcc, since CUDA is an optional dependency.

Once registered, a kernel can be launched by calling sdp_launch_cuda_kernel(), providing the name of the kernel as a string, and an array for its function arguments. (This is essentially a thin wrapper around cudaLaunchKernel).

The following example registers two versions of a simple templated kernel:

#include "ska-sdp-func/utility/sdp_device_wrapper.h"

template<typename T>
__global__
void vector_add (
    const int64_t num_elements,
    const T *const __restrict__ input_a,
    const T *const __restrict__ input_b,
    T *__restrict__ output)
{
    const int64_t i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < num_elements)
    {
        output[i] = input_a[i] + input_b[i];
    }
}

SDP_CUDA_KERNEL(vector_add<float>)
SDP_CUDA_KERNEL(vector_add<double>)

Include the header “ska-sdp-func/utility/sdp_device_wrapper.h” to use these functions.

void sdp_launch_cuda_kernel(const char *name, const uint64_t num_blocks[3], const uint64_t num_threads[3], uint64_t shared_mem_bytes, sdp_CudaStream *stream, const void **args, sdp_Error *status)

Launches a CUDA kernel.

The kernel name must have been registered with the processing function library using the SDP_CUDA_KERNEL macro.

The triple-angle-bracket syntax for launching a kernel is an nvcc extension, so the calling code would need to be compiled with nvcc as well if this was used. This function provides an isolation layer, so that the kernels can be compiled separately when the library is built with CUDA support.

Parameters:

name – Name of the kernel to launch, as provided to SDP_CUDA_KERNEL.
num_blocks – Number of thread blocks in 3D.
num_threads – Number of threads per block in 3D.
shared_mem_bytes – Amount of dynamic shared memory required, in bytes.
stream – CUDA stream. A null pointer will use the default stream.
args – Array of pointers to kernel arguments.
status – Error status.

void sdp_cuda_set_device(int device)

Sets the CUDA device to use.

This is a wrapper for cudaSetDevice(). It exists to allow the processing function library to be compiled independently of CUDA, if required.

sdp_CudaStream *sdp_cuda_stream_create()

Creates a CUDA stream.

This is a wrapper for cudaStreamCreate(). It exists to allow the processing function library to be compiled independently of CUDA, if required.

void sdp_cuda_stream_destroy(sdp_CudaStream *stream)

Destroys a CUDA stream.

This is a wrapper for cudaStreamDestroy(). It exists to allow the processing function library to be compiled independently of CUDA, if required.

void sdp_cuda_stream_synchronize(sdp_CudaStream *stream)

Synchronize a CUDA stream.

Blocks until operations in a CUDA stream complete. This is a wrapper for cudaStreamSynchronize(). It exists to allow the processing function library to be compiled independently of CUDA, if required.

SDP_CUDA_KERNEL(...)

Registers a CUDA kernel with the processing function library.

This allows the kernel to be called without needing to compile host code with nvcc. It should be placed in the same source file as the kernel, after it has been defined.

The macro takes a single argument, which is simply the name of the kernel. (It is implemented as a variadic macro to allow for templated kernels that take multiple template parameters, where the commas between type names would otherwise cause problems.)