GPU Porting in Axom

Axom uses the following two libraries as the main workhorses for GPU porting:

  • RAJA: Handles software abstractions that enable architecture and programming model portability for HPC applications

  • Umpire: Resource management library that allows the discovery, provision, and management of memory on machines with multiple memory devices like NUMA and GPUs.

From RAJA and Umpire, Axom derives a set of convenience macros and function wrappers in the axom namespace encapsulating commonly-used RAJA and Umpire functions, and preset execution spaces for host/device execution.

For the user’s guide on using GPU utilities, see also Core Acceleration.

Macros

Axom’s macros can be found in the file axom/core/Macros.hpp.

Most of the GPU-related macros are used to guard device code for compilation.

For guarding device code:

/*!
 * \def AXOM_USE_GPU
 *
 * \brief Convenience macro used for GPU-enabled checks
 *
 * \note AXOM_USE_CUDA is defined if Axom is built with CUDA.
 *       AXOM_USE_HIP is defined if Axom is built with HIP.
 */
#if defined(AXOM_USE_CUDA) || defined(AXOM_USE_HIP)
  #define AXOM_USE_GPU
#endif

/*!
 * \def AXOM_DEVICE_CODE
 *
 * \brief Convenience macro used for kernel code
 */
#if defined(__CUDA_ARCH__) || defined(__HIP_DEVICE_COMPILE__)
  #define AXOM_DEVICE_CODE
#endif

/*!
 * \def AXOM_GPUCC
 *
 * \brief Convenience macro for compiling CUDA/HIP source files
 */
#if defined(__CUDACC__) || defined(__HIPCC__)
  #define AXOM_GPUCC
#endif

Note

  • Functions called in CUDA or HIP GPU device code require the __device__ annotation.

  • Functions that will be called in device code and CPU host code require the __host__ __device__ annotation.

The following code shows the macros used in Axom to apply these annotations:

/*!
 * \def AXOM_DEVICE
 * \def AXOM_HOST_DEVICE
 *
 * \brief CUDA or HIP host/device macros for decorating functions/lambdas
 *
 * \note These will expand to the corresponding CUDA/HIP decorations when
 *  compiled with -DAXOM_USE_CUDA or -DAXOM_USE_HIP
 */
#if defined(__CUDACC__) || defined(__HIPCC__)
  #define AXOM_DEVICE __device__
  #define AXOM_HOST_DEVICE __host__ __device__
  #define AXOM_HOST __host__
#else
  #define AXOM_DEVICE
  #define AXOM_HOST_DEVICE
  #define AXOM_HOST
#endif

/*!
 * \def AXOM_LAMBDA
 *
 * \brief Convenience macro used for lambda capture by value.
 * \note When CUDA or HIP is used, the macro always expands to a host/device lambda.
 */
#if defined(AXOM_USE_CUDA) || defined(AXOM_USE_HIP)
  #define AXOM_LAMBDA [=] AXOM_HOST_DEVICE
  #define AXOM_DEVICE_LAMBDA [=] AXOM_DEVICE
  #define AXOM_HOST_LAMBDA [=] AXOM_HOST
#else
  #define AXOM_LAMBDA [=]
  #define AXOM_DEVICE_LAMBDA [=]
  #define AXOM_HOST_LAMBDA [=]
#endif

Below is a function that uses Axom macros to apply a __host__ __device__ annotation and guard the use of a CUDA intrinsic to inside a kernel:

/*!
 * \brief Counts the number of leading zeros in \a word
 * \accelerated
 * \return The number of zeros to the left of the first set bit in \word,
 * starting with the least significant bit.
 */
AXOM_HOST_DEVICE inline std::int32_t countl_zero(std::int32_t word) noexcept
{
  /* clang-format off */
#if defined(__CUDA_ARCH__) && defined(AXOM_USE_CUDA)
  // Use CUDA intrinsic for count leading zeros
  return __clz(word);
#elif defined(__HIP_DEVICE_COMPILE__) && defined(AXOM_USE_HIP)
  // Use HIP intrinsic for count leading zeros
  return __clz(word);
#elif defined(_AXOM_CORE_USE_INTRINSICS_MSVC)
  unsigned long cnt;
  return _BitScanReverse(&cnt, word) ? 31 - cnt : 32;
#elif defined(_AXOM_CORE_USE_INTRINSICS_GCC) || defined(_AXOM_CORE_USE_INTRINSICS_PPC)
  return word != std::int32_t(0) ? __builtin_clz(word) : 32;
#else
  std::int32_t y {};
  std::int32_t n = 32;
  y = word >> 16; if(y != 0) { n -= 16; word = y;}
  y = word >>  8; if(y != 0) { n -=  8; word = y;}
  y = word >>  4; if(y != 0) { n -=  4; word = y;}
  y = word >>  2; if(y != 0) { n -=  2; word = y;}
  y = word >>  1; if(y != 0) { return std::int32_t(n - 2); }
  return std::int32_t(n - word);
#endif
  /* clang-format off */
}

Memory

Axom’s memory management routines can be found in the file axom/core/memory_management.hpp.

Memory Management Routines

Umpire has the concept of “allocators” associated with each memory resource type (umpire::resource::MemoryResourceType).

To allocate memory on a particular resource, you use the ID for the allocator associated with the umpire::resource::MemoryResourceType.

You are able to set a default allocator, whereby all your memory allocations will go on the resource associated with the allocator unless otherwise specified:

/// \name Memory Management Routines
/// @{

#ifdef AXOM_USE_UMPIRE

/*!
 * \brief Returns the ID of the predefined allocator for a given resource.
 * \param [in] resource_type the Umpire resource type
 * \return ID the id of the predefined umpire allocator.
 */
inline int getUmpireResourceAllocatorID(
  umpire::resource::MemoryResourceType resource_type)
{
  umpire::ResourceManager& rm = umpire::ResourceManager::getInstance();
  umpire::Allocator alloc = rm.getAllocator(resource_type);
  return alloc.getId();
}

/*!
 * \brief Sets the default memory allocator to use.
 * \param [in] resource_type the Umpire resource type
 */
inline void setDefaultAllocator(umpire::resource::MemoryResourceType resource_type)
{
  umpire::ResourceManager& rm = umpire::ResourceManager::getInstance();
  umpire::Allocator allocator = rm.getAllocator(resource_type);
  rm.setDefaultAllocator(allocator);
}

#endif

/*!
 * \brief Sets the default memory allocator to use.
 * \param [in] allocId the Umpire allocator id
 * 
 * \note This function has no effect when Axom is not compiled with Umpire.
 */
inline void setDefaultAllocator(int allocId)
{
#ifdef AXOM_USE_UMPIRE
  umpire::ResourceManager& rm = umpire::ResourceManager::getInstance();
  umpire::Allocator allocator = rm.getAllocator(allocId);
  rm.setDefaultAllocator(allocator);
#else
  AXOM_UNUSED_VAR(allocId);
#endif
}

/*!
 * \brief Returns the ID of the current default allocator.
 * \return ID the ID of the current default allocator.
 * \post ID != INVALID_ALLOCATOR_ID
 */
inline int getDefaultAllocatorID()
{
#ifdef AXOM_USE_UMPIRE
  return umpire::ResourceManager::getInstance().getDefaultAllocator().getId();
#else
  return 0;
#endif
}

/*!
 * \brief Allocates a chunk of memory of type T.
 *
 * \param [in] n the number of elements to allocate.
 * \param [in] allocID the Umpire allocator to use (optional)
 *
 * \tparam T the type of pointer returned.
 *
 * \note By default allocate() will use the current default allocator. The
 *  caller may explicitly specify a different allocator to use by supplying the
 *  second, optional argument, or change the default allocator by calling
 *  axom::setDefaultAllocator().
 *
 * \return p pointer to the new allocation or a nullptr if allocation failed.
 */
template <typename T>
inline T* allocate(std::size_t n, int allocID = getDefaultAllocatorID()) noexcept;

/*!
 * \brief Frees the chunk of memory pointed to by the supplied pointer, p.
 * \param [in/out] p a pointer to memory allocated with allocate/reallocate or a
 * nullptr.
 * \post p == nullptr
 */
template <typename T>
inline void deallocate(T*& p) noexcept;

/*!
 * \brief Reallocates the chunk of memory pointed to by the supplied pointer.
 *
 * \param [in] p pointer to memory allocated with allocate/reallocate, or a
 * nullptr.
 * \param [in] n the number of elements to allocate.
 * \param [in] allocID the ID of the allocator to use if pointer is null
 * (optional)
 *
 * \tparam T the type pointer p points to.
 *
 * \return p pointer to the new allocation or a nullptr if allocation failed.
 *
 * \note When n == 0, this function returns a valid pointer (of size 0) in the
 * current allocator's memory space. This follows the semantics of
 * Umpire's reallocate function.
 * \note When p is a null pointer, allocID is used to allocate the data.
 * Otherwise, it is unused.
 */
template <typename T>
inline T* reallocate(T* p,
                     std::size_t n,
                     int allocID = getDefaultAllocatorID()) noexcept;

/*!
 * \brief Copies memory from the source to the destination.
 *
 * \param [in/out] dst the destination to copy to.
 * \param [in] src the source to copy from.
 * \param [in] numbytes the number of bytes to copy.
 *
 * \note When using Umpire if either src or dst is not registered with the
 *  ResourceManager then the default host allocation strategy is assumed for
 *  that pointer.
 */
inline void copy(void* dst, const void* src, std::size_t numbytes) noexcept;

/// @}

Note

When Axom is built without Umpire, the getters and setters shown above become no-ops or are undefined, while the memory allocation functions default to C++ standard library functions with only allocation on the host (CPU):

  • axom::allocate calls std::malloc

  • axom::deallocate calls std::free

  • axom::reallocate calls std::realloc

  • axom::copy calls std::memcpy

MemorySpace

/*! 
 * \brief Memory spaces supported by Array-like types
 *
 * This abstraction is not implemented using Umpire's MemoryResourceType enum
 * in order to also include a "Dynamic" option as a default template parameter
 * for Array-like types
 */
enum class MemorySpace
{
  Dynamic,
#ifdef AXOM_USE_UMPIRE
  Host,
  Device,
  Unified,
  Pinned,
  Constant
#endif
};

Axom provides the axom::MemorySpace enum type to define values indicating the memory space where data in axom::Array and axom::ArrayView lives.

Dynamic allows you to define the location at runtime, with some caveats (see Core Containers for more details and examples).

Kernels

axom::for_all

axom::for_all can be found in the file axom/core/execution/for_all.hpp.

axom::for_all is a wrapper around RAJA forall, which is used to execute simple for-loop kernels.

This is used in Axom to execute for-loop style kernels that will be run on a GPU device, or on both a GPU device and a CPU host. For example:

template <typename ExecSpace, typename KernelType>
void axom::for_all(const IndexType& N, KernelType&& kernel)

template <typename ExecSpace, typename KernelType>
void axom::for_all(const IndexType& begin, const IndexType& end, KernelType&& kernel)

Note

When Axom is built without RAJA, axom::for_all becomes a for-loop on host (CPU).

RAJA::kernel

RAJA::kernel is used to execute kernels implemented using nested loops.

This is used infrequently, mainly seen only in a few unit tests.

Your general go-to will be axom::for_all.

Useful Links

RAJA Loops - Covers RAJA::forall, RAJA::kernel, RAJA::launch kernel execution methods.

Execution Spaces & Policies

Axom’s execution spaces can be found in the file axom/core/execution/execution_space.hpp.

Axom’s execution spaces are derived from an axom::execution_space<ExecSpace> traits class containing RAJA execution policies and default Umpire memory allocators associated with each space.

Axom currently supports four execution spaces, each one a type with the following specialization of the execution_space class:

  • SEQ_EXEC - Sequential execution policies on host

  • OMP_EXEC - OpenMP execution policies on host

  • CUDA_EXEC - CUDA execution policies in Unified Memory (host + device)

  • HIP_EXEC - HIP execution policies in Unified Memory (host + device)

Additionally, HIP_EXEC and CUDA_EXEC types are templated by the number of threads and SYNCHRONOUS or ASYNC execution:

/*!
 * \brief Indicates parallel execution on the GPU with CUDA.
 *
 * \tparam BLOCK_SIZE the number of CUDA threads in a block.
 * \tparam ExecutionMode indicates synchronous or asynchronous execution.
 */
template <int BLOCK_SIZE, ExecutionMode EXEC_MODE = SYNCHRONOUS>
struct CUDA_EXEC
{ };

Each execution space provides:

  • Axom policies that are type aliases of RAJA policies to be used with kernels, RAJA types, and RAJA operations

    • loop_policy - For RAJA scans and other operations; axom::for_all uses the loop_policy from the templated execution space.

    • reduce_policy - For RAJA reduction types that perform reduction operations:

      using reduce_pol = typename axom::execution_space<ExecSpace>::reduce_policy;
      RAJA::ReduceSum<reduce_pol, axom::IndexType> totalSum(0);
    
      // Sum integers [0,99]
      axom::for_all<ExecSpace>(
        100,
        AXOM_LAMBDA(axom::IndexType i) { totalSum += i; });
    
      std::cout << "\nTotal Reduction Sum ("
                << axom::execution_space<ExecSpace>::name()
                << ") :" << totalSum.get() << std::endl;
    
      using atomic_pol = typename axom::execution_space<ExecSpace>::atomic_policy;
    
      int *sum = axom::allocate<int>(1, allocator_id);
      *sum = 0;
    
      // Increment sum 100 times
      axom::for_all<ExecSpace>(
        100,
        AXOM_LAMBDA(axom::IndexType) { RAJA::atomicAdd<atomic_pol>(sum, 1); });
    
      std::cout << "\nTotal Atomic Sum (" << axom::execution_space<ExecSpace>::name()
                << ") :" << sum[0] << std::endl;
    
      axom::deallocate(sum);
    
    • sync_policy - For Axom’s synchronize function, which is a wrapper around RAJA::synchronize(). Synchronizes execution threads when using an asynchronous loop_policy:

    /*!
     * \brief Synchronizes all execution threads when using an ASYNC policy with
     *  the specified execution space.
     *
     * \tparam ExecSpace the execution space
     */
    template <typename ExecSpace>
    inline void synchronize() noexcept
    {
      AXOM_STATIC_ASSERT(execution_space<ExecSpace>::valid());
    
    #ifdef AXOM_USE_RAJA
      using sync_policy = typename execution_space<ExecSpace>::sync_policy;
      RAJA::synchronize<sync_policy>();
    #endif
    }
    
  • Umpire allocator defaults

    • memory_space - The memory space abstraction for use by Core Containers like axom::Array.

    • allocatorID() - Gets the allocator ID for the Umpire resource to use in this execution space.

  • General information on the execution space

    • name() - Name of the execution space

    • onDevice() - Is the execution space on device? (True/False)

    • valid() - Is the execution space valid? (True)

    • async() - Is the execution space asynchronous? (True/False)

The Mint component also provides a set of nested execution policies located at axom/mint/execution/internal/structured_exec.hpp to be used with RAJA::kernel e.g. for iterating over mint meshes.

Note

When Axom is built without RAJA, only SEQ_EXEC is available for host (CPU) execution. When Axom is built with RAJA but without Umpire for memory management on device, only SEQ_EXEC and OMP_EXEC is available for host (CPU) execution.

General, Rough Porting Tips

  • Start with figuring out what memory you need on device, and use axom::Array, axom::ArrayView, and memory_managment routines to do the allocations:

    // Allocate 100 2D Triangles in unified memory
    using cuda_exec = axom::CUDA_EXEC<256>;
    using TriangleType = axom::primal::Triangle<double, 2>;
    axom::Array<Triangle> tris (100, axom::execution_space<cuda_exec>::allocatorID()));
    axom::ArrayView<Triangle> tris_view(tris);
    
    // Allocate the sum of Triangle areas
    using reduce_pol = typename axom::execution_space<cuda_exec>::reduce_policy;
    RAJA::ReduceSum<reduce_pol, double> totalArea(0);
    
  • Using an axom::for_all kernel with a device policy, attempt to access and/or manipulate the memory on device:

    axom::for_all<cuda_exec>(
    100,
    AXOM_LAMBDA(int idx) {
      // Set values on device
      tris_view[idx] = Triangle();
      totalArea = 0;
    });
    
  • Add the functions you want to call on device to the axom::for_all kernel:

    axom::for_all<cuda_exec>(
    100,
    AXOM_LAMBDA(int idx) {
      tris_view[idx] = Triangle();
      totalArea = 0;
    
      // Call area() method on device
      double area = tris_view[idx].area();
    });
    
  • Apply a __host__ __device__ annotation to your functions if you see the following error or similar:

    error: reference to __host__ function 'area' in __host__ __device__ function
    
    • Recompiling will likely introduce complaints about more functions (the functions being the non-decorated functions your newly-decorated functions are calling):

      error: reference to __host__ function 'abs<double>' in __host__ __device__ function
      
      error: reference to __host__ function 'signedArea<2>' in __host__ __device__ function
      

      Keep decorating until all the complaints are gone.

    • Most of the C++ standard library is not available on device. Your options are Axom’s equivalent functions/classes if it exists, or to add your own or rewrite the code to not use standard library.

  • With no more decorating complaints from the compiler, write the logically correct kernel:

    // Computes the total area of a 100 triangles
    axom::for_all<cuda_exec>(
      100,
      AXOM_LAMBDA(int idx) {
        totalArea += tris_view[idx].area();
    });
    
    • If at this point your kernel is not working/segfaulting, it is hopefully a logical error, and you can debug the kernel without diving into debugging tools.

    • Utilize printf() for debugging output

    • Try using the SEQ_EXEC execution space

Useful Links