GPU Porting in Axom¶

Axom uses the following two libraries as the main workhorses for GPU porting:

RAJA: Handles software abstractions that enable architecture and programming model portability for HPC applications

Umpire: Resource management library that allows the discovery, provision, and management of memory on machines with multiple memory devices like NUMA and GPUs.

From RAJA and Umpire, Axom derives a set of convenience macros and function wrappers in the axom namespace encapsulating commonly-used RAJA and Umpire functions, and preset execution spaces for host/device execution.

For the user’s guide on using GPU utilities, see also Core Acceleration.

Macros¶

Axom’s macros can be found in the file axom/core/Macros.hpp.

Most of the GPU-related macros are used to guard device code for compilation.

For guarding device code:

/*!
 * \def AXOM_USE_GPU
 *
 * \brief Convenience macro used for GPU-enabled checks
 *
 * \note AXOM_USE_CUDA is defined if Axom is built with CUDA.
 *       AXOM_USE_HIP is defined if Axom is built with HIP.
 */
#if defined(AXOM_USE_CUDA) || defined(AXOM_USE_HIP)
  #define AXOM_USE_GPU
#endif

/*!
 * \def AXOM_DEVICE_CODE
 *
 * \brief Convenience macro used for kernel code
 */
#if defined(__CUDA_ARCH__) || defined(__HIP_DEVICE_COMPILE__)
  #define AXOM_DEVICE_CODE
#endif

/*!
 * \def AXOM_GPUCC
 *
 * \brief Convenience macro for compiling CUDA/HIP source files
 */
#if defined(__CUDACC__) || defined(__HIPCC__)
  #define AXOM_GPUCC
#endif

Note

Functions called in CUDA or HIP GPU device code require the __device__ annotation.
Functions that will be called in device code and CPU host code require the __host__ __device__ annotation.

The following code shows the macros used in Axom to apply these annotations:

/*!
 * \def AXOM_DEVICE
 * \def AXOM_HOST_DEVICE
 *
 * \brief CUDA or HIP host/device macros for decorating functions/lambdas
 *
 * \note These will expand to the corresponding CUDA/HIP decorations when
 *  compiled with -DAXOM_USE_CUDA or -DAXOM_USE_HIP
 */
#if defined(__CUDACC__) || defined(__HIPCC__)
  #define AXOM_DEVICE __device__
  #define AXOM_HOST_DEVICE __host__ __device__
  #define AXOM_HOST __host__
#else
  #define AXOM_DEVICE
  #define AXOM_HOST_DEVICE
  #define AXOM_HOST
#endif

/*!
 * \def AXOM_LAMBDA
 *
 * \brief Convenience macro used for lambda capture by value.
 * \note When CUDA or HIP is used, the macro always expands to a host/device lambda.
 */
#if defined(AXOM_USE_CUDA) || defined(AXOM_USE_HIP)
  #define AXOM_LAMBDA [=] AXOM_HOST_DEVICE
  #define AXOM_DEVICE_LAMBDA [=] AXOM_DEVICE
  #define AXOM_HOST_LAMBDA [=] AXOM_HOST
#else
  #define AXOM_LAMBDA [=]
  #define AXOM_DEVICE_LAMBDA [=]
  #define AXOM_HOST_LAMBDA [=]
#endif

Below is a function that uses Axom macros to apply a __host__ __device__ annotation and guard the use of a CUDA intrinsic to inside a kernel:

/*!
 * \brief Counts the number of leading zeros in \a word
 * \accelerated
 * \return The number of zeros to the left of the first set bit in \word,
 * starting with the least significant bit.
 */
AXOM_HOST_DEVICE inline std::int32_t countl_zero(std::int32_t word) noexcept
{
  /* clang-format off */
#if defined(__CUDA_ARCH__) && defined(AXOM_USE_CUDA)
  // Use CUDA intrinsic for count leading zeros
  return __clz(word);
#elif defined(__HIP_DEVICE_COMPILE__) && defined(AXOM_USE_HIP)
  // Use HIP intrinsic for count leading zeros
  return __clz(word);
#elif defined(_AXOM_CORE_USE_INTRINSICS_MSVC)
  unsigned long cnt;
  return _BitScanReverse(&cnt, word) ? 31 - cnt : 32;
#elif defined(_AXOM_CORE_USE_INTRINSICS_GCC) || defined(_AXOM_CORE_USE_INTRINSICS_PPC)
  return word != std::int32_t(0) ? __builtin_clz(word) : 32;
#else
  std::int32_t y {};
  std::int32_t n = 32;
  y = word >> 16; if(y != 0) { n -= 16; word = y;}
  y = word >>  8; if(y != 0) { n -=  8; word = y;}
  y = word >>  4; if(y != 0) { n -=  4; word = y;}
  y = word >>  2; if(y != 0) { n -=  2; word = y;}
  y = word >>  1; if(y != 0) { return std::int32_t(n - 2); }
  return std::int32_t(n - word);
#endif
  /* clang-format off */
}

Memory¶

Axom’s memory management routines can be found in the file axom/core/memory_management.hpp.

Memory Management Routines¶

Umpire has the concept of “allocators” associated with each memory resource type (umpire::resource::MemoryResourceType).

To allocate memory on a particular resource, you use the ID for the allocator associated with the umpire::resource::MemoryResourceType.

You are able to set a default allocator, whereby all your memory allocations will go on the resource associated with the allocator unless otherwise specified:

/// \name Memory Management Routines
/// @{

#ifdef AXOM_USE_UMPIRE

/*!
 * \brief Returns the ID of the predefined allocator for a given resource.
 * \param [in] resource_type the Umpire resource type
 * \return ID the id of the predefined umpire allocator.
 */
inline int getUmpireResourceAllocatorID(umpire::resource::MemoryResourceType resource_type)
{
  umpire::ResourceManager& rm = umpire::ResourceManager::getInstance();
  umpire::Allocator alloc = rm.getAllocator(resource_type);
  return alloc.getId();
}

/*!
 * \brief Sets the default memory allocator to use.
 * \param [in] resource_type the Umpire resource type
 */
inline void setDefaultAllocator(umpire::resource::MemoryResourceType resource_type)
{
  umpire::ResourceManager& rm = umpire::ResourceManager::getInstance();
  umpire::Allocator allocator = rm.getAllocator(resource_type);
  rm.setDefaultAllocator(allocator);
}

#endif

/*!
 * \brief Sets the default memory allocator to use.
 * \param [in] allocId the Umpire allocator id
 * 
 * \note This function has no effect when Axom is not compiled with Umpire.
 */
inline void setDefaultAllocator(int allocId)
{
#ifdef AXOM_USE_UMPIRE
  umpire::ResourceManager& rm = umpire::ResourceManager::getInstance();
  umpire::Allocator allocator = rm.getAllocator(allocId);
  rm.setDefaultAllocator(allocator);
#else
  AXOM_UNUSED_VAR(allocId);
#endif
}

/*!
 * \brief Returns the ID of the current default Umpire allocator
 * or MALLOC_ALLOCATOR_ID if Umpire is not used.
 *
 * \return ID of the current Umpire default allocator or MALLOC_ALLOCATOR_ID.
 */
inline int getDefaultAllocatorID()
{
#ifdef AXOM_USE_UMPIRE
  return umpire::ResourceManager::getInstance().getDefaultAllocator().getId();
#else
  return MALLOC_ALLOCATOR_ID;
#endif
}

/*!
 * \brief Get the allocator id from which data has been allocated.
 * \return Allocator id.  If Umpire doesn't have an allocator for the
 * pointer, or if Axom wasn't configured with Umpire, assume the
 * non-null pointers are from a malloc and return \c
 * axom::MALLOC_ALLOCATOR_ID.
 *
 * \pre ptr has a valid pointer value.
 */
inline int getAllocatorIDFromPointer(const void* ptr)
{
#ifdef AXOM_USE_UMPIRE
  umpire::ResourceManager& rm = umpire::ResourceManager::getInstance();
  if(rm.hasAllocator(const_cast<void*>(ptr)))
  {
    umpire::Allocator allocator = rm.getAllocator(const_cast<void*>(ptr));
    return allocator.getId();
  }
#endif
  return ptr == nullptr ? INVALID_ALLOCATOR_ID : MALLOC_ALLOCATOR_ID;
}

/*!
 * \brief Determines whether an allocator id is for shared memory.
 *
 * \param allocID An allocator id.
 *
 * \return True if the allocator id is for shared memory; false otherwise.
 */
bool isSharedMemoryAllocator(int allocID);

/*!
 * \brief Get the allocator ID for Axom's shared memory allocator.
 *
 * \param [in] minSegmentSize Minimum desired shared-memory segment size in bytes (0 to use defaults).
 *             This value is treated as a minimum; the implementation will use the maximum
 *             of this value and Umpire's default shared-memory segment size when creating the allocator.
 *             This minimum is applied when creating the allocator and is ignored if the allocator
 *             already exists (except for validation).
 *
 * \note The shared-memory segment size cannot be increased after creation. If the allocator already
 * exists and \a minSegmentSize is larger than its existing segment size, this function aborts with
 * an explanatory message.
 *
 * \return The allocator ID for Axom's shared memory allocator (if Axom is built with Umpire shared memory support),
 *         or INVALID_ALLOCATOR_ID otherwise.
 */
int getSharedMemoryAllocatorID(std::size_t minSegmentSize = 0);

/*!
 * \brief Allocates a chunk of memory of type T.
 *
 * \param [in] n the number of elements to allocate.
 * \param [in] allocID the Umpire allocator to use (optional)
 *
 * \tparam T the type of pointer returned.
 *
 * \note By default allocate() will use the current default allocator. The
 *  caller may explicitly specify a different allocator to use by supplying the
 *  second, optional argument, or change the default allocator by calling
 *  axom::setDefaultAllocator().
 *
 * \return p pointer to the new allocation or a nullptr if allocation failed.
 */
template <typename T>
inline T* allocate(std::size_t n, int allocID = getDefaultAllocatorID()) noexcept;

/*!
 * \brief Allocates a chunk of memory of type T with a user-supplied allocation name.
 *
 * \param [in] n the number of elements to allocate.
 * \param [in] name allocation name (must be non-empty for shared memory allocators)
 * \param [in] allocID the Umpire allocator to use (optional)
 *
 * \return pointer to the new allocation or a nullptr if allocation failed.
 */
template <typename T>
inline T* allocate(std::size_t n,
                   const std::string& name,
                   int allocID = getDefaultAllocatorID()) noexcept;

/*!
 * \brief Frees the chunk of memory pointed to by the supplied pointer, p.
 * \param [in/out] p a pointer to memory allocated with allocate/reallocate or a
 * nullptr.
 * \post p == nullptr
 */
template <typename T>
inline void deallocate(T*& p) noexcept;

/*!
 * \brief Reallocates the chunk of memory pointed to by the supplied pointer.
 *
 * \param [in] p pointer to memory allocated with allocate/reallocate, or a
 * nullptr.
 * \param [in] n the number of elements to allocate.
 * \param [in] allocID the ID of the allocator to use if pointer is null
 * (optional)
 *
 * \tparam T the type pointer p points to.
 *
 * \return p pointer to the new allocation or a nullptr if allocation failed.
 *
 * \note When n == 0, this function returns a valid pointer (of size 0) in the
 * current allocator's memory space. This follows the semantics of
 * Umpire's reallocate function.
 * \note When p is a null pointer, allocID is used to allocate the data.
 * Otherwise, it is unused.
 */
template <typename T>
inline T* reallocate(T* p, std::size_t n, int allocID = getDefaultAllocatorID()) noexcept;

/*!
 * \brief Copies memory from the source to the destination.
 *
 * \param [in/out] dst the destination to copy to.
 * \param [in] src the source to copy from.
 * \param [in] numbytes the number of bytes to copy.
 *
 * \note When using Umpire if either src or dst is not registered with the
 *  ResourceManager then the default host allocation strategy is assumed for
 *  that pointer.
 */
inline void copy(void* dst, const void* src, std::size_t numbytes) noexcept;

/*!
 * \brief Fills memory with a value.
 *
 * \param [in/out] dst the destination to copy to.
 * \param [in] n the number of items to copy.
 * \param [in] The value to copy. It must be trivially copyable for use with GPU.
 *
 * \note When using Umpire if dst is not registered with the
 *  ResourceManager then the default host allocation strategy is assumed for
 *  that pointer.
 */
template <typename T>
inline void fill(void* dst, std::size_t n, const T& value) noexcept;

/// @}

Note

When Axom is built without Umpire, the getters and setters shown above become no-ops or are undefined, while the memory allocation functions default to C++ standard library functions with only allocation on the host (CPU):

axom::allocate calls std::malloc
axom::deallocate calls std::free
axom::reallocate calls std::realloc
axom::copy calls std::memcpy

MemorySpace¶

/*!
 * \brief Memory spaces supported by Array-like types
 *
 * This abstraction is not implemented using Umpire's MemoryResourceType enum
 * in order to also include
 * - a "Malloc" option that uses malloc and free.
 * - a "Dynamic" option as a default template parameter
 *   for use in Array-like types (see axom::Array).  If using
 *   Umpire, "Dynamic" refers to the default Umpire allocator.
 *   If not using Umpire, "Dynamic" falls back on "Malloc".
 *   (See axom::setDefaultAllocator() and axom::getDefaultAllocator().)
 */
enum class MemorySpace
{
  Malloc,   //!< Host memory using malloc, free and realloc
  Dynamic,  //!< Refers to Umpire's current default allocator
#ifdef AXOM_USE_UMPIRE
  Host,     //!< Umpire's host memory space
  Device,   //!< Umpire's device memory space
  Unified,  //!< Umpire's unified memory space
  Pinned,   //!< Umpire's pinned memory space
  Constant  //!< Umpire's constant memory space
#endif
};

Axom provides the axom::MemorySpace enum type to define values indicating the memory space where data in axom::Array and axom::ArrayView lives.

Dynamic allows you to define the location at run time, with some caveats (see Core Containers for more details and examples).

Useful Links¶

Umpire Tutorial - First two sections cover Allocators and Resources.

Kernels¶

axom::for_all¶

axom::for_all can be found in the file axom/core/execution/for_all.hpp.

axom::for_all is a wrapper around RAJA forall, which is used to execute simple for-loop kernels.

This is used in Axom to execute for-loop style kernels that will be run on a GPU device, or on both a GPU device and a CPU host. For example:

template <typename ExecSpace, typename KernelType>
void axom::for_all(const IndexType& N, KernelType&& kernel)

template <typename ExecSpace, typename KernelType>
void axom::for_all(const IndexType& begin, const IndexType& end, KernelType&& kernel)

Note

When Axom is built without RAJA, axom::for_all becomes a for-loop on host (CPU).

Portability¶

Adherence to the GPU porting guidelines generally results in code that will compile and run on multiple backends. However, backends such as CUDA require additional guidelines.

Do not use ``auto`` lambda parameters with axom::for_all or the code will not compile under nvcc.

Do this:

axom::for_all<ExecSpace>(n, AXOM_LAMBDA(axom::IndexType index) { /* body */});

Do NOT do this:

axom::for_all<ExecSpace>(n, AXOM_LAMBDA(auto index) { /* body */});

Pass ArrayView by value. Data in Axom is often contained in useful containers such as axom::Array. Use axom::ArrayView to access array data from within kernels. Views contain a pointer to the data and memory shape information and can be constructed within device code to access data arrays. Views are captured by kernels and passed to the device code; this copies the pointer and shape information into an object that the device code can access. If the view was passed into a method where the axom::for_all function is calling a kernel, be sure to pass the view by value so the compiler does not capture a host reference to the view, causing the kernel to fail.

Do this:

template <typename ExecSpace>
void doSomething(axom::ArrayView<int> dataView)
{
  axom::for_all<ExecSpace>(dataView.size(), AXOM_LAMBDA(axom::IndexType index)
  {
    /* body uses dataView[index] */
  });
}

Do NOT do this:

template <typename ExecSpace>
void doSomething(axom::ArrayView<int> &dataView)
{
  axom::for_all<ExecSpace>(dataView.size(), AXOM_LAMBDA(axom::IndexType index)
  {
    /* body uses dataView[index] */
    /* It will crash on GPU devices because the host reference was
       captured rather than the object.
     */
  });
}

If pass by reference is used, create a new view inside the method and then use that “device” view in the kernel so the device code uses an object captured by value.

Do not call for_all from protected or private methods. The nvcc compiler requires the method containing the kernel to be publicly accessible. Conditional compilation can be used to make methods public when they should otherwise be private or protected.

Do this:

class Algorithm
{
public:
  // stuff
#if !defined(__CUDACC__)
private:
#endif
  void helperMethod()
  {
    axom::for_all<ExecSpace>(n, AXOM_LAMBDA(axom::IndexType index) { /* body */});
  }
};

Do NOT do this:

class Algorithm
{
public:
  // stuff
private:
  void helperMethod()
  {
    axom::for_all<ExecSpace>(n, AXOM_LAMBDA(axom::IndexType index) { /* body */});
  }
};

Avoid for_all within a lambda. When calling a kernel via axom::for_all from a surrounding lambda function, consider calling axom::for_all from a class member method instead. The nvcc compiler will not compile kernel invocations inside lambda functions. This pattern comes up when an intermediate function is supplied a lambda that uses axom::for_all such as when handling many data types.

Do this:

template <typename ExecSpace>
class Algorithm
{
public:
  void execute(conduit::Node &data)
  {
    // Handle any data type
    Node_to_ArrayView(data, [&](auto dataView)
    {
      // Delegate operation to a template member method
      handleData(dataView);
    });
  }
  template <typename DataView>
  void handleData(DataView dataView)
  {
    // Call the kernel here in the member method
    axom::for_all<ExecSpace>(AXOM_LAMBDA(axom::IndexType) { /* body */ });
  }
};

Do NOT do this:

template <typename ExecSpace>
class Algorithm
{
public:
  void execute(conduit::Node &data)
  {
    // Handle any data type
    Node_to_ArrayView(data, [&](auto dataView)
    {
      // nvcc will not compile this
      axom::for_all<ExecSpace>(AXOM_LAMBDA(axom::IndexType) { /* body */ });
    });
  }
};

Avoid calling lambdas from kernels. This can work on some systems and not on others. For best odds at a portable algorithm, design your kernel so it is “one level deep”, and does not result in calling other functions that are also marked AXOM_LAMBDA.

Specialize templates outside other classes. It is necessary to extract a nested class/struct from the containing class before specializing it.

Do this:

namespace internal
{
  template <int NDIMS>
  struct B { static void method() { } };

  template <>
  struct B<2> { static void method() { /* 2D-specific method*/ } };
}

template <typename ViewType>
struct A
{
  void method() { internal::B<ViewType::dimension()>::method(); }
};

Do NOT do this:

template <typename ViewType>
struct A
{
  template <int NDIMS>
  struct B { static void method() { } };

  template <>
  struct B<2> { static void method() { /* 2D-specific method*/ } };

  void method() { B<ViewType::dimension()>::method(); }
};

RAJA::kernel¶

RAJA::kernel is used to execute kernels implemented using nested loops.

This is used infrequently, mainly seen only in a few unit tests.

Your general go-to will be axom::for_all.

Useful Links¶

RAJA Loops - Covers RAJA::forall, RAJA::kernel, RAJA::launch kernel execution methods.

Execution Spaces & Policies¶

Axom’s execution spaces can be found in the file axom/core/execution/execution_space.hpp.

Axom’s execution spaces are derived from an axom::execution_space<ExecSpace> traits class containing RAJA execution policies and default Umpire memory allocators associated with each space.

Axom currently supports four execution spaces, each one a type with the following specialization of the execution_space class:

SEQ_EXEC - Sequential execution policies on host
OMP_EXEC - OpenMP execution policies on host
CUDA_EXEC - CUDA execution policies in Unified Memory (host + device)
HIP_EXEC - HIP execution policies in Unified Memory (host + device)

Additionally, HIP_EXEC and CUDA_EXEC types are templated by the number of threads and SYNCHRONOUS or ASYNC execution:

/*!
 * \brief Indicates parallel execution on the GPU with CUDA.
 *
 * \tparam BLOCK_SIZE the number of CUDA threads in a block.
 * \tparam ExecutionMode indicates synchronous or asynchronous execution.
 */
template <int BLOCK_SIZE, ExecutionMode EXEC_MODE = SYNCHRONOUS>
struct CUDA_EXEC
{ };

Each execution space provides:

Axom policies that are type aliases of RAJA policies to be used with kernels, RAJA types, and RAJA operations

loop_policy - For RAJA scans and other operations; axom::for_all uses the loop_policy from the templated execution space.
reduce_policy - For RAJA reduction types that perform reduction operations:

  axom::ReduceSum<ExecSpace, axom::IndexType> totalSum(0);

  // Sum integers [0,99]
  axom::for_all<ExecSpace>(100, AXOM_LAMBDA(axom::IndexType i) { totalSum += i; });

  std::cout << "\nTotal Reduction Sum (" << axom::execution_space<ExecSpace>::name()
            << ") :" << totalSum.get() << std::endl;

atomic_policy - For RAJA atomic operations that avoid race conditions when updating data values:

  int *sum = axom::allocate<int>(1, allocator_id);
  *sum = 0;

  // Increment sum 100 times
  axom::for_all<ExecSpace>(100, AXOM_LAMBDA(axom::IndexType) { axom::atomicAdd<ExecSpace>(sum, 1); });

  std::cout << "\nTotal Atomic Sum (" << axom::execution_space<ExecSpace>::name() << ") :" << sum[0]
            << std::endl;

  axom::deallocate(sum);

sync_policy - For Axom’s synchronize function, which is a wrapper around RAJA::synchronize(). Synchronizes execution threads when using an asynchronous loop_policy:

/*!
 * \brief Synchronizes all execution threads when using an ASYNC policy with
 *  the specified execution space.
 *
 * \tparam ExecSpace the execution space
 */
template <typename ExecSpace>
inline void synchronize() noexcept
{
  AXOM_STATIC_ASSERT(execution_space<ExecSpace>::valid());

#ifdef AXOM_USE_RAJA
  using sync_policy = typename execution_space<ExecSpace>::sync_policy;
  RAJA::synchronize<sync_policy>();
#endif
}

Umpire allocator defaults
- memory_space - The memory space abstraction for use by Core Containers like axom::Array.
- allocatorID() - Gets the allocator ID for the Umpire resource to use in this execution space.
General information on the execution space
- name() - Name of the execution space
- onDevice() - Is the execution space on device? (True/False)
- valid() - Is the execution space valid? (True)
- async() - Is the execution space asynchronous? (True/False)

The Core component provides a set of nested execution policies located at axom/axom/execution/nested_for_exec.hpp to be used with RAJA::kernel e.g. for iterating over mint meshes. (These generic policies formerly resided in Mint and have been moved to Core.)

Note

When Axom is built without RAJA, only SEQ_EXEC is available for host (CPU) execution. When Axom is built with RAJA but without Umpire for memory management on device, only SEQ_EXEC and OMP_EXEC is available for host (CPU) execution.

General, Rough Porting Tips¶

Start with figuring out what memory you need on device, and use axom::Array, axom::ArrayView, and memory_managment routines to do the allocations:

// Allocate 100 2D Triangles in unified memory
using cuda_exec = axom::CUDA_EXEC<256>;
using TriangleType = axom::primal::Triangle<double, 2>;
axom::Array<Triangle> tris (100, axom::execution_space<cuda_exec>::allocatorID()));
axom::ArrayView<Triangle> tris_view(tris);

// Allocate the sum of Triangle areas
using reduce_pol = typename axom::execution_space<cuda_exec>::reduce_policy;
RAJA::ReduceSum<reduce_pol, double> totalArea(0);

Using an axom::for_all kernel with a device policy, attempt to access and/or manipulate the memory on device:

axom::for_all<cuda_exec>(
100,
AXOM_LAMBDA(int idx) {
  // Set values on device
  tris_view[idx] = Triangle();
  totalArea = 0;
});

Add the functions you want to call on device to the axom::for_all kernel:

axom::for_all<cuda_exec>(
100,
AXOM_LAMBDA(int idx) {
  tris_view[idx] = Triangle();
  totalArea = 0;

  // Call area() method on device
  double area = tris_view[idx].area();
});

Apply a __host__ __device__ annotation to your functions if you see the following error or similar:
```
error: reference to __host__ function 'area' in __host__ __device__ function
```
- Recompiling will likely introduce complaints about more functions (the functions being the non-decorated functions your newly-decorated functions are calling):
```
error: reference to __host__ function 'abs<double>' in __host__ __device__ function

error: reference to __host__ function 'signedArea<2>' in __host__ __device__ function
```
  Keep decorating until all the complaints are gone.
- Most of the C++ standard library is not available on device. Your options are Axom’s equivalent functions/classes if it exists, or to add your own or rewrite the code to not use standard library.
- It may not be possible to remove all such warnings on some platforms that support both CPU/GPU backends since AXOM_LAMBDA will expand to __host__ __device__ and then the compiler will issue warnings about host functions such as RAJA::ReduceSum::~ReduceSum for the OpenMP backend being called from __host__ __device__ code. This warning can be ignored.
With no more decorating complaints from the compiler, write the logically correct kernel:
```
// Computes the total area of a 100 triangles
axom::for_all<cuda_exec>(
  100,
  AXOM_LAMBDA(int idx) {
    totalArea += tris_view[idx].area();
});
```
- If at this point your kernel is not working/segfaulting, it is hopefully a logical error, and you can debug the kernel without diving into debugging tools.
- Utilize printf() for debugging output
- Try using the SEQ_EXEC execution space
If your kernel compiles/links but fails at runtime, the cause is often:
- The data arrays referenced in the kernel are not in the correct memory space for the kernel.
- A reference to host memory has been captured for use in the kernel.
- There is nothing wrong with your code and the tooling has failed. In this case, try separating the axom::for_all and your kernel into a separate method or function. This can limit the available objects that the compile will attempt to capture and increase the likelihood that the kernel will run correctly.

Useful Links¶

List of debugging tools:
- Totalview (CUDA)
- Nvidia Nsight Developer Tools (CUDA)
  - LLNL Guide for Sierra
  - ORNL Guides for Nsight Compute and Nsight Systems
- HPCToolkit (CUDA, HIP)
- ROCprof (HIP)
- ROCgdb (HIP)
ORNL Training Archives
LLNL HPC Tutorials