Core acceleration¶

Axom’s core component provides several utilities for user applications that support execution on hardware accelerators. Axom lets users control the execution space (i.e. where code runs) using RAJA and movement between memory spaces using Umpire. As noted in the RAJA documentation, developers of high-performance computing applications have many options for running code: on the CPU, using OpenMP, or on GPU hardware accelerators, and the options are constantly evolving.

Note

Axom’s memory management and execution space APIs have default implementations when Axom is not configured with Umpire and RAJA, and can be used in all Axom configurations.
GPU execution support for Axom features is continually exapanding. Axom Core offers an interface that uses RAJA and Umpire internally and provides easy access to for-loop level acceleration via the parallel-for idiom, which invokes a given lambda function for every index in range. Additional constructs for atomics, reductions, scans, and sorts are also available. These are templated off an execution space and typically wrap RAJA functionality, though serial-only implementations are also provided.

The memory management API allows users to leverage either C++ memory functions or Umpire, depending on the availability of Umpire at compilation. It supports a simple operation set: allocate, deallocate, reallocate, and copy. If Umpire is used, an allocator can be specified – see Umpire documentation for details. Note that the fall-back to C++ memory functions is automatic, so the same piece of code can handle standard C++ or C++ with Umpire. However, to use advanced features, such as accessing unified memory, Umpire must be enabled, otherwise errors will occur at compilation.

Note

We refrain from including the output of the example in this section in the documentation since the output is verbose. If you are curious, please run the examples yourself.

Here is an example of using Axom’s memory management tools:

  int *dynamic_memory_array;
  int *dyn_array_dst;
  int len = 20;

  //Allocation looks similar to use of malloc() in C -- just template
  //return type instead of casting.
  dynamic_memory_array = axom::allocate<int>(len);

  for(int i = 0; i < len; i++)
  {
    dynamic_memory_array[i] = i;
  }

  //Print array values after initialization
  for(int i = 0; i < len; i++)
  {
    std::cout << i << " Current value: " << dynamic_memory_array[i] << std::endl;
  }

  dyn_array_dst = axom::allocate<int>(len);

  //Now, a copy operation. It's used exactly like memcpy --
  //destination, source, number of bytes.
  axom::copy(dyn_array_dst, dynamic_memory_array, sizeof(int) * len);

  //Print array values and compare to copy
  for(int i = 0; i < len; i++)
  {
    std::cout << i << " Current value: " << dyn_array_dst[i] << std::endl;
    std::cout << "Matches old value? " << std::boolalpha
              << (dynamic_memory_array[i] == dyn_array_dst[i]) << std::endl;
  }

  //Deallocate is exactly like free. Of course, you cannot access the
  //now-deallocated memory after this:
  axom::deallocate(dyn_array_dst);

  //Reallocate is like realloc -- copies existing contents into new
  //memory allocation.
  //Slight deviation from realloc() in that second arg is item count,
  //rather than bytes.
  dynamic_memory_array = axom::reallocate(dynamic_memory_array, len * 2);
  for(int i = 20; i < len * 2; i++)
  {
    dynamic_memory_array[i] = i;
  }

  for(int i = 0; i < len * 2; i++)
  {
    std::cout << i << " Current value: " << dynamic_memory_array[i] << std::endl;
  }

  axom::deallocate(dynamic_memory_array);

Here is an Axom example showing sequential execution:

  //This part of the code works regardless of Umpire's presence, allowing for generic
  //use of axom::allocate in C++ code.
  int *A = axom::allocate<int>(N);
  int *B = axom::allocate<int>(N);
  int *C = axom::allocate<int>(N);

  for(int i = 0; i < N; i++)
  {
    A[i] = i * 5;
    B[i] = i * 2;
    C[i] = 0;
  }

  //Axom provides an API for the most basic usage of RAJA, the for_all loop.
  axom::for_all<axom::SEQ_EXEC>(0, N, AXOM_LAMBDA(axom::IndexType i) { C[i] = A[i] + B[i]; });

  std::cout << "Sums: " << std::endl;
  for(int i = 0; i < N; i++)
  {
    std::cout << C[i] << " ";
    C[i] = 0;
  }

  axom::deallocate(A);
  axom::deallocate(B);
  axom::deallocate(C);

Here’s the same loop from the above snippet, this time with CUDA or HIP:

  //This example requires Umpire to be in use, and Unified memory available.
  const int allocator_id =
    axom::getUmpireResourceAllocatorID(umpire::resource::MemoryResourceType::Unified);
  A = axom::allocate<int>(N, allocator_id);
  B = axom::allocate<int>(N, allocator_id);
  C = axom::allocate<int>(N, allocator_id);

  for(int i = 0; i < N; i++)
  {
    A[i] = i * 5;
    B[i] = i * 2;
    C[i] = 0;
  }

  #if defined(__CUDACC__)
  using ExecSpace = axom::CUDA_EXEC<256>;
  #elif defined(__HIPCC__)
  using ExecSpace = axom::HIP_EXEC<256>;
  #else
  using ExecSpace = axom::SEQ_EXEC;
  #endif

  axom::for_all<ExecSpace>(0, N, AXOM_LAMBDA(axom::IndexType i) { C[i] = A[i] + B[i]; });

  std::cout << "\nSums (" << axom::execution_space<ExecSpace>::name() << ") :" << std::endl;
  for(int i = 0; i < N; i++)
  {
    std::cout << C[i] << " ";
  }
  std::cout << std::endl;

  axom::deallocate(A);
  axom::deallocate(B);
  axom::deallocate(C);

For more advanced functionality, users can directly call RAJA and Umpire. See the RAJA documentation and the Umpire documentation.