Core acceleration

Axom’s core component provides several utilities for user applications that intend to support execution on hardware accelerators. Axom lets users control execution space using RAJA and memory space using Umpire. As noted in the RAJA documentation, developers of high-performance computing applications have many options for running code: on the CPU, using OpenMP, or on GPU hardware accelerators, and the options are constantly developing. RAJA controls where code runs; Umpire moves data between memory spaces.

Note

Axom’s memory management and execution space APIs have default implementations when Axom is not configured with Umpire and RAJA, respectively, and can be used in all Axom configurations.

The memory management API allows the user to leverage either C++ memory functions or Umpire, depending on the availability of Umpire at compilation. It supports a simple operation set: allocate, deallocate, reallocate, and copy. If Umpire is in use, an allocator can be specified – see Umpire documentation for more details. Note that the fall-back to C++ memory functions is automatic, so the same piece of code can handle standard C++ or C++ with Umpire. However, to use advanced features, such as accessing unified memory, Umpire must be enabled, otherwise errors will occur at compilation.

Here is an example of using Axom’s memory management tools:

  int *dynamic_memory_array;
  int *dyn_array_dst;
  int len = 20;

  //Allocation looks similar to use of malloc() in C -- just template
  //return type instead of casting.
  dynamic_memory_array = axom::allocate<int>(len);

  for(int i = 0; i < len; i++)
  {
    dynamic_memory_array[i] = i;
  }

  for(int i = 0; i < len; i++)
  {
    std::cout << i << " Current value: " << dynamic_memory_array[i] << std::endl;
  }

  dyn_array_dst = axom::allocate<int>(len);

  //Now, a copy operation. It's used exactly like memcpy -- destination, source, number of bytes.
  axom::copy(dyn_array_dst, dynamic_memory_array, sizeof(int) * len);

  for(int i = 0; i < len; i++)
  {
    std::cout << i << " Current value: " << dyn_array_dst[i] << std::endl;
    std::cout << "Matches old value? " << std::boolalpha
              << (dynamic_memory_array[i] == dyn_array_dst[i]) << std::endl;
  }

  //Deallocate is exactly like free. Of course, we won't try to access the now-deallocated
  //memory after this:
  axom::deallocate(dyn_array_dst);

  //Reallocate is like realloc -- copies existing contents into a larger memory space.
  //Slight deviation from realloc() in that it asks for item count, rather than bytes.
  dynamic_memory_array = axom::reallocate(dynamic_memory_array, len * 2);
  for(int i = 20; i < len * 2; i++)
  {
    dynamic_memory_array[i] = i;
  }
  for(int i = 0; i < len * 2; i++)
  {
    std::cout << i << " Current value: " << dynamic_memory_array[i] << std::endl;
  }

Throughout Axom, acceleration is increasingly supported. Both internally, and to support users, Axom Core offers an interface that, using RAJA and Umpire internally, provides easy access to for-loop level acceleration via the parallel-for idiom, which applies a given lambda function for every index in range.

Here is an Axom example showing sequential execution:

  //This part of the code works regardless of Umpire's presence, allowing for generic
  //use of axom::allocate in C++ code.
  int *A = axom::allocate<int>(N);
  int *B = axom::allocate<int>(N);
  int *C = axom::allocate<int>(N);

  for(int i = 0; i < N; i++)
  {
    A[i] = i * 5;
    B[i] = i * 2;
    C[i] = 0;
  }

  //Axom provides an API for the most basic usage of RAJA, the for_all loop.
  axom::for_all<axom::SEQ_EXEC>(
    0,
    N,
    AXOM_LAMBDA(axom::IndexType i) { C[i] = A[i] + B[i]; });

  std::cout << "Sums: " << std::endl;
  for(int i = 0; i < N; i++)
  {
    std::cout << C[i] << " ";
    C[i] = 0;
  }

Here’s the same loop from the above snippet, this time with CUDA:

  //This example requires Umpire to be in use, and Unified memory available.
  const int allocator_id = axom::getUmpireResourceAllocatorID(
    umpire::resource::MemoryResourceType::Unified);
  A = axom::allocate<int>(N, allocator_id);
  B = axom::allocate<int>(N, allocator_id);
  C = axom::allocate<int>(N, allocator_id);

  for(int i = 0; i < N; i++)
  {
    A[i] = i * 5;
    B[i] = i * 2;
    C[i] = 0;
  }

  axom::for_all<axom::CUDA_EXEC<256>>(
    0,
    N,
    AXOM_LAMBDA(axom::IndexType i) { C[i] = A[i] + B[i]; });

  std::cout << "Sums: " << std::endl;
  for(int i = 0; i < N; i++)
  {
    std::cout << C[i] << " ";
  }
  std::cout << std::endl;

For more advanced functionality, users can directly call RAJA and Umpire. See the RAJA documentation and the Umpire documentation.