Kernel Execution

This is a set of simple examples showcasing how the HiCR API can be used to execute kernels or functions in multiple available processing (CPU, GPU, NPU, etc) using the same API. The code is structured as follows:

  • source/ contains the different variants of this example corresponding to different backends

    • ascend.cpp corresponds to the Ascend backend implementation. The available kernel is compiled for Ascend 910. Run kernels/compile.sh to call ATC and compile the kernel.

    • opencl.cpp corresponds to the OpenCL backend implementation. The available kernel is compiled for OpenCL.

    • nosv.cpp corresponds to the nOS-V backend implementation

    • pthreads.cpp corresponds to the Pthreads backend implementation

All the examples perform a GEMM operation, and follow the same pattern:

Identify Compute Topology

The first step is to discover the local compute topology, to find out the available compute devices and use one of them for running the example.

// Asking backend to check the available devices
auto topology = topologyManager.queryTopology();

// Getting first device found in the topology
auto device = *topology.getDevices().begin();

// Getting compute resources
auto computeResources = device->getComputeResourceList();

// Selecting the first compute resource found
auto firstComputeResource = *computeResources.begin();

Creating an execution state

The HiCR Core API defines execution units as a abstract description of a function or kernel for any kind of device. In the pthreads.cpp example, we use a lambda function to describe work to be performed in the CPU and create an execution unit around it:

auto executionUnit = computeManager.createExecutionUnit([=](void *arg) {
  gemm((double *)input1Host->getPointer(),
       (double *)input2Host->getPointer(),
       (double *)input3Host->getPointer(),
       (double *)alphaHost->getPointer(),
       (double *)betaHost->getPointer());
});

The execution unit can be instantiated many times into an execution state object. Each execution state will hold its own particular current state of execution of the execution unit. We invoke the compute manager to create a new execution state:

auto executionState = computeManager.createExecutionState(executionUnit);

Creating a processing unit

A processing unit represents an active compute resource than can be employed to run an execution state to its completion. The example creates one such processing unit out of the first available CPU core, as found by the topology manager.

auto processingUnit = computeManager.createProcessingUnit(firstComputeResource);

After its creation, the processing unit is initialized (this creates and starts the corresponding pthread)

computeManager.initialize(processingUnit);

Execution and Completion

To run the execution state, we assign it to the processing unit via the start function:

computeManager.start(processingUnit, executionState);

And then wait for completion with the await function:

computeManager.await(processingUnit);

The expected result of running this example with M = 4, N = 2, K = 8 is:

First matrix [M, K]
1.0 ...

Second matrix [K, N]
1.0 ...

Third matrix [M, N]
1.0 ...

Output matrix [M, N]
9.0 ...