# ACTA: AUTOMATIC CONFIGURATION OF THE TENSOR MEMORY ACCELERATOR FOR HIGH-END GPUS

Nicolas Meseguer<sup>1</sup>, Yifan Sun<sup>2</sup>, Michael Pellauer<sup>3</sup>, Jose L. Abellan<sup>1</sup>, Manuel E. Acacio<sup>1</sup>

> <sup>1</sup>University of Murcia, Spain <sup>2</sup>William & Mary, USA <sup>3</sup>NVIDIA, USA

Nicolas Meseguer, et al.

SARTECO'25

June 26th, 2025 1 / 21

**OUTLINE** 















Nicolas Meseguer, et al.

SARTECO'25

**B b** June 26th, 2025 2/21

--



- GPUs are the fundamental compute platform in data centers (HPC, DL, Big-Data, etc.). El Capitan #1 Top500 (June 2025)<sup>1</sup>
- More and more specialized hardware units (TC, RT, AMP, TMA, ...)
- Difficult to harness their full potential, specially in kernels that are highly sensible to memory latency.
- The trend is to give programmers more tools to overlap memory operations.

= - nan

<sup>&</sup>lt;sup>1</sup>43,808 AMD MI300A GPUs

### INTRODUCTION



- Techniques like Warp Specialization.
  - One warp is doing do a very specific job (divergency in gpu is bad!)
  - Usually a consumer-producer scheme
  - Synchronization is very difficult (fine-grained).
  - Usually implemented as busy-loop waiting = consumes GPU resources and further degrades the performance

### INTRODUCTION



- New NVIDIA accelerator, Tensor Memory Accelerator (TMA), can transfer large blocks of data asynchronously.
  - TMA Descriptor a new data structure located in the SMEM to store different parameters: memory addresses, data length, offsets...



TMA Example

Nicolas Meseguer, et al.

SARTECO'25

June 26th, 2025 5 / 21



- New problem arises, the TMA is very complex to use.
- Queue mechanisms (OperandQueues in CUDA or Pipes in OpenCL), as a way to reduce complexity.
- Helps the programmer, but still, there are so many details left out (number of queues, SMEM addresses, managment of the queues).
- Cell BE processor was a failure due to the complexity of the DMA engine.



• Our goal: use the TMA with the lowest complexity and highest performance.



1

글 > - - 글 >





SARTECO'25

June 26th, 2025 8 / 21

< □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □





SARTECO'25

June 26th, 2025 9 / 21

< □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □





Nicolas Meseguer, et al.

SARTECO'25

▶ 《 볼 ▶ 《 볼 ▶ 볼| ■ ♡ Q ℃ June 26th, 2025 10 / 21





SARTECO'25



# ACTA

- Queue Configuration is highly kernel dependent and architecture dependent.
- Novel software library to infer optimal tile sizes and queue slots configurations for the TMA based on the kernel and gpu's architecture.
- Our new hardware unit, the GPU Specification Table (GST).





 ACTA extends the GPU SDK by providing an API for dynamically configuring kernel queues.

```
driver.MemCopyH2D(b.device A. b.MatrixA)
   driver.MemCopyH2D(b.device B. b.MatrixB)
   driver.CreateCommandOueue()
   // Init ACTA for configuring the Queues
   driver.InitACTA(MEDIUM, 8, 64)
   // Register the Oueues
   driver.RegisterQueue(K, 4, TYPE_STREAMING)
9
   driver.RegisterQueue(K, 4, TYPE_STATIONARY)
10
   // Obtain the Queues sized in FIFO order
   a_queue = driver.SizeQueue()
13
   b_queue = driver.SizeQueue()
14
   // Load kernel arguments using the QuCo
16
   kernArg := KernelArgs{
       b.device_A. b.device_B, b.device_Z, M, K, N,
18
       K0. a gueue.TileSize. b gueue.TileSize. K2. M0. M1.
        M2.
       a_queue.QueueTiles, b_queue.QueueTiles, ConsumerWfs
20
   driver.EngueueLaunchKernel(binarv, kernArg)
```

Nicolas Meseguer, et al.

ACTA

SARTECO'25

June 26th, 2025 13 / 21



# ACTA

- Optimal Tile Size Calculation
  - Using the arithmetic intensity, number of consumer wavefronts, GST, and a tile size range (i.e. 64 to 2048).
  - Merit factor to balance processing time and memory time for each tile.
  - For the given range, we calculate the most suitable tile based on the merit factor.
  - Further adjustments based on a scaling factor and the number of CUs.
- Optimal Number of Tiles Calculation
  - Differentiate streaming vs stationary to assign more or less SMEM space.
  - Little's Law fundamental relation in queuing systems, linking the average number of items in a system, their arrival rate (memory time), and their residence time (processing time).
  - Based on the arithmetic intensity, higher arithmetic intensity, more slots, lower arithmetic intensity, less slots.

Nicolas Meseguer, et al.

### **EVALUATION METHODOLOGY**



- Implemented on MGPUSim, a microarchitectural cycle-level simulator that accurately models the AMD R9 Nano GPU, a solid baseline.
- We extended the simulator with a TMA model inspired by the functionality of the NVIDIA Hopper's TMA, we refer to ours as TMA-Like.
- Linear algebra kernels implemented (elementwiseK, elementwise, dot-product, sumvectors, matrix-vector and matrix-matrix).
- Using ACTA for matrix-matrix operations reduces the total iterations required for complete design space exploration from  $2.6 \times 10^{14}$  to just 1.

Image: A mage: A ma

#### SARTECO'25

### June 26th, 2025 16 / 21



### RESULTS





- Release the programmer from the low-level details of the TMA.
- Achieve near-optimal performance (within **2.78%** compared to exhaustive tuning), with one single execution.
- Suitable across multiple GPU architectures.

# ACTA: AUTOMATIC CONFIGURATION OF THE TENSOR MEMORY ACCELERATOR FOR HIGH-END GPUS

Nicolas Meseguer<sup>1</sup>, Yifan Sun<sup>2</sup>, Michael Pellauer<sup>3</sup>, Jose L. Abellan<sup>1</sup>, Manuel E. Acacio<sup>1</sup>

n.mesegueriborra@um.es

Thank you for your attention!





f SéNeCa<sup>(+)</sup>

igencia de Ciencia y Tecnolog Región de Murcia

Nicolas Meseguer, et al.

SARTECO'25

March 1st, 2025 18 / 21

### **OPTIMAL TILE SIZE CALCULATION**



### Algorithm 1: Optimal Tile Size Calculation

Nicolas Meseguer, et al.

SARTECO'25

▲ ヨ ト ▲ ヨ ト ヨ = → へ (~) March 1st, 2025 19 / 21

### MERIT FACTOR



### Algorithm 2: Function for calculating the Merit Factor Input: Tile Size, GST **Output:** Merit Factor Function evaluate() // Step 1: Compute the best-case scheduling time for processing the tile $bestScheduling \leftarrow \frac{\text{TileSize}}{\text{SIMDMulsPerCycle} \times \min(\text{ConsumerWfs.4})}$ // Step 2: Calculate processing time, including scheduling roundtrip overhead $procTime \leftarrow bestScheduling + (bestScheduling - 1) \times$ min(ConsumerWfs - 1, WfPools)// Step 3: Compute memory transfer latency and times $latencyTotal \leftarrow TMACycles + DRAMLatency + L2Latency$ $memTransferTime \leftarrow \frac{\text{TileSize} \times \text{ElementSize}}{\text{Bandwidth}}$ cacheTransferTime $\leftarrow 2 \times \frac{\text{TileSize} \times \text{ElementSize}}{\text{CacheLineSize}}$ // Step 4: Aggregate memory transfer time $memTime \leftarrow$ *latencuTotal* + *memTransferTime* + *cacheTransferTime* // Step 5: Return the merit factor as the ratio of processing time to memory time return procTime end

Nicolas Meseguer, et al.

SARTECO'25

March 1st, 2025 20 / 21

# **OPTIMAL NUM TILES CALCULATION**



### Algorithm 3: Optimal Number of Slots Calculation

```
Input: Streaming and stationary queues, Ar.I., Compute Units
Output: Optimal number of slots for each Oueue
Function optimal_num_slots()
    count streaming and stationary queues;
    if there are streaming queues then
         numSlots \leftarrow useLittlesLaw():
         numSlots \leftarrow roundToPowerOfTwo(numSlots);
         numSlots \leftarrow roundBasedOnCUs(numSlots);
         if sufficient space in Shared Memory then
              allocateSpace(streaming queues);
         end
         else
              numSlots \leftarrow useArithmeticIntensity();
              reduce numSlots if necessary to fit the data:
              allocateSpace(streaming queues):
         end
    end
    if there are stationary queues then
         calculate available space for each stationary queue;
         determine how many slots can fit into the remaining space;
         numSlots \leftarrow roundToPowerOfTwo(numSlots);
         numSlots \leftarrow roundBasedOnCUs(numSlots);
         reduce numSlots if necessary to fit the data;
         allocateSpace(stationary queues);
     end
end
```

Nicolas Meseguer, et al.

SARTECO'25

March 1st, 2025 21 / 21