DX12 Multi engine capabilties of recent AMD and Nvidia hardware

I will discuss the current status of multi engine and queue support in DX12 on recent AMD and Nvidia hardware. I have covered GCN 1.0, 1.1 and 1.2 for AMD. For Nvidia I have covered Kepler, Maxwell v1 and Maxwell v2 explicitly. Some of the findings regarding pre-GK110 GPUs also apply to Fermi.

I provide a best practice guide to tune specifically to each vendors hardware, based on the hardware and driver capabilities found. Further I present an hybrid approach which caters both vendors hardware equally despite the differences.

What are engines and queues?

When you look at classic APIs such as OpenGL or DirectX before version 12, there was only the 3D queue exposed. All commands submitted to a single queue are being restricted by the same synchronisation points. Commands in between two synchronisation points form batches where the order is undefined. There can be some concurrency between commands due to pipelining and super scalar architectures, but the order in which command batches start execution is fixed.

The effect of this is that all commands on a single queue have to wait for the same barriers and fences. Fences are typically used to synchronize the work on the GPU with other operations, such as CPU side work or resource transfers between CPU and GPU. Barriers refer to blocking operations on the GPU, such as switching memory access mode in between drawcalls. or waiting for the end of one batch.

With Mantle and DX12, additional compute queues have been introduced into 3D applications. Each of these additional queues runs asynchronously along side each other. The term "asynchronous" refers to the fact that the order of execution in relation to each other is not defined. Jobs submitted to different queue may start or complete in a different order than they were issued. Fences and barriers only apply to a single queue each as well. When an queue is being blocked by a fence, other queues may still be running regardless. Synchronisation points in between queues can be defined and enforced by using fences.

Similar features have been exposed before in OpenCL and CUDA. DX12 fences and signals map to a subset of the event system in these two environments. Barriers in DX12 have additional wait conditions which are not supported in neither OpenCL nor CUDA, and a write through of dirty buffers needs to be requested explicitly.

These additional queues differ from the classic 3D queues. While the 3D queue can be fed both with compute commands, copy commands and draw calls, these additional queues can only accept compute and copy commands. For this reason, they are called compute queues.

In hardware, these queues are handled by dedicated engines of the corresponding type. The engine responsible for the 3D queues is commonly referred to as "graphics command processor" by the hardware vendors or "3D engine" in Microsofts terminology. AMD calls the implementation of its compute engine "ACE", short for asynchronous compute engine. There is also a copy engine with a dedicated queue type, but it's not relevant for this discussion.

Compute jobs have gained a lot of significance in modern game engines. The obvious use is to offload work previously done on the CPU to the GPU, such as particle physics and alike. Less intuitive is the use of compute shaders in deferred rendering. With this technique, the geometry and classic draw calls are only involved in the the generation of a few 2D buffers. Most of the following operations operate on 2D images instead, which can be done with compute shaders as well. This comes at a reduced resource usage and the option to use compute specific features to gain additional speed.

There are multiple possible gains from using these compute queues.

It depends on the hardware and driver capabilities which of these options are actually used. These differences are not directly visible via the DX12 API, only the support for compute queues in general is indicated.

Depending on which features and chances the hardware and driver are using, different platforms can react entirely different to the same type of workload. Not providing a feature is a legit option. Every additional feature is always a compromise between flexibility and effectiveness on one side, and chip size and power efficiency on the other one.

Feature Support

AMD Nvidia
3D queue support Yes Yes
Compute queue support Yes Yes
3D queue limit N/A1 N/A1
Compute queue limit 64 (GCN 1.2)
64 (GCN 1.1)
2 (GCN 1.0)
N/A1,4
Multi engine concurrency 1+82,3 (GCN 1.2)
1+82,3 (GCN 1.1)
1+22 (GCN 1.0)
14
Compute shader concurrency on 3D engine 64/1285 (GCN 1.2)
32/645 (GCN 1.1)
645 (GCN 1.0)
16,7 (GM2xx)
16 (GKxxx)
3D shader concurrency on 3D engine 64/1285 (GCN 1.2)
32/645 (GCN 1.1)
645 (GCN 1.0)
317 (GM2xx)
317 (GK110)
5/10/157 (GK10x)
Compute shader concurrency on compute engine 32/648 (GCN 1.2)
328 (GCN 1.1)
64 (GCN 1.0)
32 (GM2xx)
32 (GK110)
6/11/16 (GK10x)
Mixed 3D/Compute wavefront interleaving Yes Limited9

1 Additional queues are scheduled in software. Only memory limits apply.

2 One 3D engine plus up to 8 compute engines running concurrently.

3 Since GCN 1.1, each compute engine can seamlessly interleave commands from 8 asynchronous queues.

4 Compute and 3D engine can not be active at the same time as they utilize a single function unit.
The Hyper-Q interface used for CUDA is in fact supporting concurrent execution, but it's not compatible with the DX12 API.
If it was used, there would be a hardware limit of 32 asynchronous compute queues in addition to the 3D engine.

5 Execution slots dynamically shared between all command types.

6 Execution slots reserved for compute commands.

7 Execution slots are reserved for use by the graphics command processors.
According to Nvidia, GM20x chips should be able to lift the reservation dynamically. This behaviour appears to be limited to CUDA and Hyper-Q.

8 Execution slots dynamically shared between each 8 compute queues since GCN 1.1.

9 SMX/SMM units can only execute either type of wavefront. A full L1, local shared memory and scheduler flush is required to switch mode. This is most likely due to using a single shard memory block to provide L1 and LSHM in compute mode.

Best Practise

The differences between GCN and Maxwell/Kepler are quite significant, and so is the effect of using the compute engines, respectively multiple queues in general. I will address possible approaches to utilize the platform to full extent or to avoid common misconceptions.

AMD

Compute engines can be used for multiple different purposes on GCN hardware:

Long running compute jobs can be offloaded to a compute queue.
If a job is known to be possibly wasting a lot of time in stalls, it can be outsourced from busy queues. This comes at the benefit of achieving better shader utilization as 3D and compute workload can be interleaved on every level in hardware, from the scheduler down to actual execution on the compute units.
High priority jobs can be scheduled to a dedicated compute queue.
They will go into the next free execution slot on the corresponding ACE. They can not preempt running shaders, but they will skip any queued ones. Make proper use of the priority setting on the compute queue to achieve this behaviour.
Get around the execution slot limit.
When executing compute shaders with tiny grids, down to the minimum of 64 threads per thread group, you would under utilize the GPU using only a single engine.
By utilizing all 8 ACE units together with the 3D Engine, you can achieve up to 640 active grids on Fiji. This is precisely the upper occupation limit and maximizes utilization, even if each grid only yields a single wavefront.
You should still prefer issuing less commands with larger grids instead. Pushing the hardware to the limits like this can expose other unexpected bottlenecks.
Create more back pressure.
By providing additional jobs on a compute engine, the impact of blocking barriers in other queues can be avoided. Barriers or fences placed on other queues are not causing any interference.

GCN is still perfectly happy to accept compute commands in the 3D queue.
There is no penalty for mixing draw calls and compute commands in the 3D queue. In fact, compute commands have approximately the same performance as draw calls with proxy geometry10.
Compute commands should still be preferred for any non-geometry related operation for practical reasons, such as utilizing the local shared memory and increasing possible concurrency.
Offloading compute commands to the compute queue is a good chance to increase GPU utilization.

10 Proxy geometry refers to a technique where you are using a simple geometry, like a single screen filling square, to apply post processing effects and alike to 2D buffers.

Nvidia

Due to the possible performance penalties from using compute commands concurrently with draw calls, compute queues should mostly be used to offload and execute compute commands in batch.

There are multiple points to consider when doing this:

The workload on a single queue should always be sufficient to fully utilize the GPU.
There is no parallelism between the 3D and the compute engine so you should not try to split workload between regular draw calls and compute commands arbitrarily. Make sure to always properly batch both draw calls and compute commands.
Pay close attention not to stall the GPU with solitary compute jobs limited by texture sample rate, memory latency or anything alike. Other queues can't become active as long as such a command is running.
Compute commands should not be scheduled on the 3D queue.
Doing so will hurt the performance measurably. The 3D engine does not only enforce sequential execution, but the reconfiguration of the SMM units will impair performance even further.
Consider the use of a draw call with a proxy geometry instead when batching and offloading is not an option for you. This will still save you a few microseconds as opposed to interleaving a compute command.
Make 3D and compute sections long enough.
Switching between compute and 3D queues results in a full flush of all pipelines. The GPU should have spent enough time in one mode to justify the penalty for switching.
Beware that there is no active preemption, a long running shader in either engine will stall the transition.

Despite the limitations, the use of compute shaders should still be considered. The reduced overhead and effectively higher level of concurrency compared to classic draw calls with proxy geometry can still yield remarkable performance gains.
Additional care is required to cleanly separate the render pipeline into batches.

If async compute with support for high priority jobs and independent scheduling is a hard requirement, consider the use of CUDA for these jobs instead of the DX12 API.

With GK110 and later, CUDA bypasses the graphics command processor and is handled by a dedicated function unit in hardware which runs uncoupled from the regular compute or graphics engine. It even supports multiple asynchronous queues iin hardware as you would expect.

Ask your personal Nvidia engineer for how to share GPU side buffers between DX12 and CUDA.

Safe Approach

For a safe bet, go with the batched approach recommended for Nvidia hardware:

Choose sufficiently large batches of short running shaders.
Long running shaders can complicate scheduling on Nvidias hardware. Ensure that the GPU can remain fully utilized until the end of each batch. Tune this for Nvidias hardware, AMD will adapt just fine.
Use multiple compute engines when applicable.
If the result of a compute job or an entire chain isn't needed until much later, offload it. This will start execution early on with AMD, while Nvidia gets a chance to batch multiple command lists.
Be careful with GCN 1.0 cards, as they can only allocate two compute engines.
Signal early, signal often.
Nvidia will only update signals and fences at the end of each command list, but AMD will do so much sooner. By using additional signals, the compute engine can be given a head start on GCN. This is also true for synchronizing multiple compute engines.
Don't worry about the overhead of additional fences for synchronisation purposes. The cost for waiting on more than one signal is just the same as waiting for a single one.
While signaling eagerly, be conservative about waiting. Each additional synchronisation point comes at a cost11.
Commit early and be responsive.
Consider that AMD needs a higher level of concurrency, and many of your jobs are in fact independent. Don't wait for milestones in your render loop. Commit your work early and place fences instead.
Signals may arrive in a different order depending on the hardware. Make sure that your CPU side code is aware of that. An event driven approach will work better than a classic procedural one.
Keep 3D related jobs central.
Make especially sure that no compute jobs are required in between.
Your application may still be heavy on draw calls or utilize complex fragment shaders, as long as these are happening in bulk and dependent compute jobs have already been scheduled.

11 The actual cost differs with the hardware capabilities of the engine. E.g. for GCN, a synchronization point on a dedicated hardware queue provided by an ACE is cheaper than one on the 3D engine. This is due to engines without multiple hardware queues requiring cooperative software scheduling by the operating system, while dedicated queues may simply stall.

Conclusion

It's a complete mess at first sight.

Best case scenario for AMD is the absolute worst case scenario for Nvidia.
And exactly the same vice versa as well, if you count in the use of CUDA.

Both hardware vendors do profit from compute engine usage. They achieve this by different means, but the potential gains are significant for both vendors.

The situation is currently unsatisfying as Nvidias hardware requires a number of non-intuitive sanctions. In return, Nvidia is offering comparable capabilities on up to 5 years old hardware. The lower feature set provided by Nvidia also pays off clearly in reduced power consumption, so the motives are accountable.

It is still possible to provide a common code path for both vendors. This works by pure DX12 means and should also scale well to future GPU generations.

Perspective

For the future, I hope that Nvidia will get on par with AMD regarding multi engine support. AMD is currently providing a far more intuitive approach which aids developers directly.

This will come at an increased power consumption as the flexibility naturally requires more redundancy in hardware, but will most likely increase GPU utilization throughout the industry while accelerating development. The ultimate goal is still a common standard where you don't have to care much about hardware implementation details, the same way as x86 CPUs have matured over the course of the past 25 years.

Appendix

Future driver hacks

The following chapter is based only on vague assumptions about the hardware platforms. Don't take these for given, and read carefully on which assumptions these are based on. These features might not be implemented at all, or may even be plain impossible to provide. There might also be other technical difficulties which prevent this from happening.

Some of the weaknesses regarding Nvidias platform can be avoided using additional resources in the driver. This is especially true for the second Maxwell generation which has some yet unused additional features.

Using the grid management unit to boost the 3D engine.
Since GK110, Nvidias card are equipped with something called a "Grid Management Unit". This unit is comparable to the "Asynchronous Compute Engines" found on GCN cards, with a few differences. Unlike the ACE, the GMU has no own kernel execution slots but shares them with the 3D engine. The GMU has 32 truly async compute queues, but it is incompatible with DX12 for unknown reasons12.
Prior to the second Maxwell generation, the GMU could only access all execution slots if the 3D engine was inactive. Otherwise it was limited to a single execution slot. This limitation was supposed to be lifted on GM2xx, so that the GMU could access all slots not in active use by the graphics command processor.
In order to avoid the low concurrency between compute commands in the 3D engine, it would be possible to extract compute commands from the command lists and offload them to the GMU. This only works for sets of compute commands not separated by any split barriers, regular barriers or fences. It's only worth the effort if these sets contain more than a single compute command.
Doing so does not come free of cost. The analysis and extraction of compute commands from a command list causes additional work upfront. Additional effort is required to ensure synchronisation between the GMU and the 3D engine. This requires additional signals and fences13 to be inserted into the 3D queue, which are then resolved on the CPU side as the synchronisation capabilties of the GMU are limited.
This has nothing to do with async compute, despite the GMU being used. It is merely a method for offloading compute commands from the main queue to avoid the hardware limitations. The GMU is explicitly synchronized with the 3D engine by the newly introduced signals and fences.
For game engines making heavy use of compute commands in the 3D queue, this can help getting the runtime behaviour of the 3D engine closer to AMDs equivalent. It's not a generic optimization though, as the increased CPU load can backfire. There is also still no guarantee that compute commands and draw calls will run on the same SMM.
Use the grid management unit for barrier free command lists.
This is based on the same assumptions as the previous hack.
Even though command lists submitted to a compute queue may include resource barriers, they don't necessarily do. If they don't include any barriers, it should be safe to schedule to the GMU instead. This avoids deactivating the 3D engine and allows for true async and concurrent execution.
Similar to the previous approach, this one also comes at an slightly increased CPU load for synchronisation. It is also not applicable to every application, as barriers can't necessarily be avoided.

12 There is strong evidence that this is due compute queues having support for resource barriers. The DX12 feature differs from the CUDA equivalent, which this function unit was originally designed for. Barrier support only exists in the 3D engine and its compute mode.

13 An additional signal is required after the last preceding barrier end to trigger the GMU. An additional fence is required before the next trailing barrier start in order to wait for the GMU. The search scope is limited to the same command list.

Disclaimer

Many of the numbers regarding actual concurrency capabilities are based on data extracted from this collection: nubleh.github.io/async/. This is a synthetic benchmark and may have yielded incorrect results. I can't guarantee that my interpretation of the data is correct either. I also had access to non-synthetic results which I can't disclose.

Actual hardware specs, regarding both Nvidias and AMDs hardware, are scarce and often oversimplified or plain out misleading. This is even more so true for data reinterpreted by someone else. There is also an alarming rate of repeatedly quoted, wrong and contradicting data. I have attempted to overcome this issue by working on a far more detailed hardware model locally, and extrapolating unknown features and function units across multiple hardware revisions. The full model is not included here, and it might still be faulty. It has however proven to be a good prediction for non-synthetic tests as well. The concepts in this article are also just a simplification for the sake of comprehension. Numbers and implementation details not deemed relevant by the author have been omitted.

Only data freely available on the Internet or deducible from the common knowledge base has been included. All recommendations in the best practice section are based on my knowledge of hard- and software design, applied to the raw hardware specifications. I mostly disregarded possible or actual driver optimizations.

List of changes

19.10.2015
Rephrase terminology to better match Microsofts documentation on D3D12.
24.02.2015
Rephrase missleading advice about fences beeing free of cost. Clearly state that only signals should be sent eargerly, but the number of synchronisation points should be kept low.
Add remark that the cost of synchronization points can differ based on the hardware capabilities.