DX12 Multi engine capabilties of recent AMD and Nvidia hardware
I will discuss the current status of multi engine and queue support in DX12 on recent AMD and Nvidia hardware. I have covered GCN 1, 2 and 3 for AMD. For Nvidia I have covered Kepler, Maxwell v1 and Maxwell v2 explicitly, as well as a few remarks on Pascal. Some of the findings regarding pre-GK110 GPUs also apply to Fermi.
I provide a best practice guide to tune specifically to each vendors hardware, based on the hardware and driver capabilities found. Further I present an hybrid approach which caters both vendors hardware equally despite the differences.
What are engines and queues?
When you look at classic APIs such as OpenGL or DirectX before version 12, there was only the 3D queue exposed. All commands submitted to a single queue are being restricted by the same synchronisation points. Commands in between two synchronisation points form batches where the order is undefined. There can be some concurrency between commands due to pipelining and super scalar architectures, but the order in which command batches start execution is fixed.
The effect of this is that all commands on a single queue have to wait for the same barriers and fences. Fences are typically used to synchronize the work on the GPU with other operations, such as CPU side work or resource transfers between CPU and GPU. Barriers refer to blocking operations on the GPU, such as switching memory access mode in between drawcalls. or waiting for the end of one batch.
With Mantle and DX12, additional compute queues have been introduced into 3D applications. Each of these additional queues runs asynchronously along side each other. The term "asynchronous" refers to the fact that the order of execution in relation to each other is not defined. Jobs submitted to different queue may start or complete in a different order than they were issued. Fences and barriers only apply to a single queue each as well. When an queue is being blocked by a fence, other queues may still be running regardless. Synchronisation points in between queues can be defined and enforced by using fences.
Similar features have been exposed before in OpenCL and CUDA. DX12 fences and signals map to a subset of the event system in these two environments. Barriers in DX12 have additional wait conditions which are not supported in neither OpenCL nor CUDA, and a write through of dirty buffers needs to be requested explicitly.
These additional queues differ from the classic 3D queues. While the 3D queue can be fed both with compute commands, copy commands and draw calls, these additional queues can only accept compute and copy commands. For this reason, they are called compute queues.
In hardware, these queues may be handled by dedicated engines of the corresponding type. The engine responsible for the 3D queues is commonly referred to as "graphics command processor" by the hardware vendors or "3D engine" in Microsofts terminology. AMD calls the implementation of its compute engine "ACE", short for asynchronous compute engine. There is also a copy engine with a dedicated queue type, but it's not relevant for this discussion.
If the hardware platform does not provide a dedicated unit for a given queue type, the work may be offloaded to the corresponding super type instead. E.g. if the compute engine is missing, the application may still request a compute queue. This queue will then be scheduled for execution on the 3D engine instead.
Compute jobs have gained a lot of significance in modern game engines. The obvious use is to offload work previously done on the CPU to the GPU, such as particle physics and alike. Less intuitive is the use of compute shaders in deferred rendering. With this technique, the geometry and classic draw calls are only involved in the the generation of a few 2D buffers. Most of the following operations operate on 2D images instead, which can be done with compute shaders as well. This comes at a reduced resource usage and the option to use compute specific features to gain additional speed.
There are multiple possible gains from using these compute queues.
- The most trivial one is to keep the GPU utilized with jobs from one queue, while jobs scheduled to a different queues are still waiting for a fence or barrier to be released.
- It is also possible to tune the hardware for the execution of compute jobs while a compute queue is active, as regular draw calls can not occur inside such an queue and therefor some features are not required. Resources required for classic 3D operations can also be re-provisioned for compute specific tasks.
- More sophisticated hardware can execute multiple queues in parallel to increase the parallelism. This can be used to hide latencies inside the hardware, such texture access, cache misses and alike.
It depends on the hardware and driver capabilities which of these options are actually used. These differences are not directly visible via the DX12 API, only the support for compute queues in general is indicated.
Depending on which features and chances the hardware and driver are using, different platforms can react entirely different to the same type of workload. Not providing a feature is a legit option. Every additional feature is always a compromise between flexibility and effectiveness on one side, and chip size and power efficiency on the other one.
|3D queue support||Yes||Yes|
|Compute queue support||Yes||Yes|
|3D queue limit||N/A1||N/A1|
|Compute queue limit||N/A3 (GCN 3)
64 (GCN 2)
2 (GCN 1)
|Multi engine parallelism||1+42,3 (GCN 3)
1+82,3 (GCN 2)
1+22 (GCN 1)
|Compute shader parallelism on 3D engine||64/1285 (GCN 3)
32/645 (GCN 2)
645 (GCN 1)
|3D shader parallelism on 3D engine||64/1285 (GCN 3)
32/645 (GCN 2)
645 (GCN 1)
|Compute shader parallelism on compute engine||32/648 (GCN 3)
328 (GCN 2)
64 (GCN 1)
|Mixed 3D/Compute wavefront interleaving||Yes||Limited9|
1 Additional queues are scheduled in software instead. Only memory limits apply.
2 One 3D engine plus up to 8 compute engines running in parallel.
3 Since GCN 2, each ACE can seamlessly interleave wavefronts from 8 asynchronous queues.
Since GCN 3, there are 4 ACE units which are only responsible for actual wavefront dispatching.
The queues are now handled by a programmable, dual threaded "HWS" processor, which schedules queues for execution to the ACE units. There is no hard limit to the number of queues supported.
Compute and 3D engine can not be active at the same time as they utilize a single function unit.
The Hyper-Q interface used for CUDA is in fact supporting concurrent execution, but it's not exposed via the DX12 API.
If it was used, there would be a hardware limit of 32 asynchronous compute queues in addition to the 3D engine.
Starting with Pascal, the Hyper-Q interface is used to provide the compute engine.
5 Execution slots dynamically shared between all command types.
6 Execution slots reserved for compute commands.
Execution slots are reserved for use by the graphics command processors.
According to Nvidia, GM20x chips should be able to lift the reservation dynamically. This behaviour appears to be limited to CUDA and Hyper-Q.
8 Execution slots dynamically shared between each 8 compute queues since GCN 2.
9 SMX/SMM units can only execute either type of wavefront.
All SMM units must be stopped to switch execution mode, but the GPU can partition the SMM units arbitrarily.
Beginning with Pascal, SMM units can switch operation mode without a full shutdown.
The differences between GCN and Maxwell/Kepler are quite significant, and so is the effect of using the compute engines, respectively multiple queues in general. I will address possible approaches to utilize the platform to full extent or to avoid common misconceptions.
Compute engines can be used for multiple different purposes on GCN hardware:
- Long running compute jobs can be offloaded to a compute queue.
- If a job is known to be possibly wasting a lot of time in stalls, it can be outsourced from busy queues. This comes at the benefit of achieving better shader utilization as 3D and compute workload can be interleaved on every level in hardware, from the scheduler down to actual execution on the compute units.
- High priority jobs can be scheduled to a dedicated compute queue.
- They will go into the next free execution slot on the corresponding ACE. They can not preempt running shaders, but they will skip any queued ones. Make proper use of the priority setting on the compute queue to achieve this behaviour.
- Create more back pressure.
- By providing additional jobs on a compute engine, the impact of blocking barriers in other queues can be avoided. Barriers or fences placed on other queues are not causing any interference.
GCN is still perfectly happy to accept compute commands in the 3D queue.
There is no penalty for mixing draw calls and compute commands in the 3D queue. In fact, compute commands have approximately the same performance as draw calls with proxy geometry10.
Compute commands should still be preferred for any non-geometry related operation for practical reasons, such as utilizing the local shared memory and increasing possible concurrency.
Offloading compute commands to the compute queue is a good chance to increase GPU utilization.
10 Proxy geometry refers to a technique where you are using a simple geometry, like a single screen filling square, to apply post processing effects and alike to 2D buffers.
Due to the possible performance penalties from using compute commands concurrently with draw calls, compute queues should mostly be used to offload and execute compute commands in batch.
There are multiple points to consider when doing this:
- The workload on a single queue should always be sufficient to fully utilize the GPU.
- There is no parallelism between the 3D and the compute engine so you should not try to split workload between regular compute and 3D queues arbitrarily.
- Pay close attention not to stall the GPU with solitary compute jobs limited by texture sample rate, memory latency or anything alike. Other queues can't become active as long as such a command is running.
- Compute commands should not be scheduled together with draw calls in a single batch.
- Doing so will hurt the performance measurably. The reconfiguration of the SMM units will impair performance significantly.
- In the worst case, the GPU will even be partitioned inefficiently. This can result e.g. in the SMM units dedicated to graphics to run idle, while the compute shader running in parallel starves due to a lower number of SMM units available.
- Make sure to always properly batch both draw calls and compute commands. Avoiding a mixture ensures that all resources are allocated to either type.
- Consider the use of a draw call with a proxy geometry instead when batching and offloading is not an option for you. This will still save you a few microseconds as opposed to interleaving a compute command.
- Make 3D and compute sections long enough.
- Switching SMM units between compute and 3D mode results in a full flush of all pipelines. The GPU should have spent enough time in one mode to justify the penalty for switching.
- Beware that there is no active preemption, a long running shader in either engine will stall the transition.
Despite the limitations, the use of compute shaders should still be considered.
The reduced overhead and effectively higher level of concurrency compared to classic draw calls with proxy
geometry can still yield remarkable performance gains.
Additional care is required to cleanly separate the render pipeline into batches.
If async compute with support for high priority jobs and independent scheduling is a hard requirement, consider the use of CUDA for these jobs instead of the DX12 API.
With GK110 and later, CUDA bypasses the graphics command processor and is handled by a dedicated function unit in hardware which runs uncoupled from the regular compute or graphics engine. It even supports multiple asynchronous queues in hardware as you would expect.
Beware that the limitations regarding the partitioning of the SMM units still applies when using CUDA. There is no possibility to avoid the performance penalty resulting from this design flaw.
For Pascal, many of these limitations no longer apply. Preemption is possible with Pascal, and idle SMM units can switch mode at runtime. Be aware that Maxwell cards still make up for a huge share of the hardware base though.
For a safe bet, go with the batched approach recommended for Nvidia hardware:
- Choose sufficiently large batches of short running shaders.
- Long running shaders can complicate scheduling on Nvidias hardware. Ensure that the GPU can remain fully utilized until the end of each batch. Tune this for Nvidias hardware, AMD will adapt just fine.
- Use multiple compute engines when applicable.
- If the result of a compute job or an entire chain isn't needed until much later, offload it. This will start execution early on with AMD, while Nvidia still gets a chance to profit from scheduling.
- Signal early, signal often.
- Nvidia will only update signals and fences at the end of each command list, but AMD will do so much sooner. By using additional signals, the compute engine can be given a head start on GCN. This is also true for synchronizing multiple compute engines.
- Don't worry about the overhead of additional fences for synchronisation purposes. The cost for waiting on more than one signal is just the same as waiting for a single one.
- While signaling eagerly, be conservative about waiting. Each additional synchronisation point comes at a cost11.
- Commit early and be responsive.
- Consider that AMD needs a higher level of concurrency, and many of your jobs are in fact independent. Don't wait for milestones in your render loop. Commit your work early and place fences instead.
- Signals may arrive in a different order depending on the hardware. Make sure that your CPU side code is aware of that. An event driven approach will work better than a classic procedural one.
- Keep 3D related jobs central.
- Make especially sure that no compute jobs are required in between.
- Your application may still be heavy on draw calls or utilize complex fragment shaders, as long as these are happening in bulk and dependent compute jobs have already been scheduled.
- Be aware of the cost of barriers.
- While both fences and barriers trigger scheduling, barriers are the ones which cause "bubbles" in the schedule. If one queue is potentially stalled on a barrier, despite using the split variant already, try to schedule work on another queue to fill the gap.
- Check for cache trashing.
- The parallel execution can result in a performance penalty from unexpected memory access patterns. This is especially true for GCN, where this can even occur on the L1 cache. The increased level of parallelism can have negative side effects!
11 The actual cost differs with the hardware capabilities of the engine.
E.g. for GCN, a synchronization point on a dedicated hardware queue provided by an ACE is cheaper than one on the shared 3D engine.
Every synchronization point still involves the operating system, so there is a baseline cost which can't be avoided.
It's a complete mess at first sight.
Best case scenario for AMD is the absolute worst case scenario for Nvidia.
Both hardware vendors do profit from compute engine usage. They achieve this by different means, but the potential gains are significant for both vendors.
The situation is currently unsatisfying as Nvidias hardware requires a number of non-intuitive sanctions. In return, Nvidia is offering comparable capabilities on up to 5 years old hardware. The lower feature set provided by Nvidia also pays off clearly in reduced power consumption, so the motives are accountable.
It is still possible to provide a common code path for both vendors. This works by pure DX12 means and should also scale well to future GPU generations.
For the future, I hope that Nvidia will get on par with AMD regarding multi engine support. AMD is currently providing a far more intuitive approach which aids developers directly.
This will come at an increased power consumption as the flexibility naturally requires more redundancy in hardware, but will most likely increase GPU utilization throughout the industry while accelerating development. The ultimate goal is still a common standard where you don't have to care much about hardware implementation details, the same way as x86 CPUs have matured over the course of the past 25 years.
For the Pascal architecture, the biggest problem of Maxwell is already solved. The SMM units can now be reassigned to graphics or compute mode at runtime. As a result, there is no longer penalty from having SMM units running idle.
The effects of using compute commands still differ significantly between Pascal and GCN. You can achieve synergy effects by increasing the utilization of GPU wide shared units, such as ROPs and alike. It is however not possible to benefit on a per SMM level, so ALU utilization can't be improved.
Many of the numbers regarding actual concurrency capabilities are based on data extracted from this collection: nubleh.github.io/async/. This is a synthetic benchmark and may have yielded incorrect results. I can't guarantee that my interpretation of the data is correct either. I also had access to non-synthetic results which I can't disclose.
Actual hardware specs, regarding both Nvidias and AMDs hardware, are scarce and often oversimplified or plain out misleading. This is even more so true for data reinterpreted by someone else. There is also an alarming rate of repeatedly quoted, wrong and contradicting data. I have attempted to overcome this issue by working on a far more detailed hardware model locally, and extrapolating unknown features and function units across multiple hardware revisions. The full model is not included here, and it might still be faulty. It has however proven to be a good prediction for non-synthetic tests as well. The concepts in this article are also just a simplification for the sake of comprehension. Numbers and implementation details not deemed relevant by the author have been omitted.
Only data freely available on the Internet or deducible from the common knowledge base has been included. All recommendations in the best practice section are based on my knowledge of hard- and software design, applied to the raw hardware specifications. I mostly disregarded possible or actual driver optimizations.
List of changes
Change GCN 1.0 - 1.2 naming pattern to GCN 1 - 3 as used by AMD.
Add new insights on GCN 3 architecture, regarding HWS units.
Remove invalidated speculations regarding Maxwell.
Clarify on causes for possible performance penalties.
Add remarks on which aspects Pascal improved.
- Rephrase terminology to better match Microsofts documentation on D3D12.
- Rephrase missleading advice about fences beeing free of cost. Clearly state that only signals should be sent eagerly, but the number of synchronisation points should be kept low.
- Add remark that the cost of synchronization points can differ based on the hardware capabilities.