High DX11 CPU overhead, very low performance.

dimanche 3 mai 2015

We all know what happened the last time that a thread like this (dis)appeared, so this one is not directly directed to anyone. It is a merely a place for me to link my findings regarding something that I believe is the greatest problem with AMD drivers; a problem that cannot be attributed to anyone but AMD (you can blame developers for bad multi-gpu optimization, or GameWorks libraries).

It all started with the infamous now Direct X 11 Command Lists. Believe it or not, when DX11 was presented, it was the solution that would finally bring multi-threading to the graphics pipeline, and bring tremendous improvements on parallel workloads. To quote the explanation of the new API from Anandtech:
Quote:

We are especially hopeful about a faster shift to DX11 because of the added advantages it will bring even to DX10 hardware. The major benefit I'm talking about here is multi-threading. Yes, eventually everything will need to be drawn, rasterized, and displayed (linearly and synchronously), but DX11 adds multi-threading support that allows applications to simultaneously create resources or manage state and issue draw commands, all from an arbitrary number of threads [bold added by me].
If that reminds you of some recent new API announcements, it is because the wording is virtually the same.

The way that DX11 was supposed to achieve this was by using Deferred Contexts in combination with Command Lists. NVIDIA suppors both, AMD does not support Command Lists, resulting in comical situations like this in the AMD developer Support forum, where a question like that remains unanswered for five years now:
[spoiler][/spoiler]
Surprisingly, Firaxis was one of the first developers to leverage that power from DX11. Apparently Civilization V supported all the multithreading features of DX11 resulting in CPU utilizations that looked like this:
[spoiler][/spoiler]
on twelve threads.
Unfortunately, support for Deferred Contexts and Command Lists is optional, and AMD choose not to support Command Lists. Ryan Smith had contact with NVIDIA during the release of Civilization V, and he explains all of what they told him in this post. Apparently AMD had activated some kind of Command Lists support for the 7970/GCN architecture, but only for Civilization V specifically. Another quote from the Anandtech 7970 original review:
Quote:

[...]Because of the fact that Civilization V uses driver command lists, we were originally not going to include it in this benchmark suite as a gaming benchmark.[...]Next to DCL CivV’s other killer feature is its use of compute shaders, and GCN is a compute architecture. To that extent we believe at this point that while AMD is still facing some kind of DCL bottleneck, they have completely opened the floodgates on whatever compute shader bottleneck was standing in their way before.[...]
You will ask, why does this matter so much? The answer is that it is a good indicator of the general state of multithreading/api overhead situation with AMD's DX11 drivers. And that state is bad.

In this GDC presentation performed by people from NVIDIA, AMD and even Intel, you will see that it is the driver scheduler that is really responsible for almost everything regarding CPU optimization. So, you ask, how bad is it?

The answer is really bad.

Unfortunately Eurogamer (the Digital Foundry) is the only people I have seen doing any real work regarding GPU performance on slower CPUs, it would be really nice to see more work from sites like this one. The Digital Foundry's findings have not been disputed by either NVIDIA nor AMD, so I will use them as a general guide here, regarding to actual gameplay experience. The new 3D Mark API test will serve as a synthetic benchmark basis.

This article (which is a recommendation for a console-killer) says it all. For people with lower end CPUs, they recommend the GTX 750 Ti over the R9 270x!
I'll assume that people who read here have an idea about the relative hardware in each card, but for the sake of readability keep in mind that the R9 270x is almost double the card that the GTX 750 Ti (and it shows in the tests with better CPUs). The actual competitor of the GTX 750 Ti should have been the R9 260x, but apparently the performance is so atrocious with lower CPUs, that they cannot recommend it.
From the article:
Quote:

[...]we've re-run some of our benchmark tests, comparing AMD and Nvidia performance on the high-end quad-core i7 with the middle-of-the-road dual-core i3 processor. The results are stark. Both 260X and 270X lose a good chunk of their performance, while Nvidia's GTX 750 Ti is barely affected. The situation is even more of an issue on the mainstream enthusiast 1080p section on the next page. There's no reason why a Core i3 shouldn't be able to power an R9 280, for example, as the Nvidia cards work fine with this CPU. However, the performance hit is even more substantial there.[...]
Click below for some screenshots and percentages from the article:
[spoiler]Far Cry 4.
GTX 750 Ti Loss: 0%
R9 270x Loss: -27.27%


Far Cry 4 again.
GTX 750 Ti Loss: 10.52%
R9 270x Loss: -46.66%


Now the same game, with the R7 260x.

Far Cry 4.
GTX 750 Ti Loss: -10.52%
R7 260x Loss: -13.63%

You will notice that the more low-end an AMD GPU is, the less is bottlenecked. Keep it in mind until we get to the R9 280 numbers.

Far Cry 4 again.
GTX 750 Ti Loss: 0%
R7 260x Loss: -4.54%

Same here, almost no bottleneck, but you will notice that the 750 Ti is almost steadily at zero.

Ryse: Son of Rome
Back to the R9 270x now.
GTX 750 Ti Loss: -7.69%
R9 270x Loss: -35.29%


More Ryse: Son of Rome
This time with the R7 260x. The frame rate loss is not great, notice the horrible frame spikes though.
GTX 750 Ti Loss: -7.69%
R7 260x Loss: -8.33%


Call of Duty Advanced Warfare
The R9 270x again.
GTX 750 Ti Loss: 0%
R9 270x Loss: -34.14%


Call of Duty Advanced Warfare.
R7 260x.
GTX 750 Ti Loss: 0%
R7 260x Loss: -20.58%


Call of Duty Advanced Warfare moar.
This time the lower end processor is an A10 5800K/Athlon X4 750k. The NVIDIA CPU stutters more with the AMD CPU, something that the AMD GPUs do not. The higher-end R9 270x still loses more than 50% of the high end cpu frame rate though, which is (de)impressive.
GTX 750 Ti Loss: -16.66%
R9 270x Loss: -52.38%


Call of Duty Advanced Warfare final.
Still with the lower end AMD processors, the R7 260x. Although it gets GPU-bound much faster, it still manages to lose an impressive 41.17% with the lower CPU.
GTX 750 Ti Loss: -16.66%
R7 260x Loss: -41.17%
[/spoiler]

I will stop referring to the Digital Foundry now, after I quote them one last time from their GTA V PC performance article:
Quote:

[...]On the plus side, the GTX 750 Ti fares much better when paired with a budget CPU than the rivalling AMD R9 280. Despite costing £30-40 more, the card is a write-off for 60fps performance at 1080p, even with all settings and sliders at their lowest. Left at high settings, spikes down to 35fps are common, again pointing to an issue with AMD cards when paired with weaker CPUs. Unlike the Nvidia 750 Ti, a 30fps lock is needed here when targeting 1080p and anything close to current-gen consoles[...][bold added by me]
They are basically telling us that the driver is SO BAD, that the GTX 750 Ti performs better with an i3, than the R9 280. If any of you have any concept about relative GPU performance, you understand about what kind of bottleneck we are talking about here.

The CPU utilization performance gap was always there, but it has reached these tragic proportions since NVIDIA launched their 337.50 beta driver set, that promised (and apparently delivered this:
[spoiler][/spoiler]

There is no concrete proof about it, but it seems that NVIDIA is using Command Lists and other API optimizations in their driver, even if the game/app does not support them. This is pure speculation for now, but the results are there and they explain the vast gap in overhead performance under DX11.

From the Anandtech test for GPU overhead, we can see that all the AMD cards in DX11 have a bottleneck that makes them stop on 1.1 million draw calls. They don't scale at all between them, and they don't scale at all regardless if the load is multithreaded or not which clearly indicates that the bottleneck is not in the hardware.

[spoiler][/spoiler]

Now it is the time where people say that "DX12 is the future, why should we care" etc. My answer to that is b u l l s h i t . 99,99% of the games in the PC catalogue are going to be DX9/DX11 for at least two years after DX12 appears. DX11 optimization matters, because the PC is the only gaming platform that offers backwards compatibility. If some idiot (because they would only be that) at AMD suddenly decides that DX11 is not important any more, then we're royaly screwed gentlemen.
You might say that with that advertisment for the CPU/GPU optimization guy some months ago, something might happen. Well, I have news for you. The advertisment is still up which means that most likely they haven't hired anybody yet. Unfortunately, not being able to find someone to hire is almost expected, especially if you read this blog post from Valve's Rich Geldreich. Although he refers to the state of the OpenGL driver stack, he gives insights about situations in both companies. He says about "Vendor B" (which is AMD):
Quote:

[...]This vendor can't get key stuff like queries or syncs to work reliably. So any extension that relies on syncs for CPU/GPU synchronization aren't workable. The driver devs remaining at this vendor pine to work at Vendor A.[...]
where Vendor A = NVIDIA.

And this leaves us to today. The main reason for this post is that I don't want everybody to forget the importance of a good DX11 driver stack, amidst the enthusiasm for DX12/Vulkan.

AMD does have a leg up on the lower level API driver development (as they should), but the extra resources from that leg up should be used to add more features and make the DX11/DX11.3 driver leaner and faster, and close the driver gap with NVIDIA that is apparently widening instead of closing.

Indications of that are situations like the whole VSR/frame limiter fiasco (does anyone still remember the frame limiter?), the Vsync/Antialiasing controls on CCC not working with almost anything, no driver-side Triple Buffering support, no frame pacing in single GPU configurations, no double/dynamic vsync, this could go on and on and on...

There was a time that there wasn't any driver gap (around 2010), now it is back with a vengeance, and something should be done about it instead of burying our heads in the sand and crying about "teh DX12s".


High DX11 CPU overhead, very low performance.

0 commentaires:

Enregistrer un commentaire

Labels