Warning: This is a long rant. I’d like to share my personal thoughts and opinions on graphics APIs like Vulkan, Direct3D 12.
Some time ago I came up with a diagram showing how the graphics software technologies evolved over last decades – see my blog post “Lower-Level Graphics API - What Does It Mean?”. The new graphics APIs (Direct3D 12, Vulkan, Metal) are not only a clean start, so they abandon all the legacy garbage going back to ‘90s (like
glVertex), but they also take graphics programming to a new level. It is a lower level – they are more explicit, closer to the hardware, and better match how modern GPUs work. At least that’s the idea. It means simpler, more efficient, and less error-prone drivers. But they don’t make the game or engine programming simpler. Quite the opposite – more responsibilities are now moved to engine developers (e.g. memory management/allocation). Overall, it is commonly considered a good thing though, because the engine has higher-level knowledge of its use cases (e.g. which textures are critically important and which can be unloaded when GPU memory is full), so it can get better performance by doing it properly. All this is hidden in the engines anyway, so developers making their games don’t notice the difference.
Those of you, who – just like me – deal with those low-level graphics APIs in their everyday work, may wonder if these APIs provide the right level of abstraction. I know it will sound controversial, but sometimes I get a feeling they are at the exactly worst possible level – so low they are difficult to learn and use properly, while so high they still hide some implementation details important for getting a good performance. Let’s take image/texture barriers as an example. They were non-existent in previous APIs. Now we have to do them, which is a major pain point when porting old code to a new API. Do too few of them and you get graphical corruptions on some GPUs and not on the others. Do too many and your performance can be worse than it has been on DX11 or OGL. At the same time, they are an abstract concept that still hides multiple things happening under the hood. You can never be sure which barrier will flush some caches, stall the whole graphics pipeline, or convert your texture between internal compression formats on a specific GPU, unless you use some specialized, vendor-specific profiling tool, like Radeon GPU Profiler (RGP).
It’s the same with memory. In DX11 you could just specify intended resource usage (
D3D11_USAGE_DYNAMIC) and the driver chose preferred place for it. In Vulkan you have to query for memory heaps available on the current GPU and explicitly choose the one you decide best for your resource, based on low-level flags like
VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT etc. AMD exposes 4 memory types and 3 memory heaps. Nvidia has 11 types and 2 heaps. Intel integrated graphics exposes just 1 heap and 2 types, showing the memory is really unified, while AMD APU, also integrated, has same memory model as the discrete card. If you try to match these to what you know about physically existing video RAM and system RAM, it doesn’t make any sense. You could just pick the first
DEVICE_LOCAL memory for the fastest GPU access, but even then, you cannot be sure your resource will stay in video RAM. It may be silently migrated to system RAM without your knowledge and consent (e.g. if you go out of memory), which will degrade performance. What is more, there is no way to query for the amount of free GPU memory in Vulkan, unless you do hacks like using DXGI.
Hardware queues are no better. Vulkan claims to give explicit access to the pieces of GPU hardware, so you need to query for queues that are available. For example, Intel exposes only a single graphics queue. AMD lets you create up to 3 additional compute-only queues and 2 transfer queues. Nvidia has 8 compute queues and 1 transfer queue. Do they all really map to silicon that can work in parallel? I doubt it. So how many of them to use to get the best performance? There is no way to tell by just using Vulkan API. AMD promotes doing compute work in parallel with 3D rendering while Nvidia diplomatically advises to be “conscious” with it.
One could say the new APIs don’t deliver to their promise of being low level, explicit, and having predictable performance. It is impossible to deliver, unless the API is specific to one GPU, like there is on consoles. A common API over different GPUs is always high level, things happen under the hood, and there are still fast and slow paths. Isn’t all this complexity just for nothing? It may be true that comparing to previous generation APIs, drivers for the new ones need not launch additional threads in the background or perform shader compilation on first draw call, which greatly reduces chances of major hitching. (We will see how long this state will persist as the APIs and drivers evolve.) Still there is no way to predict or ensure minimum FPS/maximum frame time. We are talking about systems where multiple processes compete for resources. On modern PCs there is even no way to know how many cycles will a single instruction take! Cache memory, branch prediction, out-of-order execution – all of these mechanisms are there in the CPU to speed up average cases, but there can always be cases when it works slowly (e.g. cache miss). It’s the same with graphics. I think we should abandon the false hope of predictable performance as a thing of the past, just like rendering graphics pixel-perfect. We can optimize for the average, but we cannot ensure the minimum. After all, games are “soft real-time systems”.
Based on that, I am thinking if there is a room for a new graphics API or top of DX12 or Vulkan. I don’t mean whole game engine with physical simulation, handling sound, input controllers and all, like Unity or UE4. I mean an API just like DX11 or OGL, on a similar or higher abstraction level (if higher level, maybe the concept of persistent “frame graph” with explicit pass and resource dependencies is the way to go?). I also don’t think it’s enough to just reimplement any of those old APIs. The new one should take advantage of features of the explicit APIs (like parallel command buffer recording), while hiding the difficult parts (e.g. queues, memory types, descriptors, barriers), so it’s easier to use and harder to misuse. (An existing library similar to this concept is V-EZ from AMD.) I think it may still have good performance. The key thing needed for creation of such library is abandoning the assumption that developer must define everything up-front, with nothing allocated, created, or transferred on first use.