My own little DirectX FAQ
Wednesday, April 12, 2006
Why is Present so slow?
Well - it isn't! On its own, Present does very little except tell the GPU that the current frame is done, and it should display that to the screen. It might also do a few blits and clears, but those are very quick operations on today's cards.
The reason you're seeing Present on your profile is that Present is also when the CPU and GPU "sync up" with each other. They are two separate units operating asynchronously in parallel (well, ideally :-), and you need to be aware that at certain times, one may need to wait for the other. Obviously the GPU can't get ahead of the CPU (because the CPU generates commands for the GPU), but it is fairly easy to give the GPU so much work to do that you can generate rendering commands with the CPU faster than the GPU can complete those commands.
Let's say the GPU is managing to render a frame every 30ms. But what if the CPU only takes 10ms to generate the data for those frames? Obviously you can put some buffering in there to ensure smooth progress (and DirectX does this for you, unless you break it), but at some point the CPU is going to have a lot of frames queued up and the GPU can't finish them fast enough.
You could let this happen indefinitely, but then your CPU is generating data that won't be rendered and displayed for seconds - or even minutes. So you move the mouse left, and seconds later your view actually goes left. That's not very good for gmes. So DirectX puts a limit of the CPU getting at most two frames ahead of the GPU (in practice because of the way screen updates are handled, the time to when you actually see the image on your screen might be three frames). This is considered to be a good balance between responsiveness and keeping the pipeline flowing to ensure a high framerate - the more buffering in the system, the less the GPU and CPU have to wait for each other.
So after a few frames of rendering, the CPU is now two frames ahead, and calls Present. At this point, DirectX notices that the CPU is too far ahead, and deliberately waits for the GPU to catch up and finish rendering its current frame - which will take 20ms (30ms - 10ms). This is why you're seeing such a long time spent in Present - the call itself is very simple, and doesn't do all that much. The thing that is taking the time is that the CPU is waiting for the GPU to catch up.
If you want to use this spare CPU time for someting else, then you can use the D3DPRESENT_DONOTWAIT flag. What this means is that if Present would have stalled and waited for the GPU, instead it returns with the error D3DERR_WASSTILLDRAWING to say "the GPU is still busy". You can then do some work and try doing the Present again later.
Note that in general it is quite tricky to do any useful work in this period. You have to remember that on some systems with slower CPUs and faster GPUs, or in other parts of the game where the CPU has a lot to do and the GPU doesn't have as much, the CPU may be the slower of the two and you will never get a D3DERR_WASSTILLDRAWING result - the GPU will always be waiting for the CPU. Which means that any time reclaimed using this method is completely unreliable - you can't count on it at all. So be careful with this functionality!
Why is DrawPrimitive so slow?
Well - it isn't! If you simply do a whole bunch of DrawPrim calls in a row and time them, you will find they run very fast indeed. However, if you insert some rendering state changes (almost any of the Set* calls) in between those calls, that's when they start to get slow. Note that your profiler will not show the extra time being taken in the Set* calls themselves - they are always very fast.
What is happening is that the Set* calls don't actually do anything. For example, when you call SetTexture, it doesn't actually change the texture, it just remembers that you wanted the texture changed (it just writes a pointer into a memory location). Then when you actually call DrawPrim, it looks through all the changes you made since the last DrawPrim call and actually performs them. In the case of a texture change, it makes sure all the mipmap levels are loaded from disk, have been sent to video memory, are in the right format, and then tells the video card to update its internal pointers (in reality it's even more complex - but you get the idea). This is often called "lazy state changes" - the change is not made when you tell the change to happen, it happens when it absolutely finally has to happen (just before you draw triangles).
By the way, when people talk about "DrawPrim" calls, they usually mean any of DrawPrimitive, DrawIndexedPrimitive, DrawPrimitiveUP and DrawIndexedPrimitiveUP. Note however that the *UP calls can be a bit slower than the other two - in general you should try to avoid them and use Vertex Buffers. However, if you spent too much CPU time avoiding the *UP calls, it becomes counter-productive, so don't avoid them like the plague - just be aware that they are slightly more expensive - they are still appropriate for some operations.
People always ask "which state changes are most expensive?" Well, it depends. A lot. On all sorts of things - mainly what video card is in the system. So any list is bound to be a bit fuzzy and imprecise, but here's a general guide, listed most expensive to least expensive.
-Pixel shader changes.
-Pixel shader constant changes.
-Vertex shader changes.
-Vertex shader constant changes.
-Render target changes.
-Vertex format changes (SetVertexDeclaration).
-Sampler state changes.
-Vertex and index buffer changes (without changing the format).
-Misc. render state changes (alpha-blend mode, etc)
I haven't talked about any of the older DX8-style states like SetTextureStageState - they're emulated by shaders on most modern cards, and so count as the equivalent shader call, for the purposes of profiling.
Why are cards so bad at shader changes? It's because many of them can only be running one set of shaders and shader constants at a time. Remember that video cards are not like a CPU - they are a very long pipeline - it may take hundreds of thousands of clock cycles for a triangle to get from being given to the card to being rendered. GPUs are so fast because they can have huge numbers of triangles and pixels being processed at once, so although they have very poor "latency" (the time taken from an input to go all the way through to the end and finish processing), they have extremely impressive "throughput" (how many things you can finish processing in a second).
So when you change shader, many cards have to wait for their entire pipeline to drain fully empty, then upload the new shader or set of shader constants into a completely idle pipeline, and then start feeding work into the start of the pipeline again. This draining is often called a pipeline "bubble" (though it's a slightly inaccurate term from the point of view of a hardware engineer), and in general you want to avoid them.
Newer cards can change shaders in a pipelined way, so changing shader or shader state is much cheaper - at least as long as you don't do it too often (they have a certain small number of active shaders at any one time - so you should still try to avoid changing frequently).
Be aware that some seemingly innocent changes can cause "hidden" shader changes. For example, a lot of hardware has to do special things inside the shader to deal with cube maps as opposed to standard 2D texture maps. So if you change from a cube map to a 2D map or vice versa, even though you think you didn't change the shader, the driver has to upload a new version anyway. The same applies to SetVertexDeclaration and SetSamplerState - for some cards, these also requires a new shader. And lastly, changing some states such as stencil mode, Z mode or alpha-test mode can cause pipeline flushes and/or shader changes as well. So it's all very tricky - as I said, the list above is only for very general guidance.
Just to complicate matters further, in most cases DrawPrim calls are also put into a pipeline (a software one this time, not a hardware one), and only after a certain number of them are batched does the card's driver actually get called. This means that timing individual DrawPrim calls is basically pointless - you won't see how expensive that call was, you'll see some fairly arbitrary chunk of time being taken. In some cases, the driver was not even called - all that happened was the command was added to a queue. In other cases, the driver was called with a big queue of lots of DrawPrim calls and render state changes at once. In general, trying to profile your GPU by timing the CPU is going to be confusing and misleading. The only way to disagnose a GPU's true performance is to use something like PIX or some of the hardware vendor's own tools - they are the only things that can tell you what the graphics card is really doing.
Another general hint - when profiling, never run anything that renders faster than 10ms per frame (100fps). Always do more work to get the speed below 100fps (e.g. render the scene multiple times, or render more objects), and then measure how much extra work you did. For example, saying one object rendered in 2ms is not very useful - because you might infer from that that you could render only ten in 20ms. But because of overhead per frame and suchlike, you may actually be able to render a hundred or a thousand objects in 20ms.
If you get the framerate down below 100fps, i.e. your frame takes more than 10ms to render, then you have mostly got rid of the overhead and you are using the full power of pipelining. The lower you get the speed, the more reliable the timings (usually!). 10ms per frame about when you can start to draw straight lines on graphs. When I see people saying things like "I did XYZ and my framerate dropped from 2378fps to 1287fps - wow, that's a really expensive operation!" - it makes me cringe. They're just chasing phantoms - it's not useful data at all.
Another error is to measure frames per second. Always convert to microseconds or milliseconds. If I render ten object As and it takes me 10ms, and ten object Bs and it takes me 20ms, you can be fairly certain that rendering ten of each will take me 30ms - it's simple addition. Whereas if you say ten As runs at 100fps and ten Bs runs at 50fps, then what does ten of each take - 0fps? 25fps? 150fps? It's all very unintuitive (the correct answer would be 33fps). Think in seconds per frame, not frames per second.
Questions prompted by this post:
"Why do you say pixel shaders are more expensive than vertex shaders". Well, because they are very slightly more expensive to change on some cards, simply because the pipeline has to drain further before the upload can start. But the difference is slight, and on other cards, the two are basically the same cost.
"Isn't changing textures a lot more expensive? Especially if you have to transfer them to video memory?" Certainly if the texture is not in video memory and has to uploaded to the card, that is going to be a huge cost. I'm not talking about that here - these priorities assume that the texture is already in video memory (because you used it in a previous frame and you're not low on video memory). Also note that it really only applies to changing from one "simple" texture format to another. An example of a simple texture format is a 2D mipmapped DXT1 or 8888 texture - cards are pretty good about changing between these fairly quickly. More complex texture formats such as cube maps or floating-point formats may need some shader assistance, and changing those can be more expensive.