My own little DirectX FAQ
Saturday, December 24, 2005
Vertex cache optimisation
Most graphics cards have a vertex cache that sits after the Vertex Shader or Fixed-Function Transform & Light stage and caches the results. (they also have a pre-VS cache, I'll discuss that below). The idea of this cache is to make sure the card never processes any vertex twice. However, getting it to work perfectly is impossible - the best you can hope for is that the number of vertices that are re-processed is kept to a minimum. By sending triangles in the correct order to the graphics card, you can keep the vertex cache nice and warm, and the card won't re-process a significant number of vertices. The crucial thing is to choose a triangle ordering that hits the cache a lot. It is so important, that when finding an ordering, you should ignore whether your eventual primitive type is lists, strips or fans (see the post below) - the only important thing is the order you draw the triangles in.
The efficiency of a vertex cache is usually measured in "number of vertices processed for each triangle drawn", or verts/tri. On most meshes that we use in games, a ratio of 0.5 is the best you can possibly get on a totally regular mesh with an infinitely-sized cache. In practice, we don't have perfectly regular meshes or infinite caches, so this is nearly unobtainable. Getting somewhere between 0.6 and 0.8 verts/tri is far more usual.
There are many ways to choose a good ordering. Here are five common ones, roughly ordered from worst to best:
1 - "Stripping" your mesh - making it into triangle-strip primitives as long as possible. This is what people have done since the 80s with programs such as "Stripe" and a paper on "Tunneling for Triangle Strips" (by A. James Stewart), and it's not too bad. However, stripping was designed for hardware that did not have indexed primitives, just strips, so it has rather different goals. While still acceptable on indexed hardware, it only gets a ratio of around 1 vert/tri, which is a long way off the practical best of 0.6. However, it is still better than a totally random order!
Let me say this just once, because academics are still spending time and money researching this subject. You're wasting your time. Strips are obsolete - they are optimising for hardware that no longer exists. Indexed vertex caches are much better at this than non-indexed strips, and the hardware is totally ubiquitous. Please go and spend your time and money researching something more interesting.
I'll get off my soapbox now.
2 - Hugues Hoppe's paper "Optimization of mesh locality for transparent vertex caching" (available at http://research.microsoft.com/~hoppe/). This has an interesting discussion of the theoretical limits of vertex caching, and also has a nice algorithm to optimise for a given size cache. However, there is a hidden problem with it - it assumes you know how big your cache is, and how it works. On the PC, you usually don't, because each user will have a different sort of graphics card, and they all have different types and sizes. Some even change their size depending on the vertex format. The problem with Hoppe's method is that if you optimise for a cache of 16 vertices, and the real one is only 12 vertices, you will "thrash" the cache badly, and actually get very bad performnce. Conversely, if you optimise for a cache of 8 vertices and the real one has 24 entries, you will get very little benefit from the extra space. So this method is useful for known targets (e.g. consoles), but not that useful for PCs, or for cross-platform optimisation.
3 - Bogomjakov and Gotsman's paper "Universal Rendering Sequences for Transparent Vertex Caching of Progressive Meshes" (http://www.cs.technion.ac.il/~gotsman/caching/). This is the best paper I have seen about finding an ordering that works well on all caches. I didn't really understand the second part of the paper about Progressive Meshes, because that bit relies on a very specific implementation of Progressive Meshes that I don't like much (very CPU-hungry). So just read the first bit. Essentially, what you want is an ordering that continually twists around on itself, rather than the long zig-zag patterns that Hoppe's version tends to generate. They start with the ideal of a Hilbert curve (which constantly curves in on itself), and try to generate that on more arbitrary meshes. Unfortunately, while their results look good, the algorithm is very complex (i.e. I don't actually understand how to implement it :-), and rather slow.
4 - The D3DXOptimizeFaces function. This is an implementation of Hoppe's algorithm with several improvements that you can just use directly, which is handy. The previous weaknesses apply though - this function gets around them by picking a fairly small vertex cache and optimising for that. Keeping the target cache small means you are unlikely to start thrashing the real cache. While not completely optimal, it is better than nothing, and much better than a stripper.
5 - The optimiser in the Granny Preprocessor (included in Granny: http://www.radgametools.com/gramain.htm). This is one I wrote recently (late 2005). It generates Hilbert-ish patterns that give good measured coherency for caches from 4 to 64 vertices on a wide range of meshes, so it's very general-purpose. However, it is a much simpler algorithm than the above ones, it's just been tweaked a lot. The advantage is that (a) you get source, (b) you can probably understand the source, if you want to modify it for your own purposes or platform oddities, and (c) it's very fast, which means you can run it at load-time rather than actually at export time. Speed doesn't sound like a big deal, because most of the time you will be doing offline processing, but it does solve one tricky problem. When exporting an animated mesh, you frequently need to chop it into sub-meshes so that each one only uses a certain number of bones (because they must all fit in the Vertex Shader constant space at once). But because different graphics cards have different-sized constant spaces, it would be nice to be able to do this splitting at load time, when you actually know how big the constant space of the user's card is (the Granny preprocessor does this for you as well!). Once you've split the mesh up into submeshes, you'd like to re-order each of the submeshes for best vertex cache use. This is where having a really fast vertex-cache optimiser comes in handy - you can do this when the user loads the game, rather than having to do it at export time.
So if you're buying Granny, the answer is obvious! If you're not, then I would use D3DXOptimizeFaces, because it's pretty good. If you can't do that, maybe because your asset pipeline doesn't like using D3D functions, then things like Stripe are at least better than nothing - though personally I would be tempted to write my own. There is a discussion on the DirectXDev list about writing good strippers (titled simply "Stripping", started on 19th December 2005) - certainly worth a read.
One final point - there is another cache, the pre-Vertex Shader cache. This is usually just a simple cache of the Vertex Buffer memory. Fortunately, the way to optimise for this is easy. When using indexed primitives, your vertices can be in any order in the Vertex Buffer you wish - you simply renumber the indices. So once you have found the right triangle ordering that makes the post-VS cache happy, scan the list of tris and as you get a vertex that has not been referenced before, add it to the vertex buffer. This way, as you draw the mesh, you will be picking up vertices in a basically linear memory order, start to finish. This keeps the pre-VS cache happy (and also the AGP/PCIe bus happy). It's a simple reordering, and unlike the optimisations above, works basically the same on all hardware. D3DXOptimizeVertices will do this for you, but it's not difficult to write your own.
Strips, lists, fans, indexed or not?
D3D has a wide array of primitive types, and they each have different performance characteristics. So which one should you use? To lay down some ground rules, these are the things you need to optimise for when making this choice - most important first:
1: Number of DrawPrimitive/DrawIndexedPrimitive calls.
2: Number of vertices processed by the vertex shader (or transformed and lit by the fixed-function pipeline).
3: Total bandwidth of data sent to the chip.
Generally, each point is far more important than the ones after it - often by a factor of 10. This means that we rarely if ever have to make any trade-offs between different points - the earlier-listed ones are always more important, even if it affects the ones further down.
In D3D, there's six type of triangle primitive. There's strips, fans and lists, and each comes in indexed and non-indexed flavours. First of all - ignore fans. They're not even worth thinking about in D3D, because you have to keep restarting them, which means an extra DrawPrimitivie call each time, and as this is factor number 1, that's never ever a good idea - don't even think it!
For similar reasons, you can ignore non-indexed strips, because either you have to keep restarting them with a DrawPrim call, or you have to use duplicate vertices to create degenerate triangles. But unlike the indexed case, those duplicate vertices are actual fat 20-40byte vertices, and each one is another vertex to be vertex shaded, violating factors 2 and 3. The same argument can be used against non-indexed lists, but even moreso - far too many vertices to be sensible.
So now we're down to two choices - indexed strips and indexed lists. If you don't know what indices are, look them up, because they're essential. There's very few cases where you don't need to use indices (e.g. particle systems), and even then I tend to use indices anyway just because it means all of my code uses the same code-path, and because in the case of particles, whether you use indices or not is irrelevant to performance in practice (it's fillrate that kills you on particles).
OK - so you've read up on indices? Good. Major fact to know about indices - they're the only way the vertex cache can do its job. If you don't have indices, the vertex cache is switched off, and this is a Very Bad Thing, because by using the vertex cache, you can dramatically reduce factor #2 - how many vertices actually get shaded. The vertex cache is typically a small cache on the GPUthat avoids re-shading a vertex if it already shaded it recently. It is usually between 8 and 32 entries, which isn't large, but it turns out that 32 is about the maximum useful size anyway.
Whichever primtitive type you use, your priority is to first optimise the order tris are drawn in to keep the vertex cache happy. For this, you need to just ignore what primitive type you're going to use - the order of the triangles is all that matter to the vertex cache. See the entry above (Vertex Cache Optimisation) for a discussion of this.
Once you've optimised your triangle ordering for good vertex-cache coherency, whether you then choose to use indexed lists, strips or fans is a fairly minor point. Some hardware prefers one type of primitive slightly over another, but in general you can use whatever method you like. The best one to use is the one that minimises the number of indices you generate, simply because more indices = more memory. On the face of it, strips should be better than lists, right? Well, not really - if you need to keep making degenerate tris to join "striplets" together, then you'll generate more indices than lists (which don't need degenerates). Although degenerate triangles are cheap with indexed primitives, because the hardware simply checks whether two indices are the same and throws the triangle away, they are not completely free - this culling does take a bit of time.
Some APIs (not PC DX5-DX9, but some consoles, and possibly future PC APIs) also allow a "restart" index, which means you don't need to use degenerate triangles, and they can make strips better than lists, and also make fans a reasonable option as well. My personal choice is to always use indexed lists, because the difference between them and indexed strips is tiny, they are far easier to construct on the fly (e.g. for particle effects or text rendering or HUDs), and it makes the code far simpler to use a single primitive type everywhere. But using indexed strips is a perfectly valid choice as well.
The important thing is - when choosing a primitive type, DO NOT change the order of the triangles, or you'll start hurting vertex cache efficiency, and that is never a good thing.
NOTE - the above advice may change in DX10, because I hear the primitive types might get some tweaking there. If you're reading this and DX10 has been released, then email me and remind me to update this entry for DX10!
Friday, December 16, 2005
Cunning ways to avoid clearing the framebuffer.
Or rather, why you shouldn't try to do this! In ye olden tymes of the Voodoo2, filling pixels was expensive, and you could avoid having to do a clear by using a convoluted method (use only half the Z-range, flip it every even/odd frame, never actually clear any pixels).
But that is all silly nowdays, and indeed it's usually slower. There's two big things to remember with modern cards:
(1) Clearing pixels to constant colours is fast. Amazingly fast. They can set constant colours or Z values at up to 256 times the speed of the fastest shaders! But you only get this if you're clearing an entire surface to a single value. You get this speed whether you're clearing Z/stencil or a colour.
(2) Z and framebuffer compression. Almost all modern cards can have pixels in either "compressed" or "uncompressed" states. Obviously, compressed pixels are faster to render to. Pixels can change from compressed to uncompressed by rendering triangles to them, but on many DX8 and DX9 cards, they can only change back to being compressed if you do a clear. If you never clear the framebuffer or Z/stencil buffers, then pixels can never become compressed again, and after a few frames of rendering, all your pixels will be uncompressed, and rendering speed will suffer.
So, taking these two facts, you can see that it is important to always clear both framebuffer and Z/stencil every frame if you can. Obviously there are some funky effects that involve not clearing them, but unless that's deliberately what you intend, always do clears. Note that this applies to both the framebuffer and any auxilliary buffers you render to - shadowbuffers, environment cubemaps, etc.
Also, if you are only rendering to a portion of a buffer (e.g. you have a 1024x1024 texture handy, but you only want a 512x512 shadowbuffer), make sure you clear the whole thing anyway, not just the region you're going to use. Cards want to do the equivalent of a C memset() - a single contiguous memory wipe. If you specify a subrectangle, sometimes they can't do this. Some schemes (such as rendering impostors) require you to only do subrect clears. If that's what you need, then do it - it's still quite fast. Sometimes a renderer's gotta do what a renderer's gotta do. But if you can clear the whole buffer, do so.
Also remember about "SLI" setups - multiple cards in one machine. Ideally, you want to stop the cards having to exchange data. To do this, these cards frequently run "Alternate Frame Rendering", so card 1 does even frames while card 2 does odd ones. But if you don't clear the framebuffer&Z/stencil, they might think you need the data from one frame to be used in the next, and card 1 has to send its FB/Z to card 2, and this is slow. By doing a full-screen clear of both buffers, you are telling the cards they don't need to transfer the data, because you're just going to set it all to black anyway. Drivers are smart enough to spot this, and your framerates will improve.
Also, tile-based renderers (PowerVR and intel cards) will thank you for doing clears, for lots of reasons.
Finally, remember to always clear Z and stencil, even if you don't use stencil. Because they're packed together, if you only clear Z, the card has to carefully preserve the stencil buffer, and you won't get the fast clears. Obviously if that's what you want to happen, because you deliberately want to preserve the stencil value, then that's what has to happen. But if you don't use or care about the stencil, clear it, and then the card can do its fast clears.
So, to summarise - if you have an indoor scene with a skybox outside, the best sequence of rendering is this:
1 - Clear the framebuffer and Z/stencil.
......This is very fast and sets all pixels to their compressed states.
2 - Enable Z write & Z test.
3 - Render the indoors scene, trying to do so mostly front to back.
...... This makes sure you lay down a Z buffer that will reject as many hidden pixels as possible.
4 - Disable Z writes, keep Z tests enabled.
5 - Render the skybox.
...... This rejects as many skybox pixels as possible. Since the skybox can't have anything behind it, there's no point in writing Z.
6 - Render any translucent objects (glass, particles, etc).
There are many other orders to do this in, but this is generally the most efficient on all cards.