My own little DirectX FAQ
Tuesday, June 18, 2002
Why is software T&L so slow?
Well, it's slower than hardware, but not that slow. If you're finding massive speed differences (i.e. more than 10x), you may be doing something wrong. You need to be aware of the fundamental difference between software T&L and hardware T&L.
Hardware T&L only transforms vertices as-and-when you actually use one by referencing it with in index. To prevent it from transforming vertices multiple times (when used in different triangles), hardware has a vertex cache. Keeping this vertex cache happy is the key to getting good HWT&L speed.
Software T&L doesn't do this - it's inefficient. What is far more efficient is for the T&L engine to start at one vertex in a VB and T&L the next n vertices, ignoring which ones are actually referenced by indices. So to keep it happy, you need to make sure that when drawing a sub-section of your VB, all the vertices that are used lie in a single contiguous chunk. You tell it where this chunk is using the MinIndex and NumVertices arguments to the DrawIndexedPrimitive call (see the DrawIndexedPrimitive FAQ entry below for details).
For software T&L, an index list of (1,2,99, 99, 2,100) is terrible. You're drawing two triangles using four vertices, but the software T&L engine is transforming vertices 1-100. Terrible efficiency (it is worth noting that hardware T&L doesn't like it much either because of memory-cache efficiency, but it's nowhere near as lethal as the software case). You need to move vertices 99 and 100 down to slots 3 and 4, then draw (1,2,3, 3,2,4) instead.
So you need to make sure that (a) all vertices using in a single draw command are near each other in the VB and (b) that you set the MinIndex and NumVertices values to accuractely tell the software T&L engine which ones you are using - no more, no less. The DX debug runtime will warn you if you try to use MinIndex and NumVertices values that don't cover all the vertices you are using, but it cannot usually spot that you are T&Ling too many vertices. So you need to check this yourself.
Software T&L can be fast, and with a fast CPU and a slower video card, it can be faster than hardware T&L (depending on what you are doing). If your video card does not support Vertex Shaders, it is certainly worth trying out software Vertex Shaders, rather than throwing up your hands in despair - they are pretty quick. But they work in a very different way to hardware T&L, and you need to made sure you are driving both efficiently.