Coding for Speed

Concepts

I bet you think you know exactly how to optimize code. Most of us do, until we get a hard reality check. For example, many programmers, to this day, continue to believe that writing raw pointer arithmetic produces faster code than using array notation; when in fact, that’s the exact opposite of the truth.

Take for instance:

 char s[100];
 char z[100];
//.......
for( int i = 0; i<100; ++i )
    s[i] = z[i];

That will execute faster than

for( int i = 0; i<100; ++i )
   *(s+i) = *(z+i);

Why? Because in the first case, the compiler knows there can’t be aliasing, so it parallelizes execution. In the second case it only sees magical addresses and it cannot assume no aliasing, so each operation must wait until the previous one retires.

Another example of the lack of understanding that prevails is the amount of branches one finds in most code.

Branches are expensive.

In order for a modern cpu to keep its pipelines full, it must be starting execution of instructions a few instructions ahead of the instructions being retired. But what does it do at a branch, if the condition is based on the result of previous computations that have not yet emerged from the other end of the pipeline?

What modern cpu’s do is try to predict branches. This is called “speculative execution”. The cpu takes a guess at which way a branch is going to go, and 99% of the cases (figuratively speaking) it’s right. So it continues executing instructions on the side of the branch it assumes will be taken. When the prediction is right, all is hankydoory. If the prediction proves wrong, then it will have to “flush the pipeline”, and start executing the right side of the branch.

The prediction or guess is based on past behaviors at that branch. But to keep track of those past behaviors, the cpu must keep a log of what branches do. This log is implemented as a small memory cache called the “branch prediction cache”. It only has room for a few thousand branches, and here’s the crux of the matter: Every new branch the cpu meets in code causes the eviction of some other branch from this prediction cache. Branches should therefore be as few as possible: Branch prediction correctness decreases as the count of total branches in a program grows above the number of available entries in the branch prediction cache.

But branches also increase code-size. And code-size, all other things being equal, decreases performance, as cache coherence decreases as the size of code and data grow above the chache size.

I got a perfect example to show: I was running a test app for a per-thread pool allocator under AMD’s Code Analyst. I wanted to know, first of all, if there were any bottlenecks I should be aware of. The pool allocator code comes from the boost::pool library. The test app does like half a million allocations and deallocations; and the pool code uses arrays and vectors, rather that heaps, to avoid complex, recursive functions. My modification was to make the pools per-thread, by making the pool singleton’s pointer thread-local.

Anyways here’s Code Analyst at work:

Would have been hard to imagine that most of the cycles were spent on the size() function for std::vector<>, but it appears so...

What could possibly be wrong with size()? Let’s double-click on it:

Single line of code; sorry it’s going off screen in the pic. What it says is,

return (_Myfirst == 0 ? 0 : _Mylast - _Myfirst);

I’m going to be hypothetically critical, here: Hpypothetically because I don’t know for sure that _Mylast - Myfirst must be zero when the vector is empty; but I will assume that hypothesis for illustration.

If I make the above assumption, and assuming the programmer was aware of it, one can only theorize that the programmer thought he or she was “optimizing” the code for the empty case: “hey, we can spare the subtraction!”

However, and no matter how short the notation may look, the above code causes a branch:

which increases code size, occupies an entry in the branch prediction cache, and then gets mispredicted, to boot. And what’s the benefit? None at all. The Arithmetic pipe is usually sitting idle. A little subtraction is an escape from boredom to it; and how often are vectors empty, anyways?

But the worst part of it is, this size() function could have been inlined. But MSVC, correctly I must say, decided not to inline it. Why? Not sure, but I wouldn’t inline it as is, either. Inlining it would cause the branch to be repeated at each instantiation of the function; stealing multiple entries from the branch prediction cache.

The better way to code that would have been,

return _Mylast - _Myfirst;

period.

My assumption is probably wrong, though, and testing _Myfirst for zero must be related to some uninitialized condition. But my criticism doesn’t necessarily die; it merely transmutes: Such uninitialized –or whatever– condition should have been prevented some other way, to avoid deferring a descision to a potentially frequently called function.

(NOTE: This is the STL that comes with MSVC. I think we should switch to using STLport, which is open-source and allegedly the best STL by far; and if it needs fixing we can fix it, and submit patches to the maintainers.)

Conclusion

Just memorize this document:

http://developer.amd.com/htmlhelp/optimization/wwhelp/wwhimpl/java/html/wwhelp.htm

And then download Code Analyst from AMD, and have fun with it.

And then...

...there are a gazillion more issues relating to app performance that aren’t part of that AMD document, so we’ll be adding stuff below, under subtitles.

Destructors

I don’t know why this is, but I’ve heard it from two reliable sources: C++ code is much better optimized for classes that DON’T spell out a destructor. If a class has a destructor, even as trivial as an inlined, non-virtual, do nothing destructor, the code will run slower than if you comment it out and let the compiler supply a default destructor. And apparently this is not just true of one compiler, but of most, if not all. So, whenever possible, write the following in a class:

class foo
{
    int bar;
    foo( int bar ) : bar(bar) {}
    //~foo(); ** Compiler supplied dtor ok; do NOT inherit with data augmentation.
    int get_bar() const { return bar; }
};

That way, the fact that the dtor is missing is documented and justified. Kind of the same thing goes for copy constructor and assignment operator. If you can, let the compiler supply them; that way, they’ll be optimized to fast, ::memcpy()-like code.

Inheritance

Most people think inheritance slows down your code. That’s true, but only if you use concrete inheritance. In a vertical hierarchy of concrete classes meant to be instantiated and used polymorphically, one has to swallow hard and pay the price. But if the purpose of inheritance is to build a class out of smaller chunks, or to tag a class, as “displayable”, “printable”, “collidable” or whatever, then there’s a way of avoiding inheritance’s cost:

Just make sure your mix-in classes are abstract, that is, make sure they have at least one pure virtual function. You can derive abstract classes from other abstract classes, use multiple inheritance, etc. all you want. Then, at the end, once your mixing and building are done, derive a final concrete class.

The compiler won’t build a VTable for a class whose ancestors are abstract and which has no descendents; a hierarchy with a single concrete class gets compiled as efficiently as a single class with no inheritance.

Trivia: What’s the fastest compiled code OO language?

“Sather” it’s called. One in its bag of many tricks is, it disallows concrete inheritance, period. In Sather you use inheritance a lot (it even has 3 types of inheritance); but you build your huge inheritance graphs in the abstract, and only the leafs are concrete classes.

Compiler and compiler Options

Perhaps not last, but cerainly not the least of concerns is what compiler to use. The fastest compiler by far is Digital Mars. Well, fastest at compiling, that is. The fastest-code compiler is Metrowerks, and I’d say, eventually I’ll buy it. No self-respecting game design company would use anything but Metrowerks. MSVC and most other compilers are good enough for desktop apps; but games –and game engines– are a whole different story.

For now, tho, we’ll have to make do with what we got, and make the best of it. And compiler optimizations are a way to achieve the latter. Where data + code »> cache, one should be optimizing for code size rather than code speed. Size is everything. Only routines that get called a lot, and inner loops, should be optimized for speed, to the detriment of code size, such as loop-unrolling. Anywhere else, code size IS code speed: The smaller the app, the greater the cache coherence. This applies to data, too; we should consider keeping data “compressed”, not as in “zip-compressed” but rather in the sense of using the smallest representations we can. If a number will be between 1 and 20, let’s use a char to represent it, rather than waste 3.5 bytes by using an int.

In most cases, code size and speed optimizations come together, anyhow. Take for example member alignment in classes, which you should have read about by now in that AMD optimization guide in the link above. Just sorting the data members of a class from largest to smallest, and padding the end of the class so that its size is a multiple of the size of its largest member. In most cases it actually reduces the size of a class, by avoiding compiler default alignment, and can result in doubling or tripling the speed of member access. And using floats instead of doubles, except where doubles are needed.

So, yeah, our optimization settings should be for code size; and then we can override them for strategic functions via pragmas.

 
ref/codetune.txt · Last modified: 2006/10/18 21:19 by chuck_starchaser