Googles appar
Huvudmeny

Post a Comment On: cbloom rants

"07-26-10 - Virtual Functions"

8 Comments -

1 – 8 of 8
Blogger won3d said...

I wonder how much space is taken up by vtables, and how often they incur cache misses? That being said, increasing your load chain dependency might be annoying, even if you are hitting the cache when you are in-order.

Folks (like Herb Sutter) recommend non-virtual forwarding method approach, going so far as to recommend making your virtual methods private. It has the nice benefit of having distinct "caller" and "inheriter" interfaces. One thing that would help the later compiler issue might be to force inline the forwarding methods. Also, it is an interesting idea to use hungarian notation for the virtual methods.

Do you use profile-guided or link-time optimizations? I've actually found that recent GCC will inline virtual methods (maybe these are only leaf methods). In the near future, though, I would expect that they will be able to do things like partial inlining. The compiler should figure out that one virtual method is more likely than others and emit the normal-branch/if()-style polymorphism for you. Unfortunately, it doesn't sound like you can wait 1-2 versions of GCC.

Re: implementation hiding

I suppose opaque pointers (which may be wrapped in a class to be more C++ idiomatic) might be better than vtables for implementation hiding/dependency minimization. Or were you suggesting something else?

July 26, 2010 at 9:52 PM

Blogger ryg said...

A very important element missing from your coarse cost model is cache footprint, both I$ and D$. Let's go with the Xenon/PS3 again for a minute. There's 32k of I$ and 32k of D$, with 128-byte cache lines - i.e. 256 cache lines each (and suddenly it feels quite a bit smaller). It's not unrealistic to assume that no two different vtables end up sharing the same cache line (that's virtually guaranteed if the two types are implemented in different source files). So in a loop that iterates over objects of different classes, each class costs you one cache line (or more, depending on what you do). Even with a low number of distinct classes and careful coding style, you can quickly lose 15-16 cache lines to this: about 6% of your L1 cache, to buffer less than a cache line's worth of data you're actually interested in right now! It's not a big hit, but a little bit of friction everywhere.

I$ is worse. 1k of code = 256 instructions = not very much. How many unique instructions do you need per object in a typical update loop? It's very easy to run out of I$. Not all-out thrashing, but a bit of extra work on every call.

These two things have a notable effect on performance in typical virtual-heavy code. And it's often trivial to fix, without removing the virtuals or any big redesign efforts: if possible, just keep your entity list sorted by type. Easy fix as long as there's no particular order that things have to happen in.

"First of all, let's be clear than any serious game code base must have *some* type of dynamic dispatch (that is, data-dependent function calls). When people say "avoid virtual functions" it just makes me roll my eyes."
The big shift in perspective is not in changing the mechanism of dynamic dispatch, but the granularity. Instead of dispatching per object, you group everything into coherent batches and do one dispatch per batch (that's an incremental change from the "sort by type" version). In fact, you can do it all incrementally:

- Sort by type (reduce D$, I$ misses)
- Group into batches, dispatch per batch (reduce function call overhead, both virtual and non-virtual)
- Allocate as batches, not individual objects (D$ again)
- Turn inheritance into composition and segregate storage. Adds extra ptrs and indirections, but transforms code from "one loop that does everything" into "10 loops that each do 1/10th of the work". D$, I$ benefits. Code that only cares about base class fields never needs to access cache lines
containing derived class data.

You can't always do this. But it's definitely worth trying. As a bonus, if you get through it all, your code is in a form where it's relatively easy to efficiently use SIMD (or move work to a SPU, or both) if it's still too slow.

July 26, 2010 at 9:59 PM

Blogger   said...

@ryg, sorting your game entities is a fantastic idea!

July 26, 2010 at 10:30 PM

Blogger jeskola said...

"No C++ compiler that I know of actually does this, because it requires knowledge of the class heirarchy in the whole program"

Intel compiler does it.

July 27, 2010 at 7:16 AM

Blogger cbloom said...

"The big shift in perspective is not in changing the mechanism of dynamic dispatch, but the granularity. Instead of dispatching per object, you group everything into coherent batches and do one dispatch per batch (that's an incremental change from the "sort by type" version). In fact, you can do it all incrementally:

- Sort by type (reduce D$, I$ misses)
- Group into batches, dispatch per batch (reduce function call overhead, both virtual and non-virtual)"

Yeah, this is a very extreme example of exactly what I'm talking about.

First of all you're putting the virtual on Update rather than Query. You're assuming that your Update() on object type A doesn't go off and call a bunch of Query calls on objects of various types. That's what really kills you.

If A::Update() calls
B->Query() , C->Query(), etc.

the sort by type doesn't really do the magic you want.

But you're mainly talking about components that only work on themselves, eg. Skeleton::Animate() or something like that works very well with the method you describe, but NPC::Think() doesn't work so well because it has to touch a bunch of other objects.

July 27, 2010 at 10:26 AM

Blogger cbloom said...

"I suppose opaque pointers (which may be wrapped in a class to be more C++ idiomatic) might be better than vtables for implementation hiding/dependency minimization. Or were you suggesting something else?"

Yeah, you have to use something like opaque pointers.

But the main point is to not use a type of dynamic dispatch when it could be static. This doesn't just apply to virtual calls either. eg. don't use function pointers for pluggable malloc or stdio or whatever when in fact they are always set to the same thing.

July 27, 2010 at 10:30 AM

Blogger cbloom said...

"Let's go with the Xenon/PS3 again for a minute. There's 32k of I$ and 32k of D$, with 128-byte cache lines - i.e. 256 cache lines each (and suddenly it feels quite a bit smaller)."

BTW when you actually have two hyperthreads doing different things, this is pretty catastrophic. And to make it worse the time to load from L2 to L1 is actually quite long (40 clocks or so). I'm not sure how most people deal with this.

July 27, 2010 at 10:33 AM

Blogger ryg said...

"First of all you're putting the virtual on Update rather than Query. You're assuming that your Update() on object type A doesn't go off and call a bunch of Query calls on objects of various types. That's what really kills you.

If A::Update() calls
B->Query() , C->Query(), etc.

the sort by type doesn't really do the magic you want."
The purpose is to reduce effective cache pressure, and it still does. A typical "which class calls into which other classes at runtime" matrix is relatively sparse. So you still get way better I$/D$ usage out of it than if the order is just random.

Then there's another thing: if possible, you should split "Update"-style functions into two phases nowadays. First phase just gathers data from other entities and caches it (but leaves visible object state alone), second phase updates visible object state only using existing object state and the cached values. That's purely to make it safe to distribute the update step over multiple cores. But it also ensures that the second-phase function is in the form you need for later transforms.

You can need arbitrarily many phases if you have expensive queries (e.g. collision tests) that you want to start based on the results of other queries. If you really need this in a lot of places, this design is a bad idea. If you can tolerate some latency, "pipeline" the work over multiple frames: results of certain queries get reported one frame late.

"BTW when you actually have two hyperthreads doing different things, this is pretty catastrophic."
Hardware threads are generally a different beast to design for than "real" threads: on real threads it can be a win to do work multiple times if it decreases comm/sync overhead and increases concurrency. On HW threads it's a loss, always. HW threads can also stall each other easily (e.g. microcoded instructions on PPU/Xenon stall both threads until they're done). Small cache size means you pretty much have to work on similar data or you're screwed. On the plus side, there's no false sharing between two HW threads, so none of the associated slowdowns. And communication between two threads on the same core is fast since you can avoid most (all?) of the processor read/write barriers, though that's a very risky game to play.

"And to make it worse the time to load from L2 to L1 is actually quite long (40 clocks or so). I'm not sure how most people deal with this."
Shake fist at the sky, yell "CURSE YOU, IBM!", and get on with their lives.

July 27, 2010 at 11:33 AM

You can use some HTML tags, such as <b>, <i>, <a>

This blog does not allow anonymous comments.

Comment moderation has been enabled. All comments must be approved by the blog author.

You will be asked to sign in after submitting your comment.