Googles appar
Huvudmeny

Post a Comment On: cbloom rants

"07-13-11 - Good threading design for games"

11 Comments -

1 – 11 of 11
Blogger Ash said...

Interesting read Charles. Thanks for sharing. Are you aware of any books / series of articles that shed more light on such practical (as opposed to academical which I already have) knowledge of threading?

Thanks.

July 14, 2011 at 12:13 PM

Blogger cbloom said...

Not that I know of. There are some good talks by the Killzone guys and maybe some by Insomniac guys, though that's mostly about PS3 SPU task stuff, it's still relevant because the ideal way to thread CPU's for games is a lot like how you run SPU's.

July 14, 2011 at 12:24 PM

Anonymous Anonymous said...

I am not aware of any alternative to calling Sleep() for an audio thread that feeds audio data out to the audio API. At least not if the API doesn't have some way to signal or interrupt you. So I wouldn't say never, but yes, it's very close to never. (Iggy has three sleeps, all involving the audio thread, and two of them shouldn't be in there and I need to set up some sort of signal to avoid it.)

Your point #7 got cut off in the middle.

July 14, 2011 at 3:36 PM

Blogger gerhans said...

Thanks for the post Charles. Would you be willing to expound on strategies for avoiding Sleep() under Windows?

July 14, 2011 at 4:35 PM

Blogger cbloom said...

"I am not aware of any alternative to calling Sleep() for an audio thread that feeds audio data out to the audio API."

Yeah that is a special case. I'm not sure how the really good low-latency audio apps are written. I think some of them actually implement a DPC device for doing audio processing, that way you actually get called by the system like an interrupt when it nees some buffer.

One idea : make one thread that's devoted to filling audio buffers. Make it super high priority and put on a waitable timer so it pumps at the required interval to fill buffers. Of course a waitable timer is equivalent to a Sleep, but it's more reliable because it gives you a signal on a steady interval. The main thing is to make that thread super-high priority, that way the timer (or sleep) durations actually are what you think they are, and make it do nothing but fill the buffer. It can use a semaphore or some other mechanism to notify other threads when it needs them to give it more audio data.

So the idea is that the one thread that talks to a non-computational resource (the audio hardware) has to use periodic polling, but then it translates its "need buffer" condition into a signallable event that other threads can wait on. So the sleeping part of the system is isolated and has no risk of cascading.

July 14, 2011 at 5:19 PM

Blogger cbloom said...

"Thanks for the post Charles. Would you be willing to expound on strategies for avoiding Sleep() under Windows?"

Win32 is actually a great API for avoiding sleep.

Maybe I'll do a longer example and make it a post.

July 14, 2011 at 5:20 PM

Blogger ryg said...

"Yeah that is a special case. I'm not sure how the really good low-latency audio apps are written."
The really low-latency stuff uses either ASIO or the Kernel Streaming interface (direct pipe into the low-level audio mixer right in front of the driver).

Not sure about KS, but ASIO can actually give you callbacks directly from the interrupt DPC (Regular ASIO is exclusive for that reason). And then there's ASIO4ALL that implement ASIO on top of KS.

July 14, 2011 at 8:43 PM

Blogger cbloom said...

I once spent some time trying to figure out how to get the digital bits of an audio file played directly out of my digital audio out port without Windows fucking anything up with volume multipliers or whatever nonsense. It can be done with foobar2000 and Kernel Streaming, but it's a pain in the ass so I gave up. Maybe I'll try to revive that some day because that's how I want my music to play.

July 14, 2011 at 9:02 PM

Anonymous Anonymous said...

ASIO is pretty nice for musicians, because it locks the record and playback buffers to each other in a really nice way for making them lockstep and minimizing latency.

For just audio playback, it seems like overkill.

July 14, 2011 at 10:00 PM

Blogger Josh Greifer said...

Thanks for the post, Charles and for @visualc for tweeting.

On point 2, I cenrtainly agree about Sleep(n), but there is very useful, even vital, variant, which is

SleepEx(0);

The above call lets the system process any pending queued asynchronous procedures, and handle pending IO callbacks, *without* waiting.

Under windows, we have direct OS support for the "proactor" design pattern, which in my experience leads to to the fastest, and in some ways simplest design. In this pattern, the application runs in a single user thread, yet never waits for anything, but proceeds by spinning off all time-consuming tasks and I/O requests to OS-managed threads (which will usually be hyperthreads).
A definition of "time-consuming" on a modern system could be as little as be 10 nanoseconds. When building apps that use the proactor pattern, you can either use the boost::asio library, (no relation to the ASIO developed by Steinbeg), or, under Windows, you can call the OS directly, using APCs and the calls that end in "Ex:"

QueueUserAPC(), WaitForMultipleObjectsEx, SleepEx, ReadFileEx, etc.

On point 6, I use macros which allow me at design-time to easily switch between spinning off a task into an APC or running it in the main thread. I make the final choice after profiling the application.

On point 7, you seem to imply that mutexes and critical sections are the same thing, which is quite wrong! Mutexes, as you say, are slow heavyweight systemwide objects which carry a lot of bookkeeping overhead, but are vital (and efficient) for interprocess synchronization. But critical sections are relatively lightweight and easy (but unnecessary) to implement oneself. Basically all you need to roll your own critical section is one "Bit Test and Set" machine code instruction in a tight loop.

Semaphores are useful constructs, particularly when you use the proactor pattern. If you need to process five audio buffers for example, raising a semaphore with a count of five will instruct the OS to, in effect, queue 5 calls to a buffer handling routine, rather than having to raise a sempahore once for each buffer.

All in all, my advice on getting blinding performance is to delegate as much of the threading logic as possible to the OS (and to the GPU where available). Which is not to say that it's also very important to try to code all this logic by hand, if only to get a deep knowledge of how the OS does it.

July 25, 2011 at 2:29 AM

Blogger cbloom said...

Hey Josh, a couple of points -

A. In general I agree that APC's in Windows are a nice system. I wrote an earlier post about how lots of async work is better done as a little callback on an already running thread (rather than sending a message over to another thread). Windows as of Vista+ also has a very nice generalized worker thread pool system for doing APC's.

While that's all very nice, it's not portable.

I would be a little worried about using SleepEx as a way to pump APC's, but maybe it's okay. It seems like there might be unexpected side effects to that and I wonder if there is a more obviously side-effect free way to
pump APC's.

Anyway, I like the APC way of writing code a lot, and to some extent Oodle has become a cross-platform way of doing Windows-like APC's. (eg. things like "run this snippet of code after this IO finishes")

B. As for CriticalSections, I disagree.

It's crucial to distinguish between the Windows specifics and the general concepts.

Generally if I say "mutex" I mean it in the algorithmic sense, as in "something that provides mutual exclusion".

In that sense, a CriticalSection is a form of mutex.

You say a few things about CriticalSection that are just not true :

"But critical sections are relatively lightweight and easy (but unnecessary) to implement oneself."

CS is most definitely NOT easy to implement yourself. It uses a bit-test-and-set spin loop, but then automatically allocates an OS event and will wait on that to actually put the thread to sleep. (see for example my "Event mutex that makes event on demand" in an earlier post).

Windows CS also has deadlock debugging support, recursion, priority inversion protection, etc. It's quite complex actually.

"Basically all you need to roll your own critical section is one "Bit Test and Set" machine code instruction in a tight loop."

Not true, I'm not sure if you're simplifying or misunderstanding CS, but in fact CS becomes just like a kernel "Mutex" when there is contention. It simply has a fast path for the no-contention case. This is similar to fast mutexes on most platforms these days.

And it is because of this fact that CS is not actually any better about thread blocking than "Mutex" under windows. Yes, CS has a much faster best case, but it has the exact same worst case, which is threads going to sleep and waking up.

For example a "lock convoy" with CS's is just as bad as a lock convoy with Mutexes.

July 25, 2011 at 9:54 AM

You can use some HTML tags, such as <b>, <i>, <a>

This blog does not allow anonymous comments.

Comment moderation has been enabled. All comments must be approved by the blog author.

You will be asked to sign in after submitting your comment.