BTW I think this could be done much more neatly using condition variables.
October 7, 2010 at 5:43 PM
So we've seen the problem, let's look at some solutions.
First of all let me try to convince you that various hacky solutions won't work. Say you're using Semaphores.
Instead of doing Up() to signal the event, you could do Increment(1000); That will cause the Down() to just
run through immediately until it's pulled back down, so the next bunch of tests will succeed. This might
in fact make the race never happen in real life, but the race is still there.
1. Put a Mutex around the Wait. The mutex just ensures that only one thread at a time is actually in the wait
on the event. In particular we do :
WaitOnMessage_4( GUID )
{
{
MutexInScope(ReceiveMutex);
pop everything on the receive FIFO
mark everything I received as done
if ( the GUID I wanted is done )
return;
}
for(;;)
{
MutexInScope( WaiterMutex );
{
MutexInScope(ReceiveMutex);
pop everything on the receive FIFO
mark everything I received as done
if ( the GUID I wanted is done )
return;
}
Wait( ReceiveEvent );
}
}
note that the WaiterMutex is never taken by the Worker thread, only by "main" threads. Also note that
it must be held around the whole check for the GUID being done or you have another race. Also we have
the separate ReceiveMutex for the pop/mark phase so that when we are in the wait we don't block
other threads from checking status.
Now when multiple threads try to wait, the first will get in the mutex and actually wait on the Event,
the second will try to lock the mutex and stall out there and wait that way.
The problem with this is that threads that can proceed don't always get to. It's not broken, you will
make progress, but it's nowhere near optimal. Consider if N threads come in and wait on different GUIDs.
Now the worker finishes something, so that wakes up a Wait and he goes around the loop which releases
the Mutex - if you just so happened to wake up the one guy who could proceed, then it's great, he returns,
but if you woke up the wrong guy, he will take the mutex and go back to sleep. (also note when the mutex
unlocks it might immediately switch thread execution to one of the people waiting on locking it because
of the Window fair scheduling heuristics). So you only have a 1/N chance of making optimal progress,
and in the worst case you might make all N threads wait until all N items are done. That's bad.
So let's look at other solutions :
2. Put the Event/Semaphore on each message instead of associated with the whole Receive queue. You wait
on message->event , and the worker signals each message's event when it's done.
This also works fine. It is only race-free if GUIDs can only be used from one thread, if multiple threads
can wait on the same GUID you are back to the same old problem. This also requires allocating a Semaphore
per message, which may or may not be a big deal depending on how fine grained your work is.
On the plus side, this lets the Wait() be on the exact item you want being done, not just anything being done,
so your thread doesn't wake up unnecessarily to check if its item was done.
WaitOnMessage_5A( GUID )
{
Event ev;
{
MutexInScope(ReceiveMutex);
pop everything on the receive FIFO
mark everything I received as done
if ( the GUID I wanted is done )
return;
ev = message(GUID)->event
}
Wait( ev );
}
we're glossing over the tricky part of this one which is the maintenance of the event on each message.
Let's say we know that GUIDs are only touched by one thread at a time. Then we can do the event
destruction/recycling when we receive a done message. That way we know that event is always safe for
us to wait on outside of the mutex because the only person who will ever Clear or delete that event is
us.
I think you can make this safe for multiple threads thusly :
WaitOnMessage_5B( GUID )
{
Semaphore ev;
{
MutexInScope(ReceiveMutex);
pop everything on the receive FIFO
mark everything I received as done
if ( the GUID I wanted is done )
return;
message(GUID)->users ++;
ev = message(GUID)->semaphore;
}
Down( ev );
{
MutexInScope(ReceiveMutex);
pop everything on the receive FIFO
mark everything I received as done
ASSERT( the GUID I wanted is done );
message(GUID)->users --;
if ( message(GUID)->users == 0 )
delete message(GUID)->semaphore;
}
}
and you make worker signal the event by doing Up(1000); you don't have to worry about that causing weird
side effects, since each event is only ever signalled once, and the maximum number of possible waits on
each semaphore is the number of threads. Since the semaphore accesses are protected by
mutex, I think you could also do lazy semaphore creation, eg. don't make them on messages by default,
just make them when somebody tries to wait on that message.
3. Make the Event per thread. You either put the ReceiveEvent in the TLS, or you just make it a local on
the stack or somewhere for each thread that wants to wait. Then we just use WaitOnMessage_3 , but the
ReceiveEvent is now for the thread. The tricky bit is we need to let the worker thread know what events
it needs to signal.
The easiest/hackiest way is just to have the worker thread always signal N events that you determine
at startup. This way will in fact work fine for many games. A slightly cleaner way is to have a list
of the events that need signalling. But now you need to protect access to that with a mutex, something
like :
WaitOnMessage_6( GUID , LocalEvent )
{
{
MutexInScope(EventListMutex)
add LocalEvent to EventList
}
for(;;)
{
{
MutexInScope(ReceiveMutex);
pop everything on the receive FIFO
mark everything I received as done
if ( the GUID I wanted is done )
return;
}
Wait( LocalEvent );
}
{
MutexInScope(EventListMutex)
remove LocalEvent from EventList
}
}
and worker does :
do work
push message to receive FIFO
{
MutexInScope(EventListMutex)
signal everything in EventList
}
one nice thing about this is that you can push GUID with LocalEvent and then only signal events that
want to hear about that specific message.
Note with the code written as-is it works but has very nasty thread thrashing. When the worker signals
the events in the list, it immediately wakes up the waiting threads, which then try to lock the
EventListMutex and immediately go back to sleep, which wakes the worker back up again. That's pretty
shitty so a slightly better version is if the worker does :
do work
push message to receive FIFO
TempList;
{
MutexInScope(EventListMutex)
copy everything in EventList to TempList
}
-*-
signal everything in TempList
this fixes our thread thrashing, but it now gives us the usual lock-free type of restrictions - events in
TempList are no longer protected by the Mutex, so that memory can't be reclaimed until we are sure that
the worker is no longer touching them. (in practice the easiest way to do this is just to use recycling
pooled allocators which don't empty their pool until application exit). Note that if an event is recycled
and gets a different id, this might signal it, but that's not a correctness error because extra signals
are benign. (extra signals just cause a wasteful spin around the loop to check if your GUID is done,
no big deal, not enough signals means infinite wait, which is a big deal).
NOTE : you might think there's a race at -*- ; the apparent problem is that the worker has got the event list in TempList, and it
gets swapped out, then on some main thread I add myself to EventList, run through and go to sleep in the Wait. Then worker wakes
up at -*- and signals TempList - and I'm not in the list to signal! Oh crap! But this can't happen because if that was the work
item I needed, it was already put in the receive FIFO and I should have seen it and returned before going into the Wait.
We can also of course get rid of the Mutex on EventList completely by doing the usual lockfree gymnastics; instead
make it a message passing thing :
WaitOnMessage_6B( GUID , LocalEvent )
{
Send Message {Add,LocalEvent,GUID}
for(;;)
{
{
MutexInScope(ReceiveMutex);
pop everything on the receive FIFO
mark everything I received as done
if ( the GUID I wanted is done )
break;
}
Wait( LocalEvent );
}
Send Message {Remove,LocalEvent,GUID}
}
and worker does :
pop message from "send FIFO" to get work
do work
push message to "receive FIFO"
pop all event messages
process Add/Remove commands
signal everything in EventList
but this isn't tested and I have very little confidence in it. I would want to Relacey this before using it in real code.
4. Just don't do it. Sometimes domain-specific solutions are best. In particular there's a very simple solution I can
use for the current problem I'm having.
The thing that made me hit this issue is that I made my work-stealing Worklet system able to wait on IO's. In
this case the "worker" is the IO service thread and the messages are IO requests. So now the main thread can wait
on IO and so can the Worklets. It's important that they really go to a proper sleep if they are blocked on IO.
But I can solve it in a simple way in that case. The Worklets already have a way to go sleep when they have no work
to do. So all I do is if the Worklets only have work which depends on an uncompleted IO, they push their Work back onto
the main work dispatch list and go to sleep on the "I have no work" event. Then, whenever the main thread receives any
IO wakeup event, it immediately goes and checks the Worklet dispatch list and sees if anything was waiting on an IO
completion. If it was, then it hands out the work and wakes up some Worklets to do it.
This solution is more direct and also has the nice property of not waking up unnecessary threads and so on.
"09-21-10 - Waiting on Thread Events Part 2"
1 Comment -
BTW I think this could be done much more neatly using condition variables.
October 7, 2010 at 5:43 PM