Very nice, but... after fixing the std::memory_order constants to make it compile, it turns out that, ironically, the presumably much faster ticket mutex performs about the same as a simple W32 API critical section with a concurrency of 1 and 2, and is roughly 8 times *slower* with a concurrency of 4 (on a 4-core Intel).
"Okay, this is a very cheap lock in terms of the number of expensive atomics required, and it's FIFO (fair) which is nice in some cases, but it simply cannot be used outside of a kernel environment. The reason is that if the thread who is next in line get swapped out, then no currently running threads can get the lock, and we don't have any wakeup mechanism to get that sleeping thread to take the lock so we can make progress. This is okay in the kernel because the kernel is controlling which threads are awake or asleep, so obviously it won't put a thread to sleep that is currently spinning trying to get the lock. "
February 1, 2012 at 9:18 AM
The Linux kernel internally uses a FIFO spinlock that they call "ticket lock". A ticket or "bakery" algorithm
is quite a common pattern so we'll have a glance.
The analogy is the easiest way to understand it. There's an atomic ticket machine, when you walk into the shop
you grab a ticket (and the machine increments itself). On the wall is a "now serving" sign that counts up as
people turn in their tickets.
This can be implemented most obviously using two ints :
struct ticket_mutex2
{
// (*0)
std::atomicunsigned int> m_ticket;
std::atomicunsigned int> m_serving;
ticket_mutex2()
{
m_ticket($).store( 0 );
m_serving($).store( 0 );
}
~ticket_mutex2()
{
}
void lock()
{
unsigned int me = m_ticket($).fetch_add(1,std::mo_acq_rel);
rl::linear_backoff bo;
for(;;)
{
unsigned int cur = m_serving($).load(std::mo_acquire);
if ( me == cur )
return;
bo.yield($);
// (*1)
}
}
void unlock()
{
// (*2)
// my ticket must match m_serving
// (*3)
//m_serving($).fetch_add(1,std::mo_release);
unsigned int cur = m_serving($).load(std::mo_relaxed);
m_serving($).store(cur+1,std::mo_release);
}
};
*0 : obviously you could put the two counters into words and mush them in one int (Linux on x86 used
to put them into bytes and mush them into one word), but it's actually a
better demonstration of the algorithm to have them separate, because it's a weaker constraint. Lockfree
algorithms always continue to work if you mush together variables into larger atomic pieces, but rarely continue to work if you
separate them into smaller independent atomic pieces. So when you're trying to show the fundamental requirements
of an algorithm you should use the minimum mushing-together required.
(BTW I don't remotely claim that any of
the things I've posted have the minimum synchronization constraints required by the algorithm, but that is
always the goal).
*1 : you might be tempted to put a Wait here using eventcount or something, but you can't. The problem is
if multiple threads go to sleep there, only the one thread that has the next ticket will be able to take
the lock. So if you use a generic waitset, you might wake the wrong thread, it won't be able to get in,
and you will deadlock. More on this in a moment.
*2 : m_serving is actually protected by the mutex, it is only ever modified by the mutex holder. m_ticket
is actually the barrier variable for acquiring the lock. When you get the lock you could store your ticket
id as a member in the lock struct and at unlock it will be equal to m_serving.
*3 : you can of course use an atomic increment on serving but because of *2 it's not necessary, and a simple
load & inc is cheaper on some architectures (and as per *1, it's a weaker constraint so we prefer to demonstrate
its correctness here).
Okay, this is a very cheap lock in terms of the number of expensive atomics required, and it's FIFO (fair) which
is nice in some cases, but it simply cannot be used outside of a kernel environment. The reason is that if
the thread who is next in line get swapped out, then no currently running threads can get the lock, and we
don't have any wakeup mechanism to get that sleeping thread to take the lock so we can make progress. This is
okay in the kernel because the kernel is controlling which threads are awake or asleep, so obviously it won't
put a thread to sleep that is currently spinning trying to get the lock.
So if we want to turn this into a FIFO lock that works in user space, we have to have a sleep/wakeup mechanism.
I don't think this is actually an awesome way to write your own FIFO lock, but it's a nice demonstration of the
usefulness of NT's Keyed Events, so I'm going to do that.
You need to get the secret functions :
template typename ret,typename T1,typename T2,typename T3,typename T4>
ret CallNtInternal(const char * funcName,T1 arg1,T2 arg2,T3 arg3,T4 arg4)
{
typedef ret NTAPI t_func(T1,T2,T3,T4);
t_func * pFunc = (t_func*) GetProcAddress( LoadLibrary( TEXT("ntdll.dll") ), funcName );
ASSERT_RELEASE( pFunc != NULL );
return (*pFunc) (arg1,arg2,arg3,arg4);
}
#define MAKE_NTCALL_4(ret,func,type1,type2,type3,type4) ret func(type1 arg1,type2 arg2,type3 arg3,type4 arg4) { return CallNtInternalret>(#func,arg1,arg2,arg3,arg4); }
MAKE_NTCALL_4( LONG,NtCreateKeyedEvent,OUT PHANDLE, IN ACCESS_MASK, IN PVOID, IN ULONG );
MAKE_NTCALL_4( LONG,NtReleaseKeyedEvent,IN HANDLE, IN PVOID, IN BOOLEAN, IN PLARGE_INTEGER );
MAKE_NTCALL_4( LONG,NtWaitForKeyedEvent,IN HANDLE, IN PVOID, IN BOOLEAN, IN PLARGE_INTEGER );
and then you can make the lock :
struct ticket_mutex2_keyed
{
std::atomicWAITKEY_SHIFT);
NtReleaseKeyedEvent(m_keyedEvent,(PVOID)(waitKey),FALSE,NULL);
}
}
};
Note that we have had to push together our two state variables now, because previous unlock only touched the
"now serving" counter, but now it has to also check against the ticket counter to see if there are any people
waiting.
Also note that we are taking advantage of the fact that ReleaseKeyedEvent is blocking. If the Release happens
before the Wait, the signal is not lost - the unlocking thread blocks until the Wait is entered.
Exercise for the reader : make it possible for lock to spin a while before going into the wait.
I made use of these self-explanatory helpers :
bool top_word_matches_bottom( unsigned int x )
{
unsigned int t = _lrotl(x,16);
return t == x;
}
unsigned int fetch_add_low_word(std::atomicunsigned int> & state,int inc)
{
unsigned int old = state($).load(std::mo_relaxed);
while ( ! state($).compare_exchange_weak(old,((old+inc)&0xFFFF) | (old & 0xFFFF0000),std::mo_acq_rel) ) { }
return old;
}
which do what they do.
Obviously on Linux you could use futex, but there are too many platforms that have neither KeyedEvent nor futex,
which make using them not very attractive.
Some links :
Time-Published Queue-Based Spin Locks Ticket spinlocks [LWN.net] spinlocks XXXKSE What to do Linux x86 ticket spinlock git.kernel.org - linuxkernelgittorvaldslinux-2.6.gitcommit futex(2) - Linux manual page
"07-16-11 - Ticket FIFO Mutex"
2 Comments -
Very nice, but... after fixing the std::memory_order constants to make it compile, it turns out that, ironically, the presumably much faster ticket mutex performs about the same as a simple W32 API critical section with a concurrency of 1 and 2, and is roughly 8 times *slower* with a concurrency of 4 (on a 4-core Intel).
February 1, 2012 at 4:40 AM
"Okay, this is a very cheap lock in terms of the number of expensive atomics required, and it's FIFO (fair) which is nice in some cases, but it simply cannot be used outside of a kernel environment. The reason is that if the thread who is next in line get swapped out, then no currently running threads can get the lock, and we don't have any wakeup mechanism to get that sleeping thread to take the lock so we can make progress. This is okay in the kernel because the kernel is controlling which threads are awake or asleep, so obviously it won't put a thread to sleep that is currently spinning trying to get the lock. "
February 1, 2012 at 9:18 AM