The architecture folks realized pretty early on that transactions and speculation are basically the same thing. I think it is also fairly obvious that you don't need to do a fully hardware implementation of transaction memory, just like you don't need to do a fully hardware implementation of virtual memory.
I like the idea of a processor being able to dynamically choose to speculate or hyperthread, or just save power. It would be interesting to see what the "branch predictor" (or whatever would make that decision) would look like. I wonder if the additional complexity would be worthwhile. I suppose the problem is that to actually do speculation well you also need to go out-of-order, and things quickly get big and messy.
"The architecture folks realized pretty early on that transactions and speculation are basically the same thing."
Yeah I didn't see it; it's cool.
"I like the idea of a processor being able to dynamically choose to speculate or hyperthread, or just save power."
Apparently Rock can be controlled from software to use both hardware hyperthreads to execute a single software thread. In that case one of the hyperthreads runs ahead and speculatively executes and prefetches stuff, while the other one hangs behind and retires what actually happened.
It's cool to be able to control stuff like that from the OS so you have control over power use, whether you target fewer or more threads, etc.
"Yeah I didn't see it; it's cool." It's interesting how working on hardware, or even just thinking about how you would specify something if it was meant to be implemented as hardware, changes your perspective.
It changes your focus from "what am I trying to accomplish" to "what am I actually doing", which is very liberating sometimes. I remember reading some time ago about how LLVM used to have signed and unsigned integers as distinct types, which they later switched that to just having integer types with distinct signed/unsigned add, sub and multiply instructions, which apparently solved several design problems for them.
This kind of actually peeling high-level information away where it doesn't contribute anything anymore is just as tricky as getting the high-level info to where you need it.
May 26, 2009 at 6:27 PM
Anonymous said...
Does your example of lock avoidance actually work? It seems like the hardware transaction needs to be blocked by the lock; i.e. you don't want to fail a transaction, grab the lock, do part of the locked code, then pause and another thread comes along and runs the transaction code and completes it successfully.
I think the key to that is that the light transaction version still checks a bool in the lock. That does two things - if the bool is set it's locked and you don't do the transaction, but it also makes you dependent on the cache line of the lock, so if anyone else is changing the lock at the same time you fail the transaction.
and of course to be clear on the context, this isn't a general transaction thing, it's just needed when you're trying to replace locks and/or use locks as the fallback mechanism.
May 29, 2009 at 6:46 PM
Bartosz made an interesting post
about extending D for automatic multithreading with some type system additions.
It made me think about how you could do guaranteed-safe multithreading in C++. I think it's actually pretty simple.
First of all, every variable must be owned by a specific "lock". It can only be accessed if the current thread owns that
lock. Many of the ownership relationships could be implicit. For example there is an implicit lock for every
thread for stuff that is exlusively owned by that thread. That thread almost has that lock, so you never actually
generate a lock/unlock, but conceptually those variables still have a single lock that owns them.
So, stack variables for example are implicitly automatically owned by the thread they are made on. Global variables
are implicitly owned by the "main thread" lock if not otherwise specified. If some other thread tries to touch a
global, it tries to take a lock that it doesn't own and you fail.
Lock gate1;
int shared : gate1; // specifies "shared" is accessed by gate1
int global; // implicitly owned by the main thread
void thread1()
{
int x; // owned by thread1
x = shared; // fail! you must lock gate1
{
lock(gate1);
x = shared; // ok
}
y = global; // fail! you don't own global
}
Mkay that's nice and all. Single threaded programs still just work without any touches, everything is
owned by the main thread. Another goal I think of threading syntax additions should be that going from
a large single threaded code base to adding a few threading bits should be easy. It is here.
There are a few things you would need to make this really work. One is a clean transfer of ownership
method as Bartosz talks about. Something like auto_ptr or unique_ptr, but actually working in the language,
so that you can pass objects from one owner to another and ensure that no refs leak out during the passage.
You can also of course extend this if you don't want the constraint that everything is protected by a lock.
For example you could easily add "atomic" as a qualifier instead of a lock owner. If something is marked
atomic, then it can be accessed without taking a lock, but it's only allowed to be accessed by atomic
operations like cmpx/CAS.
This is a nice start, but it doesn't prevent deadlocks and still requires a lot of manual markup.
I also finally read a bit about Sun's Rock. It's very interesting, I encourage you to read more about it
at the Sun Scalable Synchronization page.
Rock is actually a lot like LRB in many ways. It's 16 lightweight cores, each of which has 2 hardware threads (they
call them strands). It's basically a simple in-order core, but it can do a sort of state-save push/pop and
speculative execution. They have cleverly multi-purposed that functionality for both the Transactional Memory,
and also just for improving performance of single threaded code. The state-save push-pop is a ghetto form
of out-of-order-execution that we all know and love so much. It means that the chip can execute past a branch
or something and then if you go the other way on the branch, it pops the state back to the branch. This is just
like checkpointing a transaction and then failing the commit !
The key thing for the transactional memory is that Rock is NOT a full hardware transactional memory chip. It
provides optimistic hardware transactions with some well designed support to help software transactional memory
implementations. The optimistic hardware transactions basically work by failing to commit if any other core
touches a cache line you're writing. Thus if you do work in cache lines that you own, you can read data, write it
out, it gets flushed out of the cache to the global consistent view and there are no problems. If someone touches
that cache line it invalides the transaction - even though it might not necessarilly need to. That's what makes
it optimistic and not fully correct (there are other things too). If it allows a transaction through, then it
definitely was okay to do, but it can fail a transaction that didn't need to fail.
There's a lot of problems with that, it can fail in cases
that are perfectly valid transactions, so obviously you cannot rely on that as a TM implementation. However, it
does let a lot of simple transactions successfully complete quickly. In particular for simple transactions with
no contention, the optimistic hardware transaction completes no problemo. If it fails, you need to have a fallback
mechanism - in which case you should fall back to your real STM implementation, which should have forward progress
guarantees. So one way of look at the HTM in Rock is just a common case optimization for your STM software.
The commit in Rock has a "jump on fail" so that you can provide the failure handling code block; you could jump
back to your checkpoint and try again, but eventually you have to do something else.
Perhaps more interestiongly, the HTM in Rock is useful in other ways too. It gives you a way to do a more general ll-sc (load linked store
conditional) kind of thing, so even if you're not using STM, you can build up larger "atomic" operations for your
traditional lock/atomic C++ style multithreading. It can also be used for "lock elision" - avoiding actually doing
locks in traditional lock-based code. For example if you have code like :
lock(CS)
x = y;
unlock(CS)
that can be transparently converted to something like :
checkpoint; // begin hardware transaction attempt
x = y;
commit // try to end hardware transaction, if fails :
failure {
lock( CS)
x = y;
unlock(CS)
}
So you avoid stalling if it's not needed. There are of course tons of scary subtleties with all this stuff. I'll let the Sun guys
work it out.
It's also actually a more efficient way of doing the simple "Interlocked" type atomic ops. On x86 or in a strongly ordered language
(such as Java's volatiles) the Interlocked ops are fully sequentially consistent, which means they go in order against each other.
That actually is a pretty hard operation. We think of the CMPX or CAS as being very light weight, but it has to sync the current core
against all other cores to ensure ops are ordered. (I wrote about some of this stuff before in more detail in the threading articles).
The "transaction" hardware with a checkpoint/commit lets your core flow through its ops without doing a big sync on interlocked ops.
Now, obviously the checkpoint/commit itself needs to be synchronized against all other cores, but it's done on a per cache-line basis,
and it uses the cache line dirtying communication hardware that we need anyway. In particular in the case of no contention or the
case of doing multiple interlocked ops within one cache line, it's a big win.
I'm sure that Sun will completely cock things up somehow as they always do, but it is very interesting, and I imagine that most
future chips will have models somewhat like this, as we all go forward and struggle with the issue of writing software for these
massively parallel architectures.
"05-26-09 - Automatic Multithreading"
7 Comments -
Pending thread annotations in GCC:
http://docs.google.com/Doc?id=ddqtfwhb_0c49t6zgr
The architecture folks realized pretty early on that transactions and speculation are basically the same thing. I think it is also fairly obvious that you don't need to do a fully hardware implementation of transaction memory, just like you don't need to do a fully hardware implementation of virtual memory.
I like the idea of a processor being able to dynamically choose to speculate or hyperthread, or just save power. It would be interesting to see what the "branch predictor" (or whatever would make that decision) would look like. I wonder if the additional complexity would be worthwhile. I suppose the problem is that to actually do speculation well you also need to go out-of-order, and things quickly get big and messy.
May 26, 2009 at 4:25 PM
"The architecture folks realized pretty early on that transactions and speculation are basically the same thing."
Yeah I didn't see it; it's cool.
"I like the idea of a processor being able to dynamically choose to speculate or hyperthread, or just save power."
Apparently Rock can be controlled from software to use both hardware hyperthreads to execute a single software thread. In that case one of the hyperthreads runs ahead and speculatively executes and prefetches stuff, while the other one hangs behind and retires what actually happened.
It's cool to be able to control stuff like that from the OS so you have control over power use, whether you target fewer or more threads, etc.
May 26, 2009 at 4:46 PM
"Yeah I didn't see it; it's cool."
It's interesting how working on hardware, or even just thinking about how you would specify something if it was meant to be implemented as hardware, changes your perspective.
It changes your focus from "what am I trying to accomplish" to "what am I actually doing", which is very liberating sometimes. I remember reading some time ago about how LLVM used to have signed and unsigned integers as distinct types, which they later switched that to just having integer types with distinct signed/unsigned add, sub and multiply instructions, which apparently solved several design problems for them.
This kind of actually peeling high-level information away where it doesn't contribute anything anymore is just as tricky as getting the high-level info to where you need it.
May 26, 2009 at 6:27 PM
Does your example of lock avoidance actually work? It seems like the hardware transaction needs to be blocked by the lock; i.e. you don't want to fail a transaction, grab the lock, do part of the locked code, then pause and another thread comes along and runs the transaction code and completes it successfully.
But I may not be understanding it correctly.
May 29, 2009 at 5:58 PM
I think the key to that is that the light transaction version still checks a bool in the lock. That does two things - if the bool is set it's locked and you don't do the transaction, but it also makes you dependent on the cache line of the lock, so if anyone else is changing the lock at the same time you fail the transaction.
May 29, 2009 at 6:33 PM
So the transaction version is something like :
checkpoint
if ( lock.set )
goto fallback
lock.set = true
.. do stuff ..
lock.set = false
commit
on failure goto fallback
So the commit only goes through when I was the only person who got their hands on the lock during that time.
Or something, I dunno, it's more complicated than that I'm sure.
May 29, 2009 at 6:45 PM
and of course to be clear on the context, this isn't a general transaction thing, it's just needed when you're trying to replace locks and/or use locks as the fallback mechanism.
May 29, 2009 at 6:46 PM