Googles appar
Huvudmeny

Post a Comment On: cbloom rants

"07-18-10 - Mystery - Does the Cell PPU need Memory Control -"

5 Comments -

1 – 5 of 5
Blogger malte said...

Afaik the PPU still have a load-miss queue and the same store-gather buffer mechanism as Xenon, which could execute loads/stores out of program order even to a single core.

July 18, 2010 at 2:02 PM

Blogger ryg said...

It's a bit more subtle than that.

Yes, the loads and stores can "execute" out of order, but in this particular case that doesn't mean much. It doesn't matter which order you execute loads in if the memory doesn't change in the meantime. Similarly, the completion order of stores doesn't matter if nobody does any loads until the last store is retired.

This simplifies matters considerably if there are no external agents modifying memory.

Processors do have mechanisms in place to ensure causality within a single thread (kinda obvious, after all we don't need to place fences in single-threaded programs, ever). If you walk through the cases, the only problematic operation in this scenario is when a load accesses memory at an address that is being accessed by an in-flight store that's not yet completed - and we know how the CPU reacts in that case: that's a load-hit-store, which causes the PPU to stall until the store buffer has been written to L1.

Anyway, the actual issue: multiple hardware threads. It heavily depends on how the particular implementation looks - in particular, how the store buffers were adapted to multiple HW threads. The only reasonable implementation I can think of on PPC is having one shared set of store buffers used by all HW threads.

In this implementation, as far as loads/stores are concerned, the HW multithreading boils down to picking instructions alternately from thread A and B and interleaving them into one instruction stream. Loads and stores are sequentially consistent within that core, as any dangerous sequences are automatically serialized by triggering LHS stalls.

So if they're doing that, then you're indeed home free without explicit memory barriers as long as only the PPU touches memory. When you're communicating with SPUs (e.g. job queues in main memory), you definitely need memory barriers.

July 18, 2010 at 3:01 PM

Blogger cbloom said...

" The only reasonable implementation I can think of on PPC is having one shared set of store buffers used by all HW threads.

In this implementation, as far as loads/stores are concerned, the HW multithreading boils down to picking instructions alternately from thread A and B and interleaving them into one instruction stream. Loads and stores are sequentially consistent within that core, as any dangerous sequences are automatically serialized by triggering LHS stalls."

Yep, this is exactly my suspicion. It would be nice to get some real confirmation that this is the case, but it's hard to image anything else.

Note that this doesn't mean you don't have to do anything for shared variables; they still have to use "atomic" ops to be consistent, and you have to make sure that accesses on them actually go to memory, they just don't need to lwsync or whatever to enforce cache line timing order.

July 18, 2010 at 3:31 PM

Anonymous Anonymous said...

I've written a small sample program which provides at least partial evidence of your theory. You can check it out here.

May 15, 2012 at 11:26 AM

Blogger cbloom said...

Cool, nice work. I need to go ahead and try my test suite without ordering on PS3. I finally wrote my own Relacy-like simulator that I can run on any of our platforms, so I should be able to test this stuff somewhat exhaustively in the future.

May 15, 2012 at 5:33 PM

You can use some HTML tags, such as <b>, <i>, <a>

This blog does not allow anonymous comments.

Comment moderation has been enabled. All comments must be approved by the blog author.

You will be asked to sign in after submitting your comment.