Googles appar
Huvudmeny

Post a Comment On: cbloom rants

"07-10-10 - PowerPC Suxors"

17 Comments -

1 – 17 of 17
Blogger   said...

The top 32-bit clears is due to the fact that the heritage of the PS3 is 64-bit, not 32-bit.

July 12, 2010 at 11:57 AM

Blogger   said...

Rolling your own assembly from scratch via an assembler is also very painful. There are a very many hard to remember scheduling rules to get right.

July 12, 2010 at 12:00 PM

Anonymous Anonymous said...

Welcome to the world of PPC!

I'm not sure how much of your post might be under NDA (also I'm not sure how much the constructors would acttualy care).

Yep, manually pre-fetching quickly gets a must to do, otherwise a simple stupid loop on an array will keep hitting L2-miss. Also virtuals are overkill on this platform: PC CPU are more than able to run fastly virtual calls thx to their architecture, but PPC just get hanged by these.
I end up re-writing several stuff as templated/metaprogrammed code to get rid of these virtuals, which somehow some people liked to put inside expensive loops :(

GCC/Xenon: y-m-may-vary. A dev compared asm of our engine and he found the Xenon's version to be ~2-5% faster somehow.

I disagree with your comment about stack variable not getting registered: most of my primitive type variables do end-up in registers: loop counter, ptr, simd-vector class... (maybe an issue with your compiler flags?)
On the other hand PPC compilers are especially afraid of optimizing member variables into registers due to memory aliasing, so putting often-read members in local variables is a must.

July 12, 2010 at 12:14 PM

Blogger MaciejS said...

Oh yeah, PPC compilers seem to be really bad at utilizing registers/paranoid about aliasing sometimes. I've seen plenty of LHSes just because compiler decided to store something to memory, then load it back few instructions later (there was no reason to do that, no aliasing, no volatile, nothing). Usually it's possible to 'fix' it by modifying C++ code a little bit, but it's still a major PITA. (That's mostly MSVC, can't really comment on GCC/SNC, didn't use it that much).

July 12, 2010 at 1:35 PM

Blogger cbloom said...

"The top 32-bit clears is due to the fact that the heritage of the PS3 is 64-bit, not 32-bit."

Yeah, that's fine; the registers are 64 bit, so you mostly want to work on 64 bit junk. The pointers are 32 bit (or maybe the TLB is 32 bit or something?). Anyway, it seems that any time you do math to compute an address it wants to clear the top 32 before letting you deref. That's pretty fucked, it seems like there are many ways to avoid that. Like the memory subsystem could have just always ignored the top 32 bits. Or they could have just not done the masking and relied on you to generate valid addresses. Duh. (maybe it's a hacky way to keep you from stomping in the kernel, I dunno).

July 12, 2010 at 3:20 PM

Blogger   said...

Yeah, it was never explained to me why this was the case. At one point I started writing a pre-processor that would go through the assembly code before actual assembling and remove the clears when it was "safe" to do so. That never made it to the light of day, though I still think its a good idea. If the compiler people can't do it due to whatever reasons, no reason why you can't do it for them.

July 12, 2010 at 4:48 PM

Blogger ryg said...

* Top-clearing: On PS3, the SN systems compiler has an optimizer option where you basically tell it "yeah, I'm not doing any weird stuff with address wraparound". Turn that on, and all of the clearing disappears. This can easily be an instant 10% improvement on loops that process several data streams but don't have much work per element.

* Top-clearing 2: Explicitly making integer variables 64-bit (even if 32 bits would be fine) can be a big win because it avoids the extra clears. Note this goes for all 64-bit architectures, even x64. Also, C/C++ typing semantics can be full of surprises in that regard. One thing I recently encountered in x64 code:
"U8 *SomePtr = BasePtr + (Offset & 0xffff) * 4;" compiled into a shl+add. Well, Offset was a U32, so the whole expression gets evaluated as "U8 *SomePtr = BasePtr + U32((Offset & 0xffff) * 4);" and you get the 32-bit shift because it might overflow - nevermind that it can't because of the masking. Fix: Write it explicitly as "U8 *SomePtr = BasePtr + U64(Offset & 0xffff)*4;" and it gets folded into the addressing mode.

* Really important: don't take the address of an important local or it will end up on the stack and stay there (not sure if that's what you mean with "stack variables", or if you mean extra arguments that are pushed onto the stack). Extra-important when you have a "vertex buffer lock"-like primitive. Always use it like this:

Vertex *vp_;
vb->Lock(..., &vp_);
Vertex * restrict vp = vp_;


and then use vp.

* If you have code that processes data sequentially, you want to end up using the load-update and store-update instructions if possible. These correspond to "x = *(ptr += dist);" and "*(ptr += dist) = x;", respectively. Because of the lack of indexed addressing modes, you'll often end up with a bunch of pointers, and stepping them individually can be a big time sink in tight loops. Note the pre-increment semantics, which are annoying. It would be nice if the compilers always used this automatically (introducing a pointer adjustment at the top of the loop if necessary), but they don't. Extra-bad because converting the code to explicitly do the preincrement makes it a lot harder to read and understand.

* Damn variable-distance shifts being microcoded makes bitstream parsing a royal pain. That, and the lack of a an integer select/cmov instruction. And the high mispredicted branch penalty. Argh!

July 13, 2010 at 2:39 AM

Comment deleted

This comment has been removed by the author.

July 13, 2010 at 2:40 AM

Blogger ryg said...

* Register spills are bad, way worse than on x86. Main problem is the long relatively LHS window. A short-time spill can get really costly if it causes a LHS on every iteration!

* Bitfield structs: just say no on PPC (or check the code very closely). In theory PPC has all the nice bitfield-specialized "rotate and do masked insertion" ops and stuff like that. In practice they tend to get sandwiched between loads and stores to make sure you hit every LHS opportunity on the way. Had a nice instance of that in some in-game GUI code where it checked if a widget had focus, set some other internal state based on that, then checked another flag, and so on. A bunch of bools for every widget, nicely bitpacked to save space. Code looked completely innocent but was a total LHS-fest.

* Short functions: You really want them inlined because stack frame setup will cause LHSs too. On PC I sometimes like to have very small (2-3 line) functions in the .cpp because they have unwieldy dependencies (e.g. system headers) that you really don't want to put into public headers included from everywhere. PPC makes this painful. Now you can just turn on LTCG and that gets sorted out automatically, but this is way too heavyweight in terms of turnaround time to use for every single compile - it's usually something you use for "outside" versions. Not a big issue, but annoying.

* Debug builds, goddamnit. PPC compilers seem to think that "debug" means "just do codegen in the most stupid way possible". Yeah, I don't want scheduling, unrolling, partial redundancy elimination or high-level loop transforms in my debug builds. But forcing everything to memory after every statement and not extracting induction variables in loops? That's complete suicide on an architecture with huge LHS penalties and no nontrivial addressing modes. A debug build for a game stops being useful when it's so slow that it's impossible to play.

July 13, 2010 at 2:40 AM

Anonymous Anonymous said...

>> Bitfield structs: just say no on PPC

Indeed. Still it can be a good tradeof versus L2 cache misses sometimes.

Also the Xenon compiler with full optimization is often able to get rid of all these LHS on bitfield.

Doing the work manually (loading the bitfield on the stack, doing all operations locally, then storing the result once for all) will obviously get us rid of LHS on both platforms.

PS: I'm the same "Sly" as in the 3rd comment ; just wanted to get rid of the -id suffix, so I changed my OpenId.

July 13, 2010 at 5:36 AM

Blogger   said...

Great to hear that the compiler people have finally implemented a good work-around for the top clears. :)

July 13, 2010 at 9:38 AM

Blogger ryg said...

"Also the Xenon compiler with full optimization is often able to get rid of all these LHS on bitfield."
Yeah, the problem with multiplatform dev is that it's always the worst implementation of a feature on any of your target platform/compiler combos that ultimately makes the decision.

"Doing the work manually (loading the bitfield on the stack, doing all operations locally, then storing the result once for all) will obviously get us rid of LHS on both platforms."
Yep, basically all of these problems are solvable by looking at the generated code and rephrasing the C++ code until it produces decent results.

But that's a disaster in terms of maintainability. You have your code written in a high-level language but there's tons of invisible (in the source code) low-level constraints that are functionally equivalent but make a big difference in performance.

It's far too easy to cause a significant performance penalty just by changing code in a way that should be completely innocent.

"Great to hear that the compiler people have finally implemented a good work-around for the top clears. :)"

Oh, SNC has a bunch of good stuff.

They'll also automatically move single-precision FP computations over to VMX if that avoids load-hit-stores for int<->float conversions, for example.

SNC also tends to compile significantly faster than GCC. And produces code that's both faster and smaller. And their linker is way quicker than GNU ld. Okay, enough advertisement for now. :)

July 13, 2010 at 10:51 AM

Blogger cbloom said...

"
Yep, basically all of these problems are solvable by looking at the generated code and rephrasing the C++ code until it produces decent results.

But that's a disaster in terms of maintainability. You have your code written in a high-level language but there's tons of invisible (in the source code) low-level constraints that are functionally equivalent but make a big difference in performance."

Yeah this is what I was trying to get at about the code gen being a disaster.

Sure you can tweak on the C until you get nice ASM, but then its bad on the other platform, or you make some little innocuous change and it completely mucks up the code gen. It's just so fragile and you have to constantly keep an eye on it.

It's a total nightmare for robust cross-platform dev.

July 13, 2010 at 11:03 AM

Blogger cbloom said...

"One thing I recently encountered in x64 code:
"U8 *SomePtr = BasePtr + (Offset & 0xffff) * 4;""

BTW a useful note on this :

On x64 the behavior of code like this is different in Debug and Release.

In the MSufSort 64 bit bugs I was looking at, it has a bunch of address math like that.

In Debug builds, the compiler would do the (U32) conversion on the math and it would wrap, so the code would work.

In Release builds, it would sometimes NOT do the U32 conversion, just promote everything to 64 bit. Which is sort of nice, that's usually what you want, but in this case it caused bugs.

So with MSVC on x64 when you are doing mixed 32/64 maths, it will sometimes just promote everything and sometimes not.

July 13, 2010 at 11:12 AM

Blogger cbloom said...

"* Really important: don't take the address of an important local or it will end up on the stack and stay there ....

Vertex *vp_;
vb->Lock(..., &vp_);
Vertex * restrict vp = vp_;
"

Yeah, what we were seeing was cases like this. The problem is I would do something like your copy to vp , but the compiler would completely eliminate the copy to vp and generate later code as if I was just using vp_.

It's like some early pass of the parse had figured out the copy did nothing and so eliminated the temporary.

eg. I had cases like

int array[32];
FillArray(array);
const register int x = array[7];

and then when I would use "x" later down in the function it would gen a bunch of code to lookup array[7] instead of putting it in a register and leaving it there.

July 13, 2010 at 11:15 AM

Blogger Arseny Kapoulkine said...

?: compilation heavily depends on the compiler.
However, I don't think I often saw MSVC generate a cmov or something else for PC.
GCC (and SNC), on the other hand, sometimes do this.

Of course the robust thing is to make a select macro (along with the corresponding version for floats).

July 16, 2010 at 10:21 AM

Blogger cbloom said...

" ?: compilation heavily depends on the compiler. However, I don't think I often saw MSVC generate a cmov or something else for PC. GCC (and SNC), on the other hand, sometimes do this."

Yeah, for sure. VC wouldn't do it much on x86 because their code gen was pretty much 586 compatible. For x64 they do generate cmov pretty reliably.

July 16, 2010 at 1:58 PM

You can use some HTML tags, such as <b>, <i>, <a>

This blog does not allow anonymous comments.

Comment moderation has been enabled. All comments must be approved by the blog author.

You will be asked to sign in after submitting your comment.