Googles appar
Huvudmeny

Post a Comment On: cbloom rants

"07-21-10 - x86"

10 Comments -

1 – 10 of 10
Blogger Tom Forsyth said...

Load-op on x86 is frequently not free. An instruction like "add rax, [rbx]" is usually broken into two uops - "mov temp, [rbx]" and "add rax, temp". These are sceduled separately, they execute in different units, and they take register file bandwidth just the same as if it was two real instructions. The two advantages over just using two real instructions are (1) you don't have to use an x86 register for the temporary. This used to be exciting, but in 64-bit we have 16 integer registers and it's not a big deal any more. (2) the combined instruction is slightly shorter to decode than the two separate ones, so saves a bit of I$ space.

That's on the current out-of-order cores. On the in-order cores (Atom, Larrabee, probably Rock Creek) they're almost always a bad idea and are slower than breaking the instructions apart.

It gets even more complex with instructions like "add [rbx], rax", i.e. with the memory as destination. These can cause a lot of problems for OOO cores, because even though it's broken into three uops, it still has to behave as if it is a single one. There's subtleties there to do with fault handling and how many times you're allowed to read or write memory.

As for OOO vs in-order, you've made a fairly standard mistake of thinking the PowerPC cores are shit because they're in-order. No, they're shit because they're shit. They're just badly made. They built them for a clock speed that was too high, so their pipelines have way too many stages, they don't have enough bypasses to handle that many stages, and so they stall like crazy. If they'd designed them to run at half the clock speed the pipelines would be far shorter, much simpler, and they'd run a lot better.

July 24, 2010 at 7:52 PM

Blogger cbloom said...

"As for OOO vs in-order, you've made a fairly standard mistake of thinking the PowerPC cores are shit because they're in-order. No, they're shit because they're shit"

Sigh. No, that's not my point at all. My point is not that they are slow, but how they affect coding practice.

In particular, the issue is that on x86, complex C++ is maybe only 25-50% slower than simple C code (no structs, only locals). On in-order PPC the C++ might be 100-200% slower than simplified C.

Certainly the reason this is so important at the moment is because these PPC chips really suck. But even if everything was faster it would still favor the old plain-C style of coding.

Basically it give credibility to all the curmudgeons and dinosaurs who think that allocations during your frame and polymorphism and such like are terrible. On x86 you can just roll your eyes and get on with your C++, but on in-order PPC you have to grudgingly admit that they are in fact correct.

Now an actual interesting topic is how a tiny bit of out-of-orderness might fix this which I guess is what they're doing in Atom2. But I don't know the details about that.

July 24, 2010 at 8:34 PM

Blogger Tom Forsyth said...

It looks like most of your gripes are to do with the compiler, not with the low-level ASM architecture.

In practice, good code for x86 looks very similar to good code for PPC code. Yes, the PPC kicks you in the nuts much much harder when you (or rather, the compiler) screw up - that's because it's a shitty design.

None of your points are anything to do with the perpetual in-order vs slightly-OOO vs fully-OOO religious wars. They're everything to do with how awful C is to compile.

July 24, 2010 at 9:29 PM

Blogger Brian said...

I do have a gripe about x86. Why aren't the instructions that do moves (rep movs*) the fastest way to copy memory? Instead you have to do crazy stuff with mmx/sse instructions to reach the peak bandwidth....

July 24, 2010 at 11:18 PM

Blogger   said...

yup, PPC is a different beast which requires dumb coding styles. Its also very hard to convince some people of this who have typically only worked with X86 processors in the past.

July 25, 2010 at 7:38 AM

Blogger cbloom said...

"None of your points are anything to do with the perpetual in-order vs slightly-OOO vs fully-OOO religious wars. They're everything to do with how awful C is to compile."

I don't know if you're just being difficult or if you have a point that I just can't see. Maybe you should write an article about how to run C++ fast on an in order core?

The issue with OO is memory and branch stalls.

If your memory stalls always took exactly the same amount of time and you always took the branches the same way, then yes a compiler could statically schedule the code perfectly. (* I would also argue that while this scheduling is theoretically possible it is neither realistic nor even desirable in practice, because it would have to be at link time and be very slow).

But more importantly, the stalls are *not* predictable, so no in-order execution can ever keep up with OO unit which is running ahead on whatever it can manage.

For example, say you execute a loop of complicated C++ code 3 times. The first time there will be branch misses and memory stalls. The OO will run as much code as it can while it waits for the stalls. The next time through it will run full speed without stalls.

You just can't beat dynamic scheduling for typical C++ish workloads.

(and recall I'm not talking about tight inner loops and super optimized stuff; I'm talking about the 90% of game code that you want to be decently fast but you don't want to have to hand massage).

July 25, 2010 at 10:14 AM

Blogger Tom Forsyth said...

Your opening statement is that x86 is greater than any other ISA. And then the reasons you give are largely misunderstandings of what happens to x86 when it is actually executed.

But now you've switched to the completely orthogonal OOO vs in-order argument. Of course OOO gives higher performance per clock cycle (higher perf per square mm or per watt is a much more complex argument). But that's got nothing to do with the ISA used.

The actual reason x86 is better than PPC is the fairly subtle one that for rarely-executed code, the CISC encoding means you get fewer I$ misses. And that's pretty much it.

July 28, 2010 at 9:24 AM

Blogger vv said...

PPC and not sweat the x86, because it will be like a bazillion times faster than the PPC anyway
PPU is 1/3 of K8 per cycle on branch and microcode-heavy code.
This is comparable to issue width difference. That means "x86s" are not as effecient as you trying to imagine.
Xenon/PPU performs mediocre even on good code. But that not a big deal.They're good enough.
You failed to realize, what fast code will be fast on either architecture, MUCH faster than typical C++ shit-code.

Basically it give credibility to all the curmudgeons and dinosaurs who think that allocations during your frame and polymorphism and such like are terrible
On a system with limited memory,
going crazy with dynamic heap allocations (no guarantees, fragmentation) is a terrible sin. I have to deal with legacy PC codebase where are many places like that caused major headache and crashes.
And I really hate it.
Allocation is evil everywhere, even in Java there it is cheap.
You can't be fast until you place your dynamic data in pools and ring-buffers.

August 20, 2010 at 2:44 AM

Blogger cbloom said...

"You failed to realize, what fast code will be fast on either architecture, MUCH faster than typical C++ shit-code."

Oh yeah it never occurred to me that fast code would be fast.

And it's just not true. Super-optimized very low level tweaked C on the in-order-PPC is significantly slower than incredibly high level elegant simple C++ on the x86/Core chips.

"On a system with limited memory,
going crazy with dynamic heap allocations (no guarantees, fragmentation) is a terrible sin. ...
You can't be fast until you place your dynamic data in pools and ring-buffers."

This is just not true. Yes, it might be easier to manage your limitted memory if you do fixed size pools, but that is not a more efficient use of the memory, nor is it necessarily faster.

August 20, 2010 at 8:48 AM

Blogger vv said...

Super-optimized very low level tweaked C on the in-order-PPC is significantly slower than incredibly high level elegant simple C++ on the x86/Core chips.
That doesn't mean the high level "elegant" c++ is actually fast.
I'd say C++ is not a high level language at all. Have fun with boost http://yfrog.com/htboosterrorp

By your logic, engineers are wasting their time trying to SIMD-ize and low-level tune libraries like MKL etc. They might just use high-level C++.

but that is not a more efficient use of the memory, nor is it necessarily faster.
So you claim what pieces of data scattered through the memory is faster?

Maybe it just me who think on a level of 128B aligned dma blocks?

August 20, 2010 at 10:29 AM

You can use some HTML tags, such as <b>, <i>, <a>

This blog does not allow anonymous comments.

Comment moderation has been enabled. All comments must be approved by the blog author.

You will be asked to sign in after submitting your comment.