Googles appar
Huvudmeny

Post a Comment On: cbloom rants

"09-06-10 - Cross Platform SIMD"

24 Comments -

1 – 24 of 24
Blogger Mojo said...

Struct containing a single simd value actually works well. It's been quite a while since I've noticed any extraneous loads&stores like the older compilers used to do.

Casting does bork the compiler sometimes though. An inline method which returns a reference to the underlying type works pretty well, the temporary copies can be optimized away.

September 7, 2010 at 2:23 AM

Blogger jeskola said...

VC generates very good code for my float4 and int4 structs most of the time. It can handle this too without using memory:

inline int4 float4::reinterpret_int4() const { return int4(*(__m128i *)&x); }

Sometimes it doesn't seem to believe my __restricts though.

September 7, 2010 at 7:24 AM

Blogger jeskola said...

VC generates very good code for my float4 and int4 structs most of the time. It can handle this too without using memory:

inline int4 float4::reinterpret_int4() const { return int4(*(__m128i *)&x); }

Sometimes it doesn't seem to believe my __restricts though.

September 7, 2010 at 7:25 AM

Blogger castano said...

It's been a long time, but IIRC one problem is that msvc does not align struct function arguments properly when passed by value, so you have to be very careful if you rely on that. However, if you use the __m128 data type, the compiler does the right thing. You would think that the align keyword would do the same, but instead it simply gives you an infuriating error when passing aligned structs by value.

September 7, 2010 at 10:29 AM

Blogger cbloom said...

"inline int4 float4::reinterpret_int4() const { return int4(*(__m128i *)&x); }"

Yeah this piece concerns me, but I guess I don't have much of a choice for that; have to just do it and cross my fingers.

There are alternatives :

I could call an instruction like or with self or something and hope that gets optimized out.

I could also use the move-through-union method.

September 7, 2010 at 11:19 AM

Blogger cbloom said...

"Sometimes it doesn't seem to believe my __restricts though."

Ugh I get this in MSVC and it's infuriating. I spent all of yesterday trying various tricks to make it stop storing temporaries to memory after each loop iteration and couldn't get it to stop.

September 7, 2010 at 11:21 AM

Blogger cbloom said...

"msvc does not align struct function arguments properly when passed by value, so you have to be very careful if you rely on that. However, if you use the __m128 data type, the compiler does the right thing."

Yeah there is some problem with this. Also x64 has weird rules about passing m128's.

But I think all this goes away if I just make all my functions FORCEINLINE.

Of course that's not really what you want for more complex functions.

September 7, 2010 at 11:24 AM

Blogger cbloom said...

Another few little open questions to me :

do I make separate simdU32 and simdS32 ?

how about variable names? vecU32 ? quadU32 ?

September 7, 2010 at 11:27 AM

Blogger jeskola said...

At least simple loops like this usually work well:

void test(float4 * __restrict pf, int4 * __restrict pi, int n)
{
for (int i = 0; i < n; i++)
pf[i] = ((pf[i] * 123.0f).reinterpret_int4() ^ pi[i]).reinterpret_float4();
}

loop:
movaps xmm0, XMMWORD PTR [eax]
movdqa xmm2, XMMWORD PTR [ecx+eax]
mulps xmm0, xmm1
pxor xmm0, xmm2
movdqa XMMWORD PTR [eax], xmm0
add eax, 16
dec edx
jne SHORT loop

This looks close to optimal.

September 7, 2010 at 12:52 PM

Blogger won3d said...

Alignment is a pain.

Note that there can be hidden state in SSE registers. I believe this is true for K8 and K10, and it might be true for Core i (Core 2 doesn't seem to be affected). I think it has to do with the subnormal state (or some other floating point sub type) which can be cleared if you do certain integer operations, so there might be a penalty to the next floating point op you do.

You'd really have to work hard to do this, though. I happened to be writing some fast r^(-3/2) code for a gravity simulator.

Does MSVC do autovectorization? GCC's has improved greatly recently. I would even consider implementing SIMD primitives as unrolled loops and depending on the vectorizer for that.

September 7, 2010 at 2:21 PM

Blogger cbloom said...

"Does MSVC do autovectorization? GCC's has improved greatly recently. I would even consider implementing SIMD primitives as unrolled loops and depending on the vectorizer for that. "

Relying on the compiler to do anything complex is not really viable IMO without some ability to compile-time-assert that it is happening.

September 7, 2010 at 2:29 PM

Blogger ryg said...

"I think it has to do with the subnormal state (or some other floating point sub type) which can be cleared if you do certain integer operations, so there might be a penalty to the next floating point op you do."
At some point AMD had shadow/tag bits for this that needed to be recalculated (at a small penalty) on a data type switch. The Core i series does have a penalty for mixing data types too, but for a different reason: SIMD int and FP units are separate and there's a 1 cycle bypass delay to move data across the chip.

"Does MSVC do autovectorization?"
Not that I'm aware of. Not a big fan of this kind of optimization anyway - it tends to work well on simple loops but is very brittle and easy to break by changes that shouldn't make a difference. That's the worst kind of optimization to work with - high variance in execution time between similar versions of source code, unpredictable at the source level, and with lots of external requirements (e.g. alignment restrictions) that are easy to break from a distance without noticing it.

September 7, 2010 at 9:12 PM

Blogger ryg said...

Correction: Core2 was the one with the 1-cycle data bypass delay for mixing types, Core i has 2-cycle delays between some units.

September 7, 2010 at 9:15 PM

Blogger Jeff Roberts said...

Charles, radvec4.h has a bunch of this awkwardly abstracted...

September 7, 2010 at 9:51 PM

Blogger Sam Martin said...

About 4-5 years ago I built a simd vector library for Lionhead using the typedef approach (I believe they still use it) and we also take the same approach at Geomerics.

I spent quite a while looking at the other options, but the generated code on the 3 platforms by the compilers at that point was shocking for anything other than a typedef. This may have changed since, but my gut feeling is that typedefs are still the way forward.

IMO, the lack of some type safety is not really that a big thing in practice - not worth the additional upheaval at any rate.

There are other pros and cons though:

+ you can write fairly decent simd vector code in a nice cross platform style. It doesn't replace platform-specific optimisation, but it's a good first pass.

+ there is a surprising amount of common functionality between the 3 main simd targets beyond the usual */+-. Many minor differences can be abstracted.

+ alignment is painful. You have to fallback to other vector types for unaligned data.

- there are some cross platform hurdles. Xbox declares all the operators for __vector4 in the global namespace for example. Plus minor compiler bugs/quibbles.

- it's way too easy to write unperformant code on platforms without a unified register set by transfering things between floats/ints/vectors. But avoiding this can lead to obfuscated code.

So in summary it's great for simd-ising loops and so on, but in retrospect I'm not sure the (potential) performance gains it offers are worth the costs of using as a general purpose vector library. In retrospect I think a straight forward 4-element float array still has the advantage. Not an obvious call though.

September 8, 2010 at 2:58 AM

Blogger won3d said...

ryg, thanks for the info! When are you going to start your blog?

September 8, 2010 at 8:58 AM

Blogger castano said...

"ryg, thanks for the info! When are you going to start your blog?"

The ryg blog

September 8, 2010 at 11:00 AM

Blogger cbloom said...

Sam, thanks for the notes!

"IMO, the lack of some type safety is not really that a big thing in practice - not worth the additional upheaval at any rate."

Well, it does one huge thing, which is to let me use operator+. Without strong types I have to do Add4I , Add4F , etc.

"+ alignment is painful."

I wish I could disable loading my simd types from pointers. eg.

val = *ptr;

is forbidden and you have to manually call LoadAligned() or LoadUnaligned().

September 8, 2010 at 12:08 PM

Blogger ryg said...

Did something like Sam too (albeit more recently), works just fine. I ended up only supporting floats (+logical/compare ops) which sidesteps the type safety issue entirely.

For float the architectures are fairly close to each other, enough to paper over the differences by just exposing the important primitives and emulating them with multi-instruction sequences if necessary (e.g. madd -> mul+add on x86, unaligned loads on Xenon/PS3, or a "splat individual element" primitive that generates the necessary shuffle/permutation masks on x86/PS3).

Integer is more of an issue. Xenon has severely gutted integer SIMD (no int multiplies at all!) and the instructions have bigger differences between the architectures in general. For example, PPC vector shifts are always variable shifts with separate shift amount per vector element, x86 has either an immediate operand or a register parameter, but the shift amount is always the same for all elements. The PPC shifts are sufficiently more expressive to make me want to use them, but that doesn't map well to x86 at all. For 32-bit elements shufps (x86) / vpermwi (Xenon) is usually enough to get by, but for 8- and 16-bit I often really want to use vperm and you don't get that on x86, which usually means a very different dataflow. Most of the integer min/max stuff is only available in fairly recent x86 processors, and same for "horizontal" ops.

Once you take all that out, you're basically down to add/sub (both without carry-out) and shifts with compile-time constant amounts. That's a useless enough subset for me to just not bother :)

September 8, 2010 at 9:22 PM

Blogger ryg said...

...although if you throw in some unpacks it's enough to get through most of the pixel processing in H.264. But that's an exception :)

September 8, 2010 at 9:25 PM

Blogger cbloom said...

"Once you take all that out, you're basically down to add/sub (both without carry-out) and shifts with compile-time constant amounts. That's a useless enough subset for me to just not bother :)"

Eh, I sort of thought that, but when I was writing the exact same code for the 4th time it occurred to me that this is not the way it should be.

The common stuff is enough to do SIMD hashes, SIMD PNG filters, various simple pixel processing, etc. I think it's probably enough that I can SIMD almost everything I need to in Oodle in a cross-platform way (DXTC encoder, lossless PNG-alike, lossy DCT image compressor, hash, etc.)

If nothing else, I think just having the common typedef for your function protos and for loads & stores and all that basic stuff would save massive amounts of duplication. It would let you do the "#if X86" on the inside of the function where it really matters rather than duplicating the whole code flow path for each platform (which not only is more typing but creates fragile code that is hard to maintain and prone to bugs).

September 8, 2010 at 10:31 PM

Blogger ryg said...

"The common stuff is enough to do SIMD hashes, SIMD PNG filters, various simple pixel processing, etc."
Okay, that's way more integer-heavy than the stuff I dealt with. I mostly wanted this for some rendering / animation / collision stuff, and for all that you don't really need integer beyond logical ops anyway.

"If nothing else, I think just having the common typedef for your function protos and for loads & stores and all that basic stuff would save massive amounts of duplication."
Yeah it does, and I used that in several places (e.g. use the dot product instrs on Xenon where you have them, otherwise do a 4x4 transpose + mul/3x madd). It's very nice to be able to drop a couple platform-specific intrinsics in there when they're the best choice, without duplicating the whole thing.

Not too fond of "typesafe" vector stuff in general. It sounds like a good idea, but both the default AltiVec intrinsics and the typed (spu_*) SPU intrinsics are just a PITA to use. It just gets messy, particularly with compare results ("vector bool short"? Yeah right) and unpack-style operations when you don't want to change the data type, just interleave two halves.

September 8, 2010 at 11:31 PM

Blogger cbloom said...

"It sounds like a good idea, but both the default AltiVec intrinsics and the typed (spu_*) SPU intrinsics are just a PITA to use."

Yeah I hated the spu_ stuff so much that I mostly just used the raw si_ stuff.

But I'm not sure if that's because it's a bad idea or because it's just a bad implementation.

I think maybe I could have the best of both worlds.

Make an untyped generic simd and Add4F() blah blah calls. Also make a typed simd and provide reinterpret ops to the generic. Let them interop painlessly.

Time to write some code...

September 8, 2010 at 11:56 PM

Anonymous Anonymous said...

>> Ryg: Integer is more of an issue.
>> Xenon has severely gutted integer
>> SIMD [...]

Indeed. Usually in such cases I end up developing an algorithm using plain C, using it for the PC version and writing the SIMD-ed version for both consoles.

September 9, 2010 at 6:13 AM

You can use some HTML tags, such as <b>, <i>, <a>

This blog does not allow anonymous comments.

Comment moderation has been enabled. All comments must be approved by the blog author.

You will be asked to sign in after submitting your comment.