Googles appar
Huvudmeny

Post a Comment On: cbloom rants

"11-20-10 - Function approximation by iteration , Part 3 - Reciprocal Square Root"

10 Comments -

1 – 10 of 10
Blogger ryg said...

"Now before you get excited and go copy-pasting rsqrtf_cb_method1 around, let me stop you. As I said at the beginning, it's pointless"
Maybe pointless on x86 where rsqrtss gives you an initial estimate accurate to 12 bits, but not on PPC where frsqrte only gives you 5 bits (curiously, AltiVec vrsqrtefp is much better, I think it's either 10 or 12 bits).

November 20, 2010 at 5:13 PM

Blogger ryg said...

Small error: the B2 you give for "second step optimal" is missing its sign.

November 20, 2010 at 7:36 PM

Blogger cbloom said...

Indeed!

November 20, 2010 at 8:18 PM

Blogger Ben Garney said...

Do you have actual timings for the various methods? I'm always leery of claims that such and such method is faster without a measured scenario. Even if it is from a trusted source. ;)

November 21, 2010 at 11:47 AM

Blogger cbloom said...

"Do you have actual timings for the various methods? I'm always leery of claims that such and such method is faster without a measured scenario. Even if it is from a trusted source. ;)
"

Well I'm not claiming anything is faster, that's not really the point. The "cb_method1" is exactly the same speed as the old Carmack/Walsh/Lomont method.

By far the fastest way is to use your platform's recriprocal sqrt estimate. As ryg points out, if your platforms rsqrte is poor then maybe a "method1" type step after that would be ideal.

Just for completeness ; on Core i7 :

x86 :

carmack/cb1 : 11 clocks
babylonian : 34 clocks
sse rsqrte : 7 clocks

x64 :

carmack/cb1 : 8 clocks
babylonian : 14 clocks
sse rsqrte : 3 clocks

(note that these should be taken with a grain of salt because I haven't carefully analyzed the assembly to make sure they are being treated equally; in particular I measured cb1 to be slightly faster than Walsh (7.5 vs 8.25 blocks on x64), which is just an anomaly of code generation)

November 21, 2010 at 12:28 PM

Blogger cbloom said...

Hmm... looking at the ASM, MSVC appears to be automatically unrolling some of my test loops (!!). That's kind of awesome but also annoying because it creates a huge inconsistency in timing, because it depends on what else is going on in the function.

Anyway, here are some more x64 times with what appears to be 4X unrolling being done by the compiler :

rsqrtf_clib : 9.208541
rsqrtf_cb_method1 : 4.134571
rsqrtf_walsh_newton : 4.658854
rsqrtf_walsh_newton_twice : 6.542103
rsqrtf_walsh_halley : 7.093772
rsqrtf_babylonian : 6.710390
rsqrtf_sse_est : 1.986016

x64-Corei7 is so ridiculously fucking fast that it's impossible to time anything on it.

The only way to really test speed reliably these days is "in situ" - you have to put it in your real game and measure it there, because there are too many code generation issues and pipeline issues. (eg. in the real world of consoles, the Walsh floating point trick method is totally impractical due to pipeline stalls)

November 21, 2010 at 12:44 PM

Blogger won3d said...

Yeah, SSE's rsqrt is quite awesome, but I've found a pretty fun use for "magic" floating point tricks in my n-body gravity simulator. If you're doing an inverse square law and you have to normalize a vector, you end up doing something like G * pow(r_dot_r + epsilon, -3/2). You can do this with rsqrt and a bunch of multiplies, but the bit bashing trick is definitely faster. Obviously, the results aren't particularly accurate, but you get the right effect.

December 1, 2010 at 2:48 AM

Blogger Jan Řrřola Kadlec said...

> Lastly, let's go back to that magic number 0x5f375a86 that was used in the y0 approximation. Can we do better if we optimize that? The answer is basically no.

<a href="http://rrrola.wz.cz/inv_sqrt.html>That's not right</a> - the optimal constant for one Newton is 0x5F1FFFF9 (for f(u) = 0.703952253 * (2.38924456 - u)).

The max relative error for this is 0.0650197%.

December 1, 2010 at 5:19 AM

Blogger cbloom said...

Jan Kadlec, I presume?

"That's not right - the optimal constant for one Newton is 0x5F1FFFF9"

Mmm indeed! Nice work.

It looks like I failed to find that because I was only doing greedy optimization and got stuck in a local minimum; it looks like you have a nice little implementation of the DIRECT optimizer.

I've always been a pedant about the danger of local minima, so shame on me for falling for it.

Addendum will be added to the original post.

December 1, 2010 at 9:17 AM

Blogger Jan Řrřola Kadlec said...

Yeah, that's me.

If I remember correctly, the magic constant has 4 periodic local minima. Floating point "noise" is evil - I had to make a 2048³ exhaustive search around the best DIRECT result to find these values.

December 4, 2010 at 3:45 PM

You can use some HTML tags, such as <b>, <i>, <a>

This blog does not allow anonymous comments.

Comment moderation has been enabled. All comments must be approved by the blog author.

You will be asked to sign in after submitting your comment.