y = x^{-\frac{1}{p}},
\] which comes up when using $L_p$ normalization.
When $p = 2$ the famous inverse square root hack applies, so I modified the derivation on Wikipedia for $p \neq 2$ resulting in \[
I_y = -\frac{1}{p} I_x + \frac{p + 1}{p} (B - \sigma) L,
\] which led to the following code,
// WARNING: this code has been updated. Do not use this version. // Instead get the latest version from http://code.google.com/p/fastapprox float fast_inverse_p (float x, float p) { union { float f; uint32_t i; } v = { x }; unsigned int R = 0x5F375A84U; // R = (3/2) (B - sigma) L unsigned int pR = ((2 * (p + 1)) / (3 * p)) * R; unsigned int sub = v.i / p; v.i = pR - sub; return v.f; }This routine does a reasonable job, but lacks the Newton's method polishing step option that the inverse square root hack has. Since the point was to avoid calling powf, I applied Newton's method to $f (y) = \log (y) + \frac{1}{p} \log (x)$, which yields iterates \[
y_{n+1} = y_n \left( 1 - \log (2) \left( \frac{1}{p} \log_2 (x) + \log_2 (y) \right) \right).
\] The idea is to use this along with a cheap approximation for $\log_2$, leveraging the floating point representation. The first approximation I tried is a rational approximation which resulted to the following.
// WARNING: this code has been updated. Do not use this version. // Instead get the latest version from http://code.google.com/p/fastapprox float fast_inverse_p_one (float x, float p) { float y = fast_inverse_p (x, p); union { float f; uint32_t i; } vx = { x }; union { float f; uint32_t i; } vy = { y }; // equivalent to: mx = frexp (x, &ex), ex += 126 int ex = ((vx.i >> 23) & 0xFF); union { uint32_t i; float f; } mx = { (vx.i & 0x007FFFFF) | (0x7e << 23) }; int ey = ((vy.i >> 23) & 0xFF); union { uint32_t i; float f; } my = { (vy.i & 0x007FFFFF) | (0x7e << 23) }; float log2x = ex - 123.52964741f - 4.29802897f / (0.73958333f + mx.f); float log2y = ey - 123.52964741f - 4.29802897f / (0.73958333f + my.f); return y * (1.0f - 0.69314718f * (log2x / p + log2y)); }That is somewhat more accurate and somewhat more expensive. For higher accuracy, I went with a polynomial approximation and two Newton iterations.
// WARNING: this code has been updated. Do not use this version. // Instead get the latest version from http://code.google.com/p/fastapprox float fast_inverse_p_two (float x, float p) { float y = fast_inverse_p (x, p); union { float f; uint32_t i; } vx = { x }; union { float f; uint32_t i; } vy = { y }; int ex = ((vx.i >> 23) & 0xFF); union { uint32_t i; float f; } mx = { (vx.i & 0x007FFFFF) | (0x7e << 23) }; int ey = ((vy.i >> 23) & 0xFF); union { uint32_t i; float f; } my = { (vy.i & 0x007FFFFF) | (0x7e << 23) }; float powmx = mx.f; float log2x = ex - 129.78533457081f + 10.07763081376f * powmx; powmx *= mx.f; log2x -= 13.90982855008f * powmx; powmx *= mx.f; log2x += 12.64852864324f * powmx; powmx *= mx.f; log2x -= 6.39532347130f * powmx; powmx *= mx.f; log2x += 1.364335673877f * powmx; log2x /= p; float powmy = my.f; float log2y = ey - 129.78533457081f + 10.07763081376f * powmy; powmy *= my.f; log2y -= 13.90982855008f * powmy; powmy *= my.f; log2y += 12.64852864324f * powmy; powmy *= my.f; log2y -= 6.39532347130f * powmy; powmy *= my.f; log2y += 1.364335673877f * powmy; union { float f; uint32_t i; } vz = { y * (1.0f - 0.69314718f * (log2x + log2y)) }; int ez = ((vz.i >> 23) & 0xFF); union { uint32_t i; float f; } mz = { (vz.i & 0x007FFFFF) | (0x7e << 23) }; float powmz = mz.f; float log2z = ez - 129.78533457081f + 10.07763081376f * powmz; powmz *= mz.f; log2z -= 13.90982855008f * powmz; powmz *= mz.f; log2z += 12.64852864324f * powmz; powmz *= mz.f; log2z -= 6.39532347130f * powmz; powmz *= mz.f; log2z += 1.364335673877f * powmz; return vz.f * (1.0f - 0.69314718f * (log2x + log2z)); }This is more accurate and more expensive.
Here are some example values along with some relative error estimates averaged over a more exhaustive range of inputs.
x p fast0 fast1 fast2 powf 7.000 0.870 0.10889 0.10663 0.10681 0.10681 7.000 2.488 0.45789 0.45704 0.45744 0.45744 7.000 4.106 0.63713 0.62284 0.62256 0.62256 7.000 5.724 0.73334 0.71151 0.71181 0.71181 7.000 7.342 0.78715 0.76660 0.76718 0.76718 7.000 8.960 0.82151 0.80411 0.80480 0.80479 69.000 0.870 0.00749 0.00769 0.00770 0.00770 69.000 2.488 0.18674 0.18229 0.18236 0.18236 69.000 4.106 0.36593 0.35650 0.35659 0.35658 69.000 5.724 0.47131 0.47700 0.47726 0.47726 69.000 7.342 0.56050 0.56189 0.56176 0.56176 69.000 8.960 0.63580 0.62382 0.62341 0.62341 211.000 0.870 0.00217 0.00213 0.00213 0.00213 211.000 2.488 0.11642 0.11628 0.11636 0.11636 211.000 4.106 0.27032 0.27150 0.27161 0.27161 211.000 5.724 0.40273 0.39225 0.39260 0.39260 211.000 7.342 0.47678 0.48220 0.48243 0.48243 211.000 8.960 0.54817 0.55025 0.55031 0.55030 err0 = 0.021138 err1 = 0.000680451 err2 = 7.20003e-06I did a timing test, compiling with -O3 -finline-functions -ffast-math, on a box running Ubuntu Lucid (so gcc 4:4.4.3-1ubuntu1 and libc 2.11.1-0ubuntu7.6).
fast_inverse_p million calls per second = 206.677 fast_inverse_p_one million calls per second = 77.9861 fast_inverse_p_two million calls per second = 26.531 powf million calls per second = 9.04992Coincidentally, each increase in accuracy costs about a factor of 3 in throughput.
How does p = 0.5 compare to the fast inverse square root hack in your test harness? (I'm interested in both timing and err.)
ReplyDeleteYes, I'll test that. However my immediate response is, if you know you are taking an inverse integral power, the Newton step can be done faster and more accurately with y = x^p as the objective function.
ReplyDeleteThe postfix indicates the number of Newton steps done.
ReplyDeletersqrtf0 relative accuracy 0.0236779
rsqrtf1 relative accuracy 0.000969781
rsqrtf2 relative accuracy 1.86903e-06
rsqrtf3 relative accuracy 2.66092e-08
rsqrtf0 million calls per second = 233.094
rsqrtf1 million calls per second = 182.67
rsqrtf2 million calls per second = 154.079
rsqrtf3 million calls per second = 121.4
So without any Newton steps they have similar speed and (in)accuracy. However the superior form of the Newton step means the square root hack has a much better computation vs. accuracy tradeoff.