What should I worry about if I compress float64 array to float32 in numpy?

--------------------------------------------------
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: Magical Minnie Puzzles

--

Chapters
00:00 What Should I Worry About If I Compress Float64 Array To Float32 In Numpy?
01:31 Answer 1 Score 2
01:53 Answer 2 Score 7
04:57 Accepted Answer Score 6
07:54 Thank you

--

Full question
https://stackoverflow.com/questions/1100...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #numpy #floatingpoint #compression

#avk47

ANSWER 1

Score 7

The following assumes you are using standard IEEE-754 floating-point operations, which are common (with some exceptions), in the usual round-to-nearest mode.

If a double value is within the normal range of float values, then the only change that occurs when the double is rounded to a float is that the significand (fraction portion of the value) is rounded from 53 bits to 24 bits. This will cause an error of at most 1/2 ULP (unit of least precision). The ULP of a float is 2^-23 times the greatest power of two not greater than the float. E.g., if a float is 7.25, the greatest power of two not greater than it is 4, so its ULP is 4*2^-23 = 2^-21, about 4.77e-7. So the error when double in the interval [4, 8) is converted to float is at most 2^-22, about 2.38e-7. For another example, if a float is about .03, the greatest power of two not greater than it is 2^-6, so the ULP is 2^-29, and the maximum error when converting to double is 2^-30.

Those are absolute errors. The relative error is less than 2^-24, which is 1/2 ULP divided by the smallest the value could be (the smallest value in the interval for a particular ULP, so the power of two that bounds it). E.g., for each number x in [4, 8), we know the number is at least 4 and error is at most 2^-22, so the relative error is at most 2^-22/4 = 2^-24. (The error cannot be exactly 2^-24 because there is no error when converting an exact power of two from float to double, so there is an error only if x is greater than four, so the relative error is less than, not equal to, 2^-24.) When you know more about the value being converted, e.g., it is nearer 8 than 4, you can bound the error more tightly.

If the number is outside the normal range of a float, errors can be larger. The maximum finite floating-point value is 2¹²⁸-2¹⁰⁴, about 3.40e38. When you convert a double that is 1/2 ULP (of a float; doubles have finer ULP) more than that or greater to float, infinity is returned, which is, of course, an infinite absolute error and an infinite relative error. (A double that is greater than the maximum finite float but is greater by less than 1/2 ULP is converted to the maximum finite float and has the same errors discussed in the previous paragraph.)

The minimum positive normal float is 2^-126, about 1.18e-38. Numbers within 1/2 ULP of this (inclusive) are converted to it, but numbers less than that are converted to a special denormalized format, where the ULP is fixed at 2^-149. The absolute error will be at most 1/2 ULP, 2^-150. The relative error will depend significantly on the value being converted.

The above discusses positive numbers. The errors for negative numbers are symmetric.

If the value of a double can be represented exactly as a float, there is no error in conversion.

Mapping the input numbers to a new interval can reduce errors in specific situations. As a contrived example, suppose all your numbers are integers in the interval [2⁴⁸, 2⁴⁸+2²⁴). Then converting them to float would lose all information that distinguishes the values; they would all be converted to 2⁴⁸. But mapping them to [0, 2²⁴) would preserve all information; each different input would be converted to a different result.

Which map would best suit your purposes depends on your specific situation.

ACCEPTED ANSWER

Score 6

It is unlikely that a simple transformation will reduce error significantly, since your distribution is centered around zero.

Scaling can have effect in only two ways: One, it moves values away from the denormal interval of single-precision values, (-2^-126, 2^-126). (E.g., if you multiply by, say, 2¹²³ values that were in [2^-249, 2^-126) are mapped to [2^-126, 2^-3), which is outside the denormal interval.) Two, it changes where values lie in each “binade” (interval from one power of two to the next). E.g., your maximum value is 20, where the relative error may be 1/2 ULP / 20, where the ULP for that binade is 16*2^-23 = 2^-19, so the relative error may be 1/2 * 2^-19 / 20, about 4.77e-8. Suppose you scale by 32/20, so values just under 20 become values just under 32. Then, when you convert to float, the relative error is at most 1/2 * 2^-19 / 32 (or just under 32), about 2.98e-8. So you may reduce the error slightly.

With regard to the former, if your values are nearly normally distributed, very few are in (-2^-126, 2^-126), simply because that interval is so small. (A trillion samples of your normal distribution almost certainly have no values in that interval.) You say these are scientific measurements, so perhaps they are produced with some instrument. It may be that the machine does not measure or calculate finely enough to return values that range from 2^-126 to 20, so it would not surprise me if you have no values in the denormal interval at all. If you have no values in the single-precision denormal range, then scaling to avoid that range is of no use.

With regard to the latter, we see a small improvement is available at the end of your range. However, elsewhere in your range, some values are also moved to the high end of a binade, but some are moved across a binade boundary to the small end of a new binade, resulting in increased relative error for them. It is unlikely there is a significant net improvement.

On the other hand, we do not know what is significant for your application. How much error can your application tolerate? Will the change in the ultimate result be unnoticeable if random noise of 1% is added to each number? Or will the result be completely unacceptable if a few numbers change by as little as 2^-200?

What do you know about the machinery producing these numbers? Is it truly producing numbers more precise than single-precision floats? Perhaps, although it produces 64-bit floating-point values, the actual values are limited to a population that is representable in 32-bit floating-point. Have you performed a conversion from double to float and measured the error?

There is still insufficient information to rule out these or other possibilities, but my best guess is that there is little to gain by any transformation. Converting to float will either introduce too much error or it will not, and transforming the numbers first is unlikely to alter that.

ANSWER 3

Score 2

The exponent for float32 is quite a lot smaller (or bigger in the case of negative exponents), but assuming all you numbers are less than that you only need to worry about the loss of precision. float32 is only good to about 7 or 8 significant decimal digits