浮点类型
1 何为浮点数?
浮点类型是带有小数点的数字,0到1区间,有无穷个浮点数。计算机中同样的64个bit, 如何表示有限个浮点数?绝大多数编程语言都是根据IEEE-754的标准实现。
这里以python和rust为例,验证浮点数的实现。先说结论:
- python的float,以CPython为例,实际使用的是C中的double类型,即float64
- 和rust中f64类型相同
此处以一个问题出发,0.1+0.2是不是等于0.3?
1.1 python 验证: 0.1 + 0.2 VS 0.2
print(f"type(0.1+0.2) -> {type(0.1+0.2)}" )
print(f"0.1 + 0.2 == 0.3 -> {0.1+0.2==0.3}")
print(f"0.1 + 0.2 -> {0.1+0.2}")
print(f"0.3 -> {0.3}")
print(f"format(0.1+0.2, '.100f') -> {format(0.1+0.2, '.100f')}")
print(f"format(0.3, '.100f') -> {format(0.3, '.100f')}")
print(f"(0.1+0.2).hex() -> {(0.1+0.2).hex()}")
print(f"0.3.hex() -> {0.3.hex()}")
print(f"(0.1+0.2).as_integer_ratio() -> {(0.1+0.2).as_integer_ratio()}")
print(f"0.3.as_integer_ratio() -> {0.3.as_integer_ratio()}")
print(f"0.125.hex() -> {0.125.hex()}")
print(f"0.125.as_integer_ratio() -> {0.125.as_integer_ratio()}")type(0.1+0.2) -> <class 'float'> 0.1 + 0.2 == 0.3 -> False 0.1 + 0.2 -> 0.30000000000000004 0.3 -> 0.3 format(0.1+0.2, '.100f') -> 0.3000000000000000444089209850062616169452667236328125000000000000000000000000000000000000000000000000 format(0.3, '.100f') -> 0.2999999999999999888977697537484345957636833190917968750000000000000000000000000000000000000000000000 (0.1+0.2).hex() -> 0x1.3333333333334p-2 0.3.hex() -> 0x1.3333333333333p-2 (0.1+0.2).as_integer_ratio() -> (1351079888211149, 4503599627370496) 0.3.as_integer_ratio() -> (5404319552844595, 18014398509481984) 0.125.hex() -> 0x1.0000000000000p-3 0.125.as_integer_ratio() -> (1, 8)
1.2 Rust验证: 0.1_f64 + 0.2_f64 VS 0.3_f64
println!(
"0.1_f64+0.2_f64==0.3_f64 -> {}",
0.1_f64 + 0.2_f64 == 0.3_f64
);
println!("0.1_f64+0.2_f64 -> {}", 0.1_f64 + 0.2_f64);
println!("0.3_f64 -> {}", 0.3_f64);
println!("0.1_f64+0.2_f64 -> {:.100}", 0.1_f64 + 0.2_f64);
println!("0.3_f64 -> {:.100}", 0.3_f64);
println!("0.125_f64 -> {:.100}", 0.125_f64);0.1_f64+0.2_f64==0.3_f64 -> false
0.1_f64+0.2_f64 -> 0.30000000000000004
0.3_f64 -> 0.3
0.1_f64+0.2_f64 -> 0.3000000000000000444089209850062616169452667236328125000000000000000000000000000000000000000000000000
0.3_f64 -> 0.2999999999999999888977697537484345957636833190917968750000000000000000000000000000000000000000000000
0.125_f64 -> 0.1250000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
留意的关键点如下:
Note
- 展开到第100位小数,可见0.1+0.2的结果不同于0.3(见Python输出第5、6行);二进制表示最后一位不同(见python输出第7、8行)
- 注意最后几位都是0,用于表示无理数(比如pi、sqrt(2))时, 精度存在问题
- Python默认的float类型float64, 和Rust中的f64结果相同,百分百相同,遵循的是同一套标准
即Python中的float和Rust中都是基于一套标准实现的,即IEEE 754。
2 IEEE-754理论
浮点数用w+p个bit表示,其中:
- 符号位(sign, S), 用1个bit表示,取值为0或1,\(-1^{S}\): 0表示正数, 1表示负数
-
指数部分(exponent, E), 用w个bit表示
- \(e_{max} = bias = 2^{w-1}-1 = \frac{2^w}{2}-1\)
- \(e_{min} = 1-e_{max} = 2-2^{w-1}\)
- 尾数部分(Traising, T), 用p-1个bit表示(理论上是p个bit, 其中有个bit恒定为1, 可不用存储)
指数部分, 以w=8为例, 因为要表示正数和负数,所以作一个映射: E-bias=E-127
| 原始bit | 映射: -127(bias) | 含义 |
|---|---|---|
| 0 | -127 | 0或接近0的数字 |
| 1 | -126 | |
| 2 | -125 | |
| ... | ... | |
| 253 | 126 | |
| 254 | 127 | |
| 255 | 128 | 无穷大或NaN |
数值\(v=(-1)^{S} \times 2^{E-bias} \times (1 + T \times 2^{-(p-1)})\), 其中\(bias=\frac{2^w}{2}-1\)
不同精度的浮点数: bit的位数
| S:1 bit | E:w bit | T: p-1 bit | bias | 类型 |
|---|---|---|---|---|
| 1 | 8 | 23 | 127 | float32 |
| 1 | 11 | 52 | 1027 | float64 |
| 1 | 15 | 112 | 16383 | float128 |
浮点数含义:
| E | T | v |
|---|---|---|
| 0 | 0 | \((-1)^{S} \times (+0)\) |
| 0 | 非0 | \((-1)^{S} \times 2^{E_{min}} \times (0 + T \times 2^{1-p})\) |
| ... | ... | \((-1)^{S} \times 2^{E - bias} \times (1 + T \times 2^{1-p})\) |
| \(2^{w}-1\) | 0 | \((-1)^{S} \times (+\infty)\) |
| \(2^{w}-1\) | 非0 | NaN |
2.1 0.1的float64表示
import struct
w = 11
p = 53
bits = "{:064b}".format(struct.unpack(">Q", struct.pack(">d", 0.1))[0])
print(f"bits = {bits}")
S = int(bits[0], 2)
print(f"S = {bits[0]}")
E = int(bits[1:(w+1)], 2)
print(f"E = {bits[1:(w+1)]}")
T = int(bits[(w+1):], 2)
print(f"T = {bits[(w+1):]}")
bias = 2**(w-1) - 1
v = ((-1)**S) * (2**(E-bias)) * (1+T*2**(1-p))
print(f"float{w+p} = 1 + {w} + {p-1} bits, bias={bias}, v={v}")bits = 0011111110111001100110011001100110011001100110011001100110011010
S = 0
E = 01111111011
T = 1001100110011001100110011001100110011001100110011010
float64 = 1 + 11 + 52 bits, bias=1023, v=0.1
将0.1改为-0.1后,只有MSB的bit由0改为1。
3 精度不够,如何应对?
浮点数很多陷阱,通常我们不会过于担心。但如果硬刚,比如既要表示非常非常小的量子,又要表示非常非常大的银河系,此时怎么办?即范围大,精度也要大。
此时程序的基本类型不再满足要求,需要用第三方包,比如python中的decimal、rust中的bigdecimal,都可以实现任意精度的浮点数。
from decimal import *
getcontext().prec = 1024
sqrt_2 = Decimal(2).sqrt()
print(type(sqrt_2))
print(sqrt_2)
print()
import numpy as np
sqrt_2_approx = np.sqrt(2)
print(type(sqrt_2_approx))
print("{:.1024f}".format(sqrt_2_approx))
print()
import math
sqrt_2_approx = math.sqrt(2)
print("{}".format(type(sqrt_2_approx)))
print("{:.1024f}".format(sqrt_2_approx))
print()
1.414213562373095048801688724209698078569671875376948073176679737990732478462107038850387534327641572735013846230912297024924836055850737212644121497099935831413222665927505592755799950501152782060571470109559971605970274534596862014728517418640889198609552329230484308714321450839762603627995251407989687253396546331808829640620615258352395054745750287759961729835575220337531857011354374603408498847160386899970699004815030544027790316454247823068492936918621580578463111596668713013015618568987237235288509264861249497715421833420428568606014682472077143585487415565706967765372022648544701585880162075847492265722600208558446652145839889394437092659180031138824646815708263010059485870400318648034219489727829064104507263688131373985525611732204024509122770022694112757362728049573810896750401836986836845072579936472906076299694138047565482372899718032680247442062926912485905218100445984215059112024944134172853147810580360337107730918286931471017111168391658172688941975871658215212822951848847208969463386289156288277
1.4142135623730951454746218587388284504413604736328125000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
1.4142135623730951454746218587388284504413604736328125000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
4 实践中的坑
浮点数由于底层格式的特殊性,如果使时不够谨慎,就可能造成危险,甚至反直觉。
Python中,HashMap的Key可以用浮点,但不建议用浮点作为Key。见下面的反直觉的代码:
surprise_map = dict()
surprise_map[0.3] = "GPT3"
surprise_map[0.1+0.2] = "SKY"
print(surprise_map){0.3: 'GPT3', 0.30000000000000004: 'SKY'}Rust语言中,更进一步,直接从语言层面不允许你使用浮点类型。具体实现: Rust的 HashMap数据结构,它要求能用作 K 的类型必须实现了 std::cmp::Eq 特征(接口),意味着我们无法使用浮点数作为 HashMap 的 Key,来存储键值对。