Number of bits required for floating point representation

  • #1
CGandC
322
34
Relevant equations ( a quote from the book A Concise Introduction to Numerical Analysis by A. C. Faul explaining what is floating point representation ):
We live in a continuous world with infinitely many real numbers. However, a computer has only a finite number of bits. This requires an approximate representation. In the past, several different representations of real numbers have been suggested, but now the most widely used by far is the floating point representation. Each floating point representations has a base ##\beta## (which is always assumed to be even) which is typically 2 (binary), 8 (octal), 10 (decimal), or 16 (hexadecimal), and a precision ##p## which is the number of digits (of base ##\beta## ) held in a floating point number. For example, if ##\beta=10## and ##p=5##, the number 0.1 is represented as ##1.0000 \times 10^{-1}##. On the other hand, if ##\beta=2## and ##p=20##, the decimal number 0.1 cannot be represented exactly but is approximately ##1.1001100110011001100 \times 2^{-4}##. We can write the representation as ##\pm d_0 . d_1 \cdots d_{p-1} \times \beta^e##, where ##d_0 . d_1 \cdots d_{p-1}## is called the significand (or mantissa) and has ##p## digits and ##e## is the exponent. If the leading digit ##d_0## is non-zero, the number is said to be normalized. More precisely ##\pm d_0 . d_1 \cdots d_{p-1} \times \beta^e## is the number
##
\pm\left(d_0+d_1 \beta^{-1}+d_2 \beta^{-2}+\cdots+d_{p-1} \beta^{-(p-1)}\right) \beta^e, 0 \leq d_i<\beta
##
__________________________________________________________________________________________________________

I've been reading A Concise Introduction to Numerical Analysis by A. C. Faul and I've been inquiring about the number of bits required to represent a number in floating point representation with base ## \beta ##, precision ## p ## and maximum and minimum exponents ## e_{\max}, e_{\min}##.

Here's the author's calculation:

The largest and smallest allowable exponents are denoted ##e_{\max }## and ##e_{\min }##, respectively. Note that ##e_{\max }## is positive, while ##e_{\min }## is negative. Thus there are ##e_{\max }-e_{\min }+1## possible exponents, the +1 standing for the zero exponent. Since there are ##\beta^p## possible significands, a floating-point number can be encoded in ##\left[\log _2\left(e_{\max }-e_{\min }+1\right)\right]+\left[\log _2\left(\beta^p\right)\right]+1## bits where the final +1 is for the sign bit.

My question: how did the author arrive to ##\left[\log _2\left(e_{\max }-e_{\min }+1\right)\right]+\left[\log _2\left(\beta^p\right)\right]+1## ?

I tried as follows but didn't succeed: the number is ##\pm d_0 \cdot d_1 \cdots d_{p-1} \times \beta^e##, each of ## d_i ## is at most ## \beta ## and since the largest exponent is ## e_{\max} ## then the largest number possible is ## \beta . \beta \cdots \beta \times \beta^{e_{\max}} ##, hence the number of bits is ( add one for plus/minus sign ) ## \lfloor log_2( { \beta^p \cdot \beta^{e_{\max}} }) \rfloor +1 = \lfloor log_2( { \beta^p}) + log_2({ \beta^{e_{\max}} }) \rfloor + 1 ##
 
Computer science news on Phys.org
  • #2
You have neglected to take account of how the number is actually encoded.

The number [tex]
(-1)^S \times d_0.d_1d_2\dots d_{p-1} \times \beta^{e_{\mathrm{min}} + (e - e_{\mathrm{min}})}[/tex] is encoded as the [itex]N_1 + N_2 + 1[/itex] digit binary integer [tex]
S\;D_1D_2\dots D_{N_2}\;E_1E_2\dots E_{N_1}[/tex] where [itex]S[/itex] is the sign bit, [itex]D_1D_2 \dots D_{N_2}[/itex] is the binary representation of [itex]0 \leq d_0d_1\dots d_{p-1} \leq \beta^p - 1[/itex] and [itex]E_1 \dots E_{N_1}[/itex] is the binary representation of [itex]0 \leq e - e_{\mathrm{min}} \leq e_{\mathrm{max}} - e_{\mathrm{min}}[/itex]. The maximum integer which can be represented with [itex]N[/itex] binary digits is [tex]1 + 2 + \dots + 2^{N-1} = 2^N - 1[/tex] so we require [tex]\begin{split}
N_1 &\geq \log_2(e_{\mathrm{max}} - e_{\mathrm{min}} + 1), \\
N_2 &\geq \log_2(\beta^p). \end{split}[/tex] The minumum necessary numbers of binary digits are therefore [tex]\begin{split}
N_1 &= \lceil \log_2(e_{\mathrm{max}} - e_{\mathrm{min}} + 1) \rceil, \\
N_2 &= \lceil \log_2(\beta^p) \rceil, \end{split}[/tex] where [itex]\lceil x \rceil[/itex] is the smallest integer greater than or equal to [itex]x[/itex].
 
  • Like
Likes CGandC and Vanadium 50
  • #3
If a specific example would help, you might look up how IEEE 754 does things.
 
  • Like
Likes berkeman
  • #4
CGandC said:
I tried as follows but didn't succeed: the number is ##\pm d_0 \cdot d_1 \cdots d_{p-1} \times \beta^e##, each of ## d_i ## is at most ## \beta ##
The "is at most ##\beta##" part is incorrect. Whatever the base ##\beta## is, each digit ##d_i## must be less than ##\beta##.
In base-2, the digits are 0 and 1.
In base-8, the digits are 0, 1, 2, 3, 4, 5, 6, and 7.
In decimal (base-10), the largest digit is 9.
 
  • Like
Likes CGandC
  • #5
Thanks for help! I understand now.
 
  • Like
Likes berkeman

What is a floating point representation?

A floating point representation is a way of representing real numbers in a computer using a fixed number of bits. It is used to approximate decimal numbers with a limited amount of precision.

Why is it important to know the number of bits required for floating point representation?

Knowing the number of bits required for floating point representation is important because it determines the range and precision of numbers that can be represented. It also affects the performance and accuracy of calculations involving these numbers.

How is the number of bits for floating point representation determined?

The number of bits required for floating point representation is determined by the format or standard used for representing floating point numbers. Different formats have different numbers of bits allocated for the mantissa (fractional part) and exponent (to represent the magnitude of the number).

What is the most commonly used format for floating point representation?

The most commonly used format for floating point representation is the IEEE 754 standard, which specifies the number of bits allocated for the mantissa and exponent, as well as special values for representing infinity and NaN (not a number).

What is the relationship between the number of bits and the precision of a floating point number?

The precision of a floating point number is directly related to the number of bits allocated for the mantissa. The more bits allocated, the higher the precision and the smaller the margin of error when representing a decimal number. However, this also means that more bits are required, leading to larger memory usage and potentially slower performance.

Similar threads

Replies
4
Views
792
  • Quantum Physics
Replies
9
Views
714
  • Engineering and Comp Sci Homework Help
Replies
9
Views
907
  • Advanced Physics Homework Help
Replies
7
Views
3K
  • Sticky
  • Topology and Analysis
Replies
9
Views
5K
Replies
5
Views
1K
  • Math Proof Training and Practice
2
Replies
61
Views
6K
  • General Math
Replies
9
Views
2K
  • Linear and Abstract Algebra
Replies
3
Views
2K
  • Computing and Technology
Replies
4
Views
7K
Back
Top