<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://www.bsaiki.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://www.bsaiki.com/" rel="alternate" type="text/html" /><updated>2026-04-10T22:23:54+00:00</updated><id>https://www.bsaiki.com/feed.xml</id><title type="html">Brett Saiki</title><subtitle>Personal Website</subtitle><entry><title type="html">Round to Odd</title><link href="https://www.bsaiki.com/blog/2025/11/19/round-to-odd.html" rel="alternate" type="text/html" title="Round to Odd" /><published>2025-11-19T00:00:00+00:00</published><updated>2025-11-26T00:00:00+00:00</updated><id>https://www.bsaiki.com/blog/2025/11/19/round-to-odd</id><content type="html" xml:base="https://www.bsaiki.com/blog/2025/11/19/round-to-odd.html"><![CDATA[<p>When writing numerical programs,
  rounding is a subtle but important aspect
  that can significantly affect the accuracy
  and stability of computations.
In particular,
  rounding twice at different precisions
  might introduce unexpected errors.
For example,
  under the nearly universal
  round to nearest, ties to even (RNE) rounding mode,
  adding <code class="language-plaintext highlighter-rouge">1.00000011</code> and <code class="language-plaintext highlighter-rouge">5.96046447e-8</code>,
  and rounding directly to a single-precision floating-point
  value yields <code class="language-plaintext highlighter-rouge">1.00000012</code>; but first rounding to a
  double-precision floating-point value then to
  a single-precision floating-point value yields <code class="language-plaintext highlighter-rouge">1.00000024</code>.
This phenomenon is known as <em>double rounding</em>.</p>

<p>The double rounding problem means that
  a high-precision reference cannot be used to verify
  the correctness of lower-precision implementations,
  even for a single operation.
For example,
  naively using double-precision arithmetic
  to verify single-precision results
  is not enough to ensure correctness:
  double rounding may cause the reference result,
  rounded to single-precision,
  to disagree with the correctly-rounded
  single-precision result.
When developing correctly-rounded
  implementations of floating-point functions,
  double rounding implies that
  we cannot simply compute the result
  at higher precision and then round
  it to the target precision.
In the worst case,
  a correctly-rounded implementation
  must be specifically designed for each
  target precision, adding significant time and complexity
  to the implementation and verification effort.</p>

<p>Across the literature on floating-point arithmetic,
  one rounding mode offers a solution to double rounding:
  <em>round to odd</em> (RTO).
This blog post covers the common rounding modes,
  and the definition and properties of round to odd.
Finally,
  it concludes with the essential property
  of round to odd: safe re-rounding.</p>

<!-- When rounding real numbers to floating-point,
  fixed-point, or integer numbers,
  _rounding modes_ determine how to handle cases
  where the real number cannot be represented exactly.
Changing the rounding mode can
  significantly change the accuracy and
  numerical stability of a numerical computation.
Due to the nuanced effects of rounding modes,
  they are almost never exposed to programmers
  in general purpose programming languages,
  with rare exceptions:
  C (and C++) provide some support via
  the `<fenv.h>` (`<cfenv>` in C++) header.

There are several rounding modes in use today.
For example,
  the IEEE 754 standard [1] defines five rounding modes:
  round to nearest, ties to even (RNE),
  round to nearest, ties away from zero (RNA),
  round to positive infinity (RTP),
  round to negative infinity (RTN), and
  round toward zero (RTZ).
In most programming environments,
  the default rounding mode is RNE.
However,
  across the literature on floating-point arithmetic,
  one rounding mode stands out:
  round to odd (RTO).

This blog post intends to explain
  common rounding modes,
  round to odd and its properties,
  and the essential application of
  round to odd: safe re-rounding. -->

<h2 id="floating-point-numbers">Floating-Point Numbers</h2>

<p>To begin,
  I’ll briefly review floating-point numbers.
Floating-point numbers
  are numbers represented in the form:</p>

\[(-1)^s \cdot c \cdot 2^{exp}\]

<p>where \(s \in \{0, 1\}\) is the sign,
  \(c \in \mathbb{Z}_{\geq 0}\) is the significand,
  and \(exp \in \mathbb{Z}\) is the (unnormalized) exponent.
The fewest number of digits
  that can represent \(c\) is called
  the <em>precision</em>, \(p\), of the number:
  if \(c &gt; 0\), then \(2^{p - 1} \leq c &lt; 2^p\).
Alternatively,
  we can represent floating-point numbers
  in normalized form:</p>

\[(-1)^s \cdot m \cdot 2^e\]

<p>where \(1 \leq m &lt; 2\) is the mantissa
  and \(e \in \mathbb{Z}\) is the (normalized) exponent.
The relationship between the two forms
  is given by the equations:</p>

\[c = m \cdot 2^{p - 1}\]

<p>and</p>

\[exp = e - (p - 1).\]

<p>For example,
  we can represent \(1.25\) using 3 bits of precision by
  \(101 \cdot 2^{-2}\) and \(1.01 \cdot 2^{0}\)
  in unnormalized and normalized forms, respectively.</p>

<p>The IEEE 754 standard
  extends floating-point numbers
  to include special values:
  including negative zero, \(-0\);
  positive infinity, \(+ \infty\);
  negative infinity, \(- \infty\);
  and Not a Number, \(\mathrm{NaN}\).
Number formats define
  discrete sets of floating-point numbers
  that approximate real numbers.</p>

<p>A <em>rounding</em> operation
  maps real numbers to representable values
  of a number format according to rounding modes.
Rounding modes determine which
  floating-point value to choose in cases
  where the real number cannot be represented exactly.
There are several rounding modes in use today.
For example,
  the IEEE 754 standard [1] defines five rounding modes:
  round to nearest, ties to even (RNE);
  round to nearest, ties away from zero (RNA);
  round to positive infinity (RTP);
  round to negative infinity (RTN); and
  round toward zero (RTZ).</p>

<!-- When every value in the number format
  is represented by a fixed exponent,
  we say that the values are _fixed-point_ numbers. -->

<h2 id="correctly-rounded-functions">Correctly-Rounded Functions</h2>

<p>Rounding extends naturally to real-valued functions
  through <em>correctly-rounded</em> functions.
We say that an implementation \(f^{*}\) of
  a real-valued function \(f\),
  is correctly-rounded if its result is the infinitely precise
  result of \(f\), rounded to the target number format
  according to a specified rounding mode.
We’ll only consider floating-point numbers
  with a fixed precision of \(p\) digits,
  so I’ll denote the rounding operation in the style
  of Boldo and Melquiond [2] as \(\mathrm{rnd}^{p}_{rm}\).
Using this notation,
  the correctly-rounded implementation of \(f\)
  in precision \(p\) under rounding mode \(rm\)
  can be expressed as:</p>

\[f^{*} = \mathrm{rnd}^{p}_{rm} \circ f.\]

<p>Requiring correct rounding has strong arguments [3]:
  it provides a clear specification for
  the behavior of \(f^{*}\);
  it ensures reproducibility of results
  across different implementations of \(f^{*}\);
  and it bounds the numerical error
  introduced by rounding.</p>

<p>We can now formalize the double rounding problem.
In general,
  for precisions \(p_2 &lt; p_1\) and
  rounding modes \(rm_1, rm_2\):</p>

\[\mathrm{rnd}^{p_2}_{rm_2} \circ f \neq
  \mathrm{rnd}^{p_2}_{rm_2} \circ
  (\mathrm{rnd}^{p_1}_{rm_1} \circ f).\]

<p>More succinctly,
  rounding operations do not compose:</p>

\[\mathrm{rnd}^{p_2}_{rm_2} \neq
  \mathrm{rnd}^{p_2}_{rm_2} \circ
  \mathrm{rnd}^{p_1}_{rm_1}.\]

<p>We will see that round to odd
  provides a solution to this problem.</p>

<h2 id="rounding-modes">Rounding Modes</h2>

<p>Once we compute
  the infinitely precise result of \(f(x)\)
  for a real number \(x\),
  we need to round the result to get
  \(f^{*}(x) = \mathrm{rnd}^{p}_{rm}(f(x))\).
To illustrate the different rounding modes,
  we’ll consider three rounding modes:
  round to nearest, ties to even (RNE);
  round toward zero (RTZ); and
  round away from zero (RAZ).</p>

<p>If \(f(x)\) is representable in the target number format,
  then no more work is needed: \(f^{*}(x) = f(x)\).
Otherwise,
  \(f(x)\) lies between two representable floating-point numbers,
  which we’ll denote as \(y_1 &lt; f(x) &lt; y_2\).
To simplify the discussion,
  let’s assume that \(y_1\) and \(y_2\) are both positive,
  and that \(y_1\) has an even significand \(1XXX\ldots0\)
  and \(y_2\) has an odd significand \(1XXX\ldots1\).</p>

<p>The rules for each rounding mode are as follows:</p>

<ul>
  <li>RNE: round to the <em>nearest</em> representable number;
if \(f(x)\) is exactly halfway between \(y_1\) and \(y_2\),
round to the one with an <em>even</em> significand \(c\);</li>
  <li>RTZ: round to the number in the direction of zero;</li>
  <li>RAZ: round to the number in the opposite direction of zero.</li>
</ul>

<p>Let’s first assume that \(f(x)\) is closer to \(y_1\)
  than to \(y_2\).</p>

<p><img src="/assets/posts/2025-11-18-round-to-odd/round-1.png" alt="rounding when $$f(x)$$ is closer to $$y_1$$" style="display:block; margin-left:auto; margin-right:auto" /></p>

<p>In this case,
  \(f(x)\) rounds to</p>
<ul>
  <li>RNE: \(y_1\) since it is the nearest representable number,</li>
  <li>RTZ: \(y_1\) since it is in the direction of zero,</li>
  <li>RAZ: \(y_2\) since it is in the opposite direction.</li>
</ul>

<p>Now, let’s assume that \(f(x)\) is closer to \(y_2\)
  than to \(y_1\).</p>

<p><img src="/assets/posts/2025-11-18-round-to-odd/round-2.png" alt="rounding when $$f(x)$$ is closer to $$y_2$$" style="display:block; margin-left:auto; margin-right:auto" /></p>

<p>The only difference is that \(f(x)\) rounds to \(y_2\)
  under RNE, since \(y_2\) is now the nearest representable number.</p>

<p>Finally,
  let’s consider the case where \(f(x)\)
  is exactly halfway between \(y_1\) and \(y_2\).</p>

<p><img src="/assets/posts/2025-11-18-round-to-odd/round-3.png" alt="rounding when $$f(x)$$ is equidistant to $$y_1$$ and $$y_2$$" style="display:block; margin-left:auto; margin-right:auto" /></p>

<p>Rounding under RNE
  will tie-break to the even significand,
  in this case, \(y_1\).
Under RTZ and RAZ,
  the result is the same as before:
  RTZ rounds to \(y_1\) and RAZ rounds to \(y_2\).</p>

<p>Other rounding modes like
  round to nearest, ties away from zero (RNA);
  round to positive infinity (RTP);
  and round to negative infinity (RTN)
  can be analyzed similarly.
RNA is similar to RNE,
  except that it tie-breaks to the representable
  value away from zero.
RTP rounds towards positive infinity:
  it is the same as RAZ for positive numbers,
  and RTN is the same as RTZ for positive numbers.
RTN is the opposite of RTP.</p>

<h2 id="round-to-odd">Round-to-Odd</h2>

<p>Round to odd is only a slight adaptation
  of the rounding modes we’ve seen so far.
Like before,
  if \(f(x)\) is representable in the target number format,
  then no rounding is required: \(f^{*}(x) = f(x)\).
Otherwise,
  \(f(x)\) lies between two representable floating-point numbers,
  which we’ll again denote as \(y_1 &lt; f(x) &lt; y_2\).
Under RTO,
  we round to the representable number
  with an <em>odd</em> significand \(c\).
In our example,
  this means that we always round to \(y_2\).</p>

<p>Round to odd should not be confused with
  round to nearest, ties to odd (RNO);
  the rounding mode that uses the opposite tie-breaking rule
  of round to nearest, ties to even (RNE).
RTO is <em>not</em> a nearest rounding mode.
For RNE and RNO, the “even” (and “odd”)
  refers to the tie-breaking rule when \(f(x)\) is exactly
  halfway between \(y_1\) and \(y_2\).
For RTO,
  the “odd” refers to the rule <em>whenever</em> \(f(x)\)
  is not representable.</p>

<p><img src="/assets/posts/2025-11-18-round-to-odd/round-4.png" alt="rounding $$f(x)$$ under RTO" style="display:block; margin-left:auto; margin-right:auto" /></p>

<p>Like RTZ and RAZ,
  any value between \(y_1\) and \(y_2\)
  rounds to the same result,
  in this case, \(y_2\).
On this example,
  round to odd doesn’t seem too interesting.</p>

<p>However,
  examining its behavior on a different
  example reveals a key property of RTO.
Let’s zoom out
  and consider the next representable floating-point number
  after \(y_2\), which we’ll denote as \(y_3\).
In a floating-point number format
  with one fewer bit of precision \(p - 1\),
  \(y_1\) and \(y_3\) would be adjacent representable numbers,
  and \(y_2\) would be the midpoint between them.
Rounding with the original precision \(p\),
  <em>any</em> \(f(x)\) between \(y_1\) and \(y_3\),
  rounds to \(y_2\) under RTO.</p>

<p><img src="/assets/posts/2025-11-18-round-to-odd/round-5.png" alt="rounding $$f(x)$$ between $$y_1$$ and $$y_3$$ under RTO" style="display:block; margin-left:auto; margin-right:auto" /></p>

<p>Notice that under precision \(p\),
  \(y_1\) and \(y_3\) have even significands,
  while \(y_2\), not representable at precision \(p - 1\),
  has an odd significand.
Thus,
  the parity of the significand encodes
  whether the infinitely precise result \(f(x)\)
  is representable in the lower precision \(p - 1\):
  the significand of \(f^{*}(x)\) is odd if and only if
  \(f(x)\) is not representable in precision \(p - 1\).
This is a key feature of round-to-odd:
  parity encodes representability,
  also called <em>exactness</em>.
In floating-point literature,
  the lowest digit of the significand is often called
  the <em>sticky bit</em>.</p>

<p>There are a few interpretations of the sticky bit.
If we expand the (possibly infinite) significand
  of \(f(x)\) as \(1X\ldots XYYY\ldots\)
  where \(1X\ldots X\) are the first \(p - 1\) bits
  and \(YYY\ldots\) are the trailing digits,
  then the sticky bit \(S\) summarizes the trailing digits:</p>

\[S = \begin{cases}
0 &amp; \text{if } YYY\ldots = 0, \\
1 &amp; \text{if } YYY\ldots \neq 0;
\end{cases}\]

<p>and the significand
  may be written as \(1X\ldots XS\),
  which is the significand of either
  \(y_1\) (if \(S = 0\)) or \(y_2\) (if \(S = 1\)).
Alternatively,
  we may view the simplified significand as an interval \(I\):
  if \(S = 0\), then \(c = 1X\ldots X0\)
  and \(I = [c, c]\);
  if \(S = 1\), then \(c = 1X\ldots X1\)
  and \(I = (c, c + \varepsilon)\),
  where \(\varepsilon\) is the distance to
  the next representable floating-point number
  with precision \(p - 1\).
Or,
  rather than an interval,
  we can choose an unknown real value \(c \in I\);
  we cannot capture the exact value of \(f(x)\)
  since the sticky bit only indicates
  whether there are trailing digits.</p>

<p>Sticky bits are widely used
  when implementing floating-point arithmetic
  in both hardware and software due to their
  ability to summarize discarded trailing digits.
Encoding whether a result has
  non-zero trailing digits
  at some precision is essential for correct rounding.
In the next section,
  we’ll see how round to odd preserves enough information
  through the sticky bit to safely re-round under any
  standard rounding mode at lower precision,
  avoiding the double rounding problem.</p>

<!-- This sticky bit is essential
  when implementing correct rounding
  in both hardware and software.
Depending on the rounding mode,
  we must _approximate_ the infinitely precise result $$f(x)$$
  with $$p$$ significant digits;
  sufficiently many additional digits,
  usually one or two extra digits;
  and a sticky bit to summarize
  the remaining trailing digits.
Notice that the initial truncation
  with a sticky bit may be viewed
  as a round to odd operation with higher precision;
  this observation hints at a key property
  of round to odd, highlighted in the next section. -->

<h2 id="properties-of-round-to-odd">Properties of Round-to-Odd</h2>

<p>Boldo at Melquiond [2]
  identify several properties of round to odd.
It’s worth summarizing them here.
The first four are properties shared
  with other standard rounding modes:</p>

<ul>
  <li>round to odd is <em>entire</em> : every real number can be rounded to odd;</li>
  <li>round to odd is <em>unique</em> : every real number has a unique rounding to odd;</li>
  <li>round to odd is <em>monotonic</em> : if \(x_1 \leq x_2\),
then \(\mathrm{rnd}^{p}_{\text{RTO}}(x_1) \leq \mathrm{rnd}^{p}_{\text{RTO}}(x_2)\);</li>
  <li>round to odd is <em>faithful</em> : if \(y_1 &lt; x &lt; y_2\)
are the two representable numbers of precision \(p\) surrounding \(x\),
then \(\mathrm{rnd}^{p}_{\text{RTO}}(x)\) is either \(y_1\) or \(y_2\);</li>
</ul>

<p>The first three properties are fairly straightforward:
  we want to round any real number;
  we want the rounding to be deterministic;
  and we want rounding to preserve order.
The fourth property, faithfulness,
  ensures that rounding does not introduce
  large errors:
  if \(\varepsilon\) is the distance
  between \(y_1\) and \(y_2\),
  then the absolute error is less than \(\varepsilon\).
In addition,</p>

<ul>
  <li>round to odd is <em>symmetric</em> : 
for any real number \(x\),
\(\mathrm{rnd}^{p}_{\text{RTO}}(-x) = -\mathrm{rnd}^{p}_{\text{RTO}}(x).\)</li>
</ul>

<p>Boldo and Melquiond [1] prove a key property
  that distinguishes round to odd from other rounding modes:
  round to odd permits <em>safe re-rounding</em>.
Rounding with round to odd first,
  then re-rounding under any standard rounding mode
  at lower precision yields the same result as rounding
  directly under that standard rounding mode
  at lower precision, specifically,
  \(k \geq 2\) digits lower.</p>

<p><strong>Theorem 1.</strong>
Let \(p, k \geq 2\) be integers;
 and \(rm\) be a standard rounding mode.
Then,</p>

\[\mathrm{rnd}^{p}_{rm} = \mathrm{rnd}^{p}_{rm} \circ \mathrm{rnd}^{p+k}_{\text{RTO}}.\]

<p>To understand this statement,
  we return to previous examples
  covering the different rounding modes.
Let \(y_1\) and \(y_3\) be two adjacent
  floating-point values that are representable with precision \(p\),
  and let \(y_2\) be the midpoint between them
  at precision \(p + 1\).
For simplicitly,
  let’s assume that \(y_1\) is positive,
  so \(y_2\) and \(y_3\) are also positive.
Consider rounding an arbitrary real number \(x \in [y_1, y_3)\)
  under RNE, RTZ, RAZ, and RTO.</p>

<p><img src="/assets/posts/2025-11-18-round-to-odd/round-6.png" alt="rounding $$x$$ between $$y_1$$ and $$y_3$$" style="display:block; margin-left:auto; margin-right:auto" /></p>

<p>For each rounding mode,
  we color the interval \([y_1, y_3)\) with
  each segment (or tick) colored according
  to the rounding result:
  blue for \(y_1\) and orange for \(y_3\).
Dual-coloring indicates a conditional rounding
  that depends on the value of \(y_1\) (and \(y_3\)).
RTO is dual-colored throughout;
  the midpoint \(y_2\) is also dual-colored for RNE.</p>

<p>Notice that, 
  for these rounding modes,
  we can distinguish four cases:</p>

<ul>
  <li>\(x = y_1\): \(x\) is representable,
so all rounding modes produce \(y_1\);</li>
  <li>\(y_1 &lt; x &lt; y_2\): \(x\) is closer to \(y_1\),
so RNE and RTZ produce \(y_1\),
RAZ produces \(y_2\);
and RTO chooses based on parity of \(y_1\);</li>
  <li>\(x = y_2\): \(x\) is halfway,
 so RNE must tie-break based on parity,
 RTZ produces \(y_1\),
 RAZ produces \(y_2\);
 and RTO chooses based on parity of \(y_1\).</li>
  <li>\(y_2 &lt; x &lt; y_3\): \(x\) is closer to \(y_3\),
so RNE and RAZ produce \(y_2\),
RTZ produces \(y_1\);
and RTO chooses based on parity of \(y_1\).</li>
</ul>

<p>Analyzing these regions at precision \(p + 1\),
  the significand of \(y_1\) and \(y_3\)
  are even, since increasing precision
  adds a trailing zero to their significands.
For example,
  if \(y_1 = 5/32\) with \(p = 3\), then:</p>

\[y_1 = 101 \cdot 2^{-5} = 1010 \cdot 2^{-6}.\]

<p>By contrast,
  the significand of \(y_2\) is odd.
For the same example,
  the midpoint \(y_2 = 11/64\)
  of \(y_1\) and \(y_3\) has the form</p>

\[y_2 = 101.1 \cdot 2^{-5} = 1011 \cdot 2^{-6}.\]

<p>The regions \((y_1, y_2)\) and \((y_2, y_3)\)
  are the intervals between adjacent
  representable numbers at precision \(p + 1\).
Recalling discussion from earlier,
  these regions are <em>exactly</em> the intervals represented
  by the sticky bit at precision \(p + 2\).
Therefore,
  rounding to odd at precision \(p + 2\)
  results in four cases:</p>

<ul>
  <li>
    <p>\(x\) is exactly the endpoint \(y_1\),
so the last two digits of the rounded significand
are \(RS = 01\);</p>
  </li>
  <li>
    <p>\(x\) lies in \((y_1, y_2)\),
so the last two digits of the rounded significand
are \(RS = 01\);</p>
  </li>
  <li>
    <p>\(x\) is exactly the midpoint \(y_2\),
so the last two digits of the rounded significand
are \(RS = 10\);</p>
  </li>
  <li>
    <p>\(x\) lies in \((y_2, y_3)\),
so the last two digits of the rounded significand
and \(RS = 11\).</p>
  </li>
</ul>

<p>Notice that these cases correspond
  exactly to the four cases we identified earlier
  for standard rounding modes at precision \(p\).
After applying round to odd at precision \(p + 2\),
  representable values (at precision \(p\)) are still representable;
  midpoints (at precision \(p\)) are still midpoints;
  and all other values, either on \((y_1, y_2)\) or \((y_2, y_3)\),
  are rounded to the midpoint of those intervals,
  preserving their nearness to one endpoint.
Therefore,
  re-rounding under any standard rounding mode
  at precision \(p\) yields the same result
  as rounding directly at precision \(p\).</p>

<p><img src="/assets/posts/2025-11-18-round-to-odd/round-7.png" alt="rounding $$x$$ between $$y_1$$ and $$y_3$$" style="display:block; margin-left:auto; margin-right:auto" /></p>

<p>Visually,
  we can indicate the round to odd step
  by overlaying gray arrows, representing
  round to odd at precision \(p + 2\),
  over the previous figure.
Layering the two rounding steps,
  we see that round to odd at precision \(p + 2\)
  rounds values in \((y_1, y_2)\) or \((y_2, y_3)\)
  to the midpoints at precision \(p + 1\);
  representable values and midpoints remain unchanged.
Safe re-rounding corresponds
  to the coloring of the initial value \(x\)
  being preserved after rounding to odd.</p>

<p>Therefore,
  round to odd at precision \(p + 2\) preserves
  sufficient information so that we can safely re-round
  under any standard rounding mode at precision \(p\).
For precision \(p + k\) where \(k &gt; 2\),
  the same reasoning applies:
  values that are neither representable nor midpoints
  may be rounded differently at precision \(p + k\),
  but their nearness to one endpoint is preserved.</p>

<h2 id="applications">Applications</h2>

<p>Boldo and Melquiond [2, 4]
  provide potential applications for round-to-odd.
They include
  emulation of FMA,
  correctly-rounded addition of 3 terms,
  correctly-rounded sum of \(n\) terms
  (under certain conditions),
  compiling the same constant under
  multiple precisions and rounding modes,
  and more.</p>

<p>More recent work
  relies on a corollary of Theorem 1.
Combining the definition
  of a correctly-rounded function
  with Theorem 1,
  we get the following result:</p>

\[\begin{align*}
f^{*} &amp;= \mathrm{rnd}^{p}_{rm} \circ f\\
&amp;= (\mathrm{rnd}^{p}_{rm} \circ \mathrm{rnd}^{p+k}_{\text{RTO}}) \circ f\\
&amp;= \mathrm{rnd}^{p}_{rm} \circ (\mathrm{rnd}^{p+k}_{\text{RTO}} \circ f)\\
&amp;= \mathrm{rnd}^{p}_{rm} \circ f_{\text{RTO}}^{*}.
\end{align*}\]

<p>Stated otherwise,
  a correctly-rounded implementation of \(f\)
  is the composition of
  (i) a round-to-odd implementation of \(f\); followed by
  (ii) re-rounding under the desired rounding mode.</p>

<p><strong>Corollary 2.</strong>
Let \(f\) be a real-valued function,
  and \(f^{*}\) be a correctly-rounded implementation
  of \(f\) at precision \(p\) under rounding mode \(rm\).
If \(f_{\text{RTO}}^{*}\) is a correctly-rounded implementation
  of \(f\) at precision \(p + k\) under round to odd (\(k \geq 2\)),
  then</p>

\[f^{*} = \mathrm{rnd}^{p}_{rm} \circ f_{\text{RTO}}^{*}.\]

<p>One successful application of this corollary
  is found in the RLibm project [5]
  which automatically generates efficient, correctly-rounded elementary functions
  by generating a polynomial approximation with additional bits
  of precision using round-to-odd arithmetic that will be correctly rounded
  when re-rounded under the desired rounding mode.
The general principle of Corollary 2
  suggests a modular approach
  to designing correctly-rounded functions
  for multiple precisions and rounding modes:
  first, design a round-to-odd implementation
  at higher precision;
  then, apply a re-rounding step
  to obtain the desired result.</p>

<h2 id="conclusion">Conclusion</h2>

<p>This blog post covered the round to odd
  including its definitions, properties, and applications.
Along the way,
  we learned about floating-point numbers,
  correctly-rounded functions,
  and how rounding works.
The key property of round to odd,
  safe re-rounding,
  avoid double rounding and suggests
  a method for designing correctly-rounded functions
  by separating the concerns of
  approximating the infinitely precise result
  and rounding to the target number format.
While round to odd is not widely supported
  in hardware or programming languages today,
  its unique properties make it a valuable techinique
  that should be better studied and
  more widely adopted.</p>

<h2 id="references">References</h2>

<ol>
  <li>
    <p>IEEE. 2019. IEEE Standard for Floating-Point Arithmetic. IEEE Std 754-2019 (Revision of IEEE 754-2008), 1–84. DOI: <a href="https://doi.org/10.1109/IEEESTD.2019.8766229">https://doi.org/10.1109/IEEESTD.2019.8766229</a>.</p>
  </li>
  <li>
    <p>Sylvie Boldo, Guillaume Melquiond. When double rounding is odd. 17th IMACS World Congress,
Jul 2005, Paris, France. pp.11. ffinria-00070603v2f.</p>
  </li>
  <li>
    <p>Nicolas Brisebarre, Guillaume Hanrot, Jean-Michel Muller, and Paul Zimmermann. 2025. Correctly Rounded
Evaluation of a Function: Why, How, and at What Cost?. ACM Comput. Surv. 58, 1, Article 27 (September 2025),
34 pages. <a href="https://doi.org/10.1145/3747840">https://doi.org/10.1145/3747840</a>.</p>
  </li>
  <li>
    <p>Sylvie Boldo and Guillaume Melquiond. 2008. Emulation of a FMA and Correctly Rounded Sums: Proved Algorithms Using Rounding to Odd. IEEE Transactions on Computers 57, 4 (2008), 462–471. <a href="https://doi.org/10.1109/TC.2007.70819">https://doi.org/10.1109/TC.2007.70819</a>.</p>
  </li>
  <li>
    <p>Jay P. Lim and Santosh Nagarakatte. 2022. One Polynomial Approximation to Produce Correctly Rounded
Results of an Elementary Function for Multiple Representations and Rounding Modes. Proc. ACM Program.
Lang. 6, POPL, Article 3 (January 2022), 28 pages.
<a href="https://doi.org/10.1145/3498664">https://doi.org/10.1145/3498664</a>.</p>
  </li>
</ol>]]></content><author><name></name></author><category term="blog" /><category term="floating-point" /><category term="rounding" /><summary type="html"><![CDATA[When writing numerical programs, rounding is a subtle but important aspect that can significantly affect the accuracy and stability of computations. In particular, rounding twice at different precisions might introduce unexpected errors. For example, under the nearly universal round to nearest, ties to even (RNE) rounding mode, adding 1.00000011 and 5.96046447e-8, and rounding directly to a single-precision floating-point value yields 1.00000012; but first rounding to a double-precision floating-point value then to a single-precision floating-point value yields 1.00000024. This phenomenon is known as double rounding.]]></summary></entry><entry><title type="html">Composable, Correctly-Rounded Number Libraries</title><link href="https://www.bsaiki.com/blog/2025/11/14/numbers-libraries.html" rel="alternate" type="text/html" title="Composable, Correctly-Rounded Number Libraries" /><published>2025-11-14T00:00:00+00:00</published><updated>2025-11-14T00:00:00+00:00</updated><id>https://www.bsaiki.com/blog/2025/11/14/numbers-libraries</id><content type="html" xml:base="https://www.bsaiki.com/blog/2025/11/14/numbers-libraries.html"><![CDATA[<p><em>This blog post was split into two parts.</em>
<em>This <a href="/blog/2025/11/19/round-to-odd.html">blog post</a></em>
  <em>covers rounding in detail and discusses some theory;</em>
  <em>this blog post focuses on design principles for number libraries.</em></p>

<p>Number libraries simulate number systems —
  floating-point, fixed-point, posits, and more —
  beyond those offered by standard hardware or
  language runtimes.
Well known examples of these libraries
  include <a href="https://www.mpfr.org/">MPFR</a>
  for arbitrary-precision floating-point numbers,
  <a href="https://gmplib.org/">GMP</a>
  for arbitrary-precision integers and rationals,
  <a href="http://www.jhauser.us/arithmetic/SoftFloat.html">SoftFloat</a>
  for emulating IEEE 754 floating-point numbers in software,
  and <a href="https://github.com/stillwater-sc/universal">Universal</a>
  for supporting formats across the machine learning landscape.
They are essential tools for
  analyzing numerical error in multi-precision numerical algorithms,
  exploring different implementation trade-offs,
  and verifying the correctness of
  numerical software and hardware.</p>

<p>This flexibility over number systems comes at a cost:
  number libraries are complex,
  requiring expert knowledge to build and maintain,
  and significant effort to provide useful features
  and ensure correctness.
Maintainers of these libraries face
  significant challenges as their tools
  serve as trusted reference implementations
  but must also be efficient for simulation and verification tasks.
Due to intense demand from machine learning,
  there has been a proliferation of new number formats
  and rounding behaviors.
Each new combination of operation,
  number format, rounding mode,
  overflow behavior, and special value handling
  requires careful implementation and testing,
  multiplying the complexity and maintenance burden.
Developers must choose:
  stick with a smaller set of features
  or invest significant effort to meet user demand.</p>

<!-- As new number formats and rounding behaviors proliferate,
  developers are faced with an explosion of different
  combinations of operations, formats, rounding modes,
  overflow behavior, special value handling, and more.


Each new feature requires careful implementation and testing,
  multiplying the complexity, verification effort,
  and maintenance burden. -->

<p>I maintain number libraries that fall into the latter category:
  they aim to support a wide variety of number formats
  and rounding behaviors.
I want to reflect on two design principles
  that have helped mitigate some of these challenges.</p>

<p>First, 
  the <em>round-to-odd</em> rounding mode [1] allows decoupling
  of arithmetic operations from rounding.
Applying this insight to number libraries,
  a number library may consist of two independent parts:
  a core arithmetic engine performing round-to-odd operations,
  and a core rounding library that safely re-rounds under
  the desired number format or rounding mode.
As a result,
  each operation provided by the number library
  is the composition of an operation in the arithmetic engine
  and the <code class="language-plaintext highlighter-rouge">round</code> method of a rounding context instance from
  the core rounding library.
This trick was pioneered by Bill Zorn [2]
  in his number library found in the Titanic repo [3].</p>

<p>Second,
  the core rounding library should be organized
  into <em>rounding contexts</em> which encapsulate number format,
  rounding mode, and other rounding information;
  providing a single <code class="language-plaintext highlighter-rouge">round</code> method.
These rounding contexts are often composable:
  a rounding context for an IEEE 754 floating-point number
  can reuse the same logic as a rounding context for
  a <code class="language-plaintext highlighter-rouge">p</code>-bit floating-point number,
  with added checks for overflow.
Thus,
  the rounding library can be broken down
  further into composable components that can
  be assembled to implement rounding for a
  variety of number formats.</p>

<p>Put together,
  these two design principles enable significant
  modularity and code reuse within number libraries,
  resulting in several benefits.
They dramatically improve <em>maintainability:</em>
  the library is smaller;
  it consists of smaller, composable components
  rather than a monolithic implementation where each
  combination of operation and rounding mode requires separate code.
They ensure <em>extensibility:</em>
  adding features is easier and less error-prone;
  adding a new operation requires implementing it only once
  in the arithmetic engine,
  while adding a new format or rounding mode
  requires implementing it only once in the rounding library.
They improve <em>correctness:</em>
  testing can be done compositionally;
  components can be extensively tested in isolation,
  and their composition inherits the correctness guarantees.</p>

<p>I will explore these two design principles in greater detail
  and illustrate their benefits through examples.
<!-- Research on number libraries is rarely published,
  and principles behind their design are rarer still. -->
As someone working in this space,
  I hope this blog post provides some insight
  into the challenges in this domain
  and possible solutions.</p>

<h2 id="separating-rounding-from-arithmetic">Separating Rounding from Arithmetic</h2>

<p>The first design principle
  is to separate rounding from arithmetic operations
  using round-to-odd arithmetic [1].
My <a href="/blog/2025/11/19/round-to-odd.html">blog post</a>
  covers this topic in detail.
I’ll summarize the key ideas here
  and illustrate how they apply to number libraries.</p>

<h3 id="theory">Theory</h3>

<p>For a real number function \(f\),
  we say that an implementation of \(f\), say \(f^{*}\),
  is <em>correctly-rounded</em> when it produces
  the infinitely precise result of \(f\) rounded
  to the target number format.
This rounding operation is usually parameterized
  by a <em>rounding mode</em> that specifies which representable value
  to choose when the result is not exactly representable
  in the target number format; the IEEE 754 standard
  specifies several such rounding modes.
For now,
  we’ll only consider floating-point numbers
  with a fixed precision of \(p\) digits,
  so I’ll denote the rounding operation in the style
  of Boldo and Melquiond [1] as \(\mathrm{rnd}^{p}_{rm}\).
Using this notation,
  the correctly-rounded implementation of \(f\)
  in precision \(p\) under rounding mode \(rm\)
  can be expressed as:</p>

\[f^{*} = \mathrm{rnd}^{p}_{rm} \circ f.\]

<p>Boldo and Melquiond [1] describe
  a non-standard rounding mode called
  <em>round to odd</em> and prove that
  the rounding mode permits <em>safe re-rounding</em>.
Rounding with round to odd at precision \(p + k\)
  (for \(k \geq 2\)) followed by re-rounding under
  a standard rounding mode at precision \(p\),
  yields the same result as rounding directly
  at precision \(p\) under the desired rounding mode.</p>

<p><strong>Theorem 1.</strong>
Let \(p, k \geq 2\) be integers;
 and \(rm\) be a standard rounding mode.
Then,</p>

\[\mathrm{rnd}^{p}_{rm} = \mathrm{rnd}^{p}_{rm} \circ \mathrm{rnd}^{p+k}_{\text{RTO}}.\]

<p>Applying this result to the
  definition of a correctly-rounded implementation,
  we can derive that
  \(f^{*}\) is the composition of
  (i) a round-to-odd implementation of \(f\) at higher precision,
  followed by (ii) re-rounding under the desired
  rounding mode:</p>

\[f^{*} = \mathrm{rnd}^{p}_{rm} \circ f_{\text{RTO}}^{*}.\]

<p>The result is not novel:
  one successful application is found in the RLibm project [4]
  which automatically generates efficient, correctly-rounded elementary functions
  by generating a polynomial approximation with additional bits
  of precision using round-to-odd arithmetic that will be correctly rounded
  when re-rounded under the desired rounding mode.</p>

<p>For number libraries,
  the result has significant implications.
It suggests that arithmetic and rounding
  can be decoupled: a number library can consist
  of two <em>independent</em> components:
  an <em>arithmetic engine</em> that implements each
  mathematical operation using round-to-odd arithmetic,
  and a <em>rounding library</em> that implements
  rounding operations for various number formats
  and rounding modes.
The only interaction between the two components
  is that the rounding library must provide the precision \(p + k\)
  required for safe re-rounding.
<!-- In a program,
  we may write `f<p, rm>(x)` to evaluate $$f^{*}$$
  on an input $$x$$ with precision $$p$$ and rounding mode $$rm$$. --></p>

<h3 id="application">Application</h3>

<p>To illustrate this approach in practice,
  consider a number library providing
  an implementation of multiplication.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">module</span> <span class="n">Engine</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">rto_mul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">p</span><span class="p">):</span>
        <span class="p">...</span>

<span class="n">module</span> <span class="n">Round</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">round</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="n">rm</span><span class="p">):</span>
        <span class="p">...</span>

    <span class="k">def</span> <span class="nf">rto_prec</span><span class="p">(</span><span class="n">p</span><span class="p">):</span>
        <span class="p">...</span>

<span class="k">def</span> <span class="nf">mul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="n">rm</span><span class="p">):</span>
    <span class="n">rto_p</span> <span class="o">=</span> <span class="n">Round</span><span class="p">.</span><span class="n">rto_prec</span><span class="p">(</span><span class="n">p</span><span class="p">)</span>
    <span class="n">result</span> <span class="o">=</span> <span class="n">Engine</span><span class="p">.</span><span class="n">rto_mul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">rto_p</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">Round</span><span class="p">.</span><span class="nb">round</span><span class="p">(</span><span class="n">result</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="n">rm</span><span class="p">)</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">Engine</code> module implements arithmetic operations
  that produce round-to-odd results with at least precision <code class="language-plaintext highlighter-rouge">p</code>.
The <code class="language-plaintext highlighter-rouge">Round</code> module handles rounding operations
  and precision calculations:
  <code class="language-plaintext highlighter-rouge">round</code> rounds <code class="language-plaintext highlighter-rouge">x</code> to precision <code class="language-plaintext highlighter-rouge">p</code> using the specified rounding mode,
  while <code class="language-plaintext highlighter-rouge">rto_prec</code> calculates the precision needed for safe re-rounding.
The <code class="language-plaintext highlighter-rouge">mul</code> function provided by the number library
  composes these functions by
  performing round-to-odd multiplication,
  and re-rounding to the desired precision and rounding mode.</p>

<p>To implement <code class="language-plaintext highlighter-rouge">rto_mul</code>,
  we can use existing libraries like MPFR
  that have been extensively tested.
MPFR provides a narrower interface
  that performs floating-point arithmetic at
  a specified precision and rounding mode.
Although MPFR does not directly support round-to-odd,
  we can implement it as described by Boldo and Melquiond [1]:
  we use MPFR’s implementation at precision <code class="language-plaintext highlighter-rouge">p - 1</code> with
  round towards zero, followed by an additional step to adapt
  the result to be round-to-odd at precision <code class="language-plaintext highlighter-rouge">p</code>.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">rto_mul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">p</span><span class="p">):</span>
  <span class="n">r</span> <span class="o">=</span> <span class="n">MPFR</span><span class="p">.</span><span class="n">mul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">p</span> <span class="o">-</span> <span class="mi">1</span><span class="p">,</span> <span class="n">MPFR</span><span class="p">.</span><span class="n">RTZ</span><span class="p">)</span>
  <span class="k">return</span> <span class="n">rto_fixup</span><span class="p">(</span><span class="n">r</span><span class="p">)</span>
</code></pre></div></div>

<p>Note by Theorem 1,
  we can request higher precision than <code class="language-plaintext highlighter-rouge">p</code> for safe re-rounding
  at precision <code class="language-plaintext highlighter-rouge">p - 2</code>.</p>

<p>Since the arithmetic engine and rounding library are separate,
  the exact implementation of <code class="language-plaintext highlighter-rouge">rto_mul</code> can be changed
  without affecting the correctness of any <code class="language-plaintext highlighter-rouge">mul</code> implementation,
  as long as it produces correct round-to-odd results.
For example,
  assume that floating-point numbers in our library
  have the following structure:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Float</span><span class="p">:</span> <span class="c1"># (-1)^sign * c * 2^exp
</span>  <span class="n">sign</span><span class="p">:</span> <span class="nb">bool</span> <span class="c1"># sign
</span>  <span class="n">exp</span><span class="p">:</span> <span class="nb">int</span> <span class="c1"># exponent
</span>  <span class="n">c</span><span class="p">:</span> <span class="nb">int</span> <span class="c1"># significand (c &gt;= 0)
</span></code></pre></div></div>

<p>We can implement <code class="language-plaintext highlighter-rouge">rto_mul</code> manually as follows:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">rto_mul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">p</span><span class="p">):</span>
  <span class="n">s</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">sign</span> <span class="o">!=</span> <span class="n">y</span><span class="p">.</span><span class="n">sign</span>
  <span class="n">exp</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">exp</span> <span class="o">+</span> <span class="n">y</span><span class="p">.</span><span class="n">exp</span>
  <span class="n">c</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">c</span> <span class="o">*</span> <span class="n">y</span><span class="p">.</span><span class="n">c</span>
  <span class="k">return</span> <span class="n">Float</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">exp</span><span class="p">,</span> <span class="n">c</span><span class="p">)</span>
</code></pre></div></div>

<p>Notice that this implementation does not
  perform any rounding, so <code class="language-plaintext highlighter-rouge">p</code> is unused.
Since the result is infinitely precise,
  it can be re-rounded under any precision.
Extending this implementation to handle special values
  (\(+\infty\), \(-\infty\), NaN, etc.)
  just requires careful case analysis.</p>

<p>For the <code class="language-plaintext highlighter-rouge">Round</code> module,
  we need to implement two functions:</p>
<ul>
  <li>
    <p><code class="language-plaintext highlighter-rouge">rto_prec(p)</code> computes the precision required
for safe re-rounding to precision <code class="language-plaintext highlighter-rouge">p</code>; and</p>
  </li>
  <li>
    <p><code class="language-plaintext highlighter-rouge">round(x, p, rm)</code> rounds <code class="language-plaintext highlighter-rouge">x</code> to precision <code class="language-plaintext highlighter-rouge">p</code>
using the specified rounding mode <code class="language-plaintext highlighter-rouge">rm</code>.</p>
  </li>
</ul>

<p>The implementation of <code class="language-plaintext highlighter-rouge">rto_prec</code> is straightforward:
  it returns <code class="language-plaintext highlighter-rouge">p + k</code> for a constant <code class="language-plaintext highlighter-rouge">k &gt;= 2</code>.
Implementing the <code class="language-plaintext highlighter-rouge">round</code> function
  requires care as it’s the core method of the rounding library.
The exact implementation is too verbose
  to include here, but I’ll outline its basic structure.
For this example,
  we’ll ignore special values like NaN and infinity.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">round</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="n">rm</span><span class="p">):</span>
  <span class="n">n</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">e</span> <span class="o">-</span> <span class="n">p</span>  <span class="c1"># compute where to round off digits
</span>  <span class="n">hi</span><span class="p">,</span> <span class="n">lo</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="n">n</span><span class="p">)</span> <span class="c1"># split into significant and leftover digits
</span>  <span class="n">rbits</span> <span class="o">=</span> <span class="n">lo</span><span class="p">.</span><span class="n">round_bits</span><span class="p">(</span><span class="n">n</span><span class="p">)</span> <span class="c1"># summarize leftover digits for rounding decision
</span>  <span class="n">increment</span> <span class="o">=</span> <span class="n">decide_increment</span><span class="p">(</span><span class="n">hi</span><span class="p">,</span> <span class="n">rbits</span><span class="p">,</span> <span class="n">rm</span><span class="p">)</span> <span class="c1"># round away from zero based on rounding mode?
</span>  <span class="k">if</span> <span class="n">increment</span><span class="p">:</span> <span class="c1"># adjust hi if needed
</span>    <span class="n">hi</span> <span class="o">=</span> <span class="n">hi</span><span class="p">.</span><span class="n">increment</span><span class="p">()</span>
  <span class="k">return</span> <span class="n">hi</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">round</code> function first determines
  where to round off digits based on the
  normalized exponent <code class="language-plaintext highlighter-rouge">x.e</code> of the argument and
  the target precision <code class="language-plaintext highlighter-rouge">p</code>.
Any digit above this point is significant
  while digits below must be rounded off.
A method <code class="language-plaintext highlighter-rouge">split</code> divides the number based on this point,
  and the lower part is summarized into rounding bits.
  <!-- using a round-sticky (RS) or round-guard-sticky (RGS) scheme. -->
Since <code class="language-plaintext highlighter-rouge">hi</code> represents the round-towards-zero result,
  we decide whether to round away from zero to the next representable
  value based on the rounding bits and the specified rounding mode.
The correctly-rounded result is <code class="language-plaintext highlighter-rouge">hi</code>.</p>

<p>A number library built in this style
  can be extended to support new operations
  and rounding modes easily.
For a new operation \(g\),
  we must only implement \(g^{*}_{\text{RTO}}\)
  in the arithmetic engine,
  either from existing libraries or manually.
The function <code class="language-plaintext highlighter-rouge">round</code> in the rounding library
  implements the function \(\mathrm{rnd}^{p}_{rm}\).
Their composition is then
  the correctly-rounded implementation \(g^{*}\) of \(g\).
For a new rounding mode,
  only the <code class="language-plaintext highlighter-rouge">round</code> function needs to be updated;
  the arithmetic engine remains unchanged.
Compare this approach to a monolithic implementation.
Each new operation requires implementing
  operation-specific rounding logic;
  each new rounding mode requires updating
  every operation.</p>

<p>To briefly summarize,
  separating arithmetic from rounding
  achieves the following benefits:</p>

<ul>
  <li>
    <p><em>Maintainability:</em> The arithmetic engine implements each operation only once,
while the rounding library implements rounding logic separately.</p>
  </li>
  <li>
    <p><em>Extensibility:</em> Any new mathematical operation
can be composed with the existing rounding library.
Similarly, any new rounding logic can be composed with
the existing arithmetic engine.</p>
  </li>
  <li>
    <p><em>Correctness:</em> Once a mathematical operation is verified
in the arithmetic engine, its correctness applies to any
composition with the rounding library.
Similarly, once a rounding context is verified,
its correctness applies to any mathematical operation.
Thus, testing can be done modularly and the results reused.</p>
  </li>
</ul>

<h2 id="rounding-contexts">Rounding Contexts</h2>

<p>While separating rounding from arithmetic
  allows developers to split the number library
  into two independent components,
  the rounding library must support a large number
  of number formats and rounding behaviors.
To manage this complexity,
  we can organize the rounding library
  into <em>rounding contexts</em>.</p>

<p>A rounding context encapsulates all information
  required to round a number correctly:
  the number format (precision, exponent range, etc.),
  the rounding mode,
  overflow behavior (saturating, wrapping, exceptions, etc.),
  and special value handling (NaN, infinity, etc.).
Rather than having a single <code class="language-plaintext highlighter-rouge">round</code> function
  that consumes all rounding information as
  a exhaustive list of parameters,
  we convert <code class="language-plaintext highlighter-rouge">Round</code> into an interface;
  each rounding context will implement the interface.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">interface</span> <span class="n">Round</span><span class="p">:</span>
  <span class="k">def</span> <span class="nf">rto_prec</span><span class="p">():</span>
    <span class="p">...</span>

  <span class="k">def</span> <span class="nf">round</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
    <span class="p">...</span>

  <span class="k">def</span> <span class="nf">round_core</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="n">rm</span><span class="p">):</span> <span class="c1"># default method
</span>     <span class="p">...</span>


<span class="k">def</span> <span class="nf">mul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">ctx</span><span class="p">):</span>
  <span class="n">rto_p</span> <span class="o">=</span> <span class="n">ctx</span><span class="p">.</span><span class="n">rto_prec</span><span class="p">()</span>
  <span class="n">result</span> <span class="o">=</span> <span class="n">Engine</span><span class="p">.</span><span class="n">rto_mul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">rto_p</span><span class="p">)</span>
  <span class="k">return</span> <span class="n">ctx</span><span class="p">.</span><span class="nb">round</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>
</code></pre></div></div>

<p>The two interface methods are similar to before:</p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">rto_prec</code> computes the precision required to safely re-round
a number <em>under</em> the rounding context; and</li>
  <li><code class="language-plaintext highlighter-rouge">round</code> rounds <code class="language-plaintext highlighter-rouge">x</code> <em>under</em> the rounding context.</li>
</ul>

<p>The previous <code class="language-plaintext highlighter-rouge">round</code> implementation is now <code class="language-plaintext highlighter-rouge">round_core</code>.
  <!-- which can be reused as the core rounding primitive
  by each rounding context implementation. -->
The <code class="language-plaintext highlighter-rouge">mul</code> function now accepts
  a rounding context instance rather than explicit rounding parameters:
  its implementation must be adapted slightly.</p>

<p>One strategy for implementing rounding contexts is
  to organize each implementation of <code class="language-plaintext highlighter-rouge">Round</code>
  based on families of <em>number formats</em>.
For example,
  we can implement <code class="language-plaintext highlighter-rouge">p</code>-digit floating-point numbers,
  as a family of rounding contexts.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">MPFloat</span><span class="p">(</span><span class="n">Round</span><span class="p">):</span>
  <span class="n">p</span><span class="p">:</span> <span class="nb">int</span> <span class="c1"># p &gt;= 1
</span>  <span class="n">rm</span><span class="p">:</span> <span class="n">RoundingMode</span>

  <span class="k">def</span> <span class="nf">rto_prec</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
    <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">p</span> <span class="o">+</span> <span class="mi">2</span>

  <span class="k">def</span> <span class="nf">round</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
    <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">round_core</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">p</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">rm</span><span class="p">)</span>
</code></pre></div></div>

<p>Notice that we completely
  reuse the <code class="language-plaintext highlighter-rouge">round_core</code> implementation for the rounding logic.</p>

<p>Critically,
  we add support for new number formats
  by simply implementing a new class that
  implements the <code class="language-plaintext highlighter-rouge">Round</code> interface.
For example,
  consider supporting the IEEE 754 floating-point numbers
  parameterized by <code class="language-plaintext highlighter-rouge">es</code>, the size of the exponent field,
  and <code class="language-plaintext highlighter-rouge">nbits</code>, the total number of bits of the representation.
While IEEE 754 numbers have restrictions
  on exponent ranges, its core rounding behavior
  is similar to <code class="language-plaintext highlighter-rouge">p</code>-digit floating-point numbers:
  we can reuse the <code class="language-plaintext highlighter-rouge">MPFloat</code> rounding logic.
The exact implementation is too verbose,
  but an outline of the implementation is as follows:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">IEEEFloat</span><span class="p">(</span><span class="n">Round</span><span class="p">):</span>
  <span class="n">es</span><span class="p">:</span> <span class="nb">int</span>  <span class="c1"># exponent size (es &gt;= 2)
</span>  <span class="n">nbits</span><span class="p">:</span> <span class="nb">int</span>  <span class="c1"># total number of bits (nbits &gt;= es + 2)
</span>  <span class="n">rm</span><span class="p">:</span> <span class="n">RoundingMode</span>

  <span class="k">def</span> <span class="nf">emin</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="c1"># minimum (normalized) exponent
</span>    <span class="k">return</span> <span class="mi">1</span> <span class="o">-</span> <span class="p">(</span><span class="mi">1</span> <span class="o">&lt;&lt;</span> <span class="p">(</span><span class="n">es</span> <span class="o">-</span> <span class="mi">1</span><span class="p">))</span>

  <span class="k">def</span> <span class="nf">rto_prec</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
    <span class="n">p</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">nbits</span> <span class="o">-</span> <span class="bp">self</span><span class="p">.</span><span class="n">es</span>
    <span class="k">return</span> <span class="n">MPFloat</span><span class="p">(</span><span class="n">p</span><span class="p">).</span><span class="n">rto_prec</span><span class="p">()</span> <span class="c1"># need at least this much precision
</span>
  <span class="k">def</span> <span class="nf">round</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
    <span class="n">max_p</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">nbits</span> <span class="o">-</span> <span class="bp">self</span><span class="p">.</span><span class="n">es</span> <span class="c1"># maximum allowable precision
</span>    <span class="n">e_diff</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">e</span> <span class="o">-</span> <span class="bp">self</span><span class="p">.</span><span class="n">emin</span><span class="p">()</span> <span class="c1"># e_diff &lt; 0 if subnormal
</span>    <span class="n">p</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="n">max_p</span><span class="p">,</span> <span class="n">max_p</span> <span class="o">+</span> <span class="n">e_diff</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="c1"># adjust precision for subnormals
</span>    <span class="n">r</span> <span class="o">=</span> <span class="n">MPFloat</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">rm</span><span class="p">).</span><span class="nb">round</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="c1"># re-use rounding logic
</span>    <span class="c1"># handle overflow based on rounding mode
</span>    <span class="p">...</span>
    <span class="k">return</span> <span class="n">r</span>
</code></pre></div></div>

<p>The implementation of <code class="language-plaintext highlighter-rouge">round</code> is more complicated.
First,
  we must deal with subnormal numbers,
  i.e., numbers with magnitude below \(2^{emin}\)
  which have reduced precision:
  the <code class="language-plaintext highlighter-rouge">min(max_p, ...)</code> expression
  computes the effective precision.
We adjust the precision accordingly,
  construct the appropriate <code class="language-plaintext highlighter-rouge">MPFloat</code> context,
  and reuse its <code class="language-plaintext highlighter-rouge">round</code> method to round
  without exponent bounds.
Finally,
  we handle overflow based on the specified rounding mode.</p>

<p>What about fixed-point formats?
To support fixed-point numbers,
  we first alter <code class="language-plaintext highlighter-rouge">round_core</code> to accept a parameter <code class="language-plaintext highlighter-rouge">n</code>
  which sets <code class="language-plaintext highlighter-rouge">n</code> directly:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">round_core</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="n">n</span><span class="p">,</span> <span class="n">rm</span><span class="p">):</span> <span class="c1"># added n parameter
</span>  <span class="p">...</span>
</code></pre></div></div>

<p>The caller must specify either <code class="language-plaintext highlighter-rouge">p</code> or <code class="language-plaintext highlighter-rouge">n</code>.
<!-- If both are specified, then the option
  that will preserve the fewest digits is chosen. -->
The arithmetic engine must also be altered
  to support a stopping point <code class="language-plaintext highlighter-rouge">n</code> rather than
  precision <code class="language-plaintext highlighter-rouge">p</code> when performing fixed-point operations.
<!-- We must also alter the arithmetic engine
  to request a stopping point `n` rather than
  precision `p` when performing round-to-odd operations. -->
For example,
  if we want to round with <code class="language-plaintext highlighter-rouge">n=-1</code>, keeping only
  integer digits, then the arithmetic engine must produce
  a round-to-odd result with enough arbitrary precision
  to preserve all integer digits plus extra digits
  for safe re-rounding.
<!-- One method of supporting fixed-point style computation
  is making an initial precision guess,
  re-computing with the correct precision based
  on the result only when needed. -->
Like before,
  one of <code class="language-plaintext highlighter-rouge">p</code> or <code class="language-plaintext highlighter-rouge">n</code> must be specified.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">module</span> <span class="n">Engine</span><span class="p">:</span>
  <span class="k">def</span> <span class="nf">rto_mul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="n">n</span><span class="p">):</span>
    <span class="p">...</span>

<span class="k">def</span> <span class="nf">mul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">ctx</span><span class="p">):</span>
  <span class="n">p</span><span class="p">,</span> <span class="n">n</span> <span class="o">=</span> <span class="n">ctx</span><span class="p">.</span><span class="n">rto_params</span><span class="p">()</span>
  <span class="n">result</span> <span class="o">=</span> <span class="n">Engine</span><span class="p">.</span><span class="n">rto_mul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="n">n</span><span class="p">)</span>
  <span class="k">return</span> <span class="n">ctx</span><span class="p">.</span><span class="nb">round</span><span class="p">(</span><span class="n">result</span><span class="p">)</span>
</code></pre></div></div>

<p>Consider implementing a rounding context
  for an arbitrary-precision fixed-point number
  that must round any digit less significant than
  the <code class="language-plaintext highlighter-rouge">n+1</code>-th digit:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">MPFixed</span><span class="p">(</span><span class="n">Round</span><span class="p">):</span>
  <span class="n">n</span><span class="p">:</span> <span class="nb">int</span>  <span class="c1"># first insignificant digit (drop digits &lt;= n)
</span>  <span class="n">rm</span><span class="p">:</span> <span class="n">RoundingMode</span>

  <span class="k">def</span> <span class="nf">rto_params</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
    <span class="k">return</span> <span class="bp">None</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">n</span> <span class="o">-</span> <span class="mi">2</span>  <span class="c1"># 2 additional bits for safe re-rounding
</span>
  <span class="k">def</span> <span class="nf">round</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
    <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">round_core</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="bp">None</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">n</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">rm</span><span class="p">)</span>
</code></pre></div></div>

<p>I could continue to list implementations of
  rounding contexts for other number formats,
  but I believe the pattern is clear:
  to support new number formats, we implement
  a rounding context class that can often reuse
  existing rounding logic.
Machine integers and fixed-point formats
  can compose <code class="language-plaintext highlighter-rouge">MPFixed</code> and apply the appropriate
  overflow behavior.
Floating-point formats like those
  described in the OCP MX standard [5] might
  benefit from a similar approach to
  IEEE 754 floating-point numbers.
Posit numbers [6] are floating-point numbers
  with <em>tapered</em> precision;
  one approach might be to use <code class="language-plaintext highlighter-rouge">MPFloat</code>
  with variable precision based on the magnitude of the number.</p>

<p>Organizing the rounding library into rounding contexts achieves the following benefits:</p>

<ul>
  <li>
    <p><em>Maintainability:</em> Rounding contexts encapsulate
rounding logic and often compose together existing
rounding contexts to implement its rounding behavior.</p>
  </li>
  <li>
    <p><em>Extensibility:</em> Implementing a new number format
only requires implementing a new rounding context class,
often reusing existing rounding logic; any instance
of the new rounding context can be used with
all existing mathematical operations.</p>
  </li>
  <li>
    <p><em>Correctness:</em> Rounding can be decomposed
into many smaller, often reusable components;
verifying each component in isolation
increases confidence in the correctness
of each rounding context implementation.</p>
  </li>
</ul>

<h2 id="evaluation">Evaluation</h2>

<p>To demonstrate the benefits of these design principles,
  I applied them to the number library 
  underlying the runtime of the FPy language [7].
FPy is an embedded Python DSL for specifying numerical algorithms,
  with explicit control over rounding via rounding contexts,
  including first-class rounding context values.
The number library supports many families of number formats
  from IEEE 754 floating-point numbers,
  OCP MX floating-point numbers,
  fixed-point numbers, and more.
The core arithmetic engine uses MPFR [8]
  to implement round-to-odd arithmetic operations.</p>

<h3 id="maintainability">Maintainability</h3>

<p>Due to its modular design,
  the FPy number library remains compact,
  composing core components to support a wide variety
  of number formats and operations.</p>

<p>The FPy number library consists
  of four major components:</p>
<ul>
  <li>the <code class="language-plaintext highlighter-rouge">Number</code> module defines FPy’s number representation;</li>
  <li>the <code class="language-plaintext highlighter-rouge">Rounding</code> module implements the <code class="language-plaintext highlighter-rouge">round_core</code>;</li>
  <li>the <code class="language-plaintext highlighter-rouge">Arithmetic</code> module implements round-to-odd arithmetic using MPFR;</li>
  <li>the <code class="language-plaintext highlighter-rouge">Contexts</code> module implements various rounding contexts.</li>
</ul>

<table>
  <thead>
    <tr>
      <th>Component</th>
      <th>LOC</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Number</td>
      <td>1725</td>
    </tr>
    <tr>
      <td>Rounding</td>
      <td>400</td>
    </tr>
    <tr>
      <td>Arithmetic</td>
      <td>1500</td>
    </tr>
    <tr>
      <td>Contexts</td>
      <td>4350</td>
    </tr>
  </tbody>
  <tbody>
    <tr>
      <td>Total</td>
      <td>7975</td>
    </tr>
  </tbody>
</table>

<p>The total code size is approximately 8,000 lines of code (LOC)
  with 4750 lines dedicated to rounding alone.
Since FPy’s arithmetic engine uses MPFR,
  the arithmetic engine is relatively small at 1500 lines of code.
Rounding contexts in FPy implement more
  than the essential <code class="language-plaintext highlighter-rouge">round</code> method
  with additional methods for encoding and decoding bit patterns,
  constructing special values,
  and more.
The largest rounding context is almost 850 lines of code,
  while the smallest is only 60 lines of code.</p>

<p>An exhaustive list of rounding contexts is below:</p>

<table>
  <thead>
    <tr>
      <th>Rounding Context</th>
      <th>Parameters</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">MPFloat</code></td>
      <td>p</td>
      <td>p-digit floating-point number</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">MPSFloat</code></td>
      <td>p, emin</td>
      <td><code class="language-plaintext highlighter-rouge">MPFloat</code> with minimum exponent</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">MPBFloat</code></td>
      <td>p, emin, max</td>
      <td><code class="language-plaintext highlighter-rouge">MPSFloat</code> with maximum value</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">IEEEFloat</code></td>
      <td>es, nbits</td>
      <td>IEEE 754 floating-point number</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">EFloat</code></td>
      <td>es, nbits, I, O, E</td>
      <td>generalized IEEE 754 format [9]</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">MPFixed</code></td>
      <td>n</td>
      <td>unbounded fixed-point number</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">MPBFixed</code></td>
      <td>n, max</td>
      <td>fixed-point with maximum value</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">Fixed</code></td>
      <td>scale, nbits</td>
      <td><code class="language-plaintext highlighter-rouge">nbits</code> fixed-point with scale \(2^{scale}\)</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">SMFixed</code></td>
      <td>scale, nbits</td>
      <td>sign-magnitude <code class="language-plaintext highlighter-rouge">nbits</code> fixed-point</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">ExpFloat</code></td>
      <td>nbits</td>
      <td><code class="language-plaintext highlighter-rouge">nbits</code> exponential floating-point number</td>
    </tr>
  </tbody>
</table>

<p>The <code class="language-plaintext highlighter-rouge">MPFloat</code> and <code class="language-plaintext highlighter-rouge">MPFixed</code> rounding contexts
  call the core rounding logic directly,
  while other rounding contexts
  compose existing rounding contexts
  to implement their rounding behavior.
Every rounding context may be used
  with any arithmetic operation
  provided by the arithmetic engine.
This compositional design
  ensures the codebase remains relatively small,
  comprising of modular components
  rather than monolithic implementations
  where small changes require large modifications
  or modifications across many parts of the codebase.</p>

<h3 id="extensibility">Extensibility</h3>

<p>To demonstrate extensibility of FPy,
  I will showcase the implementation of
  a correctly-rounded implementation of <code class="language-plaintext highlighter-rouge">1/x^2</code>
  and the <code class="language-plaintext highlighter-rouge">ExpFloat</code> number format.</p>

<h4 id="implementing-1x2">Implementing <code class="language-plaintext highlighter-rouge">1/x^2</code></h4>

<p>A correctly-rounded implementation <code class="language-plaintext highlighter-rouge">1/x^2</code>
  provides additional accuracy compared
  to composing <code class="language-plaintext highlighter-rouge">1/x</code> with <code class="language-plaintext highlighter-rouge">x^2</code>, separately.
To implement this operation in FPy,
  we only need to implement its round-to-odd implementation.
To illustrate one such implementation,
  I implemented a digit recurrence algorithm
  that iteratively computes the significant digits of <code class="language-plaintext highlighter-rouge">1/x^2</code>.</p>

<p>The implementation adapts the classic
  reciprocal digit-recurrence algorithm.
Assuming that</p>

\[x = {(-1)}^s * 1.m * 2^{e} = {(-1)}^s * c * 2^{exp},\]

<p>we know the result is of the form
  \(q * 2^{-2e}\) where \(q = 1/(1.m)^2\).
The algorithm first computes the square of the argument:
  adding the exponents and squaring the significand.
Then,
  it computes the expected exponent, both \(e\) and \(exp\),
  and checks for the special case where the (fractional)
  significand is exactly <code class="language-plaintext highlighter-rouge">1.0</code>, in which case the result
  is just \(2^{-2e}\).
Otherwise,
  it performs digit-recurrence to compute all
  but the last significant digit of the result.
Finally,
  the last digit is determined by round-to-odd:
  if we have a non-zero remainder,
  we need to round up to the result with
  odd significand.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">rto_recip_sqr</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">p</span><span class="p">):</span>
  <span class="k">assert</span> <span class="n">x</span> <span class="o">!=</span> <span class="mi">0</span>

  <span class="c1"># square the argument
</span>  <span class="n">x</span> <span class="o">=</span> <span class="n">x</span> <span class="o">*</span> <span class="n">x</span>

  <span class="n">e</span> <span class="o">=</span> <span class="o">-</span><span class="n">x</span><span class="p">.</span><span class="n">e</span> <span class="c1"># result normalized exponent
</span>  <span class="n">exp</span> <span class="o">=</span> <span class="n">e</span> <span class="o">-</span> <span class="n">p</span> <span class="o">+</span> <span class="mi">1</span> <span class="c1"># result unnormalized exponent
</span>
  <span class="n">m</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">c</span> <span class="c1"># argument significand (in 1.M)
</span>  <span class="n">one</span> <span class="o">=</span> <span class="mi">1</span> <span class="o">&lt;&lt;</span> <span class="p">(</span><span class="n">x</span><span class="p">.</span><span class="n">p</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="c1"># representation of 1.0 (fixed-point)
</span>
  <span class="k">if</span> <span class="n">m</span> <span class="o">==</span> <span class="n">one</span><span class="p">:</span>
    <span class="c1"># special case: m = 1 =&gt; q = 1.0
</span>    <span class="n">q</span> <span class="o">=</span> <span class="mi">1</span> <span class="o">&lt;&lt;</span> <span class="p">(</span><span class="n">p</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span>
  <span class="k">else</span><span class="p">:</span>
    <span class="c1"># general case: m &gt; 1 =&gt; q \in (0.5, 1.0)
</span>    <span class="c1"># step 1. digit recurrence algorithm for 1/m
</span>    <span class="c1"># trick: skip first iteration since we always extract 0
</span>    <span class="n">q</span> <span class="o">=</span> <span class="mi">0</span> <span class="c1"># quotient
</span>    <span class="n">r</span> <span class="o">=</span> <span class="n">one</span> <span class="o">&lt;&lt;</span> <span class="mi">1</span> <span class="c1"># remainder (constant fold first iter)
</span>    <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">p</span><span class="p">):</span> <span class="c1"># compute p - 1 bits
</span>      <span class="n">q</span> <span class="o">&lt;&lt;=</span> <span class="mi">1</span>
      <span class="k">if</span> <span class="n">r</span> <span class="o">&gt;=</span> <span class="n">m</span><span class="p">:</span>
        <span class="n">q</span> <span class="o">|=</span> <span class="mi">1</span>
        <span class="n">r</span> <span class="o">-=</span> <span class="n">m</span>
      <span class="n">r</span> <span class="o">&lt;&lt;=</span> <span class="mi">1</span>

    <span class="c1"># step 2. generate last digit by inexactness
</span>    <span class="n">q</span> <span class="o">&lt;&lt;=</span> <span class="mi">1</span>
    <span class="k">if</span> <span class="n">r</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">:</span>
      <span class="n">q</span> <span class="o">|=</span> <span class="mi">1</span>

    <span class="c1"># step 3. adjust exponent so that q \in [1.0, 2.0)
</span>    <span class="n">exp</span> <span class="o">-=</span> <span class="mi">1</span>

  <span class="c1"># result
</span>  <span class="k">return</span> <span class="n">Float</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">exp</span><span class="p">,</span> <span class="n">q</span><span class="p">)</span>
</code></pre></div></div>

<p>The implementation is clearly naive,
  but it verifiably satisfies the round-to-odd contract
  of the arithmetic engine.
Composing this implementation
  with any rounding context in FPy
  yields a correctly-rounded implementation of <code class="language-plaintext highlighter-rouge">1/x^2</code>
  under that number format and rounding mode.
For example,
  computing with single-precision floating-point:
  the computed result is <code class="language-plaintext highlighter-rouge">0.101321185</code>
  compared to <code class="language-plaintext highlighter-rouge">0.101321183</code> when computed
  separately as <code class="language-plaintext highlighter-rouge">1/x</code> and <code class="language-plaintext highlighter-rouge">x^2</code>.</p>

<h4 id="implementing-expfloat">Implementing <code class="language-plaintext highlighter-rouge">ExpFloat</code></h4>

<p>The <code class="language-plaintext highlighter-rouge">ExpFloat</code> number format
  encodes exponential numbers, i.e., <code class="language-plaintext highlighter-rouge">2^{e}</code>
  where <code class="language-plaintext highlighter-rouge">e</code> is an integer exponent,
  that is, positive, power-of-two numbers.
Unlike standard floating-point numbers,
  <code class="language-plaintext highlighter-rouge">ExpFloat</code> numbers cannot be negative,
  zero, infinity, or NaN.
In the OCP MX standard [5],
  the “E8M0” format is an 8-bit exponential number
  encoding the possible values of the exponent field
  of a single-precision IEEE 754 floating-point number.
The <code class="language-plaintext highlighter-rouge">ExpFloat</code> rounding context
  is parameterized by <code class="language-plaintext highlighter-rouge">nbits</code>, the total number of bits
  in the representation.
To implement <code class="language-plaintext highlighter-rouge">ExpFloat</code> in FPy,
  we compose with the <code class="language-plaintext highlighter-rouge">MPFloat</code> rounding context implementation:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">ExpFloat</span><span class="p">(</span><span class="n">Round</span><span class="p">):</span>
  <span class="n">nbits</span><span class="p">:</span> <span class="nb">int</span>  <span class="c1"># total number of bits (nbits &gt;= 2)
</span>  <span class="n">rm</span><span class="p">:</span> <span class="n">RoundingMode</span> <span class="c1"># rounding mode
</span>  <span class="p">...</span> <span class="c1"># overflow/underflow behavior
</span>
  <span class="k">def</span> <span class="nf">mp_ctx</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="c1"># corresponding MPFloat context
</span>    <span class="k">return</span> <span class="n">MPFloat</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">rm</span><span class="p">)</span>

  <span class="k">def</span> <span class="nf">rto_params</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
    <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">mp_ctx</span><span class="p">().</span><span class="n">rto_params</span><span class="p">()</span>

  <span class="k">def</span> <span class="nf">round</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
    <span class="c1"># user-defined behavior for negative, zero, Inf, or NaN
</span>    <span class="k">if</span> <span class="n">x</span><span class="p">.</span><span class="n">is_nar</span><span class="p">()</span> <span class="ow">or</span> <span class="n">x</span> <span class="o">&lt;=</span> <span class="mi">0</span><span class="p">:</span>
      <span class="p">...</span>

    <span class="n">r</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">mp_ctx</span><span class="p">().</span><span class="nb">round</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="c1"># re-use rounding logic
</span>    <span class="c1"># handle overflow/underflow based on rounding mode
</span>    <span class="p">...</span>
    <span class="k">return</span> <span class="n">r</span>
</code></pre></div></div>

<p>The implementation of <code class="language-plaintext highlighter-rouge">ExpFloat</code> is highly compact:
  it reuses the <code class="language-plaintext highlighter-rouge">MPFloat</code> rounding logic entirely,
  wrapping it with custom behavior for invalid values,
  overflow, and underflow behavior.
In FPy,
  the actual implementation of the <code class="language-plaintext highlighter-rouge">round</code> method
  of the <code class="language-plaintext highlighter-rouge">ExpFloat</code> context comes out to 50 logical lines
  of code (110 in total), with most lines dedicated
  to overflow and underflow handling based on
  the specified rounding mode;
  the full context implementation with encoding logic,
  value constructors, and predicates comes out to 500 lines.</p>

<h3 id="correctness">Correctness</h3>

<p>Rather than having to test each
  operator implementation in its entirety,
  FPy’s design allows testing to be done compositionally.
Each arithmetic operation in the arithmetic engine
  can be tested independently of the rounding library;
  the rounding library can be tested
  piecewise by verifying each rounding context
  or core rounding logic in isolation.
In particular,
  I use property-based testing
  with the Hypothesis library [10].</p>

<p>Ensuring correctness of the core
  rounding procedure <code class="language-plaintext highlighter-rouge">round_core</code> is especially important,
  as it is the basis for all rounding contexts.
To test <code class="language-plaintext highlighter-rouge">round_core</code>,
  we can verify that it satisfies
  the expected properties of rounding.
For example,
  we expect <code class="language-plaintext highlighter-rouge">round_core</code> to satisfy two properties:</p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">round_core(x, p, None, rm)</code> has at most <code class="language-plaintext highlighter-rouge">p</code> significant digits;</li>
  <li><code class="language-plaintext highlighter-rouge">round_core(x, None, n, rm)</code> has no significant digits below the <code class="language-plaintext highlighter-rouge">n+1</code>-th digit.</li>
</ul>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">@</span><span class="n">given</span><span class="p">(</span><span class="n">floats</span><span class="p">(</span><span class="n">prec_max</span><span class="o">=</span><span class="mi">256</span><span class="p">),</span> <span class="n">st</span><span class="p">.</span><span class="n">integers</span><span class="p">(</span><span class="n">min_value</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">max_value</span><span class="o">=</span><span class="mi">1024</span><span class="p">),</span> <span class="n">rounding_modes</span><span class="p">())</span>
<span class="k">def</span> <span class="nf">test_round_p</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="n">rm</span><span class="p">):</span>
    <span class="n">y</span> <span class="o">=</span> <span class="n">round_core</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">p</span><span class="p">,</span> <span class="bp">None</span><span class="p">,</span> <span class="n">rm</span><span class="p">)</span>
    <span class="k">assert</span> <span class="mi">0</span> <span class="o">&lt;=</span> <span class="n">y</span><span class="p">.</span><span class="n">p</span> <span class="o">&lt;=</span> <span class="n">p</span>

<span class="o">@</span><span class="n">given</span><span class="p">(</span><span class="n">floats</span><span class="p">(</span><span class="n">prec_max</span><span class="o">=</span><span class="mi">256</span><span class="p">),</span> <span class="n">st</span><span class="p">.</span><span class="n">integers</span><span class="p">(</span><span class="n">min_value</span><span class="o">=-</span><span class="mi">1024</span><span class="p">,</span> <span class="n">max_value</span><span class="o">=</span><span class="mi">1024</span><span class="p">),</span> <span class="n">rounding_modes</span><span class="p">())</span>
<span class="k">def</span> <span class="nf">test_round_n</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">n</span><span class="p">,</span> <span class="n">rm</span><span class="p">):</span>
    <span class="n">y</span> <span class="o">=</span> <span class="n">round_core</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="bp">None</span><span class="p">,</span> <span class="n">n</span><span class="p">,</span> <span class="n">rm</span><span class="p">)</span>
    <span class="n">_</span><span class="p">,</span> <span class="n">lo</span> <span class="o">=</span> <span class="n">y</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="n">n</span><span class="p">)</span>
    <span class="k">assert</span> <span class="n">lo</span> <span class="o">==</span> <span class="mi">0</span>
</code></pre></div></div>

<p>The code above checks the two properties.
The methods <code class="language-plaintext highlighter-rouge">floats</code> and <code class="language-plaintext highlighter-rouge">rounding_modes</code>
  are generators for FPy floating-point number values
  and rounding modes, respectively.
We could test the claim of the <code class="language-plaintext highlighter-rouge">round</code> method
  that for <code class="language-plaintext highlighter-rouge">hi, lo = x.split(n)</code>,
  the value of <code class="language-plaintext highlighter-rouge">hi</code> is the round to zero result.
Assuming that <code class="language-plaintext highlighter-rouge">x</code> is finite,
  we can verify this property as follows:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">@</span><span class="n">given</span><span class="p">(</span><span class="n">floats</span><span class="p">(</span><span class="n">prec_max</span><span class="o">=</span><span class="mi">256</span><span class="p">),</span> <span class="n">st</span><span class="p">.</span><span class="n">integers</span><span class="p">(</span><span class="n">min_value</span><span class="o">=-</span><span class="mi">1024</span><span class="p">,</span> <span class="n">max_value</span><span class="o">=</span><span class="mi">1024</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">test_round_rtz</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">n</span><span class="p">):</span>
    <span class="n">hi</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="n">n</span><span class="p">)</span>
    <span class="n">y</span> <span class="o">=</span> <span class="n">round_core</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="bp">None</span><span class="p">,</span> <span class="n">n</span><span class="p">,</span> <span class="n">RoundingMode</span><span class="p">.</span><span class="n">RTZ</span><span class="p">)</span>
    <span class="k">assert</span> <span class="n">y</span> <span class="o">==</span> <span class="n">hi</span>
</code></pre></div></div>

<p>While these three properties aren’t explicitly
  checking the <em>numerical</em> correctness of <code class="language-plaintext highlighter-rouge">round_core</code>,
  they provide confidence that the implementation
  behaves as expected.
Similar property tests can be devised
  to further test the correctness of <code class="language-plaintext highlighter-rouge">round_core</code>
  and combined with concrete test cases
  or other testing methods to increase confidence
  in its correctness.
Clearly,
  we would also want to verify the <code class="language-plaintext highlighter-rouge">split</code> helper method:
  that it does in fact split the number into two non-overlapping
  groups of digits at the specified point.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">@</span><span class="n">given</span><span class="p">(</span><span class="n">floats</span><span class="p">(</span><span class="n">prec_max</span><span class="o">=</span><span class="mi">256</span><span class="p">),</span> <span class="n">st</span><span class="p">.</span><span class="n">integers</span><span class="p">(</span><span class="n">min_value</span><span class="o">=-</span><span class="mi">1024</span><span class="p">,</span> <span class="n">max_value</span><span class="o">=</span><span class="mi">1024</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">test_split</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">n</span><span class="p">):</span>
    <span class="n">hi</span><span class="p">,</span> <span class="n">lo</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="n">n</span><span class="p">)</span>
    <span class="c1"># check we did not lose information
</span>    <span class="k">assert</span> <span class="n">hi</span> <span class="o">+</span> <span class="n">lo</span> <span class="o">==</span> <span class="n">x</span><span class="p">,</span> <span class="s">"split parts must sum to original"</span>
    <span class="c1"># check we did not gain information
</span>    <span class="k">assert</span> <span class="n">hi</span><span class="p">.</span><span class="n">p</span> <span class="o">&lt;=</span> <span class="n">x</span><span class="p">.</span><span class="n">p</span><span class="p">,</span> <span class="s">"hi must not have more precision than x"</span>
    <span class="k">assert</span> <span class="n">lo</span><span class="p">.</span><span class="n">p</span> <span class="o">&lt;=</span> <span class="n">x</span><span class="p">.</span><span class="n">p</span><span class="p">,</span> <span class="s">"lo must not have more precision than x"</span>
    <span class="c1"># check non-overlapping property
</span>    <span class="k">assert</span> <span class="n">hi</span><span class="p">.</span><span class="n">exp</span> <span class="o">&gt;</span> <span class="n">n</span><span class="p">,</span> <span class="s">"LSB of hi must be &gt; n"</span>
    <span class="k">assert</span> <span class="n">lo</span><span class="p">.</span><span class="n">e</span> <span class="o">&lt;=</span> <span class="n">n</span><span class="p">,</span> <span class="s">"MSB of lo must be &lt;= n"</span>
</code></pre></div></div>

<p>Critically,
  once <code class="language-plaintext highlighter-rouge">round_core</code> is verified,
  the effort of verifying each rounding context
  mostly involves verifying the <em>unique</em> behavior
  of that rounding context;
  confidence in the core rounding logic
  carries over to each rounding context implementation.</p>

<p>Testing of the arithmetic engine
  can be done independently of the rounding library.
FPy relies on MPFR for round-to-odd arithmetic,
  which has been extensively tested;
  the RLibm system also uses MPFR for verifying
  its transcendental function implementations [4].
Fully verifying custom transcendental function
  implementations is difficult work,
  far beyond the scope of property-based testing.</p>

<p>However,
  we can still verify that each arithmetic operation
  round-to-odd implementations meet the round-to-odd contract.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">@</span><span class="n">given</span><span class="p">(</span><span class="n">floats</span><span class="p">(</span><span class="n">prec_max</span><span class="o">=</span><span class="mi">256</span><span class="p">),</span> <span class="n">floats</span><span class="p">(</span><span class="n">prec_max</span><span class="o">=</span><span class="mi">256</span><span class="p">),</span> <span class="n">st</span><span class="p">.</span><span class="n">integers</span><span class="p">(</span><span class="n">min_value</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">max_value</span><span class="o">=</span><span class="mi">1024</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">test_rto_mul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">p</span><span class="p">):</span>
  <span class="n">r_rtz</span> <span class="o">=</span> <span class="n">MPFR</span><span class="p">.</span><span class="n">mul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">p</span> <span class="o">-</span> <span class="mi">1</span><span class="p">,</span> <span class="n">MPFR</span><span class="p">.</span><span class="n">RTZ</span><span class="p">)</span>
  <span class="n">r_rto</span> <span class="o">=</span> <span class="n">rto_mul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">p</span><span class="p">)</span>
  <span class="k">assert</span> <span class="n">r_rtz</span><span class="p">.</span><span class="n">p</span> <span class="o">==</span> <span class="n">p</span> <span class="o">-</span> <span class="mi">1</span><span class="p">,</span> <span class="s">"MPFR result must have p - 1 precision"</span>
  <span class="k">assert</span> <span class="n">r_rto</span><span class="p">.</span><span class="n">c</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">==</span> <span class="p">(</span><span class="mi">1</span> <span class="k">if</span> <span class="n">r_rtz</span><span class="p">.</span><span class="n">inexact</span> <span class="k">else</span> <span class="mi">0</span><span class="p">),</span> <span class="s">"inexact iff LSB is odd"</span>
</code></pre></div></div>

<p>For arithmetic,
  our custom <code class="language-plaintext highlighter-rouge">rto_mul</code> implementation
  can be verified against Python’s native <code class="language-plaintext highlighter-rouge">Fraction</code> implementation
  for finite inputs.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">@</span><span class="n">given</span><span class="p">(</span><span class="n">floats</span><span class="p">(</span><span class="n">prec_max</span><span class="o">=</span><span class="mi">256</span><span class="p">),</span> <span class="n">floats</span><span class="p">(</span><span class="n">prec_max</span><span class="o">=</span><span class="mi">256</span><span class="p">),</span> <span class="n">st</span><span class="p">.</span><span class="n">integers</span><span class="p">(</span><span class="n">min_value</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">max_value</span><span class="o">=</span><span class="mi">1024</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">test_rto_mul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">p</span><span class="p">):</span>
  <span class="n">r_ref</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">as_rational</span><span class="p">()</span> <span class="o">*</span> <span class="n">y</span><span class="p">.</span><span class="n">as_rational</span><span class="p">()</span>
  <span class="n">r_impl</span> <span class="o">=</span> <span class="n">rto_mul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">p</span><span class="p">)</span>
  <span class="k">assert</span> <span class="n">r_ref</span> <span class="o">==</span> <span class="n">r_impl</span>
</code></pre></div></div>

<p>Verifying both the arithmetic engine
  and rounding library in a compositional manner
  increases confidence in the correctness
  of the overall number library.</p>

<h2 id="conclusion">Conclusion</h2>

<p>Number libraries face an explosion of complexity
  with new formats, rounding modes, and operations.
The two design principles presented here —
  separating arithmetic from rounding via round-to-odd,
  and organizing rounding logic into composable contexts —
  offer a practical solution to this challenge.
By decoupling these concerns,
  number library developers can achieve
  smaller, more maintainable codebases
  that are more extensible and easier to verify
  compared to traditional monolithic designs.
As the numerical computing landscape continues to evolve,
  these principles are one approach to providing a foundation
  for building maintainable, correct, and extensible
  number libraries that can adapt to future requirements
  without overwhelming their maintainers.</p>

<h2 id="references">References</h2>

<ol>
  <li>
    <p>Sylvie Boldo, Guillaume Melquiond. When double rounding is odd. 17th IMACS World Congress,
Jul 2005, Paris, France. pp.11. ffinria-00070603v2f</p>
  </li>
  <li>
    <p>Bill Zorn. 2021. Rounding. Ph.D. Dissertation. University of Washington, USA.
<a href="https://hdl.handle.net/1773/48230">https://hdl.handle.net/1773/48230</a></p>
  </li>
  <li>
    <p>Bill Zorn. 2017. Titanic [GitHub].
<a href="https://github.com/billzorn/titanic">https://github.com/billzorn/titanic</a>.
Accessed on November 11, 2025</p>
  </li>
  <li>
    <p>Jay P. Lim and Santosh Nagarakatte. 2022. One Polynomial Approximation to Produce Correctly Rounded
Results of an Elementary Function for Multiple Representations and Rounding Modes. Proc. ACM Program.
Lang. 6, POPL, Article 3 (January 2022), 28 pages.
<a href="https://doi.org/10.1145/3498664">https://doi.org/10.1145/3498664</a>.</p>
  </li>
  <li>
    <p>Bita Darvish Rouhani, Ritchie Zhao, Ankit More, Mathew Hall, Alireza Khodamoradi, Summer Deng,
Dhruv Choudhary, Marius Cornea, Eric Dellinger, Kristof Denolf, Stosic Dusan, Venmugil Elango, Maximilian Golub,
Alexander Heinecke, Phil James-Roxby, Dharmesh Jani, Gaurav Kolhe, Martin Langhammer, Ada Li, Levi Melnick,
Maral Mesmakhosroshahi, Andres Rodriguez, Michael Schulte, Rasoul Shafipour, Lei Shao, Michael Siu, Pradeep Dubey,
Paulius Micikevicius, Maxim Naumov, Colin Verrilli, Ralph Wittig, Doug Burger, and Eric Chung. 2023.
Microscaling Data Formats for Deep Learning. arXiv preprint arXiv:2310.10537 (2023).
<a href="https://arxiv.org/abs/2310.10537">https://arxiv.org/abs/2310.10537</a></p>
  </li>
  <li>
    <p>John L. Gustafson. 2017. Beating Floating Point at its Own Game: Posit Arithmetic. Supercomputing Frontiers and Innovations 4, 2 (2017), 71–86.
<a href="https://doi.org/10.14529/jsfi170206">https://doi.org/10.14529/jsfi170206</a></p>
  </li>
  <li>
    <p>Brett Saiki. 2025. FPy [GitHub].
<a href="https://github.com/bksaiki/fpy">https://github.com/bksaiki/fpy</a>.
Accessed on November 12, 2025</p>
  </li>
  <li>
    <p>MPFR Team. 2025. The GNU MPFR Library.
<a href="https://www.mpfr.org/">https://www.mpfr.org/</a>.
Accessed on November 14, 2025.</p>
  </li>
  <li>
    <p>Brett Saiki. 2025. Taxonomy of Small Floating-Point Formats.
<a href="https://uwplse.org/2025/02/17/Small-Floats.html">https://uwplse.org/2025/02/17/Small-Floats.html</a>.
Accessed on November 12, 2025</p>
  </li>
  <li>
    <p>Hypothesis Team. 2025. Hypothesis.
<a href="https://hypothesis.works/">https://hypothesis.works/</a>.
Accessed: 2025-11-12.</p>
  </li>
</ol>]]></content><author><name></name></author><category term="blog" /><category term="floating-point" /><category term="rounding" /><summary type="html"><![CDATA[This blog post was split into two parts. This blog post covers rounding in detail and discusses some theory; this blog post focuses on design principles for number libraries.]]></summary></entry><entry><title type="html">Rearchitecting Herbie’s improvement loop</title><link href="https://www.bsaiki.com/blog/2024/01/30/herbie-rearch.html" rel="alternate" type="text/html" title="Rearchitecting Herbie’s improvement loop" /><published>2024-01-30T00:00:00+00:00</published><updated>2024-01-30T00:00:00+00:00</updated><id>https://www.bsaiki.com/blog/2024/01/30/herbie-rearch</id><content type="html" xml:base="https://www.bsaiki.com/blog/2024/01/30/herbie-rearch.html"><![CDATA[<p>Herbie’s improvement loop has five basic phases:</p>

<ol>
  <li>Expression selection</li>
  <li>Local error analysis</li>
  <li>Rewriting
    <ol type="a">
 <li>Taylor polynomial approximation ("taylor")</li>
 <li>rule-based rewriting ("rr")</li>
 </ol>
  </li>
  <li>Simplify</li>
  <li>Prune</li>
</ol>

<p>First,
  Herbie selects a few expressions (1) from
  its database as a starting point for rewriting.
Then, 
  it analyzes (2) the expression to find AST nodes
  that exhibit high local error.
Based on the analysis,
  the tool utilizes a couple rewriting techniques
  including polynomial approximation and equality saturation
  via the <a href="https://egraphs-good.github.io/">egg</a> library.
Finally,
  a second rewriting pass simplifies expressions using egg
  before merging the new expressions with the current set
  of alternative implementations, keeping only the best.</p>

<p>This sequential process is repeated for a fixed number
  of iterations before Herbie extracts and fuses expressions
  through an algorithm called regime inference.
All versions of Herbie utilize this architecture
  in one way or another.
In the last couple versions of the tool,
  I have noticed a tension building between
  the rewriting phases within the improve loop.</p>

<p>Herbie employs two primary IRs which we will call <em>SpecIR</em>
  (programs with real number semantics) and <em>ProgIR</em>
  (programs with floating-point semantics).
We can convert from ProgIR to SpecIR without additional information
  since we simply replace the rounding operation for every
  mathematical operator with the identity function.
However,
  translating from SpecIR to ProgIR requires assigning 
  a finite-precision rounding operation to each mathematical function.</p>

<p>Returning to Herbie’s rewriting phases,
  observer the approximate type signatures of the three rewriting steps:</p>

<div class="language-ocaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">val</span> <span class="n">taylor</span> <span class="o">:</span> <span class="nc">SpecIR</span> <span class="o">-&gt;</span> <span class="nc">SpecIR</span> <span class="kt">list</span>
<span class="k">val</span> <span class="n">rr</span> <span class="o">:</span> <span class="nc">ProgIR</span> <span class="o">-&gt;</span> <span class="nc">ProgIR</span> <span class="kt">list</span>
<span class="k">val</span> <span class="n">simplify</span> <span class="o">:</span> <span class="nc">ProgIR</span> <span class="o">-&gt;</span> <span class="nc">ProgIR</span> <span class="kt">list</span>
</code></pre></div></div>

<p>Why the difference? In truth, engineering challenges.
The egg-based rewriters <em>rr</em> and <em>simplify</em> are newer and
  an important part of the Pareto-Herbie (Pherbie) design from 2021.
On the other hand,
  the first version of Herbie included <em>taylor</em> as far back as 2014.
Clearly,
  ProgIR is richer than SpecIR since it contains number format information.
For this reason,
  both rewriting and precision tuning and performed in <em>rr</em>
  (see <a href="https://herbie.uwplse.org/arith21-paper.pdf">Pareto-Herbie</a>)
  and a shim is required for <em>taylor</em>.</p>

<p>But in light of a recent conversation I had with a fellow PLSE member,
  I hypothesize that this architecture is likely incorrect,
  or at the least, problematic.
Of course,
  we could dream of the “sufficiently smart” numerical compiler
  that resembles the following architecture:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SpecIR, error bound ~~~~~~ magic ~~~~~~&gt; C, machine code, etc.
</code></pre></div></div>

<p>But just as a compiler achieves its goal via numerous lowering passes,
  so too should a system like Herbie.
In fact,
  Herbie really is just a messy version of this compiler
  taking a mathematical specification and number representations
  producing floating-point code.
In light of these observations,
  I propose rearchitecting Herbie’s improvement loop.
Expanding on the numerical compiler diagram,
  we would implement</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SpecIR ~~~~~~ rewrite ~~~~~~&gt; SpecIR ~~~~~~ lowering ~~~~~~&gt; ProgIR
</code></pre></div></div>
<p>There are two important phases:</p>
<ul>
  <li><em>rewrite</em>: either equivalent rewrites or
 approximations producing a program over real numbers</li>
  <li><em>lowering</em>: assignment of rounding operations,
 e.g. precision tuning, and operator selection, and other decisions
 required for finite-precision computation.</li>
</ul>

<p>Of course,
  just as in Herbie’s current architecture,
  we use this process potentially multiple times with expression selection
  at the beginning and pruning at the end.
Therefore,
  the whole process looks something like the following:</p>

<ol>
  <li>Expression selection</li>
  <li>Local error analysis</li>
  <li>Lifting from ProgIR to SpecIR</li>
  <li>Rewriting
    <ol type="a">
 <li>Taylor polynomial approximation ("taylor")</li>
 <li>equivalent (real-number semantics) rewrites ("rr")</li>
 </ol>
  </li>
  <li>Lowering from SpecIR to ProgIR
    <ol type="a">
 <li>operator selection, e.g. reciprocal vs. division</li>
 <li>precision tuning</li>
 </ol>
  </li>
  <li>Pruning</li>
</ol>

<p>How (4) and (5) interact is an open question.
An interesting proposal would be to seed an egraph with
  the starting subexpressions from (3) and the approximations from (4a),
  run rewrites from (4b), and extract from the same egraph in (5a)
  and possibly (5b).
Additionally,
  it is unclear if separating (4) and (5) will prevent us from
  finding certain rewrites.</p>

<p>The new design requires that egg be used for both SpecIR and ProgIR
  which suggests splitting the rules into those over SpecIR and
  those over ProgIR.
The majority of rewriting will be done over the reals,
  but a small set of rewrites may be used in (5) for operator selection,
  e.g. posit-quire operations.
Constant folding should only be done over SpecIR
  since these programs are over the reals.
If rewriting ProgIR is done in a egraph,
  it will be minimal (no constant folding) and will just be
  for extraction via Herbie’s cost model.</p>

<p>Based on this new design,
  we have a number of insights and engineering improvements:</p>
<ul>
  <li>if we wish to rewrite expressions over the reals,
  then the rewrites should be performed over real number programs</li>
  <li>operator (implementation) selection <em>is</em> precision tuning;
  choosing to approximate <code class="language-plaintext highlighter-rouge">1/x</code> (real number program) with <code class="language-plaintext highlighter-rouge">recip_f64(x)</code>
  is both a choice of syntax and rounding.</li>
  <li><em>egg</em> becomes a pure syntactic rewriter;
  it need not know about representations;
  Herbie’s egg interface can be slimmed down,
  e.g. no rule expansion and a simpler <em>EggIR</em>.</li>
</ul>

<p>To summarize,
  I propose a new design for Herbie’s improvement loop.
Problematic expressions should be identified via
  the normal expression selection and local error analysis phases.
These expressions should be lifted to purely real-number programs
  and passed through two phases. 
The first phase applies equivalent rewrites (over the real numbers)
  and polynomial approximations.
The second phase lowers to actual floating-point operations
  via operator selection and precision tuning.
Finally,
  the same pruning operation is performed to shrink the set of
  useful alternative implementations.</p>]]></content><author><name></name></author><category term="blog" /><category term="herbie" /><summary type="html"><![CDATA[Herbie’s improvement loop has five basic phases:]]></summary></entry><entry><title type="html">A One-Year Retrospective on Minim</title><link href="https://www.bsaiki.com/blog/2021/10/02/minim-retrospective.html" rel="alternate" type="text/html" title="A One-Year Retrospective on Minim" /><published>2021-10-02T00:00:00+00:00</published><updated>2021-10-02T00:00:00+00:00</updated><id>https://www.bsaiki.com/blog/2021/10/02/minim-retrospective</id><content type="html" xml:base="https://www.bsaiki.com/blog/2021/10/02/minim-retrospective.html"><![CDATA[<p>September 20th marked the one year anniversary of Minim’s creation.
It initially started as an pandemic-fueled side project based on
  an online guide of making a small Lisp language.
Since then, Minim has undergone significant changes in design
  and scope.
Today Minim is a fully-interpreted language, complete with a small
  standard library, syntax macros (not quite R6RS compliant),
  and many more features.
To celebrate the past year of development, I have decided to describe
  my experience designing and implementing the language.</p>

<h4 id="the-early-days">The Early Days</h4>

<p>The initial versions of Minim were hacky at best.
The goal of development during Minim 0.1.x was expanding
  the number of available types.
By version 0.1.2, the language supported booleans, exact
  and inexact numbers, symbols, strings, pairs, lists,
  user-defined functions, hash tables, vectors,
  and sequences.
Nearly all of these types since then have undergone
  structural changes.
The set of procedures to create and modify these
  procedures was minimal at the time, just enough
  to be useful.</p>

<p>The worst feature of this era was the owner-reference
  system that kept track of objects, so it could free
  resources when objects went out of scope.
Initially, new copies were created every time objects were
  referenced, but with an “improvement”, objects
  could be set as owners of their data or references
  of another object’s data.
This required unnecessary amounts of copying objects,
  annoying equality checks, and hours of debugging
  segmentation faults.
What I didn’t know at the time was Scheme implementations
  usually never copy immutable objects and use
  garbage collectors to know when to free memory.
This issue was not fixed until I finally added my own
  garbage collector in version 0.3.0.</p>

<h4 id="expansion">Expansion</h4>

<p>With a plethora of types in the language, the focus for
  Minim 0.2.x shifted to increasing the number of
  procedures available.
First, file reading was added to support a standard
  library that was not hard-coded in C.
In these versions, file reading occured separately
  from the existing parser, tracking parentheses
  and ignoring comments to extract a hopefully
  parsable string.
Initially, reading source code was rough
  since the reader would spontaneously fail, leading
  to hours of searching for and fixing bugs, and often
  to my frustration, the creation of new bugs.
In the same release, I added errors with backtracing
  and syntax with source locations which proved to be
  helpful when modifying the standard library.</p>

<p>Minim 0.2.1 and 0.2.2 were expansions of the math
  and list libraries.
I put as many of the new procedures to the standard
  library as I could rather to the already large
  set of primitive procedures.
I also added arity and type errors so that these new
  procedures could reject arguments upfront and
  print a more descriptive error.
By then, issues with parsing motivated me to remake the
  parser from scratch which turned out to be a huge
  improvement for the language.
Since then the parser has barely changed and no longer
  hinders development.</p>

<h4 id="standardization">Standardization</h4>

<p>Development of Minim 0.3.x felt quite different from the
  previous versions.
Primarily, I began reading the existing standards on Scheme
  including R4RS, R5RS, and R6RS.
For the most part, I ignored these standards before because
  I wasn’t interested in sticking to an existing blueprint
  for Minim.
However, my design choices began to feel flawed and haphazard
  and the way forward was becoming unclear.
Implementing procedures and features described in the standards
  became the main goal during this era of development.</p>

<p>The first few changes were performance-related: the
  addition of a garbage collector and the implementation
  of tail calling.
I detailed the design of the garbage collector in
  my previous blog post which you can read <a href="./2021-07-14-minim-gc.html">here</a>.
These changes caused a significant slow down in performance,
  but accelerated the pace of development since I didn’t have to
  worry about memory management, aside from a few bugs that
  needed patching.</p>

<p>With garbage collection in place, I turned to important features
  of any Scheme language: quoting and syntax macros.
Syntax macros turned out to be quite the headache;
  my intial implementation continually caused Minim to crash.
I eventually reimplemented macros from scratch, but I found out that even my
  most recent attempt is still not compliant with the standard.
As of today, I have considered that part of the project to be
  “good enough”, but fixing it is still definitely on my to-do list.</p>

<p>After the syntax macro mess, I moved on and added
  types like characters, records, and file ports; additional
  procedures for strings and lists; and more features
  like multi-valued expressions, continuations, and multi-signature
  functions with <code class="language-plaintext highlighter-rouge">case-lambda</code>.</p>

<h4 id="today">Today</h4>

<p>And with that, we have finally reached the present day.
As you have just read, the development of Minim has been a
  long and winding path, from basic and hacky beginnings
  to a much more robust implementation.
If there’s anything I’ve learned, it’s that a small idea can
  be fully realized with time and effort.</p>

<p>As for those following my footsteps: I’d highly recommend reading
  standards for an existing language.
You might not be able to implement everything, but I found that
  following the Scheme standards made Minim a much more robust and
  sensical language.
Standards are an important part of language design no matter how
  long and dense they might seem.</p>

<h4 id="the-future">The Future</h4>

<p>What’s next for Minim?
As I’ve mentioned before, syntax macros are not Scheme-compliant,
  but there are many more features of Minim that have deviated
  from the standard, usually because of my naive choices.
A couple examples include the use of <code class="language-plaintext highlighter-rouge">def</code> instead of <code class="language-plaintext highlighter-rouge">define</code>
  and the syntax of function definitions.
I am still weighing whether or not to resolve these design differences.</p>

<p>More recently, I have implemented caching for Minim source code files: syntax
  macros are applied and the resulting desguared code is emitted for later use.
Testing shows that this decreases the number of expressions executed on boot significantly.
On top of caching, I have implemented constant folding since certain expressions
  can be resolved before runtime.
In particular, my implementation of <code class="language-plaintext highlighter-rouge">case-lambda</code> is egregious in its use of
  constant expressions for resolving arity.</p>

<p>My target goal is to implement a native-code compiler for Minim,
  but having just begun a compilers course this quarter,
  I have a feeling this may be a long ways away.
Until then, I will focus on implementing more of the standard.
Please check out the source <a href="https://github.com/bksaiki/Minim">repository</a>
  for Minim to see my progress and give the language a try!</p>]]></content><author><name></name></author><category term="blog" /><category term="minim" /><summary type="html"><![CDATA[September 20th marked the one year anniversary of Minim’s creation. It initially started as an pandemic-fueled side project based on an online guide of making a small Lisp language. Since then, Minim has undergone significant changes in design and scope. Today Minim is a fully-interpreted language, complete with a small standard library, syntax macros (not quite R6RS compliant), and many more features. To celebrate the past year of development, I have decided to describe my experience designing and implementing the language.]]></summary></entry><entry><title type="html">Garbage Collection in Minim</title><link href="https://www.bsaiki.com/blog/2021/07/14/minim-gc.html" rel="alternate" type="text/html" title="Garbage Collection in Minim" /><published>2021-07-14T00:00:00+00:00</published><updated>2021-07-14T00:00:00+00:00</updated><id>https://www.bsaiki.com/blog/2021/07/14/minim-gc</id><content type="html" xml:base="https://www.bsaiki.com/blog/2021/07/14/minim-gc.html"><![CDATA[<p>Currently, one of the key issues in Minim is the lack of garbage collection.
To handle precise memory management, I implemented a weird owner-and-reference
  system that has been quite tiresome to keep track of.
To solve this, I am developing a garbage collector for the Minim, and I recently
  <a href="https://github.com/bksaiki/Minim/pull/5">merged</a>
  a working version into the project for testing.
The garbage collector greatly simplifies the code base by getting rid
  of calls to <code class="language-plaintext highlighter-rouge">free()</code> and removes most instances of copying objects.
In terms of performance, Minim is now much slower, anywhere between 2x and 4x.
Much of this slow down is beacuse allocated objects are no
  longer freed precisely, and there is additional overhead
  from tracking allocations.</p>

<p>The Minim GC is a generational, conservative, mark-and-sweep garbage
  collector implemented in C based on the
  <a href="https://github.com/orangeduck/tgc">Tiny Garbage Collector</a>,
  a minimal garbage collector.
It expands on the TGC by separating allocations into two generations,
  young and old.
The younger generation is marked and swept every cycle, by default
  every time new allocations total 8 MB in size.
Any surviving allocations are moved to the older generation.
The older generation is marked and swept every 15 cycles.</p>

<p>The stack is swept from the “bottom” of the stack, provided during
  initialization, to a local address within the sweeping function.
A neat trick to always forcing pointer addresses on the stack no matter
  the optimization level is to use the <code class="language-plaintext highlighter-rouge">setjmp(jmp_buf env)</code> function
  from <code class="language-plaintext highlighter-rouge">setjmp.h</code> which saves the values of registers to set a jump point.
It makes sense that this function exists, but I honestly didn’t realize
  it could be used for this mechanism.</p>

<p>Like the TGC, the Minim GC provides a way to associate a destructor with
  an allocated block in case memory allocated outside of the garbage
  collector needs to be freed upon sweeping.
However, it also provides a way to associate a “marking” function with each
  memory block.
These functions precisely mark internal pointers so that the garbage collector
  doesn’t naively mark every possible pointer within the memory block.
These functions can cause issues if a developer changes a struct but not its
  respective marking function; I ran into this issue on a few occasions
  when switching Minim to use the garbage collector.
In addition, atomic allocation macros are provided to allocate data not containing
  any pointers.</p>

<p>Although naive, the Minim GC has made working with the Minim code base much easier
  since I don’t have to spend hours dealing with segmentation faults.
If I had to make Minim all over again, I definitely would have stuck with using a
  garbage collector from the beginning.
It eliminates my owner-and-reference system that was causing quite
  a headache when dealing with copies of objects.
Now the same object can be in a list, hash table, and vector, all at the same time,
  since immutability is respected.</p>

<p>As for perfomance issues, I managed to claw back almost all of the slow down by
  adding tail call optimization which ensures the amount of allocated
  memory and the size of the stack remains quite small.
This garbage collector along with tail call optimization, quasiquoting, and
  syntax macros will be in the next release, and is currently in the main branch.</p>]]></content><author><name></name></author><category term="blog" /><category term="minim" /><summary type="html"><![CDATA[Currently, one of the key issues in Minim is the lack of garbage collection. To handle precise memory management, I implemented a weird owner-and-reference system that has been quite tiresome to keep track of. To solve this, I am developing a garbage collector for the Minim, and I recently merged a working version into the project for testing. The garbage collector greatly simplifies the code base by getting rid of calls to free() and removes most instances of copying objects. In terms of performance, Minim is now much slower, anywhere between 2x and 4x. Much of this slow down is beacuse allocated objects are no longer freed precisely, and there is additional overhead from tracking allocations.]]></summary></entry><entry><title type="html">Developing Minim</title><link href="https://www.bsaiki.com/blog/2021/03/07/minim.html" rel="alternate" type="text/html" title="Developing Minim" /><published>2021-03-07T00:00:00+00:00</published><updated>2021-03-28T00:00:00+00:00</updated><id>https://www.bsaiki.com/blog/2021/03/07/minim</id><content type="html" xml:base="https://www.bsaiki.com/blog/2021/03/07/minim.html"><![CDATA[<p>Recently, I released version 0.2.0 of <a href="https://github.com/bksaiki/Minim">Minim</a>, my hobby-language that I’ve been developing since last fall. It’s inspired by my time working with Racket (now more than a year). Despite having no formal experience with programming languages, I’ve made quite a bit of progress, although I still have lots to learn. Here are a few of my thoughts from developing the language.</p>

<h4 id="from-static-to-dynamic-types">From Static to Dynamic Types</h4>

<p>My language of choice for developing Minim is C.
Not the best choice for a smooth experience, but it’s been quite interesting.
The most annoying part is obviousing dealing with pointers;
  nearly every bug I’ve run into is a segmentation fault.
The most interesting concept, however, is the interaction of Minim’s dynamic
  type system with C’s static type system.</p>

<p>Mainly, how do you define a dynamic type system in a language
  that is statically typed, especially without classes like
  in C++ or Java?
The best solution that I’ve found is void pointers and enums.
  Intuitively, store a void pointer and an enumerated value such
  that the value hints at what is stored at the pointer.
With this method, we can store anything from an integer to
  a string in a unified object, or the <code class="language-plaintext highlighter-rouge">MinimObject</code> in the
  case of the Minim language.
Minim can easily verify the type
  of inputs when invoking a function and can throw an error when necessary.</p>

<p>Of course, all of this is abstracted away when running Minim.
The user can still identify the type of an object with procedures
  such as <code class="language-plaintext highlighter-rouge">number?</code> or <code class="language-plaintext highlighter-rouge">string?</code>, but they need not worry about
  its initialization, storage, and deletion.
An added benefit is multiple types of objects can be stored
  in the same list such as <code class="language-plaintext highlighter-rouge">(list 1 'a "str")</code>.
This concept may be trivial, but it is quite interesting to see
  that a simple construct can do away with the restrictions
  of statically-typed code.</p>

<h4 id="parsing-parsing-parsing">Parsing, Parsing, Parsing</h4>

<p>The worst experience, by far, has been figuring out how
  to parse an input string without any fancy libraries.
For the most part, if we can munge the string,
  it’s not too awful.
Split the string by looking for spaces and parentheses/brackets.
New lines and tabs are really just spaces in Minim,
  so they’re not too useful like Python.
All we have to do is keep track of quoted strings,
  Lisp-style quotes, and comments.</p>

<p>Unfortunately with version 0.2.0, Minim supports executing
  expressions from a file, and errors without syntax
  information are quite unhelpful.
Therefore, we need to store the syntax information of where
  expressions and functions are located.
We can’t munge the string before parsing since we lose
  information about row and column numbers of characters.
We must (a) track row and column information,
  (b) ignore distracting whitespace, and
  (c) still tokenize words.
I managed to pull it with a reader thats around
  200 lines long and a parser of similar length,
  but it’s quite a mess.</p>

<p>Nevertheless, the results have been impressive.
Backtraces from errors are quite detailed and they
  print out “stack frames” with the following format:
  <code class="language-plaintext highlighter-rouge">&lt;file&gt; &lt;row&gt;:&lt;col&gt; &lt;name&gt;</code>.
It’s crudeness will definitely be a problem in the future,
  but for now it works.</p>

<h4 id="problems-to-come">Problems to Come</h4>

<p>The most broken part of Minim is the blurry
  distinction between owners and references,
  and the lack of separation between mutable
  and immutable objects.</p>

<p>The first problem borrows a concept from C++.
In brief, certain objects are the original owners
  of their information.
It’s better to pass a reference of that object
  to a function rather than the entire copy since
  it takes less space and is less cumbersome than
  pointers that are dominant in C.
Minim also implements this system, since it’s less
  resource intensive to <em>not</em> copy lists every time
  we use them for read-only purposes.
However, without a static type system, the use for
  this is more subtle.
Every built-in function in Minim needs to have
  two separate cases for owners and references,
  and this parallel strategy causes many issues.
  It’s not ideal, but it seems to work.</p>

<p>The second problem is a distinction that Racket makes clear:
  there are immutable objects and there are mutable objects.
In Racket, the two different types of objects each have
  their own set of procedures.
Initially, I chose not to care since it seemed
  cumbersome to have two sets of procedures.
Invoking an in-place update of <em>any</em> hash
  table seemed reasonable, but it’s problematic
  with function calls and references.
There are a number of examples that I can think of
  that will break Minim.</p>

<h4 id="conclusion-and-future-work">Conclusion and Future Work</h4>

<p>Although, I spent much of this blog talking about
  what is bad, there has been a lot of good.
Most importantly,
  I’ve learned quite a bit about developing something
  as complex as a “lightweight” programming language.
As of the time I’m writing this, the repository is
  well over 10,000 lines of code and the language
  contains over 120 procedures and numerous types
  like lists, strings, hash tables, vectors, and more.</p>

<p>In the next update, there will be considerably more
  procedures in the “standard library” I’ve been developing.
They will mostly include math functions like <code class="language-plaintext highlighter-rouge">gcd</code> and <code class="language-plaintext highlighter-rouge">lcm</code>.
More list procedures are a must since lists form the backbone
  of any Lisp/Scheme language.
Additionally, I need to resolve the issues mentioned above
  as well as making procedures proper closures
  (storing the environment from which they were created).</p>

<p>This blog was long-winded mostly because there
  was a lot to talk about.
I hope to write more about Minim in the near future as
  it develops from the small fledgling it is today to
  a language that is full and robust.
Stay tuned for more.</p>]]></content><author><name></name></author><category term="blog" /><category term="minim" /><summary type="html"><![CDATA[Recently, I released version 0.2.0 of Minim, my hobby-language that I’ve been developing since last fall. It’s inspired by my time working with Racket (now more than a year). Despite having no formal experience with programming languages, I’ve made quite a bit of progress, although I still have lots to learn. Here are a few of my thoughts from developing the language.]]></summary></entry><entry><title type="html">Publishing the “generic-flonum” package</title><link href="https://www.bsaiki.com/blog/2021/01/21/generic-flonum.html" rel="alternate" type="text/html" title="Publishing the “generic-flonum” package" /><published>2021-01-21T00:00:00+00:00</published><updated>2021-03-07T00:00:00+00:00</updated><id>https://www.bsaiki.com/blog/2021/01/21/generic-flonum</id><content type="html" xml:base="https://www.bsaiki.com/blog/2021/01/21/generic-flonum.html"><![CDATA[<p>As a side effect of recent work, I created an alternate MPFR interface in Racket. I posted in the previous blog, that I was planning on extracting that code into a package for public use. As of today, that library has officially been cleaned up, documented, and published in the Racket Package Index. To try it out, install Racket and run <code class="language-plaintext highlighter-rouge">raco pkg install generic-flonum</code>. Here is an excerpt from the documentation.</p>

<blockquote>
  <p>While the <a href="https://docs.racket-lang.org/math/bigfloat.html">math/bigfloat</a> interface is sufficient for most high-precision computing, it is lacking in a couple areas. Mainly, it does not properly emulate subnormal arithmetic or allow the exponent range to be changed.</p>

  <p>Normally, neither of these problems cause concern. For example, if a user intends to find an approximate value for some computation on the reals, then subnormal arithmetic or a narrower exponent range is not particular useful. However, if a user wants to know the result of a computation specifically in some format, say half-precision, then math/bigfloat is insufficient.</p>

  <p>At half-precision, <code class="language-plaintext highlighter-rouge">(exp -10)</code> and <code class="language-plaintext highlighter-rouge">(exp 20)</code> evaluate to <code class="language-plaintext highlighter-rouge">4.5419e-05</code> and <code class="language-plaintext highlighter-rouge">+inf.0</code>, respectively. On the other hand, evaluating <code class="language-plaintext highlighter-rouge">(bfexp (bf -10))</code> and <code class="language-plaintext highlighter-rouge">(bfexp (bf -10))</code> with <code class="language-plaintext highlighter-rouge">(bf-precision 11)</code> returns <code class="language-plaintext highlighter-rouge">(bf "4.5389e-5")</code> and <code class="language-plaintext highlighter-rouge">(bf "#e4.8523e8")</code>. While the latter results are certainly more accurate, they do not reflect proper behavior in half-precision. The standard bigfloat library does not subnormalize the first result (no subnormal arithmetic), nor does it recognize the overflow in the second result (fixed exponent range).</p>

  <p>This library fixes the issues mentioned above by automatically emulating subnormal arithmetic when necessary and providing a way to change the exponent range. In addition, the interface is quite similar to math/bigfloat, so it will feel familiar to anyone who has used the standard bigfloat library before. There are also a few extra operations from the C math library such as gflfma, gflmod, and gflremainder that the bigfloat library does not support.</p>

  <p>See <a href="https://docs.racket-lang.org/math/bigfloat.html">math/bigfloat</a> for more information on bigfloats.</p>
</blockquote>

<p>To read more of the documentation, please visit <a href="https://docs.racket-lang.org/generic-flonum/index.html">here</a>. The source code for the package can be found at <a href="https://github.com/bksaiki/generic-flonum">this</a> repository. In the future, I plan on integrating this into the FPBench reference interpreter, so we can finally emulate subnormal arithmetic for various floating-point formats correctly. This has been an outstanding issue for a long time.</p>]]></content><author><name></name></author><category term="blog" /><summary type="html"><![CDATA[As a side effect of recent work, I created an alternate MPFR interface in Racket. I posted in the previous blog, that I was planning on extracting that code into a package for public use. As of today, that library has officially been cleaned up, documented, and published in the Racket Package Index. To try it out, install Racket and run raco pkg install generic-flonum. Here is an excerpt from the documentation.]]></summary></entry><entry><title type="html">First Entry</title><link href="https://www.bsaiki.com/blog/2020/11/11/update.html" rel="alternate" type="text/html" title="First Entry" /><published>2020-11-11T00:00:00+00:00</published><updated>2020-12-16T00:00:00+00:00</updated><id>https://www.bsaiki.com/blog/2020/11/11/update</id><content type="html" xml:base="https://www.bsaiki.com/blog/2020/11/11/update.html"><![CDATA[<p>Note: This is my first entry for my blog, a compendium of my thoughts, articles, references, etc. Is blog even the correct word? I’m still trying to figure out what exactly this will. Hopefully, I’ll be able to write here often.</p>

<p>Life Update: I’m currently in the first quarter of my sophomore year and things are going fairly well considering the state of the world. My work in Herbie has started moving much faster with my exploration of multi-precision expressions and cost/accuracy Pareto curves. Today, I began work on a generic IEEE-754 floating-point plugin for Herbie which will be very useful in the future.</p>

<p>One of the problems I had run into was the lack of subnormals in Racket’s bigflonum library - the language’s MPFR interface. Fortunately, the FFI procedures buried in the math library overall proved to be quite useful, so now I have a parallel set of operations that correctly handles subnormals and checks the range on the result. I emphasize the fact that it is parallel, since I still have access to the original, arguably more practical, set of operations that allow floats to have exponents somewhere on the order of 2^15 and ignore subnormals entirely. On a side note - it would be excellent to see support for limiting the exponent size and doing subnormal arithmetic in Racket’s library; however, I may be the only person in the world currently in need of such features, so… maybe not?</p>

<p>Another problem is the mapping of ordinals since such procedures are done on the Racket side of things, but they have no regard for subnormal numbers and seem to be causing issues on the Herbie side of things. Now that I think about it, it seems that all my problems have to do with subnormal floating point numbers. Go figure.</p>

<p>I see some possibility in separating out this new set of MPFR bindings and creating a generic floating point package that allows a user to specify a float with a given significand and exponent. It would be useful in FPBench since some of the tooling is very broken.</p>

<p>On a separate note: My website is quite new. It’s only been up for one week!! I’ve been working on it recently, and things are looking good. I began using Racket to generate my html pages, so everything is much easier to work with.</p>

<p>Anyways, that’s all my thoughts for today. Maybe next time I’ll write something more technical rather than rambly… Stay tuned for more!!</p>]]></content><author><name></name></author><category term="blog" /><summary type="html"><![CDATA[Note: This is my first entry for my blog, a compendium of my thoughts, articles, references, etc. Is blog even the correct word? I’m still trying to figure out what exactly this will. Hopefully, I’ll be able to write here often.]]></summary></entry></feed>