How to make gcc on SUN calculate floating points the same way as in Linux

How to make gcc on SUN calculate floating points the same way as in Linux - linux

I have a project where I have to perform some mathematics calculations with double variables.
The problem is that I get different results on SUN Solaris 9 and Linux.
There are a lot of ways (explained here and other forums) how to make Linux work as Sun, but not the other way around.
I cannot touch the Linux code, so it is only SUN I can change.
Is there any way to make SUN to behave as Linux?
The code I run(compile with gcc on both systems):
int hash_func(char *long_id)
{
double product, lnum, gold;
while (*long_id)
lnum = lnum * 10.0 + (*long_id++ - '0');
printf("lnum => %20.20f\n", lnum);
lnum = lnum * 10.0E-8;
printf("lnum => %20.20f\n", lnum);
gold = 0.6125423371582974;
product = lnum * gold;
printf("product => %20.20f\n", product);
...
}
if the input is 339886769243483
the output in Linux:
lnum => 339886769243**483**.00000000000000000000
lnum => 33988676.9243**4829473495483398**
product => 20819503.600158**59827399253845**
When on SUN:
lnum => 339886769243483.00000000000000000000
lnum => 33988676.92434830218553543091
product = 20819503.600158**60199928283691**
Note: The result is not always different, moreover most of the times it is the same. Just 10 15-digit numbers out of 60000 have this problem.
Please help!!!

The real answer here is another question: why do you think you need this? There may be a better way to accomplish what you're trying to do that doesn't depend on intricate details of platform floating-point. Having said that...
It's unfortunate that you can't change the Linux code, since it's really the Linux results that are deficient here. The SUN results are as good as they could possibly be: they're correctly rounded; each multiplication gives the unique (in this case) C double that's closest to the result. In contrast, the first Linux multiplication does not give a correctly rounded result.
Your Linux results come from a 32-bit system on x86 hardware, right? The results you show are consistent with, and likely caused by, the phenomenon of 'double rounding': the result of the first multiplication is first rounded to 64-bit precision (the precision used internally by the Intel x87 FPU), and then re-rounded to the usual 53-bit precision of a double. Most of the time (around 1999 times out of 2000 or so on average) this double round has the same effect as a single round to 53-bit precision would have had, but occasionally it can produce a different result, and that's what you're seeing here.
As you say, there are ways to fix the Linux results to match the Solaris ones: one of these is to use appropriate compiler flags to force the use of SSE2 instructions for floating-point operations if possible. The recent 4.5 release of gcc also fixes the difference by means of a new -fexcess-precision flag, though the fix may impact performance when not using SSE2.
[Edit: after several rereads of the gcc manuals, the gcc-patches mailing list thread at http://gcc.gnu.org/ml/gcc-patches/2008-11/msg00105.html, and the related gcc bug report, it's still not clear to me whether use of -fexcess-precision=standard does in fact eliminate double rounding on x87 systems; I think the answer depends on the value of FLT_EVAL_METHOD. I don't have a 32-bit Linux/x86 machine handy to test this on.]
But I don't know how you'd fix the Solaris results to match the Linux ones, and I'm not sure why you'd want to: you'd be making the Solaris results less accurate instead of making the Linux results more accurate.
[Edit: caf has a good suggestion here. On Solaris, try deliberately using long double for intermediate results, then forcing back to double. If done right, this should reproduce the double rounding effect that you're seeing in Linux.]
See David Monniaux's excellent paper The pitfalls of verifying floating-point computations for a good explanation of double rounding. It's essential reading after the Goldberg article mentioned in an earlier answer.

Two things:
you need to read this: http://docs.sun.com/source/806-3568/ncg_goldberg.html
you need to decide what your requirements are for numerical precision in your application

These values differ by less than one part in 252. But a double-precision number has just 52 bits behind the radix point. So they differ by just the last bit. It may be that one machine isn't always rounding correctly, but you're doing two multiplications, so the answer is going to have that much error anyway.

Related

Are floating point operations deterministic when running in multiple threads?

Suppose I have a function that runs calculations, example being something like a dot product - I pass in an arrays A, B of vectors and a float array C, and the functions assigns:
C[i] = dot(A[i], B[i]);
If I create and start two threads that will run this function, and pass in the same three arrays to both the threads, under what circumstances is this type of action (perhaps using a different non-random mathematical operation etc.) not guaranteed give the same result (running the same application without any recompilation, and on the same machine)? I'm only interested in the context of a consumer PC.
I know that float operations are in general deterministic, but I do wonder whether perhaps something weird could happen and maybe on one thread the calculations will use an intermediate 80 bit register, but not in the other.
I would assume it's pretty much guaranteed the same binary code should run in both threads (is there some way this could not happen? The function being compiled multiple times for some reason, the compiler somehow figuring out it will run in multiple threads, and compiling it again, for some reason, for the second thread?).
But I'm a a bit more worried that CPU cores might not have the same instruction sets, even on consumer level PCs.
Side question - what about GPUs in a similar scenario?
//
I'm assuming x86_64, Windows, c++, and dot is a.x * b.x + a.y * b.y. Can't give more info than that - using Unity IL2CPP, don't know how it compiles/with what options.
Motivation for the question: I'm writing a computational geometry procedure that modifies a mesh - I'll call this the "geometric mesh". The issue is that it could happen that the "rendering mesh" has multiple vertices for certain geometric positions - it's needed for flat shading for example - you have multiple vertices with different normals. However, the actual computational geometry procedure only uses purely geometric data of the positions in space.
So I see two options:
Create a map from the rendering mesh to the geometric mesh (example - duplicate vertices being mapped to one unique vertex), run the procedure on the geometric mesh, then somehow modify the rendering mesh based on the result.
Work with the rendering mesh directly. Slightly more inefficient as the procedure does calculations for all vertices, but much easier from a code perspective. But most of all I'm a bit worried that I could get two different values for two vertices that actually have the same position and that shouldn't happen. Only the position is used, and the position would be the same for both such vertices.

Floating point (FP) operations are not associative (but it is commutative). As a result, (x+y)+z can give different results than x+(y+z). For example, (1e-13 + (1 - 1e-13)) == ((1e-13 + 1) - 1e-13) is false with 64-bit IEEE-754 floats. The C++ standard is not very restrictive about floating-point numbers. However, the widely-used IEEE-754 standard is. It specifies the precision of 32-bit and 64-bit number operations, including rounding modes. x86-64 processors are IEEE-754 compliant and mainstream compilers (eg. GCC, Clang and MSVC) are also IEEE-754 compliant by default. ICC is not compliant by default since it assumes the FP operations are associative for the sake of performance. Mainstream compilers have compilation flags to make such assumption so to speed up codes. It is generally combined with other ones like the assumption that all FP values are not NaN (eg. -ffast-math). Such flags break the IEEE-754 compliance, but they are often used in the 3D or video game industry so to speed up codes. IEEE-754 is not required by the C++ standard, but you can check this with std::numeric_limits<T>::is_iec559.
Threads can have different rounding modes by default. However, you can set the rounding mode using the C code provided in this answer. Also, please note that denormal numbers are sometimes disabled on some platforms because of their very-high overhead (see this for more information).
Assuming the IEEE-754 compliance is not broken, the rounding mode is the same and the threads does the operations in the same order, then the result should be identical up to at least 1 ULP. In practice, if they are compiled using a same mainstream compiler, the result should be exactly the same.
The thing is using multiple threads often result in a non-deterministic order of the applied FP operations which causes non-deterministic results. More specifically, atomic operations on FP variables often cause such an issue because the order of the operations often changes at runtime. If you want deterministic results, you need to use a static partitioning, avoid atomic operations on FP variables or more generally atomic operations that could result in a different ordering. The same thing applies for locks or any synchronization mechanisms.
The same thing is true for GPUs. In fact, such problem is very frequent when developers use atomic FP operations for example to sum values. They often do that because implementing fast reductions is complex (though it is more deterministic) and atomic operations as pretty fast on modern GPUs (since they use dedicated efficient units).

According to the accepted answer to floating point processor non-determinism?, C++ floating point is not non-deterministic. The same sequence of instructions will give the same results.
There are a few things to take into account though:
Firstly, the behavior (i.e. the result) of a particular piece of C++ source code doing a FP calculation may depend on the compiler and the chosen compiler options. For example, it may depend on whether the compiler chooses to emit 64 or 80 bit FP instructions. But this is deterministic.
Secondly, similar C++ source code may give different results; e.g. due to non-associative behavior of certain FP instructions. This also is deterministic.
Determinism won't be affected by multi-threading by default. The C++ compiler will probably be unaware of whether the code is multi-threaded or not. And it definitely has no reason to emit different FP code.
Admittedly, FP behavior depends on the rounding mode selected, and that can be set on a per-thread basis. However, for this to happen, something (application code) would have to explicitly set different rounding modes for different threads. Once again, that is deterministic. (And a pretty daft thing for the application code to do, IMO.)
The idea that a PC would would use different FP hardware with different behavior for different threads seems far-fetched to me. Sure a PC could have (say) an Intel chipset and an ARM chipset, but it is not plausible that different threads of the same C++ application (executable) would simultaneously run on both chipsets.
Likewise for GPUs. Indeed, given that you need to program GPUs in a way that is radically different to ordinary (or threaded) C++, I would doubt that they could even share the same source code.
In short, I think that you are worrying about a hypothetical problem that you are unlikely to encounter in reality ... given the current state of the art in hardware and C++ compilers.

FORTRAN 77 Read from .mtx file [duplicate]

This question already has answers here:
Is floating point math broken?
(31 answers)
Closed 6 years ago.
I've been trying to use Fortran for my research project, with the GNU Fortran compiler (gfortran), latest version,
but I've been encountering some problems in the way it processes real numbers. If you have for example the code:
program test
implicit none
real :: y = 23.234, z
z = y * 100000
write(*,*) y, z
end program
You'll get as output:
23.23999 2323400.0
I find this really strange.
Can someone tell me what's exactly happening here? Looking at z I can see that y does retain its precision, so for calculations that shouldn't be a problem I suppose. But why is the output of y not exactly the same as the value that I've specified, and what can I do to make it exactly the same?

This is not a problem - all you see is floating-point representation of the number in the computer. The computer cannot handle real numbers exactly, but only approximations of them. A good read about this can be found here: What Every Computer Scientist Should Know About Floating-Point Arithmetic.

Simply by replacing real with double precision, you can increase the number of significant decimal places from about six to about 15 on most platforms.

The general issue is not limited to Fortran, but the representation of base 10 real numbers in another base of finite precision. This computer science question is asked many times here.
For the specifically Fortran aspects, the declaration "real" will likely give you a single precision floating point. As will expressing a constant as "23.234" without a type qualifier. The constant "100000" without a decimal point is an integer so the expression "y * 100000" is causing an implicit conversion of an integer to a real because "y" is a real variable.
For previous some previous discussions of these issues see Extended double precision , Fortran: integer*4 vs integer(4) vs integer(kind=4) and Is There a Better Double-Precision Assignment in Fortran 90?

The problem here is not with Fortran, in fact it is not a problem at all. This is just a feature of floating-point arithmetic. If you think about how you would represent 23.234 as a 'single float' in binary, you would see that the number has to be saved to only so many decimals of precision.
The thing to remember about float point number is: numbers that look round and even in base-10 probably won't in binary.
For a brief overview of floating-point topics, check the Wikipedia article. And for a VERY thorough explanation, check out the canonical paper by Goldberg (PDF).

Does 64-bit floating point numbers behave identically on all modern PCs?

I would like to know whether i can assume that same operations on same 64-bit floating point numbers gives exactly the same results on any modern PC and in most common programming languages? (C++, Java, C#, etc.). We can assume, that we are operating on numbers and result is also a number (no NaNs, INFs and so on).
I know there are two very simmilar standards of computation using floating point numbers (IEEE 854-1987 and IEEE 754-2008). However I don't know how it is in practice.

Modern processors that implement 64-bit floating-point typically implement something that is close to the IEEE 754-1985 standard, recently superseded by the 754-2008 standard.
The 754 standard specifies what result you should get from certain basic operations, notably addition, subtraction, multiplication, division, square root, and negation. In most cases, the numeric result is specified precisely: The result must be the representable number that is closest to the exact mathematical result in the direction specified by the rounding mode (to nearest, toward infinity, toward zero, or toward negative infinity). In "to nearest" mode, the standard also specifies how ties are broken.
Because of this, operations that do not involve exception conditions such as overflow will get the same results on different processors that conform to the standard.
However, there are several issues that interfere with getting identical results on different processors. One of them is that the compiler is often free to implement sequences of floating-point operations in a variety of ways. For example, if you write "a = bc + d" in C, where all variables are declared double, the compiler is free to compute "bc" in either double-precision arithmetic or something with more range or precision. If, for example, the processor has registers capable of holding extended-precision floating-point numbers and doing arithmetic with extended-precision does not take any more CPU time than doing arithmetic with double-precision, a compiler is likely to generate code using extended-precision. On such a processor, you might not get the same results as you would on another processor. Even if the compiler does this regularly, it might not in some circumstances because the registers are full during a complicated sequence, so it stores the intermediate results in memory temporarily. When it does that, it might write just the 64-bit double rather than the extended-precision number. So a routine containing floating-point arithmetic might give different results just because it was compiled with different code, perhaps inlined in one place, and the compiler needed registers for something else.
Some processors have instructions to compute a multiply and an add in one instruction, so "bc + d" might be computed with no intermediate rounding and get a more accurate result than on a processor that first computes bc and then adds d.
Your compiler might have switches to control behavior like this.
There are some places where the 754-1985 standard does not require a unique result. For example, when determining whether underflow has occurred (a result is too small to be represented accurately), the standard allows an implementation to make the determination either before or after it rounds the significand (the fraction bits) to the target precision. So some implementations will tell you underflow has occurred when other implementations will not.
A common feature in processors is to have an "almost IEEE 754" mode that eliminates the difficulty of dealing with underflow by substituting zero instead of returning the very small number that the standard requires. Naturally, you will get different numbers when executing in such a mode than when executing in the more compliant mode. The non-compliant mode may be the default set by your compiler and/or operating system, for reasons of performance.
Note that an IEEE 754 implementation is typically not provided just by hardware but by a combination of hardware and software. The processor may do the bulk of the work but rely on the software to handle certain exceptions, set certain modes, and so on.
When you move beyond the basic arithmetic operations to things like sine and cosine, you are very dependent on the library you use. Transcendental functions are generally calculated with carefully engineered approximations. The implementations are developed independently by various engineers and get different results from each other. On one system, the sin function may give results accurate within an ULP (unit of least precision) for small arguments (less than pi or so) but larger errors for large arguments. On another system, the sin function might give results accurate within several ULP for all arguments. No current math library is known to produce correctly rounded results for all inputs. There is a project, crlibm (Correctly Rounded Libm), that has done some good work toward this goal, and they have developed implementations for significant parts of the math library that are correctly rounded and have good performance, but not all of the math library yet.
In summary, if you have a manageable set of calculations, understand your compiler implementation, and are very careful, you can rely on identical results on different processors. Otherwise, getting completely identical results is not something you can rely on.

If you mean getting exactly the same result, then the answer is no.
You might even get different results for debug (non-optimized) builds vs. release builds (optimized) on the same machine in some cases, so don't even assume that the results might be always identical on different machines.
(This can happen e.g. on a computer with an Intel processor, if the optimizer keeps a variable for an intermediate result in a register, that is stored in memory in the unoptimized build. Since Intel FPU registers are 80 bit, and double variables are 64 bit, the intermediate result will be stored with greater precision in the optimized build, causing different values in later results.).
In practice, however, you may often get the same results, but you shouldn't rely on it.

Modern FPUs all implement IEEE754 floats in single and double formats, and some in extended format. A certain set of operations are supported (pretty much anything in math.h), with some special instructions floating around out there.

assuming you are talking about applying multiple operations, I do not think you will get exact numbers. CPU architecture, compiler use, optimization settings will change the results of your computations.
if you mean the exact order of operations (at the assembly level), I think you will still get variations.for example Intel chips use extended precision (80 bits) internally, which may not be the case for other CPUs. (I do not think extended precision is mandated)

The same C# program can bring out different numerical results on the same PC, once compiled in debug mode without optimization, second time compiled in release mode with optimization enabled. That's my personal experience. We did not regard this when we set up an automatic regression test suite for one of our programs for the first time, and were completely surprised that a lot of our tests failed without any apparent reason.

For C# on x86, 80-bit FP registers are used.
The C# standard says that the processor must operate at the same precision as, or greater than, the type itself (i.e. 64-bit in the case of a 'double'). Promotions are allowed, except for storage. That means that locals and parameters could be at greater than 64-bit precision.
In other words, assigning a member variable to a local variable could (and in fact will under certain circumstances) be enough to give an inequality.
See also: Float/double precision in debug/release modes

For the 64-bit data type, I only know of "double precision" / "binary64" from the IEEE 754 (1985 and 2008 don't differ much here for common cases) being used.
Note: The radix types defined in IEEE 854-1987 are superseded by IEEE 754-2008 anyways.

Multiplying 23 bit datatypes in a system with no long long

I am trying to implement floating point operations in a microcontroller and so far I have had ample success.
The problem lies in the way I do multiplication in my computer and it works fine:
unsigned long long gig,mm1,mm2;
unsigned long m,m1,m2;
mm1 = f1.float_parts.mantissa;
mm2 = f2.float_parts.mantissa;
m1 = f1.float_parts.mantissa;
m2 = f2.float_parts.mantissa;
gig = mm1*mm2; //this works fine I get all the bits I need since they are all long long, but won't work in the mcu
gig = m1*m2//this does not work, to be precise it gives only the 32 least significant bits , but works on the mcu
So you can see that my problem is that the microcontroller will throw an undefined refence to __muldi3 if I try the gig = mm1*mm2 there.
And If I try with the smaller data types, it only keeps the least significant bits, which I don't want it to. I need the 23 msb bits of the product.
Does anyone have any ideas as to how I can do this?

Apologizes for the short answer, I hope that someone else will take the time to write a fuller explanation, but basically you do exactly as when you multiply two big numbers by hand on a paper! It's just that instead of working with base 10, you work in base 256. That is, treat your numbers as a byte vectors, and do with each byte what you do to a digit when you "hand multiply".

The comments in the FreeBSD implementation of __muldi3() have a good explanation of the required procedure, see muldi3.c. If you want to go straight to the source (always a good idea!), according to the comments this code was based on an algorithm described in Knuth's The Art of Computer Programming vol. 2 (2nd ed), section 4.3.3, p. 278. (N.B. the link is for the 3rd edition.)

Back on the Intel 8088 (the original PC CPU and the last CPU I wrote assembly code for) when you multiplied two 16 bit numbers (32 bits? whoow) the CPU would return 2 16 bit numbers in two different registers - one with the 16 msb and one with the lsb.
You should check the hardware capabilities of your micro-controller, maybe it has a similar setup (obviously you'll need the code this in assembly if it does).
Otherwise you'll have to implement multiplication on your own.

What programming languages support arbitrary precision arithmetic?

What programming languages support arbitrary precision arithmetic and could you give a short example of how to print an arbitrary number of digits?

Some languages have this support built in. For example, take a look at java.math.BigDecimal in Java, or decimal.Decimal in Python.
Other languages frequently have a library available to provide this feature. For example, in C you could use GMP or other options.
The "Arbitrary-precision software" section of this article gives a good rundown of your options.

Mathematica.
N[Pi, 100]
3.141592653589793238462643383279502884197169399375105820974944592307816406286208998628034825342117068
Not only does mathematica have arbitrary precision but by default it has infinite precision. It keeps things like 1/3 as rationals and even expressions involving things like Sqrt[2] it maintains symbolically until you ask for a numeric approximation, which you can have to any number of decimal places.

In Common Lisp,
(format t "~D~%" (expt 7 77))
"~D~%" in printf format would be "%d\n". Arbitrary precision arithmetic is built into Common Lisp.

Smalltalk supports arbitrary precision Integers and Fractions from the beginning.
Note that gnu Smalltalk implementation does use GMP under the hood.
I'm also developping ArbitraryPrecisionFloat for various dialects (Squeak/Pharo Visualworks and Dolphin), see http://www.squeaksource.com/ArbitraryPrecisionFl.html

Python has such ability. There is an excellent example here.
From the article:
from math import log as _flog
from decimal import getcontext, Decimal
def log(x):
if x < 0:
return Decimal("NaN")
if x == 0:
return Decimal("-inf")
getcontext().prec += 3
eps = Decimal("10")**(-getcontext().prec+2)
# A good initial estimate is needed
r = Decimal(repr(_flog(float(x))))
while 1:
r2 = r - 1 + x/exp(r)
if abs(r2-r) < eps:
break
else:
r = r2
getcontext().prec -= 3
return +r
Also, the python quick start tutorial discusses the arbitrary precision: http://docs.python.org/lib/decimal-tutorial.html
and describes getcontext:
the getcontext() function accesses the
current context and allows the
settings to be changed.
Edit: Added clarification on getcontext.

Many people recommended Python's decimal module, but I would recommend using mpmath over decimal for any serious numeric uses.

COBOL
77 VALUE PIC S9(4)V9(4).
a signed variable witch 4 decimals.
PL/1
DCL VALUE DEC FIXED (4,4);
:-) I can't remember the other old stuff...
Jokes apart, as my example show, I think you shouldn't choose a programming language depending on a single feature. Virtually all decent and recent language support fixed precision in some dedicated classes.

Scheme (a variation of lisp) has a capability called 'bignum'. there are many good scheme implementations available both full language environments and embeddable scripting options.
a few I can vouch for
MitScheme (also referred to as gnu scheme)
PLTScheme
Chezscheme
Guile (also a gnu project)
Scheme 48

Ruby whole numbers and floating point numbers (mathematically speaking: rational numbers) are by default not strictly tied to the classical CPU related limits. In Ruby the integers and floats are automatically, transparently, switched to some "bignum types", if the size exceeds the maximum of the classical sizes.
One probably wants to use some reasonably optimized and "complete", multifarious, math library that uses the "bignums". This is where the Mathematica-like software truly shines with its capabilities.
As of 2011 the Mathematica is extremely expensive and terribly restricted from hacking and reshipping point of view, specially, if one wants to ship the math software as a component of a small, low price end, web application or an open source project. If one needs to do only raw number crunching, where visualizations are not required, then there exists a very viable alternative to the Mathematica and Maple. The alternative is the REDUCE Computer Algebra System, which is Lisp based, open source and mature (for decades) and under active development (in 2011). Like Mathematica, the REDUCE uses symbolic calculation.
For the recognition of the Mathematica I say that as of 2011 it seems to me that the Mathematica is the best at interactive visualizations, but I think that from programming point of view there are more convenient alternatives even if Mathematica were an open source project. To me it seems that the Mahtematica is also a bit slow and not suitable for working with huge data sets. It seems to me that the niche of the Mathematica is theoretical math, not real-life number crunching. On the other hand the publisher of the Mathematica, the Wolfram Research, is hosting and maintaining one of the most high quality, if not THE most high quality, free to use, math reference sites on planet Earth: the http://mathworld.wolfram.com/
The online documentation system that comes bundled with the Mathematica is also truly good.
When talking about speed, then it's worth to mention that REDUCE is said to run even on a Linux router. The REDUCE itself is written in Lisp, but it comes with 2 of its very own, specific, Lisp implementations. One of the Lisps is implemented in Java and the other is implemented in C. Both of them work decently, at least from math point of view. The REDUCE has 2 modes: the traditional "math mode" and a "programmers mode" that allows full access to all of the internals by the language that the REDUCE is self written in: Lisp.
So, my opinion is that if one looks at the amount of work that it takes to write math routines, not to mention all of the symbolic calculations that are all MATURE in the REDUCE, then one can save enormous amount of time (decades, literally) by doing most of the math part in REDUCE, specially given that it has been tested and debugged by professional mathematicians over a long period of time, used for doing symbolic calculations on old-era supercomputers for real professional tasks and works wonderfully, truly fast, on modern low end computers. Neither has it crashed on me, unlike at least one commercial package that I don't want to name here.
http://www.reduce-algebra.com/
To illustrate, where the symbolic calculation is essential in practice, I bring an example of solving a system of linear equations by matrix inversion. To invert a matrix, one needs to find determinants. The rounding that takes place with the directly CPU supported floating point types, can render a matrix that theoretically has an inverse, to a matrix that does not have an inverse. This in turn introduces a situation, where most of the time the software might work just fine, but if the data is a bit "unfortunate" then the application crashes, despite the fact that algorithmically there's nothing wrong in the software, other than the rounding of floating point numbers.
The absolute precision rational numbers do have a serious limitation. The more computations is performed with them, the more memory they consume. As of 2011 I don't know any solutions to that problem other than just being careful and keeping track of the number of operations that has been performed with the numbers and then rounding the numbers to save memory, but one has to do the rounding at a very precise stage of the calculations to avoid the aforementioned problems. If possible, then the rounding should be done at the very end of the calculations as the very last operation.

In PHP you have BCMath. You not need to load any dll or compile any module.
Supports numbers of any size and precision, represented as string
<?php
$a = '1.234';
$b = '5';
echo bcadd($a, $b); // 6
echo bcadd($a, $b, 4); // 6.2340
?>

Apparently Tcl also has them, from version 8.5, courtesy of LibTomMath:
http://wiki.tcl.tk/5193
http://www.tcl.tk/cgi-bin/tct/tip/237.html
http://math.libtomcrypt.com/

There are several Javascript libraries that handle arbitrary-precision arithmetic.
For example, using my big.js library:
Big.DP = 20; // Decimal Places
var pi = Big(355).div(113)
console.log( pi.toString() ); // '3.14159292035398230088'

In R you can use the Rmpfr package:
library(Rmpfr)
exp(mpfr(1, 120))
## 1 'mpfr' number of precision 120 bits
## [1] 2.7182818284590452353602874713526624979
You can find the vignette here: Arbitrarily Accurate Computation with R:
The Rmpfr Package

Java natively can do bignum operations with BigDecimal. GMP is the defacto standard library for bignum with C/C++.

If you want to work in the .NET world you can use still use the java.math.BigDecimal class. Just add a reference to vjslib (in the framework) and then you can use the java classes.
The great thing is, they can be used fron any .NET language. For example in C#:
using java.math;
namespace MyNamespace
{
class Program
{
static void Main(string[] args)
{
BigDecimal bd = new BigDecimal("12345678901234567890.1234567890123456789");
Console.WriteLine(bd.ToString());
}
}
}

The (free) basic program x11 basic ( http://x11-basic.sourceforge.net/ ) has arbitrary precision for integers. (and some useful commands as well, e.g. nextprime( abcd...pqrs))

IBM's interpreted scripting language Rexx, provides custom precision setting with Numeric. https://www.ibm.com/docs/en/zos/2.1.0?topic=instructions-numeric.
The language runs on mainframes and pc operating systems and has very powerful parsing and variable handling as well as extension packages. Object Rexx is the most recent implementation. Links from https://en.wikipedia.org/wiki/Rexx

Haskell has excellent support for arbitrary-precision arithmetic built in, and using it is the default behavior. At the REPL, with no imports or setup required:
Prelude> 2 ^ 2 ^ 12
1044388881413152506691752710716624382579964249047383780384233483283953907971557456848826811934997558340890106714439262837987573438185793607263236087851365277945956976543709998340361590134383718314428070011855946226376318839397712745672334684344586617496807908705803704071284048740118609114467977783598029006686938976881787785946905630190260940599579453432823469303026696443059025015972399867714215541693835559885291486318237914434496734087811872639496475100189041349008417061675093668333850551032972088269550769983616369411933015213796825837188091833656751221318492846368125550225998300412344784862595674492194617023806505913245610825731835380087608622102834270197698202313169017678006675195485079921636419370285375124784014907159135459982790513399611551794271106831134090584272884279791554849782954323534517065223269061394905987693002122963395687782878948440616007412945674919823050571642377154816321380631045902916136926708342856440730447899971901781465763473223850267253059899795996090799469201774624817718449867455659250178329070473119433165550807568221846571746373296884912819520317457002440926616910874148385078411929804522981857338977648103126085903001302413467189726673216491511131602920781738033436090243804708340403154190336
(try this yourself at https://tryhaskell.org/)
If you're writing code stored in a file and you want to print a number, you have to convert it to a string first. The show function does that.
module Test where
main = do
let x = 2 ^ 2 ^ 12
let xStr = show x
putStrLn xStr
(try this yourself at code.world: https://www.code.world/haskell#Pb_gPCQuqY7r77v1IHH_vWg)
What's more, Haskell's Num abstraction lets you defer deciding what type to use as long as possible.
-- Define a function to make big numbers. The (inferred) type is generic.
Prelude> superbig n = 2 ^ 2 ^ n
-- We can call this function with different concrete types and get different results.
Prelude> superbig 5 :: Int
4294967296
Prelude> superbig 5 :: Float
4.2949673e9
-- The `Int` type is not arbitrary precision, and we might overflow.
Prelude> superbig 6 :: Int
0
-- `Double` can hold bigger numbers.
Prelude> superbig 6 :: Double
1.8446744073709552e19
Prelude> superbig 9 :: Double
1.3407807929942597e154
-- But it is also not arbitrary precision, and can still overflow.
Prelude> superbig 10 :: Double
Infinity
-- The Integer type is arbitrary-precision though, and can go as big as we have memory for and patience to wait for the result.
Prelude> superbig 12 :: Integer
1044388881413152506691752710716624382579964249047383780384233483283953907971557456848826811934997558340890106714439262837987573438185793607263236087851365277945956976543709998340361590134383718314428070011855946226376318839397712745672334684344586617496807908705803704071284048740118609114467977783598029006686938976881787785946905630190260940599579453432823469303026696443059025015972399867714215541693835559885291486318237914434496734087811872639496475100189041349008417061675093668333850551032972088269550769983616369411933015213796825837188091833656751221318492846368125550225998300412344784862595674492194617023806505913245610825731835380087608622102834270197698202313169017678006675195485079921636419370285375124784014907159135459982790513399611551794271106831134090584272884279791554849782954323534517065223269061394905987693002122963395687782878948440616007412945674919823050571642377154816321380631045902916136926708342856440730447899971901781465763473223850267253059899795996090799469201774624817718449867455659250178329070473119433165550807568221846571746373296884912819520317457002440926616910874148385078411929804522981857338977648103126085903001302413467189726673216491511131602920781738033436090243804708340403154190336
-- If we don't specify a type, Haskell will infer one with arbitrary precision.
Prelude> superbig 12
1044388881413152506691752710716624382579964249047383780384233483283953907971557456848826811934997558340890106714439262837987573438185793607263236087851365277945956976543709998340361590134383718314428070011855946226376318839397712745672334684344586617496807908705803704071284048740118609114467977783598029006686938976881787785946905630190260940599579453432823469303026696443059025015972399867714215541693835559885291486318237914434496734087811872639496475100189041349008417061675093668333850551032972088269550769983616369411933015213796825837188091833656751221318492846368125550225998300412344784862595674492194617023806505913245610825731835380087608622102834270197698202313169017678006675195485079921636419370285375124784014907159135459982790513399611551794271106831134090584272884279791554849782954323534517065223269061394905987693002122963395687782878948440616007412945674919823050571642377154816321380631045902916136926708342856440730447899971901781465763473223850267253059899795996090799469201774624817718449867455659250178329070473119433165550807568221846571746373296884912819520317457002440926616910874148385078411929804522981857338977648103126085903001302413467189726673216491511131602920781738033436090243804708340403154190336

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string