Why is a simple RGB to Lab conversion much slower using compute capability 1.3 than 1.0 even with -use_fast_math flag?

Why is a simple RGB to Lab conversion much slower using compute capability 1.3 than 1.0 even with -use_fast_math flag? - visual-studio-2012

I am using GT 740M (CC 3.5) and I have a RGB to Lab conversion kernel. Using compute capability 1.0 - 1.2 the whole kernel is executed in 924 microseconds however using the compute capability of 1.3 or higher (up to 3.5) the kernel is executed in around 3 ms. According to the table from wikipedia http://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications I found out that it could be caused by the double-precision floating-point operations so I used the -use_fast_math flag, but it did not help.
What can be the reason of the performance hit?
The whole source code can be seen in http://pastebin.com/JjhH101y

cc 1.0 - 1.2 devices do not support double-precision floating point operations. Those operations will be "demoted" to single precision floating point operations on those devices.
At first glance, all of your variables are float not double, but your constants are all double-precision constants.
Therefore arithmetic like this:
a=(x-y)*500.0;
will involve a double-precision floating point multiply on compile targets that support it (which will be subsequently reduced to a float). On compile targets that don't support it, the above operation will be handled entirely via single-precision math.
The --use-fast-math option does not affect conversion between double and float as discussed above.
I would suggest that you start by decorating all your constants as floating-point constants:
a=(x-y)*500.0f;
You might also want to carefully review the CUDA math api to be sure that you get what you want from operations like this:
exp(log(x)/3.0)
in terms of single or double-precision arithmetic.

Related

What's the term for "default rounding mode" for arithmetic ops in IEEE 754?

I know that IEEE 754 is the specification for floating-point number data types, their representation and the semantics of operations on them; but - I'm not knowledgeable in the standard itself.
I also know that IEEE 754 defines the following four rounding modes: to nearest, up, down, to zero; and that in typical environments supporting IEEE 754 you can say "please round x using rounding mode m" (e.g. here's how it's done with glibc). That's all well and good.
However, rounding also happens when you simply perform arithmetic: If you have a single-precision 0.9999999 (or something close to that) and you add this to 100000, you will either get 100001, or something close to 100000.99 .
My question: Is this rounding supposed to be an application of a "default rounding mode"? Is that the IEEE 754 term, or is there another, specific term for what I've described? And are the four rounding modes supposed to be supported for implicit rounding arithmetic, as well?

IEEE 754-2008 does not define “modes” for rounding; it changed the terminology from the 1985 standard to imply more flexibility than a global mode. In clause 4.3.3, it specifies round-to-nearest-ties-to-even as the default method for binary formats. For decimal formats, it says a default is not defined but should be round-to-nearest-ties-to-even.
IEEE 754-2008 subclause 4.3 specifies several rounding-direction attributes. It does not say there has to be a mode that provides a rounding-direction setting. A computing environment could, for example, including the rounding-direction attribute as a parameter on each individual operation. For example, a processor architecture could have separate instructions for add with rounding-to-nearest-ties-to-even, add with rounding toward +∞, add with rounding toward zero, and so on. Or it could include the rounding-direction attribute as a operand to the instruction. Or the instruction could be affected by a global mode set in some special processor register.
Many processor architectures provide rounding mode as a setting in a floating-point control register.
The IEEE 754-1985 standard described rounding modes and defined mode as “A variable that a user may set, sense, save, and restore to control the execution of subsequent arithmetic operations.” This may have influenced the development of rounding modes in control registers, but that has a detrimental effect on performance, as global registers cause dependencies between instructions: Every floating-point instruction depends on that global register, so any change to that register interferes with parallel execution of instructions.
However, it is desirable that the rounding direction be flexible, as one might want to alternately use different modes when implementing interval arithmetic or evaluating complicated routines such as implementations of sine or exponentiation. So the IEEE 754-2008 committee changed the standard not to define global modes. Clause 4 specifies some semantics for attributes, including that languages should provide ways to specify the attributes for all standard operations in a “block.” A block may be an entire program or a single operation; it is language-defined. Subclause 4.2 says languages should provide dynamic ways of specifying the attributes, so they can be determined at run-time.

Why does System.Numerics.Complex use doubles instead of decimals?

I've been working with System.Numerics.Complex recently, and I've started to notice the typical floating-point "drift" where the value stored gets calculated a tenth of a millionth off or something like that, which is well-known and common with the float type and even the double type. I looked into the Complex struct, and sure enough, it used double variables. Why does it use double values to store its data and not decimal values, which are designed to prevent this? How do I work around this?

To answer your question:
doubles are several orders of magnitude faster, as operations are done at the hardware level
base-2 floats can actually be more accurate for large computations, as there is less "wobble" when shifting up and down exponents: 1 bit of precision is less than 1 decimal digit. Moreover, base-2 can use an implicit leading bit, which means they can represent more numbers than other bases.
complex numbers are typically used for scientific/engineering applications, where small relative errors of approx 10-16 are outweighed by other sources of error (e.g. due to measurement or the model).
decimals on the other hand are typically used for "accounting" type operations, where round-off error is typically negligible (i.e. addition of small numbers, multiplication by integers, etc.)

Does decimal math use the FPU

Does decimal math use the FPU?
I would think that the answer is yes, but I'm not sure, since a decimal is not a floating point, but a fixed precision number.
I'm looking mostly for .NET, but a general answer would be useful too.

With regards to .NET and more specifically C#, no, System.Decimal does not use the FPU because the type is emulated in software.
Also, System.Decimal is a floating point number, not a fixed precision number like commonly found in a database. The type is actually a decimal floating point that uses 10 for its base as opposed to a binary floating point (i.e. System.Single or System.Double) which uses 2 as its base. It still has the same precision problems if you attempt to store a fraction that cannot be exactly represented, for example, 1/3.

Yes, modern languages in general support floating point math and integer math, and that's it; there's no direct support for fixed point, BCD, etc. Floating-point math is going to be done using floating-point processor instructions; modern architectures don't include a separate FPU.

Does 64-bit floating point numbers behave identically on all modern PCs?

I would like to know whether i can assume that same operations on same 64-bit floating point numbers gives exactly the same results on any modern PC and in most common programming languages? (C++, Java, C#, etc.). We can assume, that we are operating on numbers and result is also a number (no NaNs, INFs and so on).
I know there are two very simmilar standards of computation using floating point numbers (IEEE 854-1987 and IEEE 754-2008). However I don't know how it is in practice.

Modern processors that implement 64-bit floating-point typically implement something that is close to the IEEE 754-1985 standard, recently superseded by the 754-2008 standard.
The 754 standard specifies what result you should get from certain basic operations, notably addition, subtraction, multiplication, division, square root, and negation. In most cases, the numeric result is specified precisely: The result must be the representable number that is closest to the exact mathematical result in the direction specified by the rounding mode (to nearest, toward infinity, toward zero, or toward negative infinity). In "to nearest" mode, the standard also specifies how ties are broken.
Because of this, operations that do not involve exception conditions such as overflow will get the same results on different processors that conform to the standard.
However, there are several issues that interfere with getting identical results on different processors. One of them is that the compiler is often free to implement sequences of floating-point operations in a variety of ways. For example, if you write "a = bc + d" in C, where all variables are declared double, the compiler is free to compute "bc" in either double-precision arithmetic or something with more range or precision. If, for example, the processor has registers capable of holding extended-precision floating-point numbers and doing arithmetic with extended-precision does not take any more CPU time than doing arithmetic with double-precision, a compiler is likely to generate code using extended-precision. On such a processor, you might not get the same results as you would on another processor. Even if the compiler does this regularly, it might not in some circumstances because the registers are full during a complicated sequence, so it stores the intermediate results in memory temporarily. When it does that, it might write just the 64-bit double rather than the extended-precision number. So a routine containing floating-point arithmetic might give different results just because it was compiled with different code, perhaps inlined in one place, and the compiler needed registers for something else.
Some processors have instructions to compute a multiply and an add in one instruction, so "bc + d" might be computed with no intermediate rounding and get a more accurate result than on a processor that first computes bc and then adds d.
Your compiler might have switches to control behavior like this.
There are some places where the 754-1985 standard does not require a unique result. For example, when determining whether underflow has occurred (a result is too small to be represented accurately), the standard allows an implementation to make the determination either before or after it rounds the significand (the fraction bits) to the target precision. So some implementations will tell you underflow has occurred when other implementations will not.
A common feature in processors is to have an "almost IEEE 754" mode that eliminates the difficulty of dealing with underflow by substituting zero instead of returning the very small number that the standard requires. Naturally, you will get different numbers when executing in such a mode than when executing in the more compliant mode. The non-compliant mode may be the default set by your compiler and/or operating system, for reasons of performance.
Note that an IEEE 754 implementation is typically not provided just by hardware but by a combination of hardware and software. The processor may do the bulk of the work but rely on the software to handle certain exceptions, set certain modes, and so on.
When you move beyond the basic arithmetic operations to things like sine and cosine, you are very dependent on the library you use. Transcendental functions are generally calculated with carefully engineered approximations. The implementations are developed independently by various engineers and get different results from each other. On one system, the sin function may give results accurate within an ULP (unit of least precision) for small arguments (less than pi or so) but larger errors for large arguments. On another system, the sin function might give results accurate within several ULP for all arguments. No current math library is known to produce correctly rounded results for all inputs. There is a project, crlibm (Correctly Rounded Libm), that has done some good work toward this goal, and they have developed implementations for significant parts of the math library that are correctly rounded and have good performance, but not all of the math library yet.
In summary, if you have a manageable set of calculations, understand your compiler implementation, and are very careful, you can rely on identical results on different processors. Otherwise, getting completely identical results is not something you can rely on.

If you mean getting exactly the same result, then the answer is no.
You might even get different results for debug (non-optimized) builds vs. release builds (optimized) on the same machine in some cases, so don't even assume that the results might be always identical on different machines.
(This can happen e.g. on a computer with an Intel processor, if the optimizer keeps a variable for an intermediate result in a register, that is stored in memory in the unoptimized build. Since Intel FPU registers are 80 bit, and double variables are 64 bit, the intermediate result will be stored with greater precision in the optimized build, causing different values in later results.).
In practice, however, you may often get the same results, but you shouldn't rely on it.

Modern FPUs all implement IEEE754 floats in single and double formats, and some in extended format. A certain set of operations are supported (pretty much anything in math.h), with some special instructions floating around out there.

assuming you are talking about applying multiple operations, I do not think you will get exact numbers. CPU architecture, compiler use, optimization settings will change the results of your computations.
if you mean the exact order of operations (at the assembly level), I think you will still get variations.for example Intel chips use extended precision (80 bits) internally, which may not be the case for other CPUs. (I do not think extended precision is mandated)

The same C# program can bring out different numerical results on the same PC, once compiled in debug mode without optimization, second time compiled in release mode with optimization enabled. That's my personal experience. We did not regard this when we set up an automatic regression test suite for one of our programs for the first time, and were completely surprised that a lot of our tests failed without any apparent reason.

For C# on x86, 80-bit FP registers are used.
The C# standard says that the processor must operate at the same precision as, or greater than, the type itself (i.e. 64-bit in the case of a 'double'). Promotions are allowed, except for storage. That means that locals and parameters could be at greater than 64-bit precision.
In other words, assigning a member variable to a local variable could (and in fact will under certain circumstances) be enough to give an inequality.
See also: Float/double precision in debug/release modes

For the 64-bit data type, I only know of "double precision" / "binary64" from the IEEE 754 (1985 and 2008 don't differ much here for common cases) being used.
Note: The radix types defined in IEEE 854-1987 are superseded by IEEE 754-2008 anyways.

What programming languages support arbitrary precision arithmetic?

What programming languages support arbitrary precision arithmetic and could you give a short example of how to print an arbitrary number of digits?

Some languages have this support built in. For example, take a look at java.math.BigDecimal in Java, or decimal.Decimal in Python.
Other languages frequently have a library available to provide this feature. For example, in C you could use GMP or other options.
The "Arbitrary-precision software" section of this article gives a good rundown of your options.

Mathematica.
N[Pi, 100]
3.141592653589793238462643383279502884197169399375105820974944592307816406286208998628034825342117068
Not only does mathematica have arbitrary precision but by default it has infinite precision. It keeps things like 1/3 as rationals and even expressions involving things like Sqrt[2] it maintains symbolically until you ask for a numeric approximation, which you can have to any number of decimal places.

In Common Lisp,
(format t "~D~%" (expt 7 77))
"~D~%" in printf format would be "%d\n". Arbitrary precision arithmetic is built into Common Lisp.

Smalltalk supports arbitrary precision Integers and Fractions from the beginning.
Note that gnu Smalltalk implementation does use GMP under the hood.
I'm also developping ArbitraryPrecisionFloat for various dialects (Squeak/Pharo Visualworks and Dolphin), see http://www.squeaksource.com/ArbitraryPrecisionFl.html

Python has such ability. There is an excellent example here.
From the article:
from math import log as _flog
from decimal import getcontext, Decimal
def log(x):
if x < 0:
return Decimal("NaN")
if x == 0:
return Decimal("-inf")
getcontext().prec += 3
eps = Decimal("10")**(-getcontext().prec+2)
# A good initial estimate is needed
r = Decimal(repr(_flog(float(x))))
while 1:
r2 = r - 1 + x/exp(r)
if abs(r2-r) < eps:
break
else:
r = r2
getcontext().prec -= 3
return +r
Also, the python quick start tutorial discusses the arbitrary precision: http://docs.python.org/lib/decimal-tutorial.html
and describes getcontext:
the getcontext() function accesses the
current context and allows the
settings to be changed.
Edit: Added clarification on getcontext.

Many people recommended Python's decimal module, but I would recommend using mpmath over decimal for any serious numeric uses.

COBOL
77 VALUE PIC S9(4)V9(4).
a signed variable witch 4 decimals.
PL/1
DCL VALUE DEC FIXED (4,4);
:-) I can't remember the other old stuff...
Jokes apart, as my example show, I think you shouldn't choose a programming language depending on a single feature. Virtually all decent and recent language support fixed precision in some dedicated classes.

Scheme (a variation of lisp) has a capability called 'bignum'. there are many good scheme implementations available both full language environments and embeddable scripting options.
a few I can vouch for
MitScheme (also referred to as gnu scheme)
PLTScheme
Chezscheme
Guile (also a gnu project)
Scheme 48

Ruby whole numbers and floating point numbers (mathematically speaking: rational numbers) are by default not strictly tied to the classical CPU related limits. In Ruby the integers and floats are automatically, transparently, switched to some "bignum types", if the size exceeds the maximum of the classical sizes.
One probably wants to use some reasonably optimized and "complete", multifarious, math library that uses the "bignums". This is where the Mathematica-like software truly shines with its capabilities.
As of 2011 the Mathematica is extremely expensive and terribly restricted from hacking and reshipping point of view, specially, if one wants to ship the math software as a component of a small, low price end, web application or an open source project. If one needs to do only raw number crunching, where visualizations are not required, then there exists a very viable alternative to the Mathematica and Maple. The alternative is the REDUCE Computer Algebra System, which is Lisp based, open source and mature (for decades) and under active development (in 2011). Like Mathematica, the REDUCE uses symbolic calculation.
For the recognition of the Mathematica I say that as of 2011 it seems to me that the Mathematica is the best at interactive visualizations, but I think that from programming point of view there are more convenient alternatives even if Mathematica were an open source project. To me it seems that the Mahtematica is also a bit slow and not suitable for working with huge data sets. It seems to me that the niche of the Mathematica is theoretical math, not real-life number crunching. On the other hand the publisher of the Mathematica, the Wolfram Research, is hosting and maintaining one of the most high quality, if not THE most high quality, free to use, math reference sites on planet Earth: the http://mathworld.wolfram.com/
The online documentation system that comes bundled with the Mathematica is also truly good.
When talking about speed, then it's worth to mention that REDUCE is said to run even on a Linux router. The REDUCE itself is written in Lisp, but it comes with 2 of its very own, specific, Lisp implementations. One of the Lisps is implemented in Java and the other is implemented in C. Both of them work decently, at least from math point of view. The REDUCE has 2 modes: the traditional "math mode" and a "programmers mode" that allows full access to all of the internals by the language that the REDUCE is self written in: Lisp.
So, my opinion is that if one looks at the amount of work that it takes to write math routines, not to mention all of the symbolic calculations that are all MATURE in the REDUCE, then one can save enormous amount of time (decades, literally) by doing most of the math part in REDUCE, specially given that it has been tested and debugged by professional mathematicians over a long period of time, used for doing symbolic calculations on old-era supercomputers for real professional tasks and works wonderfully, truly fast, on modern low end computers. Neither has it crashed on me, unlike at least one commercial package that I don't want to name here.
http://www.reduce-algebra.com/
To illustrate, where the symbolic calculation is essential in practice, I bring an example of solving a system of linear equations by matrix inversion. To invert a matrix, one needs to find determinants. The rounding that takes place with the directly CPU supported floating point types, can render a matrix that theoretically has an inverse, to a matrix that does not have an inverse. This in turn introduces a situation, where most of the time the software might work just fine, but if the data is a bit "unfortunate" then the application crashes, despite the fact that algorithmically there's nothing wrong in the software, other than the rounding of floating point numbers.
The absolute precision rational numbers do have a serious limitation. The more computations is performed with them, the more memory they consume. As of 2011 I don't know any solutions to that problem other than just being careful and keeping track of the number of operations that has been performed with the numbers and then rounding the numbers to save memory, but one has to do the rounding at a very precise stage of the calculations to avoid the aforementioned problems. If possible, then the rounding should be done at the very end of the calculations as the very last operation.

In PHP you have BCMath. You not need to load any dll or compile any module.
Supports numbers of any size and precision, represented as string
<?php
$a = '1.234';
$b = '5';
echo bcadd($a, $b); // 6
echo bcadd($a, $b, 4); // 6.2340
?>

Apparently Tcl also has them, from version 8.5, courtesy of LibTomMath:
http://wiki.tcl.tk/5193
http://www.tcl.tk/cgi-bin/tct/tip/237.html
http://math.libtomcrypt.com/

There are several Javascript libraries that handle arbitrary-precision arithmetic.
For example, using my big.js library:
Big.DP = 20; // Decimal Places
var pi = Big(355).div(113)
console.log( pi.toString() ); // '3.14159292035398230088'

In R you can use the Rmpfr package:
library(Rmpfr)
exp(mpfr(1, 120))
## 1 'mpfr' number of precision 120 bits
## [1] 2.7182818284590452353602874713526624979
You can find the vignette here: Arbitrarily Accurate Computation with R:
The Rmpfr Package

Java natively can do bignum operations with BigDecimal. GMP is the defacto standard library for bignum with C/C++.

If you want to work in the .NET world you can use still use the java.math.BigDecimal class. Just add a reference to vjslib (in the framework) and then you can use the java classes.
The great thing is, they can be used fron any .NET language. For example in C#:
using java.math;
namespace MyNamespace
{
class Program
{
static void Main(string[] args)
{
BigDecimal bd = new BigDecimal("12345678901234567890.1234567890123456789");
Console.WriteLine(bd.ToString());
}
}
}

The (free) basic program x11 basic ( http://x11-basic.sourceforge.net/ ) has arbitrary precision for integers. (and some useful commands as well, e.g. nextprime( abcd...pqrs))

IBM's interpreted scripting language Rexx, provides custom precision setting with Numeric. https://www.ibm.com/docs/en/zos/2.1.0?topic=instructions-numeric.
The language runs on mainframes and pc operating systems and has very powerful parsing and variable handling as well as extension packages. Object Rexx is the most recent implementation. Links from https://en.wikipedia.org/wiki/Rexx

Haskell has excellent support for arbitrary-precision arithmetic built in, and using it is the default behavior. At the REPL, with no imports or setup required:
Prelude> 2 ^ 2 ^ 12
1044388881413152506691752710716624382579964249047383780384233483283953907971557456848826811934997558340890106714439262837987573438185793607263236087851365277945956976543709998340361590134383718314428070011855946226376318839397712745672334684344586617496807908705803704071284048740118609114467977783598029006686938976881787785946905630190260940599579453432823469303026696443059025015972399867714215541693835559885291486318237914434496734087811872639496475100189041349008417061675093668333850551032972088269550769983616369411933015213796825837188091833656751221318492846368125550225998300412344784862595674492194617023806505913245610825731835380087608622102834270197698202313169017678006675195485079921636419370285375124784014907159135459982790513399611551794271106831134090584272884279791554849782954323534517065223269061394905987693002122963395687782878948440616007412945674919823050571642377154816321380631045902916136926708342856440730447899971901781465763473223850267253059899795996090799469201774624817718449867455659250178329070473119433165550807568221846571746373296884912819520317457002440926616910874148385078411929804522981857338977648103126085903001302413467189726673216491511131602920781738033436090243804708340403154190336
(try this yourself at https://tryhaskell.org/)
If you're writing code stored in a file and you want to print a number, you have to convert it to a string first. The show function does that.
module Test where
main = do
let x = 2 ^ 2 ^ 12
let xStr = show x
putStrLn xStr
(try this yourself at code.world: https://www.code.world/haskell#Pb_gPCQuqY7r77v1IHH_vWg)
What's more, Haskell's Num abstraction lets you defer deciding what type to use as long as possible.
-- Define a function to make big numbers. The (inferred) type is generic.
Prelude> superbig n = 2 ^ 2 ^ n
-- We can call this function with different concrete types and get different results.
Prelude> superbig 5 :: Int
4294967296
Prelude> superbig 5 :: Float
4.2949673e9
-- The `Int` type is not arbitrary precision, and we might overflow.
Prelude> superbig 6 :: Int
0
-- `Double` can hold bigger numbers.
Prelude> superbig 6 :: Double
1.8446744073709552e19
Prelude> superbig 9 :: Double
1.3407807929942597e154
-- But it is also not arbitrary precision, and can still overflow.
Prelude> superbig 10 :: Double
Infinity
-- The Integer type is arbitrary-precision though, and can go as big as we have memory for and patience to wait for the result.
Prelude> superbig 12 :: Integer
1044388881413152506691752710716624382579964249047383780384233483283953907971557456848826811934997558340890106714439262837987573438185793607263236087851365277945956976543709998340361590134383718314428070011855946226376318839397712745672334684344586617496807908705803704071284048740118609114467977783598029006686938976881787785946905630190260940599579453432823469303026696443059025015972399867714215541693835559885291486318237914434496734087811872639496475100189041349008417061675093668333850551032972088269550769983616369411933015213796825837188091833656751221318492846368125550225998300412344784862595674492194617023806505913245610825731835380087608622102834270197698202313169017678006675195485079921636419370285375124784014907159135459982790513399611551794271106831134090584272884279791554849782954323534517065223269061394905987693002122963395687782878948440616007412945674919823050571642377154816321380631045902916136926708342856440730447899971901781465763473223850267253059899795996090799469201774624817718449867455659250178329070473119433165550807568221846571746373296884912819520317457002440926616910874148385078411929804522981857338977648103126085903001302413467189726673216491511131602920781738033436090243804708340403154190336
-- If we don't specify a type, Haskell will infer one with arbitrary precision.
Prelude> superbig 12
1044388881413152506691752710716624382579964249047383780384233483283953907971557456848826811934997558340890106714439262837987573438185793607263236087851365277945956976543709998340361590134383718314428070011855946226376318839397712745672334684344586617496807908705803704071284048740118609114467977783598029006686938976881787785946905630190260940599579453432823469303026696443059025015972399867714215541693835559885291486318237914434496734087811872639496475100189041349008417061675093668333850551032972088269550769983616369411933015213796825837188091833656751221318492846368125550225998300412344784862595674492194617023806505913245610825731835380087608622102834270197698202313169017678006675195485079921636419370285375124784014907159135459982790513399611551794271106831134090584272884279791554849782954323534517065223269061394905987693002122963395687782878948440616007412945674919823050571642377154816321380631045902916136926708342856440730447899971901781465763473223850267253059899795996090799469201774624817718449867455659250178329070473119433165550807568221846571746373296884912819520317457002440926616910874148385078411929804522981857338977648103126085903001302413467189726673216491511131602920781738033436090243804708340403154190336

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string