Computation in fixed point or int - python-3.x

I am using fixed point numbers within my network based on keras framework. My concern is when there are multiplication operations in the network on theano variables, the result is float32 ( even if the numbers supplied are in fixed point). Is there any intrinsic way to get the result in fixed point format, or even int.
If not, what can be alternative approaches?

Related

Why are float numbers rounded up in Cassandra?

I created Cassandra table with column type: DataType.FLOAT.
Execute my SQL using CqlSession:
CqlSessionBuilder builder = CqlSession.builder();
builder.addContactPoint(new InetSocketAddress(properties.getHost(), properties.getPort()));
builder.withLocalDatacenter(properties.getDatacenter());
builder.withAuthCredentials(properties.getUsername(), properties.getPassword());
builder.build();
But when I insert float numbers, it's rounded up:
12334.9999 -> 12335.0.
0.999999 -> 0.999999
12345.9999 -> 12346.0
It seems like Cassandra rounds the float and consider the number of all digits, not only after the point.
What are the options to solve this problem? I know that I can use Decimal datatype, but may be you have other solution?
I actually covered this issue with Apache Cassandra and DataStax Astra DB in an article I wrote last month:
The Guerilla Guide to Building E-commerce Product Services with DataStax Astra DB
So the problem here, is that FLOAT is a fixed floating point precision type. This means that when the numeric values are converted from base-10 (decimal) to base-2 (binary), each one of the 32 binary precision points must have a value (zero or one, obviously). It's during this conversion process between base-2 and base-10 that rounding errors occur. The likelihood of a rounding error increases as the value does (on either side of the decimal point).
What are the options to solve this problem? I know that I can use Decimal datatype, but may be you have other solution?
Well, you mentioned the best solution (IMO), which to use a DECIMAL to store the value. This works, because DECIMAL is an arbitrary floating point type. The values in a DECIMAL type are stored in base-10, so there's no conversion necessary and only the required precision is used.
Before arbitrary precision types came along, we used to use INTEGERs for things that had to be accurate. The first E-commerce team I worked on stored product prices in the DB as pennies, to prevent the rounding issue.
Yes, both INT and FLOAT are fixed precision types, but an INT stores whole numbers, and all of its precision points can be used for that. Therefore the usage patterns of the bits are quite different. While both INT and FLOAT allocate a bit for the "sign" (+/-), with floating point numbers the remaining 31 precision points are pre-allocated for the full numeric value and its exponent.
So your example of 12334.9999 is essentially stored in Cassandra like this:
123349999 x 10^-4
And of course, that's stored in binary, which I won't include here for brevity.
tl;dr;
Basically FLOATs use fixed precision to store values as a formula (significand and exponent) in base-2, and the conversion back to base-10 makes rounding errors likely.
You're right, use a DECIMAL type. When you need to be exact, that's the only real solution.
If you're interested, here are two additional SO answers which provide more detail on this topic:
Double vs. BigDecimal?
What is the difference between the float and integer data type when the size is the same?

Eigenvectors in Julia vs Numpy

I'm currently working to diagonalize a 5000x5000 Hermitian matrix, and I find that when I use Julia's eigen function in the LinearAlgebra module, which produces both the eigenvalues and eigenvectors, I get different results for the eigenvectors compared to when I solve the problem using numpy's np.linalg.eigh function. I believe both of them use BLAS, but I'm not sure what else they may be using that is different.
Has anyone else experienced this/knows what is going on?
numpy.linalg.eigh(a, UPLO='L') is a different algorithm. It assumes the matrix is symmetric and takes the lower triangular matrix (as a default) to more efficiently compute the decomposition.
The equivalent to Julia's LinearAlgebra.eigen() is numpy.linalg.eig. You should get the same result if you turn your matrix in Julia into a Symmetric(A, uplo=:L) matrix before feeding it into LinearAlgebra.eigen().
Check out numpy's docs on eig and eigh. Whilst Julia's standard LinearAlgebra capabilities are here. If you go down to the special matrices sections, it details what special methods it uses depending on the type of special matrix thanks to multiple dispatch.

Resize float32 array with K-nearest neighbour in the same way as scipy.misc.imresize or tf.image.resize

I am to create a network using much of the same characteristics as pix2pix: https://github.com/affinelayer/pix2pix-tensorflow.
My adjustment is that I will not be using images, but matrices with float32 values. This introduces a lot of problems and there is a lot to rewrite. Most of the code can easily be rewritten, but I've encountered a problem.
The network has a separable convolutional layer where the image is resized using tf.image.resize. This function uses different resize methods, such as K-Nearest Neighbors, and I don't want to loose that feature. Both scipy.misc.imresize and tf.image.resize are limited to int values and does not support any higher than uint16. If I were to transform the data to said formats, I will loose precision.
Is there a way to create this efficiently in numpy (or any equivalent) supporting float32?
Sorry for not introducing any code, but the problem more or less explains itself without (I hope).
Try using scipy.ndimage.interpolation.zoom. This works for float number images.
Use it as below:
image = scipy.ndimage.interpolation.zoom(image, 0.5)

directx local space coordinates float accuracy

I'm a bit confused of the local space coordinate system. Suppose I have a complex object in the local space. I know when I want to put it in the world space I have to multiply it with Scale,Rotate,Translate matrix. But the problem is the local coordinate only ranged from -1.0f to 1.0f, when I want to have vertex like (1/500,1/100,1/100) things will not work, everything will become 0 due to the float accuracy problem.
The only solution to me now is separate them into lots of local space systems and ProjectView each individually to put them together. It seems not the correct way of solving the problem. I've been checked lots of books but none of them mentioned this issue. I really want to know how to solve it.
when I want to have vertex like (1/500,1/100,1/100) things will not work
What makes you think that? The float accuracy problem does not mean something will coerce to 0 if it can't be accurately represented. It just means, it will coerce to the floating point number closest to the intended figure.
It's the very same as writing down, e.g., 3/9 with at most 6 significant decimal digits: 0.33334 – it didn't coerce to 0. And the very same goes for floating point.
Now you may be familiar with scientific notation: x·10^y – this is essentially decimal floating point, a mantissa x and an exponent y which essentially specifies the order of magnitude. In binary floating point it becomes x·2^y. In either case the significant digits are in the mantissa. Your typical floating point number (in OpenGL) has a mantissa of 23 bits, which boils down to an amount of 22 significant binary digits (which are about 7 decimal digits).
I really want to know how to solve it.
The real trouble with floating point numbers is, if you have to mix and merge numbers over a large range of orders of magnitudes. As long as the numbers are of similar order of magnitudes, everything happens with just the mantissa. And that one last change in order of magnitude to the [-1, 1] range will not hurt you; heck this can be done by "normalizing" the floating point value and then simply dropping the exponent.
Recommended read: http://floating-point-gui.de/
Update
One further thing: If you're writing 1/500 in a language like C, then you're performing an integer division and that will of course round down to 0. If you want this to be a floating point operation you either have to write floating point literals or cast to float, i.e.
1./500.
or
(float)1/(float)500
Note that casting one of the operands to float suffices to make this a floating point division.

How to implement exponential with fixed point numbers?

How can I implement a code in verilog that resolves a exponential equation that has numbers that must be represented as fixed point.
For example I have this equation on C++ and wish to convert to Verilog or VHDL:
double y = 0.1+0.75*(1.0/(1.0+exp((x[i]+40.5)/6.0)));
Where 'y' and 'x' must be fixed point numbers. And 'x' is a vector also.
I looked up for modules and libraries that has fixed point but none of them have exponentials.
Verilog has a real data type that provides simulation-time support for floating-point numbers. It also has an exponentiation operator, e.g., a ** b computes a to the power of b.
However, code written using the real datatype is generally not synthesizable. Instead, in real hardware designs, support for fixed and floating point numbers is generally achieved by implementing arithmetic logic units that implement, e.g., the IEEE floating point standard.
Most of the time, such a design will require at least a couple of cycles even for basic operations like addition and multiplication. More complex algorithms like division, sine, cosine, etc., are generally implemented using algorithms based on approximating polynomials.
If you really want to understand how to represent and manipulate fixed point and floating point numbers, you should probably get a textbook for a mathematics course such as Numerical Methods, or an EE course on Computer Arithmetic.

Resources