I am performing some operations in Pytorch that require a very high degree of precision. I am currently using torch.float64 as my default datatype. All my numbers are real. I would like to use a float128 datatype (since memory is not an issue for this simulation). However, the only datatypes I find in the documentation is torch.complex128 where the real and imaginary parts are both 64 bits. Is there a datatype or a way I can use all the 128 bits for my real numbers?
Thank you
I'm using Python3's decimal module. Is the underlying arithmetic done using the processors floating point types, or does it use integers? The notion that the results are 'exact' and of arbitrary precision suggests to me that integer maths is used below the surface.
Indeed it is integer math, not float math for sure. Roughly speaking every float is two parts - before and after the decimal dot (integer and the remainder). Thanks to that the calculations are done using integer arithmetic and hence are not rounded up so they are staying precise even if you sum up a very large value with a very small fraction.
This comes at a price - the number of operations is significantly larger and it is not always necessary to be so precise at all times. That is why most of the calculations are done using float arithmetic that may cause a loss of precision when there are many arithmetic operations on floats or there are significant differences between the values (e.g. 10^10 ratio and more). There is a separate field of computer science: numerical analysis or numerical methods that study the clever ways to get the most of the speed of float calculations while maintaining highest precision possible.
How can I implement a code in verilog that resolves a exponential equation that has numbers that must be represented as fixed point.
For example I have this equation on C++ and wish to convert to Verilog or VHDL:
double y = 0.1+0.75*(1.0/(1.0+exp((x[i]+40.5)/6.0)));
Where 'y' and 'x' must be fixed point numbers. And 'x' is a vector also.
I looked up for modules and libraries that has fixed point but none of them have exponentials.
Verilog has a real data type that provides simulation-time support for floating-point numbers. It also has an exponentiation operator, e.g., a ** b computes a to the power of b.
However, code written using the real datatype is generally not synthesizable. Instead, in real hardware designs, support for fixed and floating point numbers is generally achieved by implementing arithmetic logic units that implement, e.g., the IEEE floating point standard.
Most of the time, such a design will require at least a couple of cycles even for basic operations like addition and multiplication. More complex algorithms like division, sine, cosine, etc., are generally implemented using algorithms based on approximating polynomials.
If you really want to understand how to represent and manipulate fixed point and floating point numbers, you should probably get a textbook for a mathematics course such as Numerical Methods, or an EE course on Computer Arithmetic.
is there a way to multiply a register/ wire with a value?
e.g.
... input wire [13:0] setpoint ...
if (timer>(setpoint*0.95))
Yes this is possible, but multipliers can be quite large so use with caution. In this case the multiplicand is fixed so it will reduce the logic down quite a lot.
In RTL a real (as in 0.95) does not have much meaning you need to multiply by a fixed point number, which will also limit the precision which you can represent 0.95.
Allowing 10 binary places, a scaling of 2^10. 0.1111001100 => 0.94921875. For the comparison you need to keep track of how the result of the multiply grows.
a_int_bits.a_frac_bits * b_int_bits.b_frac_bits =
(a_int_bits + b_int_bits) . (a_frac_bits + b_frac_bits)
Therefore the timer in the comparison would need to be LSB padded for the fractional bits that are added to the representation of 0.95.
I am trying to use Excel's (2007) built in FFT feature, however, it requires that I have 2^n data points - which I do not have.
I have tried two things, both give different results:
Pad the data values by zeros so that N (the number of data points) reach the closest power of 2
Use a divide-and-conquer approach i.e. if I have 112 data points, then I do a FFT for 64, then 32, then 16 (112=64+32+16)
Which is the better approach? I am comfortable writing VBA macros but I am looking for an algorithm which does not require the constraint of N being power of 2. Can anyone help?
Splitting your data into smaller bits will result in erroneous output, especially for smaller numbers of data points.
Padding with zeroes is a much better idea, and the general approach for FFTs. If you are interested in an alternative way of doing the FFT, octave will do it for you, and most of the Matlab documentation applies so you should have no trouble with it.
Padding with zeros is the right direction, but keep in mind that if you're doing the transform in order to estimate frequency content, you will need a window function, and that should be applied to the short block (i.e., if you have 2000 points, apply a 2000 point Hann window, then pad to 2048 and calculate the transform).
If you're developing an add-in, you might consider using one of the many FFT libraries out there. I'm a big fan of KISS FFT by Marc Borgerding. It offers fast transforms for many blocksizes, essentially any blocksize that can be factored into the numbers 2,3,4, and/or 5. It doesn't handle prime number sized blocks though. It's written in very plain C, so should be easy to port to C#. Or, this SO question suggests some libraries that can be used in .NET.
pad out with zeros
2^n is a requirement of the FFT algorithm.
Maybe a test of a known time series (e.g., simple sine or cosine of a single frequency). When you FFT that, you should get a single frequency (Dirac delta function). Anything else is an error. Do it with an integer power of two, padded with zeroes, etc.
You can pad with zeros, or you can use an FFT library that supports arbitrary sizes. One such library is https://github.com/altomani/XL-FFT.
It implements the FFT as a pure formula with LAMBDA functions (i.e. without any VBA code).
For power of two length it uses a recursive radix-2 Cooley-Tukey algorithm
and for other length a version of Bluestein's algorithm that reduces the calculation to a power of two case.