Changing sign of a 64 bit floating point number in verilog? - verilog

So I am working with 64 bit floating point numbers in Verilog for synthesis, ideally I would like to do -A*B, where A and B are the two numbers. I have got past doing A*B, so is it okay now if I just change the value of the first bit 0 to 1 or 1 to 0 to make it represent -A*B.
kinda like,
A[0]=~A[0];
Thanks in advance for any suggestion.

Yes! That's all there is to it.
Keep in mind that negating 0 will give you -0. (They're different floating-point bit patterns.) Whether this matters to you will depend on your application.

Related

Why in python 12.76/106 is 0.1200000000001 and not 0.12?

I used pythonista to divide 12.76 and the result is 0.12000000001 instead of 0.12, why?
You may want to have a look at http://floating-point-gui.de . The answer you are getting is not wrong, everything depends on how much precision you want.
Because computers do the calculation in binary. But they also have limited bits to represent the numbers. there was probably an overflow in the base-2 so the machine had to round up by 1 bit or something. Then when it translated it to decimal, u get that .00000000000001

High precision floating point numbers in Haskell?

I know Haskell has native data types which allow you to have really big integers so things like
>> let x = 131242358045284502395482305
>> x
131242358045284502395482305
work as expected. I was wondering if there was a similar "large precision float" native structure I could be using, so things like
>> let x = 5.0000000000000000000000001
>> x
5.0000000000000000000000001
could be possible. If I enter this in Haskell, it truncates down to 5 if I go beyond 15 decimal places (double precision).
Depending on exactly what you are looking for:
Float and Double - pretty much what you know and "love" from Floats and Doubles in all other languages.
Rational which is a Ratio of Integers
FixedPoint - This package provides arbitrary sized fixed point values. For example, if you want a number that is represented by 64 integral bits and 64 fractional bits you can use FixedPoint6464. If you want a number that is 1024 integral bits and 8 fractional bits then use $(mkFixedPoint 1024 8) to generate type FixedPoint1024_8.
EDIT: And yes, I just learned about the numbers package mentioned above - very cool.
Haskell does not have high-precision floating-point numbers naitively.
For a package/module/library for this purpose, I'd refer to this answer to another post. There's also an example which shows how to use this package, called numbers.
If you need a high precision /fast/ floating point calculations, you may need to use FFI and long doubles, as the native Haskell type is not implemented yet (see https://ghc.haskell.org/trac/ghc/ticket/3353).
I believe the standard package for arbitrary precision floating point numbers is now https://hackage.haskell.org/package/scientific

scale 14 bit word to an 8 bit word

I'm working on a project where I sample a signal with an ADC, that represents values as 14 bit words. I need to scale the values to 8 bit words. What's a good way to go about this in general. By the way, I'm using an FPGA so I'd like to do it in "hardware" rather than a software solution. Also in case you're wondering the chain of events will be: sample analog signal, represent sample value with 14 bit word, scale 14 bit word to 8 bit word, transmit 8 bit word with UART to PC COM1.
I've never done this before. I was assuming you use quantization levels, but I'm not sure what an efficient circuit for this operation would be. Any help would be appreciated.
Thanks
You just need an add and a shift:
val_8 = (val_14 + 32) >> 6;
(The + 32 is necessary to get correct rounding - you can omit it but you will get more truncation noise in your signal if you do.)
I think you just drop the six lowest resolution bits and call it good, right? But I might not fully understand the problem statement.
Paul's algorithm is correct, but you'll need some bounds checking.
assign val_8 = (&val_14[13:5]) ? //Make sure your sum won't overflow
8'hFF : //Assign all 1's if it will
val_14[13:6] + val_14[5];

Excel Floating Point Arihmetic - What Type Does It Actually Use

From a previous questions someone indicated that Excel uses a 64 bit (8 byte) double-precision floating.
Is that correct - Is there any material on this at all?
I am trying to tie off numbers and this is killing me!
According to this article yes, it uses 64 bit double precision floating point numbers. The article also describes the rounding errors etc associated with this format.
Have a look at
Floating-point arithmetic
Under the section Precision
A floating-point number is stored in
binary in three parts within a 65-bit
range: the sign, the exponent, and the
mantissa.
Here is another article
Understanding Floating Point Precision, aka “Why does Excel Give Me Seemingly Wrong Answers?”
Have a look at the section Structure of a Floating Point Number

Multiplying 23 bit datatypes in a system with no long long

I am trying to implement floating point operations in a microcontroller and so far I have had ample success.
The problem lies in the way I do multiplication in my computer and it works fine:
unsigned long long gig,mm1,mm2;
unsigned long m,m1,m2;
mm1 = f1.float_parts.mantissa;
mm2 = f2.float_parts.mantissa;
m1 = f1.float_parts.mantissa;
m2 = f2.float_parts.mantissa;
gig = mm1*mm2; //this works fine I get all the bits I need since they are all long long, but won't work in the mcu
gig = m1*m2//this does not work, to be precise it gives only the 32 least significant bits , but works on the mcu
So you can see that my problem is that the microcontroller will throw an undefined refence to __muldi3 if I try the gig = mm1*mm2 there.
And If I try with the smaller data types, it only keeps the least significant bits, which I don't want it to. I need the 23 msb bits of the product.
Does anyone have any ideas as to how I can do this?
Apologizes for the short answer, I hope that someone else will take the time to write a fuller explanation, but basically you do exactly as when you multiply two big numbers by hand on a paper! It's just that instead of working with base 10, you work in base 256. That is, treat your numbers as a byte vectors, and do with each byte what you do to a digit when you "hand multiply".
The comments in the FreeBSD implementation of __muldi3() have a good explanation of the required procedure, see muldi3.c. If you want to go straight to the source (always a good idea!), according to the comments this code was based on an algorithm described in Knuth's The Art of Computer Programming vol. 2 (2nd ed), section 4.3.3, p. 278. (N.B. the link is for the 3rd edition.)
Back on the Intel 8088 (the original PC CPU and the last CPU I wrote assembly code for) when you multiplied two 16 bit numbers (32 bits? whoow) the CPU would return 2 16 bit numbers in two different registers - one with the 16 msb and one with the lsb.
You should check the hardware capabilities of your micro-controller, maybe it has a similar setup (obviously you'll need the code this in assembly if it does).
Otherwise you'll have to implement multiplication on your own.

Resources