I am curious about why we use addition to carry out subtraction in digital design systems.
I have a hunch that it has something to do with cost and performance but I don't understand how it would save money or be more efficient.
This really breaks down into two questions:
Why not use dedicated circuits for each operation?
Every additional circuit in a design costs money and time to design, validate, and allocate budgets for space, power and timing. We will (almost) always come out ahead if we can design a single circuit that can handle multiple operations.
In digital circuits, "a + b" and "a - b" are two different operations that require different circuits. Fortunately though, in arithmetic, "a + b" is the same as "a + (-b)", so the math doesn't care which one we use. Some very smart people came up with a clever representation for signed numbers that takes advantage of this mathematical fact, called Two's Complement. This representation allows positive and negative numbers to be added and inverted easily.
Since we can now implement both addition and subtraction with one circuit, there's no point to creating a second one.
Okay, so why addition?
If the operations are now interchangeable, why did we implement everything in addition instead of subtraction? I can only speculate on this, but I suspect the following reasons:
An adder circuit is less complicated than a subtractor:
Half Adder:
Half Subtractor:
Note that the subtractor has an added inverter. Every transistor counts in a digital design, so all other things being equal, we should use the circuit with fewer components.
Furthermore, addition is simply a more common operation. Incrementing instruction pointers & other memory addresses, iterator/loop indexes, and other regularly formed values is much more common than stepping backwards. If we're choosing to either handle subtraction or addition natively (which we are, since we are only going to dedicate one circuit to both), then it makes sense to choose the most common one. Now we've added an extra step to subtraction operations (convert to two's complement), but it's preferable to adding an extra step to the more common addition operations.
Further Reading
There is a very similar quesiont on EE.SE: Addition and subtraction in digital electronics
Here on SO: What is “2's Complement”?
Related
Suppose I have a function that runs calculations, example being something like a dot product - I pass in an arrays A, B of vectors and a float array C, and the functions assigns:
C[i] = dot(A[i], B[i]);
If I create and start two threads that will run this function, and pass in the same three arrays to both the threads, under what circumstances is this type of action (perhaps using a different non-random mathematical operation etc.) not guaranteed give the same result (running the same application without any recompilation, and on the same machine)? I'm only interested in the context of a consumer PC.
I know that float operations are in general deterministic, but I do wonder whether perhaps something weird could happen and maybe on one thread the calculations will use an intermediate 80 bit register, but not in the other.
I would assume it's pretty much guaranteed the same binary code should run in both threads (is there some way this could not happen? The function being compiled multiple times for some reason, the compiler somehow figuring out it will run in multiple threads, and compiling it again, for some reason, for the second thread?).
But I'm a a bit more worried that CPU cores might not have the same instruction sets, even on consumer level PCs.
Side question - what about GPUs in a similar scenario?
//
I'm assuming x86_64, Windows, c++, and dot is a.x * b.x + a.y * b.y. Can't give more info than that - using Unity IL2CPP, don't know how it compiles/with what options.
Motivation for the question: I'm writing a computational geometry procedure that modifies a mesh - I'll call this the "geometric mesh". The issue is that it could happen that the "rendering mesh" has multiple vertices for certain geometric positions - it's needed for flat shading for example - you have multiple vertices with different normals. However, the actual computational geometry procedure only uses purely geometric data of the positions in space.
So I see two options:
Create a map from the rendering mesh to the geometric mesh (example - duplicate vertices being mapped to one unique vertex), run the procedure on the geometric mesh, then somehow modify the rendering mesh based on the result.
Work with the rendering mesh directly. Slightly more inefficient as the procedure does calculations for all vertices, but much easier from a code perspective. But most of all I'm a bit worried that I could get two different values for two vertices that actually have the same position and that shouldn't happen. Only the position is used, and the position would be the same for both such vertices.
Floating point (FP) operations are not associative (but it is commutative). As a result, (x+y)+z can give different results than x+(y+z). For example, (1e-13 + (1 - 1e-13)) == ((1e-13 + 1) - 1e-13) is false with 64-bit IEEE-754 floats. The C++ standard is not very restrictive about floating-point numbers. However, the widely-used IEEE-754 standard is. It specifies the precision of 32-bit and 64-bit number operations, including rounding modes. x86-64 processors are IEEE-754 compliant and mainstream compilers (eg. GCC, Clang and MSVC) are also IEEE-754 compliant by default. ICC is not compliant by default since it assumes the FP operations are associative for the sake of performance. Mainstream compilers have compilation flags to make such assumption so to speed up codes. It is generally combined with other ones like the assumption that all FP values are not NaN (eg. -ffast-math). Such flags break the IEEE-754 compliance, but they are often used in the 3D or video game industry so to speed up codes. IEEE-754 is not required by the C++ standard, but you can check this with std::numeric_limits<T>::is_iec559.
Threads can have different rounding modes by default. However, you can set the rounding mode using the C code provided in this answer. Also, please note that denormal numbers are sometimes disabled on some platforms because of their very-high overhead (see this for more information).
Assuming the IEEE-754 compliance is not broken, the rounding mode is the same and the threads does the operations in the same order, then the result should be identical up to at least 1 ULP. In practice, if they are compiled using a same mainstream compiler, the result should be exactly the same.
The thing is using multiple threads often result in a non-deterministic order of the applied FP operations which causes non-deterministic results. More specifically, atomic operations on FP variables often cause such an issue because the order of the operations often changes at runtime. If you want deterministic results, you need to use a static partitioning, avoid atomic operations on FP variables or more generally atomic operations that could result in a different ordering. The same thing applies for locks or any synchronization mechanisms.
The same thing is true for GPUs. In fact, such problem is very frequent when developers use atomic FP operations for example to sum values. They often do that because implementing fast reductions is complex (though it is more deterministic) and atomic operations as pretty fast on modern GPUs (since they use dedicated efficient units).
According to the accepted answer to floating point processor non-determinism?, C++ floating point is not non-deterministic. The same sequence of instructions will give the same results.
There are a few things to take into account though:
Firstly, the behavior (i.e. the result) of a particular piece of C++ source code doing a FP calculation may depend on the compiler and the chosen compiler options. For example, it may depend on whether the compiler chooses to emit 64 or 80 bit FP instructions. But this is deterministic.
Secondly, similar C++ source code may give different results; e.g. due to non-associative behavior of certain FP instructions. This also is deterministic.
Determinism won't be affected by multi-threading by default. The C++ compiler will probably be unaware of whether the code is multi-threaded or not. And it definitely has no reason to emit different FP code.
Admittedly, FP behavior depends on the rounding mode selected, and that can be set on a per-thread basis. However, for this to happen, something (application code) would have to explicitly set different rounding modes for different threads. Once again, that is deterministic. (And a pretty daft thing for the application code to do, IMO.)
The idea that a PC would would use different FP hardware with different behavior for different threads seems far-fetched to me. Sure a PC could have (say) an Intel chipset and an ARM chipset, but it is not plausible that different threads of the same C++ application (executable) would simultaneously run on both chipsets.
Likewise for GPUs. Indeed, given that you need to program GPUs in a way that is radically different to ordinary (or threaded) C++, I would doubt that they could even share the same source code.
In short, I think that you are worrying about a hypothetical problem that you are unlikely to encounter in reality ... given the current state of the art in hardware and C++ compilers.
I have been working on a multi-GPU project where I have had problems with obtaining non-deterministic results. I was surprised when it turned out that I obtained non-deterministic results due to a reduction clause executed on the CPU.
In the book Using OpenMP - The Next Step it is written that
"[...] the order in which threads combine their value to construct the
value for the shared result is non-deterministic."
Maybe I just don't understand how the reduction clauses are implemented. Does it mean that if I use schedule(monotonic:static) in combination with a reduction clause each thread will execute its chunk of the iterations in a deterministic order, but that the order in which the partial results are combined at the end of the parallel region is non-deterministic?
Does it mean that if I use schedule(monotonic:static) in combination
with a reduction clause each thread will execute its chunk of the
iterations in a deterministic order, but that the order in which the
partial results are combined at the end of the parallel region is
non-deterministic?
It is known that the end result is non-determinist, detailed information can be found in:
What Every Computer Scientist Should Know about Floating Point Arithmetic. For instance:
Another grey area concerns the interpretation of parentheses. Due to roundoff errors, the associative laws of algebra do not necessarily hold for floating-point numbers. For example, the expression (x+y)+z has a totally different answer than x+(y+z) when x = 1e30, y = -1e30 and z = 1 (it is 1 in the former case, 0 in the latter).
Now regarding the order in which the threads perform the reduction action, as far as I know, the OpenMP standard does not enforce any order, or requires that the order has to be deterministic. Hence, this is an implementation detail that is left up to the compiler that is implementing the OpenMP standard to decide, and consequently, it is something that your code should not reply upon.
Programming language semantics usually declares that a+b+c+d is evaluated as ((a+b)+c)+d. This is not parallel, so an OpenMP reduction is probably evaluated as (a+b)+(c+d). And so on for larger numbers of summands.
So you immediately have that, because of the non-associativity of floating point arithmetic, the result may be subtly different from the sequential value.
But more importantly, the exact value will depend on precisely how the combination is done. Is it a+(b+c) (on 2 threads) or (a+b)+c? So the result is at least "indeterministic" in the sense that you can not reconstruct how it was formed. It could probably even be done in two different ways, if you run the same reduction twice. That's what I would call "non-deterministic", but look in the standard for the exact definition of the term.
By the way, if you want to get some idea of how OpenMP actually does it, write your own reduction operator, and let each invocation print out what it computes. Here is a decent illustration: https://victoreijkhout.github.io/pcse/omp-reduction.html#Initialvalueforreductions
By the way, the standard actually doesn't use the word "non-deterministic" for this case. The following passage explains the issue:
Furthermore, using different numbers of threads may result in
different numeric results because of changes in the association of
numeric operations. For example, a serial addition reduction may have
a different pattern of addition associations than a parallel
reduction.
I wrote an application in Haskell that calls Z3 solver to solve constrains with some complex formulas. Thanks to Haskell I can quickly switch the data type I'm working with.
When using SBV's AlgReal type for computations, I get results in sensible time, however switching to Float or Double types makes Z3 consume ~2Gb of RAM and doesn't result even in 30 minutes.
Is this expected that producing floating point solutions require much more time, or it is some mistake on my side?
As with any question regarding solver performance, it is impossible to make generalizations. Christoph Wintersteiger (https://stackoverflow.com/users/869966/christoph-wintersteiger) would be the expert on this to opine, but I'm not sure how closely he follows this group. Chris: If you're reading this, I'd love to hear your thoughts!
There's also the risk of comparing apples-to-oranges here: Reals and floats are two completely different logics, with different decision procedures, heuristics, algorithms, etc. I'm sure you can find problems where one outperforms the other, with no clear "performance" winner.
Having said all that, here are some things that make floating-point (FP) tricky:
Rewriting is almost impossible with FP, since rules like associativity simply
don't hold for FP addition/multiplication. So, there are fewer opportunities for
simplification before bit-blasting.
Similarly a * 1/a == 1 doesn't hold for floats. Neither does x + 1 /= x or (x + a == x) -> (a == 0) and many other "obvious" simplifications that you'd love to be able to make. All of this complicates reasoning.
Existence of NaN values make equality non-reflexive: Nothing compares equal to NaN including itself. So, substitution of equals-for-equals is also problematic and requires side conditions.
Likewise, the existence of +0 and -0, which compare equal but behave differently due to rounding complicate matters. The typical example is x == 0 -> fma(a, b, x) == a * b doesn't hold (where fma is fused multiply-add), because depending on the sign of zero these two expressions can produce different values for different rounding modes.
Combination of floats with integers and reals introduce non-linearity, which is always a soft-spot for solvers. So, if you're using FP, it is advisable not to mix it with other theories as the combination itself creates extra complexity.
Operations like multiplication, division, and remainder (yes, there's a floating-point remainder operation!) are inherently very complex and lead to extremely large SAT formulas. In particular, multiplication is a known operation that challenges most SAT/BDD engines, due to lack of any good variable ordering and splitting heuristics. Solvers end-up bit-blasting fairly quickly, resulting in extremely large state-spaces. I have observed that solvers have a hard time dealing with FP division and remainder operations even when the input is completely concrete, imagine what happens when they are fully symbolic!
The logic of reals have a decision procedure with a double-exponential complexity. However, techniques like Fourier-Motzkin elimination (https://en.wikipedia.org/wiki/Fourier%E2%80%93Motzkin_elimination), while remaining exponential, perform really well in practice.
FP solvers are relatively new, and it's a niche area with nascent research. So existing solvers tend to be quite conservative and quickly bit-blast and reduce the problem to bit-vector logic. I would expect them to improve over time, just like all the other theories did.
Again, I want to emphasize comparing solver performance on these two different logics is misguided as they are totally different beasts. But hopefully, the above points illustrate why floating-point is tricky in practice.
A great paper to read about the treatment of IEEE754 floats in SMT solvers is: http://smtlib.cs.uiowa.edu/papers/BTRW14.pdf. You can see the myriad of operations it supports and get a sense of the complexity.
Cirdec's answer to a largely unrelated question made me wonder how best to represent natural numbers with constant-time addition, subtraction by one, and testing for zero.
Why Peano arithmetic isn't good enough:
Suppose we use
data Nat = Z | S Nat
Then we can write
Z + n = n
S m + n = S(m+n)
We can calculate m+n in O(1) time by placing m-r debits (for some constant r), one on each S constructor added onto n. To get O(1) isZero, we need to be sure to have at most p debits per S constructor, for some constant p. This works great if we calculate a + (b + (c+...)), but it falls apart if we calculate ((...+b)+c)+d. The trouble is that the debits stack up on the front end.
One option
The easy way out is to just use catenable lists, such as the ones Okasaki describes, directly. There are two problems:
O(n) space is not really ideal.
It's not entirely clear (at least to me) that the complexity of bootstrapped queues is necessary when we don't care about order the way we would for lists.
As far as I know, Idris (a dependently-typed purely functional language which is very close to Haskell) deals with this in a quite straightforward way. Compiler is aware of Nats and Fins (upper-bounded Nats) and replaces them with machine integer types and operations whenever possible, so the resulting code is pretty effective. However, that's not true for custom types (even isomorphic ones) as well as for compilation stage (there were some code samples using Nats for type checking which resulted in exponential growth in compile-time, I can provide them if needed).
In case of Haskell, I think a similar compiler extension may be implemented. Another possibility is to make TH macros which would transform the code. Of course, both of options aren't easy.
My understanding is that in basic computer programming terminology the underlying problem is you want to concatenate lists in constant time. The lists don't have cheats like forward references, so you can't jump to the end in O(1) time, for example.
You can use rings instead, which you can merge in O(1) time, regardless if a+(b+(c+...)) or ((...+c)+b)+a logic is used. The nodes in the rings don't need to be doubly linked, just a link to the next node.
Subtraction is the removal of any node, O(1), and testing for zero (or one) is trivial. Testing for n > 1 is O(n), however.
If you want to reduce space, then at each operation you can merge the nodes at the insertion or deletion points and weight the remaining ones higher. The more operations you do, the more compact the representation becomes! I think the worst case will still be O(n), however.
We know that there are two "extremal" solutions for efficient addition of natural numbers:
Memory efficient, the standard binary representation of natural numbers that uses O(log n) memory and requires O(log n) time for addition. (See also Chapter "Binary Representations" in the Okasaki's book.)
CPU efficient which use just O(1) time. (See Chapter "Structural Abstraction" in the book.) However, the solution uses O(n) memory as we'd represent natural number n as a list of n copies of ().
I haven't done the actual calculations, but I believe for the O(1) numerical addition we won't need the full power of O(1) FIFO queues, it'd be enough to bootstrap standard list [] (LIFO) in the same way. If you're interested, I could try to elaborate on that.
The problem with the CPU efficient solution is that we need to add some redundancy to the memory representation so that we can spare enough CPU time. In some cases, adding such a redundancy can be accomplished without compromising the memory size (like for O(1) increment/decrement operation). And if we allow arbitrary tree shapes, like in the CPU efficient solution with bootstrapped lists, there are simply too many tree shapes to distinguish them in O(log n) memory.
So the question is: Can we find just the right amount of redundancy so that sub-linear amount of memory is enough and with which we could achieve O(1) addition? I believe the answer is no:
Let's have a representation+algorithm that has O(1) time addition. Let's then have a number of the magnitude of m-bits, which we compute as a sum of 2^k numbers, each of them of the magnitude of (m-k)-bit. To represent each of those summands we need (regardless of the representation) minimum of (m-k) bits of memory, so at the beginning, we start with (at least) (m-k) 2^k bits of memory. Now at each of those 2^k additions, we are allowed to preform a constant amount of operations, so we are able to process (and ideally remove) total of C 2^k bits. Therefore at the end, the lower bound for the number of bits we need to represent the outcome is (m-k-C) 2^k bits. Since k can be chosen arbitrarily, our adversary can set k=m-C-1, which means the total sum will be represented with at least 2^(m-C-1) = 2^m/2^(C+1) ∈ O(2^m) bits. So a natural number n will always need O(n) bits of memory!
I am teaching myself verilog. The book I am following stated in the introduction chapters that to perform division we use the '/' operator or '%' operator. In later chapters it's saying that division is too complex for verilog and cannot be synthesized, so to perform division it introduces a long algorithm.
So I am confused, can't verilog handle simple division? is the / operator useless?
It all depends what type of code you're writing.
If you're writing code that you intend to be synthesised, that you intend to go into an FPGA or ASIC, then you probably don't want to use the division or modulo operators. When you put any arithmetic operator in RTL the synthesiser instances a circuit to do the job; An adder for + & -; A multiplier for *. When you write / you're asking for a divider circuit, but a divider circuit is a very complex thing. It often takes multiple clock cycles, and may use look up tables. It's asking a lot of a synthesis tool to infer what you want when you write a / b.
(Obviously dividing by powers of 2 is simple, but normally you'd use the shift operators)
If you're writing code that you don't want to be synthesised, that is part of a test bench for example, then you can use division all you want.
So to answer your question, the / operator isn't useless, but you have be concious of where and why you're using it. The same is true of *, but to a lesser degree. Multipliers are quite expensive, but most synthesisers are able to infer them.
You have to think in hardware.
When you write a <= b/c you are saying to the synthesis tool "I want a divider that can provide a result every clock cycle and has no intermediate pipline registers".
If you work out the logic circuit required to create that it's very complex, especially for higher bit counts. Generally FPGAs won't have specialist hardware blocks for division so it would have to be implemented out of generic logic resources. It's likely to be both big (lots of luts) and slow (low fmax).
Some synthesisers may implement it anyway (from a quick search it seems quartus will), others won't bother because they don't think it's very useful in practice.
If you are dividing by a constant and can live with an approximate result then you can do tricks with multipliers. Take the reciprocal of what you wanted to divide by, multiply it by a power of two and round to the nearest integer.
Then in your verilog you can implement your approximate divide by multiply (which is not too expensive on modern FPGAS) followed by shift (shifting by a fixed number of bits is essentially free in hardware). Make sure you allow enough bits for the intermediate result.
If you need an exact answer or if you need to divide by something that is not a pre-defined constant you will have to decide what kind of divider you want. IF your throughput is low then you can use a state machine based approach which does one division every n clock cycles. If your throughput is high and you can afford the device area then a pipelined approach which does a division per clock cycle (but requires multiple cycles for the result to flow through) may be more appropriate.
Often tool vendors will provide pre-made blocks (altera calls them megafunctions) for this kind of stuff. The advantage of these is that the tool vendor will likely have carefully optimised them for the device. The downside is they can bring vendor lockin, if you want to move to a different device vendor you will most likely have to swap out the block and the block you swap it with may have different characteristics.
So im confused. cant verilog handle simple division? is the / operator
useless?
The verilog synthesis spec (IEEE 1364.1) actually indicates all arithmetic operators with integer operands should be supported but nobody follows this spec. Some synthesis tools can do integer division but others will reject it(I think XST still does) because combinational division is typically very area inefficient. Multicycle implementations are the norm but these cannot be synthesized from '/'.
Division and modulo are never "simple". Avoid them if you can do so, e.g. through bit masks or shift operations. Especially a variable divisor is really complicated to implement in hardware.
"Verilog the language" handles division and modulo just fine - when you are using a computer to simulate your code you have full access to all it's abilities.
When you are synthesising your code to a particular chip, there are limitations. The limitations tend to be based on what the tool-vendor thinks is "sensible" rather than what is feasible.
In the old days, division by anything other than a power-of-two was deemed to be non-sensible for silicon as it took up a lot of space and ran very slowly. At the moment, some synthesisers with create "divide by a constant" circuits for you.
In future, I see no reason why the synthesiser shouldn't create you a divider (or make use of one that is in the DSP blocks of a potential future architecture). Whether it will or not remains to be seen, but witness the progression of multipliers (from "only powers of two" to "one input constant" to "full implementation" in just a few years)
circuits including only division by 2 : just shift the bit :)
other than 2 .... see you should always think at circuit level verilog is NOT C or C++
/ and % is not synthesizable or if it becomes( in new versions) i believe you should keep your own division circuit this is because the ip they provide will be general ( most probably they will make for floating not fixed)
i bet you had gone through morris mano computer architechure book , there in some last chapters the whole flow is given along with algo , go through it follow it and make your own
see now if your works go for only logic verification and no real circuit is needed , sure go for / and % . no problem it will work for simulation
Division using '/' is possible in verilog. But it is not a synthesizable operator. Same is the case for multiplication using '*'. There are certain algorithms to perform these operations in verliog, and they are used if the code needs to be synthesizable. ie. if you require an equivalent hardware for it.
I am not aware of any algorithms for division, but for multiplication, i have used Booth's algorithm.
if you want the synthesizable code you can use the Divison_IP or you can use the right shifting operator for some divisions like 64/8=8 same 64>>3 = 8.
Division isn't simple in hardware as people spent a lot of time in an efficient
and fast multiplier as an example. However, you can do divid by 2 easily by right shifting one bit in hardware.
Actually your point is very valid and I was also confused in my initial days of learning HDLs.
When you synthesise a division operator, it consumes a lot of resources on FPGA or during logic synthesis for ASIC. Try following instead.
You can also perform division(and multiplication) by shifting some vector(right = division, left = multiplication). But that will be multiplication(and divion) by 2.
Example 0100 = 4
Shift right 0010 = 2(which is 4/2)
Shift left 1000 = 8(which is 4*2).
We use >> operator for shift right, and << for shift left.
But we can also produce variations out of it.
For example multiplication by 3.
So if we have 0100 (4 dec) then also will be
shift left and add one at each step. ((0100 << 1)+1)
Similarly division by 3
shift right and subtract one at each step. ((0100 >> 1) - 1)
These methods were made because to be honest, resources in FPGA are limited, and when it comes to ASICs, your manager tries to kill you for any additional logic. :)
The division operator / is not useless in Verilog/System Verilog. It works in case of simulations as usual mathematical operator.
Some synthesis tools like Xilinx Vivado synthesize the division operator also because it is having a pre-built algorithm in it (though takes more hardware gates).
In simple words, you can do division in Verilog but have to take care of tools and simulators.
Using result <= a/b and works perfectly.
Remember when using the <= operator, the answer is calculated immediately but the answer is entered inside the "result" register at next clock positive edge.
If you don't want to wait till next clock positive edge use result = a/b.
Remember, any arithmetic operation circuit needs some time to finish the operation, and during this time the circuit generates random numbers (bits).
Its like when A-10 warthog attack airplane attacks a tank it shoots lots of bullets. That's how the divider circuit acts while dividing,it spits random bits. After couple of nanoseconds it will finish dividing and return a stable good result.
This is why we wait until next clock cycle for the "result" register. We try to protect it from random garbage numbers.
Division is the most complex operation, so it will have a delay in calculation. For 16bit division the result will be calculated in approximately 6 nanoseconds.