Division in verilog - verilog

I am teaching myself verilog. The book I am following stated in the introduction chapters that to perform division we use the '/' operator or '%' operator. In later chapters it's saying that division is too complex for verilog and cannot be synthesized, so to perform division it introduces a long algorithm.
So I am confused, can't verilog handle simple division? is the / operator useless?

It all depends what type of code you're writing.
If you're writing code that you intend to be synthesised, that you intend to go into an FPGA or ASIC, then you probably don't want to use the division or modulo operators. When you put any arithmetic operator in RTL the synthesiser instances a circuit to do the job; An adder for + & -; A multiplier for *. When you write / you're asking for a divider circuit, but a divider circuit is a very complex thing. It often takes multiple clock cycles, and may use look up tables. It's asking a lot of a synthesis tool to infer what you want when you write a / b.
(Obviously dividing by powers of 2 is simple, but normally you'd use the shift operators)
If you're writing code that you don't want to be synthesised, that is part of a test bench for example, then you can use division all you want.
So to answer your question, the / operator isn't useless, but you have be concious of where and why you're using it. The same is true of *, but to a lesser degree. Multipliers are quite expensive, but most synthesisers are able to infer them.

You have to think in hardware.
When you write a <= b/c you are saying to the synthesis tool "I want a divider that can provide a result every clock cycle and has no intermediate pipline registers".
If you work out the logic circuit required to create that it's very complex, especially for higher bit counts. Generally FPGAs won't have specialist hardware blocks for division so it would have to be implemented out of generic logic resources. It's likely to be both big (lots of luts) and slow (low fmax).
Some synthesisers may implement it anyway (from a quick search it seems quartus will), others won't bother because they don't think it's very useful in practice.
If you are dividing by a constant and can live with an approximate result then you can do tricks with multipliers. Take the reciprocal of what you wanted to divide by, multiply it by a power of two and round to the nearest integer.
Then in your verilog you can implement your approximate divide by multiply (which is not too expensive on modern FPGAS) followed by shift (shifting by a fixed number of bits is essentially free in hardware). Make sure you allow enough bits for the intermediate result.
If you need an exact answer or if you need to divide by something that is not a pre-defined constant you will have to decide what kind of divider you want. IF your throughput is low then you can use a state machine based approach which does one division every n clock cycles. If your throughput is high and you can afford the device area then a pipelined approach which does a division per clock cycle (but requires multiple cycles for the result to flow through) may be more appropriate.
Often tool vendors will provide pre-made blocks (altera calls them megafunctions) for this kind of stuff. The advantage of these is that the tool vendor will likely have carefully optimised them for the device. The downside is they can bring vendor lockin, if you want to move to a different device vendor you will most likely have to swap out the block and the block you swap it with may have different characteristics.

So im confused. cant verilog handle simple division? is the / operator
useless?
The verilog synthesis spec (IEEE 1364.1) actually indicates all arithmetic operators with integer operands should be supported but nobody follows this spec. Some synthesis tools can do integer division but others will reject it(I think XST still does) because combinational division is typically very area inefficient. Multicycle implementations are the norm but these cannot be synthesized from '/'.

Division and modulo are never "simple". Avoid them if you can do so, e.g. through bit masks or shift operations. Especially a variable divisor is really complicated to implement in hardware.

"Verilog the language" handles division and modulo just fine - when you are using a computer to simulate your code you have full access to all it's abilities.
When you are synthesising your code to a particular chip, there are limitations. The limitations tend to be based on what the tool-vendor thinks is "sensible" rather than what is feasible.
In the old days, division by anything other than a power-of-two was deemed to be non-sensible for silicon as it took up a lot of space and ran very slowly. At the moment, some synthesisers with create "divide by a constant" circuits for you.
In future, I see no reason why the synthesiser shouldn't create you a divider (or make use of one that is in the DSP blocks of a potential future architecture). Whether it will or not remains to be seen, but witness the progression of multipliers (from "only powers of two" to "one input constant" to "full implementation" in just a few years)

circuits including only division by 2 : just shift the bit :)
other than 2 .... see you should always think at circuit level verilog is NOT C or C++
/ and % is not synthesizable or if it becomes( in new versions) i believe you should keep your own division circuit this is because the ip they provide will be general ( most probably they will make for floating not fixed)
i bet you had gone through morris mano computer architechure book , there in some last chapters the whole flow is given along with algo , go through it follow it and make your own
see now if your works go for only logic verification and no real circuit is needed , sure go for / and % . no problem it will work for simulation

Division using '/' is possible in verilog. But it is not a synthesizable operator. Same is the case for multiplication using '*'. There are certain algorithms to perform these operations in verliog, and they are used if the code needs to be synthesizable. ie. if you require an equivalent hardware for it.
I am not aware of any algorithms for division, but for multiplication, i have used Booth's algorithm.

if you want the synthesizable code you can use the Divison_IP or you can use the right shifting operator for some divisions like 64/8=8 same 64>>3 = 8.

Division isn't simple in hardware as people spent a lot of time in an efficient
and fast multiplier as an example. However, you can do divid by 2 easily by right shifting one bit in hardware.

Actually your point is very valid and I was also confused in my initial days of learning HDLs.
When you synthesise a division operator, it consumes a lot of resources on FPGA or during logic synthesis for ASIC. Try following instead.
You can also perform division(and multiplication) by shifting some vector(right = division, left = multiplication). But that will be multiplication(and divion) by 2.
Example 0100 = 4
Shift right 0010 = 2(which is 4/2)
Shift left 1000 = 8(which is 4*2).
We use >> operator for shift right, and << for shift left.
But we can also produce variations out of it.
For example multiplication by 3.
So if we have 0100 (4 dec) then also will be
shift left and add one at each step. ((0100 << 1)+1)
Similarly division by 3
shift right and subtract one at each step. ((0100 >> 1) - 1)
These methods were made because to be honest, resources in FPGA are limited, and when it comes to ASICs, your manager tries to kill you for any additional logic. :)

The division operator / is not useless in Verilog/System Verilog. It works in case of simulations as usual mathematical operator.
Some synthesis tools like Xilinx Vivado synthesize the division operator also because it is having a pre-built algorithm in it (though takes more hardware gates).
In simple words, you can do division in Verilog but have to take care of tools and simulators.

Using result <= a/b and works perfectly.
Remember when using the <= operator, the answer is calculated immediately but the answer is entered inside the "result" register at next clock positive edge.
If you don't want to wait till next clock positive edge use result = a/b.
Remember, any arithmetic operation circuit needs some time to finish the operation, and during this time the circuit generates random numbers (bits).
Its like when A-10 warthog attack airplane attacks a tank it shoots lots of bullets. That's how the divider circuit acts while dividing,it spits random bits. After couple of nanoseconds it will finish dividing and return a stable good result.
This is why we wait until next clock cycle for the "result" register. We try to protect it from random garbage numbers.
Division is the most complex operation, so it will have a delay in calculation. For 16bit division the result will be calculated in approximately 6 nanoseconds.

Related

Are floating point operations deterministic when running in multiple threads?

Suppose I have a function that runs calculations, example being something like a dot product - I pass in an arrays A, B of vectors and a float array C, and the functions assigns:
C[i] = dot(A[i], B[i]);
If I create and start two threads that will run this function, and pass in the same three arrays to both the threads, under what circumstances is this type of action (perhaps using a different non-random mathematical operation etc.) not guaranteed give the same result (running the same application without any recompilation, and on the same machine)? I'm only interested in the context of a consumer PC.
I know that float operations are in general deterministic, but I do wonder whether perhaps something weird could happen and maybe on one thread the calculations will use an intermediate 80 bit register, but not in the other.
I would assume it's pretty much guaranteed the same binary code should run in both threads (is there some way this could not happen? The function being compiled multiple times for some reason, the compiler somehow figuring out it will run in multiple threads, and compiling it again, for some reason, for the second thread?).
But I'm a a bit more worried that CPU cores might not have the same instruction sets, even on consumer level PCs.
Side question - what about GPUs in a similar scenario?
//
I'm assuming x86_64, Windows, c++, and dot is a.x * b.x + a.y * b.y. Can't give more info than that - using Unity IL2CPP, don't know how it compiles/with what options.
Motivation for the question: I'm writing a computational geometry procedure that modifies a mesh - I'll call this the "geometric mesh". The issue is that it could happen that the "rendering mesh" has multiple vertices for certain geometric positions - it's needed for flat shading for example - you have multiple vertices with different normals. However, the actual computational geometry procedure only uses purely geometric data of the positions in space.
So I see two options:
Create a map from the rendering mesh to the geometric mesh (example - duplicate vertices being mapped to one unique vertex), run the procedure on the geometric mesh, then somehow modify the rendering mesh based on the result.
Work with the rendering mesh directly. Slightly more inefficient as the procedure does calculations for all vertices, but much easier from a code perspective. But most of all I'm a bit worried that I could get two different values for two vertices that actually have the same position and that shouldn't happen. Only the position is used, and the position would be the same for both such vertices.
Floating point (FP) operations are not associative (but it is commutative). As a result, (x+y)+z can give different results than x+(y+z). For example, (1e-13 + (1 - 1e-13)) == ((1e-13 + 1) - 1e-13) is false with 64-bit IEEE-754 floats. The C++ standard is not very restrictive about floating-point numbers. However, the widely-used IEEE-754 standard is. It specifies the precision of 32-bit and 64-bit number operations, including rounding modes. x86-64 processors are IEEE-754 compliant and mainstream compilers (eg. GCC, Clang and MSVC) are also IEEE-754 compliant by default. ICC is not compliant by default since it assumes the FP operations are associative for the sake of performance. Mainstream compilers have compilation flags to make such assumption so to speed up codes. It is generally combined with other ones like the assumption that all FP values are not NaN (eg. -ffast-math). Such flags break the IEEE-754 compliance, but they are often used in the 3D or video game industry so to speed up codes. IEEE-754 is not required by the C++ standard, but you can check this with std::numeric_limits<T>::is_iec559.
Threads can have different rounding modes by default. However, you can set the rounding mode using the C code provided in this answer. Also, please note that denormal numbers are sometimes disabled on some platforms because of their very-high overhead (see this for more information).
Assuming the IEEE-754 compliance is not broken, the rounding mode is the same and the threads does the operations in the same order, then the result should be identical up to at least 1 ULP. In practice, if they are compiled using a same mainstream compiler, the result should be exactly the same.
The thing is using multiple threads often result in a non-deterministic order of the applied FP operations which causes non-deterministic results. More specifically, atomic operations on FP variables often cause such an issue because the order of the operations often changes at runtime. If you want deterministic results, you need to use a static partitioning, avoid atomic operations on FP variables or more generally atomic operations that could result in a different ordering. The same thing applies for locks or any synchronization mechanisms.
The same thing is true for GPUs. In fact, such problem is very frequent when developers use atomic FP operations for example to sum values. They often do that because implementing fast reductions is complex (though it is more deterministic) and atomic operations as pretty fast on modern GPUs (since they use dedicated efficient units).
According to the accepted answer to floating point processor non-determinism?, C++ floating point is not non-deterministic. The same sequence of instructions will give the same results.
There are a few things to take into account though:
Firstly, the behavior (i.e. the result) of a particular piece of C++ source code doing a FP calculation may depend on the compiler and the chosen compiler options. For example, it may depend on whether the compiler chooses to emit 64 or 80 bit FP instructions. But this is deterministic.
Secondly, similar C++ source code may give different results; e.g. due to non-associative behavior of certain FP instructions. This also is deterministic.
Determinism won't be affected by multi-threading by default. The C++ compiler will probably be unaware of whether the code is multi-threaded or not. And it definitely has no reason to emit different FP code.
Admittedly, FP behavior depends on the rounding mode selected, and that can be set on a per-thread basis. However, for this to happen, something (application code) would have to explicitly set different rounding modes for different threads. Once again, that is deterministic. (And a pretty daft thing for the application code to do, IMO.)
The idea that a PC would would use different FP hardware with different behavior for different threads seems far-fetched to me. Sure a PC could have (say) an Intel chipset and an ARM chipset, but it is not plausible that different threads of the same C++ application (executable) would simultaneously run on both chipsets.
Likewise for GPUs. Indeed, given that you need to program GPUs in a way that is radically different to ordinary (or threaded) C++, I would doubt that they could even share the same source code.
In short, I think that you are worrying about a hypothetical problem that you are unlikely to encounter in reality ... given the current state of the art in hardware and C++ compilers.

verilog synthesis not converging after 2000 iterations

I have written the below code for a simple multiplication of 2 n-bit numbers(here n=16). It is getting simulated with desired output waveform but the problem is, it is not getting synthesized in vivado 17.2 even though I have written a static 'for loop'(ie., loop iteration is constant). I am getting below mentioned error.
[Synth 8-3380] loop condition does not converge after 2000 iterations
Note: I have written
for(i=i;i<n;i=i+1)
instead of
for(i=0;i<n;i=i+1)
because the latter one was executing once again after i reached n. So that is not a mistake. Kindly someone help. Thank you for your time
//unsigned integer multiplier
module multiplication(product,multiplier,multiplicand,clk,rset);
parameter n = 16;
output reg [(n<<1)-1:0]product;
input [n-1:0]multiplier, multiplicand;
input clk,rset;
reg [n:0]i='d0;
always #( posedge clk or posedge rset)
begin
if (rset) product <= 'd0;
else
begin
for(i=i;i<n;i=i+1)
begin
product =(multiplier[i] == 1'b1)? product + ( multiplicand << i ): product;
$display("product =%d,i=%d",product,i);
end
end
end
endmodule
First of all, it is not a good practice to use for, while kind of loops if you really want to implement your design on FPGA (Vivado is optimized to be used to implement your design on FPGA). Even if you can successfully synthesize your design, you might face with timing problems or unexpected bugs.
I think you can find your answer here.
Edit: I just wanted to inform you that generally controlling timing is very important in HW design, especially when you want to integrate your design with other system, loops can be nightmare for that.
Go back to using your original for (i=0 loop.
Your error is that you assume i=0 because of reg [n:0]i='d0; That is only true the very first time. Thus only once, at the start of the simulation.
because the latter one was executing once again after i reached n.
Yes, the loop will repeat again and again for every clock cycle. That is what #( posedge clk ...) does.
More errors:
You are using blocking assignment in the clock section, use non-blocking:
product <=(multiplier[i] == 1'b1)? product + ( multiplicand << i ): product;
Your product is only correct the first time after a reset (when it starts at zero). The second clock cycle after a reset you do the multiplication again but start with the previous value of product.
Your i is a bit big you use 17 bits to count to 16. Also global loop variables have pitfalls. I suggest you use the system Verilog syntax of: `for (int i=0; ....)
The error message is not about the iterations of your loop, but about the iterations of the internal algorithm of your synthesis program. It tells you that Vivado failed to create a circuit that could actually be implemented on your FPGA of choice at your clock speed of choice and which actually does what you're asking for.
Let me elaborate, after I mention two items of general import: don't use blocking assignments (=) inside always #(posedge clk) blocks. They almost never do what you want. Only non-blocking assignments (<=) should be clocked. Secondly, the correct way to synthesize a for loop is with generate, even though Vivado seems to accept the simple for.
You are building a fairly large block of combinatorial logic here. Remember that when you synthesize combinatorial logic you are asking for a circuit that can evaluate the expression that you've written within a single clock cycle. The actual expression taken into consideration is the one after the for loop has been unrolled. I.e., you are (conditionally) adding a 16bit number to a 32bit number 16 times, each times shifted one bit further to the left. Each of these additions has a carry bit. So each of these additions will actually have to look at all of the upper 16bits of the result of the previous addition, and each will depend on all but the lowest bit of the previous addition. An adder for one bit + carry needs O(5) gates. Each addition is conditional, which adds at least one more gate for each bit. So you are asking for at minimum of 16*16*6 = 1300 interdependent gates that all have to stabilize within a single clock cycle. Vivado is telling you that it cannot fulfill these requirements.
One way of mitigating this problem will be to synthesize for a lower clock frequency, where the gates have more time to stabilize, and you can thus build longer logic chains. Another option would be to pipeline the operation, say by only evaluating what corresponds to four iterations of your loop within a single clock cycle and then building the result over several clock cycles. This will add some bookkeeping logic to your code, but it is inevitable if one wants to evaluate complex expressions with finite resources at high clock frequencies. It would also introduce you to synchronous logic, which you will have to learn anyway if you want to do anything non-trivial with an FPGA. Note that this kind of pipelining wouldn't affect throughput significantly, because your FPGA would then be doing several multiplications in parallel.
You may also be able to rewrite the expression to handle the carry bits and interdependencies in a smarter way that allows Vivado to find its way through the expression (there probably is such a way, I assume it synthesizes if you simply write the multiplication operator?).
Finally, many FPGAs come with dedicated multiplier units because multiplication is a common operation, but implementing it in logic gates wastes a lot of resources. As you found out.

Synthesizable arithmetic shift in Verilog

I recently came across this answer on stackoverflow.
With Verilog, once you take a part-select, the result is unsigned. Use the $signed system task on the part select to make it signed.
Is this method synthesizable (ie the system task $signed)
If it is not synthesizable, is there a different way to perform arithmetic shift on variables like a <= a>>>2 (this should give the quotient when a is divided by 4).
It certainly is synthesizable. Whether your particular tool supports it is another question.
You can also use a cast in SystemVerilog.
res = signed'(registers[0][0]) >>> 2;

Why perform subtraction using addition?

I am curious about why we use addition to carry out subtraction in digital design systems.
I have a hunch that it has something to do with cost and performance but I don't understand how it would save money or be more efficient.
This really breaks down into two questions:
Why not use dedicated circuits for each operation?
Every additional circuit in a design costs money and time to design, validate, and allocate budgets for space, power and timing. We will (almost) always come out ahead if we can design a single circuit that can handle multiple operations.
In digital circuits, "a + b" and "a - b" are two different operations that require different circuits. Fortunately though, in arithmetic, "a + b" is the same as "a + (-b)", so the math doesn't care which one we use. Some very smart people came up with a clever representation for signed numbers that takes advantage of this mathematical fact, called Two's Complement. This representation allows positive and negative numbers to be added and inverted easily.
Since we can now implement both addition and subtraction with one circuit, there's no point to creating a second one.
Okay, so why addition?
If the operations are now interchangeable, why did we implement everything in addition instead of subtraction? I can only speculate on this, but I suspect the following reasons:
An adder circuit is less complicated than a subtractor:
Half Adder:
Half Subtractor:
Note that the subtractor has an added inverter. Every transistor counts in a digital design, so all other things being equal, we should use the circuit with fewer components.
Furthermore, addition is simply a more common operation. Incrementing instruction pointers & other memory addresses, iterator/loop indexes, and other regularly formed values is much more common than stepping backwards. If we're choosing to either handle subtraction or addition natively (which we are, since we are only going to dedicate one circuit to both), then it makes sense to choose the most common one. Now we've added an extra step to subtraction operations (convert to two's complement), but it's preferable to adding an extra step to the more common addition operations.
Further Reading
There is a very similar quesiont on EE.SE: Addition and subtraction in digital electronics
Here on SO: What is “2's Complement”?

Verilog Best Practice - Incrementing a variable

I'm by no means a Verilog expert, and I was wondering if someone knew which of these ways to increment a value was better. Sorry if this is too simple a question.
Way A:
In a combinational logic block, probably in a state machine:
//some condition
count_next = count + 1;
And then somewhere in a sequential block:
count <= count_next;
Or Way B:
Combinational block:
//some condition
count_en = 1;
Sequential block:
if (count_en == 1)
count <= count + 1;
I have seen Way A more often. One potential benefit of Way B is that if you are incrementing the same variable in many places in your state machine, perhaps it would use only one adder instead of many; or is that false?
Which method is preferred and why? Do either have a significant drawback?
Thank you.
One potential benefit of Way B is that if you are incrementing the same variable in many places in your state machine, perhaps it would use only one adder instead of many; or is that false?
Any synthesis tool will attempt automatic resource sharing. How well they do so depends on the tool and code written. Here is a document that describes some features of Design Compiler. Notice that in some cases, less area means worse timing.
Which method is preferred and why? Do either have a significant drawback?
It depends. Verilog(for synthesis) is a means to implement some logic circuit but the spec does not specify exactly how this is done. Way A may be the same as Way B on an FPGA but Way A is not consistent with low power design on an ASIC due to the unconditional sequential assignment. Using reset nets is almost a requirement on an ASIC but since many FPGAs start in a known state, you can save quite a bit of resources by not having them.
I use Way A in my Verilog code. My sequential blocks have almost no logic in them; they just assign registers based on the values of the "wire regs" computed in the combinational always blocks. There is just less to go wrong this way. And with Verilog we need all the help we can get.
What is your definition of "better" ?
It can be better performance (faster maximum frequency of the synthesized circuit), smaller area (less logic gates), or faster simulation execution.
Let's consider smaller area case for Xilinx and Altera FPGAs. Registers in those FPGA families have enable input. In your "Way B", *count_en* will be directly mapped into that enable register input, which will result in less logic gates. Essentially, "Way B" provides more "hints" to a synthesis tool how to better synthesize that circuit. Also it's possible that most FPGA synthesis tools (I'm talking about Xilinx XST, Altera MAP, Mentor Precision, and Synopsys Synplify) will correctly infer register enable input from the "Way A".
If *count_en* is synthesized as enable register input, that will result in better performance of the circuit, because your counter increment logic will have less logic levels.
Thanks

Resources