Carry/auxiliary flag functions in x86 ALU - verilog

I'm trying to create a 8086 processor in Verilog, and I have a better-than-average fundamental understanding of most of the architecture (and can get along happily once I get past this point), but I can't seem to wrap my head around how the Carry and Auxiliary flags function in the ALU.
I understand that CF is triggered upon an addition or subtraction (in which case it's called borrow) that would cause the result to be larger than the bit width of the ALU.
But, how would I write Verilog code for addition and subtraction that would allow me to write to the FLAGS[0] (CF) bit and then re-access it to continue the operation? Can anyone give me examples that I can deconstruct?
Also, this is even more of a n00b question, but how is it that an ALU with carry operation can support creation of a 17-bit number if the SI and DI registers are only 16 bits in width? Where does that extra bit go or what gets done with it? What happens if multiplication creates that same bit overflow?
Many apologies for the novice-level questions. I almost feel like I'm set to get yelled at for some obvious ignorance or lack of understanding with regards to this. Many thanks to anyone who can help and give code lines for understanding to elucidate this for me.

I don't quite know what you mean and then re-access it to continue the operation, but if you're just asking how you can generate a carry bit from a 16 bit addition/subtraction, this is one way to do it (use concatenation to write result to two different registers):
always # posedge clk begin
if(add_with_carry)
{CF[0], result[15:0]} <= a[15:0] + b[15:0];
else if(sub_with_carry)
{CF[0], result[15:0]} <= a[15:0] - b[15:0];
else if(add_without_carry)
result[15:0] <= a[15:0] + b[15:0];
else if(sub_without_carry)
result[15:0] <= a[15:0] - b[15:0];
end
This is also basically the same thing as writing the result to a 17 bit register, and then just designating result[16] as the carry flag.

Related

verilog synthesis not converging after 2000 iterations

I have written the below code for a simple multiplication of 2 n-bit numbers(here n=16). It is getting simulated with desired output waveform but the problem is, it is not getting synthesized in vivado 17.2 even though I have written a static 'for loop'(ie., loop iteration is constant). I am getting below mentioned error.
[Synth 8-3380] loop condition does not converge after 2000 iterations
Note: I have written
for(i=i;i<n;i=i+1)
instead of
for(i=0;i<n;i=i+1)
because the latter one was executing once again after i reached n. So that is not a mistake. Kindly someone help. Thank you for your time
//unsigned integer multiplier
module multiplication(product,multiplier,multiplicand,clk,rset);
parameter n = 16;
output reg [(n<<1)-1:0]product;
input [n-1:0]multiplier, multiplicand;
input clk,rset;
reg [n:0]i='d0;
always #( posedge clk or posedge rset)
begin
if (rset) product <= 'd0;
else
begin
for(i=i;i<n;i=i+1)
begin
product =(multiplier[i] == 1'b1)? product + ( multiplicand << i ): product;
$display("product =%d,i=%d",product,i);
end
end
end
endmodule
First of all, it is not a good practice to use for, while kind of loops if you really want to implement your design on FPGA (Vivado is optimized to be used to implement your design on FPGA). Even if you can successfully synthesize your design, you might face with timing problems or unexpected bugs.
I think you can find your answer here.
Edit: I just wanted to inform you that generally controlling timing is very important in HW design, especially when you want to integrate your design with other system, loops can be nightmare for that.
Go back to using your original for (i=0 loop.
Your error is that you assume i=0 because of reg [n:0]i='d0; That is only true the very first time. Thus only once, at the start of the simulation.
because the latter one was executing once again after i reached n.
Yes, the loop will repeat again and again for every clock cycle. That is what #( posedge clk ...) does.
More errors:
You are using blocking assignment in the clock section, use non-blocking:
product <=(multiplier[i] == 1'b1)? product + ( multiplicand << i ): product;
Your product is only correct the first time after a reset (when it starts at zero). The second clock cycle after a reset you do the multiplication again but start with the previous value of product.
Your i is a bit big you use 17 bits to count to 16. Also global loop variables have pitfalls. I suggest you use the system Verilog syntax of: `for (int i=0; ....)
The error message is not about the iterations of your loop, but about the iterations of the internal algorithm of your synthesis program. It tells you that Vivado failed to create a circuit that could actually be implemented on your FPGA of choice at your clock speed of choice and which actually does what you're asking for.
Let me elaborate, after I mention two items of general import: don't use blocking assignments (=) inside always #(posedge clk) blocks. They almost never do what you want. Only non-blocking assignments (<=) should be clocked. Secondly, the correct way to synthesize a for loop is with generate, even though Vivado seems to accept the simple for.
You are building a fairly large block of combinatorial logic here. Remember that when you synthesize combinatorial logic you are asking for a circuit that can evaluate the expression that you've written within a single clock cycle. The actual expression taken into consideration is the one after the for loop has been unrolled. I.e., you are (conditionally) adding a 16bit number to a 32bit number 16 times, each times shifted one bit further to the left. Each of these additions has a carry bit. So each of these additions will actually have to look at all of the upper 16bits of the result of the previous addition, and each will depend on all but the lowest bit of the previous addition. An adder for one bit + carry needs O(5) gates. Each addition is conditional, which adds at least one more gate for each bit. So you are asking for at minimum of 16*16*6 = 1300 interdependent gates that all have to stabilize within a single clock cycle. Vivado is telling you that it cannot fulfill these requirements.
One way of mitigating this problem will be to synthesize for a lower clock frequency, where the gates have more time to stabilize, and you can thus build longer logic chains. Another option would be to pipeline the operation, say by only evaluating what corresponds to four iterations of your loop within a single clock cycle and then building the result over several clock cycles. This will add some bookkeeping logic to your code, but it is inevitable if one wants to evaluate complex expressions with finite resources at high clock frequencies. It would also introduce you to synchronous logic, which you will have to learn anyway if you want to do anything non-trivial with an FPGA. Note that this kind of pipelining wouldn't affect throughput significantly, because your FPGA would then be doing several multiplications in parallel.
You may also be able to rewrite the expression to handle the carry bits and interdependencies in a smarter way that allows Vivado to find its way through the expression (there probably is such a way, I assume it synthesizes if you simply write the multiplication operator?).
Finally, many FPGAs come with dedicated multiplier units because multiplication is a common operation, but implementing it in logic gates wastes a lot of resources. As you found out.

verilog coding style for synthesizable code

I coded something like the following:
always #(state or i1 or i2 or i3 or i4) begin
next = 5'bx;
err = 0; n_o1 = 1;
o2 = 0; o3 = 0; o4 = 0;
case (state) // synopsys full_case parallel_case
IDLE: begin
if (!i1) next = IDLE;
else if ( i2) next = S1;
else if ( i3) next = S2;
else next = ERROR;
end
S1: begin
if (!i2) next = S1;
else if ( i3) next = S2;
else if ( i4) next = S3;
else next = ERROR;**strong text**
...
My manager, of course I don't want to argue with him before I have some strong argument, but he reviewed my code and said writing
next = 5'bx;
err = 0; n_o1 = 1;
o2 = 0; o3 = 0; o4 = 0;
in a combinational logic without putting the right side in the sensitivity list will cause problems in synthesis.By not having these 3 lines, I need to explicitly write the else part inside each individual case, and he said yes.
I am wondering is there anything wrong with this coding style? And will it cause a synthesis problem or any sort of problem(maybe some version or old synthesis tool won't synthesize?) by initializing these values in the combinational logic? What he said does make sense to me and I actually never thought about it because he said this is software logic, and every wire gets its initial value from the logic before it with the initial condition. I told him school taught us this, he was like school cares less any synthesis but industry does.
Thank you for your help! Guess I am not trying to convince him anything even I have a answer, since the team need to stick with one style anyways, but i am confused by him since I have seen others doing this all the time and he is also a guy with tons of experience, so...confused
First of all, you should be using always #(*) from Verilog-2001 or even better always_comb from SystemVerilog so the sensitivity list get constructed automatically for you.
The problem with your code is the use of the full case synthesis pragma as described in this paper. Your coding style removes the need for full case as long as you are sure you made assignments to every variable in the always block for all possible flows through the block.
I think what your boss means by "software logic" is that your coding style requires the designer to think sequentially. In other words, when I read your always block, I am first forced to think about all the values being initialized to their default values, then I must evaluate the case logic. In reality, the logic will synthesize into the equivalent of a default case. This causes a disparity between the logic that represents the RTL and the logic of how I evaluate your expression in my mind. If you know what your doing, then this should be fine most of the time. But you are working for a company, so your code should be considerate of the other engineers working on a project. Each different team in the design flow is going to view the same logic through a potentially different lens (for example, the physical design team is not concerned with Verilog but the synthesized RTL). If we write our Verilog to reflect the final RTL (i.e. "hardware logic"), then everyone is analyzing the logic in a similar fashion. If I look at an output in a circuit and I know all the values of the inputs at a given time step, then I can visually trace the output through the circuit and determine its value without giving any consideration to the other logic. Your Verilog code should be written in the same way.
To summarize, your initialization statements are nothing more than another case in the selection mux in the RTL. So, you should write it that way. Use a default case and explicitly assign every output of the block in each case. This is generally considered best practice. It may not be the most clever or elegant way to write Verilog, but it is the most readable and results in far fewer errors (and in industry, people are more concerned with design verification to reduce costs than the cleverness of Verilog).
Also, as #dave_59 brought up, if you use the full_case Synopsis directive, then it will create default output drivers for you where the outputs are set to "don't cares." This is not a result that anyone wants, and it would be flagged by the verification team. To fix it, you would need to make sure every output is being assigned by adding them to all the cases like your boss mentioned. If you are forced to do this anyways, then full_case is redundant because you have explicitly made the case statement full. As for older synthesis tools, I don't see that being as big of an issue for this particular subject, but it is a consideration that is always given in industry. What would be more of an issue is if your company has configured downstream tools to force older constructs in an effort to reduce verification costs.
Trust your manager's experience on this issue. Coding style in industry is largely affected by collaboration with other engineers, costs, and legacy than by technical details. This is where your manager's experience will be valuable.

Want help in understanding verilog constructs

I am new to understanding Verilog as this language requires thinking in terms of synthesis.
While doing some program I found that:
begin
buf_inm[row][col] =temp_data;
#1 mux_data=buf_inm[row][col];
end
gave correct results than
begin
buf_inm[row][col] =temp_data;
mux_data=buf_inm[row][col];
end
in terms of assignments of variables.
Can anybody explain what is the difference between these two?
In any other higher level languages construct 2 (without delay) would have given correct assignments.
Thanking you,
Yours sincerely,
R. Ganesan.
First of all #1 is a one time scale delay that you can use in simulation, but it is not synthesizeable. When you say "gave correct results", it might help if you explained what results you are seeing, but my guess is that you assigned something to the 2 dimensional vector, and then you assign that to the max_data and are expecting mux_data to be what temp data was. Is it retaining old data? Is it undefined?
That said, I would say the difference is a synthesizeable assignment versus a non-sythesizeble one. In the first case you should see two state changes separated by a timescale unit. The second case, because it is blocking, should see both values update in zero-time, and in the order as written top to bottom. If you aren't seeing that, something else is potentially at foot here. What tool are you using? Tool version? Maybe you can share your full code and test bench so we can see why your results are unexpected.

Division in verilog

I am teaching myself verilog. The book I am following stated in the introduction chapters that to perform division we use the '/' operator or '%' operator. In later chapters it's saying that division is too complex for verilog and cannot be synthesized, so to perform division it introduces a long algorithm.
So I am confused, can't verilog handle simple division? is the / operator useless?
It all depends what type of code you're writing.
If you're writing code that you intend to be synthesised, that you intend to go into an FPGA or ASIC, then you probably don't want to use the division or modulo operators. When you put any arithmetic operator in RTL the synthesiser instances a circuit to do the job; An adder for + & -; A multiplier for *. When you write / you're asking for a divider circuit, but a divider circuit is a very complex thing. It often takes multiple clock cycles, and may use look up tables. It's asking a lot of a synthesis tool to infer what you want when you write a / b.
(Obviously dividing by powers of 2 is simple, but normally you'd use the shift operators)
If you're writing code that you don't want to be synthesised, that is part of a test bench for example, then you can use division all you want.
So to answer your question, the / operator isn't useless, but you have be concious of where and why you're using it. The same is true of *, but to a lesser degree. Multipliers are quite expensive, but most synthesisers are able to infer them.
You have to think in hardware.
When you write a <= b/c you are saying to the synthesis tool "I want a divider that can provide a result every clock cycle and has no intermediate pipline registers".
If you work out the logic circuit required to create that it's very complex, especially for higher bit counts. Generally FPGAs won't have specialist hardware blocks for division so it would have to be implemented out of generic logic resources. It's likely to be both big (lots of luts) and slow (low fmax).
Some synthesisers may implement it anyway (from a quick search it seems quartus will), others won't bother because they don't think it's very useful in practice.
If you are dividing by a constant and can live with an approximate result then you can do tricks with multipliers. Take the reciprocal of what you wanted to divide by, multiply it by a power of two and round to the nearest integer.
Then in your verilog you can implement your approximate divide by multiply (which is not too expensive on modern FPGAS) followed by shift (shifting by a fixed number of bits is essentially free in hardware). Make sure you allow enough bits for the intermediate result.
If you need an exact answer or if you need to divide by something that is not a pre-defined constant you will have to decide what kind of divider you want. IF your throughput is low then you can use a state machine based approach which does one division every n clock cycles. If your throughput is high and you can afford the device area then a pipelined approach which does a division per clock cycle (but requires multiple cycles for the result to flow through) may be more appropriate.
Often tool vendors will provide pre-made blocks (altera calls them megafunctions) for this kind of stuff. The advantage of these is that the tool vendor will likely have carefully optimised them for the device. The downside is they can bring vendor lockin, if you want to move to a different device vendor you will most likely have to swap out the block and the block you swap it with may have different characteristics.
So im confused. cant verilog handle simple division? is the / operator
useless?
The verilog synthesis spec (IEEE 1364.1) actually indicates all arithmetic operators with integer operands should be supported but nobody follows this spec. Some synthesis tools can do integer division but others will reject it(I think XST still does) because combinational division is typically very area inefficient. Multicycle implementations are the norm but these cannot be synthesized from '/'.
Division and modulo are never "simple". Avoid them if you can do so, e.g. through bit masks or shift operations. Especially a variable divisor is really complicated to implement in hardware.
"Verilog the language" handles division and modulo just fine - when you are using a computer to simulate your code you have full access to all it's abilities.
When you are synthesising your code to a particular chip, there are limitations. The limitations tend to be based on what the tool-vendor thinks is "sensible" rather than what is feasible.
In the old days, division by anything other than a power-of-two was deemed to be non-sensible for silicon as it took up a lot of space and ran very slowly. At the moment, some synthesisers with create "divide by a constant" circuits for you.
In future, I see no reason why the synthesiser shouldn't create you a divider (or make use of one that is in the DSP blocks of a potential future architecture). Whether it will or not remains to be seen, but witness the progression of multipliers (from "only powers of two" to "one input constant" to "full implementation" in just a few years)
circuits including only division by 2 : just shift the bit :)
other than 2 .... see you should always think at circuit level verilog is NOT C or C++
/ and % is not synthesizable or if it becomes( in new versions) i believe you should keep your own division circuit this is because the ip they provide will be general ( most probably they will make for floating not fixed)
i bet you had gone through morris mano computer architechure book , there in some last chapters the whole flow is given along with algo , go through it follow it and make your own
see now if your works go for only logic verification and no real circuit is needed , sure go for / and % . no problem it will work for simulation
Division using '/' is possible in verilog. But it is not a synthesizable operator. Same is the case for multiplication using '*'. There are certain algorithms to perform these operations in verliog, and they are used if the code needs to be synthesizable. ie. if you require an equivalent hardware for it.
I am not aware of any algorithms for division, but for multiplication, i have used Booth's algorithm.
if you want the synthesizable code you can use the Divison_IP or you can use the right shifting operator for some divisions like 64/8=8 same 64>>3 = 8.
Division isn't simple in hardware as people spent a lot of time in an efficient
and fast multiplier as an example. However, you can do divid by 2 easily by right shifting one bit in hardware.
Actually your point is very valid and I was also confused in my initial days of learning HDLs.
When you synthesise a division operator, it consumes a lot of resources on FPGA or during logic synthesis for ASIC. Try following instead.
You can also perform division(and multiplication) by shifting some vector(right = division, left = multiplication). But that will be multiplication(and divion) by 2.
Example 0100 = 4
Shift right 0010 = 2(which is 4/2)
Shift left 1000 = 8(which is 4*2).
We use >> operator for shift right, and << for shift left.
But we can also produce variations out of it.
For example multiplication by 3.
So if we have 0100 (4 dec) then also will be
shift left and add one at each step. ((0100 << 1)+1)
Similarly division by 3
shift right and subtract one at each step. ((0100 >> 1) - 1)
These methods were made because to be honest, resources in FPGA are limited, and when it comes to ASICs, your manager tries to kill you for any additional logic. :)
The division operator / is not useless in Verilog/System Verilog. It works in case of simulations as usual mathematical operator.
Some synthesis tools like Xilinx Vivado synthesize the division operator also because it is having a pre-built algorithm in it (though takes more hardware gates).
In simple words, you can do division in Verilog but have to take care of tools and simulators.
Using result <= a/b and works perfectly.
Remember when using the <= operator, the answer is calculated immediately but the answer is entered inside the "result" register at next clock positive edge.
If you don't want to wait till next clock positive edge use result = a/b.
Remember, any arithmetic operation circuit needs some time to finish the operation, and during this time the circuit generates random numbers (bits).
Its like when A-10 warthog attack airplane attacks a tank it shoots lots of bullets. That's how the divider circuit acts while dividing,it spits random bits. After couple of nanoseconds it will finish dividing and return a stable good result.
This is why we wait until next clock cycle for the "result" register. We try to protect it from random garbage numbers.
Division is the most complex operation, so it will have a delay in calculation. For 16bit division the result will be calculated in approximately 6 nanoseconds.

Verilog Best Practice - Incrementing a variable

I'm by no means a Verilog expert, and I was wondering if someone knew which of these ways to increment a value was better. Sorry if this is too simple a question.
Way A:
In a combinational logic block, probably in a state machine:
//some condition
count_next = count + 1;
And then somewhere in a sequential block:
count <= count_next;
Or Way B:
Combinational block:
//some condition
count_en = 1;
Sequential block:
if (count_en == 1)
count <= count + 1;
I have seen Way A more often. One potential benefit of Way B is that if you are incrementing the same variable in many places in your state machine, perhaps it would use only one adder instead of many; or is that false?
Which method is preferred and why? Do either have a significant drawback?
Thank you.
One potential benefit of Way B is that if you are incrementing the same variable in many places in your state machine, perhaps it would use only one adder instead of many; or is that false?
Any synthesis tool will attempt automatic resource sharing. How well they do so depends on the tool and code written. Here is a document that describes some features of Design Compiler. Notice that in some cases, less area means worse timing.
Which method is preferred and why? Do either have a significant drawback?
It depends. Verilog(for synthesis) is a means to implement some logic circuit but the spec does not specify exactly how this is done. Way A may be the same as Way B on an FPGA but Way A is not consistent with low power design on an ASIC due to the unconditional sequential assignment. Using reset nets is almost a requirement on an ASIC but since many FPGAs start in a known state, you can save quite a bit of resources by not having them.
I use Way A in my Verilog code. My sequential blocks have almost no logic in them; they just assign registers based on the values of the "wire regs" computed in the combinational always blocks. There is just less to go wrong this way. And with Verilog we need all the help we can get.
What is your definition of "better" ?
It can be better performance (faster maximum frequency of the synthesized circuit), smaller area (less logic gates), or faster simulation execution.
Let's consider smaller area case for Xilinx and Altera FPGAs. Registers in those FPGA families have enable input. In your "Way B", *count_en* will be directly mapped into that enable register input, which will result in less logic gates. Essentially, "Way B" provides more "hints" to a synthesis tool how to better synthesize that circuit. Also it's possible that most FPGA synthesis tools (I'm talking about Xilinx XST, Altera MAP, Mentor Precision, and Synopsys Synplify) will correctly infer register enable input from the "Way A".
If *count_en* is synthesized as enable register input, that will result in better performance of the circuit, because your counter increment logic will have less logic levels.
Thanks

Resources