What is the maximum wire bit-width in verilog/system verilog - verilog

I am trying to synthesize a design using Intel Quartus software. During synthesis flow, I got a warning stating "Verilog declaration warning: vector has more than 2**16 bits". Due to project specifications, the wire length exceeds 2^16 bits. Do I really need to worry about this warning? Is there any constraint in Verilog/System Verilog regarding maximum bit-width of the wires?

Yes, you should worry about the messages. It means you are probably miscoding something as I do not believe you really need an integral vector (packed array) with 2^16 bits. This means want to represent a number as great as 2^16^2 = 1.1579209e+77. You probably want an unpacked array of bits or bytes.

The Verilog/SystemVerilog languages themselves do not define a maximum vector width. I vaguely remember a minimum for simulations mention somewhere in the IEEE1364 / IEEE1800 LRMs.
Synthesizers can have their own limitations for various reasons; ex software complexity, physical mapping/routing, business tactics (“buy this other tool”), etc. The fact you are getting warnings probably suggests physical mapping/routing as most synthesizers try to keep all the bits in a vector together; thereby making overall routing and clock-tree buffering easier.
2^16 is a lot of bits and your design exceeds that. In theory you could split the whole vector into more manageable chunks (ex left, right) and it might give you better routing and timing. If your design is meeting all requirements, then the warning could be ignored. Just be ready to make changes when it suddenly can no longer be ignored.

Yes, according to the the SystemVerilog LRM it can be limited, and since your synthesis tool is throwing a warning, you should keep this in mind if any issues arise in the future of your project.
For packed arrays (Section 7.4.1), which you are probably referring to according to the warning, the aformentioned LRM reads:
The maximum size of a packed array can be limited, but shall be at least 65 536 (2^16) bits.
For unpacked arrays (Section 7.4.2), it says:
Implementations may limit the maximum size of an array, but they shall allow at least 16 777 216 (2^24) elements.
But, do you actually need a 2^16 bit wide vector? If you have a 16 bit variable (e.g., logic [15:0]), you can handle unsigned values up to 2^16-1 = 65535. If you have a 2^16 bit wide vector (e.g., logic [65535:0]), you could display values up to 2^(2^16)-1 ≈ 2.00 × 10^19728.
Could you not split up your wire into a couple of smaller wires?

Related

creating a constant vector of variable width in verilog

I'm writing some synthesizable Verilog. I need to create a value to use as a mask in a larger expression. This value is a sequence of 1's, when the length is stored in some register:
buffer & {offset{1'h1}};
where buffer and offset are both registers. What I expect is for buffer to be anded with 11111... of width offset. However, the compiler says this illegal in verilog, since offset needs to be constant.
Instead, I wrote the following:
buffer & ~({WIDTH{1'h1}} << offset)
where WIDTH is a constant. This works. Both expressions are equivalent in terms of values, but obviously not in terms of the hardware that would be synthesized.
What's the difference?
The difference is because of the rules for context-determined expressions (detailed in sections 11.6 and 11.8 of the IEEE 1800-2017 LRM) require the width of all operands of an expression be known at compile time.
Your example is too simple to show where the complication arises, but let's say buffer was 16-bit signed variable. To perform bitwise-and (&), you first need to know size of both operands, then extend the smaller operand to match the size of the larger operand. If we don't know what the size of {offset{1'h1}} is, we don't know whether it needs to 0-extended, or buffer needs to be sign-extended.
Of course, the language could be defined to allow this so it works, but synthesis tools would be creating a lot of unnecessary additional hardware. And if we start applying this to more complex expressions, trying to determine how the bit-widths propagate becomes unmanageable.
Both parts of your question impliy the replication operator. The operator requires you to use a replication constant to show how many times to replicate. So, the first part of your example is illegal. The offset must be a constant, not a reg.
Also, the constant is not a width of something, but a number of times the {1} is repeated. So, the second part of the example is correct syntactically.

Why is it unsafe to use Linear Congruential Generator to shuffle cards in online casino? How long will it take to crack this system?

I know that Linear Congruential Generator is not recommended where high randomness and security level are needed, but I don't know why. If I use Linear Congruential Generator to generate a random number for shuffle algorithm, is it easy to crack? If it is, how long will it take to crack it?
Linear congruential generators have several major flaws:
They have very little internal state. Some generators may have as few as 64 thousand possible starting states -- as a result, using one of these would mean that there are only 64 thousand possible shuffled decks. This makes it very easy to identify which one of those decks is being used at any given point.
Their future and past behavior can be perfectly determined based on their state. Once an attacker is able to gather enough information to guess the state of a LCRNG being used to shuffle cards, they can determine both what all the future shuffles will be, and what any past shuffles were.
They frequently suffer from statistical biases. One common flaw is that low-order bits will follow a short cycle -- for instance, in many generators, the least significant bit of the raw output will flip between 1 and 0 on subsequent outputs.
The first two issues are what will cause you the most trouble here. The exact severity will depend on the size of the LCRNG's state. That being said, if a 32-bit LCRNG is being used, its state can probably be guessed given 32 bits of output state. The position of one card is roughly lg(52) ≈ 5.7 bits of state, so the complete state can probably be guessed by seeing at least 6 (32 ÷ 5.7 ≈ 5.6) cards.

Division in verilog

I am teaching myself verilog. The book I am following stated in the introduction chapters that to perform division we use the '/' operator or '%' operator. In later chapters it's saying that division is too complex for verilog and cannot be synthesized, so to perform division it introduces a long algorithm.
So I am confused, can't verilog handle simple division? is the / operator useless?
It all depends what type of code you're writing.
If you're writing code that you intend to be synthesised, that you intend to go into an FPGA or ASIC, then you probably don't want to use the division or modulo operators. When you put any arithmetic operator in RTL the synthesiser instances a circuit to do the job; An adder for + & -; A multiplier for *. When you write / you're asking for a divider circuit, but a divider circuit is a very complex thing. It often takes multiple clock cycles, and may use look up tables. It's asking a lot of a synthesis tool to infer what you want when you write a / b.
(Obviously dividing by powers of 2 is simple, but normally you'd use the shift operators)
If you're writing code that you don't want to be synthesised, that is part of a test bench for example, then you can use division all you want.
So to answer your question, the / operator isn't useless, but you have be concious of where and why you're using it. The same is true of *, but to a lesser degree. Multipliers are quite expensive, but most synthesisers are able to infer them.
You have to think in hardware.
When you write a <= b/c you are saying to the synthesis tool "I want a divider that can provide a result every clock cycle and has no intermediate pipline registers".
If you work out the logic circuit required to create that it's very complex, especially for higher bit counts. Generally FPGAs won't have specialist hardware blocks for division so it would have to be implemented out of generic logic resources. It's likely to be both big (lots of luts) and slow (low fmax).
Some synthesisers may implement it anyway (from a quick search it seems quartus will), others won't bother because they don't think it's very useful in practice.
If you are dividing by a constant and can live with an approximate result then you can do tricks with multipliers. Take the reciprocal of what you wanted to divide by, multiply it by a power of two and round to the nearest integer.
Then in your verilog you can implement your approximate divide by multiply (which is not too expensive on modern FPGAS) followed by shift (shifting by a fixed number of bits is essentially free in hardware). Make sure you allow enough bits for the intermediate result.
If you need an exact answer or if you need to divide by something that is not a pre-defined constant you will have to decide what kind of divider you want. IF your throughput is low then you can use a state machine based approach which does one division every n clock cycles. If your throughput is high and you can afford the device area then a pipelined approach which does a division per clock cycle (but requires multiple cycles for the result to flow through) may be more appropriate.
Often tool vendors will provide pre-made blocks (altera calls them megafunctions) for this kind of stuff. The advantage of these is that the tool vendor will likely have carefully optimised them for the device. The downside is they can bring vendor lockin, if you want to move to a different device vendor you will most likely have to swap out the block and the block you swap it with may have different characteristics.
So im confused. cant verilog handle simple division? is the / operator
useless?
The verilog synthesis spec (IEEE 1364.1) actually indicates all arithmetic operators with integer operands should be supported but nobody follows this spec. Some synthesis tools can do integer division but others will reject it(I think XST still does) because combinational division is typically very area inefficient. Multicycle implementations are the norm but these cannot be synthesized from '/'.
Division and modulo are never "simple". Avoid them if you can do so, e.g. through bit masks or shift operations. Especially a variable divisor is really complicated to implement in hardware.
"Verilog the language" handles division and modulo just fine - when you are using a computer to simulate your code you have full access to all it's abilities.
When you are synthesising your code to a particular chip, there are limitations. The limitations tend to be based on what the tool-vendor thinks is "sensible" rather than what is feasible.
In the old days, division by anything other than a power-of-two was deemed to be non-sensible for silicon as it took up a lot of space and ran very slowly. At the moment, some synthesisers with create "divide by a constant" circuits for you.
In future, I see no reason why the synthesiser shouldn't create you a divider (or make use of one that is in the DSP blocks of a potential future architecture). Whether it will or not remains to be seen, but witness the progression of multipliers (from "only powers of two" to "one input constant" to "full implementation" in just a few years)
circuits including only division by 2 : just shift the bit :)
other than 2 .... see you should always think at circuit level verilog is NOT C or C++
/ and % is not synthesizable or if it becomes( in new versions) i believe you should keep your own division circuit this is because the ip they provide will be general ( most probably they will make for floating not fixed)
i bet you had gone through morris mano computer architechure book , there in some last chapters the whole flow is given along with algo , go through it follow it and make your own
see now if your works go for only logic verification and no real circuit is needed , sure go for / and % . no problem it will work for simulation
Division using '/' is possible in verilog. But it is not a synthesizable operator. Same is the case for multiplication using '*'. There are certain algorithms to perform these operations in verliog, and they are used if the code needs to be synthesizable. ie. if you require an equivalent hardware for it.
I am not aware of any algorithms for division, but for multiplication, i have used Booth's algorithm.
if you want the synthesizable code you can use the Divison_IP or you can use the right shifting operator for some divisions like 64/8=8 same 64>>3 = 8.
Division isn't simple in hardware as people spent a lot of time in an efficient
and fast multiplier as an example. However, you can do divid by 2 easily by right shifting one bit in hardware.
Actually your point is very valid and I was also confused in my initial days of learning HDLs.
When you synthesise a division operator, it consumes a lot of resources on FPGA or during logic synthesis for ASIC. Try following instead.
You can also perform division(and multiplication) by shifting some vector(right = division, left = multiplication). But that will be multiplication(and divion) by 2.
Example 0100 = 4
Shift right 0010 = 2(which is 4/2)
Shift left 1000 = 8(which is 4*2).
We use >> operator for shift right, and << for shift left.
But we can also produce variations out of it.
For example multiplication by 3.
So if we have 0100 (4 dec) then also will be
shift left and add one at each step. ((0100 << 1)+1)
Similarly division by 3
shift right and subtract one at each step. ((0100 >> 1) - 1)
These methods were made because to be honest, resources in FPGA are limited, and when it comes to ASICs, your manager tries to kill you for any additional logic. :)
The division operator / is not useless in Verilog/System Verilog. It works in case of simulations as usual mathematical operator.
Some synthesis tools like Xilinx Vivado synthesize the division operator also because it is having a pre-built algorithm in it (though takes more hardware gates).
In simple words, you can do division in Verilog but have to take care of tools and simulators.
Using result <= a/b and works perfectly.
Remember when using the <= operator, the answer is calculated immediately but the answer is entered inside the "result" register at next clock positive edge.
If you don't want to wait till next clock positive edge use result = a/b.
Remember, any arithmetic operation circuit needs some time to finish the operation, and during this time the circuit generates random numbers (bits).
Its like when A-10 warthog attack airplane attacks a tank it shoots lots of bullets. That's how the divider circuit acts while dividing,it spits random bits. After couple of nanoseconds it will finish dividing and return a stable good result.
This is why we wait until next clock cycle for the "result" register. We try to protect it from random garbage numbers.
Division is the most complex operation, so it will have a delay in calculation. For 16bit division the result will be calculated in approximately 6 nanoseconds.

How is integer overflow exploitable?

Does anyone have a detailed explanation on how integers can be exploited? I have been reading a lot about the concept, and I understand what an it is, and I understand buffer overflows, but I dont understand how one could modify memory reliably, or in a way to modify application flow, by making an integer larger than its defined memory....
It is definitely exploitable, but depends on the situation of course.
Old versions ssh had an integer overflow which could be exploited remotely. The exploit caused the ssh daemon to create a hashtable of size zero and overwrite memory when it tried to store some values in there.
More details on the ssh integer overflow: http://www.kb.cert.org/vuls/id/945216
More details on integer overflow: http://projects.webappsec.org/w/page/13246946/Integer%20Overflows
I used APL/370 in the late 60s on an IBM 360/40. APL is language in which essentially everything thing is a multidimensional array, and there are amazing operators for manipulating arrays, including reshaping from N dimensions to M dimensions, etc.
Unsurprisingly, an array of N dimensions had index bounds of 1..k with a different positive k for each axis.. and k was legally always less than 2^31 (positive values in a 32 bit signed machine word). Now, an array of N dimensions has an location assigned in memory. Attempts to access an array slot using an index too large for an axis is checked against the array upper bound by APL. And of course this applied for an array of N dimensions where N == 1.
APL didn't check if you did something incredibly stupid with RHO (array reshape) operator. APL only allowed a maximum of 64 dimensions. So, you could make an array of 1-64 dimension, and APL would do it if the array dimensions were all less than 2^31. Or, you could try to make an array of 65 dimensions. In this case, APL goofed, and surprisingly gave back a 64 dimension array, but failed to check the axis sizes.
(This is in effect where the "integer overflow occurred"). This meant you could create an array with axis sizes of 2^31 or more... but being interpreted as signed integers, they were treated as negative numbers.
The right RHO operator incantation applied to such an array to could reduce the dimensionaly to 1, with an an upper bound of, get this, "-1". Call this matrix a "wormhole" (you'll see why in moment). Such an wormhole array has
a place in memory, just like any other array. But all array accesses are checked against the upper bound... but the array bound check turned out to be done by an unsigned compare by APL. So, you can access WORMHOLE[1], WORMHOLE[2], ... WORMHOLE[2^32-2] without objection. In effect, you can access the entire machine's memory.
APL also had an array assignment operation, in which you could fill an array with a value.
WORMHOLE[]<-0 thus zeroed all of memory.
I only did this once, as it erased the memory containing my APL workspace, the APL interpreter, and obvious the critical part of APL that enabled timesharing (in those days it wasn't protected from users)... the terminal room
went from its normal state of mechanically very noisy (we had 2741 Selectric APL terminals) to dead silent in about 2 seconds.
Through the glass into the computer room I could see the operator look up startled at the lights on the 370 as they all went out. Lots of runnning around ensued.
While it was funny at the time, I kept my mouth shut.
With some care, one could obviously have tampered with the OS in arbitrary ways.
It depends on how the variable is used. If you never make any security decisions based on integers you have added with input integers (where an adversary could provoke an overflow), then I can't think of how you would get in trouble (but this kind of stuff can be subtle).
Then again, I have seen plenty of code like this that doesn't validate user input (although this example is contrived):
int pricePerWidgetInCents = 3199;
int numberOfWidgetsToBuy = int.Parse(/* some user input string */);
int totalCostOfWidgetsSoldInCents = pricePerWidgetInCents * numberOfWidgetsToBuy; // KA-BOOM!
// potentially much later
int orderSubtotal = whatever + totalCostOfWidgetInCents;
Everything is hunky-dory until the day you sell 671,299 widgets for -$21,474,817.95. Boss might be upset.
A common case would be code that prevents against buffer overflow by asking for the number of inputs that will be provided, and then trying to enforce that limit. Consider a situation where I claim to be providing 2^30+10 integers. The receiving system allocates a buffer of 4*(2^30+10)=40 bytes (!). Since the memory allocation succeeded, I'm allowed to continue. The input buffer check won't stop me when I send my 11th input, since 11 < 2^30+10. Yet I will overflow the actually allocated buffer.
I just wanted to sum up everything I have found out about my original question.
The reason things were confusing to me was because I know how buffer overflows work, and can understand how you can easily exploit that. An integer overflow is a different case - you cant exploit the integer overflow to add arbitrary code, and force a change in the flow of an application.
However, it is possible to overflow an integer, which is used - for example - to index an array to access arbitrary parts of memory. From here, it could be possible to use that mis-indexed array to override memory and cause the execution of an application to alter to your malicious intent.
Hope this helps.

Does 64-bit floating point numbers behave identically on all modern PCs?

I would like to know whether i can assume that same operations on same 64-bit floating point numbers gives exactly the same results on any modern PC and in most common programming languages? (C++, Java, C#, etc.). We can assume, that we are operating on numbers and result is also a number (no NaNs, INFs and so on).
I know there are two very simmilar standards of computation using floating point numbers (IEEE 854-1987 and IEEE 754-2008). However I don't know how it is in practice.
Modern processors that implement 64-bit floating-point typically implement something that is close to the IEEE 754-1985 standard, recently superseded by the 754-2008 standard.
The 754 standard specifies what result you should get from certain basic operations, notably addition, subtraction, multiplication, division, square root, and negation. In most cases, the numeric result is specified precisely: The result must be the representable number that is closest to the exact mathematical result in the direction specified by the rounding mode (to nearest, toward infinity, toward zero, or toward negative infinity). In "to nearest" mode, the standard also specifies how ties are broken.
Because of this, operations that do not involve exception conditions such as overflow will get the same results on different processors that conform to the standard.
However, there are several issues that interfere with getting identical results on different processors. One of them is that the compiler is often free to implement sequences of floating-point operations in a variety of ways. For example, if you write "a = bc + d" in C, where all variables are declared double, the compiler is free to compute "bc" in either double-precision arithmetic or something with more range or precision. If, for example, the processor has registers capable of holding extended-precision floating-point numbers and doing arithmetic with extended-precision does not take any more CPU time than doing arithmetic with double-precision, a compiler is likely to generate code using extended-precision. On such a processor, you might not get the same results as you would on another processor. Even if the compiler does this regularly, it might not in some circumstances because the registers are full during a complicated sequence, so it stores the intermediate results in memory temporarily. When it does that, it might write just the 64-bit double rather than the extended-precision number. So a routine containing floating-point arithmetic might give different results just because it was compiled with different code, perhaps inlined in one place, and the compiler needed registers for something else.
Some processors have instructions to compute a multiply and an add in one instruction, so "bc + d" might be computed with no intermediate rounding and get a more accurate result than on a processor that first computes bc and then adds d.
Your compiler might have switches to control behavior like this.
There are some places where the 754-1985 standard does not require a unique result. For example, when determining whether underflow has occurred (a result is too small to be represented accurately), the standard allows an implementation to make the determination either before or after it rounds the significand (the fraction bits) to the target precision. So some implementations will tell you underflow has occurred when other implementations will not.
A common feature in processors is to have an "almost IEEE 754" mode that eliminates the difficulty of dealing with underflow by substituting zero instead of returning the very small number that the standard requires. Naturally, you will get different numbers when executing in such a mode than when executing in the more compliant mode. The non-compliant mode may be the default set by your compiler and/or operating system, for reasons of performance.
Note that an IEEE 754 implementation is typically not provided just by hardware but by a combination of hardware and software. The processor may do the bulk of the work but rely on the software to handle certain exceptions, set certain modes, and so on.
When you move beyond the basic arithmetic operations to things like sine and cosine, you are very dependent on the library you use. Transcendental functions are generally calculated with carefully engineered approximations. The implementations are developed independently by various engineers and get different results from each other. On one system, the sin function may give results accurate within an ULP (unit of least precision) for small arguments (less than pi or so) but larger errors for large arguments. On another system, the sin function might give results accurate within several ULP for all arguments. No current math library is known to produce correctly rounded results for all inputs. There is a project, crlibm (Correctly Rounded Libm), that has done some good work toward this goal, and they have developed implementations for significant parts of the math library that are correctly rounded and have good performance, but not all of the math library yet.
In summary, if you have a manageable set of calculations, understand your compiler implementation, and are very careful, you can rely on identical results on different processors. Otherwise, getting completely identical results is not something you can rely on.
If you mean getting exactly the same result, then the answer is no.
You might even get different results for debug (non-optimized) builds vs. release builds (optimized) on the same machine in some cases, so don't even assume that the results might be always identical on different machines.
(This can happen e.g. on a computer with an Intel processor, if the optimizer keeps a variable for an intermediate result in a register, that is stored in memory in the unoptimized build. Since Intel FPU registers are 80 bit, and double variables are 64 bit, the intermediate result will be stored with greater precision in the optimized build, causing different values in later results.).
In practice, however, you may often get the same results, but you shouldn't rely on it.
Modern FPUs all implement IEEE754 floats in single and double formats, and some in extended format. A certain set of operations are supported (pretty much anything in math.h), with some special instructions floating around out there.
assuming you are talking about applying multiple operations, I do not think you will get exact numbers. CPU architecture, compiler use, optimization settings will change the results of your computations.
if you mean the exact order of operations (at the assembly level), I think you will still get variations.for example Intel chips use extended precision (80 bits) internally, which may not be the case for other CPUs. (I do not think extended precision is mandated)
The same C# program can bring out different numerical results on the same PC, once compiled in debug mode without optimization, second time compiled in release mode with optimization enabled. That's my personal experience. We did not regard this when we set up an automatic regression test suite for one of our programs for the first time, and were completely surprised that a lot of our tests failed without any apparent reason.
For C# on x86, 80-bit FP registers are used.
The C# standard says that the processor must operate at the same precision as, or greater than, the type itself (i.e. 64-bit in the case of a 'double'). Promotions are allowed, except for storage. That means that locals and parameters could be at greater than 64-bit precision.
In other words, assigning a member variable to a local variable could (and in fact will under certain circumstances) be enough to give an inequality.
See also: Float/double precision in debug/release modes
For the 64-bit data type, I only know of "double precision" / "binary64" from the IEEE 754 (1985 and 2008 don't differ much here for common cases) being used.
Note: The radix types defined in IEEE 854-1987 are superseded by IEEE 754-2008 anyways.

Resources