I'm estimating my resource usage by counting the number of flip flop I need for a component. For example, when I estimate it for ins_controldata (simple counter and some I/O), I use 32 flipflops. When I look at the detailed map report of this component, Section 13 - Utilization by Hierarchy, I see that my estimations are close to the number of slice registers used for this component. Every slice has 4 LUTs and 8 flipflops.
Now when I do the same for my finite state machine, inst_xtm640, I estimate my flipflop usage around 43 (including the 3 flipflops needed for the 6 states). When I look at the map report, I see my estimation is more or less correct (+-10% error). But the number of slices needed are much higher than the slice register and LUTs needed. It's 40 when you look at the used LUTs, it should only be around 20.
Why are extra slices used for this component? Is it for speed optimization ?
+----------------------------------------------------------------------------------+
| Module | Partition | Slices* | Slice Reg | LUTs |
+----------------------------------------------------------------------------------+
| ++inst_controldata | | 6/6 | 35/35 | 20/20 |
| +++inst_xtm640 | | 40/40 | 57/57 | 88/88 |
+----------------------------------------------------------------------------------+
Edit:
I think I found it myself but other inputs are always welcome:
Not all slices are fully used. So it doesn't mean if I use 88 LUTS and 57 slices registers, I will be using 22 slices. Some slices aren't fully used, so this will make slice usage go higher. Also extra slices will be used to meet timing constraints.
You are indeed correct in your edit.
The synthesizer is sometimes a very unpredictable thing, and especially what your constraints are and what you optimize for is key in getting results.
If you were to optimize for size (area) your estimates might be closer.
The bigger and more complex your systems are, the greater this will effect your final place and route. This is why, when you look at your small example, results are closer.
This is also a fine reason why you shouldn't "fill up" your FPGA.
The more space you have left, the more efficient in closing timing the optimizer can be.
This is of course a "waste" of resources, but might lead to better results.
One advice I have received regarding FPGA layout is to not go into such detail that the synthesizer doesn't have room to operate. E.g. Don't make a flip flop out of logic, use the abstraction of the language instead. As long as you are aware of how the results are reached let the synthesizer do its job of finding out what is necessary and how to optimize it, it usually does a better job
\Paul
In the Xilinx toolchain, the build step in which LUTs and registers are assigned to specific slices is called "map". By default, and especially if your design occupies only a small amount of the device's resources, map will not try to pack as many LUTs and registers into a slice as it possibly could. This is why you're seeing a higher slice count than expected.
You can force map to pack slices more agressively by setting the -c (Pack Slices) option to 1:
-c [packfactor ]
The packfactor (for non-zero values) is the target slice density
percentage.
A packfactor value of 0 specifies that only related logic (logic having signals in common) should be packed into a single Slice, and
yields the least densely packed design.
A packfactor of 1 results in maximum packing density as the packer is attempting 1% slice utilization.
A packfactor of 100 means that only enough unrelated packs will occur to fit the device with 100% utilization. This results in minimum
packing density.
Related
I have 2 different clocks, one for reading and one for writing. I am using gray-code to synchronize the pointers with an additional 2 flip-flops for synchronization on the differnt clock of the input signal.
The articles that I have read indicate how to determine the full and empty signal using gray code by comparing the 2MSB for full state and equality for empty state.
However, I need to get the number of elements in the buffer and not just the full or empty signals. Is this possible to do with gray code?
In a comment you ask about the common clock and mentioned that your depth is not a power of two.
First : Edit your original post and add that question and the information.
Second: In an a-synchronous FIFO there is no common clock. The write operations are all run from the write clock. The read operations are all run from the read clock. The critical part is to exchange information between the clock domains. That is where the gray code comes in.
Third: An a-synchronous FIFO uses gray code because only one bit changes at a time. Important there is that the process is circular. Thus the difference between your last and your first value also only differs by one bit:
Counter Gray-code
000 000
001 001
010 011
011 010
100 110
101 111
110 101
111 100 <-- Last
000 000 <-- First again
This works if and only if the depth (and thus the counters) are a power of two. Therefore an a-synchronous FIFO always has a depth which is a power of two.
If you must have a different depth you can add a synchronous FIFO to the beginning or the end. However if you think about it: a FIFO is just an elastic buffer. The behavior if it is e.g. 16 entries deep or 12 entries is not different, other then that you have the potential to store more values.
Last: As supercat said: You convert from binary to Gray code, cross to the other clock domain, then convert Gray code to binary again.
In the end clock domain you can safely compare read and write counters to determine the fill-level of the FIFO.
If the level is needed on both read and write side you have to implement this process twice, once in each clock domain.
The most understandable way to compute the difference between two gray-code values is to synchronize them with a common clock, convert them to binary, and then do an ordinary binary subtraction on them. While it may be possible to design a fully-combinatorial circuit that would compute the difference between two gray-code values in such a way that if all bits of one particular value are stable, and one bit in the other value changes, only one bit in the output would change and all others would remain stable, such a design would be much more complicated than one which simply synchronizes both counters, converts to binary, and subtracts.
Linux Kernel : 4.10.0-20-generic (also tried this on 4.11.3)
Ubuntu : 17.04
I have been trying to collect stats of memory-accesses using perf stat. I am able to collect stats for memory-stores but the count for memory-loads return me a 0 value.
The below is the details for memory-stores :-
perf stat -e cpu/mem-stores/u ./libquantum_base.arnab 100
N = 100, 37 qubits required
Random seed: 33
Measured 3277 (0.200012), fractional approximation is 1/5.
Odd denominator, trying to expand by 2.
Possible period is 10.
100 = 4 * 25
Performance counter stats for './libquantum_base.arnab 100':
158,115,510 cpu/mem-stores/u
0.559922797 seconds time elapsed
For memory-loads, I get a 0 count as can be seen below :-
perf stat -e cpu/mem-loads/u ./libquantum_base.arnab 100
N = 100, 37 qubits required
Random seed: 33
Measured 3277 (0.200012), fractional approximation is 1/5.
Odd denominator, trying to expand by 2.
Possible period is 10.
100 = 4 * 25
Performance counter stats for './libquantum_base.arnab 100':
0 cpu/mem-loads/u
0.563806170 seconds time elapsed
I cannot understand why this does not count properly. Should I use a different event in any way to get proper data ?
The mem-loads event is mapped to the MEM_TRANS_RETIRED.LOAD_LATENCY_GT_3 performance monitoring unit event on Intel processors. The events MEM_TRANS_RETIRED.LOAD_LATENCY_* are special and can only be counted by using the p modifier. That is, you have to specify mem-loads:p to perf to use the event correctly.
MEM_TRANS_RETIRED.LOAD_LATENCY_* is a precise event and it only makes sense to be counted at the precise level. According to this Intel article (emphasis mine):
When a user elects to sample one of these events, special hardware is
used that can keep track of a data load from issue to completion.
This is more complicated than simply counting instances of an event
(as with normal event-based sampling), and so only some loads are
tracked. Loads are randomly chosen, the latency determined for each,
and the correct event(s) incremented (latency >4, >8, >16, etc). Due
to the nature of the sampling for this event, only a small percentage
of an application's data loads can be tracked at any one time.
As you can see, MEM_TRANS_RETIRED.LOAD_LATENCY_* by no means count the total number of loads and it is not designed for that purpose at all.
If you want to to determine which instructions in your code are issuing load requests that take more than a specific number of cycles to complete, then MEM_TRANS_RETIRED.LOAD_LATENCY_* is the right performance event to use. In fact, that is exactly the purpose of perf-mem and it achieves its purpose by using this event.
If you want to count the total number of load uops retired, then you should use L1-dcache-loads, which is mapped to the MEM_UOPS_RETIRED.ALL_LOADS performance event on Intel processors.
On the other hand, mem-stores and L1-dcache-stores are mapped to the exact same performance event on all current Intel processors, namely, MEM_UOPS_RETIRED.ALL_STORES, which does count all retired store uops.
So in summary, if you are using perf-stat, you should (almost) always use L1-dcache-loads and L1-dcache-stores to count retired loads and stores, respectively. These are mapped to the raw events you have used in the answer you posted, only more portable because they also work on AMD processors.
I have used a Broadwell(CPU e5-2620) server machine to collect all of the below events.
To collect memory-load events, I had to use a numeric event value. I basically ran the below command -
./perf record -e "r81d0:u" -c 1 -d -m 128 ../../.././libquantum_base 20
Here r81d0 represents the raw event for counting "memory loads amongst all instructions retired". "u" as can be understood represents user-space.
The below command, on the other hand,
./perf record -e "r82d0:u" -c 1 -d -m 128 ../../.././libquantum_base 20
has "r82d0:u" as a raw event representing "memory stores amongst all instructions retired in userspace".
I am doing some performance profiling for part of my program. And I try to measure the execution with the following four methods. Interestingly they show different results and I don't fully understand their differences. My CPU is Intel(R) Core(TM) i7-4770. System is Ubuntu 14.04. Thanks in advance for any explanation.
Method 1:
Use the gettimeofday() function, result is in seconds
Method 2:
Use the rdtsc instruction similar to https://stackoverflow.com/a/14019158/3721062
Method 3 and 4 exploits Intel's Performance Counter Monitor (PCM) API
Method 3:
Use PCM's
uint64 getCycles(const CounterStateType & before, const CounterStateType &after)
Its description (I don't quite understand):
Computes the number core clock cycles when signal on a specific core is running (not halted)
Returns number of used cycles (halted cyles are not counted). The counter does not advance in the following conditions:
an ACPI C-state is other than C0 for normal operation
HLT
STPCLK+ pin is asserted
being throttled by TM1
during the frequency switching phase of a performance state transition
The performance counter for this event counts across performance state transitions using different core clock frequencies
Method 4:
Use PCM's
uint64 getInvariantTSC (const CounterStateType & before, const CounterStateType & after)
Its description:
Computes number of invariant time stamp counter ticks.
This counter counts irrespectively of C-, P- or T-states
Two samples runs generate result as follows:
(Method 1 is in seconds. Methods 2~4 are divided by a (same) number to show a per-item cost).
0.016489 0.533603 0.588103 4.15136
0.020374 0.659265 0.730308 5.15672
Some observations:
The ratio of Method 1 over Method 2 is very consistent, while the others are not. i.e., 0.016489/0.533603 = 0.020374/0.659265. Assuming gettimeofday() is sufficiently accurate, the rdtsc method exhibits the "invariant" property. (Yep I read from Internet that current generation of Intel CPU has this feature for rdtsc.)
Methods 3 reports higher than Method 2. I guess its somehow different from the TSC. But what is it?
Methods 4 is the most confusing one. It reports an order of magnitude larger number than Methods 2 and 3. Shouldn't it be also kind of cycle counts? Let alone it carries the "Invariant" name.
gettimeofday() is not designed for measuring time intervals. Don't use it for that purpose.
If you need wall time intervals, use the POSIX monotonic clock. If you need CPU time spent by a particular process or thread, use the POSIX process time or thread time clocks. See man clock_gettime.
PCM API is great for fine tuned performance measurement when you know exactly what you are doing. Which is generally obtaining a variety of separate memory, core, cache, low-power, ... performance figures. Don't start messing with it if you are not sure what exact services you need from it that you can't get from clock_gettime.
I am using the Intel Xeon Phi coprocessor, which has up to 240 threads, and I am working on minimizing the number of threads used for a particular application (or maximize performance) while being within a percentage of the best execution time. So for example if I have the following measurements:
Threads | Execution time
240 100 s
200 105 s
150 107 s
120 109 s
100 120 s
I would like to select a number of threads between 120 and 150, since the "performance curve" there seems to stabilize and the reduction in execution time is not that significant (in this case around 15% of the best measured time. I did this using an exhaustive search algorithm (measuring from 1 to 240 threads), but my problem is that it takes too long for smaller number of threads (obviously depending on the size of the problem).
To try to reduce the number of measurements, I developed a sort of "binary search" algorithm. Basically I have an upper and lower limit (beginning at 0 and 240 threads), I take the value in the middle and measure it and at 240. I get the percent difference between both values and if it is within 15% (this value was selected after analyzing the results for the exhaustive search) I assign a new lower or upper bound. If the difference is larger than 15% then this is a new lower bound (120-240) and if it is smaller then it is a new upper bound (0-120), and if I get a better execution time I store it as the best execution time.
The problem with this algorithm is that first of all this is not necessarily a sorted array of execution times, and for some problem sizes the exhaustive search results show two different minimum, so for example in one I get the best performance at 80 threads and at 170, and I would like to be able to return 80, and not 170 threads as a result of the search. However, for the other cases where there is only one minimum, the algorithm found a value very close to the one expected.
If anyone has a better idea or knows of an existing search algorithm or heuristic that could help me I would be really grateful.
I'm taking it that your goal is to get the best relative performance for the least amount of threads, while still maintaining some limit on performance based on a coefficient (<=1) of the best possible performance. IE: If the coefficient is 0.85 then the performance should be no less than 85% of the performance using all threads.
It seems like what you should be trying to do is simply find the minimium number of threads required to obtain the performance bound. Rather than looking at 1-240 threads, start at 240 threads and reduce the number of threads until you can place a lower bound on the performance limit. You can then work up from the lower bound in such a way that you can find the min without passing over it. If you don't have predefined performance bound, then you can calculate one on the fly based on diminishing returns.
As long as the performance limit has not been exceeded, half the number of threads (start with max number of threads). The number that exceeds the performance limit is a lower bound on the number of threads required.
Starting at the lower bound on the number of threads, Z, add m threads if can be added without getting within the performance limit. Repeatedly double the number of threads added until within the performance limit. If adding the threads get within the performance limit, subtract the last addition and reset the number of threads to be added to m. If even just adding m gets within the limit, then add the last m threads and return the number of threads.
It might be clearer to give an example of what the process looks like step by step. Where Passed means that the number of threads are outside of the performance limits, and failed means they are either on the performance limit or inside of it.
Try adding 1m (Z + 1m). Passed. Threads = Z + m.
Try adding 2m (Z + 3m). Passed. Threads = Z + 3m.
Try adding 4m (Z + 7m). Failed. Threads = Z + 3m. Reset.
Try adding 1m. Passed. Threads = Z + 4m.
Try adding 2m. Passed. Threads = Z + 6m.
Z + 7m failed earlier so reset.
Comparisons/lookups are cheap, use them to prevent duplication of work.
Try adding 1m. Failed. Threads = Z + 6m. Reset.
Cannot add less than 1m and still in outside of performance limit.
The solution is Z + 7m threads.
Since Z + 6m is m threads short of the performance limit.
It's a bit inefficient, but it does find the minimium number of threads (>= Z) required to obtain the performance bound to within an error of m-1 threads and requiring only O(log (N-Z)) tests. This should be enough in most cases, but if it isn't just skip step 1 and use Z=m. Unless increasing the number of threads rapidly decreases the run-time causing very slow run times when Z is very small. In which case, doing step 1 and using interpolation can get an idea of how quickly the run-time increases as the number of threads decrease, which is also useful for determining a good performance limit if none is given.
I have some serial code that I have started to parallelize using Intel's TBB. My first aim was to parallelize almost all the for loops in the code (I have even parallelized for within for loop)and right now having done that I get some speedup.I am looking for more places/ideas/options to parallelize...I know this might sound a bit vague without having much reference to the problem but I am looking for generic ideas here which I can explore in my code.
Overview of algo( the following algo is run over all levels of the image starting with shortest and increasing width and height by 2 each time till you reach actual height and width).
For all image pairs starting with the smallest pair
For height = 2 to image_height - 2
Create a 5 by image_width ROI of both left and right images.
For width = 2 to image_width - 2
Create a 5 by 5 window of the left ROI centered around width and find best match in the right ROI using NCC
Create a 5 by 5 window of the right ROI centered around width and find best match in the left ROI using NCC
Disparity = current_width - best match
The edge pixels that did not receive a disparity gets the disparity of its neighbors
For height = 0 to image_height
For width = 0 to image_width
Check smoothness, uniqueness and order constraints*(parallelized separately)
For height = 0 to image_height
For width = 0 to image_width
For disparity that failed constraints, use the average disparity of
neighbors that passed the constraints
Normalize all disparity and output to screen
Just for some perspective, it may not always be worthwhile to parallelize something.
Just because you have a for loop where each iteration can be done independently of each other, doesn't always mean you should.
TBB has some overhead for starting those parallel_for loops, so unless you're looping a large number of times, you probably shouldn't parallelize it.
But, if each loop is extremely expensive (Like in CirrusFlyer's example) then feel free to parallelize it.
More specifically, look for times where the overhead of the parallel computation is small relative to the cost of having it parallelized.
Also, be careful about doing nested parallel_for loops, as this can get expensive. You may want to just stick with paralellizing the outer for loop.
The silly answer is anything that is time consuming or iterative. I use Microsoft's .NET v4.0 Task Parallel Library and one of the interesting things about their setup is its "expressed parallelism." An interesting term to describe "attempted parallelism." Though, your coding statements may say "use the TPL here" if the host platform doesn't have the necessary cores it will simply invoke the old fashion serial code in its place.
I have begun to use the TPL on all my projects. Any place there are loops especially (this requires that I design my classes and methods such that there are no dependencies between the loop iterations). But any place that might have been just good old fashion multithreaded code I look to see if it's something I can place on different cores now.
My favorite so far has been an application I have that downloads ~7,800 different URL's to analyze the contents of the pages, and if it finds information that it's looking for does some additional processing .... this used to take between 26 - 29 minutes to complete. My Dell T7500 workstation with dual quad core Xeon 3GHz processors, with 24GB of RAM, and Windows 7 Ultimate 64-bit edition now crunches the entire thing in about 5 minutes. A huge difference for me.
I also have a publish / subscribe communication engine that I have been refactoring to take advantage of TPL (especially on "push" data from the Server to Clients ... you may have 10,000 client computers who have stated their interest in specific things, that once that event occurs, I need to push data to all of them). I don't have this done yet but I'm REALLY LOOKING FORWARD to seeing the results on this one.
Food for thought ...