AXI4 throughput on FPGA - verilog

I'm working on a 7 series FPGA, and am planning to use MIG memory controller for interface with DDR3, and an AXI4 interface between memory controller and the other modules inside the FPGA. What sort of throughput efficiency will I get, say if I run it at some X clock and 64-bit data. What I mean is 64X is illogical to assume. What fraction of it is lost in handshaking for burst mode and a non-burst mode? I'm just looking for rough values, not exact. Something in the ballpark.
Thank you.

Per Xilinx's xapp792 70% efficiency is a reasonable number. This is for video which generally has very burstable DDR SDRAM friendly access patterns. Random memory access will probably be far less.

Related

What is difference between soft core on NIOS and hard core?

I am learning FPGA recently. I have tried to use sdram, somebody recommends me use it through nios ii. But I see some articles using ip core on nios ii(c/c++) it may slow than you write through verilog? Why? Because Hardware(fast, parallel) and Software?
What is a soft-CPU? FPGAs are composed of, among other things, reconfigurable logic blocks (LUTs), Memory, and multipliers/DSPs. A soft CPU is a CPU made up of the FPGAs configurable logic. Nios II is Altera/Intel's flavour of a soft CPU. This differs from a hardened CPU like the ARM cores included in many Altera/Intel and Xilinx SoC FPGAs. In these cases, the ARM cores in made of fixed transistors instead of FPGA fabric, and cannot be reconfigured for other purposes.
Why have hardened CPUs? They're typically faster than soft CPUs, take up less space, and don't consume any of the valuable FPGA routing. Since many designs use some sort of CPU, hardening one (like is done with many popular I/O interfaces) it produces an overall net gain. (If you don't need a CPU, you can simple buy a non-SoC FPGA.
As for using a CPU vs pure logic/hardware, there are also tradeoffs. Writing software is typically easier than Verilog, and your CPU will be set up to manage things like response times and other memory quirks. However, you'll be restricted by the CPU speed (Nios is typically 100-200MHz, depending on your FPGA), and the extra latency of needing to interface with a CPU, and the CPU instruction execution speed.
In a similar vein to why FPGAs are gaining popularity, pure-hardware circuit have specialization that can allow them to operate faster than a more multipurposed CPU (either soft or hardened). The tradeoff you get for that speed boost is the extra work involved in writing timing-accurate Verilog.

FreeRTOS vs Linux against single event upsets

I am working on the on-board computer for a CubeSat. Our computer will be vulnerable to radiation, hence single event upsets, e.g. bit flips are likely to occur. Would a lighter, smaller OS like FreeRTOS bring more stability, robustness and a lower probability of failure over a full-blown Linux operating system?
The probability of a bit error in RAM is a function of time, memory size and radiation density, so a larger memory has a greater probability, and you can fit a FreeRTOS system in much less memory (like 10kb instead of 4Mb). However the usage rate of the smaller memory is likely much higher - i.e. in a FreeRTOS application, most of the code and data are accessed relatively frequently, while in a Linux deployment, much of it is redundant and if corrupted may never be accessed in any case.
However the question makes little sense for a number of reasons, such as:
The effect of a bit-flip event is entirely non-deterministic, any single event it may be benign or catastrophic. It is impossible to say that a system can tolerate 1 error when you don't know when or where the error will occur.
If your system can be implemented on FreeRTOS, why would you even consider Linux? They are chalk and cheese. If you need the extensive networking, filesystem, memory management, POSIX API and device support etc. provided by Linux, FreeRTOS is not suited to your application in any case, as you would have to add all that yourself from your own or additional third-party code. FreeRTOS is only a scheduling kernel, with threading, synchronisation and IPC support and little else. Conversely if you need hard real-time deterministic behaviour, Linux is unsuited to your application.
Where you might benefit from using an RTOS kernel like FreeRTOS is that it will execute from ROM which may be less prone to the bit-flipping cosmic ray issue - (although the availability of ECC/radiation hardened Flash memory may indicate otherwise). You still need RAM for R/W data, but at least the code itself will be robust. A typical FreeRTOS system might run in SRAM (possibly in on-chip RAM on a microcontroller) - I don't know whether low density SRAM is less prone to bit-flipping than high-density SDRAM, but I am willing to believe it is. It is also possible to source radiation hardened SRAM in any case.
The solution for a system using SDRAM in such an environment is to use ECC RAM which may largely overcome the problem of data corruption from radiation and non-deterministic system behaviour. However I would not imagine that even that would be sufficient for space or high-atmosphere applications.
In short the solution is not in the software, it has to be in the hardware, and the lengths you need to go to will depend on the radiation environment your system will be subjected to. However the selection of a small RTOS kernel allows the selection of hardware to be potentially much wider since it will run on a much wider range of architectures in much smaller memory, perform deterministically, respond to events in fewer cycles and is ROMable.

FPGA interface protocol

I am looking for a very fast protocol to implement interface communication between FPGAs (at the moment I am using emulated Virtex-7 FPGA).
Actually my requirements for the project I work on are really narrow. I need to transfer data in the order of gigabytes per microsecond. The data I need to transfer do not need any type of overhead computing, therefore just few signals of control are enough.
In my past, I have designed interfaces based on AXI protocol for a ZedBoard FPGA, but I am not sure it is enough.
I am sorry if I am not totally clear for what I am looking for, but it is hard also for me figuring out this part of my project.
Gigabytes per microsecond?? That's quite a bit... let's do some math...
I'll assume you want 2 Gigabytes per microsecond, which I think is the LEAST amount you would need based on your wording. I'll assume for transmitting the data you're using only GPIO pins that are capable of transmitting data at 1 Gbps. 2 Gigabytes per SECOND would require 16 GPIO pins. 2 Gigabytes per MICROSECOND would require 16,000,000 GPIO Pins! SIXTEEN MILLION PINS!
Your requirements are unobtanium.
According to the Xilinx's Virtex-7 product page, the Virtex-7 HT has sixteen 28Gb/s and can provide a total bandwidth of 2.78Tb/s. Converted the total bandwidth to bytes it is 347.5GB/s. Convert to bytes per microsecond and it becomes 347.5kB/us total bandwidth. 3000 Virtex-7 would be need to achieve 1 Gigabyte per microsecond. That is assuming there will be no more then 4.25% overhead added and it can maintain peek performances.
Technology has not advanced far enough to satisfy the requirements. Either relax the requirements or wait for technology to catch up. If Moor's Law holds true, a 16 Peta-bit per second (2 GB/us) on a single FPGA should be available by 2031.

Something Similar to RAPL for non Sandy Bridge/xeon processors

First post ever here.
I wanted to know if there was something similar to the Running Average Power Limit for other processors(Intel i7) that aren't Sandy Bridge or Xeon Processors as the machine im working on in the lab.
For those who do not know. I pulled this description to bring you up to speed.
"RAPL(Running Average Power Limit) interface provides platform software
with the ability to monitor, control, and get notifications on SOC
power consumptions."
What I am looking for in particular is to acquire energy consumption measurements on a processor's individual cores after running some code like Matrix Multiplication or Vector Addition. Temperature would be excellent too but that's another question for another day(lm-sensors is a bit puzzling to me)
Thanks and Take Care.
Late answer on this: There's PowerTOP on Linux, but that works for Laptops only as it needs the battery discharge rate for that. It can display Watts per process, but don't ask me how accurate that is (personally I think there might be some problems with that). IIRC it counts the number of CPU wakeups from a CPU sleep state to calculate the energy consumption per process. Also, for AMD processors there's the fam15h_power driver in the lm-sensors software package. For rather new (2011 and newer) Bulldozer AMD CPUs you can get the energy consumption that way.
Note that RAPL does not provide energy consumption per core on a multicore CPU, but only for the whole CPU. You can get the energy consumption of core and non-core (like integrated graphics) separately, but per-core is not possible.

Multicore processor core communication speeds

I would look to find the speed of communication between the two cores of a computer.
I'm in the very early stages of planning to massively parallelise a sequential program and I need to think about network communication speeds vs. communication between cores on a single processor.
Ubuntu linux probably provides some way of seeing this sort of information? I would have thought speed fluctuates.. I just need some average value. I'm basically needing to write something up at the moment and it would be good to talk about these ratios.
Any ideas?
Thanks.
According to this benchmark: http://www.dragonsteelmods.com/index.php?option=com_content&task=view&id=6120&Itemid=38&limit=1&limitstart=4 (Last image on the page)
On an Intel Q6600, inter-core latency is 32 nanoseconds. Network latency is measured in milliseconds which 1,000,000 milliseconds / nanosecond. "Good" network latency is considered around or under 100ms, so given that, the difference is about the order of 1 million times faster for inter-core latency.
Besides latency though there's also Bandwidth to consider. Again based on the linked bookmark, benchmark for that particular configuration, inter-core bandwidth is about 14GB/sec whereas according to this: http://www.tomshardware.com/reviews/gigabit-ethernet-bandwidth,2321-3.html, real-world test of a Gigabit ethernet connection shows about 35.8MB/sec so the difference there is smaller, only on the order of around 500 times faster in terms of bandwidth as opposed to a 1,000,000 times in latency. Depending on which is more important to your application that might change your numbers.
The network speeds are measured in milliseconds for Ethernet ($5-$100/port), or microseconds for specialized MPI hardware like Dolphin on Myrintet (~ $1k/port). Inter-core speeds are measured in nanoseconds, as the data is copied from one memory area to another, and then some signal is sent from one CPU to another (the data will be protected from simultaneous access by a mutex or a full-bodied queue).
So, using a back'o'the'napkin calculation the ratio is 1:10^6.
Inter-core communication is going to be massively faster. Why ?
the network layer imposes a massive overhead in terms of packets, addressing, handling contention etc.
the physical distances will impose a sizeable impact
How you measure inter-core communication speed would be very difficult. But given the above I think it's a redundant calculation to make.
This is a non-trivial thing to find. The speed of data transfer between two cores depends entirely on the application. It could depend on any (or all) of - the speed of register access, the clock speed of the cores, the system bus speed, the latency of your cache, the latency of your memory, etc., etc., etc. In short, run a benchmark or you'll be guessing in the dark.

Resources