I am looking for a very fast protocol to implement interface communication between FPGAs (at the moment I am using emulated Virtex-7 FPGA).
Actually my requirements for the project I work on are really narrow. I need to transfer data in the order of gigabytes per microsecond. The data I need to transfer do not need any type of overhead computing, therefore just few signals of control are enough.
In my past, I have designed interfaces based on AXI protocol for a ZedBoard FPGA, but I am not sure it is enough.
I am sorry if I am not totally clear for what I am looking for, but it is hard also for me figuring out this part of my project.
Gigabytes per microsecond?? That's quite a bit... let's do some math...
I'll assume you want 2 Gigabytes per microsecond, which I think is the LEAST amount you would need based on your wording. I'll assume for transmitting the data you're using only GPIO pins that are capable of transmitting data at 1 Gbps. 2 Gigabytes per SECOND would require 16 GPIO pins. 2 Gigabytes per MICROSECOND would require 16,000,000 GPIO Pins! SIXTEEN MILLION PINS!
Your requirements are unobtanium.
According to the Xilinx's Virtex-7 product page, the Virtex-7 HT has sixteen 28Gb/s and can provide a total bandwidth of 2.78Tb/s. Converted the total bandwidth to bytes it is 347.5GB/s. Convert to bytes per microsecond and it becomes 347.5kB/us total bandwidth. 3000 Virtex-7 would be need to achieve 1 Gigabyte per microsecond. That is assuming there will be no more then 4.25% overhead added and it can maintain peek performances.
Technology has not advanced far enough to satisfy the requirements. Either relax the requirements or wait for technology to catch up. If Moor's Law holds true, a 16 Peta-bit per second (2 GB/us) on a single FPGA should be available by 2031.
Related
I have a Linux machine with two PCIe RS-485 cards (XR17V354 & XR17V352). I have one port on one of the cards hardwired to one port on the other card. These cards are driven by the generic serial driver (serial8250).
I am running a test and measuring latency. I have one Linux process sending two bytes out the port and then listens for two incoming bytes. The other process receives two bytes and immediately sends two bytes back.
I'm measuring this round trip latency to be around 1500 microseconds with a standard deviation of about 40 microseconds. I am trying to understand the source of this latency. Specifically, I'd like to understand the difference in time from which a hard IRQ fires to signal data is ready to read and the time that the bytes are made available to the user space process.
I am aware of the ftrace feature, but I am not sure how best to utilize it, or if there are other, more suitable tools. Thanks.
What kind of driver is this? I assume it's a driver in kernel space and not UIO.
Independent of your issue you could start looking at how long it takes from a hardware interrupt to the kernel driver and from there to user space.
Here[1] is some ancient test case which can be hacked a bit so you can compare interrupt latencies with "standard" Linux, preempt-rt patched Linux and maybe something like Xenomai as well (although the Xenomai solution would require that you rewrite your driver).
You might want to have a look at [2], cyclictest and friends and maybe try to drill with perf into your system to see more details system wide.
Last but not least have a look at LTTng[3] which enables you to instrument code and it already has many instrumentation points.
[1] http://www.denx.de/wiki/DULG/AN2008_03_Xenomai_gpioirqbench
[2] http://cgit.openembedded.org/openembedded-core/tree/meta/recipes-rt/rt-tests/
[3] http://lttng.org/
I'm working on a 7 series FPGA, and am planning to use MIG memory controller for interface with DDR3, and an AXI4 interface between memory controller and the other modules inside the FPGA. What sort of throughput efficiency will I get, say if I run it at some X clock and 64-bit data. What I mean is 64X is illogical to assume. What fraction of it is lost in handshaking for burst mode and a non-burst mode? I'm just looking for rough values, not exact. Something in the ballpark.
Thank you.
Per Xilinx's xapp792 70% efficiency is a reasonable number. This is for video which generally has very burstable DDR SDRAM friendly access patterns. Random memory access will probably be far less.
I am seeking help, most importantly from VMEbus experts.
I am working on a project that aims to setup a communication channel from a real-time powerpc controller (Emerson MVME4100), running vxWorks 6.8, to a Linux Intel computer (Xembedded XVME6300), running Debian 6 with kernel 2.6.32.
This channel runs over VME bus; both computers are in a VME enclosure and both use the Tundra Tsi148 chipset. The Intel computer is explicitly configured as the system controller, the real-time computer is explicitly not.
Setup:
For the Intel computer I wrote a custom driver that creates a 4MB kernel buffer, and shares it over the VME bus by means of a slave window;
For the real-time computer I setup a DMA transfer to repeatedly forward blocks of exactly 48640 bytes; filled with bytes of test data (zeros, ones, twos, etc), in quick succession (once every 32 milliseconds, if possible)
For the Intel computer I read the kernel buffer from the driver, to see whether the data arrives correctly, with a hand-started Python program.
Expectation:
I am expecting to see the same data (zeros, ones etc) from the Python program.
I am expecting transmission times roughly corresponding to the chosen bus speed (typically 290 us or 145 us, depending on bus speed), plus a reasonable DMA setup overhead (up to 10us? I am willing to accept larger numbers, say hundreds of usecs, if that is what the bus normally needs)
Result:
Sometimes data does not arrive at all, and "transmission" time is ~2000 us
Sometimes data arrives reliably, but transmission time is ~98270us, or 98470us, depending on the chosen bus speed.
Questions:
How could I make the transmission reliable and bring down these aweful latencies?
What general direction should I search next?
(I would like to tag with VMEbus if I could)
Many thanks
My comments on the question describe how I got the bus working:
- ensure 2eSST320 on both sides of the bus
- ensure that the DMA transaction used a valid block size (the largest valid was 4096 bytes)
I achieved an effective speed of 150MBytes/s (the bus can achieve 320MBytes/s but the tsi148 chip is known for causing significant overhead). This is good enough for me.
I have an embedded linux device and I would like to know the baseline for various operations on the device for e.g. memory read , memory write, movinand read and write etc. Is there a way to find the baseline speeds for these operations on the device ?
Read the datasheets for the various devices and do some math.
For example, if you have 32 bit SDRAM running at 50MHz, with CAS latencies of 4-1-1-1, and you can burst, then you know it will take 7 clocks to transfer 4 words (16 bytes). There will probably be an idle period of 1 clock also, so it's really 8 clocks for the 16 bytes, or 2 bytes for every clock, which at 50MHz is 100MB/s. Now, you should really subtract the time it spends refreshing the memory, etc.
There's really no way to know what the performance should be without reading specs and doing the math.
I would look to find the speed of communication between the two cores of a computer.
I'm in the very early stages of planning to massively parallelise a sequential program and I need to think about network communication speeds vs. communication between cores on a single processor.
Ubuntu linux probably provides some way of seeing this sort of information? I would have thought speed fluctuates.. I just need some average value. I'm basically needing to write something up at the moment and it would be good to talk about these ratios.
Any ideas?
Thanks.
According to this benchmark: http://www.dragonsteelmods.com/index.php?option=com_content&task=view&id=6120&Itemid=38&limit=1&limitstart=4 (Last image on the page)
On an Intel Q6600, inter-core latency is 32 nanoseconds. Network latency is measured in milliseconds which 1,000,000 milliseconds / nanosecond. "Good" network latency is considered around or under 100ms, so given that, the difference is about the order of 1 million times faster for inter-core latency.
Besides latency though there's also Bandwidth to consider. Again based on the linked bookmark, benchmark for that particular configuration, inter-core bandwidth is about 14GB/sec whereas according to this: http://www.tomshardware.com/reviews/gigabit-ethernet-bandwidth,2321-3.html, real-world test of a Gigabit ethernet connection shows about 35.8MB/sec so the difference there is smaller, only on the order of around 500 times faster in terms of bandwidth as opposed to a 1,000,000 times in latency. Depending on which is more important to your application that might change your numbers.
The network speeds are measured in milliseconds for Ethernet ($5-$100/port), or microseconds for specialized MPI hardware like Dolphin on Myrintet (~ $1k/port). Inter-core speeds are measured in nanoseconds, as the data is copied from one memory area to another, and then some signal is sent from one CPU to another (the data will be protected from simultaneous access by a mutex or a full-bodied queue).
So, using a back'o'the'napkin calculation the ratio is 1:10^6.
Inter-core communication is going to be massively faster. Why ?
the network layer imposes a massive overhead in terms of packets, addressing, handling contention etc.
the physical distances will impose a sizeable impact
How you measure inter-core communication speed would be very difficult. But given the above I think it's a redundant calculation to make.
This is a non-trivial thing to find. The speed of data transfer between two cores depends entirely on the application. It could depend on any (or all) of - the speed of register access, the clock speed of the cores, the system bus speed, the latency of your cache, the latency of your memory, etc., etc., etc. In short, run a benchmark or you'll be guessing in the dark.