I'm working on control system stuff, where the NIC interrupt is used as a trigger. This works very well for cycle times greater than 3 microseconds. Now i want to make some performance tests and measure the transmission time, respectively the shortest time between two interrupts.
The sender is sending 60 byte packages as fast as possible. The receiver should generate one interrupt per package. I'm testing with 256 packets, the size of the Rx descriptor ring. The packet data won't handled during the test. Only the interrupt is interesting.
The trouble is, that the reception is very fast up to less then 1 microsecond between two interrupts, but only for around 70 interrupts / descriptors. Then the NIC sets the RDU (Rx Descriptor Unavailable) bit and stops the receiving before reaching the end of the ring. The confusing thing is, when i increase the size of the Rx descriptor ring up to 2048 (e.g.), then the number of interrupts is increasing too (around 800). I don't understand this behavior. I thought he should stop again after 70 interrupts.
It seems to be a time problem, but why? I'm overlooking something, but what? Can somebody help my?
Thanks in advance!
What I think is that due to large RX packet rate, your receive interrupts are missing . Don't count interrupts to see how many packets are received.Rely on "own" bit of Receive descriptors.
Receive Descriptor unavailable will be set only when you reach end of the ring unless you have made some error in programming RX descriptors (e.g. forgot to set ownership bit)
So if your RX ring has 256 descriptors, I think you should receive 256 packets without recycling RX descriptors.
If you are doubtful whether you are reaching the end of ring or not, try setting interrupt on completion bit of only last RX descriptor.In this way you receive only one interrupt at the end of ring.
Related
I'm currently working with termios for serial communication in Linux.
I need to set an intercharacter timeout to 5ms.
I found a way to set intercharacter timeout using VMIN and VTIME where VMIN has to be VMIN > 0 and VTIME > 0.
The problem is that i need to set the VTIME to 5ms, but the VTIME is expressed in tenths of a second.
VTIME data type is unsigned char, so i can't just set it to 0.05.
Does anyone know if there is some way around this?
I need to set an intercharacter timeout to 5ms.
...
Does anyone know if there is some way around this?
No, there is no way to set a shorter termios timeout than 100 ms.
Depending on your hardware and kernel configuration, this timeout may not be reliable at all, especially if you are trying to detect time-separated messages.
The termios handling is at least a full layer above the UART device driver (see
Linux serial drivers).
Unless your kernel is configured to ensure that the bottom-half of the UART driver and the kworker threads for termios are high priority and low latency, then short intercharacter intervals cannot be accurately or reliably determined.
If the UART utilizes a FIFO to buffer incoming data, then that hardware obscures the intercharacter spacing that the software can detect.
Similarly when the UART driver is using DMA to store the received data, intercharacter timing will be obscured.
With DMA the CPU is not involved with handling the received data until the DMA operation is complete, and all temporal information about any intercharacter separation is gone.
(Crucial information such as framing error and/or parity error is difficult/impossible to pinpoint to a specific byte when using DMA.)
Even without DMA, termios will only be able to use timing based on the transfer of data through the tty flip buffers (which is a layer removed from the timing on the wire).
Some UARTs do have hardware that assist in detecting the end-of-message by idle line.
For example Atmel/Microchip ATSAMA5 and AT91SAM9 SoCs have USARTs with a Receiver Timeout feature that measures the idle time after each received frame.
When this idle line time exceeds a specified value, an interrupt can be generated.
The Linux driver for the Atmel USART typically uses the receiver-timeout interrupt to (prematurely) terminate the current DMA receive operation, and copy the contents of the DMA buffer to the tty flip buffer.
In summary you cannot or should not rely solely on VMIN and VTIME settings to detect time-separated messages. See Parsing time-delimited UART data.
The message packets need to have delimiter/sentinel characters/bytes so that messages can be reliably parsed and validated.
See parsing complete messages from serial port for an example of efficient use of syscalls with a local buffer.
I'm implementing a protocol over serial ports on Linux. The protocol is based on a request answer scheme so the throughput is limited by the time it takes to send a packet to a device and get an answer. The devices are mostly arm based and run Linux >= 3.0. I'm having troubles reducing the round trip time below 10ms (115200 baud, 8 data bit, no parity, 7 byte per message).
What IO interfaces will give me the lowest latency: select, poll, epoll or polling by hand with ioctl? Does blocking or non blocking IO impact latency?
I tried setting the low_latency flag with setserial. But it seemed like it had no effect.
Are there any other things I can try to reduce latency? Since I control all devices it would even be possible to patch the kernel, but its preferred not to.
---- Edit ----
The serial controller uses is an 16550A.
Request / answer schemes tends to be inefficient, and it shows up quickly on serial port. If you are interested in throughtput, look at windowed protocol, like kermit file sending protocol.
Now if you want to stick with your protocol and reduce latency, select, poll, read will all give you roughly the same latency, because as Andy Ross indicated, the real latency is in the hardware FIFO handling.
If you are lucky, you can tweak the driver behaviour without patching, but you still need to look at the driver code. However, having the ARM handle a 10 kHz interrupt rate will certainly not be good for the overall system performance...
Another options is to pad your packet so that you hit the FIFO threshold every time. It will also confirm that if it is or not a FIFO threshold problem.
10 msec # 115200 is enough to transmit 100 bytes (assuming 8N1), so what you are seeing is probably because the low_latency flag is not set. Try
setserial /dev/<tty_name> low_latency
It will set the low_latency flag, which is used by the kernel when moving data up in the tty layer:
void tty_flip_buffer_push(struct tty_struct *tty)
{
unsigned long flags;
spin_lock_irqsave(&tty->buf.lock, flags);
if (tty->buf.tail != NULL)
tty->buf.tail->commit = tty->buf.tail->used;
spin_unlock_irqrestore(&tty->buf.lock, flags);
if (tty->low_latency)
flush_to_ldisc(&tty->buf.work);
else
schedule_work(&tty->buf.work);
}
The schedule_work call might be responsible for the 10 msec latency you observe.
Having talked to to some more engineers about the topic I came to the conclusion that this problem is not solvable in user space. Since we need to cross the bridge into kernel land, we plan to implement an kernel module which talks our protocol and gives us latencies < 1ms.
--- edit ---
Turns out I was completely wrong. All that was necessary was to increase the kernel tick rate. The default 100 ticks added the 10ms delay. 1000Hz and a negative nice value for the serial process gives me the time behavior I wanted to reach.
Serial ports on linux are "wrapped" into unix-style terminal constructs, which hits you with 1 tick lag, i.e. 10ms. Try if stty -F /dev/ttySx raw low_latency helps, no guarantees though.
On a PC, you can go hardcore and talk to standard serial ports directly, issue setserial /dev/ttySx uart none to unbind linux driver from serial port hw and control the port via inb/outb to port registers. I've tried that, it works great.
The downside is you don't get interrupts when data arrives and you have to poll the register. often.
You should be able to do same on the arm device side, may be much harder on exotic serial port hw.
Here's what setserial does to set low latency on a file descriptor of a port:
ioctl(fd, TIOCGSERIAL, &serial);
serial.flags |= ASYNC_LOW_LATENCY;
ioctl(fd, TIOCSSERIAL, &serial);
In short: Use a USB adapter and ASYNC_LOW_LATENCY.
I've used a FT232RL based USB adapter on Modbus at 115.2 kbs.
I get about 5 transactions (to 4 devices) in about 20 mS total with ASYNC_LOW_LATENCY. This includes two transactions to a slow-poke device (4 mS response time).
Without ASYNC_LOW_LATENCY the total time is about 60 mS.
With FTDI USB adapters ASYNC_LOW_LATENCY sets the inter-character timer on the chip itself to 1 mS (instead of the default 16 mS).
I'm currently using a home-brewed USB adapter and I can set the latency for the adapter itself to whatever value I want. Setting it at 200 µS shaves another mS off that 20 mS.
None of those system calls have an effect on latency. If you want to read and write one byte as fast as possible from userspace, you really aren't going to do better than a simple read()/write() pair. Try replacing the serial stream with a socket from another userspace process and see if the latencies improve. If they don't, then your problems are CPU speed and hardware limitations.
Are you sure your hardware can do this at all? It's not uncommon to find UARTs with a buffer design that introduces many bytes worth of latency.
At those line speeds you should not be seeing latencies that large, regardless of how you check for readiness.
You need to make sure the serial port is in raw mode (so you do "noncanonical reads") and that VMIN and VTIME are set correctly. You want to make sure that VTIME is zero so that an inter-character timer never kicks in. I would probably start with setting VMIN to 1 and tune from there.
The syscall overhead is nothing compared to the time on the wire, so select() vs. poll(), etc. is unlikely to make a difference.
I have C application which transmits UDP stream. It works well in most of servers, but its crazy on few servers.
I have 100 Mbps network connection say eth1 on server. Using this network I usually transmit (TX) around 10-30 Mbps UDP streams, and this network connection will have around 100-300 Kbps RX to server. I have other network connection say eth0 in server from which C application receives UDP streams and forwards to 100 Mbps network connection, eth1.
My application uses blocking sendto() function to transmit UDP packets in eth1. Packets are of variable length, from 17 bytes to maximum 1333 bytes. But most of time, more than 1000 bytes.
The problem is: sometime sendto function blocks on eth1 for huge time around 1 second. This happens once in every 30 seconds to 3 minutes. When sendto blocks, I will have lot of UDP packets buffered in UDP receive buffer from eth0 by kernel, from where C application receive packets. Once sendto returns from long blocking call on eth1, C application will have lot of buffered packets to transmit from eth0. And then C application transmits all these buffered packets with next sendto calls. This will create spike in rate at other endpoint which receives UDP stream from eth1. This will create Z like rate graph at other endpoint. So this Z like spike in rate is my problem.
I have tried to increase wmem_default from around 131 KB to 5 MB in kernel setting to overcome spike. And setting this resolves my issue of spike. Now I don't get Z like spike in rate at other endpoint, but I got new issue. The new issue is: I get lot of packet losses in place of spike. I think it may be due to send buffer of eth1 accumulating lot of packets to send while sending current packet from eth1 takes lot of time (this is why may be sendto blocking long). And at next instant when NIC sends all accumulated packets from send buffer in short time, this may be causing network congestion and I may be getting lot of packet losses instead of spike.
So, this is second problem. But I think root cause is: why sometime NIC pauses for long time while sending traffic, once in every 30 seconds to 3 minutes?
May be I need to look in TX ring buffer of driver of eth1? When socket send buffer gets full due to NIC not transmitting all in time (due to random long TX pauses), then next call to sendto blocks for room in socket send buffer, does that also blocks for room in driver TX ring buffer?
Please dont tell me that UDP is unreliable and we can't control packet losses. I know that its unreliable and UDP packets can be lost. But I am sure still we can do something to minimize packet losses.
EDIT
I have tried to increase wmem_default from around 131 KB to 5 MB in kernel setting to overcome spike. And also I have removed blocking sendto call. Now I use like: sendto(sockfd, buf, len, MSG_DONTWAIT ,dest_addr, addrlen); with large send buffer using wmem_default. Also I am not getting any EAGAIN or EWOULDBLOCK errors on sendto due to large send buffer, but still packets loosing in place of spike.
EDIT
As non-blocking sendto call with huge wmem_default, and as NO any EAGAIN or EWOULDBLOCK errors from sendto, spikes have been removed because no much packets accumulating in receive buffer of eth0. I think its possible solution from application side. But main problem is why NIC slows every few moments? What can be possible reasons? While it resumes from long TX pause, and may be it will have lot of packets accumulated in send buffer, which will be sent as burst next moment and congesting network so lot of packet losses.
More update
I use same this C application to transmit locally in machine (127.0.0.1), and I never get any spikes or packet losses problems locally.
The problem is: sometime sendto function blocks on eth1 for huge time around 1 second.
Blocking sendto may block, surprisingly.
The problem is: sometime sendto function blocks on eth1 for huge time around 1 second.
It could be that IP stack is performing path MTU discovery:
While MTU discovery is in progress, initial packets from datagram sockets may be dropped. Applications using UDP should be aware of this and not take it into account for their packet retransmit strategy.
I have tried to increase wmem_default from around 131 KB to 5 MB in kernel setting to overcome spike.
Be careful with increasing buffer sizes. After a certain limit increasing buffer sizes only increases the amount of queuing and hence delay, leading to the infamous bufferbloat.
You may also play around with NIC Queuing Disciplines, they are responsible for dropping outgoing packets.
I'm trying to write driver for rtl8139 for linux 2.6 from scratch. I've already written TX path, but I have some problems with RX.
I put RX into promiscous mode and receiving RX irqs. I set RBSTART into physical address of allocated memory by kmalloc.
I don't know how to find out how many received packets there are and how long they are.
I thought that ERBCR, CAPR, CBR registers tell it, but they are == 0.
Maybe I'm doing something wrong? How to find out anything about received packets?
I answer to my question myself.
The received packets are located starting at RBSTART. The first two bytes of rx-ed packet are status bytes, and the next 2 are length of the frame + 4 bytes of crc.
Maybye someone find this info helpful.
On receiving a packet, the data received from the line is stored in the receive FIFO. When Early Receive Threshold is met, the data is moved from FIFO to Recieve Buffer.
So, once you get an interrupt. You need to check the Interrupt Status Register for ROK. Then check the Early Rx status register which gives you the status of the packet received. If EROK is set, then check the Receive buffer status for ROK. Check for are any errors in the ISR and ERSR. Also check your Rx Configuration register for the threshold configuration for Rx FIFO, RX buf length.
I would like to have UDP packets copied directly from the ethernet adapter into my userspace buffer
Some details on my setup:
I am receiving data from a pair of gigabit ethernet cameras. Combined I am receiving 28800 UDP packets per second (1 packet per line * 30FPS * 2 cameras * 480 lines). There is no way for me to switch to jumbo frames, and I am already looking into tuning driver level interrupts for reduced CPU utilization. What I am after here is reducing the number of times I am copying this ~40MB/s data stream.
This is the best source I have found on this, but I was hoping there was a more complete reference or proof that such an approach worked out in practice.
This article may be useful:
http://yusufonlinux.blogspot.com/2010/11/data-link-access-and-zero-copy.html
Your best avenues are recvmmsg and increasing RX interrupt coalescing.
http://lwn.net/Articles/334532/
You can move lower and match how Wireshark/tcpdump operate but it becomes futile to attempt any serious processing above it having to decode everything yourself.
At only 30,000 packets per second I wouldn't worry too much about copying packets, those problems arise when dealing with 3,000,000 messages per second.