BitTorrent client transmission is low efficiency - bittorrent

I settle a BitTorrent test enviroment with: opentracker + BitTornado seed + Transmission-daemon (11 servers) and with a 40G datafile. The OS is CentOS6, Transmission 2.13 (11501).
the BitTornado cost 40% CPU, but transmission daemon cost 100% CPU, and it only run on one CPU, the bandwidth cost only 200Mbps.
As a contract, SCP command can consumer 7 times the bandwidth.
Is there any problem? Why the transmission is so low efficiency, even poor than Python(BitTornado).
update info
other client is hard to test, the reason I choose transmission-daemon is that I can control from one client through transmission-remote.
And I find transmission-remote does not work for some command, for example:
transmission-remote $host -t attr.dat --remove-and-delete

Related

unreasonable netperf benchmark results

I used netperf benchmark with the next commands:
server side:
netserver -4 -v -d -N -p
client side:
netperf -H -p -l 60 -T 1,1 -t TCP_RR
And I received the results:
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.28 () port 0 AF_INET : demo : first burst 0 : cpu bind
Local /Remote
Socket Size Request Resp. Elapsed Trans.
Send Recv Size Size Time Rate
bytes Bytes bytes bytes secs. per sec
16384 131072 1 1 60.00 9147.83
16384 131072
But when I changed the client to single CPU (same machine) by adding "maxcpus=1 nr_cpus=1" to kernel command line.
And I ran the next command:
netperf -H -p -l 60 -t TCP_RR
I received the next results:
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.28 () port 0 AF_INET : demo : first burst 0 : cpu bind
Local /Remote
Socket Size Request Resp. Elapsed Trans.
Send Recv Size Size Time Rate
bytes Bytes bytes bytes secs. per sec
16384 131072 1 1 60.00 10183.33
16384 131072
Q: I don't understand how the performance has been improved when I decreased the CPUs number from 64 to 1 CPU?
Some technique information: I used Standard_L64s_v3 instance type of Azure; OS: sles:15:sp2
• The ‘netperf’ utility command executed by you on the client side is as follows and is the same after changing the number of CPUs on the client side but you can see an improvement in performance after decreasing the number of vCPUs on the client VM: -
netperf -H -p -l 60 -I 1,1 -t TCP_RR
The above command implies that you want to test the network connectivity performance between the host ‘Server’ and ‘Client’ for TCP Request/Response and get the results in a default directory path where pipes will be created for a period of 60 seconds.
• The CPU utilization measurement mechanism uses ‘proc/stat’ on Linux OS to record the time spent for such command executions. The code for this mechanism can be found in ‘src/netcpu_procstat.c’. Thus, you can check the configuration file accordingly.
Also, the CPU utilization mechanism in a virtual guest environment, i.e., a virtual machine may not reflect the actual utilization as in a bare metal environment because much of the networking processing happens outside the context of the virtual machine. Thus, as per the below documentation link by Hewlett-Packard: -
https://hewlettpackard.github.io/netperf/doc/netperf.html
If one is looking to measure the added overhead of a virtualization mechanism, rather than rely on CPU utilization, one can rely instead on netperf _RR tests - path-lengths and overheads can be a significant fraction of the latency, so increases in overhead should appear as decreases in transaction rate. Whatever you do, DO NOT rely on the throughput of a _STREAM test. Achieving link-rate can be done via a multitude of options that mask overhead rather than eliminate it.
As a result, I would suggest you rely on other monitoring tools available in Azure, i.e., Azure Monitor, Application insights, etc.
Looking more closely at your netperf command line:
netperf -H -p -l 60 -T 1,1 -t TCP_RR
The -H option expects to take a hostname as an argument. And the -p option expects to take a port number as an argument. As written the "-p" will be interpreted as a hostname. And when I tried it at least will fail. I assume you've omitted some of the command line?
The -T option will bind where netperf and netserver will run (in this case on vCPU 1 on the netperf side and vCPU 1 on the netserver side) but it will not necessarily control where at least some of the network stack processing will take place. So, in your 64-vCPU setup, the interrupts for the networking traffic and perhaps the stack will run on a different vCPU. In your 1-vCPU setup, everything will be on the one vCPU. It is quite conceivable you are seeing the effects of cache-to-cache transfers in the 64-vCPU case leading to lower transaction/s rates.
Going to multi-processor will increase aggregate performance, but it will not necessarily increase single thread/stream performance. And single thread/stream performance can indeed degrade.

How to modify timing in readyRead of QextSeriaport [duplicate]

I'm implementing a protocol over serial ports on Linux. The protocol is based on a request answer scheme so the throughput is limited by the time it takes to send a packet to a device and get an answer. The devices are mostly arm based and run Linux >= 3.0. I'm having troubles reducing the round trip time below 10ms (115200 baud, 8 data bit, no parity, 7 byte per message).
What IO interfaces will give me the lowest latency: select, poll, epoll or polling by hand with ioctl? Does blocking or non blocking IO impact latency?
I tried setting the low_latency flag with setserial. But it seemed like it had no effect.
Are there any other things I can try to reduce latency? Since I control all devices it would even be possible to patch the kernel, but its preferred not to.
---- Edit ----
The serial controller uses is an 16550A.
Request / answer schemes tends to be inefficient, and it shows up quickly on serial port. If you are interested in throughtput, look at windowed protocol, like kermit file sending protocol.
Now if you want to stick with your protocol and reduce latency, select, poll, read will all give you roughly the same latency, because as Andy Ross indicated, the real latency is in the hardware FIFO handling.
If you are lucky, you can tweak the driver behaviour without patching, but you still need to look at the driver code. However, having the ARM handle a 10 kHz interrupt rate will certainly not be good for the overall system performance...
Another options is to pad your packet so that you hit the FIFO threshold every time. It will also confirm that if it is or not a FIFO threshold problem.
10 msec # 115200 is enough to transmit 100 bytes (assuming 8N1), so what you are seeing is probably because the low_latency flag is not set. Try
setserial /dev/<tty_name> low_latency
It will set the low_latency flag, which is used by the kernel when moving data up in the tty layer:
void tty_flip_buffer_push(struct tty_struct *tty)
{
unsigned long flags;
spin_lock_irqsave(&tty->buf.lock, flags);
if (tty->buf.tail != NULL)
tty->buf.tail->commit = tty->buf.tail->used;
spin_unlock_irqrestore(&tty->buf.lock, flags);
if (tty->low_latency)
flush_to_ldisc(&tty->buf.work);
else
schedule_work(&tty->buf.work);
}
The schedule_work call might be responsible for the 10 msec latency you observe.
Having talked to to some more engineers about the topic I came to the conclusion that this problem is not solvable in user space. Since we need to cross the bridge into kernel land, we plan to implement an kernel module which talks our protocol and gives us latencies < 1ms.
--- edit ---
Turns out I was completely wrong. All that was necessary was to increase the kernel tick rate. The default 100 ticks added the 10ms delay. 1000Hz and a negative nice value for the serial process gives me the time behavior I wanted to reach.
Serial ports on linux are "wrapped" into unix-style terminal constructs, which hits you with 1 tick lag, i.e. 10ms. Try if stty -F /dev/ttySx raw low_latency helps, no guarantees though.
On a PC, you can go hardcore and talk to standard serial ports directly, issue setserial /dev/ttySx uart none to unbind linux driver from serial port hw and control the port via inb/outb to port registers. I've tried that, it works great.
The downside is you don't get interrupts when data arrives and you have to poll the register. often.
You should be able to do same on the arm device side, may be much harder on exotic serial port hw.
Here's what setserial does to set low latency on a file descriptor of a port:
ioctl(fd, TIOCGSERIAL, &serial);
serial.flags |= ASYNC_LOW_LATENCY;
ioctl(fd, TIOCSSERIAL, &serial);
In short: Use a USB adapter and ASYNC_LOW_LATENCY.
I've used a FT232RL based USB adapter on Modbus at 115.2 kbs.
I get about 5 transactions (to 4 devices) in about 20 mS total with ASYNC_LOW_LATENCY. This includes two transactions to a slow-poke device (4 mS response time).
Without ASYNC_LOW_LATENCY the total time is about 60 mS.
With FTDI USB adapters ASYNC_LOW_LATENCY sets the inter-character timer on the chip itself to 1 mS (instead of the default 16 mS).
I'm currently using a home-brewed USB adapter and I can set the latency for the adapter itself to whatever value I want. Setting it at 200 µS shaves another mS off that 20 mS.
None of those system calls have an effect on latency. If you want to read and write one byte as fast as possible from userspace, you really aren't going to do better than a simple read()/write() pair. Try replacing the serial stream with a socket from another userspace process and see if the latencies improve. If they don't, then your problems are CPU speed and hardware limitations.
Are you sure your hardware can do this at all? It's not uncommon to find UARTs with a buffer design that introduces many bytes worth of latency.
At those line speeds you should not be seeing latencies that large, regardless of how you check for readiness.
You need to make sure the serial port is in raw mode (so you do "noncanonical reads") and that VMIN and VTIME are set correctly. You want to make sure that VTIME is zero so that an inter-character timer never kicks in. I would probably start with setting VMIN to 1 and tune from there.
The syscall overhead is nothing compared to the time on the wire, so select() vs. poll(), etc. is unlikely to make a difference.

Why TCP/IP speed depends on the size of sending data?

When I sent small data (16 bytes and 128 bytes) continuously (use a 100-time loop without any inserted delay), the throughput of TCP_NODELAY setting seems not as good as normal setting. Additionally, TCP-slow-start appeared to affect the transmission in the beginning.
The reason is that I want to control a device from PC via Ethernet. The processing time of this device is around several microseconds, but the huge latency of sending command affected the entire system. Could you share me some ways to solve this problem? Thanks in advance.
Last time, I measured the transfer performance between a Windows-PC and a Linux embedded board. To verify the TCP_NODELAY, I setup a system with two Linux PCs connecting directly with each other, i.e. Linux PC <--> Router <--> Linux PC. The router was only used for two PCs.
The performance without TCP_NODELAY is shown as follows. It is easy to see that the throughput increased significantly when data size >= 64 KB. Additionally, when data size = 16 B, sometimes the received time dropped until 4.2 us. Do you have any idea of this observation?
The performance with TCP_NODELAY seems unchanged, as shown below.
The full code can be found in https://www.dropbox.com/s/bupcd9yws5m5hfs/tcpip_code.zip?dl=0
Please share with me your thinking. Thanks in advance.
I am doing socket programming to transfer a binary file between a Windows 10 PC and a Linux embedded board. The socket library are winsock2.h and sys/socket.h for Windows and Linux, respectively. The binary file is copied to an array in Windows before sending, and the received data are stored in an array in Linux.
Windows: socket_send(sockfd, &SOPF->array[0], n);
Linux: socket_recv(&SOPF->array[0], connfd);
I could receive all data properly. However, it seems to me that the transfer time depends on the size of sending data. When data size is small, the received throughput is quite low, as shown below.
Could you please shown me some documents explaining this problem? Thank you in advance.
To establish a tcp connection, you need a 3-way handshake: SYN, SYN-ACK, ACK. Then the sender will start to send some data. How much depends on the initial congestion window (configurable on linux, don't know on windows). As long as the sender receives timely ACKs, it will continue to send, as long as the receivers advertised window has the space (use socket option SO_RCVBUF to set). Finally, to close the connection also requires a FIN, FIN-ACK, ACK.
So my best guess without more information is that the overhead of setting up and tearing down the TCP connection has a huge affect on the overhead of sending a small number of bytes. Nagle's algorithm (disabled with TCP_NODELAY) shouldn't have much affect as long as the writer is effectively writing quickly. It only prevents sending less than full MSS segements, which should increase transfer efficiency in this case, where the sender is simply sending data as fast as possible. The only effect I can see is that the final less than full MSS segment might need to wait for an ACK, which again would have more impact on the short transfers as compared to the longer transfers.
To illustrate this, I sent one byte using netcat (nc) on my loopback interface (which isn't a physical interface, and hence the bandwidth is "infinite"):
$ nc -l 127.0.0.1 8888 >/dev/null &
[1] 13286
$ head -c 1 /dev/zero | nc 127.0.0.1 8888 >/dev/null
And here is a network capture in wireshark:
It took a total of 237 microseconds to send one byte, which is a measly 4.2KB/second. I think you can guess that if I sent 2 bytes, it would take essentially the same amount of time for an effective rate of 8.2KB/second, a 100% improvement!
The best way to diagnose performance problems in networks is to get a network capture and analyze it.
When you make your test with a significative amount of data, for example your bigger test (512Mib, 536 millions bytes), the following happens.
The data is sent by TCP layer, breaking them in segments of a certain length. Let assume segments of 1460 bytes, so there will be about 367,000 segments.
For every segment transmitted there is a overhead (control and management added data to ensure good transmission): in your setup, there are 20 bytes for TCP, 20 for IP, and 16 for ethernet, for a total of 56 bytes every segment. Please note that this number is the minimum, not accounting the ethernet preamble for example; moreover sometimes IP and TCP overhead can be bigger because optional fields.
Well, 56 bytes for every segment (367,000 segments!) means that when you transmit 512Mib, you also transmit 56*367,000 = 20M bytes on the line. The total number of bytes becomes 536+20 = 556 millions of bytes, or 4.448 millions of bits. If you divide this number of bits by the time elapsed, 4.6 seconds, you get a bitrate of 966 megabits per second, which is higher than what you calculated not taking in account the overhead.
From the above calculus, it seems that your ethernet is a gigabit. It's maximum transfer rate should be 1,000 megabits per second and you are getting really near to it. The rest of the time is due to more overhead we didn't account for, and some latencies that are always present and tend to be cancelled as more data is transferred (but they will never be defeated completely).
I would say that your setup is ok. But this is for big data transfers. As the size of the transfer decreases, the overhead in the data, latencies of the protocol and other nice things get more and more important. For example, if you transmit 16 bytes in 165 microseconds (first of your tests), the result is 0.78 Mbps; if it took 4.2 us, about 40 times less, the bitrate would be about 31 Mbps (40 times bigger). These numbers are lower than expected.
In reality, you don't transmit 16 bytes, you transmit at least 16+56 = 72 bytes, which is 4.5 times more, so the real transfer rate of the link is also bigger. But, you see, transmitting 16 bytes on a TCP/IP link is the same as measuring the flow rate of an empty acqueduct by dropping some tears of water in it: the tears get lost before they reach the other end. This is because TCP/IP and ethernet are designed to carry much more data, with reliability.
Comments and answers in this page point out many of those mechanisms that trade bitrate and reactivity for reliability: the 3-way TCP handshake, the Nagle algorithm, checksums and other overhead, and so on.
Given the design of TCP+IP and ethernet, it is very normal that, for little data, performances are not optimal. From your tests you see that the transfer rate climbs steeply when the data size reaches 64Kbytes. This is not a coincidence.
From a comment you leaved above, it seems that you are looking for a low-latency communication, instead than one with big bandwidth. It is a common mistake to confuse different kind of performances. Moreover, in respect to this, I must say that TCP/IP and ethernet are completely non-deterministic. They are quick, of course, but nobody can say how much because there are too many layers in between. Even in your simple setup, if a single packet get lost or corrupted, you can expect delays of seconds, not microseconds.
If you really want something with low latency, you should use something else, for example a CAN. Its design is exactly what you want: it transmits little data with high speed, low latency, deterministic time (just microseconds after you transmitted a packet, you know if it has been received or not. To be more precise: exactly at the end of the transmission of a packet you know if it reached the destination or not).
TCP sockets typically have a buffer size internally. In many implementations, it will wait a little bit of time before sending a packet to see if it can fill up the remaining space in the buffer before sending. This is called Nagle's algorithm. I assume that the times you report above are not due to overhead in the TCP packet, but due to the fact that the TCP waits for you to queue up more data before actually sending.
Most socket implementations therefore have a parameter or function called something like TcpNoDelay which can be false (default) or true. I would try messing with that and seeing if that affects your throughput. Essentially these flags will enable/disable Nagle's algorithm.

How could uptime command show a cpu time greater than 100%?

Running an application with an infinite loop (no sleep, no system calls inside the loop) on a linux system with kernel 2.6.11 and single core processor results in a 98-99% cputime consuming. That's normal.
I have another single thread server application normally with an average sleep of 98% and a maximum of 20% of cputime. Once connected with a net client, the sleep average drops to 88% (nothing strange) but the cputime (1 minute average) rises constantly but not immediately over the 100%... I saw also a 160% !!?? The net traffic is quite slow (one small packet every 0.5 seconds). Usually the 15 min average of uptime command shows about 115%.
I ran also gprof but I did not find it useful... I have a similar scenario:
IDLE SCENARIO
%time name
59.09 Run()
25.00 updateInfos()
5.68 updateMcuInfo()
2.27 getLastEvent()
....
CONNECTED SCENARIO
%time name
38.42 updateInfo()
34.49 Run()
10.57 updateCUinfo()
4.36 updateMcuInfo()
3.90 ...
1.77 ...
....
None of the listed functions is directly involved with client communication
How can you explain this behaviour? Is it a known bug? Could be the extra time consumed in kernel space after a system call and lead to a calculus like
%=100*(apptime+kerneltime)/(total apps time)?

Traffic shaping with tc is inaccurate with high bandwidth and delay

I'm using tc with kernel 2.6.38.8 for traffic shaping. Limit bandwidth works, adding delay works, but when shaping both bandwidth with delay, the achieved bandwidth is always much lower than the limit if the limit is >1.5 Mbps or so.
Example:
tc qdisc del dev usb0 root
tc qdisc add dev usb0 root handle 1: tbf rate 2Mbit burst 100kb latency 300ms
tc qdisc add dev usb0 parent 1:1 handle 10: netem limit 2000 delay 200ms
Yields a delay (from ping) of 201 ms, but a capacity of just 1.66 Mbps (from iperf). If I eliminate the delay, the bandwidth is precisely 2 Mbps. If I specify a bandwidth of 1 Mbps and 200 ms RTT, everything works. I've also tried ipfw + dummynet, which yields similar results.
I've tried using rebuilding the kernel with HZ=1000 in Kconfig -- that didn't fix the problem. Other ideas?
It's actually not a problem, it behaves just as it should. Because you've added a 200ms latency, the full 2Mbps pipe isn't used at it's full potential. I would suggest you study the TCP/IP protocol in more detail, but here is a short summary of what is happening with iperf: your default window size is maybe 3 packets (likely 1500 bytes each). You fill your pipe with 3 packets, but now have to wait until you get an acknowledgement back (this is part of the congestion control mechanism). Since you delay the sending for 200ms, this will take a while. Now your window size will double in size and you can next send 6 packets, but will again have to wait 200ms. Then the window size doubles again, but by the time your window is completely open, the default 10 second iperf test is close to over and your average bandwidth will obviously be smaller.
Think of it like this:
Suppose you set your latency to 1 hour, and your speed to 2 Mbit/s.
2 Mbit/s requires (for example) 50 Kbit/s for TCP ACKs. Because the ACKs take over a hour to reach the source, then the source can't continue sending at 2 Mbit/s because the TCP window is still stuck waiting on the first acknowledgement.
Latency and bandwidth are more related than you think (in TCP at least. UDP is a different story)

Resources