How to make server respond faster on First Byte? - linux

My website has seen ever decreasing traffic, so I've been working to increase speed and usability. On WebPageTest.org I've worked most of my grades up but First Byte is still horrible.
F First Byte Time
A Keep-alive Enabled
A Compress Transfer
A Compress Images
A Progressive JPEGs
B Cache static
First Byte Time (back-end processing): 0/100
1081 ms First Byte Time
90 ms Target First Byte Time
I use the Rackspace Cloud Server system,
CentOS 6.4 2gig of Ram 80 gig harddrive,
Next Generation Server
Linux 2.6.32-358.18.1.el6.x86_64
Apache/2.2.15 (CentOS)
MySQL 5.1.69
PHP: 5.3.3 / Zend: 2.3.0
Website system Tomatocart Shopping Cart.
Any help would be much appreciated.
Traceroute #1 to 198.61.171.121
Hop Time (ms) IP Address FQDN
0.855 - 199.193.244.67
0.405 - 184.105.250.41 - gige-g2-14.core1.mci3.he.net
15.321 - 184.105.222.117 - 10gigabitethernet1-4.core1.chi1.he.net
12.737 - 206.223.119.14 - bbr1.ord1.rackspace.NET
14.198 - 184.106.126.144 - corea.ord1.rackspace.net
14.597 - 50.56.6.129 - corea-core5.ord1.rackspace.net
13.915 - 50.56.6.111 - core5-aggr1501a-1.ord1.rackspace.net
16.538 - 198.61.171.121 - mail.aboveallhousplans.com
#JXH's advise I did a packet capture and analyzed it using wireshark.
during a hit and leave visit to the site I got
6 lines of BAD TCP happening at about lines 28-33
warning that I have TCP Retransmission and TCP Dup ACK...
2 of each of these warnings 3 times.
Under the expanded panel viewing a
Retransmission/ TCP analysis flags - "retransmission suspected" "security level NOTE" RTO of 1.19 seconds.
Under the expanded panel viewing
DCP Dup ACK/ TCP analysis flags - Duplicate ACK" "security level NOTE" RTT of 0.09 seconds.
This is all gibberish to me...
I don't know if this is wise to do or not, but I've uploaded my packet capture dump file.
If anyone cares to take a look at my flags and let me know what they think.
I wonder if the retransmission warnings are saying that the HTTP file is sending duplicate information? I have a few things in twice that seems a little redundant. like user agent vary is duplicated.
# Set header information for proxies
Header append Vary User-Agent
# Set header information for proxies
Header append Vary User-Agent
Server fixed the retransmission and dup ack's a few days ago but lag in initial server response remains.
http://www.aboveallhouseplans.com/images/firstbyte001.jpg
http://www.aboveallhouseplans.com/images/firstbyte002.jpg
First byte of 600ms remains...

Related

Need to measure pairwire (ip based) bandwidth used over time when using TC

I need to measure the datarates of packets between multiple servers. I need pairwise bandwidths between the servers (if possible even the ports), not the overall datarate per interface on each server.
Example output
Timestamp
Server A to B
Server B to A
Server A to C
Server C to A
0
1
2
1
5
1
5
3
7
1
What I tried or thought of
tcpdump - I was capturing all the packets and looking at ip.len for getting the datarates. It worked quite well till I started testing along with TC.
Turns out tcpdump captures packets at a lower layer than TC. So, the bandwidths I measure using this can't see the limit set by TC.
netstat - I tried using this by greping the output and look at Recv-Q and Send-Q columns. But later I found out that it reports the bytes that have been received and are buffered, waiting for the local process that is using this connection to read and consume them. I won't be able to use them to get bandwidth being used.
iftop - Amazing GUI and has all the things I need. But no way to get the output in a good way to process. Might also overwhelm the storage because of the amount of extra text it stores along with.
bwm-ng - Gives overall datarate per interface on each server but not pairwise.
Please let me know if there are any other ways to achieve what I need.
Thanks in advance for your help.

ICMP timestamps added to ping echo requests in linux. How are they represented to bytes? Or how to convert UNIX Epoch time to that timestamp format?

I encountered this issue while trying to make a ping program in C myself and used wireshark for further digging into the problem: ping which sends echo requests to a destination IP also ads a timestamp field of 8 bytes (TOD timestamp) after the ICMP header in linux. Ping in Windows doesn't add that timestamp but rather I think makes the time calculations locally. Now my question is how do you convert the time from Unix Epoch format (the number of seconds from 1970 which you get with the 'time' function in C) to that TOD format of 8 bytes? I got to this question as, finally, after quite a time of research, my ping.c program sends the ICMP echo request message to the destination, where after a test with 2 hosts I noticed that it manages to arrive, but gets no ping echo reply message back while the native linux ping works properly. I can only imagine 2 possible causes:
I didnt complete well the fields of the ICMP and IP header. To be honest, I myself pretty much doubt this possiblity because wireshark shows the message arrives to the destination and is recognized as an echo request message, but doesn't trigger any echo reply answer. However, if it would to be this, the only thing I can think off is that timestamp which I don`t know how to convert in TOD form to occupy at most 8 bytes.
There might be a firewall at the destination or some other system dependent fact.
https://www.ibm.com/docs/en/zos/2.2.0?topic=problems-using-ping-command
The Ping command does not use the ICMP/ICMPv6 header sequence number field (icmp_seq or icmp6_seq) to correlate requests with ICMP/ICMPv6 Echo Replies. Instead, it uses the ICMP/ICMPv6 header identifier field (icmp_id or icmp6_id) plus an 8-byte TOD time stamp field to correlate requests with replies. The TOD time stamp is the first 8-bytes of data after the ICMP/ICMPv6 header.
Finally, to repeat the initial question:
How do you convert the UNIX Epoch time to the TOD timestamp form which linux ping adds at the end of the ICMP header/begining of data field?
An useful explanation, but I don't think sufficient I found here:
https://towardsdatascience.com/3-tips-to-handle-timestamps-in-c-ad5b36892294
I should probably mention I`m working on Ubuntu 20.04 focalfossa.
I found a related post here. The book "Principles of Operation" is mentioned in the comments. I skimmed through it but it seems to be generally lower level than C so if anyone knows another place/way to answer the question it would be better.
Background
Including the UNIX timestamp of the time of transmission in the first data bytes of the ICMP Echo message is a trick/optimization the original ping by Mike Muuss used to avoid keeping track of it locally. It exploits the following guarantee made by RFC 792's Echo or Echo Reply Message description:
The data received in the echo message must be returned in the echo reply message.
Many (if not all) BSD ping implementations are based on Mike Muuss' original implementation and therefore kept this behavior. On Linux systems, ping is typically provided by iputils, GNU inetutils, or Busybox. All exhibit the same behavior. fping is a notable exception, which stores a mapping from target host and sequence number to timestamp locally.
Implementations typically store the timestamp in the sender's native precision and byte order as opposed to a well-defined precision in conventional network byte order (big endian), that is normally used for data transmitted over the network, as it intended to be only be interpreted by the original sender and others should just treat it as opaque stream of bytes.
Because this is so common however, the Wireshark ICMP dissector (as of v3.6.2) tries to be clever and heuristically decode it nonetheless, if the first 8 data bytes look like a struct timeval timestamp in 32-bit precision for seconds and microseconds in either byte order. Please note that if the sender was actually using big endian 64-bit precision, this will fail and if it was using little endian 64-bit precision, it will truncate the microseconds before the Epochalypse and fail after that.
Obtaining and serializing epoch time
To answer your actual question:
How do you convert the UNIX Epoch time to the TOD timestamp form which linux ping adds at the end of the ICMP header/begining of data field?
Most implementations use the obsolescent gettimeofday(2) instead of the newer clock_gettime(2). The following snippet is taken from iputils:
struct icmphdr *icp /* = pointer to ICMP packet */;
struct timeval tmp_tv;
gettimeofday(&tmp_tv, NULL);
memcpy(icp + 1, &tmp_tv, sizeof(tmp_tv));
memcpy from a temporary variable instead of directly passing the icp + 1 as target to gettimeofday is used to avoid potential improper alignment, effective type and strict aliasing violation issues.
I appreciate the clear answers. I actually managed to solve the problem and make the ping function work and I have to say the problem was certainly not the timestamp because, yes indeed, i was talking about the "Echo or Echo Reply Message". One way of implementing ping is by using the ICMP feature of Echo and Echo reply messages. The fact is when I put this question I was obviously stuck probably because I wasn't clearly differentiating the main aspects of the problem. Thus, I started to examine the packet sent by the native ping on my Ubuntu 20.04 focal_fossa (with Wireshark), hopefully trying to get a better grasp of how to fill the headers of the packet sent by my program (the IP and ICMP headers). This question simply arised from the fact that I noticed that in this version of Ubuntu, ping adds a timestamp (indeed of 32 bits) after the end of ICMP header, basically in the data/payload section. As a matter of fact, I also used Wireshark on Windows10 and saw that there is no timestamp added after the header. So yes, it might be about different versions of the program being used.
What is the main point I want to emphasize is that my final version of ping has nothing to do with any timestamps. So yes, they are not a crucial aspect for ping to work.

Why TCP/IP speed depends on the size of sending data?

When I sent small data (16 bytes and 128 bytes) continuously (use a 100-time loop without any inserted delay), the throughput of TCP_NODELAY setting seems not as good as normal setting. Additionally, TCP-slow-start appeared to affect the transmission in the beginning.
The reason is that I want to control a device from PC via Ethernet. The processing time of this device is around several microseconds, but the huge latency of sending command affected the entire system. Could you share me some ways to solve this problem? Thanks in advance.
Last time, I measured the transfer performance between a Windows-PC and a Linux embedded board. To verify the TCP_NODELAY, I setup a system with two Linux PCs connecting directly with each other, i.e. Linux PC <--> Router <--> Linux PC. The router was only used for two PCs.
The performance without TCP_NODELAY is shown as follows. It is easy to see that the throughput increased significantly when data size >= 64 KB. Additionally, when data size = 16 B, sometimes the received time dropped until 4.2 us. Do you have any idea of this observation?
The performance with TCP_NODELAY seems unchanged, as shown below.
The full code can be found in https://www.dropbox.com/s/bupcd9yws5m5hfs/tcpip_code.zip?dl=0
Please share with me your thinking. Thanks in advance.
I am doing socket programming to transfer a binary file between a Windows 10 PC and a Linux embedded board. The socket library are winsock2.h and sys/socket.h for Windows and Linux, respectively. The binary file is copied to an array in Windows before sending, and the received data are stored in an array in Linux.
Windows: socket_send(sockfd, &SOPF->array[0], n);
Linux: socket_recv(&SOPF->array[0], connfd);
I could receive all data properly. However, it seems to me that the transfer time depends on the size of sending data. When data size is small, the received throughput is quite low, as shown below.
Could you please shown me some documents explaining this problem? Thank you in advance.
To establish a tcp connection, you need a 3-way handshake: SYN, SYN-ACK, ACK. Then the sender will start to send some data. How much depends on the initial congestion window (configurable on linux, don't know on windows). As long as the sender receives timely ACKs, it will continue to send, as long as the receivers advertised window has the space (use socket option SO_RCVBUF to set). Finally, to close the connection also requires a FIN, FIN-ACK, ACK.
So my best guess without more information is that the overhead of setting up and tearing down the TCP connection has a huge affect on the overhead of sending a small number of bytes. Nagle's algorithm (disabled with TCP_NODELAY) shouldn't have much affect as long as the writer is effectively writing quickly. It only prevents sending less than full MSS segements, which should increase transfer efficiency in this case, where the sender is simply sending data as fast as possible. The only effect I can see is that the final less than full MSS segment might need to wait for an ACK, which again would have more impact on the short transfers as compared to the longer transfers.
To illustrate this, I sent one byte using netcat (nc) on my loopback interface (which isn't a physical interface, and hence the bandwidth is "infinite"):
$ nc -l 127.0.0.1 8888 >/dev/null &
[1] 13286
$ head -c 1 /dev/zero | nc 127.0.0.1 8888 >/dev/null
And here is a network capture in wireshark:
It took a total of 237 microseconds to send one byte, which is a measly 4.2KB/second. I think you can guess that if I sent 2 bytes, it would take essentially the same amount of time for an effective rate of 8.2KB/second, a 100% improvement!
The best way to diagnose performance problems in networks is to get a network capture and analyze it.
When you make your test with a significative amount of data, for example your bigger test (512Mib, 536 millions bytes), the following happens.
The data is sent by TCP layer, breaking them in segments of a certain length. Let assume segments of 1460 bytes, so there will be about 367,000 segments.
For every segment transmitted there is a overhead (control and management added data to ensure good transmission): in your setup, there are 20 bytes for TCP, 20 for IP, and 16 for ethernet, for a total of 56 bytes every segment. Please note that this number is the minimum, not accounting the ethernet preamble for example; moreover sometimes IP and TCP overhead can be bigger because optional fields.
Well, 56 bytes for every segment (367,000 segments!) means that when you transmit 512Mib, you also transmit 56*367,000 = 20M bytes on the line. The total number of bytes becomes 536+20 = 556 millions of bytes, or 4.448 millions of bits. If you divide this number of bits by the time elapsed, 4.6 seconds, you get a bitrate of 966 megabits per second, which is higher than what you calculated not taking in account the overhead.
From the above calculus, it seems that your ethernet is a gigabit. It's maximum transfer rate should be 1,000 megabits per second and you are getting really near to it. The rest of the time is due to more overhead we didn't account for, and some latencies that are always present and tend to be cancelled as more data is transferred (but they will never be defeated completely).
I would say that your setup is ok. But this is for big data transfers. As the size of the transfer decreases, the overhead in the data, latencies of the protocol and other nice things get more and more important. For example, if you transmit 16 bytes in 165 microseconds (first of your tests), the result is 0.78 Mbps; if it took 4.2 us, about 40 times less, the bitrate would be about 31 Mbps (40 times bigger). These numbers are lower than expected.
In reality, you don't transmit 16 bytes, you transmit at least 16+56 = 72 bytes, which is 4.5 times more, so the real transfer rate of the link is also bigger. But, you see, transmitting 16 bytes on a TCP/IP link is the same as measuring the flow rate of an empty acqueduct by dropping some tears of water in it: the tears get lost before they reach the other end. This is because TCP/IP and ethernet are designed to carry much more data, with reliability.
Comments and answers in this page point out many of those mechanisms that trade bitrate and reactivity for reliability: the 3-way TCP handshake, the Nagle algorithm, checksums and other overhead, and so on.
Given the design of TCP+IP and ethernet, it is very normal that, for little data, performances are not optimal. From your tests you see that the transfer rate climbs steeply when the data size reaches 64Kbytes. This is not a coincidence.
From a comment you leaved above, it seems that you are looking for a low-latency communication, instead than one with big bandwidth. It is a common mistake to confuse different kind of performances. Moreover, in respect to this, I must say that TCP/IP and ethernet are completely non-deterministic. They are quick, of course, but nobody can say how much because there are too many layers in between. Even in your simple setup, if a single packet get lost or corrupted, you can expect delays of seconds, not microseconds.
If you really want something with low latency, you should use something else, for example a CAN. Its design is exactly what you want: it transmits little data with high speed, low latency, deterministic time (just microseconds after you transmitted a packet, you know if it has been received or not. To be more precise: exactly at the end of the transmission of a packet you know if it reached the destination or not).
TCP sockets typically have a buffer size internally. In many implementations, it will wait a little bit of time before sending a packet to see if it can fill up the remaining space in the buffer before sending. This is called Nagle's algorithm. I assume that the times you report above are not due to overhead in the TCP packet, but due to the fact that the TCP waits for you to queue up more data before actually sending.
Most socket implementations therefore have a parameter or function called something like TcpNoDelay which can be false (default) or true. I would try messing with that and seeing if that affects your throughput. Essentially these flags will enable/disable Nagle's algorithm.

How much data can I send through a socket.emit?

So I am using node.js and socket.io. I have this little program that takes the contents of a text box and sends it to the node.js server. Then, the server relays it back to other connected clients. Kind of like a chat service but not exactly.
Anyway, what if the user were to type 2-10k worth of text and try to send that? I know I could just try it out and see for myself but I'm looking for a practical, best practice limit on how much data I can do through an emit.
As of v3, socket.io has a default message limit of 1 MB. If a message is larger than that, the connection will be killed.
You can change this default by specifying the maxHttpBufferSize option, but consider the following (which was originally written over a decade ago, but is still relevant):
Node and socket.io don't have any built-in limits. What you do have to worry about is the relationship between the size of the message, number of messages being send per second, number of connected clients, and bandwidth available to your server – in other words, there's no easy answer.
Let's consider a 10 kB message. When there are 10 clients connected, that amounts to 100 kB of data that your server has to push out, which is entirely reasonable. Add in more clients, and things quickly become more demanding: 10 kB * 5,000 clients = 50 MB.
Of course, you'll also have to consider the amount of protocol overhead: per packet, TCP adds ~20 bytes, IP adds 20 bytes, and Ethernet adds 14 bytes, totaling 54 bytes. Assuming a MTU of 1500 bytes, you're looking at 8 packets per client (disregarding retransmits). This means you'll send 8*54=432 bytes of overhead + 10 kB payload = 10,672 bytes per client over the wire.
10.4 kB * 5000 clients = 50.8 MB.
On a 100 Mbps link, you're looking at a theoretical minimum of 4.3 seconds to deliver a 10 kB message to 5,000 clients if you're able to saturate the link. Of course, in the real world of dropped packets and corrupted data requiring retransmits, it will take longer.
Even with a very conservative estimate of 8 seconds to send 10 kB to 5,000 clients, that's probably fine in chat room where a message comes in every 10-20 seconds.
So really, it comes down to a few questions, in order of importance:
How much bandwidth will your server(s) have available?
How many users will be simultaneously connected?
How many messages will be sent per minute?
With those questions answered, you can determine the maximum size of a message that your infrastructure will support.
Limit = 1M (By Default)
Use this example config to specify your custom limit for maxHttpBufferSize:
const io = require("socket.io")(server, {
maxHttpBufferSize: 1e8, pingTimeout: 60000
});
1e8 = 100,000,000 : that can be good for any large scale response/emit
pingTimeout : When the emit was large then it take time and you need increase pingtime too
Read more from Socket.IO Docs
After these setting, if your problem is still remaining then you can check related proxy web server configs, like client_max_body_size, limit_rate in
nginx (if u have related http proxy to socketIO app) or client/server Firewall rules.
2-10k is fine, there arn't any enforced limits or anything, it just comes down to bandwidth and practicality.. 10k is small though in the grand scheme of things so you should be fine if that's somewhat of an upper bound for you.

How do I stress test a web form file upload?

I need to test a web form that takes a file upload.
The filesize in each upload will be about 10 MB.
I want to test if the server can handle over 100 simultaneous uploads, and still remain
responsive for the rest of the site.
Repeated form submissions from our office will be limited by our local DSL line.
The server is offsite with higher bandwidth.
Answers based on experience would be great, but any suggestions are welcome.
Use the ab (ApacheBench) command-line tool that is bundled with Apache
(I have just discovered this great little tool). Unlike cURL or wget,
ApacheBench was designed for performing stress tests on web servers (any type of web server!).
It generates plenty statistics too. The following command will send a
HTTP POST request including the file test.jpg to http://localhost/
100 times, with up to 4 concurrent requests.
ab -n 100 -c 4 -p test.jpg http://localhost/
It produces output like this:
Server Software:
Server Hostname: localhost
Server Port: 80
Document Path: /
Document Length: 0 bytes
Concurrency Level: 4
Time taken for tests: 0.78125 seconds
Complete requests: 100
Failed requests: 0
Write errors: 0
Non-2xx responses: 100
Total transferred: 2600 bytes
HTML transferred: 0 bytes
Requests per second: 1280.00 [#/sec] (mean)
Time per request: 3.125 [ms] (mean)
Time per request: 0.781 [ms] (mean, across all concurrent requests)
Transfer rate: 25.60 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 2.6 0 15
Processing: 0 2 5.5 0 15
Waiting: 0 1 4.8 0 15
Total: 0 2 6.0 0 15
Percentage of the requests served within a certain time (ms)
50% 0
66% 0
75% 0
80% 0
90% 15
95% 15
98% 15
99% 15
100% 15 (longest request)
Automate Selenium RC using your favorite language. Start 100 Threads of Selenium,each typing a path of the file in the input and clicking submit.
You could generate 100 sequentially named files to make looping over them easyily, or just use the same file over and over again
I would perhaps guide you towards using cURL and submitting just random stuff (like, read 10MB out of /dev/urandom and encode it into base32), through a POST-request and manually fabricate the body to be a file upload (it's not rocket science).
Fork that script 100 times, perhaps over a few servers. Just make sure that sysadmins don't think you are doing a DDoS, or something :)
Unfortunately, this answer remains a bit vague, but hopefully it helps you by nudging you in the right track.
Continued as per Liam's comment:
If the server receiving the uploads is not in the same LAN as the clients connecting to it, it would be better to get as remote nodes as possible for stress testing, if only to simulate behavior as authentic as possible. But if you don't have access to computers outside the local LAN, the local LAN is always better than nothing.
Stress testing from inside the same hardware would be not a good idea, as you would do double load on the server: Figuring out the random data, packing it, sending it through the TCP/IP stack (although probably not over Ethernet), and only then can the server do its magic. If the sending part is outsourced, you get double (taken with an arbitrary sized grain of salt) performance by the receiving end.

Resources