How to temporarily buffer incoming network traffic for latency-sensitive HFT application? - linux

We are running a Java-based trading application, and there are certain periods where we want to prioritize outgoing network traffic as much as possible for about 10 ms. Is there a way to temporarily buffer all incoming network traffic during a short time period, either on the network card or via a process or buffer on our Redhat Linux box?
The rationale behind this is that the incoming network traffic spikes during this same period, and the application processing this traffic is stealing CPU cycles from the process we are trying to prioritize. We do not have fine-grained control over the application treating the incoming network traffic.
We're on a 1 Gbps connection so a buffer of about 1 MB should be sufficient. We would prefer not dropping the incoming traffic and requesting retransmission as this would increase load on our network during quite busy periods.

Possible using Qos on the router, or using trickle to control your bandwidth by a sample configuration of :
/etc/trickled.conf.
see example in url.

I am not sure whether I understand your problem correctly. Your concern is sometimes you have priority to deal with output network traffic and at this time the incoming traffic will build up and finally might cause package drop or retransmission which you don't want. Therefore, you want to buffer your incoming traffic.
If my understanding is correct and your are using TCP, try to make your tcp buffer bigger.
http://kaivanov.blogspot.com/2010/09/linux-tcp-tuning.html and then Use netstat to check whether your change is effective.

Adrian, have you tried setting the priority of your outgoing communication process to be higher than that of the process receiving the incoming data? Using the nice command this can be achieved. Note that in Unix/Linux the lower the number the higher the priority.
Otherwise I am not sure this is possible without having a direct tie in between the two applications that are sending / receiving, allowing you to effectively ignore the incoming connections that are ready to read from until any data you have is sent out.

Related

Node script - Failover from one server to another server

I have a nodejs script - lets call it "process1" on server1, and same script is running on server2 - "process2" (just with flag=false).
Process1 will be preforming actions and will be in "running" state at the beginning. process2 will be running but in "block" state with flag programmed within it.
What i want to acomplish is to, implement failover/fallback for this process. If process1 goes down flag on process2 will change, and process2 will take over all tasks from process1 (and vice versa when process1 cames back - fallback).
What is the best approach to do this? TCP connection between those?
NOTE: Even its not too much relevant, but i want to mention that these processes are going to work internally, establishing tcp connection with third server and parsing data we are getting from that server. Both of the processes will be running on both of the servers, but only ONE process at the time can be providing services - running with flag true (and not both of them)
Update: As per discussions bellow and internal research/test and monitoring of solution, using reverse proxy will save you a lot of time. Programming fail-over based on 2 servers only will cover 70% of the cases related with the internal process which is used on the both machines - but you will not be able to detect others 30% of the issues caused because of the issues with the network (especially if you are having a lot of traffic towards DATA RECEIVER).
This is more of an infrastructure problem than it is a Node one, and the same situation can be applied to almost any server.
What you basically need is some service that monitors Server 1 and determines whether it's "healthy" or "alive" and if so continue to direct traffic to it. If the service determines that the server is no longer in a stable condition (e.g. it takes too long to respond, returns an error) it will redirect any incoming traffic to Server 2. When it's happy Server 1 has returned to normal operating conditions it will redirect the traffic back onto it.
In most cases, the "service" in this scenario is a reverse proxy like Nginx or CloudFlare. In your situation, this server would act as a buffer between Data Reciever and your network (Server 1 / Server 2) and route the incoming traffic to the relevant server.
That looks like a classical use case for a reverse proxy. Using a well tested server such as nginx should provide plenty reliability the proxy won't fail (other than hardware failure) and you could put that infront of whatever cluster size you want. You'd even get the benefit of load-balancing if that is applicable and configured properly.
Alternatively and also leaning towards a load-balancing solution, you could have a front server push requests into a queue (ZMQ for example) and either push from the queue to the app server(s) or have your app-server(s) pull tasks from the queue independently.
In both solutions, if it's a requirement not to "push" 2 simultaneous results to your data receiver, you could use an outbound queue that all app-servers push into.

Weird Tomcat outage, possibly related to maxConnections

In my company we experienced a serious problem today: our production server went down. Most people accessing our software via a browser were unable to get a connection, however people who had already been using the software were able to continue using it. Even our hot standby server was unable to communicate with the production server, which it does using HTTP, not even going out to the broader internet. The whole time the server was accessible via ping and ssh, and in fact was quite underloaded - it's normally running at 5% CPU load and it was even lower at this time. We do almost no disk i/o.
A few days after the problem started we have a new variation: port 443 (HTTPS) is responding but port 80 stopped responding. The server load is very low. Immediately after restarting tomcat, port 80 started responding again.
We're using tomcat7, with maxThreads="200", and using maxConnections=10000. We serve all data out of main memory, so each HTTP request completes very quickly, but we have a large number of users doing very simple interactions (this is high school subject selection). But it seems very unlikely we would have 10,000 users all with their browser open on our page at the same time.
My question has several parts:
Is it likely that the "maxConnections" parameter is the cause of our woes?
Is there any reason not to set "maxConnections" to a ridiculously high value e.g. 100,000? (i.e. what's the cost of doing so?)
Does tomcat output a warning message anywhere once it hits the "maxConnections" message? (We didn't notice anything).
Is it possible there's an OS limit we're hitting? We're using CentOS 6.4 (Linux) and "ulimit -f" says "unlimited". (Do firewalls understand the concept of Tcp/Ip connections? Could there be a limit elsewhere?)
What happens when tomcat hits the "maxConnections" limit? Does it try to close down some inactive connections? If not, why not? I don't like the idea that our server can be held to ransom by people having their browsers on it, sending the keep-alive's to keep the connection open.
But the main question is, "How do we fix our server?"
More info as requested by Stefan and Sharpy:
Our clients communicate directly with this server
TCP connections were in some cases immediately refused and in other cases timed out
The problem is evident even connecting my browser to the server within the network, or with the hot standby server - also in the same network - unable to do database replication messages which normally happens over HTTP
IPTables - yes, IPTables6 - I don't think so. Anyway, there's nothing between my browser and the server when I test after noticing the problem.
More info:
It really looked like we had solved the problem when we realised we were using the default Tomcat7 setting of BIO, which has one thread per connection, and we had maxThreads=200. In fact 'netstat -an' showed about 297 connections, which matches 200 + queue of 100. So we changed this to NIO and restarted tomcat. Unfortunately the same problem occurred the following day. It's possible we misconfigured the server.xml.
The server.xml and extract from catalina.out is available here:
https://www.dropbox.com/sh/sxgd0fbzyvuldy7/AACZWoBKXNKfXjsSmkgkVgW_a?dl=0
More info:
I did a load test. I'm able to create 500 connections from my development laptop, and do an HTTP GET 3 times on each, without any problem. Unless my load test is invalid (the Java class is also in the above link).
It's hard to tell for sure without hands-on debugging but one of the first things I would check would be the file descriptor limit (that's ulimit -n). TCP connections consume file descriptors, and depending on which implementation is in use, nio connections that do polling using SelectableChannel may eat several file descriptors per open socket.
To check if this is the cause:
Find Tomcat PIDs using ps
Check the ulimit the process runs with: cat /proc/<PID>/limits | fgrep 'open files'
Check how many descriptors are actually in use: ls /proc/<PID>/fd | wc -l
If the number of used descriptors is significantly lower than the limit, something else is the cause of your problem. But if it is equal or very close to the limit, it's this limit which is causing issues. In this case you should increase the limit in /etc/security/limits.conf for the user with whose account Tomcat is running and restart the process from a newly opened shell, check using /proc/<PID>/limits if the new limit is actually used, and see if Tomcat's behavior is improved.
While I don't have a direct answer to solve your problem, I'd like to offer my methods to find what's wrong.
Intuitively there are 3 assumptions:
If your clients hold their connections and never release, it is quite possible your server hits the max connection limit even there is no communications.
The non-responding state can also be reached via various ways such as bugs in the server-side code.
The hardware conditions should not be ignored.
To locate the cause of this problem, you'd better try to replay the scenario in a testing environment. Perform more comprehensive tests and record more detailed logs, including but not limited:
Unit tests, esp. logic blocks using transactions, threading and synchronizations.
Stress-oriented tests. Try to simulate all the user behaviors you can come up with and their combinations and test them in a massive batch mode. (ref)
More specified Logging. Trace client behaviors and analysis what happened exactly before the server stopped responding.
Replace a server machine and see if it will still happen.
The short answer:
Use the NIO connector instead of the default BIO connector
Set "maxConnections" to something suitable e.g. 10,000
Encourage users to use HTTPS so that intermediate proxy servers can't turn 100 page requests into 100 tcp connections.
Check for threads hanging due to deadlock problems, e.g. with a stack dump (kill -3)
(If applicable and if you're not already doing this, write your client app to use the one connection for multiple page requests).
The long answer:
We were using the BIO connector instead of NIO connector. The difference between the two is that BIO is "one thread per connection" and NIO is "one thread can service many connections". So increasing "maxConnections" was irrelevant if we didn't also increase "maxThreads", which we didn't, because we didn't understand the BIO/NIO difference.
To change it to NIO, put this in the element in server.xml:
protocol="org.apache.coyote.http11.Http11NioProtocol"
From what I've read, there's no benefit to using BIO so I don't know why it's the default. We were only using it because it was the default and we assumed the default settings were reasonable and we didn't want to become experts in tomcat tuning to the extent that we now have.
HOWEVER: Even after making this change, we had a similar occurrence: on the same day, HTTPS became unresponsive even while HTTP was working, and then a little later the opposite occurred. Which was a bit depressing. We checked in 'catalina.out' that in fact the NIO connector was being used, and it was. So we began a long period of analysing 'netstat' and wireshark. We noticed some periods of high spikes in the number of connections - in one case up to 900 connections when the baseline was around 70. These spikes occurred when we synchronised our databases between the main production server and the "appliances" we install at each customer site (schools). The more we did the synchronisation, the more we caused outages, which caused us to do even more synchronisations in a downward spiral.
What seems to be happening is that the NSW Education Department proxy server splits our database synchronisation traffic into multiple connections so that 1000 page requests become 1000 connections, and furthermore they are not closed properly until the TCP 4 minute timeout. The proxy server was only able to do this because we were using HTTP. The reason they do this is presumably load balancing - they thought by splitting the page requests across their 4 servers, they'd get better load balancing. When we switched to HTTPS, they are unable to do this and are forced to use just one connection. So that particular problem is eliminated - we no longer see a burst in the number of connections.
People have suggested increasing "maxThreads". In fact this would have improved things but this is not the 'proper' solution - we had the default of 200, but at any given time, hardly any of these were doing anything, in fact hardly any of these were even allocated to page requests.
I think you need to debug the application using Apache JMeter for number of connection and use Jconsole or Zabbix to look for heap space or thread dump for tomcat server.
Nio Connector of Apache tomcat can have maximum connections of 10000 but I don't think thats a good idea to provide that much connection to one instance of tomcat better way to do this is to run multiple instance of tomcat.
In my view best way for Production server: To Run Apache http server in front and point your tomcat instance to that http server using AJP connector.
Hope this helps.
Are you absolutely sure you're not hitting the maxThreads limit? Have you tried changing it?
These days browsers limit simultaneous connections to a max of 4 per hostname/ip, so if you have 50 simultaneous browsers, you could easily hit that limit. Although hopefully your webapp responds quickly enough to handle this. Long polling has become popular these days (until websockets are more prevalent), so you may have 200 long polls.
Another cause could be if you use HTTP[S] for app-to-app communication (that is, no browser involved). Sometimes app writers are sloppy and create new connections for performing multiple tasks in parallel, causing TCP and HTTP overhead. Double check that you are not getting an inflood of requests. Log files can usually help you on this, or you can use wireshark to count the number of HTTP requests or HTTP[S] connections. If possible, modify your API to handle multiple API calls in one HTTP request.
Related to the last one, if you have many HTTP/1.1 requests going across one connection, and intermediate proxy may be splitting them into multiple connections for load balancing purposes. Sounds crazy I know, but I've seen it happen.
Lastly, some crawl bots ignore the crawl delay set in robots.txt. Again, log files and/or wireshark can help you determine this.
Overall, run more experiments with more changes. maxThreads, https, etc. before jumping to conclusions with maxConnections.

When does a single JMS connection with multiple producing sessions start becoming a bottleneck?

I've recently read a lot about best practices with JMS, Spring (and TIBCO EMS) around connections, sessions, consumers & producers
When working within the Spring world, the prevailing wisdom seems to be
for consuming/incoming flows - to use an AbstractMessageListenerContainer with a number of consumers/threads.
for producing/publishing flows - to use a CachingConnectionFactory underneath a JmsTemplate to maintain a single connection to the broker and then cache sessions and producers.
For producing/publishing, this is what my (largeish) server application is now doing, where previously it was creating a new connection/session/producer for every single message it was publishing (bad!) due to use of the raw connection factory under JmsTemplate. The old behaviour would sometimes lead to 1,000s of connections being created and closed on the broker in a short period of time in high peak periods and even hitting socket/file handle limits as a result.
However, when switching to this model I am having trouble understanding what the performance limitations/considerations are with the use of a single TCP connection to the broker. I understand that the JMS provider is expected to ensure it can be used in the multi-threaded way etc - but from a practical perspective
it's just a single TCP connection
the JMS provider to some degree needs to co-ordinate writes down the pipe so they don't end up an interleaved jumble, even if it has some chunking in its internal protocol
surely this involves some contention between threads/sessions using the single connection
with certain network semantics (high latency to broker? unstable throughput?) surely a single connection will not be ideal?
On the assumption that I'm somewhat on the right track
Am I off base here and misunderstanding how the underlying connections work and are shared by a JMS provider?
is any contention a problem mitigated by having more connections or does it just move the contention to the broker?
Does anyone have any practical experience of hitting such a limit they could share? Either with particular message or network throughput, or even caused by # of threads/sessions sharing a connection in parallel
Should one be concerned in a single-connection scenario about sessions that write very large messages blocking other sessions that write small messages?
Would appreciate any thoughts or pointers to more reading on the subject or experience even with other brokers.
When thinking about the bottleneck, keep in mind two facts:
TCP is a streaming protocol, almost all JMS providers use a TCP based protocol
lots of the actions from TIBCO EMS client to EMS server are in the form of request/reply. For example, when you publish a message / acknowledge a receive message / commit a transactional session, what's happening under the hood is that some TCP packets are sent out from client and the server will respond with some packets as well. Because of the nature of TCP streaming, those actions have to be serialised if they are initiated from the same connection -- otherwise say if from one thread you publish a message and in the exact same time from another thread you commit a session, the packets will be mixed on the wire and there is no way server can interpret the right message from the packets. [ Note: the synchronisation is done from the EMS client library level, hence user can feel free to share one connection with multiple threads/sessions/consumers/producers ]
My own experience is multiple connections always output perform single connection. In a lossy network situation, it is definitely a must to use multiple connections. Under best network condition, with multiple connections, a single client can nearly saturate the network bandwidth between client and server.
That said, it really depends on what is your clients' performance requirement, a single connection under good network can already provides good enough performance.
Even if you use one connection and 100 sessions it means finally you
are using 100threads, it is same as using 10connections* 10 sessions =
100threads.
You are good until you reach your system resource limits

I need an advice about interval setting in small pinging app

I'm creating an application which let me to ping IP or IP range using time interval between each ping. My concern here is that if the interval will be allowed to be too small then my program would appear to do a ping flood.
What should I allow the minimum interval in milliseconds to be in my small app?
I would think that pinging a publicly available IP address more that once per second would look highly suspicious.
In general you should not ping any more frequently than is useful, it will only lead to needless network traffic and congestion. For example if the purpose of your app were to notify a user visually of a network issue, pinging more frequently that a user can respond serves no purpose.
Perhaps a better solution would be to use a statistical based algorithm that takes into account packet loss, response times and network loading. The algorithm could be adaptive in that it would trade off network loading against the value of the information being collected.

HTTP download and multiple threads

This may be a duplicate but I have not seen this being fully answered.
Does HTTP download throughput increase when using threads?
My thinking is that when the TCP stack on the server is waiting for a ack from the receiver before sending the next chunk of data, another thread is sending out a request for data which is then serviced, leading to an increase in throughput.
Is this correct?
Yes, that is pretty much correct. Threading HTTP requests would increase throughput, up until the maximum number of connections was met on the server, and then this increase would plateau. The performance increase would be limited to both the server and the client computers threading abilities, of course.
It's correct only at the startup, during the transfer TCP has a dynamic window of data that can be sent without receiving an ACK.
So while the data transfer is going on, in most situations every chunk of data that can be sent is sent, resulting in maximum throughput.
When you use multiple threads you reduce the dead time in the TCP handshaking.
It can also be useful if you have to download files froms different servers or if the server limits the bandwidth per connection.

Resources