I have just started using beanstalkd and pheanstalk and I am curious whether the following situation is a security issue (and if not, why not?):
When designing a queue that will contain jobs for an eventual worker script to pick up and preform SQL database queries, I asked a friend what I could do to prevent an online user from going into port 11300 of my server, and inserting a job into the queue himself and hence causing the job to be executed with malicious code. I was told that I could include a password inside the job being sent.
Though after some time passed, I recognized that someone could preform a few simple commands on a terminal and obtain the job inside the queue, and hence find the password, and then create jobs with the password included:
telnet thewebsitesipaddress 11300 //creating a telnet connection
list-tubes //finding which tubes are currently being used
use a_tube_found //using one of the tubes found
peek-ready //see whats inside one of the jobs and find the password
What could be done to make sure this does not happen and my queue doesn't get hacked / controlled?
Thanks in advance!
You can avoid those situations by placing beanstalkd behind a firewall or in a private network.
DigitalOcean (for example) offers such a service where you have a private network IP address which can be accessed only from servers of the same location.
We've been using beanstalkd in our company for more than a year, and we haven't had any of those issues yet.
I see, but what if the producer was a page called index.php, where when someone entered it, a job would be sent to the queue. In this situation, wouldn't the server have to be an open network?
The browser has no way to get in contact with the job server, it only access the resources /you/ allow them to, that is the view page. Only the back-end is allowed to access the job server. Also, if you build the web application in a certain way that the front-end is separated from the back-end, you're going to have even less potential security issues.
I have a nodejs script - lets call it "process1" on server1, and same script is running on server2 - "process2" (just with flag=false).
Process1 will be preforming actions and will be in "running" state at the beginning. process2 will be running but in "block" state with flag programmed within it.
What i want to acomplish is to, implement failover/fallback for this process. If process1 goes down flag on process2 will change, and process2 will take over all tasks from process1 (and vice versa when process1 cames back - fallback).
What is the best approach to do this? TCP connection between those?
NOTE: Even its not too much relevant, but i want to mention that these processes are going to work internally, establishing tcp connection with third server and parsing data we are getting from that server. Both of the processes will be running on both of the servers, but only ONE process at the time can be providing services - running with flag true (and not both of them)
Update: As per discussions bellow and internal research/test and monitoring of solution, using reverse proxy will save you a lot of time. Programming fail-over based on 2 servers only will cover 70% of the cases related with the internal process which is used on the both machines - but you will not be able to detect others 30% of the issues caused because of the issues with the network (especially if you are having a lot of traffic towards DATA RECEIVER).
This is more of an infrastructure problem than it is a Node one, and the same situation can be applied to almost any server.
What you basically need is some service that monitors Server 1 and determines whether it's "healthy" or "alive" and if so continue to direct traffic to it. If the service determines that the server is no longer in a stable condition (e.g. it takes too long to respond, returns an error) it will redirect any incoming traffic to Server 2. When it's happy Server 1 has returned to normal operating conditions it will redirect the traffic back onto it.
In most cases, the "service" in this scenario is a reverse proxy like Nginx or CloudFlare. In your situation, this server would act as a buffer between Data Reciever and your network (Server 1 / Server 2) and route the incoming traffic to the relevant server.
That looks like a classical use case for a reverse proxy. Using a well tested server such as nginx should provide plenty reliability the proxy won't fail (other than hardware failure) and you could put that infront of whatever cluster size you want. You'd even get the benefit of load-balancing if that is applicable and configured properly.
Alternatively and also leaning towards a load-balancing solution, you could have a front server push requests into a queue (ZMQ for example) and either push from the queue to the app server(s) or have your app-server(s) pull tasks from the queue independently.
In both solutions, if it's a requirement not to "push" 2 simultaneous results to your data receiver, you could use an outbound queue that all app-servers push into.
I have been asked to create a message processing system as following. As I am not sure if this is the right place to post this, feel free to move it to any other appropriate SC group.
Problem
Server have about 100 to 500 clients connected at every moment. When a client connects to server, server loads part of their data and cache it in memory for faster access. Server will receive between 200~1000 messages per second for all clients. These messages are relatively small (about 500 bytes). Any changes to data in cache should be saved to disk as soon as possible. When client disconnects all their data is saved to disk and removed from cache. each message contains some instruction and a text message which will be saved as file. Instructions should be executed as fast as possible (near instant) and all clients using that file should get the update. Only writing the modified message to disk can be delayed.
Here is my solution in a diagram
My solution consists of a web server (http or socket) a message queue and two or more instances of file server and instruction server.
Web server grabs client messages and if there is message available for client in message queue, pushes it back to client.
Instruction processor grabs instructions from queue and creates necessary message to be processed by file server (Get/set file) and waits for the file to be available in queue and more process to create another message for client.
File server only provides the files, either from cache or physical file depending on type of file.
Concerns:
There are peak times that total connected clients might go over 10000 at once and total messages received from clients increase to 10~15K.
I should be able to clear the queue and go back to normal state as soon as possible (with processing requests obviously).
I should be able to add extra instruction processors and file servers on the fly without having to shut down the other instances.
In case file server crashes it shouldn’t lose files so it has to write files to disk as soon as there are any changes and process time is available.
File system should be in b+ tree format so some applications (local reporting apps) could easily access files without having to go through queue server
My Solution
I am thinking of using node.js for socket/web server. And may be a NoSQL database for file server and a queue server such as rabbitMQ or Node_Redis and Redis.
Questions:
Is there a better way of structuring this system?
What are my other options for components of this system?
is it possible to run all the instances in same server machine or even in same application (in different thread)?
You have a couple of holes here, mostly around the web server "pushing" the message back to the client. That doesn't really work in a web-based world. You can try and use websockets, but generally, this ends up being polling based.
I don't know what the "instructions" are to be executed, but saving 1000 500byte messages is trivial. Many NoSQL solutions boast million+ write per second capacity. Especially if you let committing to disk to lag.
Don't bother with the queue for the return of the file. A good NoSQL solution will scale better. Build out a Cassandra cluster, load test it until it can handle your peak load.
This simplifies your architecture into a 1 or more web servers, clients polling that server for file updates, a queue for submitting "messages" to the "instruction server" (also known as an application server in web-developer terms), and a no-sql database for the instruction server to write files to.
This makes scaling easy, you can always add more web servers, and with a decent cluster size for your no-sql server, you should get to scale horizontally there as well. Your only real bottleneck is your instruction server queue, which you could always throw more instruction servers at.
In my company we experienced a serious problem today: our production server went down. Most people accessing our software via a browser were unable to get a connection, however people who had already been using the software were able to continue using it. Even our hot standby server was unable to communicate with the production server, which it does using HTTP, not even going out to the broader internet. The whole time the server was accessible via ping and ssh, and in fact was quite underloaded - it's normally running at 5% CPU load and it was even lower at this time. We do almost no disk i/o.
A few days after the problem started we have a new variation: port 443 (HTTPS) is responding but port 80 stopped responding. The server load is very low. Immediately after restarting tomcat, port 80 started responding again.
We're using tomcat7, with maxThreads="200", and using maxConnections=10000. We serve all data out of main memory, so each HTTP request completes very quickly, but we have a large number of users doing very simple interactions (this is high school subject selection). But it seems very unlikely we would have 10,000 users all with their browser open on our page at the same time.
My question has several parts:
Is it likely that the "maxConnections" parameter is the cause of our woes?
Is there any reason not to set "maxConnections" to a ridiculously high value e.g. 100,000? (i.e. what's the cost of doing so?)
Does tomcat output a warning message anywhere once it hits the "maxConnections" message? (We didn't notice anything).
Is it possible there's an OS limit we're hitting? We're using CentOS 6.4 (Linux) and "ulimit -f" says "unlimited". (Do firewalls understand the concept of Tcp/Ip connections? Could there be a limit elsewhere?)
What happens when tomcat hits the "maxConnections" limit? Does it try to close down some inactive connections? If not, why not? I don't like the idea that our server can be held to ransom by people having their browsers on it, sending the keep-alive's to keep the connection open.
But the main question is, "How do we fix our server?"
More info as requested by Stefan and Sharpy:
Our clients communicate directly with this server
TCP connections were in some cases immediately refused and in other cases timed out
The problem is evident even connecting my browser to the server within the network, or with the hot standby server - also in the same network - unable to do database replication messages which normally happens over HTTP
IPTables - yes, IPTables6 - I don't think so. Anyway, there's nothing between my browser and the server when I test after noticing the problem.
More info:
It really looked like we had solved the problem when we realised we were using the default Tomcat7 setting of BIO, which has one thread per connection, and we had maxThreads=200. In fact 'netstat -an' showed about 297 connections, which matches 200 + queue of 100. So we changed this to NIO and restarted tomcat. Unfortunately the same problem occurred the following day. It's possible we misconfigured the server.xml.
The server.xml and extract from catalina.out is available here:
https://www.dropbox.com/sh/sxgd0fbzyvuldy7/AACZWoBKXNKfXjsSmkgkVgW_a?dl=0
More info:
I did a load test. I'm able to create 500 connections from my development laptop, and do an HTTP GET 3 times on each, without any problem. Unless my load test is invalid (the Java class is also in the above link).
It's hard to tell for sure without hands-on debugging but one of the first things I would check would be the file descriptor limit (that's ulimit -n). TCP connections consume file descriptors, and depending on which implementation is in use, nio connections that do polling using SelectableChannel may eat several file descriptors per open socket.
To check if this is the cause:
Find Tomcat PIDs using ps
Check the ulimit the process runs with: cat /proc/<PID>/limits | fgrep 'open files'
Check how many descriptors are actually in use: ls /proc/<PID>/fd | wc -l
If the number of used descriptors is significantly lower than the limit, something else is the cause of your problem. But if it is equal or very close to the limit, it's this limit which is causing issues. In this case you should increase the limit in /etc/security/limits.conf for the user with whose account Tomcat is running and restart the process from a newly opened shell, check using /proc/<PID>/limits if the new limit is actually used, and see if Tomcat's behavior is improved.
While I don't have a direct answer to solve your problem, I'd like to offer my methods to find what's wrong.
Intuitively there are 3 assumptions:
If your clients hold their connections and never release, it is quite possible your server hits the max connection limit even there is no communications.
The non-responding state can also be reached via various ways such as bugs in the server-side code.
The hardware conditions should not be ignored.
To locate the cause of this problem, you'd better try to replay the scenario in a testing environment. Perform more comprehensive tests and record more detailed logs, including but not limited:
Unit tests, esp. logic blocks using transactions, threading and synchronizations.
Stress-oriented tests. Try to simulate all the user behaviors you can come up with and their combinations and test them in a massive batch mode. (ref)
More specified Logging. Trace client behaviors and analysis what happened exactly before the server stopped responding.
Replace a server machine and see if it will still happen.
The short answer:
Use the NIO connector instead of the default BIO connector
Set "maxConnections" to something suitable e.g. 10,000
Encourage users to use HTTPS so that intermediate proxy servers can't turn 100 page requests into 100 tcp connections.
Check for threads hanging due to deadlock problems, e.g. with a stack dump (kill -3)
(If applicable and if you're not already doing this, write your client app to use the one connection for multiple page requests).
The long answer:
We were using the BIO connector instead of NIO connector. The difference between the two is that BIO is "one thread per connection" and NIO is "one thread can service many connections". So increasing "maxConnections" was irrelevant if we didn't also increase "maxThreads", which we didn't, because we didn't understand the BIO/NIO difference.
To change it to NIO, put this in the element in server.xml:
protocol="org.apache.coyote.http11.Http11NioProtocol"
From what I've read, there's no benefit to using BIO so I don't know why it's the default. We were only using it because it was the default and we assumed the default settings were reasonable and we didn't want to become experts in tomcat tuning to the extent that we now have.
HOWEVER: Even after making this change, we had a similar occurrence: on the same day, HTTPS became unresponsive even while HTTP was working, and then a little later the opposite occurred. Which was a bit depressing. We checked in 'catalina.out' that in fact the NIO connector was being used, and it was. So we began a long period of analysing 'netstat' and wireshark. We noticed some periods of high spikes in the number of connections - in one case up to 900 connections when the baseline was around 70. These spikes occurred when we synchronised our databases between the main production server and the "appliances" we install at each customer site (schools). The more we did the synchronisation, the more we caused outages, which caused us to do even more synchronisations in a downward spiral.
What seems to be happening is that the NSW Education Department proxy server splits our database synchronisation traffic into multiple connections so that 1000 page requests become 1000 connections, and furthermore they are not closed properly until the TCP 4 minute timeout. The proxy server was only able to do this because we were using HTTP. The reason they do this is presumably load balancing - they thought by splitting the page requests across their 4 servers, they'd get better load balancing. When we switched to HTTPS, they are unable to do this and are forced to use just one connection. So that particular problem is eliminated - we no longer see a burst in the number of connections.
People have suggested increasing "maxThreads". In fact this would have improved things but this is not the 'proper' solution - we had the default of 200, but at any given time, hardly any of these were doing anything, in fact hardly any of these were even allocated to page requests.
I think you need to debug the application using Apache JMeter for number of connection and use Jconsole or Zabbix to look for heap space or thread dump for tomcat server.
Nio Connector of Apache tomcat can have maximum connections of 10000 but I don't think thats a good idea to provide that much connection to one instance of tomcat better way to do this is to run multiple instance of tomcat.
In my view best way for Production server: To Run Apache http server in front and point your tomcat instance to that http server using AJP connector.
Hope this helps.
Are you absolutely sure you're not hitting the maxThreads limit? Have you tried changing it?
These days browsers limit simultaneous connections to a max of 4 per hostname/ip, so if you have 50 simultaneous browsers, you could easily hit that limit. Although hopefully your webapp responds quickly enough to handle this. Long polling has become popular these days (until websockets are more prevalent), so you may have 200 long polls.
Another cause could be if you use HTTP[S] for app-to-app communication (that is, no browser involved). Sometimes app writers are sloppy and create new connections for performing multiple tasks in parallel, causing TCP and HTTP overhead. Double check that you are not getting an inflood of requests. Log files can usually help you on this, or you can use wireshark to count the number of HTTP requests or HTTP[S] connections. If possible, modify your API to handle multiple API calls in one HTTP request.
Related to the last one, if you have many HTTP/1.1 requests going across one connection, and intermediate proxy may be splitting them into multiple connections for load balancing purposes. Sounds crazy I know, but I've seen it happen.
Lastly, some crawl bots ignore the crawl delay set in robots.txt. Again, log files and/or wireshark can help you determine this.
Overall, run more experiments with more changes. maxThreads, https, etc. before jumping to conclusions with maxConnections.
I am running redis-benchmark tool to send N number of requests From server A to B.
This tools generates TCP requests and receives response.
Some how when number requests reach to 51000, it stops and not exceeding above that.
I have tried the same using different machine and I got almost 100000 requests proccessed per second.
What sort of factors can limit these number of requests ??
A major factor would be the number of open file descriptors the process is allowed to create. This would be true for both the server and client side.
http://redis.io/topics/clients and http://redis.io/topics/benchmarks both have the information you should work through to determine where exactly your problem is. Without the details of your setup it is unlikely we can be more specific.
Check your ulimits and your server configuration to ensure you've configured your respective systems to the limits you intend to benchmark to and you'll be able to get more usable data.