Threads in application server and connections in HttpServer - multithreading

I'm building a system, I will use two servers and one load balancer,
This company has more than 60,000 users and they expect 10,000 concurrent users, all transactions will occurs within 5 seconds
I'm not sure how about this for each server:
Amount of connections in HttpServer
Amount of threads in application server
I understand that I will find out this numbers when the system is in production but I need to start with something
any suggestion or advice?

This is about Capacity Planning I can give some suggestions as below, however all depend on the technical feasibility and business requirements of your system.
Try to find out your maximum capacity which you need to support. So you can do required stress test to figure this out.
Make sure system capable of improving performance by horizontally with clustering etc.
Decide on predicted capacity requirement(CR) , CR may be H/W ,bandwidth etc
Predicted CR = Current CR + 30%*Current CR
Finally this is about continuous improvements , keep eye on the changes .
Check how reliable the system is , decide on changes to H/W , Software , Architecture etc.
Hope this add some value to you.

Setup a test-server and extrapolate the numbers from there (get some time to do the research to come up with an educated guess).
For example, the "amount of threads in application server" depends on what kind of HTTP-server you use. Some servers can process hundreds of connections with a single thread, but badly programmed/configured servers might end up using 1 thread per connection.
The performance requirement "all transactions will occurs within 5 seconds" needs some further detailing. Showing web-pages with data (from a database) to a user may not take more than 3 seconds (if it takes longer, users get irritated), but ideally should be less than 1 second (an average user would expect that). On the other hand, it might be OK to take 10 seconds to store data from a complex form (as long as the form is not used too often).
I would be skeptical about the load requirement "they expect 10,000 concurrent users". That would mean 1 in 6 company employees is actively using the web-application. I think this would be 'peak usage' and not 'average usage'. This is important to know with respect to the performance requirements and costs: if the system must adhere to the performance requirements during peak usage, you need more money for better and/or more hardware.

Related

Is ZeroMQ with RT Linux (RT-PREEMPT patch) real time?

I am considering setting up ZeroMQ as message broker on a Linux kernel patched up with RT-PREEMPT (to make it real time).
Basically I want to publish/subscribe short events that are serialized using google protocol buffers.
1. Event Model Object (App #1) --->
2. Serialize Google protobuf --->
3. ZeroMQ --->
4. Deserialize Google protobuf -->
5. Event Model object (App #2)
From #1 to #5 and perhaps back to #1, how will the real time guarantees of linux RT-PREEMPT be affected by ZeroMQ?
I am specifically looking for real time features of ZeroMQ. Does it provide such guarantees?
To put the question in perspective, lets say I want to know if ZeroMQ is worthy of deploying on time critical systems such as Ballistic Missile Defense or Boeing 777 autopilot.
Firstly PREEMPT_RT reduces the maximum latency for a task but overall system throughput will be lower and the average latency probably higher.
Real time is a broad subject; to avoid confusion my definition of real time is in the order of tens of milliseconds per frame running at 30hz or higher.
Does it provide such (real time features) guarantees?
As already answered no it does not and it it isn't what PREEMPT_RT is really about.
Is ZeroMQ worthy of deploying on time critical systems?
Time critical is a loose definition; but a with a correctly designed protocol ZeroMQ is going to give you options in how messages are transported (memory, TCP, UDP / multicast) and rest assured that ZeroMQ does what it does really well.
IME ZeroMQ typically delivers fast on a local high speed network - but this will drop if you are using a wide network and this figure may raise as endpoints are added depending on which model you are using within ZeroMQ.
For real time systems it's not just about transmission speed, there is also latency and time synchronisation to consider.
It's worth reading the article 0MQ: Measuring messaging performance - note that there is a high (1.5ms) latency at the start of message transmission that settles quickly - probably good if you are needing to transmit a lot of small messages at high frequency - not as good if you are transmitting a few larger messages at a lower rate.
is ZeroMQ is worthy of deploying on time critical systems such as Ballistic Missile Defense or Boeing 777 autopilot.
It's important to understand the topology of what you're connecting and also how the latency will affect things and design a protocol accordingly.
So to take the case of the 777 autopilot - almost certainly yes ZeroMQ would be suitable simply because there is a lot of inertia in a stable aircraft so the airframe will take time to respond and the cleverness is inside the autopilot and not so much reliant on incoming sensor data at a high rate. On a 777 there is an Arinc 429 bus connecting the avionics and this runs at a maximum 100kbit/s between the limited number of endpoints that can be on any given bus.
Q: Does it provide such (real time features) guarantees?
No, it does not and never tried to do so. The Zen-of-Zero guarantees one to either deliver a complete message-payload or none at all, which means, your R/T-level code has never a need to test for a damaged message integrity (once a message got delivered). Yet, it does not guarantee a delivery as a whole and the application domain is free to build any such additional service-layer atop the smart and ultimately performant ZeroMQ-layer.
ZeroMQ has been since its earliest days a smart, asynchronous, broker-less tool for designing any kind of almost linear-scalable distributed-system messaging, with low-profile resources' needs and having an excellent latency/performance envelope.
This said, anyone can build on this and design any add-on features, that match some target application-domain specific needs ( N+1 robustness, N+M failure-resilience, adaptive node-rediscovery, ... ), which are principally neither hard-coded, nor pre-wired into the minimum footprint for low latency, best scalable smart-messaging core-engine.
All designs, where the rules of the Zen-of-Zero, coded so wisely into the ZeroMQ core-engine, can safely match the defined hard-Real-Time limits for the RT/Execution needs of such Real-Time constrained System-under-Review, will enjoy the ZeroMQ and its many-protocol support for inproc://, ipc://, tipc://, tcp://, pgm://, epgm://, udp://, vmci:// and other wire-level protocols.
Q: Is ZeroMQ worthy of deploying on time critical systems?
That depends a lot on indeed many things.
I remember guys in days, when F-16 avionics simulated an onboard network, that used internally an isolated high-speed deterministic and rather low latency ( due to static packet/payload sizes ) of 155-Mbit/s+ ATM-fabric for this on-plane switching-network just to enjoy it's benefits right for the R/T-control-motivated needs, so technology always matches some set of needs-to-have. Given your R/T-system design criteria gets defined, anyone may confirm/deny a feasible/in-feasible tools to design towards such R/T-system goals.
Factors will go way beyond the just ZeroMQ smart-features:
properties of your critical system
external constraints ( sector specific authoritative regulations )
Project's planning
Project's external ecosystem of cooperating parties ( greenfield systems have it too )
your Team's knowledge of ZeroMQ or absence of using it in distributed-systems' designs
...
and last but not least - the ceilings for financing and for showing the RTO demonstrator live demo and roll-out for acceptance tests.
Good luck with your forthcoming BMD or any other MCS. ZeroMQ style of thinking, the Zen-of-Zero can help you in many aspects of doing right things right.

Synthetic performance AB test

I have deployed two versions of our singlepage web app: one master (A) and one branch where are some changes which can affect somehow load time (B). The change is usually some new feature on front-end, refactoring, small performance optimization, etc. The difference is not so big and the load time varies much more from other reasons (a load of testing machines, a load of servers, network, etc). So webpagetest.org even with 9 tries varies much more (14-20s speedindex) than the real difference could be (0,5s in average for example).
Basically, I need one number which tells me - this feature increase/decrease load time.
Is there some tool which could measure such differences?
My idea was to deploy Webpagetest to a server with minimal load and run Webpagetest randomly on both versions at the same time so I avoid most of the noise. Make a lot of samples (1000+) and check average(or median) value.
But before I start working on that I would like to ask if there is some service which solves that problem.

Should I go for faster queries or less cpu consuption / faster processing?

I have to choose between performing a query for X size data and not process it, just send it to the client,
OR
I can choose to perform a query for half X size data and do a little processing, then send it to the client.
Now, in my life of a programmer I met storage vs speed problem quite a bit, but in this case, I have to choose between "fast query + processing" or "slow query + no processing".
If it matters, I am using nodejs for the server and mongodb for the database.
If you care, I am holding non intersecting map areas and I am testing if an area intersects any or no map area. All are boxes. If I keep them as origin point, its only one pair of coordinates and I have to process the point into an area(all map areas have the same size). If I store them as an area directly, I don't have to process it anymore, but its 4 pairs of coordinates now. 4 times the size and I think, slower query.
There is no right answer to this question, it all depends on your infrastructure. If you're for example using Amazon Webservices for this, it depends on the transaction price. If you've got your own infrastructure, it depends on the load of the DB and web servers. If they are on the same server, it's a matter of the underlying hardware whether the I/O from the DB starts to limit before the CPU/memory become the bottle neck.
The only way to determine the right answer to this question for your specific situation is to set it up and do a stress test, for example using Load Impact or one of the tons of other good tools to do this. While it is getting hammered, monitor your system load using top and watch the wa column specifically - if it starts going up over 50% consistently you're I/O limited, and the DB should be offloaded to the CPU.

Is bcrypt viable for large web sites?

I've been on the bcrypt bandwagon for a while now, but I'm having trouble answering a simple nagging question.
Imagine I have a reasonably successful web site in the U.S... about 100,000 active users that each have activity patterns requiring 2 to 3 authentication attempts on average over the course of a typical American business day (12 hours when you include timezones). That's 250,000 authentication requests per day, or about 5.8 authentications per second.
One of the neat things about bcrypt is that you tune it, so that over time it scales as hardware does, to stay ahead of the crackers. A common tuning is to get it to take just over 1/10 of a second per hash creation... let's say I get it to .173 seconds per hash. I chose that number because it just so happens that .173 seconds per hash works out to about 5.8 hashes per second. In other words, my hypothetical web server is literally spending all it's time doing nothing but authenticating users. Never mind actually doing any useful work.
To address this issue, I would have to either tune bcrypt way down (not a good idea) or get a dedicated server just to do authentications, and nothing else. Now imagine that the site grows and adds another 100,000 users. Suddenly I need two servers: again, doing nothing but authentication. Don't even start thinking about load spikes, as you have light and busy periods throughout a day.
As I see it right now, this is one of those problems that would be nice to have, and bcrypt would still be worth the trouble. But I'd like to know if I'm I missing something obvious here? Something subtle? Or can anyone out there actually point to a well known web site running a whole server farm just for the authentication portion of their site?
Even if you tune bcrypt to take only, say, 1/1000 of a second, that's still quite a bit slower than simple hashing — a quick and dirty Perl benchmark says my not-so-new computer can calculate about 300,000 SHA-256 hashes per second.
Yes, the difference between 1000 and 300,000 is only about 8 bits, but that's still 8 bits of security margin you wouldn't have otherwise, and that difference is only going to increase as CPUs get faster.
Also, if you use scrypt instead of bcrypt, it will retain its memory-hardness property even if the iteration count is lowered, which will still make brute forcing it harder to parallelize.

Is there any modern review of solutions to the 10000 client/sec problem

(Commonly called the C10K problem)
Is there a more contemporary review of solutions to the c10k problem (Last updated: 2 Sept 2006), specifically focused on Linux (epoll, signalfd, eventfd, timerfd..) and libraries like libev or libevent?
Something that discusses all the solved and still unsolved issues on a modern Linux server?
The C10K problem generally assumes you're trying to optimize a single server, but as your referenced article points out "hardware is no longer the bottleneck". Therefore, the first step to take is to make sure it isn't easiest and cheapest to just throw more hardware in the mix.
If we've got a $500 box serving X clients per second, it's a lot more efficient to just buy another $500 box to double our throughput instead of letting an employee gobble up who knows how many hours and dollars trying to figure out how squeeze more out of the original box. Of course, that's assuming our app is multi-server friendly, that we know how to load balance, etc, etc...
Coincidentally, just a few days ago, Programming Reddit or maybe Hacker News mentioned this piece:
Thousands of Threads and Blocking IO
In the early days of Java, my C programming friends laughed at me for doing socket IO with blocking threads; at the time, there was no alternative. These days, with plentiful memory and processors it appears to be a viable strategy.
The article is dated 2008, so it pulls your horizon up by a couple of years.
To answer OP's question, you could say that today the equivalent document is not about optimizing a single server for load, but optimizing your entire online service for load. From that perspective, the number of combinations is so large that what you are looking for is not a document, it is a live website that collects such architectures and frameworks. Such a website exists and its called www.highscalability.com
Side Note 1:
I'd argue against the belief that throwing more hardware at it is a long term solution:
Perhaps the cost of an engineer that "gets" performance is high compared to the cost of a single server. What happens when you scale out? Lets say you have 100 servers. A 10 percent improvement in server capacity can save you 10 servers a month.
Even if you have just two machines, you still need to handle performance spikes. The difference between a service that degrades gracefully under load and one that breaks down is that someone spent time optimizing for the load scenario.
Side note 2:
The subject of this post is slightly misleading. The CK10 document does not try to solve the problem of 10k clients per second. (The number of clients per second is irrelevant unless you also define a workload along with sustained throughput under bounded latency. I think Dan Kegel was aware of this when he wrote that doc.). Look at it instead as a compendium of approaches to build concurrent servers, and micro-benchmarks for the same. Perhaps what has changed between then and now is that you could assume at one point of time that the service was for a website that served static pages. Today the service might be a noSQL datastore, a cache, a proxy or one of hundreds of network infrastructure software pieces.
You can also take a look at this series of articles:
http://www.metabrew.com/article/a-million-user-comet-application-with-mochiweb-part-3
He shows a fair amount of performance data and the OS configuration work he had to do in order to support 10K and then 1M connections.
It seems like a system with 30GB of RAM could handle 1 million connected clients on a sort of social network type of simulation, using a libevent frontend to an Erlang based app server.
libev runs some benchmarks against themselves and libevent...
I'd recommend Reading Zed Shaw's poll, epoll, science, and superpoll[1]. Why epoll isn't always the answer, and why sometimes it's even better to go with poll, and how to bring the best of both worlds.
[1] http://sheddingbikes.com/posts/1280829388.html
Have a look at the RamCloud project at Stanford: https://ramcloud.atlassian.net/wiki/display/RAM/RAMCloud
Their goal is 1,000,000 RPC operations/sec/server. They have numerous benchmarks and commentary on the bottlenecks that are present in a system which would prevent them from reaching their throughput goals.

Resources