Hardware importance on asynchronous JVM server performance - multithreading

I am running a Finatra server (https://github.com/capotej/finatra) which a sinatra inspired web framework for scala built on top of Finagle (an asynchronous RPC system).
The application should be design to receive something between 10 and 50 requests concurrently. Each request is quite CPU intensive, mostly due to parsing and serializing large JSON's and operation on arrays, like sorting, grouping, etc...
Now I am wondering what is the impact of the following parameters on performance and how to combine them :
RAM of the server
Number of cores of the server
JVM heap size
Number of threads run on parallel in my Future Pool
As a partial response, I would say :
JVM heapsize should me tuned depending on RAM
Having multiple cores improves performance under concurrent workload but does not really speed up processing of a single request.
Having large RAM, on the contrary, can notably speed up execution of a single request
Number of threads in my Future Pool must be tuned according to my number of cores.
EDIT
I want to compare performance regardless of the the code, only focusing on hardware/threading model. Let's assume the code is already optimized. Additional information :
I am building a data reporting API. Processing time of a request largely depends of the dataset I am manipulating. For big datasets, it can hit 10 seconds max.
I retrieve most of the data from third party API but I am also accessing a MySQL database with a c3po connection pooling mechanism. Execution of the request is additionally delegated to a Future Pool to prevent blocking.
No disk IO excluding MySQL
I don't want to cache anything on the server side because I need to
work with fresh data.
Thanks !!!

The performance and overall behaviour will still depend on your own code, outside of the framework you are using. In other words, you have correctly listed the major factors which will influence performance, but your own code will have such a significant impact on it that it's almost impossible to tell in advance.
Offhand, I'd say that you need to characterize some things about your application in more detail:
You say that each request will be CPU intensive, but what do you mean by it? Will each request take 1 ms? 10 ms? 100 ms?
Do you access a database? What are the characteristics of your database?
Either with the database or without it, do you have any disk IO? How significant is it?
... but if your application is really simple, does not hit the disk much (or at all.. your request may be read-only and everything gets cached), and you are CPU-bound, simply sticking enough CPU cores in your server will be the most significant thing you can do.

Related

How to find optimal size of connection pool for single mongo nodejs driver

I am using official mongo nodejs driver with default settings, but was digging deeper into options today and apparently there is an option of maxPoolSize that is set to 100 by default.
My understanding of this is that single nodejs process can establish up to 100 connections, thus allowing mongo to handle 100 reads/writes simultaneously in paralel?
If so, it seems that setting this number higher could only benefit the performance, but I am not sure hence decided to ask here.
Assuming default setup with no indexes, is there a way to determine (based on cpu's and memory of the db) what the optimal connection number for pool should be?
We can also assume that nodejs process itself is not a bottleneck (i.e can be scaled horizontally).
Good question =)
it seems that setting this number higher could only benefit the performance
It does indeed. I mean it seems, and it would be the case for an abstract nodejs process in a vacuum with unlimited resources. Connections are not free, so there are things to consider:
limited connection quota on the server. Atlas in particular, but even self-hosted cluster has only 65k sockets. Remember the driver keeps them open to reuse, and the default timeout per cursor is 30 minutes of inactivity.
single thread clientside. BSON serialisation blocks event loop and is quite expensive, e.g. see the flamechart in this answer https://stackoverflow.com/a/72264469/1110423 . Blocking the loop, you increase time cursors from the previous point remain open, and in worst case get performance degradation.
limited RAM. Each connection require ~1 MB serverside.
Assuming default setup with no indexes
You have at least _id, and you should have more if we are talking about performance
is there a way to determine what the optimal connection number for pool should be?
I'd love to know that too. There are too many factors to consider, not only CPA/RAM, but also data shape, query patterns, etc. This is what dbops are for. Mongo cluster requires some attention, monitoring and adjustments for optimal operations. In many cases it's more cost efficient to scale up the cluster than optimise the app.
We can also assume that nodejs process itself is not a bottleneck (i.e can be scaled horizontally).
This is quite wild assumption. The process cannot scale horisontally. It's on the OS level. Once you have a process descriptor, it's locked to it till the death. You can use a node cluster to utilise all CPU cores, can even have multiple servers running the same nodejs and balance the load, but none of them will share connections from the pool. The pool is local to nodejs process.

Is it a good practice to use multithreading to handle requests in bulk in a micro services architecture systems?

Requirement:
I have to design a micro service which performs search query in a sql db multiple times(say 7 calls) along with multiple third party http calls(say 8 calls) in sequential and interleaved manner to complete an order, by saying sequential I mean before next call of DB or third party previous call to DB or third party must be completed as the result of these calls will be used in further third party or search operations in DB.
Resources:
I) CPU: 4 cores(per instance)
II) RAM: 4 GB(per instance)
III) It can be auto scaled upto at max of 4 pods or instances.
IV) Deployment: Open Shift (Own cloud architecture)
V) Framework: Spring Boot
My Solution:
I've created a fixed thread pool of 5 threads(Size of blocking queue is not configured, also there are another 20 fixed pool threads running apart from these 5 threads for creating orders of multiple types i.e. in total there are 25 threads running per instance) using thread pool executor of Java. So when multiple requests are sent to this micro service I keep submitting the job and the JVM by using some scheduling algorithms schedules these jobs and complete the jobs.
Problem:
I'm not able to achieve the expected through put, using above approach the micro service is able to achieve only 3 to 5 tps or orders per second which is very low. Sometimes it also happens that tomcat gets choked and we have to restart services to bring back the system in responsive situation.
Observation:
I've observed that even when orders are processed very slowly by the thread pool executor if I call orders api through jmeter at the same time when things are going slow, these kind of requests which are directly landing on the controller layer are processed faster than the request getting processed by thread pool executor.
My Questions
I) What changes I should make at the architectural level to make through put upto 50
to 100 tps.
II) What changes should be done so that even if traffic on this service increases in
future then the service can either be auto scaled or justification to increase
hardware resources can be given easily.
III) Is this the way tech giants(Amazon, Paypal) solve scaling problems like these
using multithreading to optimise performance of their code.
You can assume that third parties are responding as expected and query optimisation is already done with proper indexing.
Tomcat already has a very robust thread pooling algorithm. Making your own thread pool is likely causing deadlocks and slowing things down. The java threading model is non-trivial, and you likely are causing more problems than you are solving. This is further evidenced by the fact that you are getting better performance relying on Tomcat's scheduling when you hit the controller directly.
High-volume services generally solve problems like this by scaling wide, keeping things as stateless as possible. This allows you to allocate many small servers to solve the solution much more efficiently than a single large server.
Debugging multi-threaded executions is not for the faint of heart. I would highly recommend you simplify things as much as possible. The most important bit about threading is to avoid mutable state. Mutable state is the bane of shared executions, moving memory around and forcing reads through to main memory can be very expensive, often costing far more than savings due to threading.
Finally, the way you are describing your application, it's all I/O bound anyway. Why are you bothering with threading when it's likely I/O that's slowing it down?

IIS - Worker threads not increasing beyond certain number even though the CPU usage is less than 40 percent

We are running a web API hosted in IIS 10 on an 8 core machine with 16 GB Memory and running Windows 10, and throwing a load of say 100 to 200 requests per second through JMeter on the server.
Individual transactions are taking less than 500 milliseconds. When we throw the load initially, IIS threads grow up to around 150-160 mark (monitored through resource monitor and Performance monitor) and throughput increases up to 22-24 transactions per second but throughput and number of threads stop to grow beyond this point even though the CPU usage is less than 40 per cent and we have enough physical memory also available at the peak, the resource monitor does not show any choking at the network or IO level.
The web API is making calls to the Oracle database (3-4 select calls and 2-3 inserts/updates).
We fail to understand what is stopping IIS to further grow its thread pool to process more requests in parallel while all the resources including processing power, memory, network etc are available.
We have placed many performance counters as well, there is no queue build-up (that's probably because jmeter works in synchronous mode)
Also, we have tried to set the min and max threads settings through machine.config as well as ThreadPool.SetMin and Max threads APIs but no difference was observed and seems like those setting are not taking any effect.
Important to mention that we are using synchronous calls/operations (no asnch and await). Someone has advised to convert all our blocking IO calls e.g. database calls to asynchronous mode to achieve more throughput but my understanding is that if threads cant be grown beyond this level then making async calls might not help or may indeed negatively impact the throughput. Since our code size is huge, that would be a very costly activity in terms of time and effort and we dont want to invest in it till we are sure that it would really help. If someone has anything to share on these two problems, pls do share.
Below is a screenshot of the permanence monitor.

Steps to improve throughput of Node JS server application

I have a very simple nodejs application that accepts json data (1KB approx.) via POST request body. The response is sent back immediately to the client and the json is posted asynchronously to an Apache Kafka queue. The number of simultaneous requests can go as high as 10000 per second which we are simulating using Apache Jmeter running on three different machines. The target is to achieve an average throughput of less than one second with no failed requests.
On a 4 core machine, the app handles upto 4015 requests per second without any failures. However since the target is 10000 requests per second, we deployed the node app in a clustered environment.
Both clustering in the same machine and clustering between two different machines (as described here) were implemented. Nginx was used as a load balancer to round robin the incoming requests between the two node instances. We expected a significant improvement in the throughput (like documented here) but the results were on the contrary.
The number of successful requests dropped to around 3100 requests per second.
My questions are:
What could have gone wrong in the clustered approach?
Is this even the right way to increase the throughput of Node application?
We also did a similar exercise with a java web application in Tomcat container and it performed as expected 4000 requests with a
single instance and around 5000 successful requests in a cluster
with two instances. This is in contradiction to our belief that
nodejs performs better than a Tomcat. Is tomcat generally better
because of its thread per request model?
Thanks a lot in advance.
Per your request, I'll put my comments into an answer:
Clustering is generally the right approach, but whether or not it helps depends upon where your bottleneck is. You will need to do some measuring and some experiments to determine that. If you are CPU-bound and running on a multi-core computer, then clustering should help significantly. I wonder if your bottleneck is something besides CPU such as networking or other shared I/O or even Nginx? If that's the case, then you need to fix that before you would see the benefits of clustering.
Is tomcat generally better because of its thread per request model?
No. That's not a good generalization. If you are CPU-bound, then threading can help (and so can clustering with nodejs). But, if you are I/O bound, then threads are often more expensive than async I/O like nodejs because of the resource overhead of the threads themselves and the overhead of context switching between threads. Many apps are I/O bound which is one of the reasons node.js can be a very good choice for server design.
I forgot to mention that for http, we are using express instead of the native http provided by node. Hope it does not introduce an overhead to the request handling?
Express is very efficient and should not be the source of any of your issues.
As jfriend said , you need to find the bottlenecks ,
one thing you can try is to reduce the bandwith/throughput by using sockets to pass the json and especially this library https://github.com/uNetworking/uWebSockets.
The main reason for that is that an http request is significantly heavier than a socket connection.
Good Example : https://webcheerz.com/one-million-requests-per-second-node-js/
lastly you can also compress the json via (http gzip) or a third party module.
work on the weight ^^
Hope it helps!

How to determine the best number of threads in Tomcat?

How does one determine the best number of maxSpare, minSpare and maxThreads, acceptCount etc in Tomcat? Are there existing best practices?
I do understand this needs to be based on hardware (e.g. per core) and can only be a basis for further performance testing and optimization on specific hardware.
the "how many threads problem" is quite a big and complicated issue, and cannot be answered with a simple rule of thumb.
Considering how many cores you have is useful for multi threaded applications that tend to consume a lot of CPU, like number crunching and the like. This is rarely the case for a web-app, which is usually hogged not by CPU but by other factors.
One common limitation is lag between you and other external systems, most notably your DB. Each time a request arrive, it will probably query the database a number of times, which means streaming some bytes over a JDBC connection, then waiting for those bytes to arrive to the database (even is it's on localhost there is still a small lag), then waiting for the DB to consider our request, then wait for the database to process it (the database itself will be waiting for the disk to seek to a certain region) etc...
During all this time, the thread is idle, so another thread could easily use that CPU resources to do something useful. It's quite common to see 40% to 80% of time spent in waiting on DB response.
The same happens also on the other side of the connection. While a thread of yours is writing its output to the browser, the speed of the CLIENT connection may keep your thread idle waiting for the browser to ack that a certain packet has been received. (This was quite an issue some years ago, recent kernels and JVMs use larger buffers to prevent your threads for idling that way, however a reverse proxy in front of you web application server, even simply an httpd, can be really useful to avoid people with bad internet connection to act as DDOS attacks :) )
Considering these factors, the number of threads should be usually much more than the cores you have. Even on a simple dual or quad core server, you should configure a few dozens threads at least.
So, what is limiting the number of threads you can configure?
First of all, each thread (used to) consume a lot of resources. Each thread have a stack, which consumes RAM. Moreover, each Thread will actually allocate stuff on the heap to do its work, consuming again RAM, and the act of switching between threads (context switching) is quite heavy for the JVM/OS kernel.
This makes it hard to run a server with thousands of threads "smoothly".
Given this picture, there are a number of techniques (mostly: try, fail, tune, try again) to determine more or less how many threads you app will need:
1) Try to understand where your threads spend time. There are a number of good tools, but even jvisualvm profiler can be a great tool, or a tracing aspect that produces summary timing stats. The more time they spend waiting for something external, the more you can spawn more threads to use CPU during idle times.
2) Determine your RAM usage. Given that the JVM will use a certain amount of memory (most notably the permgen space, usually up to a hundred megabytes, again jvisualvm will tell) independently of how many threads you use, try running with one thread and then with ten and then with one hundred, while stressing the app with jmeter or whatever, and see how heap usage will grow. That can pose a hard limit.
3) Try to determine a target. Each user request needs a thread to be handled. If your average response time is 200ms per "get" (it would be better not to consider loading of images, CSS and other static resources), then each thread is able to serve 4/5 pages per second. If each user is expected to "click" each 3/4 seconds (depends, is it a browser game or a site with a lot of long texts?), then one thread will "serve 20 concurrent users", whatever it means. If in the peak hour you have 500 single users hitting your site in 1 minute, then you need enough threads to handle that.
4) Crash test the high limit. Use jmeter, configure a server with a lot of threads on a spare virtual machine, and see how response time will get worse when you go over a certain limit. More than hardware, the thread implementation of the underlying OS is important here, but no matter what it will hit a point where the CPU spend more time trying to figure out which thread to run than actually running it, and that numer is not so incredibly high.
5) Consider how threads will impact other components. Each thread will probably use one (or maybe more than one) connection to the database, is the database able to handle 50/100/500 concurrent connections? Even if you are using a sharded cluster of nosql servers, does the server farm offer enough bandwidth between those machines? What else will run on the same machine with the web-app server? Anache httpd? squid? the database itself? a local caching proxy to the database like mongos or memcached?
I've seen systems in production with only 4 threads + 4 spare threads, cause the work done by that server was merely to resize images, so it was nearly 100% CPU intensive, and others configured on more or less the same hardware with a couple of hundreds threads, cause the webapp was doing a lot of SOAP calls to external systems and spending most of its time waiting for answers.
Oce you've determined the approx. minimum and maximum threads optimal for you webapp, then I usually configure it this way :
1) Based on the constraints on RAM, other external resources and experiments on context switching, there is an absolute maximum which must not be reached. So, use maxThreads to limit it to about half or 3/4 of that number.
2) If the application is reasonably fast (for example, it exposes REST web services that usually send a response is a few milliseconds), then you can configure a large acceptCount, up to the same number of maxThreads. If you have a load balancer in front of your web application server, set a small acceptCount, it's better for the load balancer to see unaccepted requests and switch to another server than putting users on hold on an already busy one.
3) Since starting a thread is (still) considered a heavy operation, use minSpareThreads to have a few threads ready when peak hours arrive. This again depends on the kind of load you are expecting. It's even reasonable to have minSpareThreads, maxSpareThreads and maxThreads setup so that an exact number of threads is always ready, never reclaimed, and performances are predictable. If you are running tomcat on a dedicated machine, you can raise minSpareThreads and maxSpareThreads without any danger of hogging other processes, otherwise tune them down cause threads are resources shared with the rest of the processes running on most OS.

Resources