I have my own TensorFlow serving server for multiple neural networks. Now I want to estimate the load on it. Does somebody know how to get the current number of requests in a queue in TensorFlow serving? I tried using Prometheus, but there is no such option.
Actually ,the tf serving doesn't have requests queue , which means that the tf serving would't rank the requests, if there are too many requests.
The only thing that tf serving would do is allocating a threads pool, when the server is initialized.
when a request coming , the tf serving will use a unused thread to deal with the request , if there are no free threads, the tf serving will return a unavailable error.and the client shoule retry again later.
you can find the these information in the comments of tensorflow_serving/batching/streaming_batch_schedulor.h
what 's more ,you can assign the number of threads by the --rest_api_num_threads or let it empty and automatically configured by tf serivng
Related
Using Ktor and Kotlin 1.5 to implement a REST service backed by Netty. A couple of things about this service:
"Work" takes non-trivial amount of time to complete.
A unique client endpoint sends multiple requests in parallel to this service.
There are only a handful of unique client endpoints.
The service is not scaling as expected. We ran a load test with parallel requests coming from a single client and we noticed that we only have two threads on the server actually processing the requests. It's not a resource starvation problem - there is plenty of network, memory, CPU, etc. and it doesn't matter how many requests we fire up in parallel - it's always two threads keeping busy, while the others are sitting idle.
Is there a parameter we can configure to increase the number of threads available to process requests for specific endpoints?
Netty use what is called Non-blocking IO model (http://tutorials.jenkov.com/java-concurrency/single-threaded-concurrency.html).
In this case you have only a single thread and it can handle a lot of sub-processes in parallel, as long as you follow best practices (not blocking the main thread event loop).
You might need to check the following configuration options for Netty https://ktor.io/docs/engines.html#configure-engine
connectionGroupSize = x
workerGroupSize = y
callGroupSize = z
Default values usually are set rather low and tweaking them could be useful for the time-consuming 'work'. The exact values might vary depending on the available resources.
Our infra for web application looks like this
Nodejs Web application -> GraphQL + Nodejs as middleware (BE for FE) -> Lot's of BE services in ROR -> DB/ES etc etc
We have witness the whole middleware layer of GrpahQL+Nodejs gets latent whenever any of the multiple crucial BE service gets latent and request queuing starts happening. When we tried to compare it with number of requests during the period it got latent it was <1k request which is much lower than the claimed 10k concurrent request handling of nodejs. Looking for pointers to debug this issue further.
Analysis done so far from our end:
As per Datadog and other APM which are used to to monitor system health, CPU and memory usage have shown no abnormal behaviour when the servers gets latent
We are using various request tracking methods from top most layer to last layer, and it is confirmed that request queuing is happening on this middleware layer only.
How many simultaneous requests can I make with the request package?
I am expecting data back from every request confirming the request was received and processed successfully. Is this hardware or OS dependent? Where do I start looking?
One of the more recent versions of node.js does not enforce a limit on outgoing requests (older versions did). If you were literally trying to make millions of outgoing connections at the same time, then you would probably hit a limit on your own node.js server that would be OS specific. But, the practical limit is more likely going to be determined by the target host.
Since all your requests are being sent to the same host, the more likely limit will be determined by the server you are making the requests to. It will have some sort of limit for how many simultaneous requests it can have "in-flight" at the same time before it starts refusing new connections. What that number is depends entirely upon how the server is configured and built. For http://www.google.com, the number is probably hundreds of thousands or millions of requests because they have a huge server farm and requests are balanced across all of them. For some simple single CPU server, the limit would obviously be much smaller than that.
In addition, there will little use in sending zillions of requests to a single CPU server anyway because it won't be able to work on all of them at once anyway.
So, if you want to know what would work best for a given target host, you would have to set up an adjustable test harness so you could test scenarios where you send from 1, 2, 5, 10, 50, 100, 200, 500, 1000 at a time and see what the average response time is and where you start to get errors (if any).
If you don't want to do any of that type of testing, then a reasonably safe choice that doesn't attempt to fully optimize things is to put no more than 5 requests in flight at the same time.
You can either build something yourself to manage to N requests in flight at a time or you can use one of the existing libraries that will do that for you. The Bluebird promise library has a concurrency option on some of it's functions such as Promise.map() which will automatically do that for you for whatever concurrency value you set. The async library also has something similar.
If you want more specific help crafting the code to manage how many requests are in flight at a time or to build a test harness for it, please show us some of your code for the source of all the requests so we have some idea how that works (if it's a giant array of requests or what the source of the URLs is).
What will happen:
If I write a server application backed with a thread pool of millions of threads and it gets millions of requests per second
I have worked on developing web services. The web service was deployed on 1000's of computers with a front end load balancer. The load balancer's job was to distribute the traffic amongst the servers that actually process the web requests.
So my question is that since the process running inside load balancer itself HAS to be single threaded to listen to web requests on a port, how does it handle accepting millions of requests per second. the load balancer might be busy delegating a task, then what happens to the incoming request at that instance of time?
In my opinion, all clients will not be handled since there will only be single request handler thread to pass on the incoming request to the thread pool
This way no multi threaded server should ever work.
I wonder how does facebook/amazon handles millions of requests per second.
You are right, it won't work. There is a limit to how much a single computer can process, which is nothing to do with how many threads it is running.
The way Amazon and Facebook etc handle it is to have hundreds or thousands of servers spread throughout the world and then they pass the requests out to those various servers. This is a massive subject though so if you want to know more I suggest you read up on distributed computing and come back if you have specific questions.
With the edit, the question makes much more sense. It is not hard to distribute millions of requests per second. A distribution operation should take somewhat in the viscinity of tens of nanoseconds and would merely consist of pushing the received socket into the queue. No biggie.
As soon as it's done, balancer is ready to accept the next request.
Based on iis architecture, request from client hitting IIS will pass through httppipeline, specifically through each httpmodule and finally reaches respective httphandlers and then to worker process. Is this happening serially, one after the other?
Say if 10,000 requests hits the webserver concurrently in a sec, is each request get processed one by one? If the webserver has multi-core CPU and high memory capacity, does this helps IIS to handle the requests simultaneously?
Is there any webserver capable to handle requests in parallel?
I just replied to this guys question - very similar, and the answer is the same:
IIS and HTTP pipelining, processing requests in parallel