Node.js request limit per second while using cluster module - node.js

I'm using the cluster module to have multiple worker that fetch data from an API, process it and write an aggregate to the DB. The problem is, that the API has limited the requests per second. Now I'm searching for a solution to sync the limitation across all workers.
I'm thankful for every hint to solve this.

If you have a limit of number of requests per second, you could keep track of how many requests you have left in the master thread and each child could ask the master thread if it can send a request before sending, and the master thread would only fulfill the request when it has requests available for the current second. Here is another answer showing how master -> slave communication works.
At the end of each second, you would then reset the master thread to the number of requests available.
This approach would be best for achieving the maximum, however a much simpler approach would be to start N number of thread and allow them to make K number of requests per second, where K * N is just less than the number of requests allowed per second. The safest and least likely way to hit the limit with this is to do a setTimeout between the end of one request and start of the next request, but that would avoid the delay it takes processing the request. The next best option is for each thread to fire N number of requests at the start of the second and not firing again until the next second.
Your safest solution is to not go close to the limit and instead stick to max of N/2 requests per second where N is the max number of requests per second.

Related

Why is Python consistently struggling to keep up with constant generation of asyncio tasks?

I have a Python project with a server that distributes work to one or more clients. Each client is given a number of assignments which contain parameters for querying a target API. This includes a maximum number of requests per second they can make with a given API key. The clients process the response and send the results back to the server to store into a database.
Both the server and clients use Tornado for asynchronous networking. My initial implementation for the clients relied on the PeriodicCallback to ensure that n-number of calls to the API would occur. I thought that this was working properly as my tests would last 1-2 minutes.
I added some telemetry to collect statistics on performance and noticed that the clients were actually having issues after almost exactly 2 minutes of runtime. I had set the API requests to 20 per second (the maximum allowed by the API itself) which the clients could reliably hit. However, after 2 minutes performance would fluctuate between 12 and 18 requests per second. The number of active tasks steadily increased until it hit the maximum amount of active assignments (100) given from the server and the HTTP request time to the API was reported by Tornado to go from 0.2-0.5 seconds to 6-10 seconds. Performance is steady if I only do 14 requests per second. Anything higher than 15 requests will experience issues 2-3 minutes after starting. Logs can be seen here. Notice how the column of "Active Queries" is steady until 01:19:26. I've truncated the log to demonstrate
I believed the issue was the use of a single process on the client to handle both communication to the server and the API. I proceeded to split the primary process into several different processes. One handles all communication to the server, one (or more) handles queries to the API, another processes API responses into a flattened class, and finally a multiprocessing Manager for Queues. The performance issues were still present.
I thought that, perhaps, Tornado was the bottleneck and decided to refactor. I chose aiohttp and uvloop. I split the primary process in a similar manner to that in the previous attempt. Unfortunately, performance issues are unchanged.
I took both refactors and enabled them to split work into several querying processes. However, no matter how much you split the work, you still encounter problems after 2-3 minutes.
I am using both Python 3.7 and 3.8 on MacOS and Linux.
At this point, it does not appear to be a limitation of a single package. I've thought about the following:
Python's asyncio library cannot handle more than 15 coroutines/tasks being generated per second
I doubt that this is true given that different libraries claim to be able to handle several thousand messages per second simultaneously. Also, we can hit 20 requests per second just fine at the start with very consistent results.
The API is unable to handle more than 15 requests from a single client IP
This is unlikely as I am not the only user of the API and I can request 20 times per second fairly consistently over an extended period of time if I over-subscribe processes to query from the API.
There is a system configuration causing the limitation
I've tried both MacOS and Debian which yield the same results. It's possible that's it a *nix problem.
Variations in responses cause a backlog which grows linearly until it cannot be tackled fast enough
Sometimes responses from the API grow and shrink between 0.2 and 1.2 seconds. The number of active tasks returned by asyncio.all_tasks remains consistent in the telemetry data. If this were true, we wouldn't be consistently encountering the issue at the same time every time.
We're overtaxing the hardware with the number of tasks generated per second and causing thermal throttling
Although CPU temperatures spike, neither MacOS nor Linux report any thermal throttling in the logs. We are not hitting more than 80% CPU utilization on a single core.
At this point, I'm not sure what's causing it and have considered refactoring the clients into a different language (perhaps C++ with Boost libraries). Before I dive into something so foolish, I wanted to ask if I'm missing something simple.
Conclusion
Performance appears to vary wildly depending on time of day. It's likely to be the API.
How this conclusion was made
I created a new project to demonstrate the capabilities of asyncio and determine if it's the bottleneck. This project takes two websites, one to act as the baseline and the other is the target API, and runs through different methods of testing:
Spawn one process per core, pass a semaphore, and query up to n-times per second
Create a single event loop and create n-number of tasks per second
Create multiple processes with an event loop each to distribute the work, with each loop performing (n-number / processes) tasks per second
(Note that spawning processes is incredibly slow and often commented out unless using high-end desktop processors with 12 or more cores)
The baseline website would be queried up to 50 times per second. asyncio could complete 30 tasks per second reliably for an extended period, with each task completing their run in 0.01 to 0.02 seconds. Responses were very consistent.
The target website would be queried up to 20 times per second. Sometimes asyncio would struggle despite circumstances being identical (JSON handling, dumping response data to queue, returning immediately, no CPU-bound processing). However, results varied between tests and could not always be reproduced. Responses would be under 0.4 seconds initially but quickly increase to 4-10 seconds per request. 10-20 requests would return as complete per second.
As an alternative method, I chose a parent URI for the target website. This URI wouldn't have a large query to their database but instead be served back with a static JSON response. Responses bounced between 0.06 seconds to 2.5-4.5 seconds. However, 30-40 responses would be completed per second.
Splitting requests across processes with their own event loop would decrease response time in the upper-bound range by almost half, but still took more than one second each to complete.
The inability to reproduce consistent results every time from the target website would indicate that it's a performance issue on their end.

Occasional duplicate request using jmeter

I'm using JMeter 4.0 trying to create a stress test. The purpose is to emulate the types of requests we receive in production, which is generally an array of requests of different types with a certain frequency and occasionally (1 in 1000) duplicate requests of the same type within milliseconds of each other.
I've managed to create a thread group emulating frequent requests of different types and a second thread group emulating duplicate requests (using synchronizing timer to ensure the requests fire off together).
I'm almost finished. My only problem is that there is no relationship between the thread groups whatsoever. If I wanted to perform a duplicate request once every 1000 requests, I'd need to know how long it takes to perform an average request (which is complicated by the fact that there are several request types) and calculate the time it would require for roughly 1000 requests to be made, and add an appropriate constant timer in the other thread group.
This isn't ideal. I'll settle for this if I must, but I was hoping the bright minds of stackoverflow could shine some insight for my issue.
Some ideas I've had:
Add a run counter which cycles every 1000 normal requests and once run counter hits 1000, I perform a second request (though it would be under the same thread and after I've received the response from the first). Could this be made to work using a synchronized timer?
Use a constant throughput timer with "all active threads (shared)" set whose samples per minutes is set to 1000.
Is there a better way still? The actual requests are HTTP requests, though there are several steps prior in preparation of the message to send. I'm already using a constant throughput timer in the first thread group (random service requests) to maintain a specific amount of requests per minute, so I'm not sure if adding a second constant throughput timer in the other thread group would create issues.
Thank you for your time.
You can add If Controller with condition of 1 every 1000 threads
${__jexl3(${__threadNum} % 1000 == 0)}
and inside If Controller execute your duplicate HTTP Request
__threadNum return current thread/user number

How Cassandra handle blocking execute statement in datastax java driver

Blocking execute fethod from com.datastax.driver.core.Session
public ResultSet execute(Statement statement);
Comment on this method:
This method blocks until at least some result has been received from
the database. However, for SELECT queries, it does not guarantee that
the result has been received in full. But it does guarantee that some
response has been received from the database, and in particular
guarantee that if the request is invalid, an exception will be thrown
by this method.
Non-blocking execute fethod from com.datastax.driver.core.Session
public ResultSetFuture executeAsync(Statement statement);
This method does not block. It returns as soon as the query has been
passed to the underlying network stack. In particular, returning from
this method does not guarantee that the query is valid or has even
been submitted to a live node. Any exception pertaining to the failure
of the query will be thrown when accessing the {#link
ResultSetFuture}.
I have 02 questions about them, thus it would be great if you can help me to understand them.
Let's say I have 1 million of records and I want all of them to be arrived in the database (without any lost).
Question 1: If I have n number of threads, all threads will have the same amount of records they need to send to the database. All of them continue sending multiple insert queries to cassandra using blocking execute call. If I increase the value of n, will it also helps to speed up the time that I need to insert all records to cassandra?
Will this cause performance problem for cassandra? Does Cassandra have to make sure that for every single insert record, all the nodes in the clusters should know about the new record immediately? In order to maintain the consistency in data. (I assume cassandra node won't even think about using the local machine time for controlling the record insertion time).
Question 2: With non-blocking execute, how can I assure that all of the insertions is successful? The only way I know is waiting for the ResultSetFuture to check the execution of the insert query. Is there any better way I can do ? Is there a higher chance that non-blocking execute is easier to fail then blocking execute?
Thank you very much for your helps.
If I have n number of threads, all threads will have the same amount of records they need to send to the database. All of them continue sending multiple insert queries to cassandra using blocking execute call. If I increase the value of n, will it also helps to speed up the time that I need to insert all records to cassandra?
To some extent. Lets divorce the client implementation details a bit and look at things from the perspective of "Number of concurrent requests", as you don't need to have a thread for each ongoing request if you use executeAsync. In my testing I have found that while there is a lot of value in having a high number of concurrent requests, there is a threshold for which there are diminishing returns or performance starts to degrade. My general rule of thumb is (number of Nodes *native_transport_max_threads (default: 128)* 2), but you may find more optimal results with more or less.
The idea here is that there is not much value in enqueuing more requests than cassandra will handle at a time. While reducing the number of inflight requests, you limit unnecessary congestion on the connections between your driver client and cassandra.
Question 2: With non-blocking execute, how can I assure that all of the insertions is successful? The only way I know is waiting for the ResultSetFuture to check the execution of the insert query. Is there any better way I can do ? Is there a higher chance that non-blocking execute is easier to fail then blocking execute?
Waiting on the ResultSetFuture via get is one route, but if you are developing a fully async application, you want to avoid blocking as much as possible. Using guava, your two best weapons are Futures.addCallback and Futures.transform.
Futures.addCallback allows you to register a FutureCallback that gets executed when the driver has received the response. onSuccess gets executed in the success case, onFailure otherwise.
Futures.transform allows you to effectively map the returned ResultSetFuture into something else. For example if you only want the value of 1 column you could use it to transform ListenableFuture<ResultSet> to a ListenableFuture<String> without having to block in your code on the ResultSetFuture and then getting the String value.
In the context of writing a dataloader program, you could do something like the following:
To keep things simple use a Semaphore or some other construct with a fixed number of permits (that will be your maximum number of inflight requests). Whenever you go to submit a query using executeAsync, acquire a permit. You should really only need 1 thread (but may want to introduce a pool of # cpu cores size that does this) that acquires the permits from the Semaphore and executes queries. It will just block on acquire until there is an available permit.
Use Futures.addCallback for the future returned from executeAsync. The callback should call Sempahore.release() in both onSuccess and onFailure cases. By releasing a permit, this should allow your thread in step 1 to continue and submit the next request.
To further improve throughput, you might want to consider using BatchStatement and submitting requests in batches. This is a good option if you keep your batches small (50-250 is a good number) and if your inserts in a batch all share the same partition key.
Besides the above answer,
Looks like execute() calls executeAsync(statement).getUninterruptibly(), so whether you manage your own "n thread pool" using execute() and block yourself until execution completes up to a max of n running threads OR using executeAsync() on all records, cassandra side performance should be roughly same, depending on execution time/count + timeouts.
They executions will all run connections borrowed from a pool, each execution has a streamId on client side and gets notified you via future when the response comes back for this streamId, limited by total requests per connection on client side and total requests limited by read threads on each node that was picked to execute your request, any higher number will be buffered in a queue (not blocked) limited by the connection maxQueueSize and maxRequestsPerConnection, any higher than this should fail. The beauty of this is that executeAsync() does not run on a new thread per request/execution.
So, there has to be a limit on how many requests can run via execute() or executeAsync(), in execute() you are avoiding beyond these limits.
Performance wise, you will start seeing a penalty beyond what each node can handle so execute() with a good size pool makes sense to me. Even better, use a reactive architecture to avoid creating so many threads that are doing nothing but waiting, so large number of threads will cause wasted context switching on client side. For smaller number of requests, executeAsync() will be better by avoiding thread pools.
DefaultResultSetFuture future = new DefaultResultSetFuture(..., makeRequestMessage(statement, null));
new RequestHandler(this, future, statement).sendRequest();

What is a reasonable amount of time to wait when making concurrent requests?

I'm working on a crawler and I've noticed that by setting the length of time for waiting 1 minute per request has made the application more reliable and I now get fewer connection resets. Can you recommend a reasonable amount of time to wait? I think 1 minute is quite the belts and braces approach and I would like to reduce this ideally.

jMeter adding threads/users (read from CSV Data) to a running thread group

my problem is quite complex.
The matter is to test our web site answers to an increasing amount of requests from different users.
So I can take users/passwords from a CSV Data and launch an HTTP request (with variables readen from the file).
But I don't want to run the thread with all users at same time, but to loop and add at every iteration an other user from the file to the running thread groups (after some delay).
It seems very difficult to do so with jMeter. Perhaps I's need to call a custom java class ?
If I understand you correctly, you just should use Rump up. This parameter control how fast your test will reach maximum threads count.
As explained in JMeter documentation,
The ramp-up period tells JMeter how long to take to "ramp-up" to the
full number of threads chosen. If 10 threads are used, and the ramp-up
period is 100 seconds, then JMeter will take 100 seconds to get all 10
threads up and running. Each thread will start 10 (100/10) seconds
after the previous thread was begun. If there are 30 threads and a
ramp-up period of 120 seconds, then each successive thread will be
delayed by 4 seconds.
Also may be this Throughput Shaping Timer may be helpful for you. You can schedule duration of request with it.
As Jay stated, you can use ramp up to try to control this, though I am not sure the result will be what you are after...though it will add the startup delay. If you have a single thread then each row of the CSV will be processed one at a time, in order.
You can set the thread group to 1 thread and loop forever. In the CSV config you can set a single pass and to terminate the thread on EOF.
CSV Data Set Config-->Recycle on EOF = False
CSV Data Set Config-->Stop thread on EOF = True
Thread Group-->Loop Count = Forever
Also keep in mind that by using BSF and Beanshell you can exact a great deal of control over JMeter.
You should check out UltimateThreadGroup from jmeter-plugins.

Resources