Bash - cURL Get requests are slow (Relatively) - linux

I'm constantly querying a server for a list of items. Usually the data returned is "offers:" with a blank list, 99.999% of the time this is the data I get back from my request. So, pretty small payload.
I have a ping of 35ms pretty rock solid, jitter is about 0.2ms
But when running a single loop of requests, I get updates about every 300ms.
My goal is to make this as fast as I possibly can, so I parallelized it, running 8 of the loops. But now, I see that frequently 4 or more of the threads will seem to run simultaneously. Giving me 4 requests within 10 or 20ms of each other, and leaving periods of up to 200ms with no requests being processed.
The server I'm interfacing is of unknown specs, but it's a large company I'm interfacing with and I'd assume that the server I am communicating with is more than capable of handling anything I could possibly throw at it.

Related

Node's epoll behaviour on socket

I wrote a simple node.js program that sends out 1000 http requests and it records when these requests comes back and just increases counter by 1 upon response. Endpoint is very light weight and it just has simple http resonse without any heavy html. I recorded that it returns me around 200-300 requests per second for 3 seconds. On other hand, when i start this same process 3 times (4 total processes = amount of my available cores) i notice that it performs x4 faster. So i receive aproximately 300 * 4 requests per second back. I want to understand what happens when epoll gets triggered upon Kernel notifying the poll about new file descriptor being ready to compute (new tcp payload arrived). Does v8 take out this filedescriptor and read the data / manipulate with it and where is the actuall bottleneck? Is it in computing and unpacking the payload? It seems that only 1 core is working on sending/receiving these requests for this 1 process and when i start multiple (on amount of my cores), it performs faster.
where is the actuall bottleneck?
Sounds like you're interested in profiling your application. See https://nodejs.org/en/docs/guides/simple-profiling/ for the official documentation on that.
What I can say up front is that V8 does not deal with file descriptors or epoll, that's all Node territory. I don't know Node's internals myself (only V8), but I do know that the code is on https://github.com/nodejs/node, so you can look up how any given feature is implemented.

Why is Python consistently struggling to keep up with constant generation of asyncio tasks?

I have a Python project with a server that distributes work to one or more clients. Each client is given a number of assignments which contain parameters for querying a target API. This includes a maximum number of requests per second they can make with a given API key. The clients process the response and send the results back to the server to store into a database.
Both the server and clients use Tornado for asynchronous networking. My initial implementation for the clients relied on the PeriodicCallback to ensure that n-number of calls to the API would occur. I thought that this was working properly as my tests would last 1-2 minutes.
I added some telemetry to collect statistics on performance and noticed that the clients were actually having issues after almost exactly 2 minutes of runtime. I had set the API requests to 20 per second (the maximum allowed by the API itself) which the clients could reliably hit. However, after 2 minutes performance would fluctuate between 12 and 18 requests per second. The number of active tasks steadily increased until it hit the maximum amount of active assignments (100) given from the server and the HTTP request time to the API was reported by Tornado to go from 0.2-0.5 seconds to 6-10 seconds. Performance is steady if I only do 14 requests per second. Anything higher than 15 requests will experience issues 2-3 minutes after starting. Logs can be seen here. Notice how the column of "Active Queries" is steady until 01:19:26. I've truncated the log to demonstrate
I believed the issue was the use of a single process on the client to handle both communication to the server and the API. I proceeded to split the primary process into several different processes. One handles all communication to the server, one (or more) handles queries to the API, another processes API responses into a flattened class, and finally a multiprocessing Manager for Queues. The performance issues were still present.
I thought that, perhaps, Tornado was the bottleneck and decided to refactor. I chose aiohttp and uvloop. I split the primary process in a similar manner to that in the previous attempt. Unfortunately, performance issues are unchanged.
I took both refactors and enabled them to split work into several querying processes. However, no matter how much you split the work, you still encounter problems after 2-3 minutes.
I am using both Python 3.7 and 3.8 on MacOS and Linux.
At this point, it does not appear to be a limitation of a single package. I've thought about the following:
Python's asyncio library cannot handle more than 15 coroutines/tasks being generated per second
I doubt that this is true given that different libraries claim to be able to handle several thousand messages per second simultaneously. Also, we can hit 20 requests per second just fine at the start with very consistent results.
The API is unable to handle more than 15 requests from a single client IP
This is unlikely as I am not the only user of the API and I can request 20 times per second fairly consistently over an extended period of time if I over-subscribe processes to query from the API.
There is a system configuration causing the limitation
I've tried both MacOS and Debian which yield the same results. It's possible that's it a *nix problem.
Variations in responses cause a backlog which grows linearly until it cannot be tackled fast enough
Sometimes responses from the API grow and shrink between 0.2 and 1.2 seconds. The number of active tasks returned by asyncio.all_tasks remains consistent in the telemetry data. If this were true, we wouldn't be consistently encountering the issue at the same time every time.
We're overtaxing the hardware with the number of tasks generated per second and causing thermal throttling
Although CPU temperatures spike, neither MacOS nor Linux report any thermal throttling in the logs. We are not hitting more than 80% CPU utilization on a single core.
At this point, I'm not sure what's causing it and have considered refactoring the clients into a different language (perhaps C++ with Boost libraries). Before I dive into something so foolish, I wanted to ask if I'm missing something simple.
Conclusion
Performance appears to vary wildly depending on time of day. It's likely to be the API.
How this conclusion was made
I created a new project to demonstrate the capabilities of asyncio and determine if it's the bottleneck. This project takes two websites, one to act as the baseline and the other is the target API, and runs through different methods of testing:
Spawn one process per core, pass a semaphore, and query up to n-times per second
Create a single event loop and create n-number of tasks per second
Create multiple processes with an event loop each to distribute the work, with each loop performing (n-number / processes) tasks per second
(Note that spawning processes is incredibly slow and often commented out unless using high-end desktop processors with 12 or more cores)
The baseline website would be queried up to 50 times per second. asyncio could complete 30 tasks per second reliably for an extended period, with each task completing their run in 0.01 to 0.02 seconds. Responses were very consistent.
The target website would be queried up to 20 times per second. Sometimes asyncio would struggle despite circumstances being identical (JSON handling, dumping response data to queue, returning immediately, no CPU-bound processing). However, results varied between tests and could not always be reproduced. Responses would be under 0.4 seconds initially but quickly increase to 4-10 seconds per request. 10-20 requests would return as complete per second.
As an alternative method, I chose a parent URI for the target website. This URI wouldn't have a large query to their database but instead be served back with a static JSON response. Responses bounced between 0.06 seconds to 2.5-4.5 seconds. However, 30-40 responses would be completed per second.
Splitting requests across processes with their own event loop would decrease response time in the upper-bound range by almost half, but still took more than one second each to complete.
The inability to reproduce consistent results every time from the target website would indicate that it's a performance issue on their end.

Querying multiple sensors regularly using NodeJS

I need to fetch the values of about 200 sensors every 15 seconds or so. To fetch the values I simply need to make an HTTP call with basic authentication and parse the response. The catch is that these sensors might be on slow connection so I need to wait at least 5 seconds for one sensor (but usually they respond a lot quicker, but there's always some that are slow and timeout).
So right now I have the following setup for that:
There is a NodeJS process that is connected to my DB and knows all about the sensors. It checks regularly to see if there are new ones or there are some that got deleted. It spawns a child process for every sensor, and in case the child process dies it restarts it. Also it kills it if the sensor gets deleted. The child process makes the HTTP call to its sensor with a 5 second timeout value and if it receives the value, saves it to Redis. Also it is in an infinite loop with a 15 seconds setTimeout. And there is a third process that copies all the values from Redis to the main MySQL DB.
So that has been a working solution for half a year, but after a major system upgrade (from Ubuntu 14.04 to 18.04 and thus every package upgraded as well) it seems to leak some memory and I can't seem to figure out where.
After starting out, the processes summarised take about 1.5GB of memory. But after a day or so this goes up to 3GB and so on and before running out of memory I need to kill all node processes and restart the whole thing.
So now I am trying to figure out more efficient methods to achieve the same result (query around 2-300 URLs every 15 sec and store the result in MySQL). At the moment I'm thinking of ditching Redis and the child processes will communicate with their master process and the master process will write to MySQL directly. This way I don't need to load the Redis library into every child process and that might save me some time.
So I need ideas on how to reduce memory usage for that application (I'm limited to PHP and NodeJS, mainly because of my knowledge, so writing a native daemon might be out of the question)
Thanks!
The solution was easier than I thought. I had to rewrite the child process into a native bash script and that brought down the memory usage to almost being zero.

"Resequencing" messages after processing them out-of-order

I'm working on what's basically a highly-available distributed message-passing system. The system receives messages from someplace over HTTP or TCP, perform various transformations on it, and then sends it to one or more destinations (also using TCP/HTTP).
The system has a requirement that all messages sent to a given destination are in-order, because some messages build on the content of previous ones. This limits us to processing the messages sequentially, which takes about 750ms per message. So if someone sends us, for example, one message every 250ms, we're forced to queue the messages behind each other. This eventually introduces intolerable delay in message processing under high load, as each message may have to wait for hundreds of other messages to be processed before it gets its turn.
In order to solve this problem, I want to be able to parallelize our message processing without breaking the requirement that we send them in-order.
We can easily scale our processing horizontally. The missing piece is a way to ensure that, even if messages are processed out-of-order, they are "resequenced" and sent to the destinations in the order in which they were received. I'm trying to find the best way to achieve that.
Apache Camel has a thing called a Resequencer that does this, and it includes a nice diagram (which I don't have enough rep to embed directly). This is exactly what I want: something that takes out-of-order messages and puts them in-order.
But, I don't want it to be written in Java, and I need the solution to be highly available (i.e. resistant to typical system failures like crashes or system restarts) which I don't think Apache Camel offers.
Our application is written in Node.js, with Redis and Postgresql for data persistence. We use the Kue library for our message queues. Although Kue offers priority queueing, the featureset is too limited for the use-case described above, so I think we need an alternative technology to work in tandem with Kue to resequence our messages.
I was trying to research this topic online, and I can't find as much information as I expected. It seems like the type of distributed architecture pattern that would have articles and implementations galore, but I don't see that many. Searching for things like "message resequencing", "out of order processing", "parallelizing message processing", etc. turn up solutions that mostly just relax the "in-order" requirements based on partitions or topics or whatnot. Alternatively, they talk about parallelization on a single machine. I need a solution that:
Can handle processing on multiple messages simultaneously in any order.
Will always send messages in the order in which they arrived in the system, no matter what order they were processed in.
Is usable from Node.js
Can operate in a HA environment (i.e. multiple instances of it running on the same message queue at once w/o inconsistencies.)
Our current plan, which makes sense to me but which I cannot find described anywhere online, is to use Redis to maintain sets of in-progress and ready-to-send messages, sorted by their arrival time. Roughly, it works like this:
When a message is received, that message is put on the in-progress set.
When message processing is finished, that message is put on the ready-to-send set.
Whenever there's the same message at the front of both the in-progress and ready-to-send sets, that message can be sent and it will be in order.
I would write a small Node library that implements this behavior with a priority-queue-esque API using atomic Redis transactions. But this is just something I came up with myself, so I am wondering: Are there other technologies (ideally using the Node/Redis stack we're already on) that are out there for solving the problem of resequencing out-of-order messages? Or is there some other term for this problem that I can use as a keyword for research? Thanks for your help!
This is a common problem, so there are surely many solutions available. This is also quite a simple problem, and a good learning opportunity in the field of distributed systems. I would suggest writing your own.
You're going to have a few problems building this, namely
2: Exactly-once delivery
1: Guaranteed order of messages
2: Exactly-once delivery
You've found number 1, and you're solving this by resequencing them in redis, which is an ok solution. The other one, however, is not solved.
It looks like your architecture is not geared towards fault tolerance, so currently, if a server craches, you restart it and continue with your life. This works fine when processing all requests sequentially, because then you know exactly when you crashed, based on what the last successfully completed request was.
What you need is either a strategy for finding out what requests you actually completed, and which ones failed, or a well-written apology letter to send to your customers when something crashes.
If Redis is not sharded, it is strongly consistent. It will fail and possibly lose all data if that single node crashes, but you will not have any problems with out-of-order data, or data popping in and out of existance. A single Redis node can thus hold the guarantee that if a message is inserted into the to-process-set, and then into the done-set, no node will see the message in the done-set without it also being in the to-process-set.
How I would do it
Using redis seems like too much fuzz, assuming that the messages are not huge, and that losing them is ok if a process crashes, and that running them more than once, or even multiple copies of a single request at the same time is not a problem.
I would recommend setting up a supervisor server that takes incoming requests, dispatches each to a randomly chosen slave, stores the responses and puts them back in order again before sending them on. You said you expected the processing to take 750ms. If a slave hasn't responded within say 2 seconds, dispatch it again to another node randomly within 0-1 seconds. The first one responding is the one we're going to use. Beware of duplicate responses.
If the retry request also fails, double the maximum wait time. After 5 failures or so, each waiting up to twice (or any multiple greater than one) as long as the previous one, we probably have a permanent error, so we should probably ask for human intervention. This algorithm is called exponential backoff, and prevents a sudden spike in requests from taking down the entire cluster. Not using a random interval, and retrying after n seconds would probably cause a DOS-attack every n seconds until the cluster dies, if it ever gets a big enough load spike.
There are many ways this could fail, so make sure this system is not the only place data is stored. However, this will probably work 99+% of the time, it's probably at least as good as your current system, and you can implement it in a few hundred lines of code. Just make sure your supervisor is using asynchronous requests so that you can handle retries and timeouts. Javascript is by nature single-threaded, so this is slightly trickier than normal, but I'm confident you can do it.

why latency varies in web socket when it's a static connection?

as HTTP creates the connection again and again for each data to be transferred over a network, WEB SOCKETS are static and the connection will be made once initially and will stay until the transmission is done...but if the web sockets are static then why the latency differs for each data packet..???
the latency test app i have created shows me different time lag.. so what is the advantage of web socket being a static connection or if this is a common issue in web sockets ??
Do i need to create a buffer to control the flow of data because the data transmission in the is continous..?
does the latecy increases when data transmission is continous?
There is no overhead to establish a new connection with a statically open web socket (as the connection is already open and established), but when you're making a request half way around the world, networking takes some time so there's latency when you're talking to a server half way around the world.
That's just how networking works.
You get a near immediate response from a server on your own LAN and the further away the server gets (in terms of network topology) the more routers each packet much transit through, the more total delay there is. As you witnessed in your earlier question related to this topic, when you do a tracert from your location to your server location, you saw a LOT of different hops that each packet has to traverse. The time for each one of these hops all adds up and busy routers may also each add a small delay if they aren't instantly processing your packet.
The latency between when you send a packet and get a response is just 2x the packet transit time plus whatever your server takes to respond plus perhaps a tiny little overhead for TCP (since it's a reliable protocol, it needs acknowledgements). You cannot speed up the transit time unless you pick a server that is closer or somehow influence the route the packet takes to a faster route (this is mostly not under your control once you've selected a local ISP to use).
No amount of buffering on your end will decrease the roundtrip time to your server.
In addition, the more hops in the network there are between your client and server, the more variation you may get from one moment to the next in the transit time. Each one of the routers the packet traverses and each one of the links it travels on has their own load, congestion, etc... that varies with time. There will likely be a minimum transit time that you will ever observe (it will never be faster than x), but many things can influence it over time to make it slower than that in some moments. There could even be a case of an ISP taking a router offline for maintenance which puts more load on the other routers handling the traffic or a route between hops going down so a temporary, but slower and longer route is substituted in its place. There are literally hundreds of things that can cause the transit time to vary from moment to moment. In general, it won't vary a lot from one minute to the next, but can easily vary through the day or over longer periods of time.
You haven't said whether this is relevant or not, but when you have poor latency on a given roundtrip or when performance is very important, what you want to do is to minimize the number of roundtrips that you wait for. You can do that a couple of ways:
1. Don't sequence small pieces of data. The slowest way to send lots of data is to send a little bit of data, wait for a response, send a little more data, wait for a response, etc... If you had 100 bytes to send and you sent the data 1 byte at a time waiting for a response each time and your roundtrip time was X, you'd have 100X as your total time to send all the data. Instead, collect up a larger piece of the data and send it all at once. If you send the 100 bytes all at once, you'd probably only have a total delay of X rather than 100X.
2. If you can, send data parallel. As explained above the pattern of send data, wait for response, send more data, wait for response is slow when the roundtrip time is poor. If your data can be tagged such that it stands on its own, then sometimes you can send data in parallel without waiting for prior responses. In the above example, it was very slow to send 1 byte, wait for response, send next byte, wait for response. But, if you send 1 byte, then send next byte, then send next byte and then some times later you process all the response, you get much, much better throughput. Obviously, if you already have 100 bytes of data, you may as well just send that all at once, but if the data is arriving real time, you may want to just send it out as it arrives and not wait for prior responses. Obviously whether you can do this depends entirely upon the data protocol between your client and server.
3. Send bigger pieces of data at a time. If you can, send bigger chunks of data at once. Depending upon your app, it may or may not make sense to actually wait for data to accumulate before sending it, but if you already have 100 bytes of data, then try to send it all at once rather than sending it in smaller pieces.

Resources