I'm writing a bot for myself, which could, on request, find torrents and download them to my home media center.
I receive an error with my webhook: request lives only ~ 5 seconds.
Parsers work 1-10 seconds + home server on hackberry is very slow.
With this, my requests die at 50%.
How can I query and receive an answer after more then 5 seconds?
An action is expected to respond within 5 seconds. This does not necessarily have to be the exact answer, but you'll need to have something to let the user know that your action is still processing.
This could be as simple as giving an intermediary state like, "Okay, I'm going to start. Do you want anything else?", or playing a short MediaResponse as "hold music". Then you can store the state in a short-term and quick to access database which is easy to poll and give as a status update when the user asks.
This can be simply done through followUpEvents. You can call any intent through web hook's followUpEvent. So, to solve your problem, you have to maintain states in your web application like "searching", "found", "downloading" and "downloaded", it's completely upto you.
Now, once an initial intent is called, you initiate the process on your server then hold for 3-3.5 seconds and send a followUpEvent to call other intent which will do nothing but wait another 3-3.5 seconds and keep polling your server each second for updated status. You can keep calling next follow up intents till you get your desired status from server.
So if your request die at 50% on a single intent then it should work fine with two follow up intents.
Related
I'm currently developing a Shopify app with Node/Express and a Postgres database. When a user registers an account and connects their Shopify store, I'll need to download all of their store's orders. They could have 100,000s of orders, so I'd like to use a Shopify GraphQL Bulk Operation. While Shopify is handling this, my Node server will need to poll the Shopify server to check on the progress, and when the operation is complete, Shopify will send me a link where I can download all of the data. Once the data is processed and stored in my database, I'll send the user an email to say that their account is now set up.
How should I handle polling the Shopify server? The process could take anywhere from a few mins to hours. Using setInterval() would be a bad idea right? Because if the server restarts for whatever reason, It will lose the interval? So, should I use some sort of background task? And would I need to store anything in my database? I've researched cron jobs, child processes, worker threads, the bull package -- and it's left me a little confused.
(I also know that I could use a webhook, but Shopify offers no guarantees that my app will receive the webhook.)
Upon installation, launch a background job labeled "GetCustomerOrders". As you know, background jobs are mature, and nicely handle problems. For example, they can retry themselves if something goes wrong.
The Background job itself just sets up the Bulk Download and then settles into Poll. Polling is no big deal and just happens. As you said, could be minutes, could take hours. Nevertheless, a poll gets status on a bulk download, and that can even be hot-rodded. For example, you poll with an ID. So you poll till that ID completes. Regardless of restarts.
At the end of that rather simple setup, you get an URL to download and parse JSON. Spawn another job even for that. Endless fun. Why sweat it? Background jobs are the way to go.
The Webhook idea is OK but as the documentation says, they are not 100% and CRON is bush-league in that it misses out on the mature development of jobs in queues and is more like a simple trigger. Relying on CRON to start something is fine, but gives you zero management over what it starts.
I am guessing NodeJS has a decent background job system by this time. When you look at Sidekiq for Ruby you realize what awesome is. Surely you can find a copycat in Node that comes close anyway.
The HTTP status code 429 tells the client making the request to back off and retry the request after a period specified in the response's Retry-After header.
In a single-threaded client, it is obvious that the thread getting the 429 should wait as told and then retry. But the RFC explicitly states that
this specification does not define how the origin server identifies
the user, nor how it counts requests.
Consequently, in a multi-threaded client, the conservative approach would stop all threads from sending requests until the Retry-After point in time. But:
Many threads may already be past the point where they can note the information from the one rejected thread and will send at least one more request.
The global synchronization between the threads can be a pain to implement and get right
If the setup runs not only several threads but several clients, potentially on different machines, backing off all of them on one 429 becomes non-trivial.
Does anyone have specific data from the field how servers of cloud providers actually handle this? Will they get immediately aggravated if I don't globally hold back all threads. Microsoft's advice is
Wait the number of seconds specified in the Retry-After field.
Retry the request.
If the request fails again with a 429 error code, you are still being throttled. Continue to use the recommended Retry-After delay and retry the request until it succeeds.
It twice says 'the request' not 'any requests' or 'all requests', but this is legal-type interpretation I am not confident about.
To be sure this is not an opinion question, let me phrase it as fact-based as possible:
Are there more detailed specifications for cloud APIs (Microsoft, Google, Facebook, Twitter) then the example above that allow me to make an informed decision whether global back-off is necessary or whether it suffices to back-off with the specific request that got the 429?
Servers knows that its tuff to sync or expect programmers to do this. So doubt if there is a penalty unless they get an ocean of requests that do not back off at all after 429.
Each thread should wait, but each would, after being told individually.
A good system would know what its rate is and be within that. One way to impolement this is having a sleepFor variable between requests. Exact prod value can be arrived at by trial and error, and would be the sleep time minus the previous request time.
So if one requests ends, and say it took x milliseconds. Now if the sleep time is
0 or less, move immediately to next request
if 1 or more than find out sleepTime - x, if this is less than 1, go to next immediately, else sleep for so many milliseconds and then move to next request.
Another way would be to have a timeCountStrarted at request 1; count for every 5 minutes or so. After every request, check if the actual request count of current thread is more than that. If yes current thread sleeps till 5 minutes is up before moving to next. Here 5 can be configured as the timePeriod. If after a request the count is not more than set figure but time elapsed since timeCountStrarted is more than 5 minutes; then set timeCountStrarted to current time and the count of requests to 0.
What we do is keep these configuration values in a data base but cache them at run time.
Also have a page to invalidate the caches so if we like we can update the data base from an admin page, then invalidate the caches and thus the clients would pick up the new information on the run. This helps to configure the correct value to stay within API limits and get enough jobs done.
I am serving my users with data fetched from an external API. Now, I don't know when this API will have new data, how would be the best approach to do that using Node, for example?
I have tried setInterval's and node-schedule to do that and got it working, but isn't it expensive for the CPU? For example, over a day I would hit this endpoint to check for new data every minute, but it could have new data every five minutes or more.
The thing is, this external API isn't ran by me. Would the only way to check for updates hitting it every minute? Is there any module that can do that in Node or any approach that fits better?
Use case 1 : Call a weather API for every city of the country and just save data to my db when it is going to rain in a given city.
Use case 2 : Send notification to the user when a given Philips Hue lamp is turned on at the time it is turned on without having to hit the endpoint to check if it is on or not.
I appreciate the time to discuss this.
If this external API has no means of notifying you when there's new data, then the only thing you can do is to "poll" it to check for new data.
You will have to decide what an "efficient design" for polling is in your specific application and given the type of data and the needs of the client (what is an acceptable latency for new data).
You also need to be sure that your service is not violating any terms of service with your polling scheme or running afoul of rate limiting that may deny you access to the server if you use it "too much".
Would the only way to check for updates hitting it every minute?
Unless the API offers some notification feature, there is no other scheme other than polling at some interval. Polling every minute is fairly quick. Do your clients really need information that is less than a minute old? Or would it really make no difference if the information was as much as 5 minutes old.
For example, in your example of weather, a client wouldn't really need temperature updates more often than probably every 10-15 minutes.
Is there any module that can do that in Node or any approach that fits better?
No. Not really. You'll probably just use some sort of timer (either repeated setTimeout() or setInterval() in a node.js app to repeatedly carry out your API operations.
Use case: Call a weather API for every city of the country and just save data to my db when it is going to rain in a given city.
Trying to pre-save every possible piece of data from an external API is probably a losing proposition. You're essentially trying to "scrape" all the data from the external API. That is likely against the terms of service and will likely also run afoul of rate limits. And, it's just not very practical.
Instead, you will probably want to fetch data upon demand (when a client requests data for Phoenix, then, and only then, do you start collecting data for Phoenix) and then once a demand for a certain type of data (temperatures in a particular city) is established, then you might want to pre-cache that data more regularly so you can notify clients of changes. If, after awhile, no clients are asking for data from Phoenix, you stop requesting updates for Phoenix any more until a client establishes demand again.
I have tried setInterval's and node-schedule to do that and got it working, but isn't it expensive for the CPU? For example, over a day I would hit this endpoint to check for new data every minute, but it could have new data every five minutes or more.
Making a remote network request is not a CPU intensive operation, even if you're doing it every minute. node.js uses non-blocking networking so most of the time during a network request, node.js isn't doing anything and isn't using the CPU at all. The only time the CPU would be briefly used is when you first send the API request and then when you receive back the result from the API call and need to process it.
Whether you really need to "poll" every minute depends upon the data and the needs of the client. I'd ask yourself if your app will work just fine if you check for new data every 5 minutes.
The method I would use to update would be contained outside of the code in a scheduled batch/powershell/bash file. In windows you can schedule tasks based upon time of day or duration since last run, so what you could do is run a simple command that will kill your application for five minutes, run npm update, and then restart your application before closing the shell.
That way you're staying out of your API and keeping code to a minimum, and if your code is inside that Node package in the update, it'll be there and ready once you make serious application changes or you need to take the server down for maintenance and updates to the low-level code.
This is a light-weight solution for you and it's a method I've used once or twice at my workplace. There are lots of options out there, and if this isn't what you're looking for I can keep looking out for you.
I have an application where users are running a geospatial query against a mongo database. The query can return many thousands of results (~50k). These results are then streamed to the client over a websocket. However, users can abort a request mid result set and execute a new query. Users will frequently start, abort, and re-start requests on the order of several times per minute. Sometimes they even cancel/restart every couple of seconds.
The question is, when a user aborts a request, how do I cancel the query on the server so it doesn't continue to tie up resources streaming back thousands of unneeded results? I'm currently calling destroy() on the cursor, but it's not clear that this is actually stopping the query from executing on the server.
What's the best practice in this case?
Have you tried this?
db.currentOp()
db.killOp(IDRETURNEDHE)
This is a good example.
The answer is it depends upon a lot of your implementation details.
If your server is in the middle of streaming results (e.g. still hasn't sent or queued everything) when the server receives some sort of other message that the previous results should be cancelled, then it is possible for you to communicate with that other stream and tell it to stop sending. How exactly you would do that depends entirely upon your code and you would have to show us your code for us to know.
Chances are the db query is long since complete and what is going on is the server is in the process of streaming results to the client. So, if that's the case, then it isn't the db you're looking for, it's the code that streams the response to the client. Since node.js JS is single threaded, the only time another request would actually get run on the server would be while the streaming code was in some async write operation, waiting for that to finish. You would probably have to set some flag that was uniquely associated with a particular user and then your stream code would have to check for that flag before each chunk of data was sent. If it saw the cancel flag, it could abandon sending the rest of the results.
You could make things more cancellable by explicitly chunking your results (say 500 at a time) and checking for a cancel flag between the sending of each chunk.
If, on the other hand, all the data has already been buffered up by the TCP layer on the server, then the only way to stop that from being sent is to tear down the webSocket and force the client to reconnect.
This is a Brain-Question for advice on which scenario is a smarter approach to tackle situations of heavy lifting on the server end but with a responsive UI for the User.
The setup;
My System consists of two services (written in node); One Frontend Service that listens on Requests from the user and a Background Worker, that does heavy lifting and wont be finished within 1-2 seconds (eg. video conversion, image resizing, gzipping, spidering etc.). The User is connected to the Frontend Service via WebSockets (and normal POST Requests).
Scenario 1;
When a User eg. uploads a video, the Frontend Service only does some simple checks, creates a job in the name of the User for the Background Worker to process and directly responds with status 200. Later on the Worker see's its got work, does the work and finishes the job. It then finds the socket the user is connected to (if any) and sends a "hey, job finished" with the data related to the video conversion job (url, length, bitrate, etc.).
Pros I see: Quick User feedback of sucessfull upload (eg. ProgressBar can be hidden)
Cons I see: User will get a fake "success" respond with no data to handle/display and needs to wait till the job finishes anyway.
Scenario 2;
Like Scenario 1 but that the Frontend Service doesn't respond with a status 200 but rather subscribes to the created job "onComplete" event and lets the Request dangle till the callback is fired and the data can be sent down the pipe to the user.
Pros I see: "onSuccess", all data is at the User
Cons I see: Depending on the job's weight and active job count, the Users request could Timeout
While writing this question things are getting clearer to me by the minute (Scenario 1, but with smart success and update events sent). Regardless, I'd like to hear about other Scenarios you use or further Pros/Cons towards my Scenarios!?
Thanks for helping me out!
Some unnecessary info; For websockets I'm using socket.io, for job creating kue and for pub/sub redis
I just wrote something like this and I use both approaches for different things. Scenario 1 makes most sense IMO because it matches the reality best, which can then be conveyed most accurately to the user. By first responding with a 200 "Yes I got the request and created the 'job' like you requested" then you can accurately update the UI to reflect that the request is being dealt with. You can then use the push channel to notify the user of updates such as progress percentage, error, and success as needed but without the UI 'hanging' (obviously you wouldn't hang the UI in scenario 2 but its an awkward situation that things are happening and the UI just has to 'guess' that the job is being processed).
Scenario 1 -- but instead of responding with 200 OK, you should respond with 202 Accepted. From Wikipedia:
https://en.wikipedia.org/wiki/List_of_HTTP_status_codes
202 Accepted The request has been accepted for processing, but the
processing has not been completed. The request might or might not
eventually be acted upon, as it might be disallowed when processing
actually takes place.
This leaves the door open for the possibility of worker errors. You are just saying you accepted the request and is trying to do something with it.