Backoff Strategy after hitting rate limits - node.js

When you hit the rate limits on getstream, the APIs start responding with errors.
What is the recommended approach as a backoff strategy to handle those failures and start recovery after that. I thought about logging them all and send all of them again after a minute or hour.
But what if user created a post (failed to be created on getstream, waiting for a backoff) and meanwhile user deletes it. The backoff script will send the post to getstream even if user deleted it.
What is recommended by getstream or anyone handled the situation like that?

As you point out, API rate-limit errors are typically handled with (exponential) backoff solutions.
This often involves additional application logic (flow control and queues) and special purpose data services / storage (message queues, async workers etc). This can add quite some complexity to an application.
When it comes to the Stream service, being rate-limited is usually an indication of either a flaw/deficiency in the implementation (much like a performance bug) or that the application has reached a scale that is beyond that the current plan is intended to support.
It'd be wise to contact Stream support directly about this.

Related

Rate limit a pubsub queue worker

There is a scenario where a google pubsub worker will call a 3rd party API. This 3rd party API has limit of 500 requests per minute.
How can we handle this scenario.
Rate limit the google pub-sub worker.(If its possible how we can achieve it?)
Any other way available to check the limit before making the call to 3rd party API?
Please share if there is another option. Thanks
Cloud task is the tool designed for that. Instead of publishing your messages in a PubSub topic, create a task in a Cloud Task queue with the target URL.
On the task queue configuration, define the rate limit and, out of the box, your feature is done.
The simplest solution should be using the api-gateway, configure push notifications from the subscription to your limited api-gateway: https://cloud.google.com/api-gateway/docs/quotas-overview
When you reach the threshold, the messages will fail and get back to the queue. Be sure to configure several monitoring alerts to your pubsub unack message count and a DeadLetterQueue to avoid losing information.
Another solution might be a custom implementation in AppEngine, pulling messages with one instance of your subscriber App and keeping in memory the number of requests per date, not very resilient though. For a more resilient option consider using redis memorystore to keep the requests rate distributed between several instances, then you can use something like Cloud Functions.

Azure Durable Functions as Message Queue

I have a serverless function that receives orders, about ~30 per day. This function is depending on a third-party API to perform some additional lookups and checks. However, this external endpoint isn't 100% reliable and I need to be able to store order requests if the other API isn't available for a couple of hours (or more..).
My initial thought was to split the function into two, the first part would receive orders, do some initial checks such as validating the order, then post the request into a message queue or pub/sub system. On the other side, there's a consumer that reads orders and tries to perform the API requests, if the API isn't available the orders get posted back into the queue.
However, someone suggested to me to simply use an Azure Durable Function for the requests, and store the current backlog in the function state, using the Aggregator Pattern (especially since the API will be working find 99.99..% of the time). This would make the architecture a lot simpler.
What are the advantages/disadvantages of using one over the other, am I missing any important considerations?
I would appreciate any insight or other suggestions you have. Let me know if additional information is needed.
You could solve this problem with Durable Task Framework or Azure Storage or Service Bus Queues, but at your transaction volume, I think that's overcomplicating the solution.
If you're dealing with ~30 orders per day, consider one of the simpler solutions:
Use Polly, a well-supported resilience and fault-tolerance framework.
Write request information to your database. Have an Azure Function Timer Trigger read occasionally and finish processing orders that aren't marked as complete.
Durable Task Framework is great when you get into serious volume. But there's a non-trivial learning curve for the framework.

How can Azure Service Bus client-side batching guarantee messages aren't lost?

Azure Service Bus has a feature of "client-side batching" (implemented by protocols AMQP and SBMP). I am reading what the documentation says about it. It makes a rather bold claim: (emphasis mine)
There is no risk of losing messages with batching, even if there is a Service Bus failure at the end of a 20ms batching interval.
How can this be true? If messages were sent synchronously, I'd know that after the "send" method returns, the message is already on the bus and I no longer need to worry about it. But if the purpose of client-side batching is to delay sending of the message for 20ms after the method returns, so that subsequent calls to this method can add messages to the same batch, then, at least in my mind, surely there must be a risk that something bad can happen within these 20ms and the whole batch will be lost.
The only possible workaround I can think of is if the batching happened on the server instead, but then the feature wouldn't be called client-side batching.
Is this claim true? Do I misunderstand that sentence, or what the client-side batching is meant to achieve? Or did clever people at Microsoft come up with a technical solution that hasn't occurred to me?
The note that you are referring to is under Batching store access which is talking about batching on the service end, not the client.

Instagram real-time API POST rate

I'm building an application using tag subscriptions in the real-time API and have a question related to capacity planning. We may have a large number of users posting to a subscribed hashtag at once, so the question is how often will the API actually POST to our subscription processing endpoint? E.g., if 100 users post to #testhashtag within a second or two, will I receive 100 POSTs or does the API batch those together as one update? A related question: is there a maximum rate at which POSTs can be sent (e.g., one per second or one per ten seconds, etc.)?
The Instagram API seems to lack detailed information about both how many updates are sent and what are the rate limits. From the [API docs][1]:
Limits
Be nice. If you're sending too many requests too quickly, we'll send back a 503 error code (server unavailable).
You are limited to 5000 requests per hour per access_token or client_id overall. Practically, this means you should (when possible) authenticate users so that limits are well outside the reach of a given user.
In other words, you'll need to check for a 503 and throttle your application accordingly. No information I've seen for how long they might block you, but it's best to avoid that completely. I would advise you manage this by placing a rate limiting mechanism on your own code, such as pushing your API requests through a queue with rate control. That will also give you the benefit of a retry of you're throttled so you won't lose any of the updates.
Moreover, a mechanism such as a queue in the case of real-time updates is further relevant because of the following from the API docs:
You should build your system to accept multiple update objects per payload - though often there will be only one included. Also, you should acknowledge the POST within a 2 second timeout--if you need to do more processing of the received information, you can do so in an asynchronous task.
Regarding the number of updates, the API can send you 1 update or many. The problem with this is you can absolutely murder your API calls because I don't think you can batch calls to specific media items, at least not using the official python or ruby clients or API console as far as I have seen.
This means that if you receive 500 updates either as 1 request to your server or split into many, it won't matter because either way, you need to go and fetch these items. From what I observed in a real application, these seemed to count against our quota, however the quota itself seems to consume resources erratically. That is, sometimes we saw no calls at all consumed, other times the available calls dropped by far more than we actually made. My advice is to be conservative and take the 5000 as a best guess rather than an absolute. You can check the remaining calls by parsing one of the headers they send back.
Use common sense, don't be stupid, and using a rate limiting mechanism should keep you safe and have the benefit of dealing with failures either due to outages (this happens more than you may think), network hicups, and accidental rate limiting. You could try to be tricky and use different API keys in a pooling mechanism, but this is likely a violation of the TOS and if they are doing anything via IP, you'd have to split this up to different machines with different IPs.
My final advice would be to restructure your application to not completely rely on the subscription mechanism. It's less than reliable and very expensive API wise. It's only truly useful if you just need to do something in your app that doesn't require calling back to Instgram, your number of items is small, or you can filter out the majority of items to avoid calling back to Instagram accept when a specific business rule is matched.
Instead, you can do things like query the tag or the user (ex: recent media) and scale it out that way. Normally this allows you to grab 100 items with 1 request rather than 100 items with 100 requests. If you really want to be cute, you could at least merge the subscription notifications asynchronously and combine the similar ones into a single batched request when you combine the duplicate characteristics such as tag into a single bucket. Sort of like a map/reduce but on a small data set. You could of course do an actual map/reduce from time-to-time on your own data as another way of keeping things in async. Again, be careful not to thrash instagram, but rather just use map/reduce to batch out your calls in a way that's useful to your app.
Hope that helps.

Windows Azure Service Bus Queues: Throttling and TOPAZ

Today at a customer we analysed the logs of the previous weeks and we found the following issue regarding Windows Azure Service Bus Queues:
The request was terminated because the entity is being throttled.
Please wait 10 seconds and try again.
After verifying the code I told them to use the Transient Fault Handing Application Block (TOPAZ) to implement a retry policy like this one:
var retryStrategy = new Incremental(5, TimeSpan.FromSeconds(1), TimeSpan.FromSeconds(2));
var retryPolicy = new RetryPolicy<ServiceBusTransientErrorDetectionStrategy>(retryStrategy);
The customer answered:
"Ah that's great, so it will also handle the fact that it should wait
for 10 seconds when throttled."
Come to think about it, I never verified if this was the case or not. I always assumed this was the case. In the Microsoft.Practices.EnterpriseLibrary.WindowsAzure.TransientFaultHandling assembly I looked for code that would wait for 10 seconds in case of throttling but didn't find anything.
Does this mean that TOPAZ isn't sufficient to create resilient applications? Should this be combined with some custom code to handle throttling (ie: wait 10 seconds in case of a specific exception)?
As far as throttling concerned, Topaz provides a set of built-in retry strategies, including:
- Fixed interval
- Incremental intervals
- Random exponential back-off intervals
You can also write your custom retry stragey and plug-it into Topaz.
Also, as Brent indicated, 10 sec wait is not mandatory. In many cases, retrying immediately may succeed without the need to wait. By default, Topaz performs the first retry immediately before using the retry intervals defined by the strategy.
For more info, see Ch.6 of the "Building Elastic and Resilient Cloud Apps" Developer's Guide, also available as epub/mobi/pdf from here.
If you have suggestions/feature requests for Topaz, please submit them via the uservoice.
As I recall, the "10 second" wait isn't a requirement. Additionally, TOPAZ I believe also has backoff capabilities which would help you over come thing.
On a personal note, I'd argue that simply utilzing something like TOPAZ is not sufficient to creating a truely resilient solution. Resiliency goes beyond just throttling on a single connection point, you'll also need to be able to handle failover to a redundant endpoint which TOPAZ won't do.

Resources