Autotmatically renewing locks correctly on Azure Service Bus - azure

So i'm trying to understand service bus timings... Especially how the locks works. One can choose to manually call CompleteAsync which is what we're doing. It could also be the case that the processing takes some time. In these cases we want to make sure we don't get unneccessary MessageLockLostException.
Seems there are a couple of numbers to relate to:
Lock duration (found in azure portal on the bus, currently set to 1 minute which is think is default)
AutoRenewTimeout (property on OnMessageOptions, currently set to 1 minute)
AutoComplete (property on OnMessageOptions, currently set to false)
Assuming the processing is running for around 2 minutes, and then either succeeds or crases (doesn't matter which case for now). Let's say this is the normal scenario, so this means that processing takes roughly 2 minutes for each message.
Also, it's indeed a queue and not a topic. And we only have one consumer that asynchronoulsy processes the messages with MaxConcurrentCalls set to 100. We're using OnMessageAsync with ReceiveMode.PeekLock.
What should my settings now be as a single consumer to robustly process all messages?
I'm thinking that leaving Lock duration to 1 minute would be fine, as that's the default, and set my AutoRenewTimeout to 5 minutes for safety, because as i've understood this value should be the maximum time it takes to process a message (atleast according to this answer). Performance is not critical for this system, so i'm resonating as that leaving a message locked for some unneccessary 1, 2 or 3 minutes is not evil, as long as we don't get LockedException because these give no real value.
This thread and this thread gives great examples of how to manually renew the locks, but I thought there is a way to automatically renew the locks.

What should my settings now be as a single consumer to robustly process all messages?
Aside from LockDuration, MaxConcurrentCalls, AutoRenewTimeout, and AutoComplete there are some configurations of the Azure Service Bus client you might want to look into. For example, create not a single client with MaxConcurrentCalls set to 100, but a few clients with total concurrency level distributed among the clients. Note that you'd want to use different MessagingFactory instances to create those clients to ensure you have more than a single "pipe" to receive messages. And even with that, it would be way better to scale out and have competing consumers rather than having a single consumer handling all the load.
Now back to the settings. If your normal processing time is 2 minutes, it's better to set MaxLockDuration on the entities to this time and not 1 minute. This will remove unnecessary lock extension calls to the broker and eliminate MessageLockLostException.
Also, keep in mind that AutoRenewTimeout is a client based operation, not broker, and therefore not guaranteed. You will run into cases where lock will be lost even though the AutoRenewTimeout time has not elapsed yet.
AutoRenewTimeout should always be set to longer than MaxLockDuration as it will be counterproductive to have them equal. Have it somewhat larger than MaxLockDuration as this is clients' "insurance" that when processing takes longer than MaxLockDuration, message lock won't be lost. Having those two equal is, in essence, disables this fallback.

Related

How to set the number of retries for Azure DocumentDB output binding in Azure Function?

Based on this question, it seems like writing to Azure DocDB output binding in Azure Function will be retried 10 times if throttled (HTTP 429). I haven't verified this myself though.
I would like to increase this limit on the number of retries. My data comes in big chunks in a small amount of time and then with a very long period of downtime, which means that getting 429 and waiting for a bit is okay for my purpose. I must guarantee though, that no data is dropped.
One way for me to solve this is to increase the RTU limit in Document DB to make sure I don't get 429 during the time big chunks of data come in, but it's already at about 2.5 times of what I need during downtime period. Is there anyway to make the retries run infinitely until it succeeds, or less ideally, increase the number of retries to something more than 10?
Why don't you change the approach and instead of inserting documents right away you can make use of service bus and implement a dead letter queue, here are some links:
https://learn.microsoft.com/en-us/azure/service-bus-messaging/service-bus-dead-letter-queues
https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-service-bus
https://blog.jeroenmaes.eu/2017/01/process-service-bus-dead-letter-message-with-azure-functions/
The idea is having something like this:
Current function instead of saving the data in DocumentDB, it will be sending it the the service bus (you just change the output binding)
Another function will process every message of the service bus and if it failed (you can manage a timeout in the function and then move the message to a dead letter queue)
Another function that will process any message in the dead letter queue
You just need to make a small change in the first function and create two more, might sound too complicated but you'll have strong consistency in the data. In all of the above links there's an example of what I mentioned here.

"Resequencing" messages after processing them out-of-order

I'm working on what's basically a highly-available distributed message-passing system. The system receives messages from someplace over HTTP or TCP, perform various transformations on it, and then sends it to one or more destinations (also using TCP/HTTP).
The system has a requirement that all messages sent to a given destination are in-order, because some messages build on the content of previous ones. This limits us to processing the messages sequentially, which takes about 750ms per message. So if someone sends us, for example, one message every 250ms, we're forced to queue the messages behind each other. This eventually introduces intolerable delay in message processing under high load, as each message may have to wait for hundreds of other messages to be processed before it gets its turn.
In order to solve this problem, I want to be able to parallelize our message processing without breaking the requirement that we send them in-order.
We can easily scale our processing horizontally. The missing piece is a way to ensure that, even if messages are processed out-of-order, they are "resequenced" and sent to the destinations in the order in which they were received. I'm trying to find the best way to achieve that.
Apache Camel has a thing called a Resequencer that does this, and it includes a nice diagram (which I don't have enough rep to embed directly). This is exactly what I want: something that takes out-of-order messages and puts them in-order.
But, I don't want it to be written in Java, and I need the solution to be highly available (i.e. resistant to typical system failures like crashes or system restarts) which I don't think Apache Camel offers.
Our application is written in Node.js, with Redis and Postgresql for data persistence. We use the Kue library for our message queues. Although Kue offers priority queueing, the featureset is too limited for the use-case described above, so I think we need an alternative technology to work in tandem with Kue to resequence our messages.
I was trying to research this topic online, and I can't find as much information as I expected. It seems like the type of distributed architecture pattern that would have articles and implementations galore, but I don't see that many. Searching for things like "message resequencing", "out of order processing", "parallelizing message processing", etc. turn up solutions that mostly just relax the "in-order" requirements based on partitions or topics or whatnot. Alternatively, they talk about parallelization on a single machine. I need a solution that:
Can handle processing on multiple messages simultaneously in any order.
Will always send messages in the order in which they arrived in the system, no matter what order they were processed in.
Is usable from Node.js
Can operate in a HA environment (i.e. multiple instances of it running on the same message queue at once w/o inconsistencies.)
Our current plan, which makes sense to me but which I cannot find described anywhere online, is to use Redis to maintain sets of in-progress and ready-to-send messages, sorted by their arrival time. Roughly, it works like this:
When a message is received, that message is put on the in-progress set.
When message processing is finished, that message is put on the ready-to-send set.
Whenever there's the same message at the front of both the in-progress and ready-to-send sets, that message can be sent and it will be in order.
I would write a small Node library that implements this behavior with a priority-queue-esque API using atomic Redis transactions. But this is just something I came up with myself, so I am wondering: Are there other technologies (ideally using the Node/Redis stack we're already on) that are out there for solving the problem of resequencing out-of-order messages? Or is there some other term for this problem that I can use as a keyword for research? Thanks for your help!
This is a common problem, so there are surely many solutions available. This is also quite a simple problem, and a good learning opportunity in the field of distributed systems. I would suggest writing your own.
You're going to have a few problems building this, namely
2: Exactly-once delivery
1: Guaranteed order of messages
2: Exactly-once delivery
You've found number 1, and you're solving this by resequencing them in redis, which is an ok solution. The other one, however, is not solved.
It looks like your architecture is not geared towards fault tolerance, so currently, if a server craches, you restart it and continue with your life. This works fine when processing all requests sequentially, because then you know exactly when you crashed, based on what the last successfully completed request was.
What you need is either a strategy for finding out what requests you actually completed, and which ones failed, or a well-written apology letter to send to your customers when something crashes.
If Redis is not sharded, it is strongly consistent. It will fail and possibly lose all data if that single node crashes, but you will not have any problems with out-of-order data, or data popping in and out of existance. A single Redis node can thus hold the guarantee that if a message is inserted into the to-process-set, and then into the done-set, no node will see the message in the done-set without it also being in the to-process-set.
How I would do it
Using redis seems like too much fuzz, assuming that the messages are not huge, and that losing them is ok if a process crashes, and that running them more than once, or even multiple copies of a single request at the same time is not a problem.
I would recommend setting up a supervisor server that takes incoming requests, dispatches each to a randomly chosen slave, stores the responses and puts them back in order again before sending them on. You said you expected the processing to take 750ms. If a slave hasn't responded within say 2 seconds, dispatch it again to another node randomly within 0-1 seconds. The first one responding is the one we're going to use. Beware of duplicate responses.
If the retry request also fails, double the maximum wait time. After 5 failures or so, each waiting up to twice (or any multiple greater than one) as long as the previous one, we probably have a permanent error, so we should probably ask for human intervention. This algorithm is called exponential backoff, and prevents a sudden spike in requests from taking down the entire cluster. Not using a random interval, and retrying after n seconds would probably cause a DOS-attack every n seconds until the cluster dies, if it ever gets a big enough load spike.
There are many ways this could fail, so make sure this system is not the only place data is stored. However, this will probably work 99+% of the time, it's probably at least as good as your current system, and you can implement it in a few hundred lines of code. Just make sure your supervisor is using asynchronous requests so that you can handle retries and timeouts. Javascript is by nature single-threaded, so this is slightly trickier than normal, but I'm confident you can do it.

how to process hundreds of JMS message from 2 queues, response time of 1 second and 1 minute respectively

I have business requirement where I have to process messages in a certain priority say priority1 and priority2
We have decided to use 2 JMS queues where priority1 messages will be sent to priority1Queue and priority2 messages will be sent to priority2Queue.
Response time for priority1Queue messages is that the moment message is in Queue, I need to read, process and send the response back to say another queue in 1 second. This means I should immediately process these messages the moment they are in priority1Queue, and I will have hundreds of such messages coming in per second on priority1Queue so I will definitely need to have multiple concurrent consumers consuming messages on this queue so that they can be processed immediately when they are in the queue(consumed and processed within 1 second).
Response time for priority2Queue messages is that I need to read, process and send the response back to say another queue in 1 minute. So the response time of priority2 is lower to priority1 messages however I still need to respond back in a minute.
Can you suggest best possible approach for this so that I can concurrently read messages from both the queue and give higher priority to priority1 messages so that each priority1 message can be read and processed in 1 second.
Mainly how it can be read and fed to a processor so that the next message can be read and so on.
I need to write a java based component that does the reading and processing.
I also need to ensure this component is highly available and doesn't result in OutOfMemory, I will be having this component running across multiple JVMS and multiple application servers thus I can have multiple clusters running this Java component
First off, the requirement to process within 1 second is not going to be dependent on your messaging approach, but more about the actual processing of the message and the raw CPUs available. Picking up 100s of messages per second from a queue is child's play, the JMS provider is most likely not the issue. Depending on your deployment platform (Tomcat, Mule, JEE, whatever), there should be a way to have n listeners to scale up appropriately. Because the messages exist on the queue until you pick it up, doubtful you'll run out of memory. I've done these apps, processed many more messages without problems.
Second, number of strategies for prioritizing messages, not necessarily requiring different queues, using priorities. I'm leaning towards using message priorities and message filters, where one group of listeners take care of the highest priority messages and another listener filters off lower priority but makes sure it does enough to get them out within a minute.
You could also do something where a lower priority message gets rewritten back to the same queue with a higher priority, based on how close to 1 minute you are. I know that sounds wrong, but reading/writing from JMS has very little overhead (at least compared to do the equivalent, column-driven database transactions), but the listener for lower priority messages could just continually increase the priority until it has to be processed.
Or simpler, just have more listeners on the high priority queue/messages than the lower priority ones, and imbalance in number of processes for messages might be all it needs.
Lots of possibilities, time for a PoC.

oracle row contention causing deadlock errors in high throughtput JMS application

Summary:
I am interested in knowing what's the best practice for high throughput applications that have bulk messages trying to update the same row and get oracle deadlock errors. I know you cannot avoid those errors but how do you recover from them gracefully without getting bogged down by such deadlock errors happening over and over again.
Details:
We are building a high throughput JMS messaging application. Production environment will be two weblogic 11g nodes (running 6 MDB listener instances each). We were getting Oracle deadlock errors (ORA-00060) when we get around 1000 messages all trying to update the same row in oracle database. Java synchronization across nodes is not possible in standard java threading API (unless there's no other solution we don't want to use any 3rd party solutions like terracotta etc).
We were hoping Oracle "select for update WAIT n secs" statement will help because that will essentially make the competing threads (for the same row) wait few seconds before the first thread (who got the lock on the row first) gets done with it.
First issue with "SELECT FOR UPDATE WAIT n" is it doesn't allow using milliseconds for wait times. This starts negatively affecting our application's throughput because putting 1 sec WAIT (least wait time) causes delays on the messages.
Second thing we are fiddling with weblogic queue re-delivery delay parameter (30 secs in our case). Whenever a thread bounces back because of the deadlock error, it will wait 30 seconds before being re-tried.
In our experience 1000 competing messages, in a lot of situations take forever to get processed because the deadlock keeps on happening over and over.
I understand that with the current architecture we are supposed to get deadlock errors regardless ( in case of 1000 competing messages) but application should be resilient enough to recover from these errors after retrying the looping messages.
Any idea what we are missing here ? anybody who has dealt with similar issues before?
I am looking for some design ideas that can make this work resiliently so that it recovers from this deadlock situation and eventually processes all messages in reasonable amount of time without using much additional hardware.
COMPUTATION DETAILS:
These 1000 messages will EACH create 4 objects of 4 different position types each having a quantity associated with it. These quantities will have to merged into those 4 different slots (depending on the position type). The deadlock is happening when those 4 individual slots are being updated by each individual thread. We have already ordered those individual updates in a specific order before being applied to the database rows to avoid any possible race conditions.
A deadlock implies that each thread is trying to update multiple rows in a single transaction and that those updates are being done in a different order across threads. The simplest possible answer, therefore, would be to modify the code so that messages within the same transaction are applied in some defined order (i.e. in order of the primary key). That would ensure that you would never get a deadlock though you'd still get blocking locks while one thread waits for another thread to commit its transaction.
Taking a step back, though, it seems unlikely that you would really want many threads updating the same row in a table when you can't predict the order of the updates. It seems highly likely that would lead to lots of lost updates and some rather unpredictable behavior. What, exactly, is your application doing that would make this sort of thing sensible? Are you doing something like updating aggregate tables after inserting rows into a detail table (i.e. updating the count of the number of views a post has in addition to logging information about a particular view)? If so, do those operations really need to be synchronous? Or could you update the view count periodically in another thread by aggregating the views over the past N second?
As for the MDB
Let it consume the messages, and update instance variables which contain the delta of the quantities of the processed messages (an MDB can carry state in its instance variables across multiple messages).
A #Schedule method in the same MDB persists the quantities in a single database transaction using a single SQL statement every second (for example)
update x set q1 = q1 + delta1, q2 = q2 + delta2, ...
I have done some tests:
It takes 6s to create 1000 messages (JBoss 7 using HornetQ)
During that time, 840 messages were already persisted.
It takes another 2s to persist the remaining ones (the scheduled method ran every second)
This required seven SQL update commands in seven DB transcations
The load is completely caused by creating the messages; there is not real load on the DB
Notes
You need another #PreDestroy method to persist the pending deltas to make sure that nothing gets lost
If you must guarantee transactional correctness, this approach is not suitable. In that case I suggest using a normal queue receiver (= no MDB), transacted session and receive(timeout) to collect 100 - 10000 messages (or until a timeout), do one DB transaction, and right after that the commit on the queue session. This is better, but it's still not XA transactional. If you need this, both commits need to be coordinated by a single XA transaction.

does multiple Azure worker role polling same Queue causes Dead Lock or Poison message

Scenario:
if I've spin off multiple Worker roles or ONE Worker role with multiple threads, which polls the new messages in Azure Queue.
Could someone please confirm if the this the correct design approach? The reason I would like to have many worker roles is to speed up the PROCESSJOB. Our application should be near real time, i.e. as soon as there are messages we should get, apply complex business rules and commit to AZURE DB. We are expecting 11,000 message per 3min.
Thank you.
You may have as many queue-readers as you like. It's very common to scale out worker role instances, as they can all read from the same queue, giving you much greater work throughput.
When you read a queue message, it's marked "invisible" for a period of time, to prevent others from reading and doing the same work. The owner of the message must delete it before the time period expires, otherwise the message becomes visible again, and an exception will be thrown when the original reader attempts to delete it. This means your operations must be idempotent.
There's no direct poison-message handling, but it's easy to implement, as each message has a dequeue count. Just check it and remove poison messages after being read 3-4 times. You can also dynamically adjust the timeout period based on dequeue count, as maybe the processing fails due to too-short a time window.
Here's the MSDN documentation for DequeueCount.
EDIT: As far as processing 11,000 messages in 3 minutes: the scalability target for queues is 500 2,000 TPS, or up to 360,000 transactions in 3 minutes (far beyond the 11,000 message requirement you have). You can speed things up further by combining messages into a single queue message, as well as reading multiple messages at a time, which will also reduce your transaction count. You can also look at the ApproximateMessageCount property of a queue to see if your queue is backing up (and then scaling out to additional intstances to help consume queue items).

Resources