Distributed Lock Manager with Azure SQL database - azure

We have Web API using Azure SQL database. Database model has Customers and Managers. Customers can add appointments. We can't allow overlapping appointments from 2 or more Customers for same Manager. Because we are working in a distributed environment (multiple instances of web server can insert records into database at the same time), there is a possibility that appointments that are not valid will be saved. As an example, Customer 1 wants an appointment between 10:00 - 10: 30. Customer 2 wants an appointment between 10:15 - 10:45. If both appointments happen during the same time, then the validation code in Web API, will not catch an error. That's why we need something like distributed lock manager. We read about Redlock from Redis and Zookeeper. My questions is: Is Redlock or Zookeeper good choise for our use case or there is some better solution?
If we would use Redlock than we would go with Azure Redis Cache because we already use Azure Cloud to host our Web API. We plan to identify shared resource (resource we want to lock) by using ManagerId + Date. This would result in lock for Manager on one date, so it would be possible to have other locks for same Manager on some other date. We plan to use one instance of Azure Redis Cache, is this safe enough?

Q1: Is Redlock or Zookeeper good choise for our use case or there is some better solution?
I consider Redlock as not the best choice for your use case because:
a) its guarantees are for a specific amount of time (TTL) set before using the DB operation. If for some reason (talk to DevOps for incredible ones and also check How to do distributed locking) the DB operation takes longer than TTL you loose the guarantee for lock validity (see lock validity time in the official documentation). You could use large TTL (minutes) or you could try to extend its validity with another thread which would monitor the DB operation time - but this gets incredibly complicated. On the other hands with zookeeper (ZK) your lock is there till you remove it or the process dies; it could be the situation when your DB operation hangs which would lead to the lock also to hang but these kind of problems are easily spotted by DevOps tools which will kill the hanging process which in turn will free the ZK lock (there's also the option to have a monitoring process which to also do this faster and in a more specific to you business fashion).
b) while trying to lock the processes must “fight” to win a lock; the “fighting” suppose for them to wait then retry getting the lock. These could lead to retry-count to overflow which would lead to a fail to get the lock. This seems to me a less important issue but with ZK the solution is far better: there’s no “fight” but all processes will get in a line of ones waiting their turn to get the lock (check ZK lock recipe).
c) Redlock is based on time measures which is incredible tricky; check at least the paragraph containing “feeling smug” at How to do distributed locking (Conclusion paragraph too) then think again how large should be that TTL value in order to be sure about your RedLock (time) based locking.
For these reasons I consider RedLock a risky solution while Zookeeper a good solution for your use case. Other better distributed locking solution fit for your case I don’t know but other distributed locking solutions do exist, e.g. just check Apache ZooKeeper vs. etcd3.
Q2: We plan to use one instance of Azure Redis Cache, is this safe enough?
It could be safe for your use case because the TTL seems to be predictable (if we really trust the time measuring - see the warn below) but only if the slave taking over a failed master could be delayed (not sure if possible, you should check the Redis configuration capabilities). In case you loose the master before a lock is synchronized to the slave than another process could just acquire the same lock. Redlock recommends to use delayed restarts (check Performance, crash-recovery and fsync in official documentation) with a period at least of 1 TTL. If for the Q1:a+c reason your TTL is a very long one than your system won’t be able to lock for maybe an unacceptable large period (because the only 1 Redis master you have must be replaced by the slave in a delayed fashion).
PS: I stress again to read Martin Kleppmann's opinion on Redlock where you’ll find incredible reasons for a DB operation to be delayed (search for before reaching the storage service) and also incredible reasons for not relaying on time measuring when locking (and also an interesting argumentation against using Redlock)

Related

Scheduling function calls in a stateless Node.js application

I'm trying to figure out a design pattern for scheduling events in a stateless Node back-end with multiple instances running simultaneously.
Use case example:
Create an message object with a publish date/time and save it to a database
Optionally update the publishing time or delete the object
When the publish time is reached, the message content is sent to a 3rd party API endpoint
Right now my best idea is to use bee-queue or bull to queue delayed jobs. It should be able to store the state and ensure that the job is executed only once. However, I feel like it might introduce a single point of failure, especially when maintaining state on Redis for months and then hoping that the future version of the queue library is still working.
Another option is a service worker that polls the database for upcoming events every n minutes, but this seems like a potential scaling issue down the line for multi-tenant SaaS.
Are there more robust design patterns for solving this?
Don't worry about redis breaking. It's pretty stable, and eventually you can decide to freeze the version.
If there are jobs that will be executed in the future I would suggest a database, like Mongo or Redis, with a disk-store. So you will survive a reboot, you don't have to reinvent the wheel, and already have a nice set of tools for scalability.

Task scheduling behind multiple instances

Currently I am solving an engineering problem, and want to open the conversation to the SO community.
I want to implement a task scheduler. I have two separate instances of a nodeJS application sitting behind an elastic load balancer (ELB). The problem is when both instances come up, they try to execute the same tasks logic, causing the tasks run more than once.
My current solution is to use node-schedule to schedule tasks to run, then have them referencing the database to check if the task hasn't already been run since it's specified run time interval.
The logic here is a little messy, and I am wondering if there is a more elegant way I could go about doing this.
Perhaps it is possible to set a particular env variable on a specific instance - so that only that instance will run the tasks.
What do you all think?
What you are describing appears to be a perfect example of a use case for AWS Simple Queue Service.
https://aws.amazon.com/sqs/details/
Key points to look out for in your solution:
Make sure that you pick a visibility timeout that is reflective of your workload (so messages don't reenter the queue whilst still in process by another worker)
Don't store your workload in the message, reference it! A message can only be up to 256kb in size and message sizes have an impact on performance and cost.
Make sure you understand billing! As billing is charged in 64KB chunks, meaning 1 220KB message is charged as 4x 64KB chucks / requests.
If you make your messages small, you can save more money by doing batch requests as your bang for buck will be far greater!
Use longpolling to retrieve messages to get the most value out of your message requests.
Grant your application permissions to SQS by the use of an EC2 IAM Role, as this is the best security practice and the recommended approach by AWS.
It's an excellent service, and should resolve your current need nicely.
Thanks!
Xavier.

Scaling and selecting unique records per worker

I'm part of a project where we are having to deal with a lot of data in a stream. It's going to be passed to Mongo and from there it needs to be processed by workers to see if it needs to be persisted, amongst other things, or discarded.
We want to scale this horizontally. My question is, what methods are there for ensuring that each worker selects a unique record, that isn't already being processed by another worker?
Is a central main worker required to hand out jobs to the sub workers, if that is the case, the bottle neck and point of failure is with that central worker, right?
Any ideas or suggestions welcome.
Thanks!
Josh
You can use findAndModify to both select and flag a document atomically, making sure that only one worker gets to process it. My experience is that this can be slow due to excessive database locking, but that experience is based on MongoDB 2.x so it may not be an issue anymore on 3.x.
Also, with MongoDB it's difficult to "wait" for new jobs/documents (you can tail the oplog, but you'd have to do this from every worker and each one will wake up and perform the findAndModify() query, resulting in the aforementioned locking).
I think that ultimately you should consider using a proper messaging solution (write data to MongoDB, write the _id to the broker, have the workers subscribe to the message queue, and if you configure things properly only one worker will get a job). Well-known brokers are RabbitMQ, nsq.io and with a bit of extra work you can even use Redis.

Azure Redis Cache data loss?

I have a Node.js application that receives data via a Websocket connection and pushes each message to an Azure Redis cache. It stores a persistent array of messages in a variable for downstream use, and at regular intervals syncs that array from the cache. Bit convoluted, but at a later point I want to separate out the half of the application that writes to the cache from the half of it that reads from it..
At around 02:00 GMT, based on the Azure portal stats, I appear to have started getting "cache misses" on that sync, which last for a couple of hours before I started getting "cache hits" again sometime around 05:00.
The cache misses correspond to a sudden increase in CPU usage, which peaks at around 05:00. And when I say peaks, I mean it hits 81%, vs a previous max of about 6%.
So sometime around 05:00, the CPU peaks, then drops back to normal, the "cache misses" go away, but looking at the cache memory usage, I drop from about 37.4mb used to about 3.85mb used (which I suspect is the "empty" state), and the list that's being used by this application was emptied.
The only functions that the application is running against the cache are LPUSH and LRANGE, there's nothing that has any capability to remove data, and in case anybody was wondering, when the CPU ramped up the memory usage did not so there's nothing to suggest that rogue additions of data cropped up.
It's only on the Basic plan, so I'm not expecting it to be invulnerable or anything, but even without the replication features of the Standard plan I had expected that it wouldn't be in a position to completely wipe itself - I was under the impression that Redis periodically writes itself to disk and restores from that when it recovers from an error.
All of which is my way of asking:
Does anybody have any idea what might have happened here?
If this is something that others have been able to accidentally trigger themselves, are there any gotchas I should be looking out for that I might have in other applications using the same cache that could have caused it to fail so catastrophically?
I would welcome a chorus of people telling me that the Standard plan won't suffer from this sort of issue, because I've already forked out for it and it would be nice to feel like that was the right call.
Many thanks in advance..
Here my thoughts:
Azure Redis Cache stores information in memory. By default, it won't save a "backup" on disk, so, you had information in memory, for some reason the server got restarted and you lost your data.
PS: See this feedback, there is no option to persist information on disk using azure-redis cache yet http://feedback.azure.com/forums/169382-cache/suggestions/6022838-redis-cache-should-also-support-persistence
Make sure you don't use Basic plan. Basic plan doesn't suppose SLA and from my expirience it lost data quite often
Standard plan provides SLA and utilize 2 instances of Redis Cache. It's quite stable and it didn't lose our data, although such case still possible.
Now, if you're going to use Azure Redis as database, but not as a cache you need to utilize data persistance feature, which is already available in Azure Redis Cache Premium Tier: https://azure.microsoft.com/en-us/documentation/articles/cache-premium-tier-intro (see Redis data persistence)
James, using the Standards instance should give you much improved availability.
With the Basic tier any Azure Fabric update to the Master Node (or hardware failure), will cause you to loose all data.
Azure Redis Cache does not support persistence (writing to disk/blob) yet, even in Standard Tier. But the Standard tier does give you a replicated slave node, that can take over if you Master goes down.

How to implement critical section in Azure

How do I implement critical section across multiple instances in Azure?
We are implementing a payment system on Azure.
When ever account balance is updated in the SQL-azure, we need to make sure that the value is 100% correct.
But we have multiple webroles running, thus they would be able to service two requests concurrently from different customers, that would potentially update current balance for one single product. Thus both instances may read the old amount from database at the same time, then both add the purchase to the old value and the both store the new amount in the database. Who ever saves first will have it's change overwritten. :-(
Thus we need to implement a critical section around all updates to account balance in the database. But how to do that in Azure? Guides suggest to use Azure storage queues for inter process communication. :-)
They ensure that the message does not get deleted from the queue until it has been processed.
Even if a process crash, then we are sure that the message will be processed by the next process. (as Azure guarantee to launch a new process if something hang)
I thought about running a singleton worker role to service requests on the queue. But Azure does not guarantee good uptime when you don't run minimum two instances in parallel. Also when I deploy new versions to Azure, I would have to stop the running instance before I can start a new one. Our application cannot accept that the "critical section worker role" does not process messages on the queue within 2 seconds.
Thus we would need multiple worker roles to guarantee sufficient small down time. In which case we are back to the same problem of implementing critical sections across multiple instances in Azure.
Note: If update transaction has not completed before 2 seconds, then we should role it back and start over.
Any idea how to implement critical section across instances in Azure would be deeply appreciated.
Doing synchronisation across instances is a complicated task and it's best to try and think around the problem so you don't have to do it.
In this specific case, if it is as critical as it sounds, I would just leave this up to SQL server (it's pretty good at dealing with data contentions). Rather than have the instances say "the new total value is X", call a stored procedure in SQL where you simply pass in the value of this transaction and the account you want to update. Somthing basic like this:
UPDATE Account
SET
AccountBalance = AccountBalance + #TransactionValue
WHERE
AccountId = #AccountId
If you need to update more than just one table, do it all in the same stored procedure and wrap it in a SQL transaction. I know it doesn't use any sexy technologies or frameworks, but it's much less complicated than any alternative I can think of.

Resources