How to handle Web application logic and database concurrency?

How to handle Web application logic and database concurrency? - web

Let's say I have a table called items. User of my webapp can delete row of the items table, but I don't want to let the table empty.
So currently I have code like this in my application:
if (itemsCount() <= 1) {
don't delete;
}
else {
delete;
}
But I realize this code is vulnerable to concurrency problem. For example if currently the size of items is 2, and there are two thread executing this code at almost the exact same time, the table might become empty.
I think this problem is pretty common for people writing webapps. People should've already solved it. What are the available solutions for this?

The most common solution is to use a Transaction Manager. In your case, the Transaction Manager would coordinate the thread execution to make sure that only one thread at a time access and updates the table.
You didn't mention which language and which kind of environment you are using, but assuming Java and JEE, transaction management is quite easy. Start here.

Related

Azure durable entity or static variables?

Question: Is it thread-safe to use static variables (as a shared storage between orchestrations) or better to save/retrieve data to durable-entity?
There are couple of azure functions in the same namespace: hub-trigger, durable-entity, 2 orchestrations (main process and the one that monitors the whole process) and activity.
They all need some shared variables. In my case I need to know the number of main orchestration instances (start new or hold on). It's done in another orchestration (monitor)
I've tried both options and ask because I see different results.
Static variables: in my case there is a generic List, where SomeMyType holds the Id of the task, state, number of attempts, records it processed and other info.
When I need to start new orchestration and List.Add(), when I need to retrieve and modify it I use simple List.First(id_of_the_task). First() - I know for sure needed task is there.
With static variables I sometimes see that tasks become duplicated for some reason - I retrieve the task with List.First(id_of_the_task) - change something on result variable and that is it. Not a lot of code.
Durable-entity: the major difference is that I add List on a durable entity and each time I need to retrieve it I call for .CallEntityAsync("getTask") and .CallEntityAsync("saveTask") that might slow done the app.
With this approach more code and calls is required however it looks more stable, I don't see any duplicates.
Please, advice

Can't answer why you would see duplicates with the static variables approach without the code, may be because list is not thread safe and it may need ConcurrentBag but not sure. One issue with static variable is if the function app is not always on or if it can have multiple instances. Because when function unloads (or crashes) the state would be lost. Static variables are not shared across instances either so during high loads it wont work (if there can be many instances).
Durable entities seem better here. Yes they can be shared across many concurrent function instances and each entity can only execute one operation at a time so they are for sure a better option. The performance cost is a bit higher but they should not be slower than orchestrators since they perform a lot of common operations, writing to Table Storage, checking for events etc.
Can't say if its right for you but instead of List.First(id_of_the_task) you should just be able to access the orchestrators properties through the client which can hold custom data. Another idea depending on the usage is that you may be able to query the Table Storages directly with CloudTable class for the information about the running orchestrators.
Although not entirely related you can look at some settings for parallelism for durable functions Azure (Durable) Functions - Managing parallelism
Please ask any questions if I should clarify anything or if I misunderstood your question.

Do I need to use transaction when reading data after write?

I have a Node.js web app with a route that marks some entity as deleted - flipping boolean field in a database. This route returns that entity. Right now I have code that looks like this:
UPDATE entity SET is_deleted=true WHERE entity.id = ?
SELECT * FROM entity WHERE entity.id = ?
For the moment I can't use RETURNING statement for other reasons.
So I got in the argument with colleague, I think that putting both UPDATE and SELECT inside transaction is unnecessary, because we are not doing anything significant with data, just returning it. As a user of the app I would expect that data that is returned is as fresh as possible, meaning that I would get same results on page refresh.
My question is, what is the best practice regarding reading data after write? Do you always wrap reading with writing inside transaction? Or it depends?

Well, for performance reasons you want to keep your transactions as small and quick as possible. This will minimize the chance to have potential locks and deadlocks that could bring your application to its knees. As such, unless there is a very good reason to do so, keep your select statements outside of the transaction. This is specially important if your need to execute a long running select statement. By putting the select inside the transaction, you keep the update locks much longer than needed.

How to avoid concurrency issues when scaling writes horizontally?

Assume there is a worker service that receives messages from a queue, reads the product with the specified Id from a document database, applies some manipulation logic based on the message, and finally writes the updated product back to the database (a).
This work can be safely done in parallel when dealing with different products, so we can scale horizontally (b). However, if more than one service instance works on the same product, we might end up with concurrency issues, or concurrency exceptions from the database, in which case we should apply some retry logic (and still the retry might fail again and so on).
Question: How do we avoid this? Is there a way I can ensure two instances are not working on the same product?
Example/Use case: An online store has a great sale on productA, productB and productC that ends in an hour and hundreds of customers are buying. For each purchase, a message is enqueued (productId, numberOfItems, price). Goal: How can we run three instances of our worker service and make sure that all messages for productA will end up in instanceA, productB to instanceB and productC to instanceC (resulting in no concurrency issues)?
Notes: My service is written in C#, hosted on Azure as a Worker Role, I use Azure Queues for messaging, and I'm thinking to use Mongo for storage. Also, the Entity IDs are GUID.
It's more about the technique/design, so if you use different tools to solve the problem I'm still interested.

Any solution attempting to divide the load upon different items in the same collection (like orders) are doomed to fail. The reason is that if you got a high rate of transactions flowing you'll have to start doing one of the following things:
let nodes to talk each other (hey guys, are anyone working with this?)
Divide the ID generation into segments (node a creates ID 1-1000, node B 1001-1999) etc and then just let them deal with their own segment
dynamically divide a collection into segments (and let each node handle a segment.
so what's wrong with those approaches?
The first approach is simply replicating transactions in a database. Unless you can spend a large amount of time optimizing the strategy it's better to rely on transactions.
The second two options will decrease performance as you have to dynamically route messages upon ids and also change the strategy at run-time to also include newly inserted messages. It will fail eventually.
Solutions
Here are two solutions that you can also combine.
Retry automatically
Instead you have an entry point somewhere that reads from the message queue.
In it you have something like this:
while (true)
{
var message = queue.Read();
Process(message);
}
What you could do instead to get very simple fault tolerance is to retry upon failure:
while (true)
{
for (i = 0; i < 3; i++)
{
try
{
var message = queue.Read();
Process(message);
break; //exit for loop
}
catch (Exception ex)
{
//log
//no throw = for loop runs the next attempt
}
}
}
You could of course just catch db exceptions (or rather transaction failures) to just replay those messages.
Micro services
I know, Micro service is a buzz word. But in this case it's a great solution. Instead of having a monolithic core which processes all messages, divide the application in smaller parts. Or in your case just deactivate the processing of certain types of messages.
If you have five nodes running your application you can make sure that Node A receives messages related to orders, node B receives messages related to shipping etc.
By doing so you can still horizontally scale your application, you get no conflicts and it requires little effort (a few more message queues and reconfigure each node).

For this kind of a thing I use blob leases. Basically, I create a blob with the ID of an entity in some known storage account. When worker 1 picks up the entity, it tries to acquire a lease on the blob (and create the blob itself, if it doesn't exist). If it is successful in doing both, then I allow the processing of the message to occur. Always release the lease afterwards.
If I am not successfull, I dump the message back onto the queue
I follow the apporach originally described by Steve Marx here http://blog.smarx.com/posts/managing-concurrency-in-windows-azure-with-leases although tweaked to use new Storage Libraries
Edit after comments:
If you have a potentially high rate of messages all talking to the same entity (as your commend implies), I would redesign your approach somewhere.. either entity structure, or messaging structure.
For example: consider CQRS design pattern and store changes from processing of every message independently. Whereby, product entity is now an aggregate of all changes done to the entity by various workers, sequentially re-applied and rehydrated into a single object

If you want to always have the database up to date and always consistent with the already processed units then you have several updates on the same mutable entity.
In order to comply with this you need to serialize the updates for the same entity. Either you do this by partitioning your data at producers, either you accumulate the events for the entity on the same queue, either you lock the entity in the worker using an distributed lock or a lock at the database level.
You could use an actor model (in java/scala world using akka) that is creating a message queue for each entity or group of entities that process them serially.
UPDATED
You can try an akka port to .net and here.
Here you can find a nice tutorial with samples about using akka in scala.
But for general principles you should search more about [actor model]. It has drawbacks nevertheless.
In the end pertains to partition your data and ability to create a unique specialized worker(that could be reused and/or restarted in case of failure) for a specific entity.

I assume you have a means to safely access the product queue across all worker services. Given that, one simple way to avoid conflict could be using global queues per product next to the main queue
// Queue[X] is the queue for product X
// QueueMain is the main queue
DoWork(ProductType X)
{
if (Queue[X].empty())
{
product = QueueMain().pop()
if (product.type != X)
{
Queue[product.type].push(product)
return;
}
}else
{
product = Queue[X].pop()
}
//process product...
}
The access to queues need to be atomic

You should use session enabled service bus queue for ordering and concurrency.

1) Every high scale data solution that I can think of has something built in to handle precisely this sort of conflict. The details will depend on your final choice for data storage. In the case of a traditional relational database, this comes baked in without any add'l work on your part. Refer to your chosen technology's documentation for appropriate detail.
2) Understand your data model and usage patterns. Design your datastore appropriately. Don't design for scale that you won't have. Optimize for your most common usage patterns.
3) Challenge your assumptions. Do you actually have to mutate the same entity very frequently from multiple roles? Sometimes the answer is yes, but often you can simply create a new entity that's similar to reflect the update. IE, take a journaling/logging approach instead of a single-entity approach. Ultimately high volumes of updates on a single entity will never scale.

Strategy to handle race conditions with regrads to web applicaiton backend?

I have been asked questions regarding race conditions in web application like movie ticket or travel website often in interviews.
Question is something like this.
Say for a bus or plane ticket website, there is only seat left. Two(or many in extreme scenario) users on different computer log into the website at the same time and see that one seat is left. They both go ahead, select that seat and place the order.
Now there are two requests we have to handle. For the first request, we will book the ticket and but for the second request, we have to sort-of throw an error and show the error message to the end user saying the seat is not available.
Say the database schema is some-thing like this:
bus_id, seat_id,is_taken
so for the first request, we make the is_taken for corresponding bus_id, seat_id 1. Then for the second request, there won't be any seat_id with is_taken =0 so we won't book the ticket.
But here, in my opinion, we have put a restriction that at one time, only one request can be handled; Second request can be handled, only after first request has been completed.
However that is not practical, since we might have a huge website with loads of traffic and application running on several servers in parallel. We have to process requests in parallel.
Since I don't have much experience with handling race conditions in these sorts of multi-threaded web applications, I can't quite figure, what is the right way about solving this.
What is the right(even if basic) approach/ design patterns to tackle these scenarios?

Web applictions are necessarily multithreaded. There are two ways of solving this.
Application level (Not preferred)
I am not sure which programming language are you using for building the application. But all the programming language used for building websites will have something like "synchornize" which allows you to prevent two threads accessing same block of code simultaneously.
This is not preferred as this solution is not horizontally scalable. When you decide to do the increase the capacity by running one more instance of your web application, this solution fails terribly.
Database level
This is the preferred solution. You obtain the lock on the record in the database before you update.
SQL provides an option for selecting the record for update.
SELECT * FROM BUS_SEATS WHERE BUS_ID = 1 FOR UPDATE;
Above sql is one way to obtain lock. All the database provide this kind of feature. With this feature you can lock the required row and do the update and ensure consistency in the database.

At some point, there has to be some sort of synchronization.
Since you're using a database, which is usually the bottleneck anyway, you might as well let it handle the race condition.
All you have to do is update the row atomically. The requests can still be handled in parallel by the application.
Sql-pseudocode:
DECLARE #success = false;
UPDATE bus_seats
SET is_taken = 1, success = true
WHERE seat_id = #seat_id AND is_taken=0
return #success;

Creating Dependencies Within An NSOperation

I have a fairly involved download process I want to perform in a background thread. There are some natural dependencies between steps in this process. For example, I need to complete the downloads of both Table A and Table B before setting the relationships between them (I'm using Core Data).
I thought first of putting each dependent step in its own NSOperation, then creating a dependency between the two operations (i.e. download the two tables in one operation, then set the relationship between them in the next, dependent operation). However, each NSOperation requires it's own NSManagedContext, so this is no good. I don't want to save the background context until both tables have been downloaded and their relationships set.
I've therefore concluded this should all occur inside one NSOperation, and that I should use notifications or some other mechanism to call the dependent method when all the conditions for running it have been met.
I'm an iOS beginner, however, so before I venture down this path, I wouldn't mind advice on whether I've reached the right conclusion.

Given your validation requirements, I think it will be easiest inside of one operation, although this could turn into a bit of a hairball as far as code structure goes.
You'll essentially want to make two wire fetches to get the entire dataset you require, then combine the data and parse it at one time into Core Data.
If you're going to use the asynchronous API's this essentially means structuring a class that waits for both operations to complete and then launches another NSOperation or block which does the parse and relationship construction.
Imagine this order of events:
User performs some action (button tap, etc.)
Selector for that action fires two network requests
When both requests have finished (they both notify a common delegate) launch the parse operation
Might look something like this in code:
- (IBAction)someAction:(id)sender {
//fire both network requests
request1.delegate = aDelegate;
request2.delegate = aDelegate;
}
//later, inside the implementation of aDelegate
- (void)requestDidComplete... {
if (request1Finished && request2Finished) {
NSOperation *parse = //init with fetched data
//launch on queue etc.
}
}
There's two major pitfalls that this solution is prone to:
It keeps the entire data set around in memory until both requests are finished
You will have to constantly switch on the specific request that's calling your delegate (for error handling, success, etc.)
Basically, you're implementing operation dependencies on your own, although there might not be a good way around that because of the structure of NSURLConnection.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string