I've come up with a fancy issue of synchronization in node.js, which I've not able to find an elegant solution:
I setup a express/node.js web app for retrieving statistics data from a one row database table.
If the table is empty, populate it by a long calculation task
If the record in table is older than 15 minutes from now, update it by a long calculation task
Otherwise, respond with a web page showing the record in DB.
The problem is,
when multiple users issue requests simultaneously, in case the record is old, the long calculation task would be executed once per request, instead of just once.
Is there any elegant way that only one request triggers the calculation task, and all others wait for the updated DB record?
Yes, it is called locks.
Put an additional column in your table say lock which will be of timestamp type. Once a process starts working with that record put a now+timeout time into it (by the rule of thumb I choose timeout to be 2x the average time of processing). When the process stops processing update that column with NULL value.
At the begining of processing check that column. If the value > now condition is satisfied then return some status code to client (don't force client to wait, it's a bad user experience, he doesn't know what's going on unless processing time is really short) like 409 Conflict. Otherwise start processing (also ideally processing takes place in a separate thread/process so that user won't have to wait: respond with an appropriate status code like 202 Accepted).
This now+timeout value is needed in case your processing process crashes (so we avoid deadlocks). Also remember that you have to "check and set" this lock column in transaction because of race conditions (might be quite difficult if you are working with MongoDB-like databases).
Related
I have a background with Java and I am relatively new to node. I am trying to understand node in relation to the fact that it is single threaded, but can still handle multiple requests at the same time.
I have read about the single thread and the event loop, as well as the related stackoverflow questions, but I am still not sure I have understood it correctly, hence this question.
I have a simple http service that takes an id as an input. There can be multiple requests at almost the same time with the same id, and of course also other requests at almost the same time with other ids.
When the service is called, the following happens:
Lookup id in DB (in a blocking manner, i.e. await)
If the DB lookup did not find a result, insert id in DB
Let's say there are two requests at almost the same time, with the same id.
My question is whether the following is possible:
Request 1 makes the lookup in the DB -> no result
Request 2 makes the lookup in the DB -> no result
Request 1 inserts a new row
Request 2 insert a new row
The blocking manner of the lookup makes me guess the answer is "no, that is not possible", but then I read that the blocking does not block the single thread. What makes me want to answer "yes, it is possible", is because I do not understand how several requests can be handled, if the above is not possible.
Thanks,
-Louise
As far as I can determine the answer is "yes, that is possible". The "await" on the call to the DB ensures that the query has finished before we continue to the next line of code, but it does not block the thread.
The thread continues with other tasks while awaiting the DB operation to finish, and those other tasks might be handling another request. This means that a race condition can happen between multiple requests.
Over 2 years ago, Remy Lebeau gave me invaluable tips on threads in Delphi. His answers were very useful to me and I feel like I made great progress thanks to him. This post can be found here.
Today, I now face a "conceptual problem" about threads. This is not really about code, this is about the approach one should choose for a certain problem. I know we are not supposed to ask for personal opinions, I am merely asking if, on a technical point a view, one of these approach must be avoided or if they are both viable.
My application has a list of unique product numbers (named SKU) in a database. Querying an API with theses SKUS, I get back a JSON file containing details about these products. This JSON file is processed and results are displayed on screen, and saved in database. So, at one step, a download process is involved and it is executed in a worker thread.
I see two different approaches possible for this whole procedure :
When the user clicks on the start button, a query is fired, building a list of SKUs based on the user criteria. A Tstringlist is then built and, for each element of the list, a thread is launched, downloads the JSON, sends back the result to the main thread and terminates.
This can be pictured like this :
When the user clicks on the start button, a query is fired, building a list of SKUs based on the user criteria. Instead of sending SKU numbers one after another to the worker thread, the whole list is sent, and the worker thread iterates through the list, sending back results for displaying and saving to the main thread (via a synchronize event). So we only have one worker thread working the whole list before terminating.
This can be pictured like this :
I have coded these two different approaches and they both work... with each their downsides that I have experienced.
I am not a professional developer, this is a hobby and, before working my way further down a path or another for "polishing", I would like to know if, on a technical point of view and according to your knowledge and experience, one of the approaches I depicted should be avoided and why.
Thanks for your time
Mathias
Another thing to consider in this case is latency to your API that is producing the JSON. For example, if it takes 30 msec to go back and forth to the server, and 0.01 msec to create the JSON on the server, then querying a single JSON record per request, even if each request is in a different thread, does not make much sense. In that case, it would make sense to do fewer requests to the server, returning more data on each request, and partition the results up among different threads.
The other thing is that threads are not a solution to every problem. I would question why you need to break each sku into a single thread. how long is each individual thread running and how much processing is each thread doing? In general, creating lots of threads, for each thread to work for a fraction of a msec does not make sense. You want the threads to be alive for as long as possible, processing as much data as they can for the job. You don't want the computer to be using as much time creating/destroying threads as actually doing useful work.
The title isn't accurate because based on what I have found in my research there doesn't seem to be a way to make a function atomic in nodejs, but I will lay out my problem to see if you people can come up with something that I have not been able to think about.
I am trying to setup a scheduler where I can set my appointment time slots say 1 hr long each and when someone makes an appointment I want to make sure that the time slot is not taken before scheduling it.
So for example I decide that I will be working from 9 am to 2 pm with a time slot of one hour. Then my schedule would be 9-10, 10-11, 11-12, 12-1, 1-2.
An appointment will come in with a start time of 11 and end time of 12. I need to make sure that slot isn't already taken.
I am using mongodb with nodejs and restify.
I understand that in my appointments collection I can set an index on a combination of values like start time and end time, as discussed here Creating Multifield Indexes in Mongoose / MongoDB.
But if I decide to change my time slot from 1 hour to say 1.5 hours then I will have scheduling conflicts as the start time and end time of entries in the database will not match up with the new interval
Currently I have a function which checks to make sure that the new appointment will not conflict but I am not sure if it will work out well when I have multiple requests coming in. This is a nodejs and restify app so basically an api with a mongodb that it talks to, to handle appointments.
I am running it with multiple workers, so I am worried that at a certain point two requests will come in at the same time, handled by two different workers for the same time slot. When my conflict checking function executes it will return saying that the slot is open for both of them since no appointment has been made yet and then there will be a scheduling conflict.
Any ideas on how to combat this, or is there something in the way javascript executes so that I shouldn't have to worry about it this? All input will be appreciated
Thanks!
I ended up using https://github.com/Automattic/kue, to queue my requests and added another endpoint where you can check the status of your request. So when you want to make an appointment your request ends up in the job queue, and you can then periodically check the status of your request. This way only one appointment request gets processed at a time so no concurrency issues.
We are trying to create an algorithm/heuristic that will schedule a delivery at a certain time period, but there is definitely a race condition here, whereby two conflicting scheduled items could be written to the DB, because the write is not really atomic.
The only way to truly prevent race conditions is to create some atomic insert operation, TMK.
The server receives a request to schedule something for a certain time period, and the server has to check if that time period is still available before it writes the data to the DB. But in that time the server could get a similar request and end up writing conflicting data.
How to circumvent this? Is there some way to create some script in the DB itself that hooks into the write operation to make the whole thing atomic? By putting a locking mechanism on that script? What makes the whole thing non-atomic is the read and the wire time between the server and the DB.
Whenever I run into race condition I think of one immediate solution QUEUE.
Step 1) What you can do is that instead of adding data to a database directly you can add it to queue without checking anything.
Step 2) A separate reader will read from the queue check DB for any conflict and take necessary action.
This is one of the ways to solve this If you implement any better solution please do share it.
Hope that helps
Summary:
I am interested in knowing what's the best practice for high throughput applications that have bulk messages trying to update the same row and get oracle deadlock errors. I know you cannot avoid those errors but how do you recover from them gracefully without getting bogged down by such deadlock errors happening over and over again.
Details:
We are building a high throughput JMS messaging application. Production environment will be two weblogic 11g nodes (running 6 MDB listener instances each). We were getting Oracle deadlock errors (ORA-00060) when we get around 1000 messages all trying to update the same row in oracle database. Java synchronization across nodes is not possible in standard java threading API (unless there's no other solution we don't want to use any 3rd party solutions like terracotta etc).
We were hoping Oracle "select for update WAIT n secs" statement will help because that will essentially make the competing threads (for the same row) wait few seconds before the first thread (who got the lock on the row first) gets done with it.
First issue with "SELECT FOR UPDATE WAIT n" is it doesn't allow using milliseconds for wait times. This starts negatively affecting our application's throughput because putting 1 sec WAIT (least wait time) causes delays on the messages.
Second thing we are fiddling with weblogic queue re-delivery delay parameter (30 secs in our case). Whenever a thread bounces back because of the deadlock error, it will wait 30 seconds before being re-tried.
In our experience 1000 competing messages, in a lot of situations take forever to get processed because the deadlock keeps on happening over and over.
I understand that with the current architecture we are supposed to get deadlock errors regardless ( in case of 1000 competing messages) but application should be resilient enough to recover from these errors after retrying the looping messages.
Any idea what we are missing here ? anybody who has dealt with similar issues before?
I am looking for some design ideas that can make this work resiliently so that it recovers from this deadlock situation and eventually processes all messages in reasonable amount of time without using much additional hardware.
COMPUTATION DETAILS:
These 1000 messages will EACH create 4 objects of 4 different position types each having a quantity associated with it. These quantities will have to merged into those 4 different slots (depending on the position type). The deadlock is happening when those 4 individual slots are being updated by each individual thread. We have already ordered those individual updates in a specific order before being applied to the database rows to avoid any possible race conditions.
A deadlock implies that each thread is trying to update multiple rows in a single transaction and that those updates are being done in a different order across threads. The simplest possible answer, therefore, would be to modify the code so that messages within the same transaction are applied in some defined order (i.e. in order of the primary key). That would ensure that you would never get a deadlock though you'd still get blocking locks while one thread waits for another thread to commit its transaction.
Taking a step back, though, it seems unlikely that you would really want many threads updating the same row in a table when you can't predict the order of the updates. It seems highly likely that would lead to lots of lost updates and some rather unpredictable behavior. What, exactly, is your application doing that would make this sort of thing sensible? Are you doing something like updating aggregate tables after inserting rows into a detail table (i.e. updating the count of the number of views a post has in addition to logging information about a particular view)? If so, do those operations really need to be synchronous? Or could you update the view count periodically in another thread by aggregating the views over the past N second?
As for the MDB
Let it consume the messages, and update instance variables which contain the delta of the quantities of the processed messages (an MDB can carry state in its instance variables across multiple messages).
A #Schedule method in the same MDB persists the quantities in a single database transaction using a single SQL statement every second (for example)
update x set q1 = q1 + delta1, q2 = q2 + delta2, ...
I have done some tests:
It takes 6s to create 1000 messages (JBoss 7 using HornetQ)
During that time, 840 messages were already persisted.
It takes another 2s to persist the remaining ones (the scheduled method ran every second)
This required seven SQL update commands in seven DB transcations
The load is completely caused by creating the messages; there is not real load on the DB
Notes
You need another #PreDestroy method to persist the pending deltas to make sure that nothing gets lost
If you must guarantee transactional correctness, this approach is not suitable. In that case I suggest using a normal queue receiver (= no MDB), transacted session and receive(timeout) to collect 100 - 10000 messages (or until a timeout), do one DB transaction, and right after that the commit on the queue session. This is better, but it's still not XA transactional. If you need this, both commits need to be coordinated by a single XA transaction.