I have a table say CREDIT_POINTS. It has got below columns.
Copmany Credit points Amount
A 100 50
B 200 94
C 250 80
There are multiple threads which will update this table. There is a method which reads Credit points and do some calculations and update amount as well as Credit points. This calculations will take quite some time.
Suppose thread A reads and it is doing some calculations. At the same time before A writes back thread B is reading data from table to does calculations and updates data. Here I am loosing the data which thread A updated. In many cases credit points and amount will not be in sync as multiple threads are reading and updating the table.
One thing we can do here is using a synchronized method.
I am thinking of using spring transaction. Is spring transaction thread safe? What else is a good option for this?
Any help greatly appreciated.
Note: am using ibatis(ORM) and and MySQL.
You definitely need transactions to make sure, that you do your updates based on the data you previously read. This transaction must include read and write operation.
To make sure that multiple threads cooperate you do not need synchronized but have two options:
pessimistic locking: you use select for update. This will set a lock which will be release at the end of the transaction.
optimistic locking: during your update you find out, that the data has been changed meanwhile, if so you have to repeat reading and changing. You can achieve this in your update statement by not only searching for the company (the primary key, I hope), but also for the amount and "credit points" previously read.
Both methods have their merits. I recommend to make yourself familiar with these concepts before finishing this application. As soon as there is a heavy load, if you did anything wrong, your amounts and credit points might get wrongly calculated.
Related
I am finding some difficulties in the data modeling of an application which may involve the use of counters.
The app is basically a messaging app. Messages are bounded for free users, hence the initial plan of using a counter column to keep track of the total count.
I've discovered that batches (logged or not) cannot contain operations on both standard tables and counter ones. How do I ensure correctness if I cannot batch the operation I am trying to perform and the counter update together? Is the counter type really needed if there's basically no race condition on the column, being that associated to each individual user?
My second idea would be to use a standard int column to use only inside batches. Is this a viable option?
Thank you
If you can absolutely guarantee that each user will produce only one update at time then you could rely on plain ints to perform the job.
The problem however is that you will need to perform a read-before-write anti-pattern. You could solve this as well, eg skipping the read part by caching your ints and performing in-memory updates followed by writes only. This is viable by coupling your system with a caching server (e.g. Redis).
And thinking about it, you should still need to read these counters at some point, because if the number of messages a free user can send is bound to some value then you need to perform a check when they login/try to send a new message/look at the dashboard/etc and block their action.
Another option (if you store the messages sent by each user somewhere and don't want to add complexity to your system) could be to directly count them with a SELECT COUNT... type query, even if this could be become pretty inefficient very soon in the Cassandra world.
It seems to me that using IF would make the statement possibly fail if re-tried. Therefore, the statement is not idempotent. For instance, given the CQL below, if it fails because of a timeout or system problem and I retry it, then it may not work because another person may have updated the version between retries.
UPDATE users
SET name = 'foo', version = 4
WHERE userid = 1
IF version = 3
Best practices for updates in Cassandra are to make updates idempotent, yet the IF operator is in direct opposition to this. Am I missing something?
If your application is idempotent, then generally you wouldn't need to use the expensive IF clause, since all your clients would be trying to set the same value.
For example, suppose your clients were aggregating some values and writing the result to a roll up table. Each client would calculate the same total and write the same value, so it wouldn't matter if multiple clients wrote to it, or what order they wrote to it, since it would be the same value.
If what you are actually looking for is mutual exclusion, such as keeping a bank balance, then the IF clause could be used. You might read a row to get the current balance, then subtract some money and update the balance only if the balance hadn't changed since you read it. If another client was trying to add a deposit at the same time, then it would fail and would have to try again.
But another way to do that without mutual exclusion is to write each withdrawal and deposit as a separate clustered transaction row, and then calculate the balance as an idempotent result of applying all the transaction rows.
You can use the IF clause for idempotent writes, but it seems pointless. The first client to do the write would succeed and Cassandra would return the value "applied=True". And the next client to try the same write would get back "applied=False, version=4", indicating that the row had already been updated to version 4 so nothing was changed.
This question is more about linerizability(ordering) than idempotency I think. This query uses Paxos to try to determine the state of the system before applying a change. If the state of the system is identical then the query can be retried many times without a change in the results. This provides a weak form of ordering (and is expensive) unlike most Cassandra writes. Generally you should only use CAS operations if you are attempting to record state of a system (rather than a history or log)
Do not use many of these queries if you can help it, the guidelines suggest having only a small percentage of your queries rely on this behavior.
I am implementing a session table with nodejs which will grow to a huge number of items. each hash key is a uuid representing a user.
In order to delete the expired sessions, I must scan the table for expired attribute and delete old sessions. I am planning to do this scan once a few days, and other than that, I don't really need high read capacity.
I came out with 2 solutions, and i would like to hear some feedback about them.
1) UpdateTable to higher capacities for only that scheduled routine, and after the scan is done, simply reduce the table capacities to it's original values.
2) Perform the scan, and when retrieving the 'LastEvaluatedKey' after an x*MB read, create a initiation delay (for not consuming all read/sec units), and then continue the scan with 'ExclusiveStartKey'.
If you're doing a scan, option 1 is your best best. This is the only real way to guarantee that you won't effect your application performance while the scan is ongoing.
The only thing you need to be sure of is that you only run this operation once a day -- I believe you can only DOWNGRADE throughput units on a DynamoDB table 2x's per day (at most).
This is an old question, but I saw it through a related question.
There is now a much better native solution: DynamoDB Time to Live
It allows you to specify one attribute per table that serves as the time to live value for each item. You can then set the attribute per item with a Unix-Timestamp that specifies when the item should be deleted.
Within about 24 hours of that timestamp, the item will be deleted at no additional charge.
I want to tracks events. I can safely assume that total number of unique events would grow but not at a fast rate, but the the incoming stream of events would be lightening fast. So, yes I have a write intensive job at hand. Would it be a bad practise, to create an altogether separate table when I receive a brand new unique event and keep logging who did that event and when in that table?
Having more than a few hundred tables is not recommended. Schema updates are slow when you have a lot of tables. Update time basically becomes linear. Managing a large number of tables also requires good operations/automation/change tracking practices.
I execute batch update which modifies few rows within few column families. In case of TimedOutException some data could be modified, but possibly not whole set....
In order to implement compensating transaction, I would need to know what data (rows) was modified - is there a way to find this out? Does exception contain this information?
Thanks,
Maciej
Creating a system that can scale out means taking some trade-offs - one of these is facilitating "idempotent" operations in your application.
This means that you would either:
assume that the data was written somewhere and that the node will
eventually become consistent
fire the entire contents of the write again, perhaps sleeping a given amount of time or
at a less restrictive consistency level
A good description of this approach can be found in section 6 of Pat Helland's "Building on Quicksand" paper: http://arxiv.org/pdf/0909.1788