Atomic UPDATE to increment integer in Postgresql - multithreading

I'm trying to figure out if the query below is safe to use for the following scenario:
I need to generate sequential numbers, without gaps. As I need to track many of them, I have a table holding sequence records, with a sequence integer column.
To get the next sequence, I'm firing off the SQL statement below.
WITH updated AS (
UPDATE sequences SET sequence = sequence + ?
WHERE sequence_id = ? RETURNING sequence
)
SELECT * FROM updated;
My question is: is the query below safe when multiple users fire this query at the database at the same time without explicitly starting a transaction?
Note: the first parameter is usually 1, but could also be 10 for example, to get a block of 10 new sequence numbers

Yes, that is safe.
While one such statement is running, all other such statements are blocked on a lock. The lock will be released when the transaction completes, so keep your transactions short. On the other hand, you need to keep your transaction open until all your work is done, otherwise you might end up with gaps in your sequence.
That is why it is usually considered a bad idea to ask for gapless sequences.

Unless I misunderstand the question, that's what the SERIAL data type is for:
https://www.postgresql.org/docs/8.1/static/datatype.html#DATATYPE-SERIAL

Related

Optimistic concurrency control clarification

I am new to ES7 and trying to understand optimistic concurrency control.
I think I understand that when I get-request a document and send its _seq_no and _primary_term values in a later write-request to the same document, if the values differ, the write will be completely ignored.
But what happens to the document in the default case where I don't send the _seq_no and _primary_term values? Will the write go through even if it has older _seq_no and _primary_term values (therefore making the index inconsistent), or only be processed if the values are newer?
If the former, will the document eventually be consistent?
I'm trying to figure out if I need to send these values to get eventual consistency or if I get it for free without sending those values.
It's a great distributed system question. Let me break down the problem into sub-parts for readability and even before explain what is _seq_no and _primary_term as there isn't much explanation of those on the ES site.
_seq_no is the incremental counter which is assigned to ES document for each operation(update, delete, index), for example:- the first time you index a doc, it will have value 1, next update will have 2, next delete operation will have three and so on. Read operation doesn't update it.
_primary_term is the also an incremental counter, but change only when a replica shard is promoted as primary, due to network or any other failure, so if everything is excellent in your cluster it will not be changed, but in case of some failure and other replica promoted to primary then it would be increased.
Coming to the first question,
Q:- What happens to the document in the default case where I don't send the _seq_no and _primary_term values?
Ans:- you can have lost update issue, suppose you have a counter which you are updating, simultaneously 2 requests read the counter value to 1 and trying to increment by 1. now when you don't specify these above terms explicitly, then it's calculated by ES.
Now both the requests reach simultaneously to ES, then ES(primary shard) will process them one by one by increasing the sequence number, so at the end, your counter will have value 2, instead of 3. to make sure this doesn't happen, you pass these term values explicitly, and when ES tries to update them will see different sequence number and will reject your request.
To prevent such lost updates, use-cases, its always recommended sending explicit version number.
Q:- I'm trying to figure out if I need to send these values to get eventual consistency or if I get it for free without sending those values..
Answer:- These are related to concurrency control and nothing to deal with eventual consistency. In ES, write always happens to primary shards, but read can happen to any replicas(may contain obsolete data), which makes ES eventual consistent.
Important read
https://www.elastic.co/blog/elasticsearch-sequence-ids-6-0

Lock CustomRecord Serials Table

We have a Fulfillment script in 1.0 that pulls a Serial number from the custom record based on SKU and other parameters. There is a seach that is created based on SKU and the fist available record is used. One of the criteria for search is that thee is no end user associated with the key.
We are working on converting the script to 2.0. What I am unable to figure out is, if the script(say the above functionality is put into Map function for a MR script) will run on multiple queues/instances, does that mean that there is a potential chance that 2 instance might hit the same entry of the Custom record? What is a workaround to ensure that X instances of Map function dont end us using the same SN/Key? The way this could happen in 2.0 would be that 2 instance of Map make a search request on Custom record at same time and get the same results since the first Map has not completed processing and marked the key as used(updating the end user information on key).
Is there a better way to accomplish this in 2.0 or do I need to go about creating another custom record that script would have to read to be able to pull key off of. Also is there a wait I can implement if the table is locked?
Thx
Probably the best thing to do here would be to break your assignment process into two parts or restructure it so you end up with a Scheduled script that you give an explicit queue. That way your access to Serial Numbers will be serialized and no extra work would need to be done by you. If you need hint on processing large batches with SS2 see https://github.com/BKnights/KotN-Netsuite-2 for a utility script that you can require for large batch processing.
If that's not possible then what I have done is the following:
Create another custom record called "Lock Table". It must have at least an id and a text field. Create one record and note its internal id. If you leave it with a name column then give it a name that reflects its purpose.
When you want to pull a serial number you:
read from Lock Table with a lookup field function. If it's not 0 then do a wait*.
If it's 0 then generate a random integer from 0 to MAX_SAFE_INTEGER.
try to write that to the "Lock Table" with a submit field function. Then read that back right away. If it contains your random number then you have the lock. If it doesn't then wait*.
If you have the lock then go ahead and assign the serial number. Release the lock by writing back a 0.
wait:
this is tough in NS. Since I am not expecting the s/n assignment to take much time I've sometimes initiated a wait as simply looping through what I hope is a cpu intensive task that has no governance cost until some time has elapsed.

Cassandra counter usage

I am finding some difficulties in the data modeling of an application which may involve the use of counters.
The app is basically a messaging app. Messages are bounded for free users, hence the initial plan of using a counter column to keep track of the total count.
I've discovered that batches (logged or not) cannot contain operations on both standard tables and counter ones. How do I ensure correctness if I cannot batch the operation I am trying to perform and the counter update together? Is the counter type really needed if there's basically no race condition on the column, being that associated to each individual user?
My second idea would be to use a standard int column to use only inside batches. Is this a viable option?
Thank you
If you can absolutely guarantee that each user will produce only one update at time then you could rely on plain ints to perform the job.
The problem however is that you will need to perform a read-before-write anti-pattern. You could solve this as well, eg skipping the read part by caching your ints and performing in-memory updates followed by writes only. This is viable by coupling your system with a caching server (e.g. Redis).
And thinking about it, you should still need to read these counters at some point, because if the number of messages a free user can send is bound to some value then you need to perform a check when they login/try to send a new message/look at the dashboard/etc and block their action.
Another option (if you store the messages sent by each user somewhere and don't want to add complexity to your system) could be to directly count them with a SELECT COUNT... type query, even if this could be become pretty inefficient very soon in the Cassandra world.

Does CQL3 "IF" make my update not idempotent?

It seems to me that using IF would make the statement possibly fail if re-tried. Therefore, the statement is not idempotent. For instance, given the CQL below, if it fails because of a timeout or system problem and I retry it, then it may not work because another person may have updated the version between retries.
UPDATE users
SET name = 'foo', version = 4
WHERE userid = 1
IF version = 3
Best practices for updates in Cassandra are to make updates idempotent, yet the IF operator is in direct opposition to this. Am I missing something?
If your application is idempotent, then generally you wouldn't need to use the expensive IF clause, since all your clients would be trying to set the same value.
For example, suppose your clients were aggregating some values and writing the result to a roll up table. Each client would calculate the same total and write the same value, so it wouldn't matter if multiple clients wrote to it, or what order they wrote to it, since it would be the same value.
If what you are actually looking for is mutual exclusion, such as keeping a bank balance, then the IF clause could be used. You might read a row to get the current balance, then subtract some money and update the balance only if the balance hadn't changed since you read it. If another client was trying to add a deposit at the same time, then it would fail and would have to try again.
But another way to do that without mutual exclusion is to write each withdrawal and deposit as a separate clustered transaction row, and then calculate the balance as an idempotent result of applying all the transaction rows.
You can use the IF clause for idempotent writes, but it seems pointless. The first client to do the write would succeed and Cassandra would return the value "applied=True". And the next client to try the same write would get back "applied=False, version=4", indicating that the row had already been updated to version 4 so nothing was changed.
This question is more about linerizability(ordering) than idempotency I think. This query uses Paxos to try to determine the state of the system before applying a change. If the state of the system is identical then the query can be retried many times without a change in the results. This provides a weak form of ordering (and is expensive) unlike most Cassandra writes. Generally you should only use CAS operations if you are attempting to record state of a system (rather than a history or log)
Do not use many of these queries if you can help it, the guidelines suggest having only a small percentage of your queries rely on this behavior.

How to safely INSERT / UPDATE a value in Postgres in a multithreaded environment

I have a table in PostgreSQL that looks like this
create table item_counts {
item string,
view_count int}
I would like to use the table to keep track of occurrences of item, incrementing the counts as necessary. Initially the table is unpopulated, so a new value is inserted iff it is observed for the first time, otherwise the view_count is increased. Speed and multitasking are both concerns.
I know I can do
rows_affected = execute("update item_counts set view_count = view_count + 1
where item = ?")
if rows_affected == 0:
execute("insert into item_counts ...")
However, this is unsafe in a multithreaded environment, so I would have to wrap it into a transaction. This would in turn decrease the speed, since a commit would occur after each insert/update.
Any suggestions how to do it in a clean and efficient way?
If you are on 9.1, you might consider writeable CTEs:
http://vibhorkumar.wordpress.com/2011/10/26/upsertmerge-using-writable-cte-in-postgresql-9-1/
http://xzilla.net/blog/2011/Mar/Upserting-via-Writeable-CTE.html
Alternatively, you can checkpoint, insert and update on violating unique exception (rolling back the checkpoint). Whether it's better is doubtful, especially if you expect to by mostly-update.
Also the transaction in case of concurrency may still fail at commit.
Also, you can do the insert select, inserting what's NOT in the table (using self-left-join or where not exists clause, whatever pleases you) and then update if it yields 0 affected rows.
And, perhaps, it's best if you do that in a function on the server side.

Resources