Alternative for Default value in Cassandra - cassandra

I'm looking for a way to put a default value (like in relational DBs) in a Cassandra table. After some research I found out that it is not possible to do this, however I want to be able to update some info in my table, as I don't know when the info it will arrive.
e.g.
id | value
----+--------------------------------
2 | placeholder
I was thinking of using a placeholder value until the needed info arrives and update it, however, I'm not sure this is the best practice.
id | value
----+--------------------------------
2 | updated value
This must be a common case for tables. Do you know of any other approach to this use case?

As Aaron mentioned in comment, you should be fine with value==null till the new value arrives. I think it will be the best practice because appropriate value of column arrive later, so better to update it at that time. Till then it is better to keep it null. If client application cannot put null check then you can put placeholder value as you defined in your example.

Related

PostgreSQL: Is it possible to limit inserts per user based on time difference between timestamp column and current time?

I have an issue when two almost concurrent requests (+- 10ms difference) by the same user (unintentionally duplicated by client side) successfully execute whole use case logic twice. I can't really solve this situation in code of my API, so I've been thinking about how to limit one user_id to be able to insert row into table order max. once every second for example.
I want to achieve this: If in table order exists row with user_id X and that row was created (inserted) less than 1 second ago, insert with user_id X would fail.
This could be effective way of avoiding unintentionally duplicated requests by client side. Because I can't imagine situation when user could send two complex requests less than 1 second between intentionally. I'm also interested in any other ideas, for example what's the proper way to deal with similar situations in API's.
There is one problem with your idea. If the server becomes really slow for just a second, the orders will arrive more than one second apart in the database and will be inserted.
I'd recommend generating a unique ID, like a UUID, in the front-end, and sending that with the request. You could, for example, generate a new one every page load. Then, if the server sees that the received UUID already exists in the database, the order is skipped.
This avoids any potential timing issues, but also retains the possibility of someone re-ordering the exact same products.
You can do it with an EXCLUDE constraint. You need to create your own immutable helper function, and use an extension.
create extension btree_gist;
create function addsec(timestamptz) returns tstzrange immutable language sql as $$
select tstzrange($1,$1+interval '1 second')
$$;
create table orders (
userid int,
t timestamptz,
exclude using gist (userid with =, addsec(t) with &&)
);
But you should probably change the front end anyway to include a validation token, as currently it may be subject to CSRF attacks.
Note that EXCLUDE constraints may be much less efficient than UNIQUE constraints. Also, I'm not 100% sure that addsec really is immutable. There might be weird things with leap seconds or something that messes it up.

Does CQL3 "IF" make my update not idempotent?

It seems to me that using IF would make the statement possibly fail if re-tried. Therefore, the statement is not idempotent. For instance, given the CQL below, if it fails because of a timeout or system problem and I retry it, then it may not work because another person may have updated the version between retries.
UPDATE users
SET name = 'foo', version = 4
WHERE userid = 1
IF version = 3
Best practices for updates in Cassandra are to make updates idempotent, yet the IF operator is in direct opposition to this. Am I missing something?
If your application is idempotent, then generally you wouldn't need to use the expensive IF clause, since all your clients would be trying to set the same value.
For example, suppose your clients were aggregating some values and writing the result to a roll up table. Each client would calculate the same total and write the same value, so it wouldn't matter if multiple clients wrote to it, or what order they wrote to it, since it would be the same value.
If what you are actually looking for is mutual exclusion, such as keeping a bank balance, then the IF clause could be used. You might read a row to get the current balance, then subtract some money and update the balance only if the balance hadn't changed since you read it. If another client was trying to add a deposit at the same time, then it would fail and would have to try again.
But another way to do that without mutual exclusion is to write each withdrawal and deposit as a separate clustered transaction row, and then calculate the balance as an idempotent result of applying all the transaction rows.
You can use the IF clause for idempotent writes, but it seems pointless. The first client to do the write would succeed and Cassandra would return the value "applied=True". And the next client to try the same write would get back "applied=False, version=4", indicating that the row had already been updated to version 4 so nothing was changed.
This question is more about linerizability(ordering) than idempotency I think. This query uses Paxos to try to determine the state of the system before applying a change. If the state of the system is identical then the query can be retried many times without a change in the results. This provides a weak form of ordering (and is expensive) unlike most Cassandra writes. Generally you should only use CAS operations if you are attempting to record state of a system (rather than a history or log)
Do not use many of these queries if you can help it, the guidelines suggest having only a small percentage of your queries rely on this behavior.

how are secondary indexes implemented in PlayORM and are concurrent updates supported / handled?

There are several roll your own strategies for secondary indexes that handle concurrent updates, this for example:
http://www.slideshare.net/edanuff/indexing-in-cassandra
which uses 3 ColumnFamilies.
My question is, how is the PlayORM #NoSqlIndexed annotation implemented; in terms of what extra ColumnFamilies are needed / created?
Additionally, are concurrent updates supported - ie, it would not be possible with two competing updates to have the index updated from one and the table from the other?
You can do concurrent updates with no locking.
Slide 46's question of Can't I get a false positive? is the same case with PlayOrm.
The one caveat is you may need to resolve on read. Example is thus. Say you have Fred with an address of 123 in the database.
Now, two servers make an update to Fred
server 1 : Fred's new address is 456 (results in deleting index 123.fred and adding 456.fred)
server 2 : Fred's new address is 789 (results in deleting index 123.fred and adding 789.fred)
This means your index may have a duplicate of 456.fred and 789.fred. You can then resolve this on read as the query WILL return Fred when you ask for people with address 456. There is another ticket out for us to resolve this on reads for you ;) and eliminate the entry.
We did ask about getting a change in cassandra where we could possibly do (add column 456.fred IF column 123.fred exists or fail) but not sure if they will ever implement something like that. That would propogate a failure back to the loser(ie. last writer gets exception). It would be nice but I am not sure they will do a feature like this.
BIG NOTE: Unlike CQL, the query is NOT sent to all nodes. It only puts load on the nodes that contains the index instead of all 100 computers. ie. it can scale better this way.
MORE DETAIL: On slide 27 of that presentation your link has, it is ALMOST like that for our indexes. The format does not contain the 1, 2, 3 though. The index format is
Indexes=
{"User_Keys_By_Last_Name":{
{"adams","e5d…"}: null,
{"alden","e80…"}: null,
{"anderson","e5f…"}: null,
{"anderson","e71…"}: null,
{"doe","e78…"}: null,
{"franks","e66…"}: null,
…:…,
}
}
This way, we can avoid the read to find out if we need to use a 1, 2, 3, 4, 5 for the second half of the name. Instead we use the FK which we know is unique and just have to do a write. Cassandra is all about resolving conflicts on a read anyways which is why the repair process exists. It is based on the fact that conflicts will happen a very low percentage of the time and just take a hit then at that low percentage.
LASTLY, you can just use the command line tool to view the index!!!! It batches stuff in about 200 columns each streaming back so you could have 1 million entries and the command line tool will happily just keep printing them until you ctrl-c it.
later,
Dean
As of now, only 3 tables are created for all indexes in Playorm. i.e, All the indexes are stored in StringIndice, IntegerIndice and DecimalIndice column families.
Apart from that, there is a pattern under development which will created a new table for the column if required. See the pattern details at https://github.com/deanhiller/playorm/issues/44.

How to get_range for available rows in Cassandra?

In my application, I want to get all the rows in a column family, but to ignore the rows that are temporarily unavailable (e.g. some nodes are down).
I have multiple nodes. If one of the node is down, then get_range will throw UnavailableException, and I can get nothing.
What I want is to get all the rows that are currently available, because, to the user, its better than nothing. How can I do this?
I'm using pycassa.
The row keys in my column family are like random string, so I cannot use get to get all the rows one by one.
If get_range by token support is added to pycassa, you could fetch each token range (as reported by describe_ring) separately, discarding those that resulted in an UnavailableException. Barring that, using consistency level ONE is your best option, as Dean mentions.
there should be a call to get that takes a List of rowkeys so you don't need to get them one by one. Also, if you have an index, that can help. for instance playORM has an index for each partition of a table(and you can have as many partitions as you want). With that, you can then iterate over each index and call get passing it a LIST of keys.
Also, make sure your consistency read is set to ONE as well ;).
later,
Dean

Hector Cassandra Data Retrieval

Is there any way to get all the data from a column family or from a key space?
I can't think of a way of doing this without knowing every single key for every single entry made to the database.
My problem is that I'm trying to create a Twitter clone where each message has its own id, and store those in the same keyspace in the same column family.
But then how do I get them back? I'll have to keep a track of every single id, and that can't possibly work.
Any help/ideas would be appreciated.
You can retrieve all data from a column family using get_range_slices, setting the range start and end to the same value to indicate that you want all data.
See the Cassandra FAQ
See http://aquiles.codeplex.com/discussions/278245 for a Thrift example.
Haven't yet found a handy Hector example but I think it uses RangeSlicesQuery...
However, it's not clear why you want to do this - for this sort of application you would normally look up the messages by ID, and use an index to determine which IDs you need. For example, storing a row for each user that lists all their messages. For example in the messages column family you might have something like:
MsgID0001 -> time text
1234567 Hello world
MsgID0300 -> time text
3456789 LOL ROTFL
And then in a "user2msg" column family, store the messages, perhaps using timestamp column names so the messages are stored in sorted in time order:
UserID001 -> 1234567 3456789
MsgID0001 MsgID0300
This can then be used to look up a particular user's messages, possibly filtered by time.
You'd then also need further column families to store user profiles etc.
Perhaps you need to add more detail to your question?
Update in response to comment: Yes, if you have one message per row, you have to retrieve each message individually. But what is your alternative? Retrieving all messages is only useful for doing batch processing of messages, not for (for example) showing a user their recent messages. Bear in mind that retrieving all messages could take a very long time - you have not explained why you want to retrieve all messages and what you are going to do with them all. How many messages are you expecting to have?
One possibility is to denormalise, i.e. in a row for each user, store the entire messages, so you don't have to do a separate lookup step for each message. This doubles the amount of storage required, however.
The answer i was looking for is CQL, cassandra's query language. It works similarly to sql which is what i need for the function im after.
this link has some excellent tutorials.

Resources