I have this following table:
CREATE TABLE example
(
id text,
users map<text,text>,
lastvisit int,
...
PRIMARY KEY (userid)
);
Sometimes I update a column or a map entry like:
1) update example set users = users - {'JOE'} where id = 'id';
2) update example set users = users + {'JOE':'meta'} where id = 'id';
3) update example set lastvisit = 100 where id = 'id';
I need to know how each query handles the old data in manner of tombstones and compaction.
The following I have researched/ advised but specially on maps I lack on information.
Deletes the map entry at key = 'JOE' by generating a tombstone only for that entry in the map. On compaction the value is dropped.
Inserts the key value pair to the map. The old entry is dropped at compaction since there is a newer entry.
The column entry is updated and like in 2, the old value is dropped in compaction
The question in each case is, will the whole row be written again or only the updated value with a newer timestamp ?
A tombstone for the map item where key = 'BOB' will be inserted.
The row doesn't get overwritten. Just adds a new map item.
Strictly speaking, it's not an UPDATE -- a new column will be inserted. All mutations in C* are inserts under the hood even for deletes.
Here are some additional points:
You had a typo in your schema. It should be -- users map<text,text>.
For (1) you need to enclose the item in curly brackets otherwise the CQL statement is invalid -- {'JOE'}.
For (2) you need a colon (:) to delimit the key and value -- {'JOE':'meta'}.
For (3) there's no evidence that lastvisit was defined so a new column lastvisit = 100 will be inserted and there's no old value to be deleted. Cheers!
Related
I want to create a Cassandra collection with some list<int> field and insert an empty list;
CREATE TABLE test (
name text PRIMARY KEY,
scores list<int>,
);
INSERT INTO test (name, scores) VALUES ('John', []);
However, this returns null
SELECT * FROM test;
name |scores
------+------------
John | null
Does Cassandra not differentiate between null and empty list?
As always the recommendation goes with Cassandra don't insert NULL or try to insert EMPTY values. Its just saving yourselves from Tombstones, storage, I/O bandwidth.
The reason why Cassandra doesn't differentiate NULL Vs empty is because the way deletes are handled. There is no read before deleting any record in Cassandra. So it just marks as a tombstone and moves ahead.
So actually you get penalized to initialize the list as empty (essentially creating tombstone).
select count (*) from my_table gives me OperationTimedOut: errors={}, last_host=127.0.0.1
I have already tried to change the values in request_timeout_in_ms in cassandra.yaml and request_timeout in cqlshrc.sample. (Both are in C:\Programs\DataStax-DDC\apache-cassandra\conf) But without success.
How can I increse timeout?
select count (*) is not doing what you think. It is actually expensive as it counts the rows one by one. You can track number of records using a separate column family with a counter, which you will need to increment for every insert you do into your table. For example
CREATE TABLE IF NOT EXISTS my_table_counter (
mykey text,
count counter,
PRIMARY KEY (mykey)
);
Then for every insert into your table, do counter update:
INSERT into my_table (mykey, mydata) VALUES (?, ?);
UPDATE my_table_counter SET count = count + 1 WHERE mykey = ?;
To get the count:
SELECT count FROM my_table_counter WHERE mykey = ?
Note that counters are not idempotent, so in a rare event of a failure your data might be under or over-counted. Also the code above assumes that you only insert with a new key.
If you need a precise counting, Cassandra may be not a good fit for that. Also if you are not inserting with unique keys you may need to consider using light weight transaction with insert (IF NOT EXISTS) and update a counter only if transaction was applied.
In Cassandra, I'm using the cql:
select msg from log where id in ('A', 'B') and filter1 = 'filter'
(where id is the partition key and filter1 is a secondary index and filter1 cannot be used as a cluster column)
This gives the response:
Select on indexed columns and with IN clause for the PRIMARY KEY are not supported
How can I change CQL to prevent this?
You would need to split that up into separate queries of:
select msg from log where id = 'A' and filter1 = 'filter';
and
select msg from log where id = 'B' and filter1 = 'filter';
Due to the way data is partitioned in Cassandra, CQL has a lot of seemingly arbitrary restrictions (to discourage inefficient queries and also because they are complex to implement).
Over time I think these restrictions will slowly be removed, but for now we have to work around them. For more details on the restrictions, see A deep look at the CQL where clause.
Another option, is that you could build a table specifically for this query (a query table) with filter1 as a partition key and id as a clustering key. That way, your query works and you avoid having a secondary index all-together.
aploetz#cqlsh:stackoverflow> CREATE TABLE log
(filter1 text,
id text,
msg text,
PRIMARY KEY (filter1, id));
aploetz#cqlsh:stackoverflow> INSERT INTO log (filter1, id, msg)
VALUES ('filter','A','message A');
aploetz#cqlsh:stackoverflow> INSERT INTO log (filter1, id, msg)
VALUES ('filter','B','message B');
aploetz#cqlsh:stackoverflow> INSERT INTO log (filter1, id, msg)
VALUES ('filter','C','message C');
aploetz#cqlsh:stackoverflow> SELECT msg FROM log
WHERE filter1='filter' AND id IN ('A','B');
msg
-----------
message A
message B
(2 rows)
You would still be using an "IN" which isn't known to perform well either. But you would also be specifying a partition key, so it might perform better than expected.
For Cassandra, do UPDATEs become an implied INSERT if the selected row does not exist? That is, if I say
UPDATE users SET name = "Raedwald" WHERE id = 545127
and id is the PRIMARY KEY of the users table, and the table has no row with a key of 545127, will that be equivalent to
INSERT INTO users (id, name) VALUES (545127, "Raedwald")
I know that the opposite is true: an INSERT for an id that already exists becomes an UPDATE of the row with that id. Older Cassandra documentation talked about inserts actually being "upserts" for that reason.
I'm interested in the case for CQL3, Cassandra version 1.2+.
Yes, for Cassandra UPDATE is synonymous with INSERT, as explained in the CQL documentation where it says the following about UPDATE:
Note that unlike in SQL, UPDATE does not check the prior existence of the row: the row is created if none existed before, and updated otherwise. Furthermore, there is no mean to know which of creation or update happened. In fact, the semantic of INSERT and UPDATE are identical.
For the semantics to be different, Cassandra would need to do a read to know if the row already exists. Cassandra is write optimized, so you can always assume it doesn't do a read before write on any write operation. The only exception is counters (unless replicate_on_write = false), in which case replication on increment involves a read.
Unfortunately the accepted answer is not 100% accurate. inserts are different than updates:
cqlsh> create table ks.t (pk int, ck int, v int, primary key (pk, ck));
cqlsh> update ks.t set v = null where pk = 0 and ck = 0;
cqlsh> select * from ks.t where pk = 0 and ck = 0;
pk | ck | v
----+----+---
(0 rows)
cqlsh> insert into ks.t (pk,ck,v) values (0,0,null);
cqlsh> select * from ks.t where pk = 0 and ck = 0;
pk | ck | v
----+----+------
0 | 0 | null
(1 rows)
Scylla does the same thing.
In Scylla and Cassandra rows are sequences of cells. Each column gets a corresponding cell (or a set of cells in the case of non-frozen collections or UDTs). But there is one additional, invisible cell - the row marker (in Scylla at least; I suspect Cassandra has something similar).
The row marker makes a difference for rows in which all other cells are dead: a row shows up in a query if and only if there's at least one alive cell. Thus, if the row marker is alive, the row will show up, even if all other columns were previously set to null using e.g. updates.
inserts create a live row marker, while updates don't touch the row marker, so clearly they are different. The example above illustrates that.
One could argue that row markers are "internal" to Cassandra/Scylla, but as you can see, their effects are visible. Row markers affect your life whether you like it or not, so it may be useful to remember about them.
It's sad that no documentation mentions row markers (well, I found this: https://docs.scylladb.com/architecture/sstable/sstable2/sstable-data-file/#cql-row-marker but it's in the context of explaining SSTable internals, which is probably dedicated to Scylla developers more than to users).
Bonus: a cell delete:
delete v from ks.t where pk = 0 and ck = 0
is the same as a null update:
update ks.t set v = null where pk = 0 and ck = 0
indeed, a cell delete also doesn't touch the row marker. It only sets the specified cell to null.
This is different from a row delete:
delete from ks.t where pk = 0 and ck = 0
because row deletes insert a row tombstone, which kills all cells in the row (including the row marker). You could say that row deletes are the opposite of an insert. Updates and cell deletes are somewhere in between.
What one can do is this however:
UPDATE table_name SET field = false WHERE key = 55 IF EXISTS;
This will ensure that your update is a true update and not an upsert.
After reading this blog at planetcassandra, I'm wondering how does a CQL3 composite index with 3 fields map in the thrift column family word, For e.g.:
CREATE TABLE comments (
article_id uuid,
posted_at timestamp,
author text,
karma int,
content text,
PRIMARY KEY (article_id, posted_at)
)
Here the column article_id will be mapped to the internal row key and posted_at will be mapped to (the first part of) the cell name.
What if the table design will be
CREATE TABLE comments (
author_id varchar,
posted_at timestamp,
article_id uuid,
author text,
karma int,
content text,
PRIMARY KEY (author_id, posted_at, article_id)
)
And will the internal row key mapped to 1st 2 fields of the composite index with article_id mapped to cell name, essentially slicing for as many articles upto 2 billion entries and any query on author_id and posted_at combination is one seek on the disk?
Is the behavior same for any number of fields in a composite key?
Your answers much appreciated.
The above observation is incorrect and the correct one is here
I've personally verified:
In the first case:
article_id = partition key, posted_at = cluster key
In the second case:
author_id = partition key, posted_at:article_id = cluster key
First part of composite key (author_id) is called "Partition Key",
rest (posted_at,article_id) are remaining keys.
Cassandra stores columns differently when composite keys are used. Partition key
becomes row key. Remaining keys are concatenated with each column
name (":" as separator) to form column names. Column values remain
unchanged.
Remaining keys (other than partition keys) are ordered,
and it's not allowed to search on any random column, you have to
start with the first one and then you can move to the second one and
so on. This is evident from "Bad Request" error.
There's an excellent explanation by Aaron Morton # his site thelastpickle.
In the first case:
article_id = partition key, posted_at = cluster key
In the second case:
author_id + posted_at = partition key, article_id = cluster key
hence be mindful of the disk seeks as you go by second method and see the row is not getting too wide and gives real benefit compared to the first case.
If you aren't crossing the 2 billion and well within the limits, don't overdo by adopting the 2nd method, as the dispersion of records happens on the combo key.