I want to create a Cassandra collection with some list<int> field and insert an empty list;
CREATE TABLE test (
name text PRIMARY KEY,
scores list<int>,
);
INSERT INTO test (name, scores) VALUES ('John', []);
However, this returns null
SELECT * FROM test;
name |scores
------+------------
John | null
Does Cassandra not differentiate between null and empty list?
As always the recommendation goes with Cassandra don't insert NULL or try to insert EMPTY values. Its just saving yourselves from Tombstones, storage, I/O bandwidth.
The reason why Cassandra doesn't differentiate NULL Vs empty is because the way deletes are handled. There is no read before deleting any record in Cassandra. So it just marks as a tombstone and moves ahead.
So actually you get penalized to initialize the list as empty (essentially creating tombstone).
Related
I have this following table:
CREATE TABLE example
(
id text,
users map<text,text>,
lastvisit int,
...
PRIMARY KEY (userid)
);
Sometimes I update a column or a map entry like:
1) update example set users = users - {'JOE'} where id = 'id';
2) update example set users = users + {'JOE':'meta'} where id = 'id';
3) update example set lastvisit = 100 where id = 'id';
I need to know how each query handles the old data in manner of tombstones and compaction.
The following I have researched/ advised but specially on maps I lack on information.
Deletes the map entry at key = 'JOE' by generating a tombstone only for that entry in the map. On compaction the value is dropped.
Inserts the key value pair to the map. The old entry is dropped at compaction since there is a newer entry.
The column entry is updated and like in 2, the old value is dropped in compaction
The question in each case is, will the whole row be written again or only the updated value with a newer timestamp ?
A tombstone for the map item where key = 'BOB' will be inserted.
The row doesn't get overwritten. Just adds a new map item.
Strictly speaking, it's not an UPDATE -- a new column will be inserted. All mutations in C* are inserts under the hood even for deletes.
Here are some additional points:
You had a typo in your schema. It should be -- users map<text,text>.
For (1) you need to enclose the item in curly brackets otherwise the CQL statement is invalid -- {'JOE'}.
For (2) you need a colon (:) to delimit the key and value -- {'JOE':'meta'}.
For (3) there's no evidence that lastvisit was defined so a new column lastvisit = 100 will be inserted and there's no old value to be deleted. Cheers!
I have a Cassandra table with schema:
CREATE TABLE IF NOT EXISTS TestTable(
documentId text,
sequenceNo bigint,
messageData blob,
clientId text
PRIMARY KEY(documentId, sequenceNo))
WITH CLUSTERING ORDER BY(sequenceNo DESC);
Is there a way to delete the records which were inserted between a given time range? I know internally Cassandra must be using some timestamp to track the insertion time of each record, which would be used by features like TTL.
Since there is no explicit column for insertion timestamp in the given schema, is there a way to use the implicit timestamp or is there any better approach?
There is never any update to the records after insertion.
It's an interesting question...
All columns that aren't part of the primary key have so-called WriteTime that could be retrieved using the writetime(column_name) function of CQL (warning: it doesn't work with collection columns, and return null for UDTs!). But because we don't have nested queries in the CQL, you will need to write a program to fetch data, filter out entries by WriteTime, and delete entries where WriteTime is older than your threshold. (note that value of writetime is in microseconds, not milliseconds as in CQL's timestamp type).
The easiest way is to use Spark Cassandra Connector's RDD API, something like this:
val timestamp = someDate.toInstant.getEpochSecond * 1000L
val oldData = sc.cassandraTable(srcKeyspace, srcTable)
.select("prk1", "prk2", "reg_col".writeTime as "writetime")
.filter(row => row.getLong("writetime") < timestamp)
oldData.deleteFromCassandra(srcKeyspace, srcTable,
keyColumns = SomeColumns("prk1", "prk2"))
where: prk1, prk2, ... are all components of the primary key (documentId and sequenceNo in your case), and reg_col - any of the "regular" columns of the table that isn't collection or UDT (for example, clientId). It's important that list of the primary key columns in select and deleteFromCassandra was the same.
Anybody please help me understand why Cassandra is inserting null values in columns that was skipped? Isn't it supposed to skip the column? It should not insert any value (not even null) if I skip the column entirely while inserting data? I am bit confused because as per the following tutorial, data is stored by row key with the columns (the diagram in column family), if it is true then I should not get null for the column.
Or the whole concept I learned about the Cassandra column family is wrong?
http://www.tutorialspoint.com/cassandra/cassandra_data_model.htm
Here is the CQL script
create keyspace test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'};
create table users (firstname text,lastname text,age int, gender ascii, primary key(firstname))
insert into users(firstname,age,gender,lastname) values("Michael",30,"male","smith");
Here, I am skipping a column, but when I run select query, it shows null for that column. Why Cassandra is filling up null in that column?
insert into users(firstname,age,gender) values('Jane',23,'female');
select * from users;
Why don't you go to the most comprehensive source of documentation and learning for Cassandra : http://academy.datastax.com ? And it's free. The content and tutorialspoint.com is very old and not updated since ages (SuperColumn are deprecated since 2011 - 2012 ...)
Here, I am skipping a column, but when I run select query, it shows null for that column. Why Cassandra is filling up null in that column?
In CQL, null == value is not present or value has been deleted
Since you did not insert any value for column lastname Cassandra will return null (== not present in this case)
I have strange problem with Cassandra (version 2.2.3) database and using static columns when write some proof of concept for simple application with send money functionality.
My table is:
CREATE TABLE transactions (
profile text,
timestamp timestamp,
amount text,
balance text,
lock int static,
PRIMARY KEY (profile, timestamp)) WITH CLUSTERING ORDER BY (timestamp ASC);
First step I add new record
INSERT INTO transactions (profile, timestamp, amount) VALUES ( 'test_profile', '2015-11-05 15:20:01+0000', '10USD');
Then I want to 'lock' current user transaction to do some action with his balance. I try to execute this request:
UPDATE transactions SET lock = 1 WHERE profile = 'test_profile' IF lock = null;
But as result in cqlsh I see
[applied]
-----------
False
I don't understand why 'False', because current data for profile is:
profile | timestamp | lock | amount | balance
--------------+--------------------------+------+--------+---------
test_profile | 2015-11-05 15:20:01+0000 | null | 10USD | null
Any idea what I do wrong?
UPDATE
After read Nenad Bozic answer I modify my example to clarify why I need condition in update. Full code sample
CREATE TABLE transactions (
profile text,
timestamp timestamp,
amount text,
balance text,
lock int static,
balances map<text,text> static,
PRIMARY KEY (profile, timestamp)
) WITH CLUSTERING ORDER BY (timestamp ASC);
INSERT INTO transactions (profile, timestamp, amount) VALUES ( 'test_profile', '2015-11-05 15:20:01+0000', '1USD');
INSERT INTO transactions (profile, lock) VALUES ('test_profile', 1) IF NOT EXISTS;
BEGIN BATCH
UPDATE transactions SET balances={'USD':'1USD'} WHERE profile='test_profile';
UPDATE transactions SET balance='1USD' WHERE profile='test_profile' AND timestamp='2015-11-05 15:20:01+0000';
DELETE lock FROM transactions WHERE profile='test_profile';
APPLY BATCH;
And if I try get lock again I get
INSERT INTO transactions (profile, lock) VALUES ('test_profile', 1) IF NOT EXISTS;
[applied] | profile | timestamp | balances | lock | amount | balance
-----------+--------------+-----------+-----------------+------+--------+---------
False | test_profile | null | {'USD': '1USD'} | null | null | null
When you INSERT you do not insert lock field which means this field does not exist. Null representation in CQLSH or DevCenter is only synthetic sugar to make results looks like tabular data but in reality it has dynamic key values and lock is not present in that map of key values. It is useful to look thrift representation of data even though it is not used anymore to get sense how it is stored to disk.
So when UPDATE is fired it is expecting column to be present to updated it. In your case lock column is not even present so it cannot update it. This thread on difference between INSERT and UPDATE is also good read.
You have two solutions to make this work:
Insert null explicitly
You can add lock to your insert statement and set it to null (which is different in Cassandra than excluding it from insert because this way it will get null value and when you exclude it this column would not exist in
INSERT INTO transactions (profile, timestamp, amount, lock)
VALUES ( 'test_profile', '2015-11-05 15:20:01+0000', '10USD', null);
Use insert on second statement
Since you are inserting on second statement lock for first time instead of updating existing value and since it is static column for that partition you can use INSERT IF NOT EXISTS instead of UPDATE IF LWT way of doing it (lock would not exist so this will pass first time and fail all other times since lock will have value):
INSERT INTO transactions (profile, lock)
VALUES ('test_profile', 1) IF NOT EXISTS;
I am trying to INSERT (also UPDATE and DELETE) data in Cassandra using timestamp, but no change occur to the table. Any help please?
BEGIN BATCH
INSERT INTO transaction_test.users(email,age,firstname,lastname) VALUES ('1',null,null,null) USING TIMESTAMP 0;
INSERT INTO transaction_test.users(email,age,firstname,lastname) VALUES ('2',null,null,null) USING TIMESTAMP 1;
INSERT INTO transaction_test.users(email,age,firstname,lastname) VALUES ('3',null,null,null) USING TIMESTAMP 2;
APPLY BATCH;
I think you're falling into Cassandra's "control of timestamps". Operations in C* are (in effect1) executed only if the timestamp of the new operation is "higher" than previous one.
Let's see an example. Given the following insert
INSERT INTO test (key, value ) VALUES ( 'mykey', 'somevalue') USING TIMESTAMP 1000;
You expect this as output:
select key,value,writetime(value) from test where key='mykey';
key | value | writetime(value)
-------+-----------+------------------
mykey | somevalue | 1000
And it should be like this unless someone before you didn't perform an operation on this information with a higher timestamp. For instance, if you now write
INSERT INTO test (key, value ) VALUES ( 'mykey', '999value') USING TIMESTAMP 999;
Here's the output
select key,value,writetime(value) from test where key='mykey';
key | value | writetime(value)
-------+-----------+------------------
mykey | somevalue | 1000
As you can see neither the value nor the timestamp have been updated.
1 That's a slight simplification. Unless you are doing a specialised 'compare-and-set' write, Cassandra doesn't read anything from the table before it writes and it doesn't know if there is existing data or what its timestamp is. So you end up with two versions of the row, with different timestamps. But when you read the row back you always get the one with the latest timestamp. Normally Cassandra will compact such duplicate rows after a while, which is when the older timestamp row gets discarded.