Cassandra Internal Storage - cassandra

If I create a table like this in Cassandra
CREATE TABLE example (
key1 text PRIMARY KEY,
map1 map<text,text>,
list1 list<text>,
set1 set<text>
);
and insert some data like this
INSERT INTO example (
key1,
map1,
list1,
set1
) VALUES (
'john',
{'patricia':'555-4326','doug':'555-1579'},
['doug','scott'],
{'patricia','scott'}
);
and look at the storage using CLI, I will see this
RowKey: john
=> (column=, value=, timestamp=1374683971220000)
=> (column=map1:doug, value='555-1579', timestamp=1374683971220000)
=> (column=map1:patricia, value='555-4326', timestamp=1374683971220000)
=> (column=list1:26017c10f48711e2801fdf9895e5d0f8, value='doug', timestamp=1374683971220000)
=> (column=list1:26017c12f48711e2801fdf9895e5d0f8, value='scott', timestamp=1374683971220000)
=> (column=set1:'patricia', value=, timestamp=1374683971220000)
=> (column=set1:'scott', value=, timestamp=1374683971220000)
Now my question is this: what is the first row in CLI output? what does it mean? why it does not have any Column nor Value but has a timestamp?

The "row marker" was introduced [1] so the row doesn't disappear when you remove (set a column to null) the last column. Aligned with how traditional SQL implementations behaves)
You have also found out how cassandra represents collections under the hood.
Remember that
Map keys should be unique (solved)
List can contain duplicates (solved by appending a uuid)
Set should not contain duplicates (solved)
[1] https://issues.apache.org/jira/browse/CASSANDRA-4361

This because while using cassandra-cli, you get a thrift representation of the rows.
First information are for primary key, as your is just a partition key, it's the same as the row key.
So you have your value : as rowKey (John in your example) then the timestamp
You will have more readable result set if you usin cqlsh instead.
you can find more detail here :
https://thelastpickle.com/blog/2013/01/11/primary-keys-in-cql.html
I hope this helps

Related

Cassandra list type conflicts

If I have a List field in Cassandra and two people write to it at the same time, is it a simple last write wins or will it merge the writes?
For example: [a, b, c, d]
User1 -> [b, a, c, d] (move b to index 0)
User2 -> [a, b, d, c] (move c to index 3)
Will Cassandra merge the results and end up with [b, a, d, c] or will it use last write wins to the microsecond?
You will get the merge result
Every write data to cassandra, a timestamp associated with each column is also inserted. when you execute read query, timestamps are used to pick a "winning" update within a single column or collection element.
What if I have a truly concurrent write with the same time stamp? In the unlikely case that you precisely end up with two time stamps that match in its microsecond, you might end up with a bad version but Cassandra ensures that ties are consistently broken by comparing the byte values.
Cassandra store list (collection) different than normal column.
Example :
CREATE TABLE friendlists (
user text PRIMARY KEY,
friends list <text>
);
If we insert some dummy data :
user | friends
----------+-------------------------
john | [doug, patricia, scott]
patricia | [john, lucifer]
The internal representation:
RowKey: john
=> (column=, value=, timestamp=1374687324950000)
=> (column=friends:26017c10f48711e2801fdf9895e5d0f8, value='doug', timestamp=1374687206993000)
=> (column=friends:26017c11f48711e2801fdf9895e5d0f8, value='patricia', timestamp=1374687206993000)
=> (column=friends:26017c12f48711e2801fdf9895e5d0f8, value='scott', timestamp=1374687206993000)
=> (column=friends:6c504b60f48711e2801fdf9895e5d0f8, value='matt', timestamp=1374687324950000)
=> (column=friends:6c504b61f48711e2801fdf9895e5d0f8, value='eric', timestamp=1374687324950000)
-------------------
RowKey: patricia
=> (column=, value=, timestamp=1374687352290000)
=> (column=friends:3b817b80f48711e2801fdf9895e5d0f8, value='john', timestamp=1374687243064000)
Here the internal column name is more complicated because a UUID is appended to the name of the CQL field "friends". This is used to keep track of the order of items in the list.
Every time you insert data cassandra with below query :
INSERT INTO friendlists (user , friends ) VALUES ( 'patricia', ['john', 'lucifer']);
//or
UPDATE friendlists SET friends = ['john', 'lucifer'] where user = 'patricia';
Will create a tombstone with a less timestamp than current, it tells that the previous data has been deleted. So if concurrent insert happened with the same exact timestamp both data are ahead of tombstone so both data will live.
Source :
http://mighty-titan.blogspot.com/2012/06/understanding-cassandras-consistency.html
http://opensourceconnections.com/blog/2013/07/24/understanding-how-cql3-maps-to-cassandras-internal-data-structure-sets-lists-and-maps/

How to get the raw row content in Cassandra 3.3

I am using Cassandra 3.3 and CQL to create the following table
CREATE TABLE collected_data (
collection_hour int,
source_id int,
entity_id int,
measurement text,
value text,
primary key((collection_hour),source_id,entity_id,measurement)
);
After inserting a bunch of values into this table I wish to see how each row is really stored in Cassandra. For that I have seen that folks were using cassandra-cli (list command), but that is not available anymore in 3.3 ( post 3.0 )
Is there a way I can use to query cassandra to see how each row is really stored ? I am looking for some tool or any way to do this from CQL ...
Thank you
PS: in cassandra CLI one would use the the "list command" and get an output similar to the following (different table ofcourse):
RowKey: 1
=> (column=, value=, timestamp=1374546754299000)
=> (column=field2, value=00000002, timestamp=1374546754299000)
=> (column=field3, value=00000003, timestamp=1374546754299000)
RowKey: 4
=> (column=, value=, timestamp=1374546757815000)
=> (column=field2, value=00000005, timestamp=1374546757815000)
=> (column=field3, value=00000006, timestamp=1374546757815000)
The storage engine has been rewritten since Cassandra 3.0 so the on-disk layout has changed completely.
There is no official documentation on this subject but you can look at several places in the source code to have a big picture of how data are laid on disk
UnfilteredSerializer: https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/rows/UnfilteredSerializer.java#L29-L71
Cell storage: https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/rows/Cell.java#L145-L163
ClusteringPrefix: https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ClusteringPrefix.java#L33-L45

Cassandra IN query not working if table has SET type column

I am new to Cassandra. I have an issue when using IN in cassandra query.
If table has no column of SET type it works.
CREATE TABLE test (
test_date bigint,
test_id bigint,
caption text,
PRIMARY KEY(test_date,test_id)
);
select * from test where test_date = 2022015 and test_id IN (1,2);
But if I add a column of SET type f.e. tags set in the above table and rerun the select query, it gives error.
CREATE TABLE test1 (
test_date bigint,
test_id bigint,
tags set<text>,
caption text,
PRIMARY KEY(test_date,test_id)
);
select * from test1 where test_date = 2022015 and test_id IN (1,2);
code=2200 [Invalid query] message="Cannot restrict column "test_id" by
IN relation as a collection is selected by the query"
I'm not sure why this restriction should apply particulary for collections. But in your case you can get around this issue by making the test_id part of your partition key:
PRIMARY KEY((test_date,test_id))
This will allow you to do IN queries as long as you specify the first part of the composite key (test_date).
I think you are seeing this error due to Cassandra's underlying storage model. When I query your test1 table within CQLSH (with my own test data), this is what I see:
aploetz#cqlsh:stackoverflow> SELECT * FROM test1;
test_date | test_id | caption | tags
-----------+---------+-----------+-------------------------
2022015 | 1 | blah blah | {'one', 'three', 'two'}
2022015 | 2 | blah blah | {'one', 'three', 'two'}
(2 rows)
This view gives a misleading interpretation of how the data is actually stored. This is what it looks like when I query the same table from within cassandra-cli:
[default#stackoverflow] list test1;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: 2022015
=> (name=1:, value=, timestamp=1422895168730184)
=> (name=1:caption, value=626c616820626c6168, timestamp=1422895168730184)
=> (name=1:tags:6f6e65, value=, timestamp=1422895168730184)
=> (name=1:tags:7468726565, value=, timestamp=1422895168730184)
=> (name=1:tags:74776f, value=, timestamp=1422895168730184)
=> (name=2:, value=, timestamp=1422895161891116)
=> (name=2:caption, value=626c616820626c6168, timestamp=1422895161891116)
=> (name=2:tags:6f6e65, value=, timestamp=1422895161891116)
=> (name=2:tags:7468726565, value=, timestamp=1422895161891116)
=> (name=2:tags:74776f, value=, timestamp=1422895161891116)
1 Row Returned.
This suggests that collection (set) values are stored as additional column keys. A restriction on using the IN relation, is that it must operate on the last key (partitioning or clustering) of a primary key. So I would guess that this is a limitation based on how Cassandra stores the collection data "under the hood."
And just a warning, but using IN for production-level queries is not recommended. Some have even gone as far as to put it on the list of Cassandra anti-patterns. My answer to this question (Is the IN relation in Cassandra bad for queries?) explains why IN queries are not optimal.
EDIT
Just to see, I tried your schema with a list instead of a set to see if that made any difference. It still didn't work, but from within the cassandra-cli it appeared to add an additional UUID identifier to the key, and stored the actual value as the column value. Which is different from how a set was treated...this must be how sets are restricted to unique values.
You can use a Materialized View with test_id as a part of partitioning expression to satisfy your requirement if changing the PK on your base table is not an option:
CREATE MATERIALIZED VIEW test1_mv AS
SELECT * FROM test1
WHERE test_date IS NOT NULL AND test_id IS NOT NULL
PRIMARY KEY((test_date,test_id));
Then use the Materialized View instead of the base table in your query:
select * from test1_mv where test_date = 2022015 and test_id IN (1,2);

CQL generates two columns per value?

I am wondering why is Cassandra creating two columns when I add a cell with CQL?
This is my schema:
DROP KEYSPACE IF EXISTS tsdb;
CREATE KEYSPACE tsdb WITH replication =
{
'class': 'SimpleStrategy',
'replication_factor' : 3
};
USE tsdb;
CREATE TABLE datapoints (
tsid int,
key text,
value blob,
PRIMARY KEY (tsid, key)
);
INSERT INTO datapoints (tsid, key, value)
VALUES (
1,
'foo',
0x012345
);
INSERT INTO datapoints (tsid, key, value)
VALUES (
2,
'foo',
0x500000
);
Querying it in CQLSH looks good:
cqlsh:tsdb> SELECT * FROM datapoints;
tsid | key | value
------+-----+----------
1 | foo | 0x012345
2 | foo | 0x500000
(2 rows)
but when I list the rows via cassandra-cli I get two columns per row:
[default#tsdb] list datapoints;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: 1
=> (name=foo:, value=, timestamp=1405353603216000)
=> (name=foo:value, value=012345, timestamp=1405353603216000)
-------------------
RowKey: 2
=> (name=foo:, value=, timestamp=1405353603220000)
=> (name=foo:value, value=500000, timestamp=1405353603220000)
2 Rows Returned.
Elapsed time: 6.9 msec(s).
I was expecting to get something like:
-------------------
RowKey: 1
=> (name=foo:value, value=012345, timestamp=1405353603216000)
-------------------
RowKey: 2
=> (name=foo:value, value=500000, timestamp=1405353603220000)
2 Rows Returned.
Why does CQL create columns with the name "foo:" and an empty value? What are these good for?
Thank you!
Best,
Malte
Understanding How CQL3 Maps to Cassandra’s Internal Data Structure
Thanks to John Berryman for the depth explanation of CQL mapping under the hood.

Extra column created by CQL inserts (comparing to cli)

I see extra column being created in my column family when I use cql comparing to cli.
Create table using CQL and insert row:
cqlsh:cassandraSample> CREATE TABLE bedbugs(
... id varchar,
... name varchar,
... description varchar,
... primary key(id, name)
... ) ;
cqlsh:cassandraSample> insert into bedbugs (id, name, description)
values ('Cimex','Cimex lectularius','http://en.wikipedia.org/wiki/Bed_bug');
Now insert column using cli:
[default#cassandraSample] set bedbugs['BatBedBug']['C. pipistrelli:description']='google.com';
Value inserted.
Elapsed time: 1.82 msec(s).
[default#cassandraSample] list bedbugs
... ;
Using default limit of 100
Using default column limit of 100
-------------------
RowKey: Cimex
=> (column=Cimex lectularius:, value=, timestamp=1369682957658000)
=> (column=Cimex lectularius:description, value=http://en.wikipedia.org/wiki/Bed_bug, timestamp=1369682957658000)
-------------------
RowKey: BatBedBug
=> (column=C. pipistrelli:description, value=google.com, timestamp=1369688651442000)
2 Rows Returned.
cqlsh:cassandraSample> select * from bedbugs;
id | name | description
-----------+-------------------+--------------------------------------
Cimex | Cimex lectularius | http://en.wikipedia.org/wiki/Bed_bug
BatBedBug | C. pipistrelli | google.com
So, cql creates one extra column for each row, with empty non-primary key columns. Isn't it waste of space?
When you created a column family using CQLSh and specified primary key(Id, name) you make cassandra create two indices of the data stored one for data sorted by ID and the other for data sorted by name. but when you do this by cassandra-cli your column family doesn't have the index column. cassandra-cli doesn't support having secondary indexes. I hope I made sense to you I lack words to explain my understanding.
For compatibility with cassandra-cli and to prevent this extra column from being created, change your create table statement to include "WITH COMPACT STORAGE".
described here
So
CREATE TABLE bedbugs(
id varchar,
name varchar,
description varchar,
primary key(id, name)
);
becomes
CREATE TABLE bedbugs(
id varchar,
name varchar,
description varchar,
primary key(id, name)
) WITH COMPACT STORAGE;
WITH COMPACT STORAGE is also how you would go about supporting wide rows in cql.

Resources