How to get the raw row content in Cassandra 3.3 - cassandra

I am using Cassandra 3.3 and CQL to create the following table
CREATE TABLE collected_data (
collection_hour int,
source_id int,
entity_id int,
measurement text,
value text,
primary key((collection_hour),source_id,entity_id,measurement)
);
After inserting a bunch of values into this table I wish to see how each row is really stored in Cassandra. For that I have seen that folks were using cassandra-cli (list command), but that is not available anymore in 3.3 ( post 3.0 )
Is there a way I can use to query cassandra to see how each row is really stored ? I am looking for some tool or any way to do this from CQL ...
Thank you
PS: in cassandra CLI one would use the the "list command" and get an output similar to the following (different table ofcourse):
RowKey: 1
=> (column=, value=, timestamp=1374546754299000)
=> (column=field2, value=00000002, timestamp=1374546754299000)
=> (column=field3, value=00000003, timestamp=1374546754299000)
RowKey: 4
=> (column=, value=, timestamp=1374546757815000)
=> (column=field2, value=00000005, timestamp=1374546757815000)
=> (column=field3, value=00000006, timestamp=1374546757815000)

The storage engine has been rewritten since Cassandra 3.0 so the on-disk layout has changed completely.
There is no official documentation on this subject but you can look at several places in the source code to have a big picture of how data are laid on disk
UnfilteredSerializer: https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/rows/UnfilteredSerializer.java#L29-L71
Cell storage: https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/rows/Cell.java#L145-L163
ClusteringPrefix: https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ClusteringPrefix.java#L33-L45

Related

Cassandra Internal Storage

If I create a table like this in Cassandra
CREATE TABLE example (
key1 text PRIMARY KEY,
map1 map<text,text>,
list1 list<text>,
set1 set<text>
);
and insert some data like this
INSERT INTO example (
key1,
map1,
list1,
set1
) VALUES (
'john',
{'patricia':'555-4326','doug':'555-1579'},
['doug','scott'],
{'patricia','scott'}
);
and look at the storage using CLI, I will see this
RowKey: john
=> (column=, value=, timestamp=1374683971220000)
=> (column=map1:doug, value='555-1579', timestamp=1374683971220000)
=> (column=map1:patricia, value='555-4326', timestamp=1374683971220000)
=> (column=list1:26017c10f48711e2801fdf9895e5d0f8, value='doug', timestamp=1374683971220000)
=> (column=list1:26017c12f48711e2801fdf9895e5d0f8, value='scott', timestamp=1374683971220000)
=> (column=set1:'patricia', value=, timestamp=1374683971220000)
=> (column=set1:'scott', value=, timestamp=1374683971220000)
Now my question is this: what is the first row in CLI output? what does it mean? why it does not have any Column nor Value but has a timestamp?
The "row marker" was introduced [1] so the row doesn't disappear when you remove (set a column to null) the last column. Aligned with how traditional SQL implementations behaves)
You have also found out how cassandra represents collections under the hood.
Remember that
Map keys should be unique (solved)
List can contain duplicates (solved by appending a uuid)
Set should not contain duplicates (solved)
[1] https://issues.apache.org/jira/browse/CASSANDRA-4361
This because while using cassandra-cli, you get a thrift representation of the rows.
First information are for primary key, as your is just a partition key, it's the same as the row key.
So you have your value : as rowKey (John in your example) then the timestamp
You will have more readable result set if you usin cqlsh instead.
you can find more detail here :
https://thelastpickle.com/blog/2013/01/11/primary-keys-in-cql.html
I hope this helps

How to populate related table in Cassandra using CQL?

I am trying to practice Cassandra using this example (under Composite Columns paragraph):
So, I have created table tweets and it looks like following:
cqlsh:twitter> SELECT * from tweets;
tweet_id | author | body
--------------------------------------+-------------+--------------
73954b90-baf7-11e4-a7d0-27983e9e7f51 | gwashington | I chopped...
(1 rows)
Now I am trying to populate timeline, which is a related table using CQL and I am not sure how to do it. I have tried SQL approach, but it did not work:
cqlsh:twitter> INSERT INTO timeline (user_id, tweet_id, author, body) SELECT 'gmason', 73954b90-baf7-11e4-a7d0-27983e9e7f51, author, body FROM tweets WHERE tweet_id = 73954b90-baf7-11e4-a7d0-27983e9e7f51;
Bad Request: line 1:55 mismatched input 'select' expecting K_VALUES
So I have two questions:
How to populate timeline table with SQL, so that it would relate to tweets?
How do I make sure that Timeline Physical Layout will be created as shown in that example?
Thanks.
EDIT:
This is explanation for my question #2 above (the picture is taken from here):
tldr;
Use cqlsh COPY to export tweets, modify the file, use COPY to import timeline.
Use cassandra-cli to verify the physical structure.
Long version...
I'll go a different way on this one, and suggest that it will probably be easier using the native COPY command in cqlsh.
I followed the similar examples found here. After creating the tweets and timeline tables in cqlsh, I inserted rows into tweets as indicated. My tweets table then looked like this:
aploetz#cqlsh:stackoverflow> SELECT * FROM tweets;
tweet_id | author | body
--------------------------------------+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------
05a5f177-f070-486d-b64d-4e2bb28eaecc | gmason | Those gentlemen, who will be elected senators, will fix themselves in the federal town, and become citizens of that town more than of your state.
b67fe644-4dbe-489b-bc71-90f809f88636 | jmadison | All men having power ought to be distrusted to a certain degree.
819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1 | gwashington | To be prepared for war is one of the most effectual means of preserving peace.
I then exported them like this:
aploetz#cqlsh:stackoverflow> COPY tweets TO '/home/aploetz/tweets_20150223.txt'
WITH DELIMITER='|' AND HEADER=true;
3 rows exported in 0.052 seconds.
I then edited the tweets_20150223.txt file, adding a user_id column on the front and copying a couple of rows, like this:
userid|tweet_id|author|body
gmason|05a5f177-f070-486d-b64d-4e2bb28eaecc|gmason|Those gentlemen, who will be elected senators, will fix themselves in the federal town, and become citizens of that town more than of your state.
jmadison|b67fe644-4dbe-489b-bc71-90f809f88636|jmadison|All men having power ought to be distrusted to a certain degree.
gwashington|819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1|gwashington|To be prepared for war is one of the most effectual means of preserving peace.
jmadison|819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1|gwashington|To be prepared for war is one of the most effectual means of preserving peace.
ahamilton|819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1|gwashington|To be prepared for war is one of the most effectual means of preserving peace.
ahamilton|05a5f177-f070-486d-b64d-4e2bb28eaecc|gmason|Those gentlemen, who will be elected senators, will fix themselves in the federal town, and become citizens of that town more than of your state.
I saved that file as timeline_20150223.txt, and imported it into the timeline table, like this:
aploetz#cqlsh:stackoverflow> COPY timeline FROM '/home/aploetz/timeline_20150223.txt'
WITH DELIMITER='|' AND HEADER=true;
6 rows imported in 0.016 seconds.
Yes, timeline will be a wide-row table, partitioning on user_id and then clustering on tweet_id. I verified the "under the hood" structure by running the cassandra-cli tool, and listing the timeline column family (table). Here you can see how the rows are partitioned by user_id, and each column has the tweet_id uuid as a part of its name:
-
[default#stackoverflow] list timeline;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: ahamilton
=> (name=05a5f177-f070-486d-b64d-4e2bb28eaecc:, value=, timestamp=1424707827585904)
=> (name=05a5f177-f070-486d-b64d-4e2bb28eaecc:author, value=676d61736f6e, timestamp=1424707827585904)
=> (name=05a5f177-f070-486d-b64d-4e2bb28eaecc:body, value=54686f73652067656e746c656d656e2c2077686f2077696c6c20626520656c65637465642073656e61746f72732c2077696c6c20666978207468656d73656c76657320696e20746865206665646572616c20746f776e2c20616e64206265636f6d6520636974697a656e73206f66207468617420746f776e206d6f7265207468616e206f6620796f75722073746174652e, timestamp=1424707827585904)
=> (name=819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1:, value=, timestamp=1424707827585715)
=> (name=819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1:author, value=6777617368696e67746f6e, timestamp=1424707827585715)
=> (name=819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1:body, value=546f20626520707265706172656420666f7220776172206973206f6e65206f6620746865206d6f73742065666665637475616c206d65616e73206f662070726573657276696e672070656163652e, timestamp=1424707827585715)
-------------------
RowKey: gmason
=> (name=05a5f177-f070-486d-b64d-4e2bb28eaecc:, value=, timestamp=1424707827585150)
=> (name=05a5f177-f070-486d-b64d-4e2bb28eaecc:author, value=676d61736f6e, timestamp=1424707827585150)
=> (name=05a5f177-f070-486d-b64d-4e2bb28eaecc:body, value=54686f73652067656e746c656d656e2c2077686f2077696c6c20626520656c65637465642073656e61746f72732c2077696c6c20666978207468656d73656c76657320696e20746865206665646572616c20746f776e2c20616e64206265636f6d6520636974697a656e73206f66207468617420746f776e206d6f7265207468616e206f6620796f75722073746174652e, timestamp=1424707827585150)
-------------------
RowKey: gwashington
=> (name=819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1:, value=, timestamp=1424707827585475)
=> (name=819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1:author, value=6777617368696e67746f6e, timestamp=1424707827585475)
=> (name=819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1:body, value=546f20626520707265706172656420666f7220776172206973206f6e65206f6620746865206d6f73742065666665637475616c206d65616e73206f662070726573657276696e672070656163652e, timestamp=1424707827585475)
-------------------
RowKey: jmadison
=> (name=819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1:, value=, timestamp=1424707827585597)
=> (name=819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1:author, value=6777617368696e67746f6e, timestamp=1424707827585597)
=> (name=819d95e9-356c-4bd5-9ad0-8cd36a7aa5e1:body, value=546f20626520707265706172656420666f7220776172206973206f6e65206f6620746865206d6f73742065666665637475616c206d65616e73206f662070726573657276696e672070656163652e, timestamp=1424707827585597)
=> (name=b67fe644-4dbe-489b-bc71-90f809f88636:, value=, timestamp=1424707827585348)
=> (name=b67fe644-4dbe-489b-bc71-90f809f88636:author, value=6a6d616469736f6e, timestamp=1424707827585348)
=> (name=b67fe644-4dbe-489b-bc71-90f809f88636:body, value=416c6c206d656e20686176696e6720706f776572206f7567687420746f206265206469737472757374656420746f2061206365727461696e206465677265652e, timestamp=1424707827585348)
4 Rows Returned.
Elapsed time: 35 msec(s).
In order to accomplish this you will need to use an ETL tool. Use either Hadoop or Spark. There is no INSERT/SELECT in CQL and this is for a reason. In a real world you will need to execute 2 inserts from your application - one into each table.
You will just have to believe that when you have primary key with partition key and clustering key, this would store the data in a wide row format.

Cassandra IN query not working if table has SET type column

I am new to Cassandra. I have an issue when using IN in cassandra query.
If table has no column of SET type it works.
CREATE TABLE test (
test_date bigint,
test_id bigint,
caption text,
PRIMARY KEY(test_date,test_id)
);
select * from test where test_date = 2022015 and test_id IN (1,2);
But if I add a column of SET type f.e. tags set in the above table and rerun the select query, it gives error.
CREATE TABLE test1 (
test_date bigint,
test_id bigint,
tags set<text>,
caption text,
PRIMARY KEY(test_date,test_id)
);
select * from test1 where test_date = 2022015 and test_id IN (1,2);
code=2200 [Invalid query] message="Cannot restrict column "test_id" by
IN relation as a collection is selected by the query"
I'm not sure why this restriction should apply particulary for collections. But in your case you can get around this issue by making the test_id part of your partition key:
PRIMARY KEY((test_date,test_id))
This will allow you to do IN queries as long as you specify the first part of the composite key (test_date).
I think you are seeing this error due to Cassandra's underlying storage model. When I query your test1 table within CQLSH (with my own test data), this is what I see:
aploetz#cqlsh:stackoverflow> SELECT * FROM test1;
test_date | test_id | caption | tags
-----------+---------+-----------+-------------------------
2022015 | 1 | blah blah | {'one', 'three', 'two'}
2022015 | 2 | blah blah | {'one', 'three', 'two'}
(2 rows)
This view gives a misleading interpretation of how the data is actually stored. This is what it looks like when I query the same table from within cassandra-cli:
[default#stackoverflow] list test1;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: 2022015
=> (name=1:, value=, timestamp=1422895168730184)
=> (name=1:caption, value=626c616820626c6168, timestamp=1422895168730184)
=> (name=1:tags:6f6e65, value=, timestamp=1422895168730184)
=> (name=1:tags:7468726565, value=, timestamp=1422895168730184)
=> (name=1:tags:74776f, value=, timestamp=1422895168730184)
=> (name=2:, value=, timestamp=1422895161891116)
=> (name=2:caption, value=626c616820626c6168, timestamp=1422895161891116)
=> (name=2:tags:6f6e65, value=, timestamp=1422895161891116)
=> (name=2:tags:7468726565, value=, timestamp=1422895161891116)
=> (name=2:tags:74776f, value=, timestamp=1422895161891116)
1 Row Returned.
This suggests that collection (set) values are stored as additional column keys. A restriction on using the IN relation, is that it must operate on the last key (partitioning or clustering) of a primary key. So I would guess that this is a limitation based on how Cassandra stores the collection data "under the hood."
And just a warning, but using IN for production-level queries is not recommended. Some have even gone as far as to put it on the list of Cassandra anti-patterns. My answer to this question (Is the IN relation in Cassandra bad for queries?) explains why IN queries are not optimal.
EDIT
Just to see, I tried your schema with a list instead of a set to see if that made any difference. It still didn't work, but from within the cassandra-cli it appeared to add an additional UUID identifier to the key, and stored the actual value as the column value. Which is different from how a set was treated...this must be how sets are restricted to unique values.
You can use a Materialized View with test_id as a part of partitioning expression to satisfy your requirement if changing the PK on your base table is not an option:
CREATE MATERIALIZED VIEW test1_mv AS
SELECT * FROM test1
WHERE test_date IS NOT NULL AND test_id IS NOT NULL
PRIMARY KEY((test_date,test_id));
Then use the Materialized View instead of the base table in your query:
select * from test1_mv where test_date = 2022015 and test_id IN (1,2);

Cassandra long row with different data types

I have read the following article about Cassandra CQL3 and Thrift API
http://www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows
In the article, they give an example on creating a scheme for gathering data from sensors.
They show a “wide row” solution by making the timestamp as a column. Cassandra's strength, as I see it is by supporting 2 billion columns and a fast way to extract data according to column.
In the article, with CQL3 they build a table
CREATE TABLE data (
sensor_id int,
collected_at timestamp,
volts float,
PRIMARY KEY (sensor_id, collected_at)
) WITH COMPACT STORAGE;
which translates to:
sensor_id | collected_at | volts
1 | 2013-06-05 15:11:00-0500 | 3.1
1 | 2013-06-05 15:11:10-0500 | 4.3
1 | 2013-06-05 15:11:20-0500 | 5.7
2 | 2013-06-05 15:11:00-0500 | 3.2
3 | 2013-06-05 15:11:00-0500 | 3.3
3 | 2013-06-05 15:11:10-0500 | 4.3
In Thrift it translates to:
list data;
RowKey: 1
=> (cell=2013-06-05 15:11:00-0500, value=3.1, timestamp=1370463146717000)
=> (cell=2013-06-05 15:11:10-0500, value=4.3, timestamp=1370463282090000)
=> (cell=2013-06-05 15:11:20-0500, value=5.7, timestamp=1370463282093000)
RowKey: 2
=> (cell=2013-06-05 15:11:00-0500, value=3.2, timestamp=1370463332361000)
RowKey: 3
=> (cell=2013-06-05 15:11:00-0500, value=3.3, timestamp=1370463332365000)
=> (cell=2013-06-05 15:11:10-0500, value=4.3, timestamp=1370463332368000)
I'm trying to think of a Cassandra Schema example for the following sensor data gathering problem.
Let's say I add a new set of sensors which have a bigint (long) value (instead of float).
Any ideas how to design such a table schema to include both sensor types with different data types yet keep the columns based on timestamp?
Thanks,
Guy
If you don't need to use COMPACT STORAGE (and backwards compatibility with Thrift), just create your table as
CREATE TABLE data (
sensor_id int,
collected_at timestamp,
other_field bigint,
volts float,
PRIMARY KEY (sensor_id, collected_at)
)
Cassandra supports sparse columns with basically no overhead, so if you programmatically decide to only populate one of the two fields for any cql row, you will achieve your goal.
Alternatively, you can continue to use COMPACT STORAGE and just switch to blob type. The blob type will do absolutely no interpretation or transformation of the bytes that you insert into it, so accuracy can be guaranteed. I would not recommend using a text type for this.

Extra column created by CQL inserts (comparing to cli)

I see extra column being created in my column family when I use cql comparing to cli.
Create table using CQL and insert row:
cqlsh:cassandraSample> CREATE TABLE bedbugs(
... id varchar,
... name varchar,
... description varchar,
... primary key(id, name)
... ) ;
cqlsh:cassandraSample> insert into bedbugs (id, name, description)
values ('Cimex','Cimex lectularius','http://en.wikipedia.org/wiki/Bed_bug');
Now insert column using cli:
[default#cassandraSample] set bedbugs['BatBedBug']['C. pipistrelli:description']='google.com';
Value inserted.
Elapsed time: 1.82 msec(s).
[default#cassandraSample] list bedbugs
... ;
Using default limit of 100
Using default column limit of 100
-------------------
RowKey: Cimex
=> (column=Cimex lectularius:, value=, timestamp=1369682957658000)
=> (column=Cimex lectularius:description, value=http://en.wikipedia.org/wiki/Bed_bug, timestamp=1369682957658000)
-------------------
RowKey: BatBedBug
=> (column=C. pipistrelli:description, value=google.com, timestamp=1369688651442000)
2 Rows Returned.
cqlsh:cassandraSample> select * from bedbugs;
id | name | description
-----------+-------------------+--------------------------------------
Cimex | Cimex lectularius | http://en.wikipedia.org/wiki/Bed_bug
BatBedBug | C. pipistrelli | google.com
So, cql creates one extra column for each row, with empty non-primary key columns. Isn't it waste of space?
When you created a column family using CQLSh and specified primary key(Id, name) you make cassandra create two indices of the data stored one for data sorted by ID and the other for data sorted by name. but when you do this by cassandra-cli your column family doesn't have the index column. cassandra-cli doesn't support having secondary indexes. I hope I made sense to you I lack words to explain my understanding.
For compatibility with cassandra-cli and to prevent this extra column from being created, change your create table statement to include "WITH COMPACT STORAGE".
described here
So
CREATE TABLE bedbugs(
id varchar,
name varchar,
description varchar,
primary key(id, name)
);
becomes
CREATE TABLE bedbugs(
id varchar,
name varchar,
description varchar,
primary key(id, name)
) WITH COMPACT STORAGE;
WITH COMPACT STORAGE is also how you would go about supporting wide rows in cql.

Resources