CQL generates two columns per value? - cassandra

I am wondering why is Cassandra creating two columns when I add a cell with CQL?
This is my schema:
DROP KEYSPACE IF EXISTS tsdb;
CREATE KEYSPACE tsdb WITH replication =
{
'class': 'SimpleStrategy',
'replication_factor' : 3
};
USE tsdb;
CREATE TABLE datapoints (
tsid int,
key text,
value blob,
PRIMARY KEY (tsid, key)
);
INSERT INTO datapoints (tsid, key, value)
VALUES (
1,
'foo',
0x012345
);
INSERT INTO datapoints (tsid, key, value)
VALUES (
2,
'foo',
0x500000
);
Querying it in CQLSH looks good:
cqlsh:tsdb> SELECT * FROM datapoints;
tsid | key | value
------+-----+----------
1 | foo | 0x012345
2 | foo | 0x500000
(2 rows)
but when I list the rows via cassandra-cli I get two columns per row:
[default#tsdb] list datapoints;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: 1
=> (name=foo:, value=, timestamp=1405353603216000)
=> (name=foo:value, value=012345, timestamp=1405353603216000)
-------------------
RowKey: 2
=> (name=foo:, value=, timestamp=1405353603220000)
=> (name=foo:value, value=500000, timestamp=1405353603220000)
2 Rows Returned.
Elapsed time: 6.9 msec(s).
I was expecting to get something like:
-------------------
RowKey: 1
=> (name=foo:value, value=012345, timestamp=1405353603216000)
-------------------
RowKey: 2
=> (name=foo:value, value=500000, timestamp=1405353603220000)
2 Rows Returned.
Why does CQL create columns with the name "foo:" and an empty value? What are these good for?
Thank you!
Best,
Malte

Understanding How CQL3 Maps to Cassandra’s Internal Data Structure
Thanks to John Berryman for the depth explanation of CQL mapping under the hood.

Related

Cassandra create duplicate table with different primary key

I'm new to Apache Cassandra and have the following issue:
I have a table with PRIMARY KEY (userid, countrycode, carid). As described in many tutorials this table can be queried by using following filter criteria:
userid = x
userid = x and countrycode = y
userid = x and countrycode = y and carid = z
This is fine for most cases, but now I need to query the table by filtering only on
userid = x and carid = z
Here, the documentation sais that is the best solution to create another table with a modified primary key, in this case PRIMARY KEY (userid, carid, countrycode).
The question here is, how to copy the data from the "original" table to the new one with different index?
On small tables
On huge tables
And another important question concerning the duplication of a huge table: What about the storage needed to save both tables instead of only one?
You can use COPY command to export from one table and import into other table.
From your example - I created 2 tables. user_country and user_car with respective primary keys.
CREATE KEYSPACE user WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 2 } ;
CREATE TABLE user.user_country ( user_id text, country_code text, car_id text, PRIMARY KEY (user_id, country_code, car_id));
CREATE TABLE user.user_car ( user_id text, country_code text, car_id text, PRIMARY KEY (user_id, car_id, country_code));
Let's insert some dummy data into one table.
cqlsh> INSERT INTO user.user_country (user_id, country_code, car_id) VALUES ('1', 'IN', 'CAR1');
cqlsh> INSERT INTO user.user_country (user_id, country_code, car_id) VALUES ('2', 'IN', 'CAR2');
cqlsh> INSERT INTO user.user_country (user_id, country_code, car_id) VALUES ('3', 'IN', 'CAR3');
cqlsh> select * from user.user_country ;
user_id | country_code | car_id
---------+--------------+--------
3 | IN | CAR3
2 | IN | CAR2
1 | IN | CAR1
(3 rows)
Now we will export the data into a CSV. Observe the sequence of columns mentioned.
cqlsh> COPY user.user_country (user_id,car_id, country_code) TO 'export.csv';
Using 1 child processes
Starting copy of user.user_country with columns [user_id, car_id, country_code].
Processed: 3 rows; Rate: 4 rows/s; Avg. rate: 4 rows/s
3 rows exported to 1 files in 0.824 seconds.
export.csv can now be directly inserted into other table.
cqlsh> COPY user.user_car(user_id,car_id, country_code) FROM 'export.csv';
Using 1 child processes
Starting copy of user.user_car with columns [user_id, car_id, country_code].
Processed: 3 rows; Rate: 6 rows/s; Avg. rate: 8 rows/s
3 rows imported from 1 files in 0.359 seconds (0 skipped).
cqlsh>
cqlsh>
cqlsh> select * from user.user_car ;
user_id | car_id | country_code
---------+--------+--------------
3 | CAR3 | IN
2 | CAR2 | IN
1 | CAR1 | IN
(3 rows)
cqlsh>
About your other question - yes the data will be duplicated, but that's how cassandra is used.

Cassandra Internal Storage

If I create a table like this in Cassandra
CREATE TABLE example (
key1 text PRIMARY KEY,
map1 map<text,text>,
list1 list<text>,
set1 set<text>
);
and insert some data like this
INSERT INTO example (
key1,
map1,
list1,
set1
) VALUES (
'john',
{'patricia':'555-4326','doug':'555-1579'},
['doug','scott'],
{'patricia','scott'}
);
and look at the storage using CLI, I will see this
RowKey: john
=> (column=, value=, timestamp=1374683971220000)
=> (column=map1:doug, value='555-1579', timestamp=1374683971220000)
=> (column=map1:patricia, value='555-4326', timestamp=1374683971220000)
=> (column=list1:26017c10f48711e2801fdf9895e5d0f8, value='doug', timestamp=1374683971220000)
=> (column=list1:26017c12f48711e2801fdf9895e5d0f8, value='scott', timestamp=1374683971220000)
=> (column=set1:'patricia', value=, timestamp=1374683971220000)
=> (column=set1:'scott', value=, timestamp=1374683971220000)
Now my question is this: what is the first row in CLI output? what does it mean? why it does not have any Column nor Value but has a timestamp?
The "row marker" was introduced [1] so the row doesn't disappear when you remove (set a column to null) the last column. Aligned with how traditional SQL implementations behaves)
You have also found out how cassandra represents collections under the hood.
Remember that
Map keys should be unique (solved)
List can contain duplicates (solved by appending a uuid)
Set should not contain duplicates (solved)
[1] https://issues.apache.org/jira/browse/CASSANDRA-4361
This because while using cassandra-cli, you get a thrift representation of the rows.
First information are for primary key, as your is just a partition key, it's the same as the row key.
So you have your value : as rowKey (John in your example) then the timestamp
You will have more readable result set if you usin cqlsh instead.
you can find more detail here :
https://thelastpickle.com/blog/2013/01/11/primary-keys-in-cql.html
I hope this helps

Cassandra IN query not working if table has SET type column

I am new to Cassandra. I have an issue when using IN in cassandra query.
If table has no column of SET type it works.
CREATE TABLE test (
test_date bigint,
test_id bigint,
caption text,
PRIMARY KEY(test_date,test_id)
);
select * from test where test_date = 2022015 and test_id IN (1,2);
But if I add a column of SET type f.e. tags set in the above table and rerun the select query, it gives error.
CREATE TABLE test1 (
test_date bigint,
test_id bigint,
tags set<text>,
caption text,
PRIMARY KEY(test_date,test_id)
);
select * from test1 where test_date = 2022015 and test_id IN (1,2);
code=2200 [Invalid query] message="Cannot restrict column "test_id" by
IN relation as a collection is selected by the query"
I'm not sure why this restriction should apply particulary for collections. But in your case you can get around this issue by making the test_id part of your partition key:
PRIMARY KEY((test_date,test_id))
This will allow you to do IN queries as long as you specify the first part of the composite key (test_date).
I think you are seeing this error due to Cassandra's underlying storage model. When I query your test1 table within CQLSH (with my own test data), this is what I see:
aploetz#cqlsh:stackoverflow> SELECT * FROM test1;
test_date | test_id | caption | tags
-----------+---------+-----------+-------------------------
2022015 | 1 | blah blah | {'one', 'three', 'two'}
2022015 | 2 | blah blah | {'one', 'three', 'two'}
(2 rows)
This view gives a misleading interpretation of how the data is actually stored. This is what it looks like when I query the same table from within cassandra-cli:
[default#stackoverflow] list test1;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: 2022015
=> (name=1:, value=, timestamp=1422895168730184)
=> (name=1:caption, value=626c616820626c6168, timestamp=1422895168730184)
=> (name=1:tags:6f6e65, value=, timestamp=1422895168730184)
=> (name=1:tags:7468726565, value=, timestamp=1422895168730184)
=> (name=1:tags:74776f, value=, timestamp=1422895168730184)
=> (name=2:, value=, timestamp=1422895161891116)
=> (name=2:caption, value=626c616820626c6168, timestamp=1422895161891116)
=> (name=2:tags:6f6e65, value=, timestamp=1422895161891116)
=> (name=2:tags:7468726565, value=, timestamp=1422895161891116)
=> (name=2:tags:74776f, value=, timestamp=1422895161891116)
1 Row Returned.
This suggests that collection (set) values are stored as additional column keys. A restriction on using the IN relation, is that it must operate on the last key (partitioning or clustering) of a primary key. So I would guess that this is a limitation based on how Cassandra stores the collection data "under the hood."
And just a warning, but using IN for production-level queries is not recommended. Some have even gone as far as to put it on the list of Cassandra anti-patterns. My answer to this question (Is the IN relation in Cassandra bad for queries?) explains why IN queries are not optimal.
EDIT
Just to see, I tried your schema with a list instead of a set to see if that made any difference. It still didn't work, but from within the cassandra-cli it appeared to add an additional UUID identifier to the key, and stored the actual value as the column value. Which is different from how a set was treated...this must be how sets are restricted to unique values.
You can use a Materialized View with test_id as a part of partitioning expression to satisfy your requirement if changing the PK on your base table is not an option:
CREATE MATERIALIZED VIEW test1_mv AS
SELECT * FROM test1
WHERE test_date IS NOT NULL AND test_id IS NOT NULL
PRIMARY KEY((test_date,test_id));
Then use the Materialized View instead of the base table in your query:
select * from test1_mv where test_date = 2022015 and test_id IN (1,2);

How Can I Search for Records That Have A Null/Empty Field Using CQL?

How can I write a query to find all records in a table that have a null/empty field? I tried tried the query below, but it doesn't return anything.
SELECT * FROM book WHERE author = 'null';
null fields don't exist in Cassandra unless you add them yourself.
You might be thinking of the CQL data model, which hides certain implementation details in order to have a more understandable data model. Cassandra is sparse, which means that only data that is used is actually stored. You can visualize this by adding in some test data to Cassandra through CQL.
cqlsh> CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 } ;
cqlsh> use test ;
cqlsh:test> CREATE TABLE foo (name text, age int, pet text, primary key (name)) ;
cqlsh:test> insert into foo (name, age, pet) values ('yves', 81, 'german shepherd') ;
cqlsh:test> insert into foo (name, pet) values ('coco', 'ferret') ;
cqlsh:test> SELECT * FROM foo ;
name | age | pet
-----+-----+------------------
coco | null | ferret
yves | 81 | german shepherd
So even it appears that there is a null value, the actual value is nonexistent -- CQL is showing you a null because this makes more sense, intuitively.
If you take a look at the table from the Thrift side, you can see that the table contains no such value for coco's age.
$ bin/cassandra-cli
[default#unknown] use test;
[default#test] list foo;
RowKey: coco
=> (name=, value=, timestamp=1389137986090000)
=> (name=age, value=00000083, timestamp=1389137986090000)
-------------------
RowKey: yves
=> (name=, value=, timestamp=1389137973402000)
=> (name=age, value=00000051, timestamp=1389137973402000)
=> (name=pet, value=6765726d616e207368657068657264, timestamp=1389137973402000)
Here, you can clearly see that yves has two columns: age and pet, while coco only has one: age.
As far as I know you cannot do this with NULL.
As an alternative, you could use a different empty value, for example the empty string: ''
In that case you could select all books with an empty author like this (assuming the author column is appropriately indexed):
SELECT * FROM book WHERE author = '';
If your_column_name in your_table is a text data type then following should work,
SELECT * FROM your_table WHERE your_column_name >= '' ALLOW FILTERING;
You can try language hacks depending on your usecase;
eg:
if you have a column: column_a, which holds only positive integer values.
for this usecase to filter results by Null values of this column,
you can apply the condition as:
where column_a <0
This will work if you are using Solr over Cassandra but not sure about direct Cassandra query.
SELECT * FROM BOOK WHERE solr_query = ' -author : [* TO *] '

what is the reason for composite column, there must be at least one column which is not part of the primary key

From online document:
A CQL 3 table’s primary key can have any number (1 or more) of component columns, but there must be at least one column which is not part of the primary key.
What is the reason for that?
I tried to insert a row only with the columns in the composite key in CQL. I can't see it when I do SELECT
cqlsh:demo> CREATE TABLE DEMO (
user_id bigint,
dep_id bigint,
created timestamp,
lastupdated timestamp,
PRIMARY KEY (user_id, dep_id)
);
cqlsh:demo> INSERT INTO DEMO (user_id, dep_id)
... VALUES (100, 1);
cqlsh:demo> select * from demo;
cqlsh:demo>
But when I use cli, it shows up something:
default#demo] list demo;
Using default limit of 100
Using default column limit of 100
-------------------
RowKey: 100
1 Row Returned.
Elapsed time: 27 msec(s).
But can't see the values of the columns.
After I add the column which is not in the primary key, the value shows up in CQL
cqlsh:demo> INSERT INTO DEMO (user_id, dep_id, created)
... VALUES (100, 1, '7943-07-23');
cqlsh:demo> select * from demo;
user_id | dep_id | created | lastupdated
---------+--------+--------------------------+-------------
100 | 1 | 7943-07-23 00:00:00+0000 | null
Result from CLI:
[default#demo] list demo;
Using default limit of 100
Using default column limit of 100
-------------------
RowKey: 100
invalid UTF8 bytes 0000ab7240ab7580
[default#demo]
Any idea?
update: I found the reason why CLI returns invalid UTF8 bytes 0000ab7240ab7580, it's not compatible with the table created for from CQL3, if I use compact storage option, the value shows up correctly for CLI.
What's really happening under the covers is that the non-key values are being saved using the primary key values which make up the row key and column names. If you don't insert any non-key values then you're not really creating any new column family columns. The row key comes from the first primary key, so that's why Cassandra was able to create a new row for you, even though no columns were created with it.
This limitation is fixed in Cassandra 1.2, which is in beta now.

Resources