Row Inserts having same primary key, are replacing previous writes in Cassandra - cassandra

Created a table in Cassandra where the primary key is based on two columns(groupname,type). When I'm trying to insert more than 1 row where the groupname and type is same, then in such situation its not storing more than one row, subsequent writes where in the groupname and type are same.. then the latest write is replacing the previous similar writes. Why Cassandra is replacing in this manner instead of writing every row im inserting?
Write 1
cqlsh:resto> insert into restmaster (rest_id,type,rname,groupname,address,city,country)values(blobAsUuid(timeuuidAsBlob(now())),'SportsBar','SportsDen','VK Group','Majestic','Bangalore','India');
Write 2
insert into restmaster (rest_id,type,rname,groupname,address,city,country)values(blobAsUuid(timeuuidAsBlob(now())),'SportsBar','Sports Spot','VK Group','Bandra','Mumbai','India');
Write 3
cqlsh:resto> insert into restmaster (rest_id,type,rname,groupname,address,city,country)values(blobAsUuid(timeuuidAsBlob(now())),'SportsBar','Cricket Heaven ','VK Group','Connaught Place','New Delhi','India');
The result Im expecting(check rows 4,5,6)
groupname | type | rname
----------------+------------+-----------------
none | Udipi | Gayatri Bhavan
none | dinein | Blue Diamond
VK Group | FoodCourt | FoodLion
VK Group | SportsBar | Sports Den
VK Group | SportsBar | Sports Spot
VK Group | SportsBar | Cricket Heaven
Viceroy Group | Vegetarian | Palace Heights
Mainland Group | Chinese | MainLand China
JSP Group | FoodCourt | Nautanki
Ohris | FoodCourt | Ohris
But this is the actual result (write 3 has replaced previous 2 inserts [rows 4,5])
cqlsh:resto> select groupname,type,rname From restmaster;
groupname | type | rname
----------------+------------+-----------------
none | Udipi | Gayatri Bhavan
none | dinein | Blue Diamond
VK Group | FoodCourt | FoodLion
VK Group | SportsBar | Cricket Heaven
Viceroy Group | Vegetarian | Palace Heights
Mainland Group | Chinese | MainLand China
JSP Group | FoodCourt | Nautanki
Ohris | FoodCourt | Ohris
cqlsh:resto> describe table restmaster;
CREATE TABLE restmaster (
groupname text,
type text,
address text,
city text,
country text,
rest_id uuid,
rname text,
PRIMARY KEY ((groupname), type)
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.100000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=0.000000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='99.0PERCENTILE' AND
memtable_flush_period_in_ms=0 AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'LZ4Compressor'};

All inserts to the Cassandra database are actually insert/update operations and there can only be on set of non-key values per uniquely defined primary key. This means that you can not ever have more than one set of values for one primary key and that you will only see the last write.
More info:
http://www.datastax.com/documentation/cql/3.1/cql/cql_intro_c.html
Update: A datamodel
If you used a key like
Primary Key ((groupname),type,rname)
As long as you have unique restaurant names you will be able to get the results you are expecting. But what you really should be asking is "What queries would I like to perform on this data?" All Cassandra Tables should be based around satisfying a class of queries. The key I wrote above basically says "This table is constructed to quickly look up all the restaurants in a particular group and the only conditionals I will use will be on type and on restaurant name"
Examples queries you could perform with that schema
SELECT * FROM restmaster WHERE groupname = 'Lettuce Entertain You' ;
SELECT * FROM restmaster WHERE groupname = 'Lettuce Entertain You' and type = 'Formal' ;
SELECT * FROM restmaster WHERE groupname = 'Lettuce Entertain You' and type = 'Formal'
and rname > 'C' and rname < 'Y' ;
If that isn't the kind of queries you want to be performing in your application or you want other queries in addition to those, you most likely will need additional tables.

Related

How do I model a CQL table such that it can be queried by zip_code, or by zip_code and hash?

Hi all I have a cassandra Table containing Hash as Primary key and another column containing List. I want to add another column named Zipcode such that I can query cassandra based on either zipcode or zipcode and hash
Hash | List | zipcode
select * from table where zip_code = '12345';
select * from table where zip_code = '12345' && hash='abcd';
Is there any way that I could do this?
Recommendation in Cassandra is that you design your data tables based on your access patterns. For example in your case you would like to get results by zipcode and by zipcode and hash, so ideally you can have two tables like this
CREATE TABLE keyspace.table1 (
zipcode text,
field1 text,
field2 text,
PRIMARY KEY (zipcode));
and
CREATE TABLE keyspace.table2 (
hashcode text
zipcode text,
field1 text,
field2 text,
PRIMARY KEY ((hashcode,zipcode)));
Then you may be required to redesign your tables based on your data. I recommend you understand data model design in cassandra before proceeding further.
ALLOW FILTERING construct can be used but its usage depends on how big/small is your data. If you have a very large data then avoid using this construct as it will require complete scan of the database which is quite expensive in terms of resources and time.
It is possible to design a single table that will satisfy both app queries.
In this example schema, the table is partitioned by zip code with hash as the clustering key:
CREATE TABLE table_by_zipcode (
zipcode int,
hash text,
...
PRIMARY KEY(zipcode, hash)
)
With this design, each zip code can have one or more rows of hash. Here's the table with some test data in it:
zipcode | hash | intcol | textcol
---------+------+--------+---------
123 | abc | 1 | alice
123 | def | 2 | bob
123 | ghi | 3 | charli
456 | tuv | 5 | banana
456 | xyz | 4 | apple
The table contains two partitions zipcode = 123 and zipcode = 456. The first zip code has three rows (abc, def, ghi) and the second has two rows (tuv, xyz).
You can query the table using just the partition key (zipcode), for example:
cqlsh> SELECT * FROM table_by_zipcode WHERE zipcode = 123;
zipcode | hash | intcol | textcol
---------+------+--------+---------
123 | abc | 1 | alice
123 | def | 2 | bob
123 | ghi | 3 | charli
It is also possible to query the table with the partition key zipcode and clustering key hash, for example:
cqlsh> SELECT * FROM table_by_zipcode WHERE zipcode = 123 AND hash = 'abc';
zipcode | hash | intcol | textcol
---------+------+--------+---------
123 | abc | 1 | alice
Cheers!

Last record each group in cassandra

I has a table with schema:
create table last_message_by_group
(
date date,
created_at timestamp,
message text,
group_id bigint,
primary key (date, created_at, message_id)
)
with clustering order by (created_at desc)
and data should be:
| date | created_at | message | group_id |
| 2021-05-11 | 7:23:54 | ddd | 1 |
| 2021-05-11 | 6:21:43 | ccc | 1 |
| 2021-05-11 | 5:35:16 | bbb | 2 |
| 2021-05-11 | 4:38:23 | aaa | 2 |
It will show messages order by created_at desc partition by date.
But the problem is it can not get last message each group likes:
| date | created_at | message | group_id |
| 2021-05-11 | 7:23:54 | ddd | 1 |
| 2021-05-11 | 5:35:16 | bbb | 2 |
created_at is cluster key, so it cant be updated, so I delete and insert new row every new message by group_id, this way make low performance
Is there any way to do that?
I was able to get this to work by making one change to your primary key definition. I added group_id as the first clustering key:
PRIMARY KEY (date, group_id, created_at, message_id)
After inserting the same data, this works:
> SELECT date, group_id, max(created_at), message
FROM last_message_by_group
WHERE date='2021-05-11'
GROUP BY date,group_id;
date | group_id | system.max(created_at) | message
------------+----------+---------------------------------+---------
2021-05-11 | 1 | 2021-05-11 12:23:54.000000+0000 | ddd
2021-05-11 | 2 | 2021-05-11 10:35:16.000000+0000 | bbb
(2 rows)
There's more detail on using CQL's GROUP BY clause in the official docs.
there is one problem, because you changed clustering key, so message will be ordered by group_id first. Any idea for still order by created_at and 1 message each group?
From the document linked above:
the GROUP BY option only accept as arguments primary key column names in the primary key order.
Unfortunately, if we were to adjust the primary key definition to put created_at before group_id, we would also have to group by created_at. That would create a "group" for each unique created_at, which negates the idea behind group_id.
In this case, you may have to decide between having the grouped results in a particular order vs. having them grouped at all. It might also be possible to group the results, but then re-order them appropriately on the application side.

Cassandra - CQL - Order by desc on partition key

I create a table in Cassandra for monitoring insert from an application.
My partition key is an int composed by year+month+day, my clustering key a timestamp and after that my username and some others fields.
I would like to display the last 5 inserts but it's seems that the partition key go before the "order by desc".
How can I get the correct result ? Normaly clustering key induces the order so why I get this result? (Thank in advance)
Informations :
Query : select tsp_insert, txt_name from ks_myKeyspace.myTable limit 5;
Result :
idt_day | tsp_insert | txt_name
----------+--------------------------+----------
20161028 | 2016-10-28 15:21:09+0000 | Jean
20161028 | 2016-10-28 15:21:01+0000 | Michel
20161028 | 2016-10-28 15:20:44+0000 | Quentin
20161031 | 2016-10-31 09:24:32+0000 | Jacquie
20161031 | 2016-10-31 09:23:32+0000 | Gabriel
Wanted :
idt_day | tsp_insert | txt_name
----------+--------------------------+----------
20161031 | 2016-10-31 09:24:32+0000 | Jacquie
20161031 | 2016-10-31 09:23:32+0000 | Gabriel
20161028 | 2016-10-28 15:21:09+0000 | Jean
20161028 | 2016-10-28 15:21:01+0000 | Michel
20161028 | 2016-10-28 15:20:44+0000 | Quentin
My table :
CREATE TABLE ks_myKeyspace.myTable(
idt_day int,
tsp_insert timestamp,
txt_name text, ...
PRIMARY KEY (idt_day, tsp_insert)) WITH CLUSTERING ORDER BY (tsp_insert DESC);
Ultimately, you are seeing the current order because you are not using a WHERE clause. You can see what's going on if you use the token function on your partition key:
aploetz#cqlsh:stackoverflow> SELECT idt_day,tsp_insert,token(idt_day),txt_name FROM mytable ;
idt_day | tsp_insert | system.token(idt_day) | txt_name
----------+---------------------------------+-----------------------+----------
20161028 | 2016-10-28 15:21:09.000000+0000 | 810871225231161248 | Jean
20161028 | 2016-10-28 15:21:01.000000+0000 | 810871225231161248 | Michel
20161028 | 2016-10-28 15:20:44.000000+0000 | 810871225231161248 | Quentin
20161031 | 2016-10-31 09:24:32.000000+0000 | 5928478420752051351 | Jacquie
20161031 | 2016-10-31 09:23:32.000000+0000 | 5928478420752051351 | Gabriel
(5 rows)
Results in Cassandra CQL will always come back in order of the hashed token value of the partition key (which you can see by using token). Within the partition keys, your CLUSTERING ORDER will be enforced.
That's key to understand... Result set ordering in Cassandra can only be enforced within a partition key. You have no control over the order that the partition keys come back in.
In short, use a WHERE clause on your idt_day and you'll see the order you expect.
It seems to me that you are getting the whole thing wrong. Partition keys are not used for ordering data, they are used only to know the location of your data in the cluster, specifically the node. Moreover, the order really matters inside a partition only...
Your query results really are unpredictable. Depending on which node is faster to answer (assuming a cluster and not a single node), you can get every time a different result. You should try to avoid selecting without partition restrictions, they don't scale.
You can however change your queries and perform one select per day, then you'd query for ordered data (your clustering key) in an ordered manner ( you manually chose the order of the days in your queries). And as a side note it would be faster because you could query multiple partitions in parallel.

Cassandra storage internal

I'm trying to understand what exactly happens internally in storage engine level when a row(columns) is inserted in a CQL style table.
CREATE TABLE log_date (
userid bigint,
time timeuuid,
category text,
subcategory text,
itemid text,
count int,
price int,
PRIMARY KEY ((userid), time) - #1
PRIMARY KEY ((userid), time, category, subcategory, itemid, count, price) - #2
);
Suppose that I have a table like above.
In case of #1, a CQL row will generate 6(or 5?) columns in storage.
In case of #2, a CQL row will generate a very composite column in storage.
I'm wondering what's more effective way for storing logs into Cassandra.
Please focus on those given two situations.
I don't need any real-time reads. Just writings.
If you want to suggest other options please refer to the following.
The reasons I chose Cassandra for storing logs are
Linear scalability and good for heavy writing.
It has schema in CQL. I really prefer having a schema.
Seems to support Spark well enough. Datastax's cassandra-spark connector seems to have data locality awareness.
I'm trying to understand what exactly happens internally in storage engine level when a row(columns) is inserted in a CQL style table.
Let's say that I build tables with both of your PRIMARY KEYs, and INSERT some data:
aploetz#cqlsh:stackoverflow2> SELECT userid, time, dateof(time), category, subcategory, itemid, count, price FROM log_date1;
userid | time | dateof(time) | category | subcategory | itemid | count | price
--------+--------------------------------------+--------------------------+----------+----------------+-------------------+-------+-------
1002 | e2f67ec0-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:48:20-0500 | Books | Novels | 678-2-44398-312-9 | 1 | 798
1002 | 15d0fd20-f589-11e4-ade7-21b264d4c94d | 2015-05-08 08:49:45-0500 | Audio | Headphones | 228-5-44343-344-5 | 1 | 4799
1001 | 32671010-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:43:23-0500 | Books | Computer Books | 978-1-78398-912-6 | 1 | 2200
1001 | 74ad4f70-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:45:14-0500 | Books | Novels | 678-2-44398-312-9 | 1 | 798
1001 | a3e1f750-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:46:34-0500 | Books | Computer Books | 977-8-78998-466-4 | 1 | 599
(5 rows)
aploetz#cqlsh:stackoverflow2> SELECT userid, time, dateof(time), category, subcategory, itemid, count, price FROM log_date2;
userid | time | dateof(time) | category | subcategory | itemid | count | price
--------+--------------------------------------+--------------------------+----------+----------------+-------------------+-------+-------
1002 | e2f67ec0-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:48:20-0500 | Books | Novels | 678-2-44398-312-9 | 1 | 798
1002 | 15d0fd20-f589-11e4-ade7-21b264d4c94d | 2015-05-08 08:49:45-0500 | Audio | Headphones | 228-5-44343-344-5 | 1 | 4799
1001 | 32671010-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:43:23-0500 | Books | Computer Books | 978-1-78398-912-6 | 1 | 2200
1001 | 74ad4f70-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:45:14-0500 | Books | Novels | 678-2-44398-312-9 | 1 | 798
1001 | a3e1f750-f588-11e4-ade7-21b264d4c94d | 2015-05-08 08:46:34-0500 | Books | Computer Books | 977-8-78998-466-4 | 1 | 599
(5 rows)
Looks pretty much the same via cqlsh. So let's have a look from the cassandra-cli, and query all rows foor userid 1002:
RowKey: 1002
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:, value=, timestamp=1431092900008568)
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:category, value=426f6f6b73, timestamp=1431092900008568)
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:count, value=00000001, timestamp=1431092900008568)
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:itemid, value=3637382d322d34343339382d3331322d39, timestamp=1431092900008568)
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:price, value=0000031e, timestamp=1431092900008568)
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:subcategory, value=4e6f76656c73, timestamp=1431092900008568)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:, value=, timestamp=1431092985326774)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:category, value=417564696f, timestamp=1431092985326774)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:count, value=00000001, timestamp=1431092985326774)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:itemid, value=3232382d352d34343334332d3334342d35, timestamp=1431092985326774)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:price, value=000012bf, timestamp=1431092985326774)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:subcategory, value=4865616470686f6e6573, timestamp=1431092985326774)
Simple enough, right? We see userid 1002 as the RowKey, and our clustering column of time as a column key. Following that, are all of our columns for each column key (time). And I believe your first instance generates 6 columns, as I'm pretty sure that includes the placeholder for the column key, because your PRIMARY KEY could point to an empty value (as your 2nd example key does).
But what about the 2nd version for userid 1002?
RowKey: 1002
=> (name=e2f67ec0-f588-11e4-ade7-21b264d4c94d:Books:Novels:678-2-44398-312-9:1:798:, value=, timestamp=1431093011349994)
=> (name=15d0fd20-f589-11e4-ade7-21b264d4c94d:Audio:Headphones:228-5-44343-344-5:1:4799:, value=, timestamp=1431093011360402)
Two columns are returned for RowKey 1002, one for each unique combination of our column (clustering) keys, with an empty value (as mentioned above).
So what does this all mean for you? Well, a few things:
This should tell you that PRIMARY KEYs in Cassandra ensure uniqueness. So if you decide that you need to update key values like category or subcategory (2nd example) that you really can't unless you DELETE and recreate the row. Although from a logging perspective, that's probably ok.
Cassandra stores all data for a particular partition/row key (userid) together, sorted by the column (clustering) keys. If you were concerned about querying and sorting your data, it would be important to understand that you would have to query for each specific userid for sort order to make any difference.
The biggest issue I see, is that right now you are setting yourself up for unbounded column growth. Partition/row keys can support a maximum of 2 billion columns, so your 2nd example will help you out the most there. If you think some of your userids might exceed that, you could implement a "date bucket" as an additional partition key (say, if you knew that a userid would never exceed more than 2 billion in a year, or whatever).
It looks to me like your 2nd option might be the better choice. But honestly for what you're doing, either of them will probably work ok.

Range query - Data modeling for time series in CQL Cassandra

I have a table like this:
CREATE TABLE test ( partitionkey text, rowkey text, date
timestamp, policyid text, policyname text, primary key
(partitionkey, rowkey));
with some data:
partitionkey | rowkey | policyid | policyname | date
p1 | r1 | pl1 | plicy1 | 2007-01-02 00:00:00+0000
p1 | r2 | pl2 | plicy2 | 2007-01-03 00:00:00+0000
p2 | r3 | pl3 | plicy3 | 2008-01-03 00:00:00+0000
I want to be able to find:
1/ data from a particular partition key
2/ data from a particular partition key & rowkey
3/ Range query on date given a partitionkey
1/ and 2/ are trivial:
select * from test where partitionkey='p1';
partitionkey | rowkey | policyid | policyname | range
p1 | r1 | pl1 | plicy1 | 2007-01-02 00:00:00+0000
p1 | r2 | pl2 | plicy2 | 2007-01-03 00:00:00+0000
but what about 3/?
Even with an index it doesnt work:
create index i1 on test (date);
select * from test where partitionkey='p1' and date =
'2007-01-02';
partitionkey | rowkey | policyid | policyname | date
p1 | r1 | pl1 plicy1 | 2007-01-02 00:00:00+0000
but
select * from test where partitionkey='p1' and
date > '2007-01-02';
Bad Request: No indexed columns present in
by-columns clause with Equal operator
Any idea?
thanks,
Matt
CREATE TABLE test ( partitionkey text, rowkey text, date timestamp,
policyid text, policyname text, primary key (partitionkey, rowkey));
First of all, you really should use more descriptive column names instead of partitionkey and rowkey (and even date, for that matter). By looking at those column names, I really can't tell what kind of data this table is supposed to be indexed by.
select * from test where partitionkey='p1' and date > '2007-01-02';
Bad Request: No indexed columns present in by-columns clause with Equal operator
As for this issue, try making your "date" column a part of your primary key.
primary key (partitionkey, rowkey, date)
Once you do that, I think your date range queries will function appropriately.
For more information on this, check out DataStax Academy's (free) course called Java Development With Apache Cassandra. Session 5, Module 104 discusses how to model time series data and that should help you out.

Resources