Cassandra: counter for "SOME" keys not incremented? - cassandra

I have a very simple Cassandra table called campaignvariantaction with a counter column called val. I encountered a very strange problem today. For some keys, Cassandra increments the counter just fine. For others, it just doesn't do it. I'm very confused as to how this could be.
Sample output from Cassandra Shell (cqlsh) is below. Note how incrementing a counter works fine for one key (first 3 examples) and doesn't work for another (last example).
Cassandra version 2.2.7 on Ubuntu.
cqlsh:z> UPDATE campaignvariantaction SET val = val + 1 WHERE campaignvariantaction = 'campaign_variant#54408#sent' AND date = 20161118;
cqlsh:z> select * from campaignvariantaction where campaignvariantaction = 'campaign_variant#54408#sent';
campaignvariantaction | date | val
-----------------------------------+----------+-----
campaign_variant#54408#sent | 20161118 | 1
(1 rows)
cqlsh:z> UPDATE campaignvariantaction SET val = val + 1 WHERE campaignvariantaction = 'campaign_variant#54408#sent' AND date = 20161118;
cqlsh:z> select * from campaignvariantaction where campaignvariantaction = 'campaign_variant#54408#sent';
campaignvariantaction | date | val
-----------------------------------+----------+-----
campaign_variant#54408#sent | 20161118 | 2
(1 rows)
cqlsh:z> UPDATE campaignvariantaction SET val = val + 1 WHERE campaignvariantaction = 'campaign_variant#979165#sent' AND date = 20161118;
cqlsh:z> select * from campaignvariantaction where campaignvariantaction = 'campaign_variant#979165#sent';
campaignvariantaction | date | val
-----------------------+------+-----
(0 rows)
Describe output:
cqlsh:z> describe table campaignvariantaction ;
CREATE TABLE z.campaignvariantaction (
campaignvariantaction text,
date int,
val counter,
PRIMARY KEY (campaignvariantaction, date)
) WITH CLUSTERING ORDER BY (date ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';

Related

Pyspark dataframe Inner Join Doesn't Work for a second time

I am trying to join a table with itself several times to get the cnt for all its connections:
original table:
cust_id node_1 node_2 cnt
----------------------------------
1 5 6 12
5 10 9 3
6 7 10 4
The table I wanted:
cust_id cnt cnt_node_1 cnt_node_2
-----------------------------------------
1 12 3 4
(notice here that cnt_node_1 is the cnt value for cust_id 5 and it is 3, same for cnt_node_2)
I am able to join produce the result for the first node
cust_id cnt cnt_node_1
----------------------------
1 12 3
by using code:
df1 = table.alias('df1')
df2 = table.select("cust_id", "cnt").withColumnRenamed("cust_id", "cust_id_1").withColumnRenamed("cnt", "cnt_1").alias('df2')
df1 = df1.join(df2, on = df1.node_1 == df2.cust_id_1, how = "inner")
Then I am trying to do the same thing for node 2, using the exact same code
df2 = table.select("cust_id", "cnt").withColumnRenamed("cust_id", "cust_id_2").withColumnRenamed("cnt", "cnt_2").alias('df2')
df1 = df1.join(df2, on = df1.node_2 == df2.cust_id_2, how = "inner")
but I got NULLs everywhere:
cust_id cnt cnt_node_1 cnt_node_2
-----------------------------------------
1 12 3 NULL
Can someone give me a hint why this inner join does not work for the second time? Thanks upfront!
Try this simple.
nn = 2 # nodes
cols = ['node_1','node_2']
new_cols = [i+'_cnt' for i in cols]
df.reindex(columns=[*cv.columns.tolist(), *new_cols], fill_value=0)
for i in cols:
df[i+'_cnt'] = df['cnt']-df['i']

What does "PER PARTITION LIMIT" means in cql query in cassandra?

I have a scylla table as shown below:
cqlsh:sampleks> describe table test;
CREATE TABLE test (
client_id int,
when timestamp,
process_ids list<int>,
md text,
PRIMARY KEY (client_id, when) ) WITH CLUSTERING ORDER BY (when DESC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
AND comment = ''
AND compaction = {'class': 'TimeWindowCompactionStrategy', 'compaction_window_size': '1', 'compaction_window_unit': 'DAYS'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 172800
AND max_index_interval = 1024
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
And I see this is how we are querying it. It's been a long time I worked on cassandra so this PER PARTITION LIMIT is new thing to me (looks like recently added). Can someone explain what does this do with some example in a layman language? I couldn't find any good doc on that which explains easily.
SELECT * FROM test WHERE client_id IN ? PER PARTITION LIMIT 1;
The PER PARTITION LIMIT clause can be helpful in a "wide partition scenario."
It returns only the first two rows in the partition.
Take this query:
aploetz#cqlsh:stackoverflow> SELECT client_id,when,md
FROM test PER PARTITION LIMIT 2 ;
Considering the PRIMARY KEY definition of (client_id,when), that query will iterate over each client_id. Cassandra will then return only the first two rows (clustered by when) from that partition, regardless of how many ocurences of when may be present.
In this case, I inserted 7 rows into your test table, using two different client_ids (2 partitions total). Using a PER PARTITION LIMIT of 2, I get 4 rows returned (2 client_id x PER PARTITION LIMIT 2) == 4 rows.
client_id | when | md
-----------+---------------------------------+-----
1 | 2020-05-06 12:00:00.000000+0000 | md1
1 | 2020-05-05 22:00:00.000000+0000 | md1
2 | 2020-05-06 19:00:00.000000+0000 | md2
2 | 2020-05-06 01:00:00.000000+0000 | md2
(4 rows)

Cassandra Partition key duplicates?

I am new to Cassandra so I had a few quick questions, suppose I do this:
CREATE TABLE my_keyspace.my_table (
id bigint,
year int,
datetime timestamp,
field1 int,
field2 int,
PRIMARY KEY ((id, year), datetime))
I imagine Cassandra as something like Map<PartitionKey, SortedMap<ColKey, ColVal>>,
My question is when querying for something from Cassandra using a WHERE, it will be like:
SELECT * FROM my_keyspace.my_table WHERE id = 1 AND year = 4,
This could return 2 or more records, how does this fit in with the data model of Cassandra?
If it really it a Big HashMap how come duplicate records for a partition key are allowed?
Thanks!
There is a batch of entries in the SortedMap<ColKey, ColVal> for each row, using its sorted nature.
To build on your mental model, while there is only 1 partition key for id = 1 AND year = 4 there are multiple cells:
(id, year) | ColKey | ColVal
------------------------------------------
1, 4 | datetime(1):field1 | 1 \ Row1
1, 4 | datetime(1):field2 | 2 /
1, 4 | datetime(5):field1 | 1 \
1, 4 | datetime(5):field2 | 2 / Row2
...

getting error while using IN operator in cassandra 3.4

Cassandra version :- [cqlsh 5.0.1 | Cassandra 3.0.9 | CQL spec 3.4.0 | Native protocol v4]
My Table structure
CREATE TABLE test (
id1 text,
id2 text,
id3 text,
id4 text,
clinet_starttime timestamp,
avail_endtime timestamp,
starttime timestamp,
client_endtime timestamp,
code int,
status text,
total_time double,
PRIMARY KEY (id1, id2, id3, id4, client_starttime)
) WITH CLUSTERING ORDER BY (id2 ASC, id3 ASC, id4 ASC, client_starttime ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
Following query is working for me
SELECT * FROM test WHERE client_starttime<1522832400000 AND client_starttime>1522831800000 AND id1='data1' AND id2='data2' AND id3='data3' AND id4 IN ('data5','data6','data7') ALLOW FILTERING;
But when I query using starttime and avail_endTime instead of client_starttime like
SELECT * FROM test WHERE starttime<1522832400000 AND avail_endTime>1522831800000 AND id1='data1' AND id2='data2' AND id3='data3' AND id4 IN ('data5','data6','data7') ALLOW FILTERING;
getting the following error
InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot restrict clustering columns by IN relations when a collection is selected by the query"
If I am not using IN operator , If I use = operator then its working fine
SELECT * FROM test WHERE starttime<1522832400000 AND avail_endTime>1522831800000 AND id1='data1' AND id2='data2' AND id3='data3' AND id4 = 'data5'ALLOW FILTERING;
but I want to retrieve data for data5, data6 and data7 same time.
This is a known Cassandra bug: CASSANDRA-12654 - it's already fixed, but will be available only in version 4.0 for which there is no release date yet defined.

Cassandra: how to initialize the counter column with value?

I have to benchmark Cassandra with the Facebook Linkbench. There are two phase during the Benchmark, the load and the request phase.
in the Load Phase, Linkbench fill the cassandra tables : nodes, links and counts (for links counting) with default values(graph data).
the count table looks like this:
keyspace.counttable (
link_id bigint,
link_type bigint,
time bigint,
version bigint,
count counter,
PRIMARY KEY (link_id, link_type, time, version)
my question is how to insert the default counter values (before incrementing and decrementing the counter in the Linkbench request phase) ?
If it isn't possible to do that with cassandra, how should i increment/decrement a bigint variable (instead of counter variable)
Any suggest and comments? Thanks a lot.
The default value is zero. Given
create table counttable (
link_id bigint,
link_type bigint,
time bigint,
version bigint,
count counter,
PRIMARY KEY (link_id, link_type, time, version)
);
and
update counttable set count = count + 1 where link_id = 1 and link_type = 1 and time = 1 and version = 1;
We see that the value of count is now 1.
select * from counttable ;
link_id | link_type | time | version | count
---------+-----------+------+---------+-------
1 | 1 | 1 | 1 | 1
(1 rows)
So, if we want to set it to some other value we can:
update counttable set count = count + 500 where link_id = 1 and link_type = 1 and time = 1 and version = 2;
select * from counttable ;
link_id | link_type | time | version | count
---------+-----------+------+---------+-------
1 | 1 | 1 | 1 | 1
1 | 1 | 1 | 2 | 500
(2 rows)
There is no elegant way to initialize a counter column with a non-zero value. The only operation you can do on a counter field is increment/decrement. I recommend to keep the offset (e.g. the your intended initial value) in a different column, and simply add the two values in your client application.
Thank you for the Answers. I implemented the following solution to initialize the counter Field.
as the initial and default value of the counter Field is 0 ,i incremented it with my default value. it looks like Don Branson solution but with only one column:
create table counttable (
link_id bigint,
link_type bigint,
count counter,
PRIMARY KEY (link_id, link_type)
);
i set the value with this statement (during the load phase):
update counttable set count = count + myValue where link_id = 1 and link_type = 1;
select * from counttable ;
link_id | link_type | count
---------+-----------+--------
1 | 1 | myValue (added to 0)
(1 row)

Resources