Handling null values in Map keys - cassandra

I am using cassandra 3.10 and in order to use a Group by function on non primary partitions I am referring: http://www.batey.info/cassandra-aggregates-min-max-avg-group.html, which is using map keys to do the same. When I execute select group_and_total(name,count) from school; and I get the error ServerError: java.lang.NullPointerException: Map keys cannot be null.
The problem is that name column has some null values in it and is there any way by modifying the function and getting the desired result instead of removing the rows that have null values in it.
The schema of the table is
Table school{
name text,
count int,
roll_no text,
...
primary key(roll_no)
}
The functions that I am using for Group by are:
CREATE FUNCTION state_group_and_total( state map<text, int>, type text, amount int )
CALLED ON NULL INPUT
RETURNS map<text, int>
LANGUAGE java AS '
Integer count = (Integer) state.get(type); if (count == null) count = amount; else count = count + amount; state.put(type, count); return state; ' ;
CREATE OR REPLACE AGGREGATE group_and_total(text, int)
SFUNC state_group_and_total
STYPE map<text, int>
INITCOND {};

Schema that you mentioned
CREATE TABLE temp.school (
roll_no text PRIMARY KEY,
count int,
name text
)
Sample inputs into the table
roll_no | count | name
---------+-------+------
6 | 1 | b
7 | 1 | null
4 | 1 | b
3 | 1 | a
5 | 1 | b
2 | 1 | a
1 | 1 | a
(7 rows)
Note: There is one null value in the name column.
Modified Function definition
CREATE FUNCTION temp.state_group_and_total(state map<text, int>, type text, amount int)
RETURNS NULL ON NULL INPUT
RETURNS map<text, int>
LANGUAGE java
AS $$Integer count = (Integer) state.get(type);if (count == null) count = amount;else count = count + amount;state.put(type, count); return state;$$;
Note: removed CALLED ON NULL INPUT and added RETURNS NULL ON NULL INPUT
Aggregate definition:
CREATE AGGREGATE temp.group_and_total(text, int)
SFUNC state_group_and_total
STYPE map<text, int>
INITCOND {};
Query output:
cassandra#cqlsh:temp> select group_and_total(name,count) from school;
temp.group_and_total(name, count)
-----------------------------------
{'a': 3, 'b': 3}
(1 rows)

Related

Cassandra: update or insert map value together with other values

I have a table that looks like this:
(id, a, b, mapValue)
I want to update if exist or insert if not VALUES(id,a,b,mapValue). Where mapValue is the combination of the old mapValue with the new one, replacing the values of each key that was already there.
For example if the old mapValue was {1:c, 2:d} and the new one is {2:e, 3:f} the result would be {1:c, 2:e, 3:f}.
I want to do this in a query that also updates/inserts id,a,b in VALUES(id,a,b,mapValue).
How can I achieve this?
I've found this guide about updating maps but it doesn't say anything about updating them while dealing with other values in the table. I need to do this at the same time.
In Cassandra, there is no difference between INSERT & UPDATE - everything is UPSERT, so when you do UPDATE and data doesn't exist, it's inserted. Or if you do INSERT and data already exist, it will be updated.
Regarding map update, you can use + and - operations on the corresponding column when doing UPDATE. For example, I have a table:
CREATE TABLE test.m1 (
id int PRIMARY KEY,
i int,
m map<int, text>
);
and I can do following to update existing row:
cqlsh:test> insert into test.m1 (id, i, m) values (1, 1, {1:'t1'});
cqlsh:test> select * from test.m1;
id | i | m
----+---+-----------
1 | 1 | {1: 't1'}
(1 rows)
cqlsh:test> update test.m1 set m = m + {2:'t2'}, i = 4 where id = 1;
cqlsh:test> select * from test.m1;
id | i | m
----+---+--------------------
1 | 4 | {1: 't1', 2: 't2'}
(1 rows)
and I can use the similar UPDATE command to insert completely new data:
cqlsh:test> update test.m1 set m = m + {6:'t6'}, i = 6 where id = 6;
cqlsh:test> select * from test.m1;
id | i | m
----+---+--------------------
1 | 4 | {1: 't1', 2: 't2'}
6 | 6 | {6: 't6'}
(2 rows)
Usually, if you know that no data existed before for given primary key, then UPDATE with + is better way to insert data into set or map because it doesn't generate a tombstone that is generated when you do INSERT or UPDATE without + on the collection column.
P.S. You can find more information on using collections in the following document.

Cassandra where clause as a tuple

Table12
CustomerId CampaignID
1 1
1 2
2 3
1 3
4 2
4 4
5 5
val CustomerToCampaign = ((1,1),(1,2),(2,3),(1,3),(4,2),(4,4),(5,5))
Is it possible to write a query like
select CustomerId, CampaignID from Table12 where (CustomerId, CampaignID) in (CustomerToCampaign_1, CustomerToCampaign_2)
???
So the input is a tuple but the columns are not tuple but rather individual columns.
Sure, it's possible. But only on the clustering keys. That means I need to use something else as a partition key or "bucket." For this example, I'll assume that marketing campaigns are time sensitive and that we'll get a good distribution and easy of querying by using "month" as the bucket (partition).
CREATE TABLE stackoverflow.customertocampaign (
campaign_month int,
customer_id int,
campaign_id int,
customer_name text,
PRIMARY KEY (campaign_month, customer_id, campaign_id)
);
Now, I can INSERT the data described in your CustomerToCampaign variable. Then, this query works:
aploetz#cqlsh:stackoverflow> SELECT campaign_month, customer_id, campaign_id
FROM customertocampaign WHERE campaign_month=202004
AND (customer_id,campaign_id) = (1,2);
campaign_month | customer_id | campaign_id
----------------+-------------+-------------
202004 | 1 | 2
(1 rows)

CAS with CQL in Cassandra

I'm trying to model some time series data in Cassandra which I had been able to do with the older thrift client but CQL seems to be throwing me off.
I want to add a NEW column to my row IF a specific column value matches.
My table definition is:
CREATE TABLE TestTable (
key int,
base uuid,
ts int, // Timestamp (column name)
val text, // Timestamp value (column value)
PRIMARY KEY (key, ts)
) WITH CLUSTERING ORDER BY (ts DESC);
What I'm guessing it'd look like is:
Row | UUID | TS | TS | TS
--- | ---- | --- | ---| ---
1 | id1 | 1 | 2 | 3
--- | --- | --- | ---| ---
2 | id2 | 1 | 5 | 6
So essentially, I can have a bunch of Timestamps for a given row and a SINGLE UUID for a row.
The UUID needs to be updated for each new insert of a TS column.
So inserts in a row work just fine:
insert into TestTable(key, base, ts, val) values (1, dfb63886-91a4-11e6-ae22-56b6b6499611, 50, 'one')
But I'm failing to figure out a way, using CQL, to INSERT a new column in a row using Cassandra transactions (CAS).
This one fails:
insert into TestTable(key, base, ts, val) values (1, dfb63886-91a4-11e6-ae22-56b6b6499611, 70, 'four') if base = dfb63886-91a4-11e6-ae22-56b6b6499611;
with the error:
SyntaxException: <ErrorMessage code=2000 [Syntax error in CQL query] message="line 1:106 mismatched input 'base' expecting K_NOT (..., 70, 'four') if [base] =...)">
And the query:
update TestTable set val = 'four', ts=70 where key = 1 if base = dfb63886-91a4-11e6-ae22-56b6b6499611;
fails with the error:
InvalidRequest: code=2200 [Invalid query] message="PRIMARY KEY part ts found in SET part"
I'm trying to figure out how to model the data properly so that I only have one UUID per row and can have multiple columns without having to explicitly define them during table creation, since it can vary quite a bit.
IIRC, it was easy doing this with the thrift client but using that isn't an option =/
There is a nice tutorial regarding data series here
In a nutshell, your composite key will be your unique identifier (like the UUID that you were proposing) and a timestamp, so you will be able to add as many events/values associated to a UUID
CREATE TABLE IF NOT EXISTS TestTable (
base uuid,
ts timestamp, // Timestamp (column name)
value text, // Timestamp value (column value)
PRIMARY KEY (base, ts)
) WITH CLUSTERING ORDER BY (ts DESC);
Adding values will have the same UUID with different times:
INSERT INTO TestTable (base, ts, value)
VALUES (467286c5-7d13-40c2-92d0-73434ee8970c, dateof(now()), 'abc');
INSERT INTO TestTable (base, ts, value)
VALUES (467286c5-7d13-40c2-92d0-73434ee8970c, dateof(now()), 'def');
cqlsh:test> SELECT * FROM TestTable WHERE base = 467286c5-7d13-40c2-92d0-73434ee8970c;
base | ts | value
--------------------------------------+---------------------------------+-------
467286c5-7d13-40c2-92d0-73434ee8970c | 2016-10-14 04:13:42.779000+0000 | def
467286c5-7d13-40c2-92d0-73434ee8970c | 2016-10-14 04:12:50.551000+0000 | abc
(2 rows)
Updating can be done in any of the columns, except the ones used as keys, the errors displayed in the update statement was caused by the "IF" statement and because it was tried to update ts which is part of the composite key.
INSERT INTO TestTable (base, ts, value)
VALUES (ffb0bb8e-3d67-4203-8c53-046a21992e52, dateof(now()), 'bananas');
SELECT * FROM TestTable WHERE base = ffb0bb8e-3d67-4203-8c53-046a21992e52 AND ts < dateof(now());
base | ts | value
--------------------------------------+---------------------------------+---------
ffb0bb8e-3d67-4203-8c53-046a21992e52 | 2016-10-14 04:17:26.421000+0000 | apples
(1 rows)
UPDATE TestTable SET value = 'apples' WHERE base = ffb0bb8e-3d67-4203-8c53-046a21992e52;
SELECT * FROM TestTable WHERE base = ffb0bb8e-3d67-4203-8c53-046a21992e52 AND ts < dateof(now());
base | ts | value
--------------------------------------+---------------------------------+---------
ffb0bb8e-3d67-4203-8c53-046a21992e52 | 2016-10-14 04:17:26.421000+0000 | bananas
(1 rows)

Timestamp with auto increment in Cassandra

Want to write System.currentMiliseconds in the cassandta table for each column by cassandra. For example
writeToCassandra(name, email)
in cassandra table:
--------------------------------
name | email| currentMiliseconds
Can cassandra prepare currentMiliseconds column automatically like auto increment ?
BR!
Cassandra has some sort of columnar database taste inside. So if you read docs how the columns are stored inside SSTable, you'll notice that each column has a personal write timestamp appended (used for conflict resolution, like last-write-wins strategy). You can query for that timestamp using writetime() function:
cqlsh:so> create table ticks ( id text primary key, value int);
cqlsh:so> insert into ticks (id, value) values ('foo', 1);
cqlsh:so> insert into ticks (id, value) values ('bar', 2);
cqlsh:so> insert into ticks (id, value) values ('baz', 3);
cqlsh:so> select id, value from ticks;
id | value
-----+-------
bar | 2
foo | 1
baz | 3
(3 rows)
cqlsh:so> select id, writetime(value) from ticks;
id | writetime(value)
-----+------------------
bar | 1448282940862913
foo | 1448282937031542
baz | 1448282945591607
(3 rows)
As you requested, I've not explicitly inserted write timestamp to DB, but able to query it. Note you cannot use writetime() function for PK.
You can try with: dateof(now())
e.g.
INSERT INTO YOUR_TABLE (NAME, EMAIL, DATE)
VALUES ('NAME', 'EMAIL', dateof(now()));

Mixing column types in Cassandra / wide rows

I am trying to learn how to implement a feed in cassandra (think twitter). I want to use wide rows to store all the posts made by a user.
I am thinking about adding user information or statistical information in the same row (num of posts, last post date, user name, etc.).
My question is: is name, age, etc. "field name" stored in column? Or those wide rows only store the column-name and values specified? Am I wasting disk space? Am I compromising performance somehow?
Thanks!
-- TABLE CREATION
CREATE TABLE user_feed (
owner_id int,
name text,
age int,
posted_at timestamp,
post_text text,
PRIMARY KEY (owner_id, posted_at)
);
-- INSERTING THE USER
insert into user_feed (owner_id, name, age, posted_at) values (1, 'marc', 36, 0);
-- INSERTING USER POSTS
insert into user_feed (owner_id, posted_at, post_text) values (1, dateof(now()), 'first post!');
insert into user_feed (owner_id, posted_at, post_text) values (1, dateof(now()), 'hello there');
insert into user_feed (owner_id, posted_at, post_text) values (1, dateof(now()), 'i am kind of happy');
-- GETTING THE FEED
select * from user_feed where owner_id=1 and posted_at>0;
-- RESULT
owner_id | posted_at | age | name | post_text
----------+--------------------------+------+------+--------------------
1 | 2014-07-04 12:01:23+0000 | null | null | first post!
1 | 2014-07-04 12:01:23+0000 | null | null | hello there
1 | 2014-07-04 12:01:23+0000 | null | null | i am kind of happy
-- GETTING USER INFO - ONLY USER INFO IS POSTED_AT=0
select * from user_feed where owner_id=1 and posted_at=0;
-- RESULT
owner_id | posted_at | age | name | post_text
----------+--------------------------+------+------+--------------------
1 | 1970-01-01 00:00:00+0000 | 36 | marc | null
What about making them static?
A static column is the same in all partition key and since your partition key is the id of the owner you could avoid wasting space and retrieve the user informations in any query.
CREATE TABLE user_feed (
owner_id int,
name text static,
age int static,
posted_at timestamp,
post_text text,
PRIMARY KEY (owner_id, posted_at)
);
Cheers,
Carlo

Resources