How to model cassandra columnfamily - cassandra

Is the below select queries are possible for the columnfamily I have defined, because I am getting a bad request error. How should I Model my columnfamily to get the correct results.
CREATE TABLE recordhistory (
userid bigint,
objectid bigint,
operation text,
record_link_id bigint,
time timestamp,
username text,
value map<bigint, text>,
PRIMARY KEY ((userid, objectid), operation, record_link_id, time)
) WITH CLUSTERING ORDER BY (operation ASC, record_link_id ASC, time DESC)
Select Query:
SELECT * FROM recordhistory WHERE userid=439035 AND objectid=20011009 AND operation='update' AND time>=1389205800000 AND time<=1402338600000 ALLOW FILTERING;
Bad Request: PRIMARY KEY column "time" cannot be restricted (preceding column "record_link_id" is either not restricted or by a non-EQ relation)
SELECT * FROM recordhistory WHERE userid=439035 AND objectid=20011009 AND record_link_id=20011063 ALLOW FILTERING;
Bad Request: PRIMARY KEY column "record_link_id" cannot be restricted (preceding column "operation" is either not restricted or by a non-EQ relation)

create table recordhistory (
userid bigint,
objectid bigint,
operation text,
record_link_id bigint,
time timestamp,
username text,
value map<bigint, text>,
PRIMARY KEY ((userid, objectid), time, operation, record_link_id)) WITH CLUSTERING ORDER BY (time DESC, operation ASC, record_link_id ASC);
select * from recordhistory where userid=12346 AND objectid=45646 and time >=1389205800000 and time <1402338700000 ALLOW FILTERING;
userid | objectid | time | operation | record_link_id | username | value
--------+----------+--------------------------+-----------+----------------+----------+-------
12346 | 45646 | 2014-06-09 11:30:00-0700 | myop4 | 78946 | name3 | null
12346 | 45646 | 2014-01-08 10:30:00-0800 | myop99999 | 999999 | name3 | null

Related

Group by on Primary Partition

I am not able to perform Group by on a primary partition. I am using Cassandra 3.10. When I group by I get the following error.
InvalidReqeust: Error from server: code=2200 [Invalid query] message="Group by currently only support groups of columns following their declared order in the Primary Key. My column is a primary key even still I am facing the problem.
My schema is
Table trends{
name text,
price int,
quantity int,
code text,
code_name text,
cluster_id text
uitime timeuuid,
primary key((name,price),code,uitime))
with clustering order by (code DESC, uitime DESC)
And the command that I run is: select sum(quantity) from trends group by code;
For starters your schema is invalid. You cannot set clustering order on code because it is the partition key. The order is going to be determined by the hash of it (unless using byte order partitioner - but don't do that).
The query and thing your talking about does work though. For example you can run
> SELECT keyspace_name, sum(partitions_count) AS approx_partitions FROM system.size_estimates GROUP BY keyspace_name;
keyspace_name | approx_partitions
--------------------+-------------------
system_auth | 128
basic | 4936508
keyspace1 | 870
system_distributed | 0
system_traces | 0
where they schema is:
CREATE TABLE system.size_estimates (
keyspace_name text,
table_name text,
range_start text,
range_end text,
mean_partition_size bigint,
partitions_count bigint,
PRIMARY KEY ((keyspace_name), table_name, range_start, range_end)
) WITH CLUSTERING ORDER BY (table_name ASC, range_start ASC, range_end ASC)
Perhaps the pseudo-schema you provided differs from the actual one. Can you provide output of describe table xxxxx in your question?

Cassandra select order by

I create table as this
CREATE TABLE sm.data (
did int,
tid int,
ts timestamp,
aval text,
dval decimal,
PRIMARY KEY (did, tid, ts)
) WITH CLUSTERING ORDER BY (tid ASC, ts DESC);
Before I did all select query with ts DESC so it was good. Now I also need select query with ts ASC in some cases. How do I accomplish that? Thank you
You can simply use ORDER BY ts ASC
Example :
SELECT * FROM data WHERE did = ? and tid = ? ORDER BY ts ASC
if you do this select
select * from data where did=1 and tid=2 order by ts asc;
you will end up with some errors
InvalidRequest: Error from server: code=2200 [Invalid query] message="Order by currently only support the ordering of columns following their declared order in the PRIMARY KEY"
I have tested it against my local cassandra db
I would suggets altering the order of the primary key columns
the reason is that :
"Querying compound primary keys and sorting results ORDER BY clauses can select a single column only. That column has to be the second column in a compound PRIMARY KEY."
CREATE TABLE data2 (
did int,
tid int,
ts timestamp,
aval text,
dval decimal,
PRIMARY KEY (did, ts, tid)
) WITH CLUSTERING ORDER BY (ts DESC, tid ASC)
Now we are free to choose the type of ordering for TS
cassandra#cqlsh:airline> SELECT * FROM data2 WHERE did = 1 and ts=2 order by ts DESC;
did | ts | tid | aval | dval
-----+----+-----+------+------
(0 rows)
cassandra#cqlsh:airline> SELECT * FROM data2 WHERE did = 1 and ts=2 order by ts ASC;
did | ts | tid | aval | dval
-----+----+-----+------+------
(0 rows)
Another way would be either to create a new table or a materialized view , the later would lead behind the scene to data duplication anyway
hope that clear enough

Cassandra data modeling for range queries using timestamp

I need to create a table with 4 columns:
timestamp BIGINT
name VARCHAR
value VARCHAR
value2 VARCHAR
I have 3 required queries:
SELECT *
FROM table
WHERE timestamp > xxx
AND timestamp < xxx;
SELECT *
FROM table
WHERE name = 'xxx';
SELECT *
FROM table
WHERE name = 'xxx'
AND timestamp > xxx
AND timestamp < xxx;
The result needs to be sorted by timestamp.
When I use:
CREATE TABLE table (
timestamp BIGINT,
name VARCHAR,
value VARCHAR,
value2 VARCHAR,
PRIMARY KEY (timestamp)
);
the result is never sorted.
When I use:
CREATE TABLE table (
timestamp BIGINT,
name VARCHAR,
value VARCHAR,
value2 VARCHAR,
PRIMARY KEY (name, timestamp)
);
the result is sorted by name > timestamp which is wrong.
name | timestamp
------------------------
a | 20170804142825729
a | 20170804142655569
a | 20170804142650546
a | 20170804142645516
a | 20170804142640515
a | 20170804142620454
b | 20170804143446311
b | 20170804143431287
b | 20170804143421277
b | 20170804142920802
b | 20170804142910787
How do I do this using Cassandra?
Cassandra order data by clustering key group by partition key
In your case first table have only partition key timestamp, no clustering key. So data will not be sorted.
And For the second table partition key is name and clustering key is timestamp. So your data will sorted by timestamp group by name. Means data will be first group by it's name then each group will be sorted separately by timestamp.
Edited
So you need to add a partition key like below :
CREATE TABLE table (
year BIGINT,
month BIGINT,
timestamp BIGINT,
name VARCHAR,
value VARCHAR,
value2 VARCHAR,
PRIMARY KEY ((year, month), timestamp)
);
here (year, month) is the composite partition key. You have to insert the year and month from the timestamp. So your data will be sorted by timestamp within a year and month

Cassandra - alternate way for clustering key with ORDER BY and UPDATE

My schema is :
CREATE TABLE friends (
userId timeuuid,
friendId timeuuid,
status varchar,
ts timeuuid,
PRIMARY KEY (userId,friendId)
);
CREATE TABLE friends_by_status (
userId timeuuid,
friendId timeuuid,
status varchar,
ts timeuuid,
PRIMARY KEY ((userId,status), ts)
)with clustering order by (ts desc);
Here, whenever a friend-request is made, I'll insert record in both tables.
When I want to check one to one status of users, i'll use this query:
SELECT status FROM friends WHERE userId=xxx AND friendId=xxx;
When I need to query all the records with pending status, i'll use :
SELECT * FROM friends_by_status WHERE userId=xxx AND status='pending';
But, when there is a status change, I can update the 'status' and 'ts' in the 'friends' table, but not in the 'friends_by_status' table as both are part of PRIMARY KEY.
You could see that even if I denormalise it, I definitely need to update 'status' and 'ts' in 'friends_by_status' table to maintain consistency.
Only way I can maintain consistency is to delete the record and insert again.
But frequent delete is also not recommended in cassandra model. As said in Cassaandra Spottify summit.
I find this as the biggest limitation in Cassandra.
Is there any other way to sort this issue.
Any solution is appreciated.
I don't know how soon you need to deploy this, but in Cassandra 3.0 you could handle this with a materialized view. Your friends table would be the base table, and the friends_by_status would be a view of the base table. Cassandra would take care updating the view when you changed the base table.
For example:
CREATE TABLE friends ( userid int, friendid int, status varchar, ts timeuuid, PRIMARY KEY (userId,friendId) );
CREATE MATERIALIZED VIEW friends_by_status AS
SELECT userId from friends WHERE userID IS NOT NULL AND friendId IS NOT NULL AND status IS NOT NULL AND ts IS NOT NULL
PRIMARY KEY ((userId,status), friendID);
INSERT INTO friends (userid, friendid, status, ts) VALUES (1, 500, 'pending', now());
INSERT INTO friends (userid, friendid, status, ts) VALUES (1, 501, 'accepted', now());
INSERT INTO friends (userid, friendid, status, ts) VALUES (1, 502, 'pending', now());
SELECT * FROM friends;
userid | friendid | status | ts
--------+----------+----------+--------------------------------------
1 | 500 | pending | a02f7fe0-49f9-11e5-9e3c-ab179e6a6326
1 | 501 | accepted | a6c80980-49f9-11e5-9e3c-ab179e6a6326
1 | 502 | pending | add10830-49f9-11e5-9e3c-ab179e6a6326
So now in the view you can select rows by the status:
SELECT * FROM friends_by_status WHERE userid=1 AND status='pending';
userid | status | friendid
--------+---------+----------
1 | pending | 500
1 | pending | 502
(2 rows)
And then when you update the status in the base table, it automatically updates in the view:
UPDATE friends SET status='pending' WHERE userid=1 AND friendid=501;
SELECT * FROM friends_by_status WHERE userid=1 AND status='pending';
userid | status | friendid
--------+---------+----------
1 | pending | 500
1 | pending | 501
1 | pending | 502
(3 rows)
But note that in the view you couldn't have ts as part of the key, since you can only add one non-key field from the base table as part of the key in the view, which in your case would be adding 'status' to the key.
I think the first beta release for 3.0 is coming out tomorrow if you want to try this out.
Why do you need status to be in the primary key for your second table? If this was your schema:
CREATE TABLE friends_by_status (
userId timeuuid,
friendId timeuuid,
status varchar,
ts timeuuid,
PRIMARY KEY ((userId), status, ts)
with clustering order by (ts desc));
you can update the status as needed and still filter by it. You will be storing more data under one partition but it seems like you are storing one row for each friend a user has. This will be the same as in the first table, so I don't see partition size being a problem.

Select a specific record in Cassandra using cql

This is the schema I use:
CREATE TABLE playerInfo (
key text,
column1 bigint,
column2 bigint,
column3 bigint,
column4 bigint,
column5 text,
value bigint,
PRIMARY KEY (key, column1, column2, column3, column4, column5)
)
WITH COMPACT STORAGE AND
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'SnappyCompressor'};
Note I use a composite key. And there is a record like this:
key | column1 | column2 | column3 | column4 | column5 | value
----------+------------+---------+----------+---------+--------------------------------------------------+-------
Kitty | 1411 | 3 | 713 | 4 | American | 1
In cqlsh, how to select it? I try to use:
cqlsh:game> SELECT * FROM playerInfo WHERE KEY = 'Kitty' AND column5 = 'American';
but the output is:
Bad Request: PRIMARY KEY part column5 cannot be restricted (preceding part column4 is either not restricted or by a non-EQ relation)
Then how could I select such cell?
You have choosen the primary key as PRIMARY KEY (key, column1, column2, column3, column4, column5) so if you are going to give where clause on column5 then you should also need to specify the where clause of key, column1, column2, column3, column4. for eg,
SELECT * FROM playerInfo WHERE KEY = 'Kitty' AND column1 = 1411 AND column2 = 3 AND column3 = 713 AND column4 = 4 AND column5 = 'American';
If you are going to give where clause on column2 then you should also need to specify the where clause of key, column1. for eg,
SELECT * FROM playerInfo WHERE KEY = 'Kitty' AND column1 = 1411 AND column2 = 3;
If you want to specify where clause on a particular column of primary key, then where clause of previous column also need to be given. So you need to choose the cassandra data modelling in a tricky way to have a good read and write performance and to satisfy your business needs too. But however if business logic satisfies you, then cassandra performance will not satisfies you. If cassandra performance satisfies you, then your business logic will not satisfies you. That is the beauty of cassandra. Sure cassandra needs more to improve.
There is a way to select rows based on columns that are not a part of the primary key by creating secondary index.
Let me explain this with an example.
In this schema:
CREATE TABLE playerInfo (
player_id int,
name varchar,
country varchar,
age int,
performance int,
PRIMARY KEY ((player_id, name), country)
);
the first part of the primary key i.e player_id and name is the partition key. The hash value of this will determine which node in the cassandra cluster this row will be written to.
Hence we need to specify both these values in the where clause to fetch a record. For example
SELECT * FROM playerinfo WHERE player_id = 1000 and name = 'Mark B';
player_id | name | country | age | performance
-----------+--------+---------+-----+-------------
1000 | Mark B | USA | 26 | 8
If the second part of your primary key contains more than 2 columns you would have to specify values for all the columns on the left hand side of they key including that column.
In this example
PRIMARY KEY ((key, column1), column2, column3, column4, column5)
For filtering based on column3 you would have to specify values for "key, column1, column2 and column3".
For filtering based on column5 you need to sepcify values for "key, column1, column2, column3, column4, and column5".
But if your application demands using filtering on a particular columns which are not a part of the partition key you could create secondary indices on those columns.
To create an index on a column use the following command
CREATE INDEX player_age on playerinfo (age) ;
Now you can filter columns based on age.
SELECT * FROM playerinfo where age = 26;
player_id | name | country | age | performance
-----------+---------+---------+-----+-------------
2000 | Sarah L | UK | 26 | 24
1000 | Mark B | USA | 26 | 8
Be very careful about using index in Cassandra. Use this only if a table has few records or more precisely few distinct values in those columns.
Also you can drop an index using
DROP INDEX player_age ;
Refer http://wiki.apache.org/cassandra/SecondaryIndexes and http://www.datastax.com/docs/1.1/ddl/indexes for more details

Resources