Group by on Primary Partition - cassandra

I am not able to perform Group by on a primary partition. I am using Cassandra 3.10. When I group by I get the following error.
InvalidReqeust: Error from server: code=2200 [Invalid query] message="Group by currently only support groups of columns following their declared order in the Primary Key. My column is a primary key even still I am facing the problem.
My schema is
Table trends{
name text,
price int,
quantity int,
code text,
code_name text,
cluster_id text
uitime timeuuid,
primary key((name,price),code,uitime))
with clustering order by (code DESC, uitime DESC)
And the command that I run is: select sum(quantity) from trends group by code;

For starters your schema is invalid. You cannot set clustering order on code because it is the partition key. The order is going to be determined by the hash of it (unless using byte order partitioner - but don't do that).
The query and thing your talking about does work though. For example you can run
> SELECT keyspace_name, sum(partitions_count) AS approx_partitions FROM system.size_estimates GROUP BY keyspace_name;
keyspace_name | approx_partitions
--------------------+-------------------
system_auth | 128
basic | 4936508
keyspace1 | 870
system_distributed | 0
system_traces | 0
where they schema is:
CREATE TABLE system.size_estimates (
keyspace_name text,
table_name text,
range_start text,
range_end text,
mean_partition_size bigint,
partitions_count bigint,
PRIMARY KEY ((keyspace_name), table_name, range_start, range_end)
) WITH CLUSTERING ORDER BY (table_name ASC, range_start ASC, range_end ASC)
Perhaps the pseudo-schema you provided differs from the actual one. Can you provide output of describe table xxxxx in your question?

Related

Materialised view error in Cassandra

I am new to Cassandra, I am trying to create a table and materialized view. but it not working.
My queries are:
-- all_orders
create table all_orders (
id uuid,
order_number bigint,
country text,
store_number bigint,
supplier_number bigint,
flow_type int,
planned_delivery_date timestamp,
locked boolean,
primary key ( order_number,store_number,supplier_number,planned_delivery_date ));
-- orders_by_date
CREATE MATERIALIZED VIEW orders_by_date AS
SELECT
id,
order_number,
country,
store_number,
supplier_number,
flow_type,
planned_delivery_date,
locked,
FROM all_orders
WHERE planned_delivery_date IS NOT NULL AND order_number IS NOT NULL
PRIMARY KEY ( planned_delivery_date )
WITH CLUSTERING ORDER BY (store_number,supplier_number);
I am getting an exception like this:
SyntaxException: <ErrorMessage code=2000 [Syntax error in CQL query]
message="line 1:7 no viable alternative at input 'MATERIALIZED' ([CREATE] MATERI
ALIZED...)">
Materialized Views in Cassandra solves the use case of not having to maintain additional table(s) for querying by different partition keys. But comes with following restrictions
Use all base table primary keys in the materialized view as primary keys.
Optionally, add one non-PRIMARY KEY column from the base table to the
materialized view's PRIMARY KEY.
Static columns are not supported as a PRIMARY KEY.
More documentation reference here.
So the correct syntax in your case of adding the materialized view would be
CREATE MATERIALIZED VIEW orders_by_date AS
SELECT id,
order_number,
country,
store_number,
supplier_number,
flow_type,
planned_delivery_date,
locked
FROM all_orders
WHERE planned_delivery_date IS NOT NULL AND order_number IS NOT NULL AND store_number IS NOT NULL AND supplier_number IS NOT NULL
PRIMARY KEY ( planned_delivery_date, store_number, supplier_number, order_number );
Here planned_delivery_date is the partition key and the rows are ordered by store_number, supplier_number, order_number (essentially the clustering columns). So there isn't a mandatory requirement to add "CLUSTERING ORDER BY" clause here.

Cassandra Order by currently only support the ordering of columns following their declared order in the PRIMARY KEY

This is the query I used to create the table:
CREATE TABLE test.comments (msguuid timeuuid, page text, userid text, username text, msg text, timestamp int, PRIMARY KEY (timestamp, msguuid));
then I create a materialized view:
CREATE MATERIALIZED VIEW test.comments_by_page AS
SELECT *
FROM test.comments
WHERE page IS NOT NULL AND msguuid IS NOT NULL
PRIMARY KEY (page, timestamp, msguuid)
WITH CLUSTERING ORDER BY (msguuid DESC);
I want to get the last 50 rows sorted by timestamp in ascending order.
This is the query I'm trying:
SELECT * FROM test.comments_by_page WHERE page = 'test' AND timestamp < 1496707057 ORDER BY timestamp ASC LIMIT 50;
which then gives this error: InvalidRequest: code=2200 [Invalid query] message="Order by currently only support the ordering of columns following their declared order in the PRIMARY KEY"
How can I accomplish this?
Materialized View rules are basically the same of "standard" tables ones. If you want a specific order you must specify that in the clustering key.
So you have to put your timestamp into the clustering section.
clustering order statement should be modified as below:
//Don't forget to put the primary key before timestamp into ()
CLUSTERING ORDER BY ((msguuid DESC), timestamp ASC)

and clause in cql cassandra

i have created a table with this schema
CREATE TABLE iplocation (
"idIPLocation" uuid,
"fromIP" bigint,
"toIP" bigint,
"idCity" uuid,
"idCountry" uuid,
"idProvince" uuid,
"isActive" boolean,
PRIMARY KEY ("idIPLocation", "fromIP", "toIP")
)
and inserted some records in it!
now i want to fetch a record like this
select * from iplocation where "toIP" <= 3065377522 and "fromIP" >= 3065377522 ALLOW FILTERING;
but its giving me an error of
A column of a clustering key can be restricted only if the preceding one is restricted by an Equal relation.
You need to restrict fromIP before restrict toIP.
but if i want to do just
select * from iplocation where "toIP" <= 3065377522 ALLOW FILTERING;
It still says
column of a clustering key can be restricted only if the preceding
one is restricted by an Equal relation.
You need to restrict
fromIP before restrict toIP.
i cant figureout whats the problem?
Your are misusing partition key concept. In your case the partition key is idIPLocation Cassandra use this key to know in which partition data will be write or read. So in your select statement you have to provide the partition key. Then you can filter data within the specified partition by provide fromIP, toIP.
You have four solutions :
1) Chose a better partition key : you can for example use followinf partition key clause : PRIMARY KEY ("toIP"). But in your case I guess this solution won't work because you want to query data by idIPLocation too.
2) Denormalize : add a new table with the same data structure but a différent partition key like so :
CREATE TABLE backup_advertyze.iplocation (
"idIPLocation" uuid,
"fromIP" bigint,
"toIP" bigint,
"idCity" uuid,
"idCountry" uuid,
"idProvince" uuid,
"isActive" boolean,
PRIMARY KEY ("idIPLocation", "fromIP", "toIP")
);
CREATE TABLE backup_advertyze.iplocationbytoip (
"idIPLocation" uuid,
"fromIP" bigint,
"toIP" bigint,
"idCity" uuid,
"idCountry" uuid,
"idProvince" uuid,
"isActive" boolean,
PRIMARY KEY ("toIP", "fromIP")
);
with this structure you can run this query select * from iplocationbytoip where "toIP" <= 3065377522 and "fromIP" >= 3065377522.
But with this solution you have to maintain doubles in two tables
3) Use materialized view :
This is the same concept as 2) but you have to maintain data in one table instead of two :
`CREATE TABLE backup_advertyze.iplocation (
"idIPLocation" uuid,
"fromIP" bigint,
"toIP" bigint,
"idCity" uuid,
"idCountry" uuid,
"idProvince" uuid,
"isActive" boolean,
PRIMARY KEY ("idIPLocation", "fromIP", "toIP")
);
CREATE MATERIALIZED VIEW backup_advertyze.iplocationbytoip
AS
SELECT *
FROM backup_advertyze.iplocation
WHERE idIPLocation IS NOT NULL
AND fromIP IS NOT NULL
AND toIP IS NOT NULL
PRIMARY KEY (toip, fromip, idiplocation);`
4) The most simple solution but i don't recommend due to query performences issues is to use secondary indexes :
CREATE INDEX iplocationfromindex ON backup_advertyze.iplocation(fromip);
you can run your query select * from iplocation where "toIP" <= 3065377522 and "fromIP" >= 3065377522 ALLOW FILTERING;.
Hope this can help you.
First of all, use of the ALLOW FILTERING directive is horribly inefficient, and its use is considered to be an anti-pattern. If you find yourself having to use it to satisfy a query requirement, you should be building a new table that better-suits your query, instead. Perhaps, one that makes better use of your partition keys for data retrieval.
select * from implication
where "toIP" <= 3065377522 and "fromIP" >= 3065377522 ALLOW FILTERING;
This fails because Cassandra only use non-equals conditions (>,=>,<,<=) on a single column, and it has to be the last one.
select * from implication
where "toIP" <= 3065377522 ALLOW FILTERING;
This fails with the same error message, because it senses that you are actively trying to prevent Cassandra from doing what it does best. And that is read a single row or a contiguous range of ordered rows off of the disk. Essentially, you are asking it to perform random reads, because it will have to check every node in your cluster to satisfy this query. As Cassandra is designed to support large-scale, that could introduce lots of network time into your query equation...something it is trying to save you from.
To solve this issue, I would rework the table with an appropriate partition key (as mentioned above) a single IP address column, and a from/to column...all a part of the key. It would look something like this:
CREATE TABLE iplocation (
idIPLocation uuid,
IP bigint,
fromTo text,
idCity uuid,
idCountry uuid,
idProvince uuid,
isActive boolean,
PRIMARY KEY (idIPLocation, IP, fromTo)
);
Now you essentially store your data twice, giving you a starting and ending IP range. The rows are differentiated by a F or T as a clustering key to tell you which is the "From IP" and which is the "To IP."
aploetz#cqlsh:stackoverflow> SELECT * FROm implication
WHERE idiplocation=76080f76-92f7-4d25-a531-a44c38ff38a7
AND IP>=10000 AND IP<=3065377522;
idiplocation | ip | fromto | idcity | idcountry | idprovince | isactive
--------------------------------------+----------+--------+--------------------------------------+--------------------------------------+--------------------------------------+----------
76080f76-92f7-4d25-a531-a44c38ff38a7 | 10001 | F | 6921a08b-c156-428e-8d4f-b371ff13f073 | f33bd5ed-b9b3-419b-99ab-ac2a7c87ba55 | 5a13cfcc-382e-418a-aeae-309f43671336 | True
76080f76-92f7-4d25-a531-a44c38ff38a7 | 10480101 | T | 6921a08b-c156-428e-8d4f-b371ff13f073 | f33bd5ed-b9b3-419b-99ab-ac2a7c87ba55 | 5a13cfcc-382e-418a-aeae-309f43671336 | True
(2 rows)
This is similar to how I model problems where data points have a range of both a starting and ending time. While your end solution will probably be different, the modeling mechanism here is something that may work for you.

Non-EQ relation error Cassandra - how fix primary key?

I created a one table posts. When I make request SELECT:
return $this->db->query('SELECT * FROM "posts" WHERE "id" IN(:id) LIMIT '.$this->limit_per_page, ['id' => $id]);
I get error:
PRIMARY KEY column "id" cannot be restricted (preceding column
"post_at" is either not restricted or by a non-EQ relation)
My table dump is:
CREATE TABLE posts (
id uuid,
post_at timestamp,
user_id bigint,
name text,
category set<text>,
link varchar,
image set<varchar>,
video set<varchar>,
content map<text, text>,
private boolean,
PRIMARY KEY (user_id,post_at,id)
)
WITH CLUSTERING ORDER BY (post_at DESC);
I read some article about PRIMARY AND CLUSTER KEYS, and understood, when there are some primary keys - I need use operator = with IN. In my case, i can not use a one PRIMARY KEY. What you advise me to change in table structure, that error will disappear?
My dummy table structure
CREATE TABLE posts (
id timeuuid,
post_at timestamp,
user_id bigint,
PRIMARY KEY (id,post_at,user_id)
)
WITH CLUSTERING ORDER BY (post_at DESC);
And after inserting some dummy data
I ran query select * from posts where id in (timeuuid1,timeuuid2,timeuuid3);
I was using cassandra 2.0 with cql 3.0

ORDER BY with 2ndary indexes is not supported

I am using cassandra 2.1 with latest CQL.
Here is my table & indexes:
CREATE TABLE mydata.chats_new (
id bigint,
adid bigint,
fromdemail text,
fromemail text,
fromjid text,
messagebody text,
messagedatetime text,
messageid text,
messagetype text,
todemail text,
toemail text,
tojid text,
PRIMARY KEY(messageid,messagedatetime)
);
CREATE INDEX user_fromJid ON mydata.chats_new (fromjid);
CREATE INDEX user_toJid ON mydata.chats_new (tojid);
CREATE INDEX user_adid ON mydata.chats_new (adid);
When i execute this query:
select * from chats_new WHERE fromjid='test' AND toJid='test1' ORDER BY messagedatetime DESC;
I got this error:
code=2200 [Invalid query] message="ORDER BY with 2ndary indexes is not supported."
So how should fetch this data?
select * from chats_new
WHERE fromjid='test' AND toJid='test1'
ORDER BY messagedatetime DESC;
code=2200 [Invalid query] message="ORDER BY with 2ndary indexes is not supported."
To get the WHERE clause of this query to work, I would build a specific query table, like this:
CREATE TABLE mydata.chats_new_by_fromjid_and_tojid (
id bigint,
adid bigint,
fromdemail text,
fromemail text,
fromjid text,
messagebody text,
messagedatetime text,
messageid text,
messagetype text,
todemail text,
toemail text,
tojid text,
PRIMARY KEY((fromjid, tojid), messagedatetime, messageid)
);
Note the primary key definition. This creates a partitioning key out of fromjid and tojid. While this will allow you to query on both fields, it will also require both fields to be specified in all queries on this table. But that's why they call it a "query table", as it is generally designed to serve one particular query.
As for the remaining fields in the primary key, I kept messagedatetime as the first clustering column, to assure on-disk sort order. Default ordering in Cassandra is ascending, so if you want to change that at query time, that's where your ORDER BY messagedatetime DESC comes into play. And lastly, I made sure that the messageid was the second clustering column, to help ensure primary key uniqueness (assuming that messageid is unique).
Now, this query will work:
select * from chats_new_by_fromjid_and_tojid
WHERE fromjid='test' AND toJid='test1'
ORDER BY messagedatetime DESC;
If you need to query this data by additional criteria, I highly recommend that you create additional query table(s). Remember, Cassandra works best with tables that are specifically designed for each query they serve. It's ok to replicate your data a few times, because disk space is cheap...operation time is not.
Also, DataStax has a great article on when not to use a secondary index. It's definitely worth a read.

Resources