Cassandra - How group by latest timestamp

Cassandra - How group by latest timestamp - cassandra

I saw some related topics here , but still it isn't clear to me, How group by the latest row values with cassandra 4.0.1
Let's say my table looks like;
CREATE TABLE simple_search (
engine text,
term text,
time bigint,
rank bigint,
url text,
domain text,
pagenum bigint,
descr text,
display_url text,
title text,
type text,
PRIMARY KEY ((domain), term , time , engine, url , pagenum)
) WITH CLUSTERING ORDER BY (term DESC, time DESC, engine DESC , url DESC);
My data looks like:
SELECT time, rank, term from search_by_domain_termsV2 where domain ='zerotoappstore.com'
time , rank, term
1633297772, 105, avfoundation swift
1633315263, 112, best ide
1633332881, 119, best ide
1633365856, 50, developing an app cost
1633375273, 36, developing an app cost
I want to have after group by
time , rank, term
1633297772, 105, avfoundation swift
1633332881, 119, best ide
1633375273, 36, developing an app cost
If I do
SELECT max(time) , rank, term from search_by_domain_termsV2 where domain ='zerotoappstore.com' GROUP BY term;
it gives me the correct max time value but not correct rating,and term.
1633297772 105 avfoundation swift
1633332881 112 best ide
1633375273 50 developing an app cost
Is it possible to group by term and take the max value of time ?

#VitalyT,
First, if we're not specifying the pagenum as part of the clustering order by clause of the create table construct, it would give you an error as follows:
InvalidRequest: Error from server: code=2200 [Invalid query] message="Clustering key columns must exactly match columns in CLUSTERING ORDER BY directive"
so, it has to be like as follows:
CREATE TABLE IF NOT EXISTS simple_search(
...
PRIMARY KEY (domain, term, time, engine, url, pagenum)
) WITH CLUSTERING ORDER BY (term DESC, time DESC, engine DESC, url [ASC|DESC]);
Next, with the give data sample of 5 rows. Note, I assumed some values for engine, url, pagenum columns as those values weren't provided in the original question:
SELECT * FROM simple_search ;
domain | term | time | engine | url | pagenum | descr | display_url | rank | title | type
--------------------+------------------------+------------+---------+------+---------+-------+-------------+------+-------+------
zerotoappstore.com | developing an app cost | 1633375273 | engine5 | url5 | 5 | null | null | 36 | null | null
zerotoappstore.com | developing an app cost | 1633365856 | engine4 | url4 | 4 | null | null | 50 | null | null
zerotoappstore.com | best ide | 1633332881 | engine3 | url3 | 3 | null | null | 119 | null | null
zerotoappstore.com | best ide | 1633315263 | engine2 | url2 | 2 | null | null | 112 | null | null
zerotoappstore.com | avfoundation swift | 1633297772 | engine1 | url1 | 1 | null | null | 105 | null | null
(5 rows)
we would get the following result if we only retrieve the MAX(time) column (without any GROUP BY):
SELECT MAX(time),rank,term FROM simple_search WHERE domain = 'zerotoappstore.com';
system.max(time) | rank | term
------------------+------+------------------------
1633375273 | 36 | developing an app cost
(1 rows)
Now, let's see what happens if we include the GROUP BY term clause to the same exact SELECT statement:
SELECT MAX(time), rank, term FROM simple_search WHERE domain = 'zerotoappstore.com' GROUP BY term;
system.max(time) | rank | term
------------------+------+------------------------
1633375273 | 36 | developing an app cost
1633332881 | 119 | best ide
1633297772 | 105 | avfoundation swift
(3 rows)
What if we remove the MAX aggregate function on time column because we've the data already stored in the descending order for time column? We get the following:
SELECT time,rank,term FROM simple_search WHERE domain = 'zerotoappstore.com' GROUP BY term;
time | rank | term
------------+------+------------------------
1633375273 | 36 | developing an app cost
1633332881 | 119 | best ide
1633297772 | 105 | avfoundation swift
(3 rows)
Is this what you want as your result? Please also see the corresponding documentation for certain conditions as it is laid out.

Related

Does Cassandra's CQL support the modulo operator for filtering queries?

I need to make a selection by the value of the remainder of the division:
cqlsh> SELECT * FROM table WHERE key%10=1;
Invalid syntax at line 1, char 39
SELECT * FROM table WHERE key%10=1;
^
Does CQL allow such queries?

CQL does not support modulo operations on the partition key.
You can only use the absolute value of the partition to filter in CQL queries. Cheers!

So I went to try this out with a simple table:
CREATE TABLE stackoverflow.keys (
month int,
id uuid,
key int,
PRIMARY KEY (month, id));
I was able to get this to work:
> SELECT month,month%10,id,key,key%10 AS "key mod 10"
FROM keys2 WHERE month=202208;
month | month % 10 | id | key | key mod 10
--------+------------+--------------------------------------+------+------------
202208 | 8 | 2fe7e98f-d1e2-45df-91f6-fa1430995fdc | 12 | 2
202208 | 8 | 59d04401-d11f-472d-a606-a33d380dc017 | 800 | 0
202208 | 8 | 92d3fa01-3b1e-4649-9280-786d75e2b9dc | 1157 | 7
202208 | 8 | 02612042-a7de-49ce-b958-ee60853ba51c | 2660 | 0
However, I was not able to get the modulus operator to work in the WHERE clause.

Order by in materialized view doesn't sort the results

I have a table with a structure like this:
CREATE TABLE kaefko.se_vi_f55dfeebae00d2b3 (
value text PRIMARY KEY,
id text,
popularity bigint);
With data that looks like this:
value | id | popularity
--------+------------------+------------
rally | 4eff16cb91f96cd6 | 2
reddit | 11aa39686ed66ba5 | 3
red | 552d7e95af481415 | 1
really | 756bfa499965863c | 1
right | c5850c6b08f7966b | 1
redis | 7f1d251f399442d7 | 1
And I've created a materialized view that should sort these values by the popularity from the biggest to the smallest ones:
CREATE MATERIALIZED VIEW kaefko.se_vi_f55dfeebae00d2b3_by_popularity AS
SELECT *
FROM kaefko.se_vi_f55dfeebae00d2b3
WHERE popularity IS NOT null
PRIMARY KEY (value, popularity)
WITH CLUSTERING ORDER BY (popularity DESC);
But the data in the materialized view looks like this:
value | popularity | id
--------+------------+------------------
rally | 2 | 4eff16cb91f96cd6
reddit | 3 | 11aa39686ed66ba5
really | 1 | 756bfa499965863c
right | 1 | c5850c6b08f7966b
redis | 1 | 7f1d251f399442d7
As you can see there are two main issues:
Data is not sorted as defined in the materialized view
There is just a part of all data in the materialized view
I'm not very experienced in Cassandra and I've already spent hours trying to find the reason why this happens with no avail. Could somebody please help me? Thank you <3
__
I'm using ScyllaDB 4.1.9-0 and cqlsh shows this:
[cqlsh 5.0.1 | Cassandra 3.0.8 | CQL spec 3.3.1 | Native protocol v4]

Alex's comment is 100% correct, the order is within the partition.
PRIMARY KEY (value, popularity)
WITH CLUSTERING ORDER BY (popularity DESC);
This means that the ordering of popularity is descending only for values where the 'value' field is the same - if I was to alter the data you used to show what this would look like as an example, you would get the following:
value | popularity | id
--------+------------+------------------
rally | 3 | 4eff16cb91f96cd6
rally | 2 | 11aa39686ed66ba5
really | 3 | 756bfa499965863c
really | 2 | c5850c6b08f7966b
really | 1 | 7f1d251f399442d7
The order is on a per partition key basis, not globally ordered.

Understanding Cassandra static field

I learn Cassandra through its documentation. Now I'm learning about batch and static fields.
In their example at the end of the page, they somehow managed to make balance have two different values (-200, -208) even though it's a static field.
Could someone explain to me how this is possible? I've read the whole page but I did not catch on.

In Cassandra static field is static under a partition key.
Example : Let's define a table
CREATE TABLE static_test (
pk int,
ck int,
d int,
s int static,
PRIMARY KEY (pk, ck)
);
Here pk is the partition key and ck is the clustering key.
Let's insert some data :
INSERT INTO static_test (pk , ck , d , s ) VALUES ( 1, 10, 100, 1000);
INSERT INTO static_test (pk , ck , d , s ) VALUES ( 2, 20, 200, 2000);
If we select the data
pk | ck | s | d
----+----+------+-----
1 | 10 | 1000 | 100
2 | 20 | 2000 | 200
here for partition key pk = 1 static field s value is 1000 and for partition key pk = 2 static field s value is 2000
If we insert/update static field s value of partition key pk = 1
INSERT INTO static_test (pk , ck , d , s ) VALUES ( 1, 11, 101, 1001);
Then static field s value will change for all the rows of the partition key pk = 1
pk | ck | s | d
----+----+------+-----
1 | 10 | 1001 | 100
1 | 11 | 1001 | 101
2 | 20 | 2000 | 200

In a table that uses clustering columns, non-clustering columns can be declared static in the table definition. Static columns are only static within a given partition.
Example:
CREATE TABLE test (
partition_column text,
static_column text STATIC,
clustering_column int,
PRIMARY KEY (partition_column , clustering_column)
);
INSERT INTO test (partition_column, static_column, clustering_column) VALUES ('key1', 'A', 0);
INSERT INTO test (partition_column, clustering_column) VALUES ('key1', 1);
SELECT * FROM test;
Results:
primary_column | clustering_column | static_column
----------------+-------------------+--------------
key1 | 0 | A
key1 | 1 | A
Observation:
Once declared static, the column inherits the value from given partition key
Now, lets insert another record
INSERT INTO test (partition_column, static_column, clustering_column) VALUES ('key1', 'C', 2);
SELECT * FROM test;
Results:
primary_column | clustering_column | static_column
----------------+-------------------+--------------
key1 | 0 | C
key1 | 1 | C
key1 | 2 | C
Observation:
If you update the static key, or insert another record with updated static column value, the value is reflected across all the columns ==> static column values are static (constant) across given partition column
Restriction (from the DataStax reference documentation below):
A table that does not define any clustering columns cannot have a static column. The table having no clustering columns has a one-row partition in which every column is inherently static.
A table defined with the COMPACT STORAGE directive cannot have a static column.
A column designated to be the partition key cannot be static.
Reference : DataStax Reference

In the example on the page you've linked they don't have different values at the same point in time.
They first have the static balance field set to -208 for the whole user1 partition:
user | expense_id | balance | amount | description | paid
-------+------------+---------+--------+-------------+-------
user1 | 1 | -208 | 8 | burrito | False
user1 | 2 | -208 | 200 | hotel room | False
Then they apply a batch update statement that sets the balance value to -200:
BEGIN BATCH
UPDATE purchases SET balance=-200 WHERE user='user1' IF balance=-208;
UPDATE purchases SET paid=true WHERE user='user1' AND expense_id=1 IF paid=false;
APPLY BATCH;
This updates the balance field for the whole user1 partition to -200:
user | expense_id | balance | amount | description | paid
-------+------------+---------+--------+-------------+-------
user1 | 1 | -200 | 8 | burrito | True
user1 | 2 | -200 | 200 | hotel room | False
The point of a static fields is that you can update/change its value for the whole partition at once. So if I would execute the following statement:
UPDATE purchases SET balance=42 WHERE user='user1'
I would get the following result:
user | expense_id | balance | amount | description | paid
-------+------------+---------+--------+-------------+-------
user1 | 1 | 42 | 8 | burrito | True
user1 | 2 | 42 | 200 | hotel room | False

retrieving data from cassandra database

I'm working on smart parking data stored in Cassandra database and i'm trying to get the last status of each device.
I'm working on self-made dataset.
here's the description of the table.
table description
select * from parking.meters
need help please !

trying to get the last status of each device
In Cassandra, you need to design your tables according to your query patterns. Building a table, filling it with data, and then trying to fulfill a query requirement is a very backward approach. The point, is that if you really need to satisfy that query, then your table should have been designed to serve that query from the beginning.
That being said, there may still be a way to make this work. You haven't mentioned which version of Cassandra you are using, but if you are on 3.6+, you can use the PER PARTITION LIMIT clause on your SELECT.
If I build your table structure and INSERT some of your rows:
aploetz#cqlsh:stackoverflow> SELECT * FROM meters ;
parking_id | device_id | date | status
------------+-----------+----------------------+--------
1 | 20 | 2017-01-12T12:14:58Z | False
1 | 20 | 2017-01-10T09:11:51Z | True
1 | 20 | 2017-01-01T13:51:50Z | False
1 | 7 | 2017-01-13T01:20:02Z | False
1 | 7 | 2016-12-02T16:50:04Z | True
1 | 7 | 2016-11-24T23:38:31Z | False
1 | 19 | 2016-12-14T11:36:26Z | True
1 | 19 | 2016-11-22T15:15:23Z | False
(8 rows)
And I consider your PRIMARY KEY and CLUSTERING ORDER definitions:
PRIMARY KEY ((parking_id, device_id), date, status)
) WITH CLUSTERING ORDER BY (date DESC, status ASC);
You are at least clustering by date (which should be an actual date type, not a text), so that will order your rows in a way that helps you here:
aploetz#cqlsh:stackoverflow> SELECT * FROM meters PER PARTITION LIMIT 1;
parking_id | device_id | date | status
------------+-----------+----------------------+--------
1 | 20 | 2017-01-12T12:14:58Z | False
1 | 7 | 2017-01-13T01:20:02Z | False
1 | 19 | 2016-12-14T11:36:26Z | True
(3 rows)

Cassandra: Searching for NULL values

I have a table MACRecord in Cassandra as follows :
CREATE TABLE has.macrecord (
macadd text PRIMARY KEY,
position int,
record int,
rssi1 float,
rssi2 float,
rssi3 float,
rssi4 float,
rssi5 float,
timestamp timestamp
)
I have 5 different nodes each updating a row based on its title i-e node 1 just updates rssi1, node 2 just updates rssi2 etc. This evidently creates null values for other columns.
I cannot seem to be able to a find a query which will give me only those rows which are not null. Specifically i have referred to this post.
I want to be able to query for example like SELECT *FROM MACRecord where RSSI1 != NULL as in MYSQL. However it seems both null values and comparison operators such as != are not supported in CQL.
Is there an alternative to putting NULL values or a special flag?. I am inserting float so unlike strings i cannot insert something like ''. What is a possible workaround for this problem?
Edit :
My data model in MYSQL was like this :
+-----------+--------------+------+-----+-------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+--------------+------+-----+-------------------+-----------------------------+
| MACAdd | varchar(17) | YES | UNI | NULL | |
| Timestamp | timestamp | NO | | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
| Record | smallint(6) | YES | | NULL | |
| RSSI1 | decimal(5,2) | YES | | NULL | |
| RSSI2 | decimal(5,2) | YES | | NULL | |
| RSSI3 | decimal(5,2) | YES | | NULL | |
| RSSI4 | decimal(5,2) | YES | | NULL | |
| RSSI5 | decimal(5,2) | YES | | NULL | |
| Position | smallint(6) | YES | | NULL | |
+-----------+--------------+------+-----+-------------------+-----------------------------+
Each node (1-5) was querying from MYSQL based on its number for example node 1 "SELECT *FROM MACRecord WHERE RSSI1 is not NULL"
I updated my data model in cassandra as follows so that rssi1-rssi5 are now VARCHAR types.
CREATE TABLE has.macrecord (
macadd text PRIMARY KEY,
position int,
record int,
rssi1 text,
rssi2 text,
rssi3 text,
rssi4 text,
rssi5 text,
timestamp timestamp
)
I was thinking that each node would initially insert string 'NULL' for a record and when an actual rssi data comes it will just replace the 'NULL' string so it would avoid having tombstones and would more or less appear to the user that the values are actually not valid pieces of data since they are flagged 'NULL'.
However i am still puzzled as to how i will retrieve results like i have done in MYSQL. There is no != operator in cassandra. How can i write a query which will give me a result set for example like "SELECT *FROM HAS.MACRecord where RSSI1 != 'NULL'" .

You can only select rows in CQL based on the PRIMARY KEY fields, which by definition cannot be null. This also applies to secondary indexes. So I don't think Cassandra will be able to do the filtering you want on the data fields. You could select on some other criteria and then write your client to ignore rows that had null values.
Or you could create a different table for each rssiX value, so that none of them would be null.
If you are only interested in some kind of aggregation, then the null values are treated as zero. So you could do something like this:
SELECT sum(rssi1) WHERE macadd='someadd';
The sum() function is available in Cassandra 2.2.
You might also be able to do some kind of trick with a user defined function/aggregate, but I think it would be simpler to have multiple tables.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Cassandra - How group by latest timestamp - cassandra

Related

Does Cassandra's CQL support the modulo operator for filtering queries?

Order by in materialized view doesn't sort the results

Understanding Cassandra static field

retrieving data from cassandra database

Cassandra: Searching for NULL values

Categories

Resources