ScyllaDB window function-like query

ScyllaDB window function-like query - cql

I am using ScyllaDB open-source version 4.4.
I am trying to figure out how to write a query which I would have done with a window function or a UNION set operator if this was a traditional relational database with a full SQL.
A simplified table schema:
CREATE TABLE mykeyspace.mytable (
name text ,
timestamp_utc_nanoseconds bigint ,
value bigint ,
PRIMARY KEY( (name),timestamp_utc_nanoseconds )
) WITH CLUSTERING ORDER BY (timestamp_utc_nanoseconds DESC);
My query needs to calculate and return 6 values, each of them is an average of "value" over one of the previous minutes.
In pseudo-code:
SELECT AVG(value) AS one_min_avg_6_min_ago FROM mykeyspace.mytable WHERE name = 'some_name' AND timestamp_utc_nanoseconds >= [*6 minutes ago*] AND timestamp_utc_nanoseconds < [*5 minutes ago*];
SELECT AVG(value) AS one_min_avg_5_min_ago FROM mykeyspace.mytable WHERE name = 'some_name' AND timestamp_utc_nanoseconds >= [*5 minutes ago*] AND timestamp_utc_nanoseconds < [*4 minutes ago*];
SELECT AVG(value) AS one_min_avg_4_min_ago FROM mykeyspace.mytable WHERE name = 'some_name' AND timestamp_utc_nanoseconds >= [*4 minutes ago*] AND timestamp_utc_nanoseconds < [*3 minutes ago*];
SELECT AVG(value) AS one_min_avg_3_min_ago FROM mykeyspace.mytable WHERE name = 'some_name' AND timestamp_utc_nanoseconds >= [*3 minutes ago*] AND timestamp_utc_nanoseconds < [*2 minutes ago*];
SELECT AVG(value) AS one_min_avg_2_min_ago FROM mykeyspace.mytable WHERE name = 'some_name' AND timestamp_utc_nanoseconds >= [*2 minutes ago*] AND timestamp_utc_nanoseconds < [*1 minutes ago*];
SELECT AVG(value) AS one_min_avg_1_min_ago FROM mykeyspace.mytable WHERE name = 'some_name' AND timestamp_utc_nanoseconds >= [*1 minute ago*];
My client-side is C# .NET 5. I can easily do pretty much anything on the client side. But the latency in this case is going to be a big problem.
My question is:
How can I combine these 6 queries into one result set on the server side (not on the client app side)?

Ideally you would use a UDA - User Defined Aggregate function, however support for these is not yet complete. In the meantime, you can execute these 6 queries in parallel which might even be preferable.

Related

How to make a sequence of select, update and insert atomic in one single Cassandra statement?

I'm dealing with 1MLN of Tweets (with a frequency of about 5K at seconds) and I would like to do something similar to this code in Cassandra. Let's say that I'm using a Lambda Architecture.
I know the following code is not working, I just would like to explain my logic through it.
DROP TABLE IF EXISTS hashtag_trend_by_week;
CREATE TABLE hashtag_trend_by_week(
shard_week timestamp,
hashtag text ,
counter counter,
PRIMARY KEY ( ( shard_week ), hashtag )
) ;
DROP TABLE IF EXISTS topten_hashtag_by_week;
CREATE TABLE topten_hashtag_by_week(
shard_week timestamp,
counter bigInt,
hashtag text ,
PRIMARY KEY ( ( shard_week ), counter, hashtag )
) WITH CLUSTERING ORDER BY ( counter DESC );
BEGIN BATCH
UPDATE hashtag_trend_by_week SET counter = counter + 22 WHERE shard_week='2021-06-15 12:00:00' and hashtag ='Gino';
INSERT INTO topten_hashtag_trend_by_week( shard_week, hashtag, counter) VALUES ('2021-06-15 12:00:00','Gino',
SELECT counter FROM hashtag_trend_by_week WHERE shard_week='2021-06-15 12:00:00' AND hashtag='Gino'
) USING TTL 7200;
APPLY BATCH;
Then the final query to satisfy my UI should be something like
SELECT hashtag, counter FROM topten_hashtag_by_week WHERE shard_week='2021-06-15 12:00:00' limit 10;
Any suggesting ?

You can only have CQL counter columns in a counter table so you need to rethink the schema for the hashtag_trend_by_week table.
Batch statements are used for making writes atomic in Cassandra so including a SELECT statement does not make sense.
The final query for topten_hashtag_by_week looks fine to me. Cheers!

Data model in Cassandra and proper deletion Strategy

I have following table in cassandra:
CREATE TABLE article (
id text,
price int,
validFrom timestamp,
PRIMARY KEY (id, validFrom)
) WITH CLUSTERING ORDER BY (validFrom DESC);
With articles and historical price information (validFrom is a timestamp of new price). Article price changes often. I want to query for
Historic price for a particular article.
Last price for an article.
From my understanding, I can solve both problems with following query:
select id, price from article where id = X validFrom < Y limit 1;
This query uses article id as restriction, query uses the partition key. Since the clustering order is based on the validFrom timestamp in reversed order, cassandra can efficient perform this query.
Am I getting this right?
What is the best approach to delete old data (house-keeping). Let's assume, I want delete all articles with validFrom > 20150101 and validFrom < 20151231. Since I don't have a primary key, this would be inefficient, even if I use an index on validFrom, right? How can I achieve this?

You can use external tools for that:
Spark with Spark Cassandra Connector (even in the local mode). Code could look as following (note that I'm using validfrom as name, not validFrom, as it's not escaped in your schema):
import com.datastax.spark.connector._
val data = sc.cassandraTable("test", "article")
.where("validfrom >= '2020-07-28T11:50:00Z' AND validfrom < '2020-07-28T12:50:00Z'")
.select("id", "validfrom")
data.deleteFromCassandra("test", "article", keyColumns=SomeColumns("id", "validfrom"))
use DSBulk to do find the matching entries and output them into the file (output.csv in my case), and then perform their deletion:
bin/dsbulk unload -url output.csv \
-query "SELECT id, validfrom FROM test.article WHERE token(id) > :start AND token(id) <= :end AND validFrom >= '2020-07-28T11:50:00Z' AND validFrom < '2020-07-28T12:50:00Z' ALLOW FILTERING"
bin/dsbulk load -query "DELETE from test.article WHERE id = :id and validfrom = :validfrom" \
-url output.csv

To add to Alex Ott's answer, this comment of yours is incorrect:
This query uses article id as restriction, query uses the partition key. Since the clustering order is based on price, cassandra can efficient perform this query.
The rows are not ordered by price. They are ordered by validFrom in reverse-chronological order. Cheers!

Cassandra Modelling for Date Range

Cassandra Newbie here. Cassandra v 3.9.
I'm modelling the Travellers Flight Checkin Data.
My Main Query Criteria is Search for travellers with a date range (max of 7 day window).
Here is what I've come up with with my limited exposure to Cassandra.
create table IF NOT EXISTS travellers_checkin (checkinDay text, checkinTimestamp bigint, travellerName text, travellerPassportNo text, flightNumber text, from text, to text, bookingClass text, PRIMARY KEY (checkinDay, checkinTimestamp)) WITH CLUSTERING ORDER BY (checkinTimestamp DESC)
Per day, I'm expecting upto a million records - resulting in the partition to have a million records.
Now my users want search in which the date window is mandatory (max a week window). In this case should I use a IN clause that spans across multiple partitions? Is this the correct way or should I think of re-modelling the data? Alternatively, I'm also wondering if issuing 7 queries (per day) and merging the responses would be efficient.

Your Data Model Seems Good.But If you could add more field to the partition key it will scale well. And you should use Separate Query with executeAsync
If you are using in clause, this means that you’re waiting on this single coordinator node to give you a response, it’s keeping all those queries and their responses in the heap, and if one of those queries fails, or the coordinator fails, you have to retry the whole thing
Source : https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/
Instead of using IN clause, use separate query of each day and execute it with executeAsync.
Java Example :
PreparedStatement statement = session.prepare("SELECT * FROM travellers_checkin where checkinDay = ? and checkinTimestamp >= ? and checkinTimestamp <= ?");
List<ResultSetFuture> futures = new ArrayList<>();
for (int i = 1; i < 4; i++) {
ResultSetFuture resultSetFuture = session.executeAsync(statement.bind(i, i));
futures.add(resultSetFuture);
}
for (ResultSetFuture future : futures){
ResultSet rows = future.getUninterruptibly();
//You get the result set of each query, merge them here
}

Having trouble querying by dates using the Java Cassandra Spark SQL Connector

I'm trying to use Spark SQL to query a table by a date range. For example, I'm trying to run an SQL statement like: SELECT * FROM trip WHERE utc_startdate >= '2015-01-01' AND utc_startdate <= '2015-12-31' AND deployment_id = 1 AND device_id = 1. When I run the query no error is being thrown but I'm not receiving any results back when I would expect some. When running the query without the date range I am getting results back.
SparkConf sparkConf = new SparkConf().setMaster("local").setAppName("SparkTest")
.set("spark.executor.memory", "1g")
.set("spark.cassandra.connection.host", "localhost")
.set("spark.cassandra.connection.native.port", "9042")
.set("spark.cassandra.connection.rpc.port", "9160");
JavaSparkContext context = new JavaSparkContext(sparkConf);
JavaCassandraSQLContext sqlContext = new JavaCassandraSQLContext(context);
sqlContext.sqlContext().setKeyspace("mykeyspace");
String sql = "SELECT * FROM trip WHERE utc_startdate >= '2015-01-01' AND utc_startdate < '2015-12-31' AND deployment_id = 1 AND device_id = 1";
JavaSchemaRDD rdd = sqlContext.sql(sql);
List<Row> rows = rdd.collect(); // rows.size() is zero when I would expect it to contain numerous rows.
Schema:
CREATE TABLE trip (
device_id bigint,
deployment_id bigint,
utc_startdate timestamp,
other columns....
PRIMARY KEY ((device_id, deployment_id), utc_startdate)
) WITH CLUSTERING ORDER BY (utc_startdate ASC);
Any help would be greatly appreciated.

What does your table schema (in particular, your PRIMARY KEY definition) look like? Even without seeing it, I am fairly certain that you are seeing this behavior because you are not qualifying your query with a partition key. Using the ALLOW FILTERING directive will filter the rows by date (assuming that is your clustering key), but that is not a good solution for a big cluster or large dataset.
Let's say that you are querying users in a certain geographic region. If you used region as a partition key, you could run this query, and it would work:
SELECT * FROM users
WHERE region='California'
AND date >= '2015-01-01' AND date <= '2015-12-31';
Give Patrick McFadin's article on Getting Started with Timeseries Data a read. That has some good examples that should help you.

postgresql insert rules for parallel transactions

We have a postgreql connection pool used by multithreaded application, that permanently inserts some records into big table. So, lets say we have 10 database connections, executing the same function, whcih inserts the record.
The trouble is, we have 10 records inserted as a result meanwhile it should be only 2-3 records inserted, if only transactions could see the records of each other (our function takes decision to do not insert the record according to the date of the last record found).
We can not afford table locking for func execution period.
We tried different tecniques to make the database apply our rules to new records immediately despite the fact they are created in parallel transactions, but havent succeeded yet.
So, I would be very grateful for any help or idea!
To be more specific, here is the code:
schm.events ( evtime TIMESTAMP, ref_id INTEGER, param INTEGER, type INTEGER);
record filter rule:
BEGIN
select count(*) into nCnt
from events e
where e.ref_id = ref_id and e.param = param and e.type = type
and e.evtime between (evtime - interval '10 seconds') and (evtime + interval '10 seconds')
if nCnt = 0 then
insert into schm.events values (evtime, ref_id, param, type);
end if;
END;
UPDATE (comment length is not enough unfortunately)
I've applied to production the unique index solution. The results are pretty acceptable, but the initial target has not been achieved.
The issue is, with the unique hash I can not control the interval between 2 records with sequential hash_codes.
Here is the code:
CREATE TABLE schm.events_hash (
hash_code bigint NOT NULL
);
CREATE UNIQUE INDEX ui_events_hash_hash_code ON its.events_hash
USING btree (hash_code);
--generate the hash codes data by partioning(splitting) evtime in 10 sec intervals:
INSERT into schm.events_hash
select distinct ( cast( trunc( extract(epoch from evtime) / 10 ) || cast( ref_id as TEXT) || cast( type as TEXT ) || cast( param as TEXT ) as bigint) )
from schm.events;
--and then in a concurrently executed function I insert sequentially:
begin
INSERT into schm.events_hash values ( cast( trunc( extract(epoch from evtime) / 10 ) || cast( ref_id as TEXT) || cast( type as TEXT ) || cast( param as TEXT ) as bigint) );
insert into schm.events values (evtime, ref_id, param, type);
end;
In that case, if evtime lies within hash-determined interval, only one record is being inserted.
The case is, we can skip records that refer to different determined intervals, but are close to each other (less than 60 sec interval).
insert into schm.events values ( '2013-07-22 19:32:37', '123', '10', '20' ); --inserted, test ok, (trunc( extract(epoch from cast('2013-07-22 19:32:37' as timestamp)) / 10 ) = 137450715 )
insert into schm.events values ( '2013-07-22 19:32:39', '123', '10', '20' ); --filtered out, test ok, (trunc( extract(epoch from cast('2013-07-22 19:32:39' as timestamp)) / 10 ) = 137450715 )
insert into schm.events values ( '2013-07-22 19:32:41', '123', '10', '20' ); --inserted, test fail, (trunc( extract(epoch from cast('2013-07-22 19:32:41' as timestamp)) / 10 ) = 137450716 )
I think there must be a way to modify the hash function to achieve the initial target, but havent found it yet. Maybe, there are some table constraint expressions, that are executed by the postgresql itself, out of the transaction?

About your only options are:
Using a unique index with a hack to collapse 20-second ranges to a single value;
Using advisory locking to control communication; or
SERIALIZABLE isolation and intentionally creating a mutual dependency between sessions. Not 100% sure this will be practical in your case.
What you really want is a dirty read, but PostgreSQL does not support dirty reads, so you're kind of stuck there.
You might land up needing a co-ordinator outside the database to manage your requirements.
Unique index
You can truncate your timestamps for the purpose of uniquenes checking, rounding them to regular boundaries so they jump in 20 second chunks. Then add them to a unique index on (chunk_time_seconds(evtime, 20), ref_id, param, type) .
Only one insert will succeed and the rest will fail with an error. You can trap the error in a BEGIN ... EXCEPTION block in PL/PgSQL, or preferably just handle it in the application.
I think a reasonable definition of chunk_time_seconds might be:
CREATE OR REPLACE FUNCTION chunk_time_seconds(t timestamptz, round_seconds integer)
RETURNS bigint
AS $$
SELECT floor(extract(epoch from t) / 20) * 20;
$$ LANGUAGE sql IMMUTABLE;
A starting point for advisory locking:
Advisory locks can be taken on a single bigint or a pair of 32-bit integers. Your key is bigger than that, it's three integers, so you can't directly use the simplest approach of:
IF pg_try_advisory_lock(ref_id, param) THEN
... do insert ...
END IF;
then after 10 seconds, on the same connection (but not necessarily in the same transaction) issue pg_advisory_unlock(ref_id_param).
It won't work because you must also filter on type and there's no three-integer-argument form of pg_advisory_lock. If you can turn param and type into smallints you could:
IF pg_try_advisory_lock(ref_id, param << 16 + type) THEN
but otherwise you're in a bit of a pickle. You could hash the values, of course, but then you run the (small) risk of incorrectly skipping an insert that should not be skipped in the case of a hash collision. There's no way to trigger a recheck because the conflicting rows aren't visible, so you can't use the usual solution of just comparing rows.
So ... if you can fit the key into 64 bits and your application can deal with the need to hold the lock for 10-20s before releasing it in the same connection, advisory locks will work for you and will be very low overhead.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string