Using Presto HLL to count rolling wau, mau - presto

I am using Presto SQL, Hyperloglog to calculate dau, wau, and mau. Tho I am getting exact same number for all of them. Anyone can suggest what's wrong with my query?
With dau_hll AS (
SELECT
dt,
platform,
service,
account,
country,
CAST(APPROX_SET(userid) AS VARBINARY) AS job_hll_sketch
FROM
xx
GROUP BY
1,2,3,4,5
)
SELECT dt, platform, service, country,
CARDINALITY(CAST(job_hll_sketch AS HYPERLOGLOG)) AS dau,
cardinality(merge(CAST(job_hll_sketch AS HYPERLOGLOG)) OVER (PARTITION BY dt,platform,service,account,country ORDER BY dt ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)) AS wau,
cardinality(merge(CAST(job_hll_sketch AS HYPERLOGLOG)) OVER (PARTITION BY dt,platform,service,account,country ORDER BY dt ROWS BETWEEN 29 PRECEDING AND CURRENT ROW)) AS mau
from dau_hll

Related

Cassandra data model for intersection of ranges

Assume data with pk (text), start (int), end (int), extra_data(text).
Query is: given a pk (e.g. 'pk1') and a range (e.g [1000, 2000]), find all rows for 'pk1' which intersect that range. This (sql) logically translates to WHERE pk=pk1 AND end>=1000 AND start<=2000 (intersection condition)
Notice this is NOT the same as the more conventional query of:
all rows for pk1 where start>1000 and start<2000
If I define a table with end as part of the clustering key:
CREATE TABLE test1 (
pk text,
start bigint,
end bigint,
extra_data text,
PRIMARY KEY ((pk), start, end)
)...
Then Cassandra does not allow the query:
select * from test1 where pk='pk1' and start < 2000 and end > 1000;
with "Clustering column "end" cannot be restricted (preceding column "start" is restricted by a non-EQ relation)"
Why does Cassandra not allow further filtering to limit ranged rows (forces to do this filter with results application-side).
A second try would be to remove 'end' from clustering columns:
CREATE TABLE test1 (
pk text,
start bigint,
end bigint,
extra_data text,
PRIMARY KEY ((pk), start)
)...
Then Cassandra warns the query:
select * from test1 where pk='pk1' and start < 2000 and end > 1000;
with "Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"
Here I would like to understand if I can safely add the ALLOW FILTERING and be assured Cassandra will perform the scan only of 'pk1'.
Using cqlsh 5.0.1 | Cassandra 3.11.3
Actually, I think you made the fatal mistake of designing your table first and then trying to adapt the application query to fit the table design.
In Cassandra data modelling, the primary principle is to always start by listing all your application queries THEN design a table for each of those application queries -- not the other way around.
Let's say I have an IoT use case where I have sensors collecting temperature readings once a day. If my application needs to retrieve the readings from the last 7 days from a sensor, the app query is:
Get the temperature for the last 7 days for sensor X
Assuming today is October 25, a more SQL-like representation of this app query is:
SELECT temperature FROM table
WHERE sensor = X
AND reading_date >= 2022-10-18
AND reading_date < 2022-10-25
This means that we need to design the table such that:
it is partitioned by sensor, and
the data is clustered by date.
The table schema would look like:
CREATE TABLE readings_by_sensor (
sensor text,
reading_date date,
temp float,
PRIMARY KEY (sensor, reading_date)
)
We can then perform a range query on the date:
SELECT temperature FROM readings_by_sensor
WHERE sensor = ?
AND reading_date >= 2022-10-18
AND reading_date < 2022-10-25
You don't need two separate columns to represent the start and end range because. Cheers!

How to delete duplicate rows in Azure Synapse

How can I delete duplicate rows from Azure Synapse Analytics? I'd like to delete one of the rows where audit_date = '2022-08-10' and city = 'LA'. I'd like to keep only 1 row. I've tried using the CTE method( Row_number()... ). Unfortunately, SQL Pool doesn't support Delete statements with CTE.
audit_date
city
number_of_toys
number_of_balloons
number_of_drinks
2022-08-10
LA
35
100
40
2022-08-10
NY
20
70
30
2022-08-10
LA
35
102
40
You can do this using DELETE and ROW_NUMBER(). I have created a similar table with the sample data that you have given.
Now use the ROW_NUMBER() function to partition by audit_date and city based on your condition.
SELECT *, ROW_NUMBER() OVER (PARTITION BY audit_date,city ORDER BY audit_date,city) AS row_num FROM demo where audit_date='2022-08-10' and city='LA'
⦁ You can use the following query to complete the delete operation only on the rows where row_num > 1.
DELETE my_table FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY audit_date,city ORDER BY audit_date,city)
AS row_num FROM demo where audit_date='2022-08-10' and city='LA'
) my_table
where row_num>1
This way you can delete duplicate records by retaining one row using DELETE and ROW_NUMBER() as demonstrated above.

Cassandra cql: select N “most recent” rows in ascending order

I understand that the best way to fetch the most recent rows in Cassandra is to create my table as following:
CREATE TABLE IF NOT EXISTS data1(
asset_id int
date timestamp,
value decimal,
PRIMARY KEY ((asset_id), date)
) WITH CLUSTERING ORDER BY (date desc);
Then select 1000 recent data items via:
select * from data1 where asset_id = 8 limit 1000;
The client requires the data in ascending order.
Server side is python.
Is there a way to reverse the results in CQL and not in code (i.e. python)?
Have you tried using the ORDER BY clause
select * from data1 where asset_id = 8 ORDER BY date asc limit 1000;
More information available here:
https://docs.datastax.com/en/cql/3.1/cql/cql_using/useColumnsSort.html

Why Cassandra query not fetching data

I have been modelling a column family. As of now the primary key is
Primary Key((side,rundate),fund). So i am performing the following query
select count(*) from cf_Summary
where side = 'Long'and rundate in ('2015-01-12 05:30:00','2015-01-13 05:30:00');
Above query is returning 1200. If i run below query then its returning 0 records.
select count(*) from cf_Summary
where token(side,rundate)>= token('Long','2015-01-12 05:30:00') and token(side,rundate) <= token('Long','2015-01-13 05:30:00');
Token operator can be applied on last two partition columns as mentioned in datastax website.
http://www.datastax.com/dev/blog/a-deep-look-to-the-cql-where-clause
Why am i getting the 0 count in second query. Side has two values Long and Short so i would like to perform the query sometime for long and sometime for short and sometime for both values in a date range. In the first query i can restrict the rundate but can query both long and shot. I know both row keys are stored on different rows.
Is it possible to perform
select count(*) from cf_valuationsummary_1
where token(side,rundate)>= token('Long','2015-01-02 05:30:00') and token(side,rundate) <= token('Short','2015-01-03 05:30:00');
Coordinator node can query two node to get data but i want to perform single query from client application despite performance issues
Any lead??
It looks like you are using different date strings in the two queries.
In the first query with the IN clause, you used date 01-12 and 01-13.
But in the token queries you are using dates 01-02 and 01-03.
So perhaps you don't have any rows for 01-02 and 01-03.
Other than that, the token queries look valid.
Update:
I did some more experiments. I think this type of query may only work with the ByteOrderedPartitioner. When using the Murmur3Partitioner the token values for a given partition key aren't ordered by the date string. For example, with this data:
cqlsh:test> select * from cf_Summary;
side | rundate | fund
-------+---------------------+------
Short | 2015-01-13 05:30:00 | HIJ
Long | 2015-01-12 05:30:00 | ABC
Long | 2015-01-13 05:30:00 | DEF
Short | 2015-01-12 05:30:00 | HIJ
We get token values like this:
cqlsh:test> select token(side,rundate) from cf_Summary;
token(side, rundate)
----------------------
-2522183250639624078
-2064350486281951596
1812183325578390943
7832903641319907586
So you can see that the Short row with a higher date has a negative token value, while the earlier Short date has a positive token value. So if I do a range query on those token values it finds zero rows. For the Long rows it happened that the token value for the earlier date was negative and the later date was positive, so that finds the two Long rows.
cqlsh:test> select count(*) from cf_Summary where token(side,rundate)>= token('Short','2015-01-12 05:30:00') and token(side,rundate) <= token('Short','2015-01-13 05:30:00');
count
0
cqlsh:test> select count(*) from cf_Summary where token(side,rundate)>= token('Long','2015-01-12 05:30:00') and token(side,rundate) <= token('Long','2015-01-13 05:30:00');
count
2
So in general I don't think what you are trying to do will work since the token values don't directly correlate to your increasing dates.

Presto Cassandra Connector Clustering Index

CQL Execution [returns instantly, assuming uses clustering key index]:
cqlsh:stats> select count(*) from events where month='2015-04' and day = '2015-04-02';
count
-------
5447
Presto Execution [takes around 8secs]:
presto:default> select count(*) as c from cassandra.stats.events where month = '2015-04' and day = timestamp '2015-04-02';
c
------
5447
(1 row)
Query 20150228_171912_00102_cxzfb, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0:08 [147K rows, 144KB] [17.6K rows/s, 17.2KB/s]
Why should presto get to process 147K rows when cassandra itself responds with just 5447 rows for the same query [I tried select * too]?
Why presto is not able to use the clustering key optimization?
I tried all possible values like timestamp, date, different formats of dates. Not able to see any effect on number of rows being fetched.
CF Reference:
CREATE TABLE events (
month text,
day timestamp,
test_data text,
some_random_column text,
event_time timestamp,
PRIMARY KEY (month, day, event_time)
) WITH comment='Test Data'
AND read_repair_chance = 1.0;
Added event_timestamp too as a constraint in response to Dain's answer
presto:default> select count(*) from cassandra.stats.events where month = '2015-04' and day = timestamp '2015-04-02 00:00:00+0000' and event_time = timestamp '2015-04-02 00:00:34+0000';
_col0
-------
1
(1 row)
Query 20150301_071417_00009_cxzfb, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0:07 [147K rows, 144KB] [21.3K rows/s, 20.8KB/s]
The Presto engine will pushdown simple WHERE clauses like this to a connector (you can see this in the Hive connector), so the question is, why does the Cassandra connector not take advantage of this. To see why, we'll have to look at the code.
The pushdown system first interacts with connectors in the ConnectorSplitManager.getPartitions(ConnectorTableHandle, TupleDomain) method, so looking at the CassandraSplitManager, I see it is delegating the logic to getPartitionKeysSet. This method looks for a range constraint (e.g., x=33 or x BETWEEN 1 AND 10) for every column in the primary key, so in your case, you would need to add a constraint on event_time.
I don't know why the code insists on having a constraint on every column in the primary key, but I'd guess that it is a bug. It should be easy to tweak this code to remove that constraint.

Resources