Gaps and Islands analysis in Cognos - cognos

Using Cognos Analytics 11.1.7IF9, but asking about Cognos Analytics/Business Intelligence report development in general.
Continuation of my SQL question regarding gaps and islands.
For performance reasons, I have created SQL code using CTEs and have added this to SQL objects that feed queries. Each of these statements uses a different data source. I want to avoid asking the database administrators to create something on a database server for me. I want to do the remaining query work in Cognos.
The results of my SQL statements is a query generated by UNIONing these together into a query named "combined".
In Cognos.
Is there a way to use minimum or maximum with additional parameters?
Can I perform LEAD or LAG functionality other than joining a query to itself on row-number = row-number - 1?
Below is SQL code that works. Can this be done in Cognos?
with
ranges as (
select SRID
, MIN(ARMBegin) OVER (PARTITION BY SRID ORDER BY ARMBegin, ARMEnd ROWS BETWEEN 0 FOLLOWING AND UNBOUNDED FOLLOWING) as WindowStart
, MAX(ARMEnd) OVER (PARTITION BY SRID ORDER BY ARMBegin, ARMEnd ROWS BETWEEN UNBOUNDED PRECEDING AND 0 PRECEDING) as WindowEnd
from #combined f
),
groups as (
SELECT SRID
, WindowStart
, WindowEnd
, LAG(WindowStart,1) OVER (ORDER BY SRID, WindowStart, WindowEnd) AS PreviousStart
, LAG(WindowEnd,1) OVER (ORDER BY SRID, WindowStart, WindowEnd) AS PreviousEnd
FROM ranges
),
islands as (
SELECT SRID
, WindowStart
, WindowEnd
, CASE WHEN PreviousEnd >= WindowStart THEN 0 ELSE 1 END AS IslandStartInd
, SUM(CASE WHEN PreviousEnd >= WindowStart THEN 0 ELSE 1 END) OVER (ORDER BY SRID, WindowStart, WindowEnd) AS IslandId
FROM groups
)
select SRID
, min(WindowStart) as IslandStart
, max(WindowEnd) as IslandEnd
from islands
group by SRID
, IslandId
order by 1, 2, 3

Related

Cassandra data model for intersection of ranges

Assume data with pk (text), start (int), end (int), extra_data(text).
Query is: given a pk (e.g. 'pk1') and a range (e.g [1000, 2000]), find all rows for 'pk1' which intersect that range. This (sql) logically translates to WHERE pk=pk1 AND end>=1000 AND start<=2000 (intersection condition)
Notice this is NOT the same as the more conventional query of:
all rows for pk1 where start>1000 and start<2000
If I define a table with end as part of the clustering key:
CREATE TABLE test1 (
pk text,
start bigint,
end bigint,
extra_data text,
PRIMARY KEY ((pk), start, end)
)...
Then Cassandra does not allow the query:
select * from test1 where pk='pk1' and start < 2000 and end > 1000;
with "Clustering column "end" cannot be restricted (preceding column "start" is restricted by a non-EQ relation)"
Why does Cassandra not allow further filtering to limit ranged rows (forces to do this filter with results application-side).
A second try would be to remove 'end' from clustering columns:
CREATE TABLE test1 (
pk text,
start bigint,
end bigint,
extra_data text,
PRIMARY KEY ((pk), start)
)...
Then Cassandra warns the query:
select * from test1 where pk='pk1' and start < 2000 and end > 1000;
with "Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"
Here I would like to understand if I can safely add the ALLOW FILTERING and be assured Cassandra will perform the scan only of 'pk1'.
Using cqlsh 5.0.1 | Cassandra 3.11.3
Actually, I think you made the fatal mistake of designing your table first and then trying to adapt the application query to fit the table design.
In Cassandra data modelling, the primary principle is to always start by listing all your application queries THEN design a table for each of those application queries -- not the other way around.
Let's say I have an IoT use case where I have sensors collecting temperature readings once a day. If my application needs to retrieve the readings from the last 7 days from a sensor, the app query is:
Get the temperature for the last 7 days for sensor X
Assuming today is October 25, a more SQL-like representation of this app query is:
SELECT temperature FROM table
WHERE sensor = X
AND reading_date >= 2022-10-18
AND reading_date < 2022-10-25
This means that we need to design the table such that:
it is partitioned by sensor, and
the data is clustered by date.
The table schema would look like:
CREATE TABLE readings_by_sensor (
sensor text,
reading_date date,
temp float,
PRIMARY KEY (sensor, reading_date)
)
We can then perform a range query on the date:
SELECT temperature FROM readings_by_sensor
WHERE sensor = ?
AND reading_date >= 2022-10-18
AND reading_date < 2022-10-25
You don't need two separate columns to represent the start and end range because. Cheers!

Correct way to get the last value for a field in Apache Spark or Databricks Using SQL (Correct behavior of last and last_value)?

What is the correct behavior of the last and last_value functions in Apache Spark/Databricks SQL. The way I'm reading the documentation (here: https://docs.databricks.com/spark/2.x/spark-sql/language-manual/functions.html) it sounds like it should return the last value of what ever is in the expression.
So if I have a select statement that does something like
select
person,
last(team)
from
(select * from person_team order by date_joined)
group by person
I should get the last team a person joined, yes/no?
The actual query I'm running is shown below. It is returning a different number each time I execute the query.
select count(distinct patient_id) from (
select
patient_id,
org_patient_id,
last_value(data_lot) data_lot
from
(select * from my_table order by data_lot)
where 1=1
and org = 'my_org'
group by 1,2
order by 1,2
)
where data_lot in ('2021-01','2021-02')
;
What is the correct way to get the last value for a given field (for either the team example or my specific example)?
--- EDIT -------------------
I'm thinking collect_set might be useful here, but I get the error shown when I try to run this:
select
patient_id,
last_value(collect_set(data_lot)) data_lot
from
covid.demo
group by patient_id
;
Error in SQL statement: AnalysisException: It is not allowed to use an aggregate function in the argument of another aggregate function. Please use the inner aggregate function in a sub-query.;;
Aggregate [patient_id#89338], [patient_id#89338, last_value(collect_set(data_lot#89342, 0, 0), false) AS data_lot#91848]
+- SubqueryAlias spark_catalog.covid.demo
The posts shown below discusses how to get max values (not the same as last in a list ordered by a different field, I want the last team a player joined, the player may have joined the Reds, the A's, the Zebras, and the Yankees, in that order timewise, I'm looking for the Yankees) and these posts get to the solution procedurally using python/r. I'd like to do this in SQL.
Getting last value of group in Spark
Find maximum row per group in Spark DataFrame
--- SECOND EDIT -------------------
I ended up using something like this based upon the accepted answer.
select
row_number() over (order by provided_date, data_lot) as row_num,
demo.*
from demo
You can assign row numbers based on an ordering on data_lots if you want to get its last value:
select count(distinct patient_id) from (
select * from (
select *,
row_number() over (partition by patient_id, org_patient_id, org order by data_lots desc) as rn
from my_table
where org = 'my_org'
)
where rn = 1
)
where data_lot in ('2021-01','2021-02');

COUNT(*) vs. COUNT(1) vs COUNT(column) with WHERE condition performance in Cassandra

I have a query in Cassandra
select count(pk1) from tableA where pk1=xyz
Table is :
create table tableA
(
pk1 uuid,
pk2 int,
pk3 text,
pk4 int,
fc1 int,
fc2 bigint,
....
fcn blob,
primary key (pk1, pk2 , pk3 , pk4)
The query is executed often and takes up to 2s to execute.
I am wondering if there will be any performance gain if refactoring to:
select count(1) from tableA where pk = xyz
Based on the documentation here, there is no difference between count(1) and count(*).
Generally speaking COUNT(1) and COUNT(*) will both return the number of rows that match the condition specified in your query
This is in line with how traditional SQL databases are implemented.
COUNT ( { [ [ ALL | DISTINCT ] expression ] | * } )
Count(1) is a condition that always evaluates to true.
Also, Count(Column_name) only returns the Non-Null values.
Since in your case because of where condition the "Non-null" is a non-factor, I don't think there will be any difference in performance in using one over the other. This answer tried confirming the same using some performance tests.
In general COUNT is not at all recommended in Cassandra . As it’s going to scan through multiple nodes and get your answer back . And I’m not sure the count you get is really consistent.

equivalent percentile_cont function in apache spark sql

I am new to spark environment. I have dataset with column names as follows:
user_id, Date_time, order_quantity
I want to calculate the 90th percentile of order_quantity for each user_id.
If it were to be sql, I would have used the following query:
%sql
SELECT user_id, PERCENTILE_CONT ( 0.9 ) WITHIN GROUP (ORDER BY order_quantity) OVER (PARTITION BY user_id)
However, spark doesn't have the built in support for using the percentile_cont function.
Any suggestions on how I can implement this in spark on the above dataset?
please let me know if more information is needed.
I have a solution for PERCENTILE_DISC (0.9) which will return the discrete order_quantity closest to percentile 0.9 (without interpolation).
The idea is to calculate PERCENT_RANK, substract 0.9 and calculate Absolute value, then take the minimal value:
%sql
WITH temp1 AS (
SELECT
user_id,
ABS(PERCENTILE_RANK () OVER
(PARTITION BY user_id ORDER BY order_quantity) -0.9) AS perc_90_temp
SELECT
user_id,
FIRST_VALUE(order_quantity) OVER
(PARTITION BY user_id ORDER BY perc_90_temp) AS perc_disc_90
FROM
temp1;
I was dealing with a similar issue too. I worked in SAP HANA and then I moved to Spark SQL on Databricks. I have migrated the following SAP HANA query:
SELECT
DISTINCT ITEM_ID,
LOCATION_ID,
PERCENTILE_CONT(0.8) WITHIN GROUP (ORDER BY VENTAS) OVER (PARTITION BY ITEM_ID, LOCATION_ID) AS P95Y,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY PRECIO) OVER (PARTITION BY ITEM_ID, LOCATION_ID) AS MEDIAN_PRECIO
FROM MY_TABLE
to
SELECT DISTINCT
ITEM_ID,
LOCATION_ID,
PERCENTILE(VENTAS,0.8) OVER (PARTITION BY ITEM_ID, LOCATION_ID) AS P95Y,
PERCENTILE(PRECIO,0.5) OVER (PARTITION BY ITEM_ID, LOCATION_ID) AS MEDIAN_PRECIO
FROM
delta.`MY_TABLE`
In your particular case it should be as follows:
SELECT DISTINCT user_id, PERCENTILE(order_quantity,0.9) OVER (PARTITION BY user_id)
I hope this helps.

Presto Cassandra Connector Clustering Index

CQL Execution [returns instantly, assuming uses clustering key index]:
cqlsh:stats> select count(*) from events where month='2015-04' and day = '2015-04-02';
count
-------
5447
Presto Execution [takes around 8secs]:
presto:default> select count(*) as c from cassandra.stats.events where month = '2015-04' and day = timestamp '2015-04-02';
c
------
5447
(1 row)
Query 20150228_171912_00102_cxzfb, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0:08 [147K rows, 144KB] [17.6K rows/s, 17.2KB/s]
Why should presto get to process 147K rows when cassandra itself responds with just 5447 rows for the same query [I tried select * too]?
Why presto is not able to use the clustering key optimization?
I tried all possible values like timestamp, date, different formats of dates. Not able to see any effect on number of rows being fetched.
CF Reference:
CREATE TABLE events (
month text,
day timestamp,
test_data text,
some_random_column text,
event_time timestamp,
PRIMARY KEY (month, day, event_time)
) WITH comment='Test Data'
AND read_repair_chance = 1.0;
Added event_timestamp too as a constraint in response to Dain's answer
presto:default> select count(*) from cassandra.stats.events where month = '2015-04' and day = timestamp '2015-04-02 00:00:00+0000' and event_time = timestamp '2015-04-02 00:00:34+0000';
_col0
-------
1
(1 row)
Query 20150301_071417_00009_cxzfb, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0:07 [147K rows, 144KB] [21.3K rows/s, 20.8KB/s]
The Presto engine will pushdown simple WHERE clauses like this to a connector (you can see this in the Hive connector), so the question is, why does the Cassandra connector not take advantage of this. To see why, we'll have to look at the code.
The pushdown system first interacts with connectors in the ConnectorSplitManager.getPartitions(ConnectorTableHandle, TupleDomain) method, so looking at the CassandraSplitManager, I see it is delegating the logic to getPartitionKeysSet. This method looks for a range constraint (e.g., x=33 or x BETWEEN 1 AND 10) for every column in the primary key, so in your case, you would need to add a constraint on event_time.
I don't know why the code insists on having a constraint on every column in the primary key, but I'd guess that it is a bug. It should be easy to tweak this code to remove that constraint.

Resources