Presto / AWS Athena query, historicized table (last value in aggregation)

Presto / AWS Athena query, historicized table (last value in aggregation) - aggregation

I've got a table split in a static part and a history one. I have to create a query which groups by a series of dimensions, including year and month, and do some aggregations. One of the values that I need to project is a value of the last tuple of the history table matching the given year / month couple.
History table have validity_date_start and validity_date_end, and the latter is NULL if it's up-to-date.
This is the query I've done so far (using temporary tables for ease of reproduction):
SELECT
time.year,
time.month,
t1.name,
FIRST_VALUE(t2.value1) OVER(ORDER BY t2.validity_date_start DESC) AS value, -- take the last valid t2 part for the month
(CASE WHEN t1.id = 1 AND time.date >= timestamp '2020-07-01 00:00:00' THEN 27
ELSE CASE WHEN t1.id = 1 AND time.date >= timestamp '2020-03-01 00:00:00' THEN 1
ELSE CASE WHEN t1.id = 2 AND time.date >= timestamp '2020-05-01 00:00:00' THEN 42 END
END
END) AS expected_value
FROM
(SELECT year(ts.date) year, month(ts.date) month, ts.date FROM (
(VALUES (SEQUENCE(date '2020-01-01', current_date, INTERVAL '1' MONTH))) AS ts(ts_array)
CROSS JOIN UNNEST(ts_array) AS ts(date)
) GROUP BY ts.date) time
CROSS JOIN (VALUES (1, 'Hal'), (2, 'John'), (3, 'Jack')) AS t1 (id, name)
LEFT JOIN (VALUES
(1, 1, timestamp '2020-01-03 10:22:33', timestamp '2020-07-03 23:59:59'),
(1, 27, timestamp '2020-07-04 00:00:00', NULL),
(2, 42, timestamp '2020-05-29 10:22:31', NULL)
) AS t2 (id, value1, validity_date_start, validity_date_end)
ON t1.id = t2.id
AND t2.validity_date_start <= (CAST(time.date as timestamp) + interval '1' month - interval '1' second)
AND (t2.validity_date_end IS NULL OR t2.validity_date_end >= (CAST(time.date as timestamp) + interval '1' month - interval '1' second)) -- last_day_of_month (Athena doesn't have the fn)
GROUP BY time.date, time.year, time.month, t1.id, t1.name, t2.value1, t2.validity_date_start
ORDER BY time.year, time.month, t1.id
value and expected_value should match, but they don't (value is always empty). I've evidently misunderstood how FIRST_VALUE(...) OVER(...) works.
May you please help me?
Thank you very much!

I've eventually found out what I was doing wrong here.
In the documents it is written:
The partition specification, which separates the input rows into different partitions. This is analogous to how the GROUP BY clause separates rows into different groups for aggregate functions
This led me to think that if I already had a GROUP BY statement, this was useless. It is not: generally if you want to get the datum for the given group, you have to specify it in the PARTITION BY statement, too (or better the dimensions that you're projecting in the SELECT part).
SELECT
time.year,
time.month,
t1.name,
FIRST_VALUE(t2.value1) OVER(PARTITION BY (time.year, time.month, t1.name) ORDER BY t2.validity_date_start DESC) AS value, -- take the last valid t2 part for the month
(CASE WHEN time.date >= timestamp '2020-07-01 00:00:00' AND t1.id = 1 THEN 27
ELSE CASE WHEN time.date >= timestamp '2020-05-01 00:00:00' AND t1.id = 2 THEN 42
ELSE CASE WHEN time.date >= timestamp '2020-03-01 00:00:00' AND t1.id = 1 THEN 1 END
END
END) AS expected_value
FROM
(SELECT year(ts.date) year, month(ts.date) month, ts.date FROM (
(VALUES (SEQUENCE(date '2020-01-01', current_date, INTERVAL '1' MONTH))) AS ts(ts_array)
CROSS JOIN UNNEST(ts_array) AS ts(date)
) GROUP BY ts.date) time
CROSS JOIN (VALUES (1, 'Hal'), (2, 'John'), (3, 'Jack')) AS t1 (id, name)
LEFT JOIN (VALUES
(1, 1, timestamp '2020-03-01 10:22:33', timestamp '2020-07-03 23:59:59'),
(1, 27, timestamp '2020-07-04 00:00:00', NULL),
(2, 42, timestamp '2020-05-29 10:22:31', NULL)
) AS t2 (id, value1, validity_date_start, validity_date_end)
ON t1.id = t2.id
AND t2.validity_date_start <= (CAST(time.date as timestamp) + interval '1' month - interval '1' second)
AND (t2.validity_date_end IS NULL OR t2.validity_date_end >= (CAST(time.date as timestamp) + interval '1' month - interval '1' second)) -- last_day_of_month (Athena doesn't have the fn)
GROUP BY time.date, time.year, time.month, t1.id, t1.name, t2.value1, t2.validity_date_start
ORDER BY time.year, time.month, t1.id

Related

Calculating %diff between 2 values in Presto

I have a field type double called values.
I want to calculate the %diff between last value and the one before last value, for example:
value
10
2
4
2
the output should be: -50%
How can I do this in presto?

If you have a field for ordering (otherwise result is not guaranteed) you can use lag window function:
-- sample data
WITH dataset ("values", date) AS (
VALUES (10, now()),
(4, now() + interval '1' hour),
(2, now() + interval '2' hour)
)
--query
select ("values" - l) * 100.0 / l as value
from(
select "values",
lag("values") over(order by date) as l,
date
from dataset
)
order by date desc
limit 1
Output:
value
-50.0

Python sqlite added rows to SELECT result which not changed by UPDATE commit [duplicate]

Can someone please explain this to me:
import sqlite3
db = sqlite3.connect(':memory:')
db.execute('create table t1 (id integer primary key, val text)')
db.execute('create table t2 (id integer primary key, val text)')
c = db.cursor()
c.execute('insert into t1 values (?, ?)', (1, 'a'))
c.execute('insert into t2 values (?, ?)', (1, 'b'))
c.execute('insert into t1 values (?, ?)', (2, 'c'))
c.execute('insert into t2 values (?, ?)', (2, 'd'))
c.execute('''select t1.id, t1.val, t2.val
from t1
left join t2 using (id)
where t1.id is not null
union all
select t2.id, t1.val, t2.val
from t2
left join t1 using (id)
where t2.id is not null
and t1.id is null
''')
for row in c:
print(row[0])
if row[0] == 1:
c2 = db.cursor()
c2.execute('delete from t1 where id = ?', (row[0],))
If I comment out the last three lines, the output is:
1
2
But if I uncomment the last three lines, the output is:
1
2
1
ie. the first cursor has been updated with the results of DML executed in the second cursor.
Is this expected behaviour? Is there some way to prevent it?
I'm running Python 3.6.3 (as per Ubuntu 17.10), in case that makes a difference.

SQLite computes results rows on demand, if possible. But this is not always possible, so there is no guarantee.
You should never modify any table that you are currently reading in another query. (The database might scan the table in unobvious ways, so even changes to other rows might change the enumeration.)
If you intend to do such modifications, you have to read all rows before doing the changes, e.g., for row in c.fetchall(). Alternatively, read the table in single steps that re-search for the place where the last query left, i.e.:
SELECT ... FROM MyTable WHERE ID > :LastID ORDER BY ID LIMIT 1;

Oracle Query - Join with comma separated data

Table Name : crm_mrdetails
id | mr_name | me_email | mr_mobile | mr_doctor|
-----------------------------------------------------
1 | John |abc#gmail.com | 1234555555 | ,1,2,3 |
Table Name : crm_mr_doctor
id | dr_name | specialization|
----------------------------------
1 | Abhishek | cordiologist |
2 | Krishnan | Physician |
3 | Krishnan | Nurse |
The concatenated values in mrdetails.mr_doctor are the foreign keys for mr_doctor.id. I need to join on them to produce output like this:
id | mr_name | me_email |Doctor_Specialization|
-------------------------------------------------
1 | John |abc#gmail.com |cordiologist,Physician,Nurse|
I'm new to Oracle, I'm using Oracle 12C. Any help much appreciated.

First of all we must acknowledge that is a bad data model. The column mr_doctor violates First Normal Form. This is not some abstruse theoretical point. Not being in 1NF means we must write more code to lookup the meaning of the keys instead of using standard SQL join syntax. It also means we cannot depend on the column containing valid IDs: mr_doctor can contain any old nonsense and we must write a query which will can handle that. See Is storing a delimited list in a database column really that bad? for more on this.
Anyway. Here is a solution which uses regular expressions to split the mr_doctor column into IDs and then joins them to the mr_doctor table. The specialization column is concatenated to produce the required output.
select mrdet.id,
mrdet.mr_name,
mrdet.me_email,
listagg(mrdoc.specialization, ',')
within group (order by mrdoc.specialization) as doctor_specialization
from mr_details mrdet
join (
select distinct id,
regexp_substr(mr_doctor, '(,?)([0-9]+)(,?)', 1, level, null, 2) as dr_id
from mr_details
connect by level <= regexp_count(mr_doctor, '(,?)([0-9]+)')
) as mrids
on mrids.id = mrdet.id
left outer join mr_doctor mrdoc
on mrids.dr_id = mr_doc.id
group by mrdet.id,
mrdet.mr_name,
mrdet.me_email
/
This solution is reasonably resilient despite the data model being brittle. It will return results if the string has too many commas, or spaces. It will ignore values which are letters or otherwise aren't numbers. It won't hurl if the extracted number doesn't match an ID in the mr_doctor table. Obviously the results are untrustworthy for those reasons, but that's part of the price of a shonky data model.
Can you please explain the following: (,?)([0-9]+)(,?)
The pattern matches zero or one comma followed by one or more digits followed by zero or one comma. Perhaps the (,?) in the matched patterns aren't strictly necessary. However, without them, this string 2 3 4 would match the same three IDs as this string 2,3,4. Maybe that's correct maybe it isn't. When the foreign keys are stored in a CSV column instead of being enforced through a proper constraint what does 'correct' even mean?

You have to split data in mr_doctor column into rows, join table crm_mrdoctor and then use listagg().
How to split data? Splitting string into multiple rows in Oracle
select t.id, max(mr_name) mr_name,
listagg(specialization, ', ') within group (order by rn) specs
from (
select id, mr_name, levels.column_value rn,
trim(regexp_substr(mr_doctor, '[^,]+', 1, levels.column_value)) as did
from crm_mrdetails t,
table(cast(multiset(select level
from dual
connect by level <=
length(regexp_replace(t.mr_doctor, '[^,]+')) + 1)
as sys.odcinumberlist)) levels) t
left join crm_mr_doctor d on t.did = d.id
group by t.id
Demo and result:
with crm_mrdetails (id, mr_name, mr_doctor) as (
select 1, 'John', ',1,2,3' from dual union all
select 2, 'Anne', ',4,2,6,5' from dual union all
select 3, 'Dave', ',4' from dual),
crm_mr_doctor (id, dr_name, specialization) as (
select 1, 'Abhishek', 'cordiologist' from dual union all
select 2, 'Krishnan', 'Physician' from dual union all
select 3, 'Krishnan', 'Nurse' from dual union all
select 4, 'Krishnan', 'Onkologist' from dual union all
select 5, 'Krishnan', 'Surgeon' from dual union all
select 6, 'Krishnan', 'Nurse' from dual
)
select t.id, max(mr_name) mr_name,
listagg(specialization, ', ') within group (order by rn) specs
from (
select id, mr_name, levels.column_value rn,
trim(regexp_substr(mr_doctor, '[^,]+', 1, levels.column_value)) as did
from crm_mrdetails t,
table(cast(multiset(select level
from dual
connect by level <=
length(regexp_replace(t.mr_doctor, '[^,]+')) + 1)
as sys.odcinumberlist)) levels) t
left join crm_mr_doctor d on t.did = d.id
group by t.id
Output:
ID MR_NAME SPECS
------ ------- -------------------------------------
1 John cordiologist, Physician, Nurse
2 Anne Onkologist, Physician, Nurse, Surgeon
3 Dave Onkologist

You can use a recursive sub-query and simple string functions (which may be faster than using regular expressions and a correlated hierarchical query):
Oracle Setup:
CREATE TABLE crm_mrdetails (id, mr_name, mr_doctor) as
select 1, 'John', ',1,2,3' from dual union all
select 2, 'Anne', ',4,2,6,5' from dual union all
select 3, 'Dave', ',4' from dual;
CREATE TABLE crm_mr_doctor (id, dr_name, specialization) as
select 1, 'Abhishek', 'cordiologist' from dual union all
select 2, 'Krishnan', 'Physician' from dual union all
select 3, 'Krishnan', 'Nurse' from dual union all
select 4, 'Krishnan', 'Onkologist' from dual union all
select 5, 'Krishnan', 'Surgeon' from dual union all
select 6, 'Krishnan', 'Nurse' from dual;
Query:
WITH crm_mrdetails_bounds ( id, mr_name, mr_doctor, start_pos, end_pos ) AS (
SELECT id,
mr_name,
mr_doctor,
2,
INSTR( mr_doctor, ',', 2 )
FROM crm_mrdetails
UNION ALL
SELECT id,
mr_name,
mr_doctor,
end_pos + 1,
INSTR( mr_doctor, ',', end_pos + 1 )
FROM crm_mrdetails_bounds
WHERE end_pos > 0
),
crm_mrdetails_specs ( id, mr_name, start_pos, specialization_id ) AS (
SELECT id,
mr_name,
start_pos,
TO_NUMBER(
CASE end_pos
WHEN 0
THEN SUBSTR( mr_doctor, start_pos )
ELSE SUBSTR( mr_doctor, start_pos, end_pos - start_pos )
END
)
FROM crm_mrdetails_bounds
)
SELECT s.id,
MAX( s.mr_name ) AS mr_name,
LISTAGG( d.specialization, ',' )
WITHIN GROUP ( ORDER BY s.start_pos )
AS doctor_specialization
FROM crm_mrdetails_specs s
INNER JOIN crm_mr_doctor d
ON ( s.specialization_id = d.id )
GROUP BY s.id
Output:
ID | MR_NAME | DOCTOR_SPECIALIZATION
-: | :------ | :---------------------------------
1 | John | cordiologist,Physician,Nurse
2 | Anne | Onkologist,Physician,Nurse,Surgeon
3 | Dave | Onkologist
db<>fiddle here

Please change the column names according to your requirement.
CREATE OR REPLACE Function ReplaceSpec
(String_Inside IN Varchar2)
Return Varchar2 Is
outputString Varchar2(5000);
tempOutputString crm_doc.specialization%TYPE;
Begin
FOR i in 1..(LENGTH(String_Inside)-LENGTH(REPLACE(String_Inside,',',''))+1)
LOOP
Select specialization into tempOutputString From crm_doc
Where id = PARSING_STRING(String_Inside,i);
If i != 1 Then
outputString := outputString || ',';
end if;
outputString := outputString || tempOutputString;
END LOOP;
Return outputString;
End;
/
The Parsing_String function to help split the comma separated values.
CREATE OR REPLACE Function PARSING_STRING
(String_Inside IN Varchar2, Position_No IN Number)
Return Varchar2 Is
OurEnd Number; Beginn Number;
Begin
If Position_No < 1 Then
Return Null;
End If;
OurEnd := Instr(String_Inside, ',', 1, Position_No);
If OurEnd = 0 Then
OurEnd := Length(String_Inside) + 1;
End If;
If Position_No = 1 Then
Beginn := 1;
Else
Beginn := Instr(String_Inside, ',', 1, Position_No-1) + 1;
End If;
Return Substr(String_Inside, Beginn, OurEnd-Beginn);
End;
/
Please note that I have given only a basic function to get your output. You might need to add some exceptions etc.
Eg. When the doc_id [mr_doctor] is empty, what to do.
Usage
select t1.*,ReplaceSpec(doc_id) from crm_details t1
if your mr_doctor data always starts with a comma use:
Select t1.*,ReplaceSpec(Substr(doc_id,2)) from crm_details t1

Please go through https://oracle-base.com/articles/misc/string-aggregation-techniques
String Aggregation Techniques
or
SELECT deptno,
LTRIM(MAX(SYS_CONNECT_BY_PATH(ename,','))
KEEP (DENSE_RANK LAST ORDER BY curr),',') AS employees
FROM (SELECT deptno,
ename,
ROW_NUMBER() OVER (PARTITION BY deptno ORDER BY ename) AS curr,
ROW_NUMBER() OVER (PARTITION BY deptno ORDER BY ename) -1 AS prev
FROM emp)
GROUP BY deptno
CONNECT BY prev = PRIOR curr AND deptno = PRIOR deptno
START WITH curr = 1
or
listagg and wm_concat an also be used as other people have used it

How about this one? I have not tested it, so there could be any syntax error.
select id,mr_name,me_email,listagg(specialization,',') within group (order by specialization) as Doctor_Specialization
from
(select dtls.id,dtls.mr_name,dtls.me_email,dr.specialization
from crm_mrdetails dtls,
crm_mr_doctor dr
where INSTR(','||dtls.mr_doctor||',' , ','||dr.id||',') > 0
) group by id,mr_name,me_email;

Filling in NULLS with previous records - Netezza SQL

I am using Netezza SQL on Aginity Workbench and have the following data:
id DATE1 DATE2
1 2013-07-27 NULL
2 NULL NULL
3 NULL 2013-08-02
4 2013-09-10 2013-09-23
5 2013-12-11 NULL
6 NULL 2013-12-19
I need to fill in all the NULL values in DATE1 with preceding values in the DATE1 field that are filled in. With DATE2, I need to do the same, but in reverse order. So my desired output would be the following:
id DATE1 DATE2
1 2013-07-27 2013-08-02
2 2013-07-27 2013-08-02
3 2013-07-27 2013-08-02
4 2013-09-10 2013-09-23
5 2013-12-11 2013-12-19
6 2013-12-11 2013-12-19
I only have read access to the data. So creating Tables or views are out of the question

How about this?
select
id
,last_value(date1 ignore nulls) over (
order by id
rows between unbounded preceding and current row
) date1
,first_value(date2 ignore nulls) over (
order by id
rows between current row and unbounded following
) date2
You can manually calculate this as well, rather than relying on the windowing functions.
with chain as (
select
this.*,
prev.date1 prev_date1,
case when prev.date1 is not null then abs(this.id - prev.id) else null end prev_distance,
next.date2 next_date2,
case when next.date2 is not null then abs(this.id - next.id) else null end next_distance
from
Table1 this
left outer join Table1 prev on this.id >= prev.id
left outer join Table1 next on this.id <= next.id
), min_distance as (
select
id,
min(prev_distance) min_prev_distance,
min(next_distance) min_next_distance
from
chain
group by
id
)
select
chain.id,
chain.prev_date1,
chain.next_date2
from
chain
join min_distance on
min_distance.id = chain.id
and chain.prev_distance = min_distance.min_prev_distance
and chain.next_distance = min_distance.min_next_distance
order by chain.id
If you're unable to calculate the distance between IDs by subtraction, just replace the ordering scheme by a row_number() call.

I think Netezza supports the order by clause for max() and min(). So, you can do:
select max(date1) over (order by date1) as date1,
min(date2) over (order by date2 desc) as date2
. . .
EDIT:
In Netezza, you may be able to do this with last_value() and first_value():
select last_value(date1 ignore nulls) over (order by id rows between unbounded preceding and 1 preceding) as date1,
first_value(date1 ignore nulls) over (order by id rows between 1 following and unbounded following) as date2
Netezza doesn't seem to support IGNORE NULLs on LAG(), but it does on these functions.

I've only tested this in Oracle so hopefully it works in Netezza:
Fiddle:
http://www.sqlfiddle.com/#!4/7533f/1/0
select id,
coalesce(date1, t1_date1, t2_date1) as date1,
coalesce(date2, t3_date2, t4_date2) as date2
from (select t.*,
t1.date1 as t1_date1,
t2.date1 as t2_date1,
t3.date2 as t3_date2,
t4.date2 as t4_date2,
row_number() over(partition by t.id order by t.id) as rn
from tbl t
left join tbl t1
on t1.id < t.id
and t1.date1 is not null
left join tbl t2
on t2.id > t.id
and t2.date1 is not null
left join tbl t3
on t3.id < t.id
and t3.date2 is not null
left join tbl t4
on t4.id > t.id
and t4.date2 is not null
order by t.id) x
where rn = 1

Here's a way to fill in NULL dates with the most recent min/max non-null dates using self-joins. This query should work on most databases
select t1.id, max(t2.date1), min(t3.date2)
from tbl t1
join tbl t2 on t1.id >= t2.id
join tbl t3 on t1.id <= t3.id
group by t1.id
http://www.sqlfiddle.com/#!4/acc997/2

Cassandra - dates before 1970

Is there a way to support dates older than 1970 in Cassandra while supporting dates operations on them? I can only see timestamps. If we need older dates should I simulate my own dates as longs or perhaps as strings?
CQL doesn't return anything when I issue the query:
SELECT col1 FROM table1 WHERE ts >= '1900-01-01 00:00:00+0000'

This issue seems to be ok with Cassandra 2.0.9:
CREATE TABLE table1 (id int, col1 int, ts timestamp, PRIMARY KEY (id, ts));
INSERT INTO table1 (id, col1, ts) values (1, 10, '2000-02-03');
INSERT INTO table1 (id, col1, ts) values (1, 20, '1960-02-03');
INSERT INTO table1 (id, col1, ts) values (1, 30, '1890-02-03');
SELECT col1 FROM table1 WHERE id = 1 and ts >= '1900-01-01 00:00:00+0000' limit 10;
Output:
col1
------
20
10
Problems in earlier versions might be related to CASSANDRA-6395 (fixed in 2.0.4), or JAVA-264 (that was later reverted).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Presto / AWS Athena query, historicized table (last value in aggregation) - aggregation

Related

Calculating %diff between 2 values in Presto

Python sqlite added rows to SELECT result which not changed by UPDATE commit [duplicate]

Oracle Query - Join with comma separated data

Filling in NULLS with previous records - Netezza SQL

Cassandra - dates before 1970

Categories

Resources