Filling in NULLS with previous records - Netezza SQL - apache-spark

I am using Netezza SQL on Aginity Workbench and have the following data:
id DATE1 DATE2
1 2013-07-27 NULL
2 NULL NULL
3 NULL 2013-08-02
4 2013-09-10 2013-09-23
5 2013-12-11 NULL
6 NULL 2013-12-19
I need to fill in all the NULL values in DATE1 with preceding values in the DATE1 field that are filled in. With DATE2, I need to do the same, but in reverse order. So my desired output would be the following:
id DATE1 DATE2
1 2013-07-27 2013-08-02
2 2013-07-27 2013-08-02
3 2013-07-27 2013-08-02
4 2013-09-10 2013-09-23
5 2013-12-11 2013-12-19
6 2013-12-11 2013-12-19
I only have read access to the data. So creating Tables or views are out of the question

How about this?
select
id
,last_value(date1 ignore nulls) over (
order by id
rows between unbounded preceding and current row
) date1
,first_value(date2 ignore nulls) over (
order by id
rows between current row and unbounded following
) date2
You can manually calculate this as well, rather than relying on the windowing functions.
with chain as (
select
this.*,
prev.date1 prev_date1,
case when prev.date1 is not null then abs(this.id - prev.id) else null end prev_distance,
next.date2 next_date2,
case when next.date2 is not null then abs(this.id - next.id) else null end next_distance
from
Table1 this
left outer join Table1 prev on this.id >= prev.id
left outer join Table1 next on this.id <= next.id
), min_distance as (
select
id,
min(prev_distance) min_prev_distance,
min(next_distance) min_next_distance
from
chain
group by
id
)
select
chain.id,
chain.prev_date1,
chain.next_date2
from
chain
join min_distance on
min_distance.id = chain.id
and chain.prev_distance = min_distance.min_prev_distance
and chain.next_distance = min_distance.min_next_distance
order by chain.id
If you're unable to calculate the distance between IDs by subtraction, just replace the ordering scheme by a row_number() call.

I think Netezza supports the order by clause for max() and min(). So, you can do:
select max(date1) over (order by date1) as date1,
min(date2) over (order by date2 desc) as date2
. . .
EDIT:
In Netezza, you may be able to do this with last_value() and first_value():
select last_value(date1 ignore nulls) over (order by id rows between unbounded preceding and 1 preceding) as date1,
first_value(date1 ignore nulls) over (order by id rows between 1 following and unbounded following) as date2
Netezza doesn't seem to support IGNORE NULLs on LAG(), but it does on these functions.

I've only tested this in Oracle so hopefully it works in Netezza:
Fiddle:
http://www.sqlfiddle.com/#!4/7533f/1/0
select id,
coalesce(date1, t1_date1, t2_date1) as date1,
coalesce(date2, t3_date2, t4_date2) as date2
from (select t.*,
t1.date1 as t1_date1,
t2.date1 as t2_date1,
t3.date2 as t3_date2,
t4.date2 as t4_date2,
row_number() over(partition by t.id order by t.id) as rn
from tbl t
left join tbl t1
on t1.id < t.id
and t1.date1 is not null
left join tbl t2
on t2.id > t.id
and t2.date1 is not null
left join tbl t3
on t3.id < t.id
and t3.date2 is not null
left join tbl t4
on t4.id > t.id
and t4.date2 is not null
order by t.id) x
where rn = 1

Here's a way to fill in NULL dates with the most recent min/max non-null dates using self-joins. This query should work on most databases
select t1.id, max(t2.date1), min(t3.date2)
from tbl t1
join tbl t2 on t1.id >= t2.id
join tbl t3 on t1.id <= t3.id
group by t1.id
http://www.sqlfiddle.com/#!4/acc997/2

Related

Compare blank string and null spark sql

I am writing an SQL query that joins two tables. The problem that I am facing is that the column on which I am joining is blank (""," ") on one table and null on the other.
Table A
id
col
1
2
3
SG
Table B
id
col
a
null
b
null
c
SG
source_alleg = spark.sql("""
SELECT A.*,B.COL as COLB FROM TABLEA A LEFT JOIN TABLEB B
ON A.COL = B.COL
""")
For my use case blank values and null are same. I want to do something like Trim(a.col) which will convert blank values to null and hence find all the matches in the join.
Output:
id
col
colb
1
either null or blank
either null or blank
2
either null or blank
either null or blank
3
SG
SG
In sql the NULL are ignored during a join unless you use a outer join or full join
more information : https://www.geeksforgeeks.org/difference-between-left-right-and-full-outer-join/
if you want to convert the nulls to a string you can just use an if
select
if(isnull(trim(col1)),"yourstring", col1),
if(isnull(trim(col2)),"yourstring", col2)
from T;

Presto / AWS Athena query, historicized table (last value in aggregation)

I've got a table split in a static part and a history one. I have to create a query which groups by a series of dimensions, including year and month, and do some aggregations. One of the values that I need to project is a value of the last tuple of the history table matching the given year / month couple.
History table have validity_date_start and validity_date_end, and the latter is NULL if it's up-to-date.
This is the query I've done so far (using temporary tables for ease of reproduction):
SELECT
time.year,
time.month,
t1.name,
FIRST_VALUE(t2.value1) OVER(ORDER BY t2.validity_date_start DESC) AS value, -- take the last valid t2 part for the month
(CASE WHEN t1.id = 1 AND time.date >= timestamp '2020-07-01 00:00:00' THEN 27
ELSE CASE WHEN t1.id = 1 AND time.date >= timestamp '2020-03-01 00:00:00' THEN 1
ELSE CASE WHEN t1.id = 2 AND time.date >= timestamp '2020-05-01 00:00:00' THEN 42 END
END
END) AS expected_value
FROM
(SELECT year(ts.date) year, month(ts.date) month, ts.date FROM (
(VALUES (SEQUENCE(date '2020-01-01', current_date, INTERVAL '1' MONTH))) AS ts(ts_array)
CROSS JOIN UNNEST(ts_array) AS ts(date)
) GROUP BY ts.date) time
CROSS JOIN (VALUES (1, 'Hal'), (2, 'John'), (3, 'Jack')) AS t1 (id, name)
LEFT JOIN (VALUES
(1, 1, timestamp '2020-01-03 10:22:33', timestamp '2020-07-03 23:59:59'),
(1, 27, timestamp '2020-07-04 00:00:00', NULL),
(2, 42, timestamp '2020-05-29 10:22:31', NULL)
) AS t2 (id, value1, validity_date_start, validity_date_end)
ON t1.id = t2.id
AND t2.validity_date_start <= (CAST(time.date as timestamp) + interval '1' month - interval '1' second)
AND (t2.validity_date_end IS NULL OR t2.validity_date_end >= (CAST(time.date as timestamp) + interval '1' month - interval '1' second)) -- last_day_of_month (Athena doesn't have the fn)
GROUP BY time.date, time.year, time.month, t1.id, t1.name, t2.value1, t2.validity_date_start
ORDER BY time.year, time.month, t1.id
value and expected_value should match, but they don't (value is always empty). I've evidently misunderstood how FIRST_VALUE(...) OVER(...) works.
May you please help me?
Thank you very much!
I've eventually found out what I was doing wrong here.
In the documents it is written:
The partition specification, which separates the input rows into different partitions. This is analogous to how the GROUP BY clause separates rows into different groups for aggregate functions
This led me to think that if I already had a GROUP BY statement, this was useless. It is not: generally if you want to get the datum for the given group, you have to specify it in the PARTITION BY statement, too (or better the dimensions that you're projecting in the SELECT part).
SELECT
time.year,
time.month,
t1.name,
FIRST_VALUE(t2.value1) OVER(PARTITION BY (time.year, time.month, t1.name) ORDER BY t2.validity_date_start DESC) AS value, -- take the last valid t2 part for the month
(CASE WHEN time.date >= timestamp '2020-07-01 00:00:00' AND t1.id = 1 THEN 27
ELSE CASE WHEN time.date >= timestamp '2020-05-01 00:00:00' AND t1.id = 2 THEN 42
ELSE CASE WHEN time.date >= timestamp '2020-03-01 00:00:00' AND t1.id = 1 THEN 1 END
END
END) AS expected_value
FROM
(SELECT year(ts.date) year, month(ts.date) month, ts.date FROM (
(VALUES (SEQUENCE(date '2020-01-01', current_date, INTERVAL '1' MONTH))) AS ts(ts_array)
CROSS JOIN UNNEST(ts_array) AS ts(date)
) GROUP BY ts.date) time
CROSS JOIN (VALUES (1, 'Hal'), (2, 'John'), (3, 'Jack')) AS t1 (id, name)
LEFT JOIN (VALUES
(1, 1, timestamp '2020-03-01 10:22:33', timestamp '2020-07-03 23:59:59'),
(1, 27, timestamp '2020-07-04 00:00:00', NULL),
(2, 42, timestamp '2020-05-29 10:22:31', NULL)
) AS t2 (id, value1, validity_date_start, validity_date_end)
ON t1.id = t2.id
AND t2.validity_date_start <= (CAST(time.date as timestamp) + interval '1' month - interval '1' second)
AND (t2.validity_date_end IS NULL OR t2.validity_date_end >= (CAST(time.date as timestamp) + interval '1' month - interval '1' second)) -- last_day_of_month (Athena doesn't have the fn)
GROUP BY time.date, time.year, time.month, t1.id, t1.name, t2.value1, t2.validity_date_start
ORDER BY time.year, time.month, t1.id

hive How to use conditional statements to execute different query based on result

I have query select col1, col2 from view1 and I wanted execute only when (select columnvalue from table1) > 0 else do nothing.
if (select columnvalue from table1)>0
select col1, col2 from view1"
else
do thing
How can I achieve this in single hive query?
If check query returns scalar value (single row) then you can cross join with check result and filter using > 0 condition:
with check_query as (
select count (*) cnt
from table1
)
select *
from view1 t
cross join check_query c
where c.cnt>0
;

Unpivot and Pivot does not return data

I'm trying to return data as columns.
I've written this unpivot and pivot query:
`select StockItemCode, barcode, barcode2 from (select StockItemCode, col+cast(seq as varchar(20)) col, value from (
select
(select min(StockItemCode)
from RTLBarCode t2
where t.StockItemCode = t2.StockItemCode) StockItemCode,
cast(BarCode as varchar(20)) barcode,
row_number() over(partition by StockItemCode order by StockItemCode) seq
from RTLBarCode t) d unpivot(
value
for col in (barcode) ) unpiv) src pivot ( max(value) for col in (barcode, barcode2)) piv;`
But the problem is only the "Barcode2" field are returning a value (the barcode field returns a null when in fact there is a value.
SAMPLE DATA
I have a Table called RTLBarCode
It has a field called Barcode and a field called StockItemCode
For StockItemCode = 10 I have 2 rows with a Barcode value of 5014721112824 and 0000000019149.
Can anyone see where I am going wrong?
Many thanks
You are indexing your barcode in unpiv.
This results in col's-values barcode1 and barcode2.
But then you are pivoting on barcode instead of barcode1. No value is found and the aggregate returns null.
The correct statement would be:
select StockItemCode, barcode1, barcode2 from
(
select StockItemCode, col+cast(seq as varchar(20)) col, value
from
(
select
(select min(StockItemCode)from RTLBarCode t2 where t.StockItemCode = t2.StockItemCode) StockItemCode,
cast(BarCode as varchar(20)) barcode,
row_number() over(partition by StockItemCode order by StockItemCode) seq
from RTLBarCode t
) d
unpivot(value for col in (barcode)) unpiv
) src
pivot (max(value) for col in (barcode1, barcode2)) piv

SQL query for SQL Server Compact Edition 3.5 - GROUP BY issue

SELECT BabyInformation.* , t1.*
FROM BabyInformation
LEFT JOIN
(SELECT * FROM BabyData
GROUP BY BabyID
ORDER By Date DESC ) AS t1 ON BabyInformation.BabyID=t1.BabyID
This is my query. I want to get the one most recent BabyData tuple based on date.
The BabyInformation should left join with babyData but one row per baby...
I tried TOP(1) but this worked only for the first baby
Here is one way to do it, there are other ways which can be faster, but I believe this one to be the clearest for a beginner.
SELECT BabyInformation.*, BabyData.*
FROM BabyInformation
JOIN
(SELECT BabyID, Max(Date) as maxDate FROM BabyData
GROUP BY BabyID
) AS t1
ON BabyInformation.BabyID=t1.BabyID
Join BabyData ON BabyData.BabyID = t1.BabyID and BabyData.Date = t1.maxDate
This should do it:
SELECT bi.* , bd.*
FROM BabyInformation [bi]
LEFT JOIN BabyData [bd]
on bd.BabyDataId = (select top 1 sub.BabyDataId from BabyData [sub] where sub.BabyId = bi.BabyId order by sub.Date desc)
I've assumed that there is a column called 'BabyDataId' in the BabyData table.

Resources