I am writing an SQL query that joins two tables. The problem that I am facing is that the column on which I am joining is blank (""," ") on one table and null on the other.
Table A
id
col
1
2
3
SG
Table B
id
col
a
null
b
null
c
SG
source_alleg = spark.sql("""
SELECT A.*,B.COL as COLB FROM TABLEA A LEFT JOIN TABLEB B
ON A.COL = B.COL
""")
For my use case blank values and null are same. I want to do something like Trim(a.col) which will convert blank values to null and hence find all the matches in the join.
Output:
id
col
colb
1
either null or blank
either null or blank
2
either null or blank
either null or blank
3
SG
SG
In sql the NULL are ignored during a join unless you use a outer join or full join
more information : https://www.geeksforgeeks.org/difference-between-left-right-and-full-outer-join/
if you want to convert the nulls to a string you can just use an if
select
if(isnull(trim(col1)),"yourstring", col1),
if(isnull(trim(col2)),"yourstring", col2)
from T;
I've got a table split in a static part and a history one. I have to create a query which groups by a series of dimensions, including year and month, and do some aggregations. One of the values that I need to project is a value of the last tuple of the history table matching the given year / month couple.
History table have validity_date_start and validity_date_end, and the latter is NULL if it's up-to-date.
This is the query I've done so far (using temporary tables for ease of reproduction):
SELECT
time.year,
time.month,
t1.name,
FIRST_VALUE(t2.value1) OVER(ORDER BY t2.validity_date_start DESC) AS value, -- take the last valid t2 part for the month
(CASE WHEN t1.id = 1 AND time.date >= timestamp '2020-07-01 00:00:00' THEN 27
ELSE CASE WHEN t1.id = 1 AND time.date >= timestamp '2020-03-01 00:00:00' THEN 1
ELSE CASE WHEN t1.id = 2 AND time.date >= timestamp '2020-05-01 00:00:00' THEN 42 END
END
END) AS expected_value
FROM
(SELECT year(ts.date) year, month(ts.date) month, ts.date FROM (
(VALUES (SEQUENCE(date '2020-01-01', current_date, INTERVAL '1' MONTH))) AS ts(ts_array)
CROSS JOIN UNNEST(ts_array) AS ts(date)
) GROUP BY ts.date) time
CROSS JOIN (VALUES (1, 'Hal'), (2, 'John'), (3, 'Jack')) AS t1 (id, name)
LEFT JOIN (VALUES
(1, 1, timestamp '2020-01-03 10:22:33', timestamp '2020-07-03 23:59:59'),
(1, 27, timestamp '2020-07-04 00:00:00', NULL),
(2, 42, timestamp '2020-05-29 10:22:31', NULL)
) AS t2 (id, value1, validity_date_start, validity_date_end)
ON t1.id = t2.id
AND t2.validity_date_start <= (CAST(time.date as timestamp) + interval '1' month - interval '1' second)
AND (t2.validity_date_end IS NULL OR t2.validity_date_end >= (CAST(time.date as timestamp) + interval '1' month - interval '1' second)) -- last_day_of_month (Athena doesn't have the fn)
GROUP BY time.date, time.year, time.month, t1.id, t1.name, t2.value1, t2.validity_date_start
ORDER BY time.year, time.month, t1.id
value and expected_value should match, but they don't (value is always empty). I've evidently misunderstood how FIRST_VALUE(...) OVER(...) works.
May you please help me?
Thank you very much!
I've eventually found out what I was doing wrong here.
In the documents it is written:
The partition specification, which separates the input rows into different partitions. This is analogous to how the GROUP BY clause separates rows into different groups for aggregate functions
This led me to think that if I already had a GROUP BY statement, this was useless. It is not: generally if you want to get the datum for the given group, you have to specify it in the PARTITION BY statement, too (or better the dimensions that you're projecting in the SELECT part).
SELECT
time.year,
time.month,
t1.name,
FIRST_VALUE(t2.value1) OVER(PARTITION BY (time.year, time.month, t1.name) ORDER BY t2.validity_date_start DESC) AS value, -- take the last valid t2 part for the month
(CASE WHEN time.date >= timestamp '2020-07-01 00:00:00' AND t1.id = 1 THEN 27
ELSE CASE WHEN time.date >= timestamp '2020-05-01 00:00:00' AND t1.id = 2 THEN 42
ELSE CASE WHEN time.date >= timestamp '2020-03-01 00:00:00' AND t1.id = 1 THEN 1 END
END
END) AS expected_value
FROM
(SELECT year(ts.date) year, month(ts.date) month, ts.date FROM (
(VALUES (SEQUENCE(date '2020-01-01', current_date, INTERVAL '1' MONTH))) AS ts(ts_array)
CROSS JOIN UNNEST(ts_array) AS ts(date)
) GROUP BY ts.date) time
CROSS JOIN (VALUES (1, 'Hal'), (2, 'John'), (3, 'Jack')) AS t1 (id, name)
LEFT JOIN (VALUES
(1, 1, timestamp '2020-03-01 10:22:33', timestamp '2020-07-03 23:59:59'),
(1, 27, timestamp '2020-07-04 00:00:00', NULL),
(2, 42, timestamp '2020-05-29 10:22:31', NULL)
) AS t2 (id, value1, validity_date_start, validity_date_end)
ON t1.id = t2.id
AND t2.validity_date_start <= (CAST(time.date as timestamp) + interval '1' month - interval '1' second)
AND (t2.validity_date_end IS NULL OR t2.validity_date_end >= (CAST(time.date as timestamp) + interval '1' month - interval '1' second)) -- last_day_of_month (Athena doesn't have the fn)
GROUP BY time.date, time.year, time.month, t1.id, t1.name, t2.value1, t2.validity_date_start
ORDER BY time.year, time.month, t1.id
I have query select col1, col2 from view1 and I wanted execute only when (select columnvalue from table1) > 0 else do nothing.
if (select columnvalue from table1)>0
select col1, col2 from view1"
else
do thing
How can I achieve this in single hive query?
If check query returns scalar value (single row) then you can cross join with check result and filter using > 0 condition:
with check_query as (
select count (*) cnt
from table1
)
select *
from view1 t
cross join check_query c
where c.cnt>0
;
I'm trying to return data as columns.
I've written this unpivot and pivot query:
`select StockItemCode, barcode, barcode2 from (select StockItemCode, col+cast(seq as varchar(20)) col, value from (
select
(select min(StockItemCode)
from RTLBarCode t2
where t.StockItemCode = t2.StockItemCode) StockItemCode,
cast(BarCode as varchar(20)) barcode,
row_number() over(partition by StockItemCode order by StockItemCode) seq
from RTLBarCode t) d unpivot(
value
for col in (barcode) ) unpiv) src pivot ( max(value) for col in (barcode, barcode2)) piv;`
But the problem is only the "Barcode2" field are returning a value (the barcode field returns a null when in fact there is a value.
SAMPLE DATA
I have a Table called RTLBarCode
It has a field called Barcode and a field called StockItemCode
For StockItemCode = 10 I have 2 rows with a Barcode value of 5014721112824 and 0000000019149.
Can anyone see where I am going wrong?
Many thanks
You are indexing your barcode in unpiv.
This results in col's-values barcode1 and barcode2.
But then you are pivoting on barcode instead of barcode1. No value is found and the aggregate returns null.
The correct statement would be:
select StockItemCode, barcode1, barcode2 from
(
select StockItemCode, col+cast(seq as varchar(20)) col, value
from
(
select
(select min(StockItemCode)from RTLBarCode t2 where t.StockItemCode = t2.StockItemCode) StockItemCode,
cast(BarCode as varchar(20)) barcode,
row_number() over(partition by StockItemCode order by StockItemCode) seq
from RTLBarCode t
) d
unpivot(value for col in (barcode)) unpiv
) src
pivot (max(value) for col in (barcode1, barcode2)) piv
SELECT BabyInformation.* , t1.*
FROM BabyInformation
LEFT JOIN
(SELECT * FROM BabyData
GROUP BY BabyID
ORDER By Date DESC ) AS t1 ON BabyInformation.BabyID=t1.BabyID
This is my query. I want to get the one most recent BabyData tuple based on date.
The BabyInformation should left join with babyData but one row per baby...
I tried TOP(1) but this worked only for the first baby
Here is one way to do it, there are other ways which can be faster, but I believe this one to be the clearest for a beginner.
SELECT BabyInformation.*, BabyData.*
FROM BabyInformation
JOIN
(SELECT BabyID, Max(Date) as maxDate FROM BabyData
GROUP BY BabyID
) AS t1
ON BabyInformation.BabyID=t1.BabyID
Join BabyData ON BabyData.BabyID = t1.BabyID and BabyData.Date = t1.maxDate
This should do it:
SELECT bi.* , bd.*
FROM BabyInformation [bi]
LEFT JOIN BabyData [bd]
on bd.BabyDataId = (select top 1 sub.BabyDataId from BabyData [sub] where sub.BabyId = bi.BabyId order by sub.Date desc)
I've assumed that there is a column called 'BabyDataId' in the BabyData table.