hive How to use conditional statements to execute different query based on result - apache-spark

I have query select col1, col2 from view1 and I wanted execute only when (select columnvalue from table1) > 0 else do nothing.
if (select columnvalue from table1)>0
select col1, col2 from view1"
else
do thing
How can I achieve this in single hive query?

If check query returns scalar value (single row) then you can cross join with check result and filter using > 0 condition:
with check_query as (
select count (*) cnt
from table1
)
select *
from view1 t
cross join check_query c
where c.cnt>0
;

Related

How to get top 3 columns and their values across multiple columns (dynamically) in BigQuery

I have a table that looks like this
select 'Alice' AS ID, 1 as col1, 3 as col2, -2 as col3, 9 as col4
union all
select 'Bob' AS ID, -9 as col1, 2 as col2, 5 as col3, -6 as col4
I would like to get the top 3 absolute values for each record across the four columns and then format the output as a dictionary or STRUCT like below
select
'Alice' AS ID, [STRUCT('col4' AS column, 9 AS value), STRUCT('col2',3), STRUCT('col3',-2)] output
union all
select
'Bob' AS ID, [STRUCT('col1' AS column, -9 AS value), STRUCT('col4',-6), STRUCT('col3',5)]
output
output
I would like it to be dynamic, so avoid writing out columns individually. It could go up to 100 columns that change
For more context, I am trying to get the top three features from the batch local explanations output in Vertex AI
https://cloud.google.com/vertex-ai/docs/tabular-data/classification-regression/get-batch-predictions
I have looked up some examples, would like something similar to the second answer here How to get max value of column values in a record ? (BigQuery)
EDIT: the data is actually structured like this. If this can be worked with more easily, this would be a better option to work from
select 'Alice' AS ID, STRUCT(1 as col1, 3 as col2, -2 as col3, 9 as col4) AS featureAttributions
union all
SELECT 'Bob' AS ID, STRUCT(-9 as col1, 2 as col2, 5 as col3, -6 as col4) AS featureAttributions
Consider below query.
SELECT ID, ARRAY_AGG(STRUCT(column, value) ORDER BY ABS(value) DESC LIMIT 3) output
FROM (
SELECT * FROM sample_table UNPIVOT (value FOR column IN (col1, col2, col3, col4))
)
GROUP BY ID;
Query results
Dynamic Query
I would like it to be dynamic, so avoid writing out columns individually
You need to consider a dynamic SQL for this. By refering to the answer from #Mikhail you linked in the post, you can write a dynamic query like below.
EXECUTE IMMEDIATE FORMAT("""
SELECT ID, ARRAY_AGG(STRUCT(column, value) ORDER BY ABS(value) DESC LIMIT 3) output
FROM (
SELECT * FROM sample_table UNPIVOT (value FOR column IN (%s))
)
GROUP BY ID
""", ARRAY_TO_STRING(
REGEXP_EXTRACT_ALL(TO_JSON_STRING((SELECT AS STRUCT * EXCEPT (ID) FROM sample_table LIMIT 1)), r'"([^,{]+)":'), ',')
);
For updated sample table
SELECT ID, ARRAY_AGG(STRUCT(column, value) ORDER BY ABS(value) DESC LIMIT 3) output
FROM (
SELECT * FROM (SELECT ID, featureAttributions.* FROM sample_table)
UNPIVOT (value FOR column IN (col1, col2, col3, col4))
)
GROUP BY ID;
EXECUTE IMMEDIATE FORMAT("""
SELECT ID, ARRAY_AGG(STRUCT(column, value) ORDER BY ABS(value) DESC LIMIT 3) output
FROM (
SELECT * FROM (SELECT ID, featureAttributions.* FROM sample_table)
UNPIVOT (value FOR column IN (%s))
)
GROUP BY ID
""", ARRAY_TO_STRING(
REGEXP_EXTRACT_ALL(TO_JSON_STRING((SELECT featureAttributions FROM sample_table LIMIT 1)), r'"([^,{]+)":'), ',')
);

How to only select the very next string after target string but the next string after the target string, regardless of punctation?

I have a df that looks like this:
id query
1 select * from table1 where col1 = 1
2 select a.columns FROM table2 a
I want to only select the string (table if you know sql) after the string FROM into a new column. FROM can be spelled with different capitalizations (ie From, from,FROM,etc).
How do I select the string directly after the From but not the very next string after the FROM string
I tried:
df['tableName'] = df['query'].str.extract('[^from]*$')
but this is not working. I am not sure if I should make the entire df lowercase right off the bat.
New df should look like this:
id query tableName
1 select * from table1 where col1 = 1 table1
2 select a.columns FROM table2 a table2
Thank you in advance.
You can try
df['tableName'] = df['query'].str.extract('(?i)from ([^ ]*)')
(?i) means ignore case.
print(df)
id query tableName
0 1 select * from table1 where col1 = 1 table1
1 2 select a.columns FROM table2 a table2
This will get you your answer without a regex and should account for all capitalizations types of "table"
df['Table_Name'] = df['query'].apply(lambda x : x.lower().split('from')[1]).apply(lambda x : x.split()[0])

Error Snowflake - Unsupported subquery type cannot be evaluated

I am facing an error in snowflake saying "Unsupported subquery type cannot be evaluated" after for example executing the below statement. How should write this statement to avoid this error?
select A
from (
select b
, c
FROM test_table
) ;
The outer query column list needs to be within the column list of the subquery. example: select b from (select b,c from test_table);
ignoring "columns" the query you have shown will never trigger this error.
You would get it from this form though:
select A.*
from tableA as A
where a.x = (select b.y FROM test_table as b where b.z = a.z)
this form assuming there is only 1 b.y per b.z can be turned into a inner join like
select A.*
from tableA as A
join test_table as b
on b.z = a.z and a.x = b.y
other forms of this pattern do the likes of max(b.y) and those can be made into a sub-select like:
select A.*
from tableA as A
join (
select c.z, max(c.y) from test_table as c group by 1
) as b
on b.z = a.z and a.x = b.y
but the general pattern is, in other databases there is no "cost" to do row-by-row queries, where-as Snowflake is more optimal with pre-building tables of similar data, and then equi-joining those results together. So both the "how-to-write" example pivot from a for-each-row thinking to a build the set of all possible answers, and then join that. This allows for the most parallel processing of the data possible. And while it means you the develop need to understand your data to get he best performance out of it, in general if you are doing large scale data processing, you should be understanding your data. So this costs, is rather acceptable imho.
If you are trying to Match Two Attributes on the Subquery.
Use like below:
If both need to matched:
select * from Table WHERE a IN ( select b FROM test_table ) AND a IN ( select c FROM test_table )
If any one need to matched:
select * from Table WHERE a IN ( select b FROM test_table ) OR a IN ( select c FROM test_table )

how to delete the data from the Delta Table?

I was actually trying to delete the data from the Delta table.
When i run the below query, I'm getting data around 500 or 1000 records.
SELECT * FROM table1 inv
join (SELECT col1, col2, col2, min(Date) minDate, max(Date) maxDate FROM table2 a GROUP BY col1, col2, col3) aux
on aux.col1 = inv.col1 and aux.col2 = inv.col2 and aux.col3 = inv.col3
WHERE Date between aux.minDate and aux.maxDate
But when i try to delete that 500 records with the below query I'm getting error with syntax.
DELETE FROM table1 inv
join (SELECT col1, col2, col2, min(Date) minDate, max(Date) maxDate FROM table2 a GROUP BY col1, col2, col3) aux
on aux.col1 = inv.col1 and aux.col2 = inv.col2 and aux.col3 = inv.col3
WHERE Date between aux.minDate and aux.maxDate
Please someone help me here.
Thanks in advance :).
Here is the sql reference:
DELETE FROM table_identifier [AS alias] [WHERE predicate]
You can't use JOIN here, so expand your where clause according to your needs.
Here are some examples:
DELETE FROM table1
WHERE EXISTS (SELECT ... FROM table2 ...)
DELETE FROM table1
WHERE table1.col1 IN (SELECT ... FROM table2 WHERE ...)
DELETE FROM table1
WHERE table1.col1 NOT IN (SELECT ... FROM table2 WHERE ...)

{MS Access] How can I joint two tables without duplicate data?

I got two tables below:
And I want to create a query to combine them like below:
But unsuccessfully I got something like this:
Some data for "value1" duplicated
How can I solve this?
Is there any function that can have "value1" for the first "no." only?
Thank you.
You can do this by using a subquery to prepare the data.
It seems you want only the rows with the lowest Sub no to join, so we'll first select that:
SELECT [No], Value2
FROM Table2 m
WHERE
EXISTS(
SELECT 1
FROM Table2 s
WHERE s.[No] = m.[No]
HAVING MIN(s.sub_no) = m.sub_no
)
Then, integrate this into your main query:
SELECT *
FROM Table1
INNER JOIN (
SELECT [No], Value2
FROM Table2 m
WHERE
EXISTS(
SELECT 1
FROM Table2 s
WHERE s.[No] = m.[No]
HAVING MIN(s.sub_no) = m.sub_no
)
) AS T2 ON T1.[No] = T2.[No]

Resources