lookup in presto using single column against a range in lookup table - presto

I would like to perform a lookup in presto where table is contains my lookup column "lookup_code" and tableb has a range lookup_range_description that i want to return
TableA
# lookup_code
12
2333
50000
TableB
# start_,end_,lookup_range_description
2300,4000, AverageCost1
23300,239900, AverageCost2
193000,193999, AverageCost3
expected result
# lookup_code,start_,end_,lookup_range_description
12,''
2333,2300,4000, AverageCost1
50000,23300,239900, AverageCost2

You may want to use LEFT OUTER JOIN with BETWEEN like this.
select
a.lookup_code
,b.start_
,b.end_
,b.lookup_range_description
from TableA a
left outer join TableB b
on a.lookup_code between b.start_ and b.end_
lookup_code | start_ | end_ | lookup_range_description
-------------+--------+--------+--------------------------
12 | NULL | NULL | NULL
2333 | 2300 | 4000 | AverageCost1
50000 | 23300 | 239900 | AverageCost2

Related

Using group_by where with Spark SQL

Hello I have created a Spark Dataframe reading from a parquet file which looks like this
+-------+-----+-----+-----+-----+
| col-a|col-b|col-c|col-d|col-e|
+-------+-----+-----+-----+-----+
| .....|.....|.....|.....|.....|
+-------+-----+-----+-----+-----+
I want to do a group by using col-a and col-b and then find out how many groups have more than 1 unique row. For example, if we have rows
a | b | x | y | z |
a | b | x | y | z |
c | b | m | n | p |
c | b | m | o | r |
I want to find out where count > 1 from
col-a | col-b | count |
a | b | 1 |
c | b | 2 |
I wrote this query
"SELECT `col-a`, `col-b`, count(DISTINCT (`col-c`, `col-d`, `col-e`)) AS count \
FROM df GROUP BY `col-a`, `col-b` WHERE count > 1"
This actually generates an error mismatched input 'WHERE' expecting {<EOF>, ';'}. Can anyone help me what I am doing wrong here?
I am currently doing it like
df_2 = spark.sql("SELECT `col-a`, `col-b`, count(DISTINCT (`col-c`, `col-d`, `col-e`)) AS count FROM df GROUP BY `col-a`, `col-b`")
df_2.createOrReplaceTempView("df_2")
spark.sql("SELECT * FROM df_2 WHERE `count` > 1").show()
Filtering after a group by requires a having clause. where has to be before group by.
It might work like this. You may also be able to refer to the complex column by index (not sure if that works in spark). Would keep it cleaner.
SELECT
`col-a`,
`col-b`,
count(DISTINCT (`col-c`, `col-d`, `col-e`)) AS count \
FROM df
GROUP BY `col-a`, `col-b`
HAVING count(DISTINCT (`col-c`, `col-d`, `col-e`)) > 1

How to group by rollup on only some columns in Apache Spark SQL?

I'm using the the SQL API for Spark 3.0 in a Databricks 7.0 runtime cluster. I know that I can do the following:
select
coalesce(a, "All A") as colA,
coalesce(b, "All B") as colB,
sum(c) as sumC
from
myTable
group by rollup (
colA,
colB
)
order by
colA asc,
colB asc
I'd then expect an output like:
+-------+-------+------+
| colA | colB | sumC |
+-------+-------+------+
| All A | All B | 300 |
| a1 | All B | 100 |
| a1 | b1 | 30 |
| a1 | b2 | 70 |
| a2 | All B | 200 |
| a2 | b1 | 50 |
| a2 | b2 | 150 |
+-------+-------+------+
However, I'm trying to write a query where only column b needs to be rolled up. I've written something like:
select
a as colA,
coalesce(b, "All B") as colB,
sum(c) as sumC
from
myTable
group by
a,
rollup (b)
order by
colA asc,
colB asc
And I'd expect an output like:
+-------+-------+------+
| colA | colB | sumC |
+-------+-------+------+
| a1 | All B | 100 |
| a1 | b1 | 30 |
| a1 | b2 | 70 |
| a2 | All B | 200 |
| a2 | b1 | 50 |
| a2 | b2 | 150 |
+-------+-------+------+
I know this sort of operation is supported in at least some SQL APIs, but I get Error in SQL statement: UnsupportedOperationException when trying to run the above query. Does anyone know whether this behavior is simply as-of-yet unsupported in Spark 3.0 or if I just have the syntax wrong? The docs aren't helpful on the subject.
I know that I can accomplish this with union all, but I'd prefer to avoid that route, if only for the sake of elegance and brevity.
Thanks in advance, and please let me know if I can clarify anything.
Try this GROUPING SETS option:
%sql
SELECT
COALESCE( a, 'all a' ) a,
COALESCE( b, 'all b' ) b,
SUM(c) c
FROM myTable
GROUP BY a, b
GROUPING SETS ( ( a , b ), a )
ORDER BY a, b
My results (with updated numbers):

Spark SQL - best way to programmatically loop over a table

Say I have the following spark dataframe:
| Node_id | Parent_id |
|---------|-----------|
| 1 | NULL |
| 2 | 1 |
| 3 | 1 |
| 4 | NULL |
| 5 | 4 |
| 6 | NULL |
| 7 | 6 |
| 8 | 3 |
This dataframe represents a tree structure consisting of several disjoint trees. Now, say that we have a list of nodes [8, 7], and we want to get a dataframe containing just the nodes that are roots of the trees containing the nodes in the list.The output looks like:
| Node_id | Parent_id |
|---------|-----------|
| 1 | NULL |
| 6 | NULL |
What would be the best (fastest) way to do this with spark queries and pyspark?
If I were doing this in plain SQL I would just do something like this:
CREATE TABLE #Tmp
Node_id int,
Parent_id int
INSERT INTO #Tmp Child_Nodes
SELECT #num = COUNT(*) FROM #Tmp WHERE Parent_id IS NOT NULL
WHILE #num > 0
INSERT INTO #Tmp (
SELECT
p.Node_id
p.Parent_id
FROM
#Tmp t
LEFT-JOIN Nodes p
ON t.Parent_id = p.Node_id)
SELECT #num = COUNT(*) FROM #Tmp WHERE Parent_id IS NOT NULL
END
SELECT Node_id FROM #Tmp WHERE Parent_id IS NULL
Just wanted to know if there's a more spark-centric way of doing this using pyspark, beyond the obvious method of simply looping over the dataframe using python.
parent_nodes = spark.sql("select Parent_id from table_name where Node_id in [2,7]").distinct()
You can join the above dataframe with the table to get the Parent_id of those nodes as well.

How to combine two columns into one in Sqlite and also get the underlying value of the Foreign Key?

I want to be able to combine two columns from a table into one column then to to be able to get the actual value of the foreign keys. I can do these things individually but not together.
Following the answer below I was able to combine the two columns into one using the first sql statement below.
How to combine 2 columns into a new one in sqlite
The combining process is shown below:
+---+---+
|HT | AT|
+---+---+
|1 | 2 |
|5 | 7 |
|9 | 5 |
+---+---+
into one column as shown:
+---+
|HT |
+---+
| 1 |
| 5 |
| 9 |
| 2 |
| 7 |
| 5 |
+---+
The second SQL statement show's the actual value of each foreign key corresponding to each foreign key id. The Foreign Key Table.
+-----+------------------------+
|T_id | TN |
+-----+------------------------+
| 1 | 'Dallas Cowboys |
| 2 | 'Chicago Bears' |
| 5 | 'New England Patriots' |
| 7 | 'New York Giants' |
| 9 | 'New York Jets' |
+-----+------------------------+
sql = "SELECT * FROM (SELECT M.HT FROM M UNION SELECT M.AT FROM Match)t"
The second sql statement lets me get the foreign key values for each value in M.HT.
sql = "SELECT M.HT, T.TN FROM M INNER JOIN T ON M.HT = T.Tid WHERE strftime('%Y-%m-%d', M.ST) BETWEEN \'2015-08-01\' AND \'2016-06-30\' AND M.Comp = 6 ORDER BY M.ST"
Result of second SQL statement:
+-----+------------------------+
| HT | TN |
+-----+------------------------+
| 1 | 'Dallas Cowboys |
| 5 | 'New England Patriots' |
| 9 | 'New York Jets' |
+-----+------------------------+
But try as I might I have not been able to combine these queries!
I believe the following will work (assuming that the tables are Match and T and baring the WHERE and ORDER BY clauses for brevity/ease) :-
SELECT DISTINCT(m.ht), t.tn
FROM
(SELECT Match.HT FROM Match UNION SELECT Match.AT FROM Match) AS m
JOIN T ON t.tid = m.ht
JOIN Match ON (m.ht = Match.ht OR m.ht = Match.at)
/* WHERE and ORDER BY clauses using Match as m only has columns ht and at */
WHERE strftime('%Y-%m-%d', Match.ST)
BETWEEN \'2015-08-01\' AND \'2016-06-30\' AND Match.Comp = 6
ORDER BY Match.ST
;
Note only tested without the WHERE and ORDER BY clause.
That is using :-
DROP TABLE IF EXISTS Match;
DROP TABLE IF EXISTS T;
CREATE TABLE IF NOT EXISTS Match (ht INTEGER, at INTEGER, st TEXT DEFAULT (datetime('now')));
CREATE TABLE IF NOT EXISTS t (tid INTEGER PRIMARY KEY, tn TEXT);
INSERT INTO T (tn) VALUES('Cows'),('Bears'),('a'),('b'),('Pats'),('c'),('Giants'),('d'),('Jets');
INSERT INTO Match (ht,at) VALUES (1,2),(5,7),(9,5);
/* Directly without the Common Table Expression */
SELECT
DISTINCT(m.ht), t.tn,
Match.st /*<<<<< Added to show results of obtaining other values from Matches >>>>> */
FROM
(SELECT Match.HT FROM Match UNION SELECT Match.AT FROM Match) AS m
JOIN T ON t.tid = m.ht
JOIN Match ON (m.ht = Match.ht OR m.ht = Match.at)
/* WHERE and ORDER BY clauses here using Match */
;
Noting that limited data (just the one extra column) was used for brevity
Results in :-

Power query merge tables any-to-any

I new in MS Excel Power Query and i cant finde solution in google for this problem.
Join tables any-to-any row
Table1 Table2
+-----+ +-----+
| A | | 1 |
| B | | 2 |
+-----+ +-----+
Merge Table1 and Table2 to Table3
Table3
+-----+-----+
| A | 1 |
| A | 2 |
| B | 1 |
| B | 2 |
+-----+-----+
The link Hakan provided is great, so I'll just summarize it here.
Starting with your Table1, go to Add Column > Custom Colum and simply input Table2 as the formula.
Once that column is created, click the expand button and choose which columns from Table2 to expand.
This should result in the desired table.

Resources