Using group_by where with Spark SQL

Using group_by where with Spark SQL - apache-spark

Hello I have created a Spark Dataframe reading from a parquet file which looks like this
+-------+-----+-----+-----+-----+
| col-a|col-b|col-c|col-d|col-e|
+-------+-----+-----+-----+-----+
| .....|.....|.....|.....|.....|
+-------+-----+-----+-----+-----+
I want to do a group by using col-a and col-b and then find out how many groups have more than 1 unique row. For example, if we have rows
a | b | x | y | z |
a | b | x | y | z |
c | b | m | n | p |
c | b | m | o | r |
I want to find out where count > 1 from
col-a | col-b | count |
a | b | 1 |
c | b | 2 |
I wrote this query
"SELECT `col-a`, `col-b`, count(DISTINCT (`col-c`, `col-d`, `col-e`)) AS count \
FROM df GROUP BY `col-a`, `col-b` WHERE count > 1"
This actually generates an error mismatched input 'WHERE' expecting {<EOF>, ';'}. Can anyone help me what I am doing wrong here?
I am currently doing it like
df_2 = spark.sql("SELECT `col-a`, `col-b`, count(DISTINCT (`col-c`, `col-d`, `col-e`)) AS count FROM df GROUP BY `col-a`, `col-b`")
df_2.createOrReplaceTempView("df_2")
spark.sql("SELECT * FROM df_2 WHERE `count` > 1").show()

Filtering after a group by requires a having clause. where has to be before group by.
It might work like this. You may also be able to refer to the complex column by index (not sure if that works in spark). Would keep it cleaner.
SELECT
`col-a`,
`col-b`,
count(DISTINCT (`col-c`, `col-d`, `col-e`)) AS count \
FROM df
GROUP BY `col-a`, `col-b`
HAVING count(DISTINCT (`col-c`, `col-d`, `col-e`)) > 1

Related

How to identify if a particular string/pattern exist in a column using pySpark

Below is my sample dataframe for household things.
Here W represents Wooden
G represents Glass and P represents Plastic, and different items are classified in that category.
So I want to identify which items falls in W,G,P categories. As an initial step ,I tried classifying it for Chair
M = sqlContext.createDataFrame([('W-Chair-Shelf;G-Vase;P-Cup',''),
('W-Chair',''),
('W-Shelf;G-Cup;P-Chair',''),
('G-Cup;P-ShowerCap;W-Board','')],
['Household_chores_arrangements','Chair'])
M.createOrReplaceTempView('M')
+-----------------------------+-----+
|Household_chores_arrangements|Chair|
+-----------------------------+-----+
| W-Chair-Shelf;G-Vase;P-Cup| |
| W-Chair| |
| W-Shelf;G-Cup;P-Chair| |
| G-Cup;P-ShowerCap;W-Board| |
+-----------------------------+-----+
I tried to do it for one condition where I can mark it as W, But I am not getting expected results,may be my condition is wrong.
df = sqlContext.sql("select * from M where Household_chores_arrangements like '%W%Chair%'")
display(df)
Is there a better way to do this in pySpark
Expected output
+-----------------------------+-----+
|Household_chores_arrangements|Chair|
+-----------------------------+-----+
| W-Chair-Shelf;G-Vase;P-Cup| W|
| W-Chair| W|
| W-Shelf;G-Cup;P-Chair| P|
| G-Cup;P-ShowerCap;W-Board| NULL|
+-----------------------------+-----+
Thanks #mck - for the solution.
Update
In addition to that I was trying to analyse more on regexp_extract option.So altered the sample set
M = sqlContext.createDataFrame([('Wooden|Chair',''),
('Wooden|Cup;Glass|Chair',''),
('Wooden|Cup;Glass|Showercap;Plastic|Chair','') ],
['Household_chores_arrangements','Chair'])
M.createOrReplaceTempView('M')
df = spark.sql("""
select
Household_chores_arrangements,
nullif(regexp_extract(Household_chores_arrangements, '(Wooden|Glass|Plastic)(|Chair)', 1), '') as Chair
from M
""")
display(df)
Result:
+-----------------------------+-----------------+
|Household_chores_arrangements| Chair|
+-----------------------------+-----------------+
| Wooden|Chair |Wooden|
| Wooden|Cup;Glass|Chair |Wooden|
|Wooden|Cup;Glass|Showercap;Plastic|Chair|Wooden|
+-----------------------------+----------------+
Changed delimiter to | instead of - and made changes in the query aswell. Was expecting results as below, But derived a wrong result
+-----------------------------+-----------------+
|Household_chores_arrangements| Chair|
+-----------------------------+-----------------+
| Wooden|Chair |Wooden|
| Wooden|Cup;Glass|Chair |Glass |
|Wooden|Cup;Glass|Showercap;Plastic|Chair|Plastic|
+-----------------------------+----------------+
If delimiter alone is changed,should we need to change any other values?
update - 2
I have got the solution for the above mentioned update.
For pipe delimiter we have to escape them using 4 \

You can use regexp_extract to extract the categories, and if no match is found, replace empty string with null using nullif.
df = spark.sql("""
select
Household_chores_arrangements,
nullif(regexp_extract(Household_chores_arrangements, '([A-Z])-Chair', 1), '') as Chair
from M
""")
df.show(truncate=False)
+-----------------------------+-----+
|Household_chores_arrangements|Chair|
+-----------------------------+-----+
|W-Chair-Shelf;G-Vase;P-Cup |W |
|W-Chair |W |
|W-Shelf;G-Cup;P-Chair |P |
|G-Cup;P-ShowerCap;W-Board |null |
+-----------------------------+-----+

Correlated Subquery in Spark SQL

I have the following 2 tables for which I have to check the existence of values between them using a correlated sub-query.
The requirement is - for each record in the orders table check if the corresponding custid is present in the customer table, and then output a field (named FLAG) with value Y if the custid exists, otherwise N if it doesn't.
orders:
orderid | custid
12345 | XYZ
34566 | XYZ
68790 | MNP
59876 | QRS
15620 | UVW
customer:
id | custid
1 | XYZ
2 | UVW
Expected Output:
orderid | custid | FLAG
12345 | XYZ | Y
34566 | XYZ | Y
68790 | MNP | N
59876 | QRS | N
15620 | UVW | Y
I tried something like the following but couldn't get it to work -
select
o.orderid,
o.custid,
case when o.custid EXISTS (select 1 from customer c on c.custid = o.custid)
then 'Y'
else 'N'
end as flag
from orders o
Can this be solved with a correlated scalar sub-query ? If not what is the best way to implement this requirement ?
Please advise.
Note: using Spark SQL query v2.4.0
Thanks.

IN/EXISTS predicate sub-queries can only be used in a filter in Spark.
The following works in a locally recreated copy of your data:
select orderid, custid, case when existing_customer is null then 'N' else 'Y' end existing_customer
from (select o.orderid, o.custid, c.custid existing_customer
from orders o
left join customer c
on c.custid = o.custid)
Here's how it works with recreated data:
def textToView(csv: String, viewName: String) = {
spark.read
.option("ignoreLeadingWhiteSpace", "true")
.option("ignoreTrailingWhiteSpace", "true")
.option("delimiter", "|")
.option("header", "true")
.csv(spark.sparkContext.parallelize(csv.split("\n")).toDS)
.createOrReplaceTempView(viewName)
}
textToView("""id | custid
1 | XYZ
2 | UVW""", "customer")
textToView("""orderid | custid
12345 | XYZ
34566 | XYZ
68790 | MNP
59876 | QRS
15620 | UVW""", "orders")
spark.sql("""
select orderid, custid, case when existing_customer is null then 'N' else 'Y' end existing_customer
from (select o.orderid, o.custid, c.custid existing_customer
from orders o
left join customer c
on c.custid = o.custid)""").show
Which returns:
+-------+------+-----------------+
|orderid|custid|existing_customer|
+-------+------+-----------------+
| 59876| QRS| N|
| 12345| XYZ| Y|
| 34566| XYZ| Y|
| 68790| MNP| N|
| 15620| UVW| Y|
+-------+------+-----------------+

Spark SQL - best way to programmatically loop over a table

Say I have the following spark dataframe:
| Node_id | Parent_id |
|---------|-----------|
| 1 | NULL |
| 2 | 1 |
| 3 | 1 |
| 4 | NULL |
| 5 | 4 |
| 6 | NULL |
| 7 | 6 |
| 8 | 3 |
This dataframe represents a tree structure consisting of several disjoint trees. Now, say that we have a list of nodes [8, 7], and we want to get a dataframe containing just the nodes that are roots of the trees containing the nodes in the list.The output looks like:
| Node_id | Parent_id |
|---------|-----------|
| 1 | NULL |
| 6 | NULL |
What would be the best (fastest) way to do this with spark queries and pyspark?
If I were doing this in plain SQL I would just do something like this:
CREATE TABLE #Tmp
Node_id int,
Parent_id int
INSERT INTO #Tmp Child_Nodes
SELECT #num = COUNT(*) FROM #Tmp WHERE Parent_id IS NOT NULL
WHILE #num > 0
INSERT INTO #Tmp (
SELECT
p.Node_id
p.Parent_id
FROM
#Tmp t
LEFT-JOIN Nodes p
ON t.Parent_id = p.Node_id)
SELECT #num = COUNT(*) FROM #Tmp WHERE Parent_id IS NOT NULL
END
SELECT Node_id FROM #Tmp WHERE Parent_id IS NULL
Just wanted to know if there's a more spark-centric way of doing this using pyspark, beyond the obvious method of simply looping over the dataframe using python.

parent_nodes = spark.sql("select Parent_id from table_name where Node_id in [2,7]").distinct()
You can join the above dataframe with the table to get the Parent_id of those nodes as well.

Conditional Inner join in sqlite python

I have three tables a, b and c.
Table a is related with table b through column key.
table b is related with table c through columns word, sense and speech. In addition table c holds column id.
Now some rows in a.word have no matching value with b.word, based on that
I want to inner join tables on condition if a.word = b.word then join, otherwise compare only a.end_key = b.key.
As a result I want to have table in form of a with extra columns of start_id and end_id from c matching with key_start and key_end.
I tried following sql command with python:
CREATE TABLE relations
AS
SELECT * FROM
c
INNER JOIN
a
INNER JOIN
b
ON
a.end_key = b.key
AND
a.start_key = b.key
AND
b.word = c.word
AND
b.speech = c.speech
AND
b.sense = c.sense
OR
a.word = b.word
a:
+-----------+---------+------+-----------+
| key_start | key_end | word | relation |
+-----------+---------+------+-----------+
| k5 | k1 | tree | h |
| k7 | k2 | car | m |
| k200 | k3 | bad | ho |
+-----------+---------+------+-----------+
b:
+-----+------+--------+-------+
| key | word | speech | sense |
+-----+------+--------+-------+
| k5 | sky | a | 1 |
| k2 | car | a | 1 |
| k3 | bad | n | 2 |
+-----+------+--------+-------+
c:
+----+---------+--------+-------+
| id | word | speech | sense |
+----+---------+--------+-------+
| 0 | light | a | 1 |
| 0 | dark | b | 3 |
| 1 | neutral | a | 2 |
+----+---------+--------+-------+
Edit for clarification:
The values of tables a, b and c hold hundreds thousands lines, so there are matching values in the tables. Table a is related to table b with end_key ~ key and start_key~key relation. Table b is related to c through word sense and speech, there are values which match in each of these columns.
The desired table is in form
start_id|key_start|key_end|end_id|relation
Where start_id matches key_start and key_end matches end_id.

EDIT new answer
The problem with the proposed query lies in the use of AND's and OR's (and likely missing (...)). This statement
a.word = b.word then join, otherwise compare only a.end_key = b.key.
would translate to:
AND (a.word= b.word OR a.end_key = b.key).
Maybe try it like this:
ON
b.word = c.word
AND
b.speech = c.speech
AND
b.sense = c.sense
AND
(a.word = b.word OR a.end_key = b.key)
It would be a good idea to test in a sqlite manager (eg command line sqlite3, DB Browser for sqlite) before you try it in python; troubleshooting is much easier. And of course test the SELECT before you implement it in a CREATE TABLE.
You could clarify your question by showing the desired columns and result in relations table that this sample data would create (there is nothing between b and c that would match on word, speech, sense). Also the description of the relationship between a and b is confusing. In the first paragraph it says Table a is related with table b through column key. Should key be word?

How can I overwrite in a Spark DataFrame null entries with other valid entries from the same dataframe?

I have a Spark DataFrame with data like this
| id | value1 |value2 |
------------------------
| 1 | null | 1 |
| 1 | 2 | null |
And want to transform it
into
| id | value1 |value2 |
-----------------------
| 1 | 2 | 1 |
That is, I need to get the rows with the same id and merge their values in a single row.
Could you explain me what is the most scalable way to do this?

df.groupBy(“id”).agg(collect_set(“value1”).alias(“value1”),collect_set(“value2”).alias(“value2”))
//more elegant way of doing for dynamic columns
df.groupBy(“id”).agg(df.columns.tail.map((_ -> “collect_set”)).toMap).show
//1.5
Val df1=df.rdd.map(i=>(i(0).toString,i(1).toString)).groupByKey.mapValues(_.toSet.toList.filter(_!=“null”)).toDF()
Val df2 = df.rdd.map(i=>(i(0).toString,i(2).toString)).groupByKey.mapValues(_.toSet.toList.filter(_!=“null”)).toDF()
df1.join(df2,df1(“_1”) === df2(“_1”),”inner”).drop(df2(“_1”)).show

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Using group_by where with Spark SQL - apache-spark

Related

How to identify if a particular string/pattern exist in a column using pySpark

Correlated Subquery in Spark SQL

Spark SQL - best way to programmatically loop over a table

Conditional Inner join in sqlite python

How can I overwrite in a Spark DataFrame null entries with other valid entries from the same dataframe?

Categories

Resources