Correlated Subquery in Spark SQL - apache-spark

I have the following 2 tables for which I have to check the existence of values between them using a correlated sub-query.
The requirement is - for each record in the orders table check if the corresponding custid is present in the customer table, and then output a field (named FLAG) with value Y if the custid exists, otherwise N if it doesn't.
orders:
orderid | custid
12345 | XYZ
34566 | XYZ
68790 | MNP
59876 | QRS
15620 | UVW
customer:
id | custid
1 | XYZ
2 | UVW
Expected Output:
orderid | custid | FLAG
12345 | XYZ | Y
34566 | XYZ | Y
68790 | MNP | N
59876 | QRS | N
15620 | UVW | Y
I tried something like the following but couldn't get it to work -
select
o.orderid,
o.custid,
case when o.custid EXISTS (select 1 from customer c on c.custid = o.custid)
then 'Y'
else 'N'
end as flag
from orders o
Can this be solved with a correlated scalar sub-query ? If not what is the best way to implement this requirement ?
Please advise.
Note: using Spark SQL query v2.4.0
Thanks.

IN/EXISTS predicate sub-queries can only be used in a filter in Spark.
The following works in a locally recreated copy of your data:
select orderid, custid, case when existing_customer is null then 'N' else 'Y' end existing_customer
from (select o.orderid, o.custid, c.custid existing_customer
from orders o
left join customer c
on c.custid = o.custid)
Here's how it works with recreated data:
def textToView(csv: String, viewName: String) = {
spark.read
.option("ignoreLeadingWhiteSpace", "true")
.option("ignoreTrailingWhiteSpace", "true")
.option("delimiter", "|")
.option("header", "true")
.csv(spark.sparkContext.parallelize(csv.split("\n")).toDS)
.createOrReplaceTempView(viewName)
}
textToView("""id | custid
1 | XYZ
2 | UVW""", "customer")
textToView("""orderid | custid
12345 | XYZ
34566 | XYZ
68790 | MNP
59876 | QRS
15620 | UVW""", "orders")
spark.sql("""
select orderid, custid, case when existing_customer is null then 'N' else 'Y' end existing_customer
from (select o.orderid, o.custid, c.custid existing_customer
from orders o
left join customer c
on c.custid = o.custid)""").show
Which returns:
+-------+------+-----------------+
|orderid|custid|existing_customer|
+-------+------+-----------------+
| 59876| QRS| N|
| 12345| XYZ| Y|
| 34566| XYZ| Y|
| 68790| MNP| N|
| 15620| UVW| Y|
+-------+------+-----------------+

Related

Collapse DataFrame using Window functions

I would like to collapse the rows in a dataframe based on an ID column and count the number of records per ID using window functions. Doing this, I would like to avoid partitioning the window by ID, because this would result in a very large number of partitions.
I have a dataframe of the form
+----+-----------+-----------+-----------+
| ID | timestamp | metadata1 | metadata2 |
+----+-----------+-----------+-----------+
| 1 | 09:00 | ABC | apple |
| 1 | 08:00 | NULL | NULL |
| 1 | 18:00 | XYZ | apple |
| 2 | 07:00 | NULL | banana |
| 5 | 23:00 | ABC | cherry |
+----+-----------+-----------+-----------+
where I would like to keep only the records with the most recent timestamp per ID, such that I have
+----+-----------+-----------+-----------+-------+
| ID | timestamp | metadata1 | metadata2 | count |
+----+-----------+-----------+-----------+-------+
| 1 | 18:00 | XYZ | apple | 3 |
| 2 | 07:00 | NULL | banana | 1 |
| 5 | 23:00 | ABC | cherry | 1 |
+----+-----------+-----------+-----------+-------+
I have tried:
window = Window.orderBy( [asc('ID'), desc('timestamp')] )
window_count = Window.orderBy( [asc('ID'), desc('timestamp')] ).rowsBetween(-sys.maxsize,sys.maxsize)
columns_metadata = [metadata1, metadata2]
df = df.select(
*(first(col_name, ignorenulls=True).over(window).alias(col_name) for col_name in columns_metadata),
count(col('ID')).over(window_count).alias('count')
)
df = df.withColumn("row_tmp", row_number().over(window)).filter(col('row_tmp') == 1).drop(col('row_tmp'))
which is in part based on How to select the first row of each group?
This without the use of pyspark.sql.Window.partitionBy, this does not give the desired output.
I read you wanted without partitioning by ID after I posted it. I could only think of this approach.
Your dataframe:
df = sqlContext.createDataFrame(
[
('1', '09:00', 'ABC', 'apple')
,('1', '08:00', '', '')
,('1', '18:00', 'XYZ', 'apple')
,('2', '07:00', '', 'banana')
,('5', '23:00', 'ABC', 'cherry')
]
,['ID', 'timestamp', 'metadata1', 'metadata2']
)
We can use rank and partition by ID over timestamp:
from pyspark.sql.window import Window
import pyspark.sql.functions as F
w1 = Window().partitionBy(df['ID']).orderBy(df['timestamp']).orderBy(F.desc('timestamp'))
w2 = Window().partitionBy(df['ID'])
df\
.withColumn("rank", F.rank().over(w1))\
.withColumn("count", F.count('ID').over(w2))\
.filter(F.col('rank') == 1)\
.select('ID', 'timestamp', 'metadata1', 'metadata2', 'count')\
.show()
+---+---------+---------+---------+-----+
| ID|timestamp|metadata1|metadata2|count|
+---+---------+---------+---------+-----+
| 1| 18:00| XYZ| apple| 3|
| 2| 07:00| | banana| 1|
| 5| 23:00| ABC| cherry| 1|
+---+---------+---------+---------+-----+

Insert delta records in master and update existing column value if match found

I have a master table
#+-----------+----------+-------------+
#| Name | Gender | date |
#+-----------+----------+-------------+
#| Tom | M | 2021-02-15 |
#| Bob | M | 2021-03-02 |
#| Kelly | F | 2021-06-01 |
And a daily table , A daily table can have a data with following conditions
1)Totally new records 2)date column updated for existing records
#+-----------+----------+-------------+
#| Name | Gender | date |
#+-----------+----------+-------------+
#| Tom | M | 2021-03-20 | date updated
#| suzen | F | 2021-06-10 | new records
expected output master table should have all the new records coming in daily
plus if any of the name matches with master table then update the new date from daily
#+-----------+----------+-------------+
#| Name | Gender | date |
#+-----------+----------+-------------+
#| Tom | M | 2021-03-20 | date updated form daily
#| Bob | M | 2021-03-02 |
#| Kelly | F | 2021-06-01 |
#| suzen | F | 2021-06-10 | New record
For ease, lets take Name as the unique identifier of both tables.
there is a one way to to join both of these table on full outer and get the result
select (case when d.name is null or d.name='' then m.name
when m.name is null or m.name='' then d.name
else m.name end) as name,
(case when d.gender is null or d.gender ='' then m.gender
when m.gender is null or m.gender ='' then d.gender
else m.gender end) as gender,
(case when d.date is null or d.date ='' then m.date
when m.date is null or m.date ='' then d.date
else d.date end) as date
Form master m full outer join daily d on
on m.name=d.name
What is the way to achieve the expected output in better and more performant way
To have a new table (dataframe) based on your criteria, your solution is good.
But if you want to update the master table using the daily table, delta supports upsert into a table using merge:
from delta.tables import *
master_table = DeltaTable.forPath(spark, "/path/to/master")
daily_table = DeltaTable.forPath(spark, "/path/to/daily")
master_table.alias("master").merge(
daily_table.alias("daily"),
"master.Name = daily.Name") \
.whenMatchedUpdate(set = { "Gender" : "daily.Gender", "date": "daily.date"} ) \
.whenNotMatchedInsert(values =
{
"Name": "daily.Name",
"Gender": "daily.Gender",
"date": "daily.date"
}
) \
.execute()
or using SQL
MERGE INTO master
USING daily
ON master.Name = daily.Name
WHEN MATCHED THEN
UPDATE SET master.Gender = daily.Gender, master.date = daily.date
WHEN NOT MATCHED
THEN INSERT (Name, Gender, date) VALUES (Name, Gender, date)

Spark Java - Replace specific String with another String in a dataset

I'm writing spark application where I have a dataset of 100 fields. I want to replace "Account" with acct in all 100 fields.
dataset.show();
+-------+-------+---------+-------------------------------|
| id | loc| price |description|postdate |
+-------+-------+---------|-----------+-------------------+
|001 |account|315000.25|account |2020-06-01 |
|account|account|account |sampledes |2020-06-05 |
|003 |kochin |315000 | |account |
|004 |madurai|null |abc | |
|005 |account|15000.20 |n.a |2021/12/01 |
+-------+-------+---------+-----------+-------------------|
Result:- Replace account with acct in all the fields.
+-------+-------+---------+-------------------------------|
| id | loc| price |description|postdate |
+-------+-------+---------|-----------+-------------------+
|001 |acct |315000.25|acct |2020-06-01 |
|acct |acct |acct |sampledes |2020-06-05 |
|003 |kochin |315000 | |acct |
|004 |madurai|null |abc | |
|005 |acct |15000.20 |n.a |2021/12/01 |
+-------+-------+---------+-----------+-------------------|
I see the regular expression replace function but we have to write for each column. So I am looking for an alternative.
Thanks in advance
you cant try this code that iterate over all columns and update the columns with replaced character
oldDF.show()
val newDF = oldDF.columns.foldLeft(oldDF) { (replaceDF, colName) =>
replaceDF.withColumn(
colName,
regexp_replace(col(colName), "account", "acct ")
)
}
newDF.show()

How to concatenate spark dataframe columns using Spark sql in databricks

I have two columns called "FirstName" and "LastName" in my dataframe, how can I concatenate this two columns into one.
|Id |FirstName|LastName|
| 1 | A | B |
| | | |
| | | |
I want to make it like this
|Id |FullName |
| 1 | AB |
| | |
| | |
my query look like this but it raises an error
val kgt=spark.sql("""
Select Id,FirstName+' '+ContactLastName AS FullName from tblAA """)
kgt.createOrReplaceTempView("NameTable")
Here we go with the Spark SQL solution:
spark.sql("select Id, CONCAT(FirstName,' ',LastName) as FullName from NameTable").show(false)
OR
spark.sql( " select Id, FirstName || ' ' ||LastName as FullName from NameTable ").show(false)
from pyspark.sql import functions as F
df = df.withColumn('FullName', F.concat(F.col('First_name'), F.col('last_name')))

How to compare two tables and replace nulls with values from other table

I am working on some assignment, where we have two tables with same/different columns.If a record of table A has some column values as null then that has to update to value in table B,Vice versa.
table A
id | code | type
1 | null | A
2 | null | null
3 | 123 | C
table B
id | code | type
1 | 456 | A
2 | 789 | A1
3 | null | C
what I have worked so far
Dataset<Row> df1 = spark.read().format("csv").option("header", "true").load("C:\\Users\\System2\\Videos\\1199_data\\d1_1.csv");
Dataset<Row> df2 = spark.read().format("csv").option("header", "true").load("C:\\Users\\System2\\Videos\\1199_data\\d2_1.csv");
df1
.as("a").join(df2.as("b"))
.where("a.id== b.id")
.withColumn("a.code",
functions.when(
df1.col("code").isNull(),
df2.col("code") )
).show();
Required Output
table C
id | code | type
1 | 456 | A
2 | 789 | A1
3 | 123 | C
You can use the coalesce function?
df1.join(df2, "id")
.select(df1("id"),
coalesce(df1("code"),
df2("code")).as("code"),
coalesce(df1("type"),
df2("type")).as("type"))
And output:
+---+----+----+
| id|code|type|
+---+----+----+
| 1| 456| A|
| 2| 789| A1|
| 3| 123| C|
+---+----+----+

Resources