Anti join followed by union in Spark SQL

Anti join followed by union in Spark SQL - apache-spark

I am running PySpark script in which I am doing anti join & union of 2 dataframes. But I want to do it in Spark SQL.
df_src:
+-------+-------+
|call_id|call_nm|
+-------+-------+
| 100| QC|
| 105| XY|
| 110| NM|
| 115| AB|
+-------+-------+
df_lkp:
+-------+-------+
|call_id|call_nm|
+-------+-------+
| 100| QC|
| 105| XY|
| 106| XZ|
+-------+-------+
We have two dataframes: df_src & df_lkp. I am extracting unmatched records from df_src:
df_unmatched = df_src.join(df_lkp, on=column_nm, how='left_anti')
It is giving this result:
df_unmatched
+-------+-------+
|call_id|call_nm|
+-------+-------+
| 110| NM|
| 115| AB|
+-------+-------+
But I want to do this part using Spark SQL. I have created temporary view vw_df_src & vw_df_lkp and trying to run the following query, but not getting the result.
unmatched_sql = "SELECT * from vw_df_src where {0} in (select {0} from vw_df_src minus select {0} from vw_df_lkp)".format('call_id')
df_unmatched = sqlContext.sql(unmatched_sql)
I am also doing union of both the dataframes and dropping duplicates. I am using below code:
df_src1 = df_lkp.union(df_src)
df_src1.show(10)
df_src2 = df_src1.dropDuplicates(['call_id'])
df_src2:
+-------+-------+
|call_id|call_nm|
+-------+-------+
| 110| NM|
| 100| QC|
| 115| AB|
| 106| XZ|
| 105| XY|
+-------+-------+
I want this to be done in Spark SQL too.
I am using the following code to create temp views:
df_src = sqlContext.read.format('com.databricks.spark.csv').option("delimiter", '\001').options(header='true',inferSchema='false').load(src_file_nm)
df_src.createOrReplaceTempView('vw_df_src')
df_lkp = sqlContext.read.format('com.databricks.spark.csv').option("delimiter", '\001').options(header='true',inferSchema='false').load(lkp_file)
df_lkp.createOrReplaceTempView('vw_df_lkp')

ANTI JOIN
spark.sql(
"""select * from vw_df_src LEFT ANTI JOIN
vw_df_lkp ON
vw_df_src.call_nm= vw_df_lkp.call_nm """).show()
+-------+-------+
|call_id|call_nm|
+-------+-------+
| 115| AB|
| 110| NM|
+-------+-------+
If running in a notebook cell not initialed as sql TRY
%sql
select * from vw_df_src LEFT ANTI JOIN
vw_df_lkp ON
vw_df_src.call_nm= vw_df_lkp.call_nm
UNION
In pyspark, union returns duplicates and you have to drop_duplicates() or use distinct(). In sql, union eliminates duplicates. The following will therefore do. Spark 2.0.0 unionall() retuned duplicates and union is the thing
spark.sql(
"""select * from vw_df_src
union
select * from vw_df_lkp""" ).show()

Preset:
df_src = spark.createDataFrame(
[(100, 'QC'),
(105, 'XY'),
(110, 'NM'),
(115, 'AB')],
['call_id', 'call_nm']
)
df_lkp = spark.createDataFrame(
[(100, 'QC'),
(105, 'XY'),
(105, 'XY'),
(106, 'XZ')],
['call_id', 'call_nm']
)
df_src.createOrReplaceTempView('vw_df_src')
df_lkp.createOrReplaceTempView('vw_df_lkp')
According to your requirements, (anti join + union) can be done like this:
spark.sql(
"""
select *
from vw_df_src as a
anti join vw_df_lkp b on a.call_nm=b.call_nm
union (select * from vw_df_lkp)
"""
).show()
# +-------+-------+
# |call_id|call_nm|
# +-------+-------+
# | 110| NM|
# | 115| AB|
# | 100| QC|
# | 105| XY|
# | 106| XZ|
# +-------+-------+
However, it seems that anti join is not needed:
spark.sql(
"""
select * from vw_df_src
union
select * from vw_df_lkp
"""
).show()
# +-------+-------+
# |call_id|call_nm|
# +-------+-------+
# | 100| QC|
# | 105| XY|
# | 115| AB|
# | 110| NM|
# | 106| XZ|
# +-------+-------+

Related

How to update dataframe column value while joinining with other dataframe in pyspark?

I have 3 Dataframe df1(EMPLOYEE_INFO),df2(DEPARTMENT_INFO),df3(COMPANY_INFO) and i want to update a column which is in df1 by joining all the three dataframes . The name of column is FLAG_DEPARTMENT which is in df1. I need to set the FLAG_DEPARTMENT='POLITICS' . In sql query will look like this.
UPDATE [COMPANY_INFO] INNER JOIN ([DEPARTMENT_INFO]
INNER JOIN [EMPLOYEE_INFO] ON [DEPARTMENT_INFO].DEPT_ID = [EMPLOYEE_INFO].DEPT_ID)
ON [COMPANY_INFO].[COMPANY_DEPT_ID] = [DEPARTMENT_INFO].[DEP_COMPANYID]
SET EMPLOYEE_INFO.FLAG_DEPARTMENT = "POLITICS";
If the values in columns of these three tables matches i need to set my FLAG_DEPARTMENT='POLITICS' in my employee_Info Table
How can i achieve this same thing in pyspark. I have just started learning pyspark don't have that much depth knowledge?

You can use a chain of joins with a select on top of it.
Suppose that you have the following pyspark DataFrames:
employee_df
+---------+-------+
| Name|dept_id|
+---------+-------+
| John| dept_a|
| Liù| dept_b|
| Luke| dept_a|
| Michail| dept_a|
| Noe| dept_e|
|Shinchaku| dept_c|
| Vlad| dept_e|
+---------+-------+
department_df
+-------+----------+------------+
|dept_id|company_id| description|
+-------+----------+------------+
| dept_a| company1|Department A|
| dept_b| company2|Department B|
| dept_c| company5|Department C|
| dept_d| company3|Department D|
+-------+----------+------------+
company_df
+----------+-----------+
|company_id|description|
+----------+-----------+
| company1| Company 1|
| company2| Company 2|
| company3| Company 3|
| company4| Company 4|
+----------+-----------+
Then you can run the following code to add the flag_department column to your employee_df:
from pyspark.sql import functions as F
employee_df = (
employee_df.alias('a')
.join(
department_df.alias('b'),
on='dept_id',
how='left',
)
.join(
company_df.alias('c'),
on=F.col('b.company_id') == F.col('c.company_id'),
how='left',
)
.select(
*[F.col(f'a.{c}') for c in employee_df.columns],
F.when(
F.col('b.dept_id').isNotNull() & F.col('c.company_id').isNotNull(),
F.lit('POLITICS')
).alias('flag_department')
)
)
The new employee_df will be:
+---------+-------+---------------+
| Name|dept_id|flag_department|
+---------+-------+---------------+
| John| dept_a| POLITICS|
| Liù| dept_b| POLITICS|
| Luke| dept_a| POLITICS|
| Michail| dept_a| POLITICS|
| Noe| dept_e| null|
|Shinchaku| dept_c| null|
| Vlad| dept_e| null|
+---------+-------+---------------+

How to get updated or new records by comparing two dataframe in pyspark

I have two dataframes like this:
df2.show()
+----+-------+------+
|NAME|BALANCE|SALARY|
+----+-------+------+
|PPan| 11| 500|
|Liza| 20| 900|
+----+-------+------+
df3.show()
+----+-------+------+
|NAME|BALANCE|SALARY|
+----+-------+------+
|PPan| 10| 700|
| Cal| 70| 888|
+----+-------+------+
df2 here, represents existing database records and df3 represents new records/updated records(any column) which need to be inserted/updated into db.For ex: NAME=PPan the new balance is 10 as per df3. so For NAME=PPan entire row has to be replaced in df2 and for NAME=Cal, a new row has to be added and for name=Liza will be untouched like this:
+----+-------+------+
|NAME|BALANCE|SALARY|
+----+-------+------+
|PPan| 10| 700|
|Liza| 20| 900|
| Cal| 70| 888|
+----+-------+------+
How can I achieve this use case?

First you need to join both dataframes using full method to keep unmatched rows (new) and to updating the matched records I do prefer to use select with coalesce function:
joined_df = df2.alias('rec').join(df3.alias('upd'), on='NAME', how='full')
# +----+-------+------+-------+------+
# |NAME|BALANCE|SALARY|BALANCE|SALARY|
# +----+-------+------+-------+------+
# |Cal |null |null |70 |888 |
# |Liza|20 |900 |null |null |
# |PPan|11 |500 |10 |700 |
# +----+-------+------+-------+------+
output_df = joined_df.selectExpr(
'NAME',
'COALESCE(upd.BALANCE, rec.BALANCE) BALANCE',
'COALESCE(upd.SALARY, rec.SALARY) SALARY'
)
output_df.sort('BALANCE').show(truncate=False)
+----+-------+------+
|NAME|BALANCE|SALARY|
+----+-------+------+
|PPan|10 |700 |
|Liza|20 |900 |
|Cal |70 |888 |
+----+-------+------+

Crossjoin between two dataframes that is dependent on a common column

A crossJoin can be done as follows:
df1 = pd.DataFrame({'subgroup':['A','B','C','D']})
df2 = pd.DataFrame({'dates':pd.date_range(date_today, date_today + timedelta(3), freq='D')})
sdf1 = spark.createDataFrame(df1)
sdf2 = spark.createDataFrame(df2)
sdf1.crossJoin(sdf2).toPandas()
In this example there are two dataframes each containing 4 rows, in the end, I get 16 rows.
However, for my problem, I would like to do a cross join per user, and the user is another column in the two dataframes, e.g.:
df1 = pd.DataFrame({'user':[1,1,1,1,2,2,2,2],'subgroup':['A','B','C','D','A','B','D','E']})
df2 = pd.DataFrame({'user':[1,1,1,1,2,2,2,2],'dates':np.hstack([np.array(pd.date_range(date_today, date_today + timedelta(3), freq='D')),np.array(pd.date_range(date_today+timedelta(1), date_today + timedelta(4), freq='D'))])})
The result of applying the per-user crossJoin should be a dataframe with 32 rows. Is this possible in pyspark and how can this be done?

A cross join is a join that generates a multiplication of lines because the joining key does not identify rows uniquely (in our case the joining key is trivial or there is no joining key at all)
Let's start with sample data frames:
import pyspark.sql.functions as psf
import pyspark.sql.types as pst
df1 = spark.createDataFrame(
[[user, value] for user, value in zip(5 * list(range(2)), np.random.randint(0, 100, 10).tolist())],
schema=pst.StructType([pst.StructField(c, pst.IntegerType()) for c in ['user', 'value1']]))
df2 = spark.createDataFrame(
[[user, value] for user, value in zip(5 * list(range(2)), np.random.randint(0, 100, 10).tolist())],
schema=pst.StructType([pst.StructField(c, pst.IntegerType()) for c in ['user', 'value2']]))
+----+------+
|user|value1|
+----+------+
| 0| 76|
| 1| 59|
| 0| 14|
| 1| 71|
| 0| 66|
| 1| 61|
| 0| 2|
| 1| 22|
| 0| 16|
| 1| 83|
+----+------+
+----+------+
|user|value2|
+----+------+
| 0| 65|
| 1| 81|
| 0| 60|
| 1| 69|
| 0| 21|
| 1| 61|
| 0| 98|
| 1| 76|
| 0| 40|
| 1| 21|
+----+------+
Let's try joining the data frames on a constant column to see the equivalence between a cross join and regular join on a constant (trivial) column:
df = df1.withColumn('key', psf.lit(1)) \
.join(df2.withColumn('key', psf.lit(1)), on=['key'])
We get an error from spark > 2, because it realises we're trying to do a cross join (cartesian product)
Py4JJavaError: An error occurred while calling o1865.showString.
: org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans
LogicalRDD [user#1538, value1#1539], false
and
LogicalRDD [user#1542, value2#1543], false
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
If your joining key (user here) is not a column that uniquely identifies rows, you'll get a multiplication of lines as well but within each user group:
df = df1.join(df2, on='user')
print("Number of rows : \tdf1: {} \tdf2: {} \tdf: {}".format(df1.count(), df2.count(), df.count()))
Number of rows : df1: 10 df2: 10 df: 50
+----+------+------+
|user|value1|value2|
+----+------+------+
| 1| 59| 81|
| 1| 59| 69|
| 1| 59| 61|
| 1| 59| 76|
| 1| 59| 21|
| 1| 71| 81|
| 1| 71| 69|
| 1| 71| 61|
| 1| 71| 76|
| 1| 71| 21|
| 1| 61| 81|
| 1| 61| 69|
| 1| 61| 61|
| 1| 61| 76|
| 1| 61| 21|
| 1| 22| 81|
| 1| 22| 69|
| 1| 22| 61|
| 1| 22| 76|
| 1| 22| 21|
+----+------+------+
5 * 5 rows for user 0 + 5 * 5 rows for user 1, hence 50
Note: Using a self join followed by a filter usually means you should be using window functions instead.

Nested Pivot in Spark Dataframe [duplicate]

This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 3 years ago.
I have below Two DF
MasterDF
NumberDF(Creating using Hive load)
Desire output:
Logic to populate
For Field1 need to pick sch_id where CAT='PAY' and SUB_CAT='client'
For Field2 need to pick sch_id where CAT='PAY' and SUB_CAT='phr'
For Field3 need to pick pay_id where CAT='credit' and
SUB_CAT='spGrp'
Currently before joining I performing filter on NumberDF and the picking the value
EX:
masterDF.as("master").join(NumberDF.filter(col("CAT")==="PAY" && col("SUB_CAT")==="phr").as("number"), "$master.id" ==="$number.id" , "leftouter" )
.select($"master.*", $"number.sch_id".as("field1") )
above approach would need multiple join. I look into pivot function but it does solve my problem
Note: Please ignore the syntax error in code

Better solution to do this is to pivot DataFrame (numberDF) by column (subject) before joining with studentDF.
pyspark code looks like this
numberDF = spark.createDataFrame([(1, "Math", 80), (1, "English", 60), (1, "Science", 80)], ["id", "subject", "marks"])
studentDF = spark.createDataFrame([(1, "Vikas")],["id","name"])
>>> numberDF.show()
+---+-------+-----+
| id|subject|marks|
+---+-------+-----+
| 1| Math| 80|
| 1|English| 60|
| 1|Science| 80|
+---+-------+-----+
>>> studentDF.show()
+---+-----+
| id| name|
+---+-----+
| 1|Vikas|
+---+-----+
pivotNumberDF = numberDF.groupBy("id").pivot("subject").sum("marks")
>>> pivotNumberDF.show()
+---+-------+----+-------+
| id|English|Math|Science|
+---+-------+----+-------+
| 1| 60| 80| 80|
+---+-------+----+-------+
>>> studentDF.join(pivotNumberDF, "id").show()
+---+-----+-------+----+-------+
| id| name|English|Math|Science|
+---+-----+-------+----+-------+
| 1|Vikas| 60| 80| 80|
+---+-----+-------+----+-------+
ref: http://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html

Finally I have implemented it using Pivot
flights.groupBy("ID", "CAT")
.pivot("SUB_CAT", Seq("client", "phr", "spGrp")).agg(avg("SCH_ID").as("SCH_ID"), avg("pay_id").as("pay_id"))
.groupBy("ID")
.pivot("CAT", Seq("credit", "price"))
.agg(
avg("client_SCH_ID").as("client_sch_id"), avg("client_pay_id").as("client_pay_id")
, avg("phr_SCH_ID").as("phr_SCH_ID"), avg("phr_pay_id").as("phr_pay_id")
, avg("spGrp_SCH_ID").as("spGrp_SCH_ID"), avg("spGrp_pay_id").as("spGrp_pay_id")
)
First Pivot would
Return table like
+---+------+-------------+--------------+-----------+------------+-------------+--------------+
| ID| CAT|client_SCH_ID|client_pay_id |phr_SCH_ID |phr_pay_id |spnGrp_SCH_ID|spnGrp_pay_id |
+---+------+-------------+--------------+-----------+------------+-------------+--------------+
| 1|credit| 5.0| 105.0| 4.0| 104.0| 6.0| 106.0|
| 1| pay | 2.0| 102.0| 1.0| 101.0| 3.0| 103.0|
+---+------+-------------+--------------+-----------+------------+-------------+--------------+
After second Pivot it would be like
+---+--------------------+---------------------+------------------+-------------------+--------------------+---------------------+-----------------+------------------+-----------------+------------------+-----------------+------------------+
| ID|credit_client_sch_id|credit_client_pay_id | credit_phr_SCH_ID| credit_phr_pay_id |credit_spnGrp_SCH_ID|credit_spnGrp_pay_id |pay_client_sch_id|pay_client_pay_id | pay_phr_SCH_ID| pay_phr_pay_id |pay_spnGrp_SCH_ID|pay_spnGrp_pay_id |
+---+--------------------+---------------------+------------------+-------------------+--------------------+---------------------+-----------------+------------------+-----------------+------------------+-----------------+------------------+
| 1| 5.0| 105.0| 4.0| 104.0| 6.0| 106.0| 2.0| 102.0| 1.0| 101.0| 3.0| 103.0|
+---+--------------------+---------------------+------------------+-------------------+--------------------+---------------------+-----------------+------------------+-----------------+------------------+-----------------+------------------+
Though I am not sure about performance.

df.createOrReplaceTempView("NumberDF")
df.createOrReplaceTempView("MasterDf")
val sqlDF = spark.sql("select m.id,t1.fld1,t2.fld2,t3.fld3,m.otherfields
from
(select id, (case when n.cat='pay' and n.sub_cat ='client' then n.sch_id end) fld1
from NumberDF n where case when n.cat='pay' and n.sub_cat ='client' then n.sch_id end is not null ) t1 ,
(select id, (case when n.cat='pay' and n.sub_cat ='phr' then n.sch_id end) fld2
from NumberDF n where case when n.cat='pay' and n.sub_cat ='phr' then n.sch_id end is not null ) t2,
(select id, (case when n.cat='credit' and n.sub_cat ='spGrp' then n.pay_id end) fld3
from NumberDF n where case when n.cat='credit' and n.sub_cat ='spGrp' then n.pay_id end is not null ) t3,
MasterDf m ")
sqlDF.show()

How to compare records from PySpark data frames

I would like to compare 2 data frames and I want to pull out the records based on below 3 conditions.
If the record is matching, 'SAME' should come in a new column FLAG.
If the record not matching, if it is from df1 (suppose No.66), 'DF1' should come in FLAG column.
If the record not matching, if it is from df2 (suppose No.77), 'DF2' should come in FLAG column.
Here whole RECORD need to consider and verify. Record wise comparison.
Also i need to check like this for millions of records using PySpark code.
df1:
No,Name,Sal,Address,Dept,Join_Date
11,Sam,1000,ind,IT,2/11/2019
22,Tom,2000,usa,HR,2/11/2019
33,Kom,3500,uk,IT,2/11/2019
44,Nom,4000,can,HR,2/11/2019
55,Vom,5000,mex,IT,2/11/2019
66,XYZ,5000,mex,IT,2/11/2019
df2:
No,Name,Sal,Address,Dept,Join_Date
11,Sam,1000,ind,IT,2/11/2019
22,Tom,2000,usa,HR,2/11/2019
33,Kom,3000,uk,IT,2/11/2019
44,Nom,4000,can,HR,2/11/2019
55,Xom,5000,mex,IT,2/11/2019
77,XYZ,5000,mex,IT,2/11/2019
Expected Output:
No,Name,Sal,Address,Dept,Join_Date,FLAG
11,Sam,1000,ind,IT,2/11/2019,SAME
22,Tom,2000,usa,HR,2/11/2019,SAME
33,Kom,3500,uk,IT,2/11/2019,DF1
33,Kom,3000,uk,IT,2/11/2019,DF2
44,Nom,4000,can,HR,2/11/2019,SAME
55,Vom,5000,mex,IT,2/11/2019,DF1
55,Xom,5000,mex,IT,2/11/2019,DF2
66,XYZ,5000,mex,IT,2/11/2019,DF1
77,XYZ,5000,mex,IT,2/11/2019,DF2
I loaded input data like below, but not getting idea on how to proceed.
df1 = pd.read_csv("D:\\inputs\\file1.csv")
df2 = pd.read_csv("D:\\inputs\\file2.csv")
Any help is appreciated. Thanks.

# Requisite packages to import
import sys
from pyspark.sql.functions import lit, count, col, when
from pyspark.sql.window import Window
# Create the two dataframes
df1 = sqlContext.createDataFrame([(11,'Sam',1000,'ind','IT','2/11/2019'),(22,'Tom',2000,'usa','HR','2/11/2019'),
(33,'Kom',3500,'uk','IT','2/11/2019'),(44,'Nom',4000,'can','HR','2/11/2019'),
(55,'Vom',5000,'mex','IT','2/11/2019'),(66,'XYZ',5000,'mex','IT','2/11/2019')],
['No','Name','Sal','Address','Dept','Join_Date'])
df2 = sqlContext.createDataFrame([(11,'Sam',1000,'ind','IT','2/11/2019'),(22,'Tom',2000,'usa','HR','2/11/2019'),
(33,'Kom',3000,'uk','IT','2/11/2019'),(44,'Nom',4000,'can','HR','2/11/2019'),
(55,'Xom',5000,'mex','IT','2/11/2019'),(77,'XYZ',5000,'mex','IT','2/11/2019')],
['No','Name','Sal','Address','Dept','Join_Date'])
df1 = df1.withColumn('FLAG',lit('DF1'))
df2 = df2.withColumn('FLAG',lit('DF2'))
# Concatenate the two DataFrames, to create one big dataframe.
df = df1.union(df2)
Use window function to check if the count of same rows is more than 1 and if it indeed is, then mark column FLAG as SAME, else keep it the way it is. Finally, drop the duplicates.
my_window = Window.partitionBy('No','Name','Sal','Address','Dept','Join_Date').rowsBetween(-sys.maxsize, sys.maxsize)
df = df.withColumn('FLAG', when((count('*').over(my_window) > 1),'SAME').otherwise(col('FLAG'))).dropDuplicates()
df.show()
+---+----+----+-------+----+---------+----+
| No|Name| Sal|Address|Dept|Join_Date|FLAG|
+---+----+----+-------+----+---------+----+
| 33| Kom|3000| uk| IT|2/11/2019| DF2|
| 44| Nom|4000| can| HR|2/11/2019|SAME|
| 22| Tom|2000| usa| HR|2/11/2019|SAME|
| 77| XYZ|5000| mex| IT|2/11/2019| DF2|
| 55| Xom|5000| mex| IT|2/11/2019| DF2|
| 11| Sam|1000| ind| IT|2/11/2019|SAME|
| 66| XYZ|5000| mex| IT|2/11/2019| DF1|
| 55| Vom|5000| mex| IT|2/11/2019| DF1|
| 33| Kom|3500| uk| IT|2/11/2019| DF1|
+---+----+----+-------+----+---------+----+

I think you can solve your problem with the creation of temporary columns to indicate the source and a join. Then you only have to check for the conditions, i.e. if both sources are present or if only one source is there and which one.
Consider the following code:
from pyspark.sql.functions import *
df1= sqlContext.createDataFrame([(11,'Sam',1000,'ind','IT','2/11/2019'),\
(22,'Tom',2000,'usa','HR','2/11/2019'),(33,'Kom',3500,'uk','IT','2/11/2019'),\
(44,'Nom',4000,'can','HR','2/11/2019'),(55,'Vom',5000,'mex','IT','2/11/2019'),\
(66,'XYZ',5000,'mex','IT','2/11/2019')], \
["No","Name","Sal","Address","Dept","Join_Date"])
df2= sqlContext.createDataFrame([(11,'Sam',1000,'ind','IT','2/11/2019'),\
(22,'Tom',2000,'usa','HR','2/11/2019'),(33,'Kom',3000,'uk','IT','2/11/2019'),\
(44,'Nom',4000,'can','HR','2/11/2019'),(55,'Xom',5000,'mex','IT','2/11/2019'),\
(77,'XYZ',5000,'mex','IT','2/11/2019')], \
["No","Name","Sal","Address","Dept","Join_Date"])
#creation of your example dataframes
df1 = df1.withColumn("Source1", lit("DF1"))
df2 = df2.withColumn("Source2", lit("DF2"))
#temporary columns to refer the origin later
df1.join(df2, ["No","Name","Sal","Address","Dept","Join_Date"],"full")\
#full join on all columns, but source is only set if record appears in original dataframe\
.withColumn("FLAG",when(col("Source1").isNotNull() & col("Source2").isNotNull(), "SAME")\
#condition if record appears in both dataframes\
.otherwise(when(col("Source1").isNotNull(), "DF1").otherwise("DF2")))\
#condition if record appears in one dataframe\
.drop("Source1","Source2").show() #remove temporary columns and show result
Output:
+---+----+----+-------+----+---------+----+
| No|Name| Sal|Address|Dept|Join_Date|FLAG|
+---+----+----+-------+----+---------+----+
| 33| Kom|3000| uk| IT|2/11/2019| DF2|
| 44| Nom|4000| can| HR|2/11/2019|SAME|
| 22| Tom|2000| usa| HR|2/11/2019|SAME|
| 77| XYZ|5000| mex| IT|2/11/2019| DF2|
| 55| Xom|5000| mex| IT|2/11/2019| DF2|
| 11| Sam|1000| ind| IT|2/11/2019|SAME|
| 66| XYZ|5000| mex| IT|2/11/2019| DF1|
| 55| Vom|5000| mex| IT|2/11/2019| DF1|
| 33| Kom|3500| uk| IT|2/11/2019| DF1|
+---+----+----+-------+----+---------+----+

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Anti join followed by union in Spark SQL - apache-spark

Related

How to update dataframe column value while joinining with other dataframe in pyspark?

How to get updated or new records by comparing two dataframe in pyspark

Crossjoin between two dataframes that is dependent on a common column

Nested Pivot in Spark Dataframe [duplicate]

How to compare records from PySpark data frames

Categories

Resources