A crossJoin can be done as follows:
df1 = pd.DataFrame({'subgroup':['A','B','C','D']})
df2 = pd.DataFrame({'dates':pd.date_range(date_today, date_today + timedelta(3), freq='D')})
sdf1 = spark.createDataFrame(df1)
sdf2 = spark.createDataFrame(df2)
sdf1.crossJoin(sdf2).toPandas()
In this example there are two dataframes each containing 4 rows, in the end, I get 16 rows.
However, for my problem, I would like to do a cross join per user, and the user is another column in the two dataframes, e.g.:
df1 = pd.DataFrame({'user':[1,1,1,1,2,2,2,2],'subgroup':['A','B','C','D','A','B','D','E']})
df2 = pd.DataFrame({'user':[1,1,1,1,2,2,2,2],'dates':np.hstack([np.array(pd.date_range(date_today, date_today + timedelta(3), freq='D')),np.array(pd.date_range(date_today+timedelta(1), date_today + timedelta(4), freq='D'))])})
The result of applying the per-user crossJoin should be a dataframe with 32 rows. Is this possible in pyspark and how can this be done?
A cross join is a join that generates a multiplication of lines because the joining key does not identify rows uniquely (in our case the joining key is trivial or there is no joining key at all)
Let's start with sample data frames:
import pyspark.sql.functions as psf
import pyspark.sql.types as pst
df1 = spark.createDataFrame(
[[user, value] for user, value in zip(5 * list(range(2)), np.random.randint(0, 100, 10).tolist())],
schema=pst.StructType([pst.StructField(c, pst.IntegerType()) for c in ['user', 'value1']]))
df2 = spark.createDataFrame(
[[user, value] for user, value in zip(5 * list(range(2)), np.random.randint(0, 100, 10).tolist())],
schema=pst.StructType([pst.StructField(c, pst.IntegerType()) for c in ['user', 'value2']]))
+----+------+
|user|value1|
+----+------+
| 0| 76|
| 1| 59|
| 0| 14|
| 1| 71|
| 0| 66|
| 1| 61|
| 0| 2|
| 1| 22|
| 0| 16|
| 1| 83|
+----+------+
+----+------+
|user|value2|
+----+------+
| 0| 65|
| 1| 81|
| 0| 60|
| 1| 69|
| 0| 21|
| 1| 61|
| 0| 98|
| 1| 76|
| 0| 40|
| 1| 21|
+----+------+
Let's try joining the data frames on a constant column to see the equivalence between a cross join and regular join on a constant (trivial) column:
df = df1.withColumn('key', psf.lit(1)) \
.join(df2.withColumn('key', psf.lit(1)), on=['key'])
We get an error from spark > 2, because it realises we're trying to do a cross join (cartesian product)
Py4JJavaError: An error occurred while calling o1865.showString.
: org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans
LogicalRDD [user#1538, value1#1539], false
and
LogicalRDD [user#1542, value2#1543], false
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
If your joining key (user here) is not a column that uniquely identifies rows, you'll get a multiplication of lines as well but within each user group:
df = df1.join(df2, on='user')
print("Number of rows : \tdf1: {} \tdf2: {} \tdf: {}".format(df1.count(), df2.count(), df.count()))
Number of rows : df1: 10 df2: 10 df: 50
+----+------+------+
|user|value1|value2|
+----+------+------+
| 1| 59| 81|
| 1| 59| 69|
| 1| 59| 61|
| 1| 59| 76|
| 1| 59| 21|
| 1| 71| 81|
| 1| 71| 69|
| 1| 71| 61|
| 1| 71| 76|
| 1| 71| 21|
| 1| 61| 81|
| 1| 61| 69|
| 1| 61| 61|
| 1| 61| 76|
| 1| 61| 21|
| 1| 22| 81|
| 1| 22| 69|
| 1| 22| 61|
| 1| 22| 76|
| 1| 22| 21|
+----+------+------+
5 * 5 rows for user 0 + 5 * 5 rows for user 1, hence 50
Note: Using a self join followed by a filter usually means you should be using window functions instead.
This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 3 years ago.
I have below Two DF
MasterDF
NumberDF(Creating using Hive load)
Desire output:
Logic to populate
For Field1 need to pick sch_id where CAT='PAY' and SUB_CAT='client'
For Field2 need to pick sch_id where CAT='PAY' and SUB_CAT='phr'
For Field3 need to pick pay_id where CAT='credit' and
SUB_CAT='spGrp'
Currently before joining I performing filter on NumberDF and the picking the value
EX:
masterDF.as("master").join(NumberDF.filter(col("CAT")==="PAY" && col("SUB_CAT")==="phr").as("number"), "$master.id" ==="$number.id" , "leftouter" )
.select($"master.*", $"number.sch_id".as("field1") )
above approach would need multiple join. I look into pivot function but it does solve my problem
Note: Please ignore the syntax error in code
Better solution to do this is to pivot DataFrame (numberDF) by column (subject) before joining with studentDF.
pyspark code looks like this
numberDF = spark.createDataFrame([(1, "Math", 80), (1, "English", 60), (1, "Science", 80)], ["id", "subject", "marks"])
studentDF = spark.createDataFrame([(1, "Vikas")],["id","name"])
>>> numberDF.show()
+---+-------+-----+
| id|subject|marks|
+---+-------+-----+
| 1| Math| 80|
| 1|English| 60|
| 1|Science| 80|
+---+-------+-----+
>>> studentDF.show()
+---+-----+
| id| name|
+---+-----+
| 1|Vikas|
+---+-----+
pivotNumberDF = numberDF.groupBy("id").pivot("subject").sum("marks")
>>> pivotNumberDF.show()
+---+-------+----+-------+
| id|English|Math|Science|
+---+-------+----+-------+
| 1| 60| 80| 80|
+---+-------+----+-------+
>>> studentDF.join(pivotNumberDF, "id").show()
+---+-----+-------+----+-------+
| id| name|English|Math|Science|
+---+-----+-------+----+-------+
| 1|Vikas| 60| 80| 80|
+---+-----+-------+----+-------+
ref: http://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html
Finally I have implemented it using Pivot
flights.groupBy("ID", "CAT")
.pivot("SUB_CAT", Seq("client", "phr", "spGrp")).agg(avg("SCH_ID").as("SCH_ID"), avg("pay_id").as("pay_id"))
.groupBy("ID")
.pivot("CAT", Seq("credit", "price"))
.agg(
avg("client_SCH_ID").as("client_sch_id"), avg("client_pay_id").as("client_pay_id")
, avg("phr_SCH_ID").as("phr_SCH_ID"), avg("phr_pay_id").as("phr_pay_id")
, avg("spGrp_SCH_ID").as("spGrp_SCH_ID"), avg("spGrp_pay_id").as("spGrp_pay_id")
)
First Pivot would
Return table like
+---+------+-------------+--------------+-----------+------------+-------------+--------------+
| ID| CAT|client_SCH_ID|client_pay_id |phr_SCH_ID |phr_pay_id |spnGrp_SCH_ID|spnGrp_pay_id |
+---+------+-------------+--------------+-----------+------------+-------------+--------------+
| 1|credit| 5.0| 105.0| 4.0| 104.0| 6.0| 106.0|
| 1| pay | 2.0| 102.0| 1.0| 101.0| 3.0| 103.0|
+---+------+-------------+--------------+-----------+------------+-------------+--------------+
After second Pivot it would be like
+---+--------------------+---------------------+------------------+-------------------+--------------------+---------------------+-----------------+------------------+-----------------+------------------+-----------------+------------------+
| ID|credit_client_sch_id|credit_client_pay_id | credit_phr_SCH_ID| credit_phr_pay_id |credit_spnGrp_SCH_ID|credit_spnGrp_pay_id |pay_client_sch_id|pay_client_pay_id | pay_phr_SCH_ID| pay_phr_pay_id |pay_spnGrp_SCH_ID|pay_spnGrp_pay_id |
+---+--------------------+---------------------+------------------+-------------------+--------------------+---------------------+-----------------+------------------+-----------------+------------------+-----------------+------------------+
| 1| 5.0| 105.0| 4.0| 104.0| 6.0| 106.0| 2.0| 102.0| 1.0| 101.0| 3.0| 103.0|
+---+--------------------+---------------------+------------------+-------------------+--------------------+---------------------+-----------------+------------------+-----------------+------------------+-----------------+------------------+
Though I am not sure about performance.
df.createOrReplaceTempView("NumberDF")
df.createOrReplaceTempView("MasterDf")
val sqlDF = spark.sql("select m.id,t1.fld1,t2.fld2,t3.fld3,m.otherfields
from
(select id, (case when n.cat='pay' and n.sub_cat ='client' then n.sch_id end) fld1
from NumberDF n where case when n.cat='pay' and n.sub_cat ='client' then n.sch_id end is not null ) t1 ,
(select id, (case when n.cat='pay' and n.sub_cat ='phr' then n.sch_id end) fld2
from NumberDF n where case when n.cat='pay' and n.sub_cat ='phr' then n.sch_id end is not null ) t2,
(select id, (case when n.cat='credit' and n.sub_cat ='spGrp' then n.pay_id end) fld3
from NumberDF n where case when n.cat='credit' and n.sub_cat ='spGrp' then n.pay_id end is not null ) t3,
MasterDf m ")
sqlDF.show()
I would like to compare 2 data frames and I want to pull out the records based on below 3 conditions.
If the record is matching, 'SAME' should come in a new column FLAG.
If the record not matching, if it is from df1 (suppose No.66), 'DF1' should come in FLAG column.
If the record not matching, if it is from df2 (suppose No.77), 'DF2' should come in FLAG column.
Here whole RECORD need to consider and verify. Record wise comparison.
Also i need to check like this for millions of records using PySpark code.
df1:
No,Name,Sal,Address,Dept,Join_Date
11,Sam,1000,ind,IT,2/11/2019
22,Tom,2000,usa,HR,2/11/2019
33,Kom,3500,uk,IT,2/11/2019
44,Nom,4000,can,HR,2/11/2019
55,Vom,5000,mex,IT,2/11/2019
66,XYZ,5000,mex,IT,2/11/2019
df2:
No,Name,Sal,Address,Dept,Join_Date
11,Sam,1000,ind,IT,2/11/2019
22,Tom,2000,usa,HR,2/11/2019
33,Kom,3000,uk,IT,2/11/2019
44,Nom,4000,can,HR,2/11/2019
55,Xom,5000,mex,IT,2/11/2019
77,XYZ,5000,mex,IT,2/11/2019
Expected Output:
No,Name,Sal,Address,Dept,Join_Date,FLAG
11,Sam,1000,ind,IT,2/11/2019,SAME
22,Tom,2000,usa,HR,2/11/2019,SAME
33,Kom,3500,uk,IT,2/11/2019,DF1
33,Kom,3000,uk,IT,2/11/2019,DF2
44,Nom,4000,can,HR,2/11/2019,SAME
55,Vom,5000,mex,IT,2/11/2019,DF1
55,Xom,5000,mex,IT,2/11/2019,DF2
66,XYZ,5000,mex,IT,2/11/2019,DF1
77,XYZ,5000,mex,IT,2/11/2019,DF2
I loaded input data like below, but not getting idea on how to proceed.
df1 = pd.read_csv("D:\\inputs\\file1.csv")
df2 = pd.read_csv("D:\\inputs\\file2.csv")
Any help is appreciated. Thanks.
# Requisite packages to import
import sys
from pyspark.sql.functions import lit, count, col, when
from pyspark.sql.window import Window
# Create the two dataframes
df1 = sqlContext.createDataFrame([(11,'Sam',1000,'ind','IT','2/11/2019'),(22,'Tom',2000,'usa','HR','2/11/2019'),
(33,'Kom',3500,'uk','IT','2/11/2019'),(44,'Nom',4000,'can','HR','2/11/2019'),
(55,'Vom',5000,'mex','IT','2/11/2019'),(66,'XYZ',5000,'mex','IT','2/11/2019')],
['No','Name','Sal','Address','Dept','Join_Date'])
df2 = sqlContext.createDataFrame([(11,'Sam',1000,'ind','IT','2/11/2019'),(22,'Tom',2000,'usa','HR','2/11/2019'),
(33,'Kom',3000,'uk','IT','2/11/2019'),(44,'Nom',4000,'can','HR','2/11/2019'),
(55,'Xom',5000,'mex','IT','2/11/2019'),(77,'XYZ',5000,'mex','IT','2/11/2019')],
['No','Name','Sal','Address','Dept','Join_Date'])
df1 = df1.withColumn('FLAG',lit('DF1'))
df2 = df2.withColumn('FLAG',lit('DF2'))
# Concatenate the two DataFrames, to create one big dataframe.
df = df1.union(df2)
Use window function to check if the count of same rows is more than 1 and if it indeed is, then mark column FLAG as SAME, else keep it the way it is. Finally, drop the duplicates.
my_window = Window.partitionBy('No','Name','Sal','Address','Dept','Join_Date').rowsBetween(-sys.maxsize, sys.maxsize)
df = df.withColumn('FLAG', when((count('*').over(my_window) > 1),'SAME').otherwise(col('FLAG'))).dropDuplicates()
df.show()
+---+----+----+-------+----+---------+----+
| No|Name| Sal|Address|Dept|Join_Date|FLAG|
+---+----+----+-------+----+---------+----+
| 33| Kom|3000| uk| IT|2/11/2019| DF2|
| 44| Nom|4000| can| HR|2/11/2019|SAME|
| 22| Tom|2000| usa| HR|2/11/2019|SAME|
| 77| XYZ|5000| mex| IT|2/11/2019| DF2|
| 55| Xom|5000| mex| IT|2/11/2019| DF2|
| 11| Sam|1000| ind| IT|2/11/2019|SAME|
| 66| XYZ|5000| mex| IT|2/11/2019| DF1|
| 55| Vom|5000| mex| IT|2/11/2019| DF1|
| 33| Kom|3500| uk| IT|2/11/2019| DF1|
+---+----+----+-------+----+---------+----+
I think you can solve your problem with the creation of temporary columns to indicate the source and a join. Then you only have to check for the conditions, i.e. if both sources are present or if only one source is there and which one.
Consider the following code:
from pyspark.sql.functions import *
df1= sqlContext.createDataFrame([(11,'Sam',1000,'ind','IT','2/11/2019'),\
(22,'Tom',2000,'usa','HR','2/11/2019'),(33,'Kom',3500,'uk','IT','2/11/2019'),\
(44,'Nom',4000,'can','HR','2/11/2019'),(55,'Vom',5000,'mex','IT','2/11/2019'),\
(66,'XYZ',5000,'mex','IT','2/11/2019')], \
["No","Name","Sal","Address","Dept","Join_Date"])
df2= sqlContext.createDataFrame([(11,'Sam',1000,'ind','IT','2/11/2019'),\
(22,'Tom',2000,'usa','HR','2/11/2019'),(33,'Kom',3000,'uk','IT','2/11/2019'),\
(44,'Nom',4000,'can','HR','2/11/2019'),(55,'Xom',5000,'mex','IT','2/11/2019'),\
(77,'XYZ',5000,'mex','IT','2/11/2019')], \
["No","Name","Sal","Address","Dept","Join_Date"])
#creation of your example dataframes
df1 = df1.withColumn("Source1", lit("DF1"))
df2 = df2.withColumn("Source2", lit("DF2"))
#temporary columns to refer the origin later
df1.join(df2, ["No","Name","Sal","Address","Dept","Join_Date"],"full")\
#full join on all columns, but source is only set if record appears in original dataframe\
.withColumn("FLAG",when(col("Source1").isNotNull() & col("Source2").isNotNull(), "SAME")\
#condition if record appears in both dataframes\
.otherwise(when(col("Source1").isNotNull(), "DF1").otherwise("DF2")))\
#condition if record appears in one dataframe\
.drop("Source1","Source2").show() #remove temporary columns and show result
Output:
+---+----+----+-------+----+---------+----+
| No|Name| Sal|Address|Dept|Join_Date|FLAG|
+---+----+----+-------+----+---------+----+
| 33| Kom|3000| uk| IT|2/11/2019| DF2|
| 44| Nom|4000| can| HR|2/11/2019|SAME|
| 22| Tom|2000| usa| HR|2/11/2019|SAME|
| 77| XYZ|5000| mex| IT|2/11/2019| DF2|
| 55| Xom|5000| mex| IT|2/11/2019| DF2|
| 11| Sam|1000| ind| IT|2/11/2019|SAME|
| 66| XYZ|5000| mex| IT|2/11/2019| DF1|
| 55| Vom|5000| mex| IT|2/11/2019| DF1|
| 33| Kom|3500| uk| IT|2/11/2019| DF1|
+---+----+----+-------+----+---------+----+