pyspark : fetch common data from dataframe when comparing values of given columns - apache-spark

I have two pyspark dataframes like this.
data_frame A
+----+---+
|name1| id1|
+----+---+
| a| 3|
| b| 5|
| c| 7|
+----+---+
data_frame B
+----+---+
|name2| id2|
+----+---+
| a| 13|
| b| 15|
| c| 17|
| d| 6|
| e| 0|
| f| 3|
+----+---+
I want to fetch dataframe B contents if values of name1 (from df a) and name2 (from df b) matches. which is as shown below.
o/p dataframe
+----+---+
|name2| id2|
+----+---+
| a| 13|
| b| 15|
| c| 17|
+----+---+
I want to avoid computationally expensive methods such as collect() etc.
How this can be done in apache spark?

from pyspark.sql.functions import *
df1.join(df2, df1.name1 == df2.name2).select('df2.*')
OR (using sql)
df1.registerTempTable("tableA")
df2.registerTempTable("tableB")
val result = sqlContext.sql("select b.name2, b.id2 from tableA a join tableB b on a.name1=b.name2")
result.show()
+----+----+
|name2| id2|
+----+----+
| a| 13|
| b| 15|
| c| 17|
+----+---+

Related

How can I achieve following spark behaviour using replaceWhere clause

I want to write data in delta tables incrementally while replacing (overwriting) partitions already present in sink. Example:
Consider this data inside my delta table already partionned by id column:
+---+---+
| id| x|
+---+---+
| 1| A|
| 2| B|
| 3| C|
+---+---+
Now, I would like to insert the following dataframe:
+---+---------+
| id| x|
+---+---------+
| 2| NEW|
| 2| NEW|
| 4| D|
| 5| E|
+---+---------+
The desired output is this
+---+---------+
| id| x|
+---+---------+
| 1| A|
| 2| NEW|
| 2| NEW|
| 3| C|
| 4| D|
| 5| E|
+---+---------+
What I did is the following:
df = spark.read.format("csv").option("sep", ";").option("header", "true").load("/mnt/blob/datafinance/bronze/simba/test/in/input.csv")
Ids=[x.id for x in df.select("id").distinct().collect()]
for Id in Ids:
df.filter(df.id==Id).write.format("delta").option("mergeSchema", "true").partitionBy("id").option("replaceWhere", "id == '$i'".format(i=Id)).mode("append").save("/mnt/blob/datafinance/bronze/simba/test/res/")
spark.read.format("delta").option("sep", ";").option("header", "true").load("/mnt/blob/datafinance/bronze/simba/test/res/").show()
And this is the result:
+---+---------+
| id| x|
+---+---------+
| 2| B|
| 1| A|
| 5| E|
| 2| NEW|
| 2|NEW AUSSI|
| 3| C|
| 4| D|
+---+---------+
As you can see it appended all value without replacing the partition id=2 which was already present in table.
I think it is because of mode("append").
But changing it to mode("overwrite") throws the following error:
Data written out does not match replaceWhere 'id == '$i''.
Can anyone tell me how to achieve what I want please ?
Thank you.
I actually had an error in the code. I replaced
.option("replaceWhere", "id == '$i'".format(i=idd))
with
.option("replaceWhere", "id == '{i}'".format(i=idd))
and it worked.
Thanks to #ggordon who noticed me about the error on another question.

How to concatenate data frame column pyspark?

I have created data frame using below code:
df = spark.createDataFrame([("A", "20"), ("B", "30"), ("D", "80"),("A", "120"),("c", "20"),("Null", "20")],["Let", "Num"])
df.show()
+----+---+
| Let|Num|
+----+---+
| A| 20|
| B| 30|
| D| 80|
| A|120|
| c| 20|
|Null| 20|
+----+---+
I want create data frame like below:
+----+-------+
| Let|Num |
+----+-------+
| A| 20,120|
| B| 30 |
| D| 80 |
| c| 20 |
|Null| 20 |
+----+-------+
how to achieve this?
You can groupBy Let and collect as list with collect_list
from pyspark.sql import functions as F
df.groupBy("Let").agg(F.collect_list("Num")).show()
Output as List:
+----+-----------------+
| Let|collect_list(Num)|
+----+-----------------+
| B| [30]|
| D| [80]|
| A| [20, 120]|
| c| [20]|
|Null| [20]|
+----+-----------------+
df.groupBy("Let").agg(F.concat_ws(",", F.collect_list("Num"))).show()
Output as String
+----+-------------------------------+
| Let|concat_ws(,, collect_list(Num))|
+----+-------------------------------+
| B| 30|
| D| 80|
| A| 20,120|
| c| 20|
|Null| 20|
+----+-------------------------------+

How to Merge DataFrames in Apache Spark/Hive and then increment version

We receive daily files from external system and store it into Hive.
Want to enable versioning on data. col1, col2 is composite key so if we receive same combination of data from file then it should be stored into Hive with new version. Latest data that comes from file should get the biggest version number. How could we do this in spark
file df
+----+----+-----+-------------------+-------+
||col1 |col2|value| ts       |version|
+----+----+-----+-------------------+-------+
| A| B| 777|2019-01-01 00:00:00| 1|
| K| D| 228|2019-01-01 00:00:00| 1|
| G| G| 241|2019-01-01 00:00:00| 1|
+----+----+-----+-------------------+-------+
Don't receive version from external system but if we need it for comparison then it will be always 1
hive df
+----+----+-----+-------------------+-------+
||col1 |col2|value| ts       |version|
+----+----+-----+-------------------+-------+
| A| B| 999|2018-01-01 00:00:00| 1|
| A| B| 888|2018-01-02 00:00:00| 2|
| B| C| 133|2018-01-03 00:00:00| 1|
| G| G| 231|2018-01-01 00:00:00| 1|
+----+----+-----+-------------------+-------+
After merge
+----+----+-----+-------------------+-----------+
|col1|col2|value| ts |new_version|
+----+----+-----+-------------------+-----------+
| B| C| 133|2018-01-03 00:00:00| 1|
| K| D| 228|2019-01-01 00:00:00| 1|
| A| B| 999|2018-01-01 00:00:00| 1|
| A| B| 888|2018-01-02 00:00:00| 2|
| A| B| 777|2019-01-01 00:00:00| 3|
| G| G| 231|2018-01-01 00:00:00| 1|
| G| G| 241|2019-01-01 00:00:00| 2|
+----+----+-----+-------------------+-----------+
existing main hive table:
INSERT INTO TABLE test_dev_db.test_1 VALUES
('A','B',124,1),
('A','B',123,2),
('B','C',133,1),
('G','G',231,1);
suppose you have loaded below data from file
INSERT INTO TABLE test_dev_db.test_2 VALUES
('A','B',222,1),
('K','D',228,1),
('G','G',241,1);
here is your query:
WITH CTE AS (
SELECT col1,col2,value,version FROM test_dev_db.test_1
UNION
SELECT col1,col2,value,version FROM test_dev_db.test_2
)
insert overwrite table test_dev_db.test_1
SELECT a.col1,a.col2,a.value, row_number() over(partition by a.col1,a.col2 order by a.col1,a.col1) as new_version
FROM CTE a;
hive> select * from test_dev_db.test_1;
OK
A B 123 1
A B 124 2
A B 222 3
B C 133 1
G G 231 1
G G 241 2
K D 228 1
for Spark:
create your dataframes reading from file and hive table and union them
uniondf=df1.unionAll(df2)
from pyspark.sql.functions import row_number,lit
from pyspark.sql.window import Window
w = Window().partitionBy('col1','col2').orderBy(lit('A'))
newdf= uniondf.withColumn("new_version", row_number().over(w)).drop('version')
>>> newdf.show();
+----+----+-----+-----------+
|col1|col2|value|new_version|
+----+----+-----+-----------+
| B| C| 133| 1|
| K| D| 228| 1|
| A| B| 124| 1|
| A| B| 123| 2|
| A| B| 222| 3|
| G| G| 231| 1|
| G| G| 241| 2|
+----+----+-----+-----------+
saving it to hive
newdf.write.format("orc").option("header", "true").mode("overwrite").saveAsTable('test_dev_db.new_test_1')

Spark pairwise differences within groups

I have a spark dataframe, for the sake of argument lets take it to be:
val df = sc.parallelize(
Seq(("a",1,2),("a",1,4),("b",5,6),("b",10,2),("c",1,1))
).toDF("id","x","y")
+---+---+---+
| id| x| y|
+---+---+---+
| a| 1| 2|
| a| 1| 4|
| b| 5| 6|
| b| 10| 2|
| c| 1| 1|
+---+---+---+
I would like to compute all pairwise differences between entries in the dataframe with the same id and output the result to another dataframe. For a small dataframe I can accomplish this by:
df.crossJoin(
df.select(
(df.columns.map(x=>col(x).as("_"+x))):_*)
).where(
col("id")===col("_id")
).select(
col("id"),
(col("x")-col("_x")).as("dx"),
(col("y")-col("_y")).as("dy")
)
+---+---+---+
| id| dx| dy|
+---+---+---+
| c| 0| 0|
| b| 0| 0|
| b| -5| 4|
| b| 5| -4|
| b| 0| 0|
| a| 0| 0|
| a| 0| -2|
| a| 0| 2|
| a| 0| 0|
+---+---+---+
However, for large dataframes this isn't a reasonable approach as the crossJoin will mostly produce data that will be discarded by the subsequent where clause.
I'm still pretty new to spark and groupBy seemed like a natural place to start looking, but I can't figure out how to accomplish this using groupBy. Any help would be welcome.
I would eventually like to remove redundancy, for instance in:
val df1 = df.withColumn("idx",monotonicallyIncreasingId)
df.crossJoin(
df.select(
(df.columns.map(x=>col(x).as("_"+x))):_*)
).where(
col("id")===col("_id") && col("idx") < col("_idx")
).select(
col("id"),
(col("x")-col("_x")).as("dx"),
(col("y")-col("_y")).as("dy")
)
+---+---+---+
| id| dx| dy|
+---+---+---+
| b| -5| 4|
| a| 0| -2|
+---+---+---+
But if its easier to accomplish this with redundancy, then I can live with that.
This is not an uncommon transformation to perform in ML so I thought something out of MLlib might be appropriate, but again I haven't found anything there either.
Can be achived via inner join, result the same as expected:
df.alias("left").join(df.alias("right"),"id")
.select($"id",
($"left.x"-$"right.x").alias("dx"),
($"left.y"-$"right.y").alias("dy"))

pyspark two dataframes subtractbykey issue

I am trying to output a dataframe only with columns identified with different values after comparing two dataframes. I am finding difficulty in identifying an approach to proceed.
**Code:**
df_a = sql_context.createDataFrame([("a", 3,"apple","bear","carrot"), ("b", 5,"orange","lion","cabbage"), ("c", 7,"pears","tiger","onion"),("c", 8,"jackfruit","elephant","raddish"),("c", 8,"watermelon","giraffe","tomato")], ["name", "id","fruit","animal","veggie"])
df_b = sql_context.createDataFrame([("a", 3,"apple","bear","carrot"), ("b", 5,"orange","lion","cabbage"), ("c", 7,"banana","tiger","onion"),("c", 8,"jackfruit","camel","raddish")], ["name", "id","fruit","animal","veggie"])
df_a = df_a.alias('df_a')
df_b = df_b.alias('df_b')
df = df_a.join(df_b, (df_a.id == df_b.id) & (df_a.name == df_b.name),'leftanti').select('df_a.*').show()
Trying to match based on the ids (id,name) between dataframe1 & dataframe2
Dataframe 1:
+----+---+----------+--------+-------+
|name| id| fruit| animal| veggie|
+----+---+----------+--------+-------+
| a| 3| apple| bear| carrot|
| b| 5| orange| lion|cabbage|
| c| 7| pears| tiger| onion|
| c| 8| jackfruit|elephant|raddish|
| c| 9|watermelon| giraffe| tomato|
+----+---+----------+--------+-------+
Dataframe 2:
+----+---+---------+------+-------+
|name| id| fruit|animal| veggie|
+----+---+---------+------+-------+
| a| 3| apple| bear| carrot|
| b| 5| orange| lion|cabbage|
| c| 7| banana| tiger| onion|
| c| 8|jackfruit| camel|raddish|
+----+---+---------+------+-------+
Expected dataframe
+----+---+----------+--------+
|name| id| fruit| animal|
+----+---+----------+--------+
| c| 7| pears| tiger|
| c| 8| jackfruit|elephant|
| c| 9|watermelon| giraffe|
+----+---+----------+--------+

Resources