Join two dataframes in pyspark by one column - apache-spark

I have a two dataframes that I need to join by one column and take just rows from the first dataframe if that id is contained in the same column of second dataframe:
df1:
id a b
2 1 1
3 0.5 1
4 1 2
5 2 1
df2:
id c d
2 fs a
5 fa f
Desired output:
df:
id a b
2 1 1
5 2 1
I have tried with df1.join(df2("id"),"left"), but gives me error :'Dataframe' object is not callable.

df2("id") is not a valid python syntax for selecting columns, you'd either need df2[["id"]] or use select df2.select("id"); For your example, you can do:
df1.join(df2.select("id"), "id").show()
+---+---+---+
| id| a| b|
+---+---+---+
| 5|2.0| 1|
| 2|1.0| 1|
+---+---+---+
or:
df1.join(df2[["id"]], "id").show()
+---+---+---+
| id| a| b|
+---+---+---+
| 5|2.0| 1|
| 2|1.0| 1|
+---+---+---+

If you need to check if id exists in df2 and does not need any column in your output from df2 then isin() is more efficient solution (This is similar to EXISTS and IN in SQL).
df1 = spark.createDataFrame([(2,1,1) ,(3,5,1,),(4,1,2),(5,2,1)], "id: Int, a : Int , b : Int")
df2 = spark.createDataFrame([(2,'fs','a') ,(5,'fa','f')], ['id','c','d'])
Create df2.id as list and pass it to df1 under isin()
from pyspark.sql.functions import col
df2_list = df2.select('id').rdd.map(lambda row : row[0]).collect()
df1.where(col('id').isin(df2_list)).show()
#+---+---+---+
#| id| a| b|
#+---+---+---+
#| 2| 1| 1|
#| 5| 2| 1|
#+---+---+---+
It is reccomended to use isin() IF -
You don't need to return data from the refrence dataframe/table
You have duplicates in the refrence dataframe/table (JOIN can cause duplicate rows if values are repeated)
You just want to check existence of particular value

Related

Pyspark : concat two spark df side ways without join efficiently

Hi I have sparse dataframe that was loaded by mergeschema option
DF
name A1 A2 B1 B2 ..... partitioned_name
A 1 1 null null partition_a
B 2 2 null null partition_a
A null null 3 4 partition_b
B null null 3 4 partition_b
to
DF
name A1 A2 B1 B2 .....
A 1 1 3 4
B 2 2 3 4
Any Best ideas without joining for efficiency (nor rdd because data is huge)? I was thinking about solutions like pandas concat(axis=1) since all the tables are sorted
If that pattern repeats and you don't mind hardcode the column names:
df = spark.createDataFrame(
[
('A','1','1','null','null','partition_a'),
('B','2','2','null','null','partition_a'),
('A','null','null','3','4','partition_b'),
('B','null','null','3','4','partition_b')
],
['name','A1','A2','B1','B2','partitioned_name']
)\
.withColumn('A1', F.col('A1').cast('integer'))\
.withColumn('A2', F.col('A2').cast('integer'))\
.withColumn('B1', F.col('B1').cast('integer'))\
.withColumn('B2', F.col('B2').cast('integer'))\
df.show()
import pyspark.sql.functions as F
cols_to_agg = [col for col in df.columns if col not in ["name", "partitioned_name"]]
df\
.groupby('name')\
.agg(F.sum('A1').alias('A1'),
F.sum('A2').alias('A2'),
F.sum('B1').alias('B1'),
F.sum('B2').alias('B2'))\
.show()
+----+----+----+----+----+----------------+
# |name| A1| A2| B1| B2|partitioned_name|
# +----+----+----+----+----+----------------+
# | A| 1| 1|null|null| partition_a|
# | B| 2| 2|null|null| partition_a|
# | A|null|null| 3| 4| partition_b|
# | B|null|null| 3| 4| partition_b|
# +----+----+----+----+----+----------------+
# +----+---+---+---+---+
# |name| A1| A2| B1| B2|
# +----+---+---+---+---+
# | A| 1| 1| 3| 4|
# | B| 2| 2| 3| 4|
# +----+---+---+---+---+

Pyspark how to group row based value from a data frame

I am need to group row based value against each index from below data frame
+-----+------+------+------+------+-----+----+-------+
|index|amount| dept | date | amount |dept |date |
+-----+-----------+-----+--+---------+---------+----+
| 1|1000 | acnt |2-4-21| 2000 | acnt2 |2-4-21 |
| 2|1500 | sales|2-3-21| 1600 | sales2|2-3-21 |
since index stand unique to each row and date are same , i need to group the row values as below
+-----+------ +------------+-------+
|index|amount | dept | date |
+-----+---------+------------+-------+
| 1|1000,2000|acnt,acnt2 |2-4-21 |
| 2|1500,1600|sales,sales2|2-3-21 |
i see many option to group columns but specifically for row based value in pyspark
Is there any solution to populate the result as above?
Ideally this needs to be fixed upstream (check if you have joins in your upstream codes and try to select only appropriate aliases to retain the unique columns only).
With that being said, you can create a helper spark function after creating a helper dictionary and column names:
from pyspark.sql import functions as F
from itertools import groupby
Create a fresh list with a counter:
l = []
s = {}
for i in df.columns:
l.append(f"{i}_{s.get(i)}" if i in s else i)
s[i] = s.get(i,0)+1
#['index', 'amount', 'dept', 'date', 'amount_1', 'dept_1', 'date_1']
Then with this new list create a dataframe with the existing dataframe and use a helper function to concat based on duplicate checks:
def mysparkfunc(cols):
cols = [list(v) for k,v in groupby(sorted(cols),lambda x: x.split("_")[0])]
return [F.concat_ws(",",*col).alias(col[0])
if len(col)>1 and col[0]!= 'date'
else F.col(col[0]) for col in cols]
df.toDF(*l).select(*mysparkfunc(l)).show()
+---------+------+------------+-----+
| amount| date| dept|index|
+---------+------+------------+-----+
|1000,2000|2-4-21| acnt,acnt2| 1|
|1500,1600|2-3-21|sales,sales2| 2|
+---------+------+------------+-----+
Full Code:
from pyspark.sql import functions as F
from itertools import groupby
l = []
s = {}
for i in df.columns:
l.append(f"{i}_{s.get(i)}" if i in s else i)
s[i] = s.get(i,0)+1
def mysparkfunc(cols):
cols = [list(v) for k,v in groupby(sorted(cols),lambda x: x.split("_")[0])]
return [F.concat_ws(",",*col).alias(col[0])
if len(col)>1 and col[0]!= 'date'
else F.col(col[0]) for col in cols]
df.toDF(*l).select(*mysparkfunc(l)).show()
let's say you have an initial data frame as shown below
INPUT:+------+------+------+------+
| dept| dept|amount|amount|
+------+------+------+------+
|sales1|sales2| 1| 1|
|sales1|sales2| 2| 2|
|sales1|sales2| 3| 3|
|sales1|sales2| 4| 4|
|sales1|sales2| 5| 5|
+------+------+------+------+
Rename the columns:
newColumns = ["dept1","dept2","amount1","amount2"]
new_clms_df = df.toDF(*newColumns)
new_clms_df.show()
+------+------+-------+-------+
| dept1| dept2|amount1|amount2|
+------+------+-------+-------+
|sales1|sales2| 1| 1|
|sales1|sales2| 2| 2|
|sales1|sales2| 3| 3|
|sales1|sales2| 4| 4|
|sales1|sales2| 5| 5|
+------+------+-------+-------+
Derive the final output columns:
final_df = None
final_df = new_clms_df.\
withColumn('dept', concat_ws(',',new_clms_df['dept1'],new_clms_df['dept2'])).\
withColumn('amount', concat_ws(',',new_clms_df['amount1'],new_clms_df['amount2']))
final_df.show()
+------+------+-------+-------+-------------+------+
| dept1| dept2|amount1|amount2| dept|amount|
+------+------+-------+-------+-------------+------+
|sales1|sales2| 1| 1|sales1,sales2| 1,1|
|sales1|sales2| 2| 2|sales1,sales2| 2,2|
|sales1|sales2| 3| 3|sales1,sales2| 3,3|
|sales1|sales2| 4| 4|sales1,sales2| 4,4|
|sales1|sales2| 5| 5|sales1,sales2| 5,5|
+------+------+-------+-------+-------------+------+
There are two ways.. deppending on what you want
from pyspark.sql.functions import struct, array, col
df = df.withColumn('amount', struct(col('amount1'),col('amount2')) # Map
df = df.withColumn('amount', array(col('amount1'),col('amount2')) # Array
if there are two columns with same name (like in your example), just recreate your df
(If is a join, there is no need... Just use alias)
cols = ['index','amount1','dept', 'amount2', 'dept2', 'date']
df = df.toDF(*cols)

Mapping key and list of values to key value using pyspark

I have a dataset which consists of two columns C1 and C2.The columns are associated with a relation of many to many.
What I would like to do is find for each C2 the value C1 which has the most associations with C2 values overall.
For example:
C1 | C2
1 | 2
1 | 5
1 | 9
2 | 9
2 | 8
We can see here that 1 is matched to 3 values of C2 while 2 is matched to 2 so i would like as output:
Out1 |Out2| matches
2 | 1 | 3
5 | 1 | 3
9 | 1 | 3 (1 wins because 3>2)
8 | 2 | 2
What I have done so far is:
dataset = sc.textFile("...").\
map(lambda line: (line.split(",")[0],list(line.split(",")[1]) ) ).\
reduceByKey(lambda x , y : x+y )
What this does is for each C1 value gather all the C2 matches,the count of this list is our desired matches column. What I would like now is somehow use each value in this list as a new key and have a mapping like :
(Key ,Value_list[value1,value2,...]) -->(value1 , key ),(value2 , key)...
How could this be done using spark? Any advice would be really helpful.
Thanks in advance!
The dataframe API is perhaps easier for this kind of task. You can group by C1, get the count, then group by C2, and get the value of C1 that corresponds to the highest number of matches.
import pyspark.sql.functions as F
df = spark.read.csv('file.csv', header=True, inferSchema=True)
df2 = (df.groupBy('C1')
.count()
.join(df, 'C1')
.groupBy(F.col('C2').alias('Out1'))
.agg(
F.max(
F.struct(F.col('count').alias('matches'), F.col('C1').alias('Out2'))
).alias('c')
)
.select('Out1', 'c.Out2', 'c.matches')
.orderBy('Out1')
)
df2.show()
+----+----+-------+
|Out1|Out2|matches|
+----+----+-------+
| 2| 1| 3|
| 5| 1| 3|
| 8| 2| 2|
| 9| 1| 3|
+----+----+-------+
We can get the desired result easily using dataframe API.
from pyspark.sql import *
import pyspark.sql.functions as fun
from pyspark.sql.window import Window
spark = SparkSession.builder.master("local[*]").getOrCreate()
# preparing sample dataframe
data = [(1, 2), (1, 5), (1, 9), (2, 9), (2, 8)]
schema = ["c1", "c2"]
df = spark.createDataFrame(data, schema)
output = df.withColumn("matches", fun.count("c1").over(Window.partitionBy("c1"))) \
.groupby(fun.col('C2').alias('out1')) \
.agg(fun.first(fun.col("c1")).alias("out2"), fun.max("matches").alias("matches"))
output.show()
# output
+----+----+-------+
|Out1|out2|matches|
+----+----+-------+
| 9| 1| 3|
| 5| 1| 3|
| 8| 2| 2|
| 2| 1| 3|
+----+----+-------+

Finding the max value from a column and populating another column based on the max value

I have incremental load in csv files. I read the csv in a dataframe. The dataframe has one column containing some strings. I have to find the distinct strings from this column and assign an ID (integer) to each of the value starting from 0 after joining one other dataframe.
In the next run, I have to assign the ID after finding out the max value in ID column and incrementing it for different strings. Wherever there is a null in ID column, I have to increment it (+1) from the value of the previous run.
FIRST RUN
string
ID
zero
0
first
1
second
2
third
3
fourth
4
SECOND RUN
MAX(ID) = 4
string
ID
zero
0
first
1
second
2
third
3
fourth
4
fifth
5
sixth
6
seventh
7
eighth
8
I have tried this but couldn't make it working..
max = df.agg({"ID": "max"}).collect()[0][0]
df_incremented = df.withcolumn("ID", when(col("ID").isNull(),expr("max += 1")))
Let me know if there is an easy way to achieve this.
As you keep only distinct values, you can use row_number function over window :
from pyspark.sql import Window
from pyspark.sql import functions as F
df = spark.createDataFrame(
[("a",), ("a",), ("b",), ("c",), ("d",), ("e",), ("e",)],
("string",)
)
w = Window.orderBy("string")
df1 = df.distinct().withColumn("ID", F.row_number().over(w) - 1)
df1.show()
#+------+---+
#|string| ID|
#+------+---+
#| a| 0|
#| b| 1|
#| c| 2|
#| d| 3|
#| e| 4|
#+------+---+
Now let's add some rows into this dataframe and use row_number along with coalesce to assign ID only for row where it's null (no need to get the max):
df2 = df1.union(spark.sql("select * from values ('f', null), ('h', null), ('i', null)"))
df3 = df2.withColumn("ID", F.coalesce("ID", F.row_number(w) - 1))
df3.show()
#+------+---+
#|string| ID|
#+------+---+
#| a| 0|
#| b| 1|
#| c| 2|
#| d| 3|
#| e| 4|
#| f| 5|
#| h| 6|
#| i| 7|
#+------+---+
If you wanted to keep duplicated values too and assign them the same ID, then use dense_rank instead of row_number.

How to aggregate on one column and take maximum of others in pyspark?

I have columns X (string), Y (string), and Z (float).
And I want to
aggregate on X
take the maximum of column Z
report ALL the values for columns X, Y, and Z
If there are multiple values for column Y that correspond to the maximum for column Z, then take the maximum of those values in column Y.
For example, my table is like: table1:
col X col Y col Z
A 1 5
A 2 10
A 3 10
B 5 15
resulting in:
A 3 10
B 5 15
If I were using SQL, I would do it like this:
select X, Y, Z
from table1
join (select max(Z) as max_Z from table1 group by X) table2
on table1.Z = table2.max_Z
However how do I do this when 1) column Z is a float? and 2) I'm using pyspark sql?
The two following solutions are in Scala, but honestly could not resist posting them to promote my beloved window aggregate functions. Sorry.
The only question is which structured query is more performant/effective?
Window Aggregate Function: rank
val df = Seq(
("A",1,5),
("A",2,10),
("A",3,10),
("B",5,15)
).toDF("x", "y", "z")
scala> df.show
+---+---+---+
| x| y| z|
+---+---+---+
| A| 1| 5|
| A| 2| 10|
| A| 3| 10|
| B| 5| 15|
+---+---+---+
// describe window specification
import org.apache.spark.sql.expressions.Window
val byX = Window.partitionBy("x").orderBy($"z".desc).orderBy($"y".desc)
// use rank to calculate the best X
scala> df.withColumn("rank", rank over byX)
.select("x", "y", "z")
.where($"rank" === 1) // <-- take the first row
.orderBy("x")
.show
+---+---+---+
| x| y| z|
+---+---+---+
| A| 3| 10|
| B| 5| 15|
+---+---+---+
Window Aggregate Function: first and dropDuplicates
I've always been thinking about the alternatives to rank function and first usually sprung to mind.
// use first and dropDuplicates
scala> df.
withColumn("y", first("y") over byX).
withColumn("z", first("z") over byX).
dropDuplicates.
orderBy("x").
show
+---+---+---+
| x| y| z|
+---+---+---+
| A| 3| 10|
| B| 5| 15|
+---+---+---+
You can consider using Window function. My approach here is to create Window function that partition dataframe by X first. Then, order columns Y and Z by its value.
We can simply select rank == 1 for row that we're interested.
Or we can use first and drop_duplicates to achieve the same task.
PS. Thanks Jacek Laskowski for the comments and Scala solution that leads to this solution.
Create toy example dataset
from pyspark.sql.window import Window
import pyspark.sql.functions as func
data=[('A',1,5),
('A',2,10),
('A',3,10),
('B',5,15)]
df = spark.createDataFrame(data,schema=['X','Y','Z'])
Window Aggregate Function: rank
Apply windows function with rank function
w = Window.partitionBy(df['X']).orderBy([func.col('Y').desc(), func.col('Z').desc()])
df_max = df.select('X', 'Y', 'Z', func.rank().over(w).alias("rank"))
df_final = df_max.where(func.col('rank') == 1).select('X', 'Y', 'Z').orderBy('X')
df_final.show()
Output
+---+---+---+
| X| Y| Z|
+---+---+---+
| A| 3| 10|
| B| 5| 15|
+---+---+---+
Window Aggregate Function: first and drop_duplicates
This task can also be achieved by using first and drop_duplicates as follows
df_final = df.select('X', func.first('Y').over(w).alias('Y'), func.first('Z').over(w).alias('Z'))\
.drop_duplicates()\
.orderBy('X')
df_final.show()
Output
+---+---+---+
| X| Y| Z|
+---+---+---+
| A| 3| 10|
| B| 5| 15|
+---+---+---+
Lets create a dataframe from your sample data as -
data=[('A',1,5),
('A',2,10),
('A',3,10),
('B',5,15)]
df = spark.createDataFrame(data,schema=['X','Y','Z'])
df.show()
output:
+---+---+---+
| X| Y| Z|
+---+---+---+
| A| 1| 5|
| A| 2| 10|
| A| 3| 10|
| B| 5| 15|
+---+---+---+
:
# create a intermediate dataframe that find max of Z
df1 = df.groupby('X').max('Z').toDF('X2','max_Z')
:
# create 2nd intermidiate dataframe that finds max of Y where Z = max of Z
df2 = df.join(df1,df.X==df1.X2)\
.where(col('Z')==col('max_Z'))\
.groupBy('X')\
.max('Y').toDF('X','max_Y')
:
# join above two to form final result
result = df1.join(df2,df1.X2==df2.X)\
.select('X','max_Y','max_Z')\
.orderBy('X')
result.show()
:
+---+-----+-----+
| X|max_Y|max_Z|
+---+-----+-----+
| A| 3| 10|
| B| 5| 15|
+---+-----+-----+

Resources