Self join on different columns in pyspark?

Self join on different columns in pyspark? - apache-spark

I have pyspark dataframe like this
df = sqlContext.createDataFrame([
Row(a=1, b=3),
Row(a=3, b=2),
])
+---+---+
| a| b|
+---+---+
| 1| 3|
| 3| 2|
+---+---+
I tried self-join on it like this
df1 = df.alias("df1")
df2 = df.alias("df2")
cond = [df1.a == df2.b]
df1.join(df2, cond).show()
But it gives me error.
Ideally i want to find all pair where one neighbor is common. (3 is common to both 1,2)
+---+---+
| c1| c2|
+---+---+
| 1| 2|
+---+---+

You can rename column names accordingly before self join.
from pyspark.sql.functions import *
df_as1 = df.alias("df_as1").selectExpr("a as c1", "b")
df_as2 = df.alias("df_as2").selectExpr("a", "b as c2")
joined_df = df_as1.join(df_as2, col("df_as1.b") == col("df_as2.a"), 'inner').select("c1", "c2")
joined_df.show()
Output will be:
+---+---+
| c1| c2|
+---+---+
| 1| 2|
+---+---+

Related

How to set the value of a Pyspark column based on two conditions of the value of another column

Say I have a dataframe:
+-----+-----+-----+
|id |foo. |bar. |
+-----+-----+-----+
| 1| baz| 0|
| 2| baz| 0|
| 3| 333| 2|
| 4| 444| 1|
+-----+-----+-----+
I want to set the 'foo' column to a value depending on the value of bar.
If bar is 2: set the value of foo for that row to 'X',
else if bar is 1: set the value of foo for that row to 'Y'
And if neither condition is met, leave the foo value as it is.
pyspark.when seems like the closest method, but that doesn't seem to work based on another columns value.

when can work with other columns. You can use F.col to get the value of the other column and provide an appropriate condition:
import pyspark.sql.functions as F
df2 = df.withColumn(
'foo',
F.when(F.col('bar') == 2, 'X')
.when(F.col('bar') == 1, 'Y')
.otherwise(F.col('foo'))
)
df2.show()
+---+---+---+
| id|foo|bar|
+---+---+---+
| 1|baz| 0|
| 2|baz| 0|
| 3| X| 2|
| 4| Y| 1|
+---+---+---+

We can solve this using when òr UDF in spark to insert new column based on condition.
Create Sample DataFrame:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('AddConditionalColumn').getOrCreate()
data = [(1,"baz",0),(2,"baz",0),(3,"333",2),(4,"444",1)]
columns = ["id","foo","bar"]
df = spark.createDataFrame(data = data, schema = columns)
df.show()
+---+---+---+
| id|foo|bar|
+---+---+---+
| 1|baz| 0|
| 2|baz| 0|
| 3|333| 2|
| 4|444| 1|
+---+---+---+
Using When:
from pyspark.sql.functions import when
df2 = df.withColumn("foo", when(df.bar == 2,"X")
.when(df.bar == 1,"Y")
.otherwise(df.foo))
df2.show()
+---+---+---+
| id|foo|bar|
+---+---+---+
| 1|baz| 0|
| 2|baz| 0|
| 3| X| 2|
| 4| Y| 1|
+---+---+---+
Using UDF:
import pyspark.sql.functions as F
from pyspark.sql.types import *
def executeRule(value):
if value == 2:
return 'X'
elif value == 1:
return 'Y'
else:
return value
# Converting function to UDF
ruleUDF = F.udf(executeRule, StringType())
df3 = df.withColumn("foo", ruleUDF("bar"))
df3.show()
+---+---+---+
| id|foo|bar|
+---+---+---+
| 1| 0| 0|
| 2| 0| 0|
| 3| X| 2|
| 4| Y| 1|
+---+---+---+

How to compute the numerical difference between columns of different dataframes?

Given two spark dataframes A and B with the same number of columns and rows, I want to compute the numerical difference between the two dataframes and store it into another dataframe (or another data structure optionally).
For instance let us have the following datasets
DataFrame A:
+----+---+
| A | B |
+----+---+
| 1| 0|
| 1| 0|
+----+---+
DataFrame B:
----+---+
| A | B |
+----+---+
| 1| 0 |
| 0| 0 |
+----+---+
How to obtain B-A, i.e
+----+---+
| c1 | c2|
+----+---+
| 0| 0 |
| -1| 0 |
+----+---+
In practice the real dataframes have a consequent number of rows and 50+ columns for which the difference need to be computed. What is the Spark/Scala way of doing it?

I was able to solve this by using the approach below. This code can work with any number of columns. You just have to change the input DFs accordingly.
import org.apache.spark.sql.Row
val df0 = Seq((1, 5), (1, 4)).toDF("a", "b")
val df1 = Seq((1, 0), (3, 2)).toDF("a", "b")
val columns = df0.columns
val rdd = df0.rdd.zip(df1.rdd).map {
x =>
val arr = columns.map(column =>
x._2.getAs[Int](column) - x._1.getAs[Int](column))
Row(arr: _*)
}
spark.createDataFrame(rdd, df0.schema).show(false)
Output generated:
df0=>
+---+---+
|a |b |
+---+---+
|1 |5 |
|1 |4 |
+---+---+
df1=>
+---+---+
|a |b |
+---+---+
|1 |0 |
|3 |2 |
+---+---+
Output=>
+---+---+
|a |b |
+---+---+
|0 |-5 |
|2 |-2 |
+---+---+

If your df A is the same as df B you can try below approach. I don't know if this will work correct for large datasets, it will be better to have id for joining already instead of creating it using monotonically_increasing_id().
import spark.implicits._
import org.apache.spark.sql.functions._
val df0 = Seq((1, 0), (1, 0)).toDF("a", "b")
val df1 = Seq((1, 0), (0, 0)).toDF("a", "b")
// new cols names
val colNamesA = df0.columns.map("A_" + _)
val colNamesB = df0.columns.map("B_" + _)
// rename cols and add id
val dfA = df0.toDF(colNamesA: _*)
.withColumn("id", monotonically_increasing_id())
val dfB = df1.toDF(colNamesB: _*)
.withColumn("id", monotonically_increasing_id())
dfA.show()
dfB.show()
// get columns without id
val dfACols = dfA.columns.dropRight(1).map(dfA(_))
val dfBCols = dfB.columns.dropRight(1).map(dfB(_))
// diff between cols
val calcCols = (dfACols zip dfBCols).map(s=>s._2-s._1)
// join dfs
val joined = dfA.join(dfB, "id")
joined.show()
calcCols.foreach(_.explain(true))
joined.select(calcCols:_*).show()
+---+---+---+
|A_a|A_b| id|
+---+---+---+
| 1| 0| 0|
| 1| 0| 1|
+---+---+---+
+---+---+---+
|B_a|B_b| id|
+---+---+---+
| 1| 0| 0|
| 0| 0| 1|
+---+---+---+
+---+---+---+---+---+
| id|A_a|A_b|B_a|B_b|
+---+---+---+---+---+
| 0| 1| 0| 1| 0|
| 1| 1| 0| 0| 0|
+---+---+---+---+---+
(B_a#26 - A_a#18)
(B_b#27 - A_b#19)
+-----------+-----------+
|(B_a - A_a)|(B_b - A_b)|
+-----------+-----------+
| 0| 0|
| -1| 0|
+-----------+-----------+

pyspark two dataframes subtractbykey issue

I am trying to output a dataframe only with columns identified with different values after comparing two dataframes. I am finding difficulty in identifying an approach to proceed.
**Code:**
df_a = sql_context.createDataFrame([("a", 3,"apple","bear","carrot"), ("b", 5,"orange","lion","cabbage"), ("c", 7,"pears","tiger","onion"),("c", 8,"jackfruit","elephant","raddish"),("c", 8,"watermelon","giraffe","tomato")], ["name", "id","fruit","animal","veggie"])
df_b = sql_context.createDataFrame([("a", 3,"apple","bear","carrot"), ("b", 5,"orange","lion","cabbage"), ("c", 7,"banana","tiger","onion"),("c", 8,"jackfruit","camel","raddish")], ["name", "id","fruit","animal","veggie"])
df_a = df_a.alias('df_a')
df_b = df_b.alias('df_b')
df = df_a.join(df_b, (df_a.id == df_b.id) & (df_a.name == df_b.name),'leftanti').select('df_a.*').show()
Trying to match based on the ids (id,name) between dataframe1 & dataframe2
Dataframe 1:
+----+---+----------+--------+-------+
|name| id| fruit| animal| veggie|
+----+---+----------+--------+-------+
| a| 3| apple| bear| carrot|
| b| 5| orange| lion|cabbage|
| c| 7| pears| tiger| onion|
| c| 8| jackfruit|elephant|raddish|
| c| 9|watermelon| giraffe| tomato|
+----+---+----------+--------+-------+
Dataframe 2:
+----+---+---------+------+-------+
|name| id| fruit|animal| veggie|
+----+---+---------+------+-------+
| a| 3| apple| bear| carrot|
| b| 5| orange| lion|cabbage|
| c| 7| banana| tiger| onion|
| c| 8|jackfruit| camel|raddish|
+----+---+---------+------+-------+
Expected dataframe
+----+---+----------+--------+
|name| id| fruit| animal|
+----+---+----------+--------+
| c| 7| pears| tiger|
| c| 8| jackfruit|elephant|
| c| 9|watermelon| giraffe|
+----+---+----------+--------+

PySpark : change column names of a df based on relations defined in another df

I have two Spark data-frames loaded from csv of the form :
mapping_fields (the df with mapped names):
new_name old_name
A aa
B bb
C cc
and
aa bb cc dd
1 2 3 43
12 21 4 37
to be transformed into :
A B C D
1 2 3
12 21 4
as dd didn't have any mapping in the original table, D column should have all null values.
How can I do this without converting the mapping_df into a dictionary and checking individually for mapped names? (this would mean I have to collect the mapping_fields and check, which kind of contradicts my use-case of distributedly handling all the datasets)
Thanks!

With melt borrowed from here you could:
from pyspark.sql import functions as f
mapping_fields = spark.createDataFrame(
[("A", "aa"), ("B", "bb"), ("C", "cc")],
("new_name", "old_name"))
df = spark.createDataFrame(
[(1, 2, 3, 43), (12, 21, 4, 37)],
("aa", "bb", "cc", "dd"))
(melt(df.withColumn("id", f.monotonically_increasing_id()),
id_vars=["id"], value_vars=df.columns, var_name="old_name")
.join(mapping_fields, ["old_name"], "left_outer")
.withColumn("value", f.when(f.col("new_name").isNotNull(), col("value")))
.withColumn("new_name", f.coalesce("new_name", f.upper(col("old_name"))))
.groupBy("id")
.pivot("new_name")
.agg(f.first("value"))
.drop("id")
.show())
+---+---+---+----+
| A| B| C| DD|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
but in your description nothing justifies this. Because number of columns is fairly limited, I'd rather:
mapping = dict(
mapping_fields
.filter(f.col("old_name").isin(df.columns))
.select("old_name", "new_name").collect())
df.select([
(f.lit(None).cast(t) if c not in mapping else col(c)).alias(mapping.get(c, c.upper()))
for (c, t) in df.dtypes])
+---+---+---+----+
| A| B| C| DD|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
At the end of the day you should use distributed processing when it provides performance or scalability improvements. Here it would do the opposite and make your code overly complicated.
To ignore no-matches:
(melt(df.withColumn("id", f.monotonically_increasing_id()),
id_vars=["id"], value_vars=df.columns, var_name="old_name")
.join(mapping_fields, ["old_name"])
.groupBy("id")
.pivot("new_name")
.agg(f.first("value"))
.drop("id")
.show())
or
df.select([
col(c).alias(mapping.get(c))
for (c, t) in df.dtypes if c in mapping])

I tried with a simple for loop,hope this helps too.
from pyspark.sql import functions as F
l1 = [('A','aa'),('B','bb'),('C','cc')]
l2 = [(1,2,3,43),(12,21,4,37)]
df1 = spark.createDataFrame(l1,['new_name','old_name'])
df2 = spark.createDataFrame(l2,['aa','bb','cc','dd'])
print df1.show()
+--------+--------+
|new_name|old_name|
+--------+--------+
| A| aa|
| B| bb|
| C| cc|
+--------+--------+
>>> df2.show()
+---+---+---+---+
| aa| bb| cc| dd|
+---+---+---+---+
| 1| 2| 3| 43|
| 12| 21| 4| 37|
+---+---+---+---+
when you need the missing column with null values,
>>>cols = df2.columns
>>> for i in cols:
val = df1.where(df1['old_name'] == i).first()
if val is not None:
df2 = df2.withColumnRenamed(i,val['new_name'])
else:
df2 = df2.withColumn(i,F.lit(None))
>>> df2.show()
+---+---+---+----+
| A| B| C| dd|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
when we need only the mapping columns,changing the else part,
else:
df2 = df2.drop(i)
>>> df2.show()
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 2| 3|
| 12| 21| 4|
+---+---+---+
This will transform the original df2 dataframe though.

How to join a DataFrame with the same aggregated DataFramefor e

Given a DataFrame
+---+---+----+
| id| v|date|
+---+---+----+
| 1| a| 1|
| 2| a| 2|
| 3| b| 3|
| 4| b| 4|
+---+---+----+
And we want to add a column with the mean value of date by v
+---+---+----+---------+
| v| id|date|avg(date)|
+---+---+----+---------+
| a| 1| 1| 1.5|
| a| 2| 2| 1.5|
| b| 3| 3| 3.5|
| b| 4| 4| 3.5|
+---+---+----+---------+
Is there a better way (e.g in term of performance) ?
val df = sc.parallelize(List((1,"a",1), (2, "a", 2), (3, "b", 3), (4, "b", 4))).toDF("id", "v", "date")
val aggregated = df.groupBy("v").agg(avg("date"))
df.join(aggregated, usingColumn = "v")
More precisely, I think this join will trigger a shuffle.
[update] add some precisions because I don't think it's a duplicate. The join has a key in this case.
I may different options to avoid it :
automatic. Spark has an automaticBroadcastJoin but it requires that Hive metadata had been computed. Right ?
by using a known partitioner ? If yes, how to do that with DataFrame.
by forcing a broadcast (leftDF.join(broadcast(rightDF), usingColumn = "v") ?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Self join on different columns in pyspark? - apache-spark

Related

How to set the value of a Pyspark column based on two conditions of the value of another column

How to compute the numerical difference between columns of different dataframes?

pyspark two dataframes subtractbykey issue

PySpark : change column names of a df based on relations defined in another df

How to join a DataFrame with the same aggregated DataFramefor e

Categories

Resources