How to compute the numerical difference between columns of different dataframes? - apache-spark

Given two spark dataframes A and B with the same number of columns and rows, I want to compute the numerical difference between the two dataframes and store it into another dataframe (or another data structure optionally).
For instance let us have the following datasets
DataFrame A:
+----+---+
| A | B |
+----+---+
| 1| 0|
| 1| 0|
+----+---+
DataFrame B:
----+---+
| A | B |
+----+---+
| 1| 0 |
| 0| 0 |
+----+---+
How to obtain B-A, i.e
+----+---+
| c1 | c2|
+----+---+
| 0| 0 |
| -1| 0 |
+----+---+
In practice the real dataframes have a consequent number of rows and 50+ columns for which the difference need to be computed. What is the Spark/Scala way of doing it?

I was able to solve this by using the approach below. This code can work with any number of columns. You just have to change the input DFs accordingly.
import org.apache.spark.sql.Row
val df0 = Seq((1, 5), (1, 4)).toDF("a", "b")
val df1 = Seq((1, 0), (3, 2)).toDF("a", "b")
val columns = df0.columns
val rdd = df0.rdd.zip(df1.rdd).map {
x =>
val arr = columns.map(column =>
x._2.getAs[Int](column) - x._1.getAs[Int](column))
Row(arr: _*)
}
spark.createDataFrame(rdd, df0.schema).show(false)
Output generated:
df0=>
+---+---+
|a |b |
+---+---+
|1 |5 |
|1 |4 |
+---+---+
df1=>
+---+---+
|a |b |
+---+---+
|1 |0 |
|3 |2 |
+---+---+
Output=>
+---+---+
|a |b |
+---+---+
|0 |-5 |
|2 |-2 |
+---+---+

If your df A is the same as df B you can try below approach. I don't know if this will work correct for large datasets, it will be better to have id for joining already instead of creating it using monotonically_increasing_id().
import spark.implicits._
import org.apache.spark.sql.functions._
val df0 = Seq((1, 0), (1, 0)).toDF("a", "b")
val df1 = Seq((1, 0), (0, 0)).toDF("a", "b")
// new cols names
val colNamesA = df0.columns.map("A_" + _)
val colNamesB = df0.columns.map("B_" + _)
// rename cols and add id
val dfA = df0.toDF(colNamesA: _*)
.withColumn("id", monotonically_increasing_id())
val dfB = df1.toDF(colNamesB: _*)
.withColumn("id", monotonically_increasing_id())
dfA.show()
dfB.show()
// get columns without id
val dfACols = dfA.columns.dropRight(1).map(dfA(_))
val dfBCols = dfB.columns.dropRight(1).map(dfB(_))
// diff between cols
val calcCols = (dfACols zip dfBCols).map(s=>s._2-s._1)
// join dfs
val joined = dfA.join(dfB, "id")
joined.show()
calcCols.foreach(_.explain(true))
joined.select(calcCols:_*).show()
+---+---+---+
|A_a|A_b| id|
+---+---+---+
| 1| 0| 0|
| 1| 0| 1|
+---+---+---+
+---+---+---+
|B_a|B_b| id|
+---+---+---+
| 1| 0| 0|
| 0| 0| 1|
+---+---+---+
+---+---+---+---+---+
| id|A_a|A_b|B_a|B_b|
+---+---+---+---+---+
| 0| 1| 0| 1| 0|
| 1| 1| 0| 0| 0|
+---+---+---+---+---+
(B_a#26 - A_a#18)
(B_b#27 - A_b#19)
+-----------+-----------+
|(B_a - A_a)|(B_b - A_b)|
+-----------+-----------+
| 0| 0|
| -1| 0|
+-----------+-----------+

Related

Spark-Scala : Create split rows based on the value of other column

I have an Input as below
id
size
1
4
2
2
output - If input is 4 (size column) split 4 times(1-4) and if input size column value is 2 split it
1-2 times.
id
size
1
1
1
2
1
3
1
4
2
1
2
2
You can create an array of sequence from 1 to size using sequence function and then to explode it:
import org.apache.spark.sql.functions._
val df = Seq((1,4), (2,2)).toDF("id", "size")
df
.withColumn("size", explode(sequence(lit(1), col("size"))))
.show(false)
The output would be:
+---+----+
|id |size|
+---+----+
|1 |1 |
|1 |2 |
|1 |3 |
|1 |4 |
|2 |1 |
|2 |2 |
+---+----+
You can use first use sequence function to create sequence from 1 to size and then explode it.
val df = input.withColumn("seq", sequence(lit(1), $"size"))
df.show()
+---+----+------------+
| id|size| seq|
+---+----+------------+
| 1| 4|[1, 2, 3, 4]|
| 2| 2| [1, 2]|
+---+----+------------+
df.withColumn("size", explode($"seq")).drop("seq").show()
+---+----+
| id|size|
+---+----+
| 1| 1|
| 1| 2|
| 1| 3|
| 1| 4|
| 2| 1|
| 2| 2|
+---+----+
You could turn your size column into an incrementing sequence using Seq.range and then explode the arrays. Something like this:
import spark.implicits._
import org.apache.spark.sql.functions.{explode, col}
// Original dataframe
val df = Seq((1,4), (2,2)).toDF("id", "size")
// Mapping over this dataframe: turning each row into (idx, array)
val df_with_array = df
.map(row => {
(row.getInt(0), Seq.range(1, row.getInt(1) + 1))
})
.toDF("id", "array")
.select(col("id"), explode(col("array")))
output.show()
+---+---+
| id|col|
+---+---+
| 1| 1|
| 1| 2|
| 1| 3|
| 1| 4|
| 2| 1|
| 2| 2|
+---+---+

Self join on different columns in pyspark?

I have pyspark dataframe like this
df = sqlContext.createDataFrame([
Row(a=1, b=3),
Row(a=3, b=2),
])
+---+---+
| a| b|
+---+---+
| 1| 3|
| 3| 2|
+---+---+
I tried self-join on it like this
df1 = df.alias("df1")
df2 = df.alias("df2")
cond = [df1.a == df2.b]
df1.join(df2, cond).show()
But it gives me error.
Ideally i want to find all pair where one neighbor is common. (3 is common to both 1,2)
+---+---+
| c1| c2|
+---+---+
| 1| 2|
+---+---+
You can rename column names accordingly before self join.
from pyspark.sql.functions import *
df_as1 = df.alias("df_as1").selectExpr("a as c1", "b")
df_as2 = df.alias("df_as2").selectExpr("a", "b as c2")
joined_df = df_as1.join(df_as2, col("df_as1.b") == col("df_as2.a"), 'inner').select("c1", "c2")
joined_df.show()
Output will be:
+---+---+
| c1| c2|
+---+---+
| 1| 2|
+---+---+

How to define spark dataframe join match priority

I have two dataframes.
dataDF
+---+
| tt|
+---+
| a|
| b|
| c|
| ab|
+---+
alter
+----+-----+------+
|name|alter|profit|
+----+-----+------+
| a| aa| 1|
| b| a| 5|
| c| ab| 8|
+----+-----+------+
The task is to search col "tt" in dataframe alter col("name"), if found it join them, if not found it, then search col "tt" in col("alter"). The priority of col ("name") is high than col("alter"). That means if row of col("tt") is matched to col("name"), I do not want to match it to other row which only matches col("alter"). How can I achieve this task?
I tried to write a join, but it does not work.
dataDF = dataDF.select("*")
.join(broadcast(alterDF),
col("tt") === col("Name") || col("tt") === col("alter"),
"left")
The result is:
+---+----+-----+------+
| tt|name|alter|profit|
+---+----+-----+------+
| a| a| aa| 1|
| a| b| a| 5| // this row is not expected.
| b| b| a| 5|
| c| c| ab| 8|
| ab| c| ab| 8|
+---+----+-----+------+
You can try joining twice. First time with the name column, filter out the tt values for which data is not matched and join it with the alter column. Union both the results. Please find the code below for the same. I hope it is helpful.
//Creating Test Data
val dataDF = Seq("a", "b", "c", "ab").toDF("tt")
val alter = Seq(("a", "aa", 1), ("b", "a", 5), ("c", "ab", 8))
.toDF("name", "alter", "profit")
val join1 = dataDF.join(alter, col("tt") === col("name"), "left")
val join2 = join1.filter( col("name").isNull).select("tt")
.join(alter, col("tt") === col("alter"), "left")
val joinDF = join1.filter( col("name").isNotNull).union(join2)
joinDF.show(false)
+---+----+-----+------+
|tt |name|alter|profit|
+---+----+-----+------+
|a |a |aa |1 |
|b |b |a |5 |
|c |c |ab |8 |
|ab |c |ab |8 |
+---+----+-----+------+

How can I use the literal value of a spark dataframe column?

I have this simple dataframe that looks like this,
+---+---+---+---+
|nm | ca| cb| cc|
+---+---+---+---+
| a|123| 0| 0|
| b| 1| 2| 3|
| c| 0| 1| 0|
+---+---+---+---+
What I want to do is,
+---+---+---+---+---+
|nm |ca |cb |cc |p |
+---+---+---+---+---+
|a |123|0 |0 |1 |
|b |1 |2 |3 |1 |
|c |0 |1 |0 |0 |
+---+---+---+---+---+
bascially added a new column p, such that, if value of column nm is 'a', check column ca is >0, if yes put '1' for column p1 else 0.
My code,
def purchaseCol: UserDefinedFunction =
udf((brand: String) => s"c$brand")
val a = ss.createDataset(List(
("a", 123, 0, 0),
("b", 1, 2, 3),
("c", 0, 1, 0)))
.toDF("nm", "ca", "cb", "cc")
a.show()
a.withColumn("p", when(lit(DataFrameUtils.purchaseCol($"nm")) > 0, 1).otherwise(0))
.show(false)
It doesnt seem to be working and is returning 0 for all rows in col 'p'.
PS: Columns number is over 100 and they are dynamically generated.
Map over rdd, calculate and add p to each row:
val a = sc.parallelize(
List(("a", 123, 0, 0),
("b", 1, 2, 3),
("c", 0, 1, 0))
).toDF("nm", "ca", "cb", "cc")
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val b = a.rdd.map(r => {
val s = r.getAs[String]("nm")
val v = r.getAs[Int](s"c$s")
val p = if(v > 0) 1 else 0
Row.fromSeq(r.toSeq :+ p)
})
val new_schema = StructType(a.schema :+ StructField("p", IntegerType, true))
val df_new = spark.createDataFrame(b, new_schema)
df_new.show
+---+---+---+---+---+
| nm| ca| cb| cc| p|
+---+---+---+---+---+
| a|123| 0| 0| 1|
| b| 1| 2| 3| 1|
| c| 0| 1| 0| 0|
+---+---+---+---+---+
If "c*" columns number is limited, UDF with all values can be used:
val nameMatcherFunct = (nm: String, ca: Int, cb: Int, cc: Int) => {
val value = nm match {
case "a" => ca
case "b" => cb
case "c" => cc
}
if (value > 0) 1 else 0
}
def purchaseValueUDF = udf(nameMatcherFunct)
val result = a.withColumn("p", purchaseValueUDF(col("nm"), col("ca"), col("cb"), col("cc")))
If you have many "c*" columns, function with Row as parameter can be used:
How to pass whole Row to UDF - Spark DataFrame filter
looking at your logic
if value of column nm is 'a', check column ca is >0, if yes put '1' for column p1 else 0.
you can simply do
import org.apache.spark.sql.functions._
a.withColumn("p", when((col("nm") === lit("a")) && (col("ca") > 0), lit(1)).otherwise(lit(0)))
but looking at your output dataframe, you would require an || instead of &&
import org.apache.spark.sql.functions._
a.withColumn("p", when((col("nm") === lit("a")) || (col("ca") > 0), lit(1)).otherwise(lit(0)))
val a1 = sc.parallelize(
List(("a", 123, 0, 0),
("b", 1, 2, 3),
("c", 0, 1, 0))
).toDF("nm", "ca", "cb", "cc")
a1.show()
+---+---+---+---+
| nm| ca| cb| cc|
+---+---+---+---+
| a|123| 0| 0|
| b| 1| 2| 3|
| c| 0| 1| 0|
+---+---+---+---+
val newDf = a1.withColumn("P", when($"ca" > 0, 1).otherwise(0))
newDf.show()
+---+---+---+---+---+
| nm| ca| cb| cc| P|
+---+---+---+---+---+
| a|123| 0| 0| 1|
| b| 1| 2| 3| 1|
| c| 0| 1| 0| 0|
+---+---+---+---+---+

PySpark : change column names of a df based on relations defined in another df

I have two Spark data-frames loaded from csv of the form :
mapping_fields (the df with mapped names):
new_name old_name
A aa
B bb
C cc
and
aa bb cc dd
1 2 3 43
12 21 4 37
to be transformed into :
A B C D
1 2 3
12 21 4
as dd didn't have any mapping in the original table, D column should have all null values.
How can I do this without converting the mapping_df into a dictionary and checking individually for mapped names? (this would mean I have to collect the mapping_fields and check, which kind of contradicts my use-case of distributedly handling all the datasets)
Thanks!
With melt borrowed from here you could:
from pyspark.sql import functions as f
mapping_fields = spark.createDataFrame(
[("A", "aa"), ("B", "bb"), ("C", "cc")],
("new_name", "old_name"))
df = spark.createDataFrame(
[(1, 2, 3, 43), (12, 21, 4, 37)],
("aa", "bb", "cc", "dd"))
(melt(df.withColumn("id", f.monotonically_increasing_id()),
id_vars=["id"], value_vars=df.columns, var_name="old_name")
.join(mapping_fields, ["old_name"], "left_outer")
.withColumn("value", f.when(f.col("new_name").isNotNull(), col("value")))
.withColumn("new_name", f.coalesce("new_name", f.upper(col("old_name"))))
.groupBy("id")
.pivot("new_name")
.agg(f.first("value"))
.drop("id")
.show())
+---+---+---+----+
| A| B| C| DD|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
but in your description nothing justifies this. Because number of columns is fairly limited, I'd rather:
mapping = dict(
mapping_fields
.filter(f.col("old_name").isin(df.columns))
.select("old_name", "new_name").collect())
df.select([
(f.lit(None).cast(t) if c not in mapping else col(c)).alias(mapping.get(c, c.upper()))
for (c, t) in df.dtypes])
+---+---+---+----+
| A| B| C| DD|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
At the end of the day you should use distributed processing when it provides performance or scalability improvements. Here it would do the opposite and make your code overly complicated.
To ignore no-matches:
(melt(df.withColumn("id", f.monotonically_increasing_id()),
id_vars=["id"], value_vars=df.columns, var_name="old_name")
.join(mapping_fields, ["old_name"])
.groupBy("id")
.pivot("new_name")
.agg(f.first("value"))
.drop("id")
.show())
or
df.select([
col(c).alias(mapping.get(c))
for (c, t) in df.dtypes if c in mapping])
I tried with a simple for loop,hope this helps too.
from pyspark.sql import functions as F
l1 = [('A','aa'),('B','bb'),('C','cc')]
l2 = [(1,2,3,43),(12,21,4,37)]
df1 = spark.createDataFrame(l1,['new_name','old_name'])
df2 = spark.createDataFrame(l2,['aa','bb','cc','dd'])
print df1.show()
+--------+--------+
|new_name|old_name|
+--------+--------+
| A| aa|
| B| bb|
| C| cc|
+--------+--------+
>>> df2.show()
+---+---+---+---+
| aa| bb| cc| dd|
+---+---+---+---+
| 1| 2| 3| 43|
| 12| 21| 4| 37|
+---+---+---+---+
when you need the missing column with null values,
>>>cols = df2.columns
>>> for i in cols:
val = df1.where(df1['old_name'] == i).first()
if val is not None:
df2 = df2.withColumnRenamed(i,val['new_name'])
else:
df2 = df2.withColumn(i,F.lit(None))
>>> df2.show()
+---+---+---+----+
| A| B| C| dd|
+---+---+---+----+
| 1| 2| 3|null|
| 12| 21| 4|null|
+---+---+---+----+
when we need only the mapping columns,changing the else part,
else:
df2 = df2.drop(i)
>>> df2.show()
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 2| 3|
| 12| 21| 4|
+---+---+---+
This will transform the original df2 dataframe though.

Resources