How can I use the literal value of a spark dataframe column? - apache-spark

I have this simple dataframe that looks like this,
+---+---+---+---+
|nm | ca| cb| cc|
+---+---+---+---+
| a|123| 0| 0|
| b| 1| 2| 3|
| c| 0| 1| 0|
+---+---+---+---+
What I want to do is,
+---+---+---+---+---+
|nm |ca |cb |cc |p |
+---+---+---+---+---+
|a |123|0 |0 |1 |
|b |1 |2 |3 |1 |
|c |0 |1 |0 |0 |
+---+---+---+---+---+
bascially added a new column p, such that, if value of column nm is 'a', check column ca is >0, if yes put '1' for column p1 else 0.
My code,
def purchaseCol: UserDefinedFunction =
udf((brand: String) => s"c$brand")
val a = ss.createDataset(List(
("a", 123, 0, 0),
("b", 1, 2, 3),
("c", 0, 1, 0)))
.toDF("nm", "ca", "cb", "cc")
a.show()
a.withColumn("p", when(lit(DataFrameUtils.purchaseCol($"nm")) > 0, 1).otherwise(0))
.show(false)
It doesnt seem to be working and is returning 0 for all rows in col 'p'.
PS: Columns number is over 100 and they are dynamically generated.

Map over rdd, calculate and add p to each row:
val a = sc.parallelize(
List(("a", 123, 0, 0),
("b", 1, 2, 3),
("c", 0, 1, 0))
).toDF("nm", "ca", "cb", "cc")
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val b = a.rdd.map(r => {
val s = r.getAs[String]("nm")
val v = r.getAs[Int](s"c$s")
val p = if(v > 0) 1 else 0
Row.fromSeq(r.toSeq :+ p)
})
val new_schema = StructType(a.schema :+ StructField("p", IntegerType, true))
val df_new = spark.createDataFrame(b, new_schema)
df_new.show
+---+---+---+---+---+
| nm| ca| cb| cc| p|
+---+---+---+---+---+
| a|123| 0| 0| 1|
| b| 1| 2| 3| 1|
| c| 0| 1| 0| 0|
+---+---+---+---+---+

If "c*" columns number is limited, UDF with all values can be used:
val nameMatcherFunct = (nm: String, ca: Int, cb: Int, cc: Int) => {
val value = nm match {
case "a" => ca
case "b" => cb
case "c" => cc
}
if (value > 0) 1 else 0
}
def purchaseValueUDF = udf(nameMatcherFunct)
val result = a.withColumn("p", purchaseValueUDF(col("nm"), col("ca"), col("cb"), col("cc")))
If you have many "c*" columns, function with Row as parameter can be used:
How to pass whole Row to UDF - Spark DataFrame filter

looking at your logic
if value of column nm is 'a', check column ca is >0, if yes put '1' for column p1 else 0.
you can simply do
import org.apache.spark.sql.functions._
a.withColumn("p", when((col("nm") === lit("a")) && (col("ca") > 0), lit(1)).otherwise(lit(0)))
but looking at your output dataframe, you would require an || instead of &&
import org.apache.spark.sql.functions._
a.withColumn("p", when((col("nm") === lit("a")) || (col("ca") > 0), lit(1)).otherwise(lit(0)))

val a1 = sc.parallelize(
List(("a", 123, 0, 0),
("b", 1, 2, 3),
("c", 0, 1, 0))
).toDF("nm", "ca", "cb", "cc")
a1.show()
+---+---+---+---+
| nm| ca| cb| cc|
+---+---+---+---+
| a|123| 0| 0|
| b| 1| 2| 3|
| c| 0| 1| 0|
+---+---+---+---+
val newDf = a1.withColumn("P", when($"ca" > 0, 1).otherwise(0))
newDf.show()
+---+---+---+---+---+
| nm| ca| cb| cc| P|
+---+---+---+---+---+
| a|123| 0| 0| 1|
| b| 1| 2| 3| 1|
| c| 0| 1| 0| 0|
+---+---+---+---+---+

Related

Check multiple columns for any column greater than zero using a regex

I need to apply a when function on multiple columns. I want to check if at least one of the columns has a value greater than 0.
This is my solution:
df.withColumn("any value", F.when(
(col("col1") > 0) |
(col("col2") > 0) |
(col("col3") > 0) |
...
(col("colX") > 0)
, "any greater than 0").otherwise(None))
Is it possible to do the same task with a regex, so I don't have to write all the column names?
So let's create sample data:
df = spark.createDataFrame(
[(0, 0, 0, 0), (0, 0, 2, 0), (0, 0, 0, 0), (1, 0, 0, 0)],
['a', 'b', 'c', 'd']
)
Then, you can build your condition from a list of columns (say all the columns of the dataframe) using map and reduce like this:
cols = df.columns
from pyspark.sql import functions as F
condition = reduce(lambda a, b: a | b, map(lambda c: F.col(c) > 0, cols))
df.withColumn("any value", F.when(condition, "any greater than 0")).show()
which yields:
+---+---+---+---+------------------+
| a| b| c| d| any value|
+---+---+---+---+------------------+
| 0| 0| 0| 0| null|
| 0| 0| 2| 0|any greater than 0|
| 0| 0| 0| 0| null|
| 1| 0| 0| 0|any greater than 0|
+---+---+---+---+------------------+
Another way you could have this done is create an array, use forall to check and conditionally assign values. Code below
df = df.withColumn('any value', array(df.columns)).withColumn('any value',when(forall('any value',lambda x: x==0),None).otherwise("any greater than 0"))
df.show()
+---+---+---+---+------------------+
| a| b| c| d| any value|
+---+---+---+---+------------------+
| 0| 0| 0| 0| null|
| 0| 0| 2| 0|any greater than 0|
| 0| 0| 0| 0| null|
| 1| 0| 0| 0|any greater than 0|
+---+---+---+---+------------------+

Spark aggregation / group by so as to determine a new column's value based on col value in a set

I have some data that will be grouped by id.
id, field
0 A
0 B
0 C
1 B
1 B
1 C
2 E
I want to group by ID and calculate a simple new value, is_special, which is group by id, if any(field) is in a special set {A, E} (just a random set of letters, no pattern).
id, is_special
0 True
1 False
2 True
Something like this question but in pyspark.
I want to understand how to do this group by without actually grouping, and just create a new column:
id, field, is_special
0 A, True
0 B, True
0 C, True
1 B, False
1 B, False
1 C, False
2 E, True
I think it can be done using some of the following, but I don't know how to use a window with the when.
from F import when, col, coalesce
special = ['A', 'E']
window = Window.partitionBy('product_ari')
df.withColumn("is_special",
when(col("field").isin(special), lit(True))
)
Test set creation :
a = [
(0, "A"),
(0, "B"),
(0, "C"),
(1, "B"),
(1, "B"),
(1, "C"),
(2, "E"),
]
b = ["id", "field"]
df = spark.createDataFrame(a, b)
set_ = ("A", "E")
Sevaral ways of doing that.
With a join
from pyspark.sql import functions as F
agg_df = (
df.withColumn(
"is_special", F.when(F.expr(f"field in {set_}"), True).otherwise(False)
)
.groupBy("id")
.agg(F.max("is_special").alias("is_special"))
)
df.join(agg_df, on="id", how="left").show()
+---+-----+----------+
| id|field|is_special|
+---+-----+----------+
| 0| A| true|
| 0| B| true|
| 0| C| true|
| 1| B| false|
| 1| B| false|
| 1| C| false|
| 2| E| true|
+---+-----+----------+
With a window
from pyspark.sql import Window
df.withColumn(
"is_special", F.when(F.expr(f"field in {set_}"), True).otherwise(False)
).withColumn("is_special", F.max("is_special").over(Window.partitionBy("id"))).show()
# OR "one-liner"
df.withColumn(
"is_special",
F.max(F.when(F.expr(f"field in {set_}"), True).otherwise(False)).over(
Window.partitionBy("id")
),
).show()
+---+-----+----------+
| id|field|is_special|
+---+-----+----------+
| 0| A| true|
| 0| B| true|
| 0| C| true|
| 1| B| false|
| 1| B| false|
| 1| C| false|
| 2| E| true|
+---+-----+----------+
For a little bit of intellectual carry-on the following works as well:
from pyspark.sql.functions import (
array_intersect,
size,
array_except,
collect_set,
lit,
array,
explode,
)
df = sc.parallelize(
[
(0, "A"),
(0, "B"),
(0, "C"),
(1, "B"),
(1, "B"),
(1, "C"),
(2, "A"),
(2, "E"),
(2, "A"),
(2, "A"),
(2, "G"),
(2, "J"),
(3, "A"),
(4, "E"),
(5, "A"),
(5, "E"),
(6, "Z"),
]
).toDF(["id", "field"])
df2 = df.groupby("id").agg(collect_set("field").alias("X"))
df3a = df2.filter(size(array_intersect(df2["X"], lit(array(lit("E"), lit("A"))))) >= 1)
df3b = df2.filter(size(array_intersect(df2["X"], lit(array(lit("E"), lit("A"))))) == 0)
df4 = (
df3a.select(df3a.id, explode(df3a.X).alias("field"))
.withColumn("is_special", lit(True))
.union(
df3b.select(df3b.id, explode(df3b.X).alias("field")).withColumn(
"is_special", lit(False)
)
)
)
df4.show()
returns:
+---+-----+----------+
| id|field|is_special|
+---+-----+----------+
| 0| C| true|
| 0| B| true|
| 0| A| true|
| 5| E| true|
| 5| A| true|
| 3| A| true|
| 2| J| true|
| 2| E| true|
| 2| G| true|
| 2| A| true|
| 4| E| true|
| 6| Z| false|
| 1| C| false|
| 1| B| false|
+---+-----+----------+

How to compute the numerical difference between columns of different dataframes?

Given two spark dataframes A and B with the same number of columns and rows, I want to compute the numerical difference between the two dataframes and store it into another dataframe (or another data structure optionally).
For instance let us have the following datasets
DataFrame A:
+----+---+
| A | B |
+----+---+
| 1| 0|
| 1| 0|
+----+---+
DataFrame B:
----+---+
| A | B |
+----+---+
| 1| 0 |
| 0| 0 |
+----+---+
How to obtain B-A, i.e
+----+---+
| c1 | c2|
+----+---+
| 0| 0 |
| -1| 0 |
+----+---+
In practice the real dataframes have a consequent number of rows and 50+ columns for which the difference need to be computed. What is the Spark/Scala way of doing it?
I was able to solve this by using the approach below. This code can work with any number of columns. You just have to change the input DFs accordingly.
import org.apache.spark.sql.Row
val df0 = Seq((1, 5), (1, 4)).toDF("a", "b")
val df1 = Seq((1, 0), (3, 2)).toDF("a", "b")
val columns = df0.columns
val rdd = df0.rdd.zip(df1.rdd).map {
x =>
val arr = columns.map(column =>
x._2.getAs[Int](column) - x._1.getAs[Int](column))
Row(arr: _*)
}
spark.createDataFrame(rdd, df0.schema).show(false)
Output generated:
df0=>
+---+---+
|a |b |
+---+---+
|1 |5 |
|1 |4 |
+---+---+
df1=>
+---+---+
|a |b |
+---+---+
|1 |0 |
|3 |2 |
+---+---+
Output=>
+---+---+
|a |b |
+---+---+
|0 |-5 |
|2 |-2 |
+---+---+
If your df A is the same as df B you can try below approach. I don't know if this will work correct for large datasets, it will be better to have id for joining already instead of creating it using monotonically_increasing_id().
import spark.implicits._
import org.apache.spark.sql.functions._
val df0 = Seq((1, 0), (1, 0)).toDF("a", "b")
val df1 = Seq((1, 0), (0, 0)).toDF("a", "b")
// new cols names
val colNamesA = df0.columns.map("A_" + _)
val colNamesB = df0.columns.map("B_" + _)
// rename cols and add id
val dfA = df0.toDF(colNamesA: _*)
.withColumn("id", monotonically_increasing_id())
val dfB = df1.toDF(colNamesB: _*)
.withColumn("id", monotonically_increasing_id())
dfA.show()
dfB.show()
// get columns without id
val dfACols = dfA.columns.dropRight(1).map(dfA(_))
val dfBCols = dfB.columns.dropRight(1).map(dfB(_))
// diff between cols
val calcCols = (dfACols zip dfBCols).map(s=>s._2-s._1)
// join dfs
val joined = dfA.join(dfB, "id")
joined.show()
calcCols.foreach(_.explain(true))
joined.select(calcCols:_*).show()
+---+---+---+
|A_a|A_b| id|
+---+---+---+
| 1| 0| 0|
| 1| 0| 1|
+---+---+---+
+---+---+---+
|B_a|B_b| id|
+---+---+---+
| 1| 0| 0|
| 0| 0| 1|
+---+---+---+
+---+---+---+---+---+
| id|A_a|A_b|B_a|B_b|
+---+---+---+---+---+
| 0| 1| 0| 1| 0|
| 1| 1| 0| 0| 0|
+---+---+---+---+---+
(B_a#26 - A_a#18)
(B_b#27 - A_b#19)
+-----------+-----------+
|(B_a - A_a)|(B_b - A_b)|
+-----------+-----------+
| 0| 0|
| -1| 0|
+-----------+-----------+

How to define spark dataframe join match priority

I have two dataframes.
dataDF
+---+
| tt|
+---+
| a|
| b|
| c|
| ab|
+---+
alter
+----+-----+------+
|name|alter|profit|
+----+-----+------+
| a| aa| 1|
| b| a| 5|
| c| ab| 8|
+----+-----+------+
The task is to search col "tt" in dataframe alter col("name"), if found it join them, if not found it, then search col "tt" in col("alter"). The priority of col ("name") is high than col("alter"). That means if row of col("tt") is matched to col("name"), I do not want to match it to other row which only matches col("alter"). How can I achieve this task?
I tried to write a join, but it does not work.
dataDF = dataDF.select("*")
.join(broadcast(alterDF),
col("tt") === col("Name") || col("tt") === col("alter"),
"left")
The result is:
+---+----+-----+------+
| tt|name|alter|profit|
+---+----+-----+------+
| a| a| aa| 1|
| a| b| a| 5| // this row is not expected.
| b| b| a| 5|
| c| c| ab| 8|
| ab| c| ab| 8|
+---+----+-----+------+
You can try joining twice. First time with the name column, filter out the tt values for which data is not matched and join it with the alter column. Union both the results. Please find the code below for the same. I hope it is helpful.
//Creating Test Data
val dataDF = Seq("a", "b", "c", "ab").toDF("tt")
val alter = Seq(("a", "aa", 1), ("b", "a", 5), ("c", "ab", 8))
.toDF("name", "alter", "profit")
val join1 = dataDF.join(alter, col("tt") === col("name"), "left")
val join2 = join1.filter( col("name").isNull).select("tt")
.join(alter, col("tt") === col("alter"), "left")
val joinDF = join1.filter( col("name").isNotNull).union(join2)
joinDF.show(false)
+---+----+-----+------+
|tt |name|alter|profit|
+---+----+-----+------+
|a |a |aa |1 |
|b |b |a |5 |
|c |c |ab |8 |
|ab |c |ab |8 |
+---+----+-----+------+

Spark: Match columns from two dataframes

I have a dataframe of format as below
+---+---+------+---+
| sp|sp2|colour|sp3|
+---+---+------+---+
| 0| 1| 1| 0|
| 1| 0| 0| 1|
| 0| 0| 1| 0|
+---+---+------+---+
another dataframe contains coefficients for each column in first dataframe. for example
+------+------+---------+------+
| CE_sp|CE_sp2|CE_colour|CE_sp3|
+------+------+---------+------+
| 0.94| 0.31| 0.11| 0.72|
+------+------+---------+------+
Now I want to add a column to first dataframe which is calculated by adding scores from second dataframe.
for ex.
+---+---+------+---+-----+
| sp|sp2|colour|sp3|Score|
+---+---+------+---+-----+
| 0| 1| 1| 0| 0.42|
| 1| 0| 0| 1| 1.66|
| 0| 0| 1| 0| 0.11|
+---+---+------+---+-----+
i.e
r -> row of first dataframe
score = r(0)*CE_sp + r(1)*CE_sp2 + r(2)*CE_colour + r(3)*CE_sp3
There can be n number of columns and order of columns can be different.
Thanks in Advance!!!
Quick and simple:
import org.apache.spark.sql.functions.col
val df = Seq(
(0, 1, 1, 0), (1, 0, 0, 1), (0, 0, 1, 0)
).toDF("sp","sp2", "colour", "sp3")
val coefs = Map("sp" -> 0.94, "sp2" -> 0.32, "colour" -> 0.11, "sp3" -> 0.72)
val score = df.columns.map(
c => col(c) * coefs.getOrElse(c, 0.0)).reduce(_ + _)
df.withColumn("score", score)
And the same thing in PySpark:
from pyspark.sql.functions import col
df = sc.parallelize([
(0, 1, 1, 0), (1, 0, 0, 1), (0, 0, 1, 0)
]).toDF(["sp","sp2", "colour", "sp3"])
coefs = {"sp": 0.94, "sp2": 0.32, "colour": 0.11, "sp3": 0.72}
df.withColumn("score", sum(col(c) * coefs.get(c, 0) for c in df.columns))
I believe that there many way to accomplish what you are trying to do. In all cases you don't need that second DataFrame, like I said in the comments.
Here is one way :
import org.apache.spark.ml.feature.{ElementwiseProduct, VectorAssembler}
import org.apache.spark.mllib.linalg.{Vectors,Vector => MLVector}
val df = Seq((0, 1, 1, 0), (1, 0, 0, 1), (0, 0, 1, 0)).toDF("sp", "sp2", "colour", "sp3")
// Your coefficient represents a dense Vector
val coeffSp = 0.94
val coeffSp2 = 0.31
val coeffColour = 0.11
val coeffSp3 = 0.72
val weightVectors = Vectors.dense(Array(coeffSp, coeffSp2, coeffColour, coeffSp3))
// You can assemble the features with VectorAssembler
val assembler = new VectorAssembler()
.setInputCols(df.columns) // since you need to compute on all your columns
.setOutputCol("features")
// Once these features assembled we can perform an element wise product with the weight vector
val output = assembler.transform(df)
val transformer = new ElementwiseProduct()
.setScalingVec(weightVectors)
.setInputCol("features")
.setOutputCol("weightedFeatures")
// Create an UDF to sum the weighted vectors values
import org.apache.spark.sql.functions.udf
def score = udf((score: MLVector) => { score.toDense.toArray.sum })
// Apply the UDF on the weightedFeatures
val scores = transformer.transform(output).withColumn("score",score('weightedFeatures))
scores.show
// +---+---+------+---+-----------------+-------------------+-----+
// | sp|sp2|colour|sp3| features| weightedFeatures|score|
// +---+---+------+---+-----------------+-------------------+-----+
// | 0| 1| 1| 0|[0.0,1.0,1.0,0.0]|[0.0,0.31,0.11,0.0]| 0.42|
// | 1| 0| 0| 1|[1.0,0.0,0.0,1.0]|[0.94,0.0,0.0,0.72]| 1.66|
// | 0| 0| 1| 0| (4,[2],[1.0])| (4,[2],[0.11])| 0.11|
// +---+---+------+---+-----------------+-------------------+-----+
I hope this helps. Don't hesitate if you have more questions.
Here is a simple solution:
scala> df_wght.show
+-----+------+---------+------+
|ce_sp|ce_sp2|ce_colour|ce_sp3|
+-----+------+---------+------+
| 1| 2| 3| 4|
+-----+------+---------+------+
scala> df.show
+---+---+------+---+
| sp|sp2|colour|sp3|
+---+---+------+---+
| 0| 1| 1| 0|
| 1| 0| 0| 1|
| 0| 0| 1| 0|
+---+---+------+---+
Then we can just do a simple cross join and crossproduct.
val scored = df.join(df_wght).selectExpr("(sp*ce_sp + sp2*ce_sp2 + colour*ce_colour + sp3*ce_sp3) as final_score")
The output:
scala> scored.show
+-----------+
|final_score|
+-----------+
| 5|
| 5|
| 3|
+-----------+

Resources