Spark: Match columns from two dataframes

Spark: Match columns from two dataframes - apache-spark

I have a dataframe of format as below
+---+---+------+---+
| sp|sp2|colour|sp3|
+---+---+------+---+
| 0| 1| 1| 0|
| 1| 0| 0| 1|
| 0| 0| 1| 0|
+---+---+------+---+
another dataframe contains coefficients for each column in first dataframe. for example
+------+------+---------+------+
| CE_sp|CE_sp2|CE_colour|CE_sp3|
+------+------+---------+------+
| 0.94| 0.31| 0.11| 0.72|
+------+------+---------+------+
Now I want to add a column to first dataframe which is calculated by adding scores from second dataframe.
for ex.
+---+---+------+---+-----+
| sp|sp2|colour|sp3|Score|
+---+---+------+---+-----+
| 0| 1| 1| 0| 0.42|
| 1| 0| 0| 1| 1.66|
| 0| 0| 1| 0| 0.11|
+---+---+------+---+-----+
i.e
r -> row of first dataframe
score = r(0)*CE_sp + r(1)*CE_sp2 + r(2)*CE_colour + r(3)*CE_sp3
There can be n number of columns and order of columns can be different.
Thanks in Advance!!!

Quick and simple:
import org.apache.spark.sql.functions.col
val df = Seq(
(0, 1, 1, 0), (1, 0, 0, 1), (0, 0, 1, 0)
).toDF("sp","sp2", "colour", "sp3")
val coefs = Map("sp" -> 0.94, "sp2" -> 0.32, "colour" -> 0.11, "sp3" -> 0.72)
val score = df.columns.map(
c => col(c) * coefs.getOrElse(c, 0.0)).reduce(_ + _)
df.withColumn("score", score)
And the same thing in PySpark:
from pyspark.sql.functions import col
df = sc.parallelize([
(0, 1, 1, 0), (1, 0, 0, 1), (0, 0, 1, 0)
]).toDF(["sp","sp2", "colour", "sp3"])
coefs = {"sp": 0.94, "sp2": 0.32, "colour": 0.11, "sp3": 0.72}
df.withColumn("score", sum(col(c) * coefs.get(c, 0) for c in df.columns))

I believe that there many way to accomplish what you are trying to do. In all cases you don't need that second DataFrame, like I said in the comments.
Here is one way :
import org.apache.spark.ml.feature.{ElementwiseProduct, VectorAssembler}
import org.apache.spark.mllib.linalg.{Vectors,Vector => MLVector}
val df = Seq((0, 1, 1, 0), (1, 0, 0, 1), (0, 0, 1, 0)).toDF("sp", "sp2", "colour", "sp3")
// Your coefficient represents a dense Vector
val coeffSp = 0.94
val coeffSp2 = 0.31
val coeffColour = 0.11
val coeffSp3 = 0.72
val weightVectors = Vectors.dense(Array(coeffSp, coeffSp2, coeffColour, coeffSp3))
// You can assemble the features with VectorAssembler
val assembler = new VectorAssembler()
.setInputCols(df.columns) // since you need to compute on all your columns
.setOutputCol("features")
// Once these features assembled we can perform an element wise product with the weight vector
val output = assembler.transform(df)
val transformer = new ElementwiseProduct()
.setScalingVec(weightVectors)
.setInputCol("features")
.setOutputCol("weightedFeatures")
// Create an UDF to sum the weighted vectors values
import org.apache.spark.sql.functions.udf
def score = udf((score: MLVector) => { score.toDense.toArray.sum })
// Apply the UDF on the weightedFeatures
val scores = transformer.transform(output).withColumn("score",score('weightedFeatures))
scores.show
// +---+---+------+---+-----------------+-------------------+-----+
// | sp|sp2|colour|sp3| features| weightedFeatures|score|
// +---+---+------+---+-----------------+-------------------+-----+
// | 0| 1| 1| 0|[0.0,1.0,1.0,0.0]|[0.0,0.31,0.11,0.0]| 0.42|
// | 1| 0| 0| 1|[1.0,0.0,0.0,1.0]|[0.94,0.0,0.0,0.72]| 1.66|
// | 0| 0| 1| 0| (4,[2],[1.0])| (4,[2],[0.11])| 0.11|
// +---+---+------+---+-----------------+-------------------+-----+
I hope this helps. Don't hesitate if you have more questions.

Here is a simple solution:
scala> df_wght.show
+-----+------+---------+------+
|ce_sp|ce_sp2|ce_colour|ce_sp3|
+-----+------+---------+------+
| 1| 2| 3| 4|
+-----+------+---------+------+
scala> df.show
+---+---+------+---+
| sp|sp2|colour|sp3|
+---+---+------+---+
| 0| 1| 1| 0|
| 1| 0| 0| 1|
| 0| 0| 1| 0|
+---+---+------+---+
Then we can just do a simple cross join and crossproduct.
val scored = df.join(df_wght).selectExpr("(sp*ce_sp + sp2*ce_sp2 + colour*ce_colour + sp3*ce_sp3) as final_score")
The output:
scala> scored.show
+-----------+
|final_score|
+-----------+
| 5|
| 5|
| 3|
+-----------+

Related

Check multiple columns for any column greater than zero using a regex

I need to apply a when function on multiple columns. I want to check if at least one of the columns has a value greater than 0.
This is my solution:
df.withColumn("any value", F.when(
(col("col1") > 0) |
(col("col2") > 0) |
(col("col3") > 0) |
...
(col("colX") > 0)
, "any greater than 0").otherwise(None))
Is it possible to do the same task with a regex, so I don't have to write all the column names?

So let's create sample data:
df = spark.createDataFrame(
[(0, 0, 0, 0), (0, 0, 2, 0), (0, 0, 0, 0), (1, 0, 0, 0)],
['a', 'b', 'c', 'd']
)
Then, you can build your condition from a list of columns (say all the columns of the dataframe) using map and reduce like this:
cols = df.columns
from pyspark.sql import functions as F
condition = reduce(lambda a, b: a | b, map(lambda c: F.col(c) > 0, cols))
df.withColumn("any value", F.when(condition, "any greater than 0")).show()
which yields:
+---+---+---+---+------------------+
| a| b| c| d| any value|
+---+---+---+---+------------------+
| 0| 0| 0| 0| null|
| 0| 0| 2| 0|any greater than 0|
| 0| 0| 0| 0| null|
| 1| 0| 0| 0|any greater than 0|
+---+---+---+---+------------------+

Another way you could have this done is create an array, use forall to check and conditionally assign values. Code below
df = df.withColumn('any value', array(df.columns)).withColumn('any value',when(forall('any value',lambda x: x==0),None).otherwise("any greater than 0"))
df.show()
+---+---+---+---+------------------+
| a| b| c| d| any value|
+---+---+---+---+------------------+
| 0| 0| 0| 0| null|
| 0| 0| 2| 0|any greater than 0|
| 0| 0| 0| 0| null|
| 1| 0| 0| 0|any greater than 0|
+---+---+---+---+------------------+

Extracting value using Window and Partition

I have a dataframe in pyspark
id | value
1 0
1 1
1 0
2 1
2 0
3 0
3 0
3 1
I want to extract all the rows after the first occurrence of 1 in value column in the same id group. I have created Window with partition of Id but do not know how to get rows which are present after value 1.
Im expecting result to be
id | value
1 1
1 0
2 1
2 0
3 1

Below solutions may be relevant for this (It is working perfectly for small data but may cause the problem in big data if id are on multiple partitions)
df = sqlContext.createDataFrame([
[1, 0],
[1, 1],
[1, 0],
[2, 1],
[2, 0],
[3, 0],
[3, 0],
[3, 1]
],
['id', 'Value']
)
df.show()
+---+-----+
| id|Value|
+---+-----+
| 1| 0|
| 1| 1|
| 1| 0|
| 2| 1|
| 2| 0|
| 3| 0|
| 3| 0|
| 3| 1|
+---+-----+
#importing Libraries
from pyspark.sql import functions as F
from pyspark.sql.window import Window as W
import sys
#This way we can generate a cumulative sum for values
df.withColumn(
"sum",
F.sum(
"value"
).over(W.partitionBy(["id"]).rowsBetween(-sys.maxsize, 0))
).show()
+---+-----+-----+
| id|Value|sum |
+---+-----+-----+
| 1| 0| 0|
| 1| 1| 1|
| 1| 0| 1|
| 3| 0| 0|
| 3| 0| 0|
| 3| 1| 1|
| 2| 1| 1|
| 2| 0| 1|
+---+-----+-----+
#Filter all those which are having sum > 0
df.withColumn(
"sum",
F.sum(
"value"
).over(W.partitionBy(["id"]).rowsBetween(-sys.maxsize, 0))
).where("sum > 0").show()
+---+-----+-----+
| id|Value|sum |
+---+-----+-----+
| 1| 1| 1|
| 1| 0| 1|
| 3| 1| 1|
| 2| 1| 1|
| 2| 0| 1|
+---+-----+-----+
Before running this you must be sure that data related to ID should be partitioned and no id can be on 2 partitions.

Ideally, you would need to:
Create a window partitioned by id and ordered the same way the dataframe already is
Keep only the rows for which there is a "one" before them in the window
AFAIK, there is no look up function within windows in Spark. Yet, you could follow this idea and work something out. Let's first create the data and import functions and windows.
import pyspark.sql.functions as F
from pyspark.sql.window import Window
l = [(1, 0), (1, 1), (1, 0), (2, 1), (2, 0), (3, 0), (3, 0), (3, 1)]
df = spark.createDataFrame(l, ['id', 'value'])
Then, let's add an index on the dataframe (it's free) to be able to order the windows.
indexedDf = df.withColumn("index", F.monotonically_increasing_id())
Then we create a window that only looks at the values before the current row, ordered by that index and partitioned by id.
w = Window.partitionBy("id").orderBy("index").rowsBetween(Window.unboundedPreceding, 0)
Finally, we use that window to collect the set of preceding values of each row, and filter out the ones that do not contain 1. Optionally, we order back by index because the windowing does not preserve the order by id column.
indexedDf\
.withColumn('set', F.collect_set(F.col('value')).over(w))\
.where(F.array_contains(F.col('set'), 1))\
.orderBy("index")\
.select("id", "value").show()
+---+-----+
| id|value|
+---+-----+
| 1| 1|
| 1| 0|
| 2| 1|
| 2| 0|
| 3| 1|
+---+-----+

how can I create a pyspark udf using multiple columns?

I need to write some custum code using multiple columns within a group of my data.
My custom code is to set a flag if a value is over a threshold, but suppress the flag if it is within a certain time of a previous flag.
Here is some sample code:
df = spark.createDataFrame(
[
("a", 1, 0),
("a", 2, 1),
("a", 3, 1),
("a", 4, 1),
("a", 5, 1),
("a", 6, 0),
("a", 7, 1),
("a", 8, 1),
("b", 1, 0),
("b", 2, 1)
],
["group_col","order_col", "flag_col"]
)
df.show()
+---------+---------+--------+
|group_col|order_col|flag_col|
+---------+---------+--------+
| a| 1| 0|
| a| 2| 1|
| a| 3| 1|
| a| 4| 1|
| a| 5| 1|
| a| 6| 0|
| a| 7| 1|
| a| 8| 1|
| b| 1| 0|
| b| 2| 1|
+---------+---------+--------+
from pyspark.sql.functions import udf, col, asc
from pyspark.sql.window import Window
def _suppress(dates=None, alert_flags=None, window=2):
sup_alert_flag = alert_flag
last_alert_date = None
for i, alert_flag in enumerate(alert_flag):
current_date = dates[i]
if alert_flag == 1:
if not last_alert_date:
sup_alert_flag[i] = 1
last_alert_date = current_date
elif (current_date - last_alert_date) > window:
sup_alert_flag[i] = 1
last_alert_date = current_date
else:
sup_alert_flag[i] = 0
else:
alert_flag = 0
return sup_alert_flag
suppress_udf = udf(_suppress, DoubleType())
df_out = df.withColumn("supressed_flag_col", suppress_udf(dates=col("order_col"), alert_flags=col("flag_col"), window=4).Window.partitionBy(col("group_col")).orderBy(asc("order_col")))
df_out.show()
The above fails, but my expected output is the following:
+---------+---------+--------+------------------+
|group_col|order_col|flag_col|supressed_flag_col|
+---------+---------+--------+------------------+
| a| 1| 0| 0|
| a| 2| 1| 1|
| a| 3| 1| 0|
| a| 4| 1| 0|
| a| 5| 1| 0|
| a| 6| 0| 0|
| a| 7| 1| 1|
| a| 8| 1| 0|
| b| 1| 0| 0|
| b| 2| 1| 1|
+---------+---------+--------+------------------+

Editing answer after more thought.
The general problem seems to be that the result of the current row depends upon result of the previous row. In effect, there is a recurrence relationship. I haven't found a good way to implement a recursive UDF in Spark. There are several challenges that result from the assumed distributed nature of the data in Spark which would make this difficult to achieve. At least in my mind. The following solution should work but may not scale for large data sets.
from pyspark.sql import Row
import pyspark.sql.functions as F
import pyspark.sql.types as T
suppress_flag_row = Row("order_col", "flag_col", "res_flag")
def suppress_flag( date_alert_flags, window_size ):
sorted_alerts = sorted( date_alert_flags, key=lambda x: x["order_col"])
res_flags = []
last_alert_date = None
for row in sorted_alerts:
current_date = row["order_col"]
aflag = row["flag_col"]
if aflag == 1 and (not last_alert_date or (current_date - last_alert_date) > window_size):
res = suppress_flag_row(current_date, aflag, True)
last_alert_date = current_date
else:
res = suppress_flag_row(current_date, aflag, False)
res_flags.append(res)
return res_flags
in_fields = [T.StructField("order_col", T.IntegerType(), nullable=True )]
in_fields.append( T.StructField("flag_col", T.IntegerType(), nullable=True) )
out_fields = in_fields
out_fields.append(T.StructField("res_flag", T.BooleanType(), nullable=True) )
out_schema = T.StructType(out_fields)
suppress_udf = F.udf(suppress_flag, T.ArrayType(out_schema) )
window_size = 4
tmp = df.groupBy("group_col").agg( F.collect_list( F.struct( F.col("order_col"), F.col("flag_col") ) ).alias("date_alert_flags"))
tmp2 = tmp.select(F.col("group_col"), suppress_udf(F.col("date_alert_flags"), F.lit(window_size)).alias("suppress_res"))
expand_fields = [F.col("group_col")] + [F.col("res_expand")[f.name].alias(f.name) for f in out_fields]
final_df = tmp2.select(F.col("group_col"), F.explode(F.col("suppress_res")).alias("res_expand")).select( expand_fields )

I think, You don't need custom function for this. you can use rowsBetween option along with window to get the 5 rows range. Please check and let me know if missed something.
>>> from pyspark.sql import functions as F
>>> from pyspark.sql import Window
>>> w = Window.partitionBy('group_col').orderBy('order_col').rowsBetween(-5,-1)
>>> df = df.withColumn('supr_flag_col',F.when(F.sum('flag_col').over(w) == 0,1).otherwise(0))
>>> df.orderBy('group_col','order_col').show()
+---------+---------+--------+-------------+
|group_col|order_col|flag_col|supr_flag_col|
+---------+---------+--------+-------------+
| a| 1| 0| 0|
| a| 2| 1| 1|
| a| 3| 1| 0|
| b| 1| 0| 0|
| b| 2| 1| 1|
+---------+---------+--------+-------------+

How can I use the literal value of a spark dataframe column?

I have this simple dataframe that looks like this,
+---+---+---+---+
|nm | ca| cb| cc|
+---+---+---+---+
| a|123| 0| 0|
| b| 1| 2| 3|
| c| 0| 1| 0|
+---+---+---+---+
What I want to do is,
+---+---+---+---+---+
|nm |ca |cb |cc |p |
+---+---+---+---+---+
|a |123|0 |0 |1 |
|b |1 |2 |3 |1 |
|c |0 |1 |0 |0 |
+---+---+---+---+---+
bascially added a new column p, such that, if value of column nm is 'a', check column ca is >0, if yes put '1' for column p1 else 0.
My code,
def purchaseCol: UserDefinedFunction =
udf((brand: String) => s"c$brand")
val a = ss.createDataset(List(
("a", 123, 0, 0),
("b", 1, 2, 3),
("c", 0, 1, 0)))
.toDF("nm", "ca", "cb", "cc")
a.show()
a.withColumn("p", when(lit(DataFrameUtils.purchaseCol($"nm")) > 0, 1).otherwise(0))
.show(false)
It doesnt seem to be working and is returning 0 for all rows in col 'p'.
PS: Columns number is over 100 and they are dynamically generated.

Map over rdd, calculate and add p to each row:
val a = sc.parallelize(
List(("a", 123, 0, 0),
("b", 1, 2, 3),
("c", 0, 1, 0))
).toDF("nm", "ca", "cb", "cc")
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val b = a.rdd.map(r => {
val s = r.getAs[String]("nm")
val v = r.getAs[Int](s"c$s")
val p = if(v > 0) 1 else 0
Row.fromSeq(r.toSeq :+ p)
})
val new_schema = StructType(a.schema :+ StructField("p", IntegerType, true))
val df_new = spark.createDataFrame(b, new_schema)
df_new.show
+---+---+---+---+---+
| nm| ca| cb| cc| p|
+---+---+---+---+---+
| a|123| 0| 0| 1|
| b| 1| 2| 3| 1|
| c| 0| 1| 0| 0|
+---+---+---+---+---+

If "c*" columns number is limited, UDF with all values can be used:
val nameMatcherFunct = (nm: String, ca: Int, cb: Int, cc: Int) => {
val value = nm match {
case "a" => ca
case "b" => cb
case "c" => cc
}
if (value > 0) 1 else 0
}
def purchaseValueUDF = udf(nameMatcherFunct)
val result = a.withColumn("p", purchaseValueUDF(col("nm"), col("ca"), col("cb"), col("cc")))
If you have many "c*" columns, function with Row as parameter can be used:
How to pass whole Row to UDF - Spark DataFrame filter

looking at your logic
if value of column nm is 'a', check column ca is >0, if yes put '1' for column p1 else 0.
you can simply do
import org.apache.spark.sql.functions._
a.withColumn("p", when((col("nm") === lit("a")) && (col("ca") > 0), lit(1)).otherwise(lit(0)))
but looking at your output dataframe, you would require an || instead of &&
import org.apache.spark.sql.functions._
a.withColumn("p", when((col("nm") === lit("a")) || (col("ca") > 0), lit(1)).otherwise(lit(0)))

val a1 = sc.parallelize(
List(("a", 123, 0, 0),
("b", 1, 2, 3),
("c", 0, 1, 0))
).toDF("nm", "ca", "cb", "cc")
a1.show()
+---+---+---+---+
| nm| ca| cb| cc|
+---+---+---+---+
| a|123| 0| 0|
| b| 1| 2| 3|
| c| 0| 1| 0|
+---+---+---+---+
val newDf = a1.withColumn("P", when($"ca" > 0, 1).otherwise(0))
newDf.show()
+---+---+---+---+---+
| nm| ca| cb| cc| P|
+---+---+---+---+---+
| a|123| 0| 0| 1|
| b| 1| 2| 3| 1|
| c| 0| 1| 0| 0|
+---+---+---+---+---+

How to get the min of each row in PySpark DataFrame [duplicate]

I am working on a PySpark DataFrame with n columns. I have a set of m columns (m < n) and my task is choose the column with max values in it.
For example:
Input: PySpark DataFrame containing :
col_1 = [1,2,3], col_2 = [2,1,4], col_3 = [3,2,5]
Ouput :
col_4 = max(col1, col_2, col_3) = [3,2,5]
There is something similar in pandas as explained in this question.
Is there any way of doing this in PySpark or should I change convert my PySpark df to Pandas df and then perform the operations?

You can reduce using SQL expressions over a list of columns:
from pyspark.sql.functions import max as max_, col, when
from functools import reduce
def row_max(*cols):
return reduce(
lambda x, y: when(x > y, x).otherwise(y),
[col(c) if isinstance(c, str) else c for c in cols]
)
df = (sc.parallelize([(1, 2, 3), (2, 1, 2), (3, 4, 5)])
.toDF(["a", "b", "c"]))
df.select(row_max("a", "b", "c").alias("max")))
Spark 1.5+ also provides least, greatest
from pyspark.sql.functions import greatest
df.select(greatest("a", "b", "c"))
If you want to keep name of the max you can use `structs:
from pyspark.sql.functions import struct, lit
def row_max_with_name(*cols):
cols_ = [struct(col(c).alias("value"), lit(c).alias("col")) for c in cols]
return greatest(*cols_).alias("greatest({0})".format(",".join(cols)))
maxs = df.select(row_max_with_name("a", "b", "c").alias("maxs"))
And finally you can use above to find select "top" column:
from pyspark.sql.functions import max
((_, c), ) = (maxs
.groupBy(col("maxs")["col"].alias("col"))
.count()
.agg(max(struct(col("count"), col("col"))))
.first())
df.select(c)

We can use greatest
Creating DataFrame
df = spark.createDataFrame(
[[1,2,3], [2,1,2], [3,4,5]],
['col_1','col_2','col_3']
)
df.show()
+-----+-----+-----+
|col_1|col_2|col_3|
+-----+-----+-----+
| 1| 2| 3|
| 2| 1| 2|
| 3| 4| 5|
+-----+-----+-----+
Solution
from pyspark.sql.functions import greatest
df2 = df.withColumn('max_by_rows', greatest('col_1', 'col_2', 'col_3'))
#Only if you need col
#from pyspark.sql.functions import col
#df2 = df.withColumn('max', greatest(col('col_1'), col('col_2'), col('col_3')))
df2.show()
+-----+-----+-----+-----------+
|col_1|col_2|col_3|max_by_rows|
+-----+-----+-----+-----------+
| 1| 2| 3| 3|
| 2| 1| 2| 2|
| 3| 4| 5| 5|
+-----+-----+-----+-----------+

You can also use the pyspark built-in least:
from pyspark.sql.functions import least, col
df = df.withColumn('min', least(col('c1'), col('c2'), col('c3')))

Another simple way of doing it. Let us say that the below df is your dataframe
df = sc.parallelize([(10, 10, 1 ), (200, 2, 20), (3, 30, 300), (400, 40, 4)]).toDF(["c1", "c2", "c3"])
df.show()
+---+---+---+
| c1| c2| c3|
+---+---+---+
| 10| 10| 1|
|200| 2| 20|
| 3| 30|300|
|400| 40| 4|
+---+---+---+
You can process the above df as below to get the desited results
from pyspark.sql.functions import lit, min
df.select( lit('c1').alias('cn1'), min(df.c1).alias('c1'),
lit('c2').alias('cn2'), min(df.c2).alias('c2'),
lit('c3').alias('cn3'), min(df.c3).alias('c3')
)\
.rdd.flatMap(lambda r: [ (r.cn1, r.c1), (r.cn2, r.c2), (r.cn3, r.c3)])\
.toDF(['Columnn', 'Min']).show()
+-------+---+
|Columnn|Min|
+-------+---+
| c1| 3|
| c2| 2|
| c3| 1|
+-------+---+

Scala solution:
df = sc.parallelize(Seq((10, 10, 1 ), (200, 2, 20), (3, 30, 300), (400, 40, 4))).toDF("c1", "c2", "c3"))
df.rdd.map(row=>List[String](row(0).toString,row(1).toString,row(2).toString)).map(x=>(x(0),x(1),x(2),x.min)).toDF("c1","c2","c3","min").show
+---+---+---+---+
| c1| c2| c3|min|
+---+---+---+---+
| 10| 10| 1| 1|
|200| 2| 20| 2|
| 3| 30|300| 3|
|400| 40| 4| 4|
+---+---+---+---+

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark: Match columns from two dataframes - apache-spark

Related

Check multiple columns for any column greater than zero using a regex

Extracting value using Window and Partition

how can I create a pyspark udf using multiple columns?

How can I use the literal value of a spark dataframe column?

How to get the min of each row in PySpark DataFrame [duplicate]

Categories

Resources