Manipulating a dataframe within a Spark UDF - apache-spark

I have a UDF that filters and selects values from a dataframe, but it runs into "object not serializable" error. Details below.
Suppose I have a dataframe df1 that has columns with names ("ID", "Y1", "Y2", "Y3", "Y4", "Y5", "Y6", "Y7", "Y8", "Y9", "Y10"). I want sum a subset of the "Y" columns based on the matching "ID" and "Value" from another dataframe df2. I tried the following:
val y_list = ("Y1", "Y2", "Y3", "Y4", "Y5", "Y6", "Y7", "Y8", "Y9", "Y10").map(c => col(c))
def udf_test(ID: String, value: Int): Double = {
df1.filter($"ID" === ID).select(y_list:_*).first.toSeq.toList.take(value).foldLeft(0.0)(_+_)
}
sqlContext.udf.register("udf_test", udf_test _)
val df_result = df2.withColumn("Result", callUDF("udf_test", $"ID", $"Value"))
This gives me errors of the form:
java.io.NotSerializableException: org.apache.spark.sql.Column
Serialization stack:
- object not serializable (class: org.apache.spark.sql.Column, value: Y1)
I looked this up and realized that Spark Column is not serializable. I am wondering:
1) There is any way to manipulate a dataframe within an UDF?
2) If not, what's the best way to achieve the type of operation above? My real case is more complicated than this. It requires me to select values from multiple small dataframes based on some columns in a big dataframe, and compute back a value to the big dataframe.
I am using Spark 1.6.3. Thanks!

You can't use Dataset operations inside UDFs. UDF can only manupulate on existing columns and produce one result column. It can't filter Dataset or make aggregations, but it can be used inside filter. UDAF also can aggregate values.
Instead, you can use .as[SomeCaseClass] to make Dataset from DataFrame and use normal, strongly typed functions inside filter, map, reduce.
Edit: If you want to join your bigDF with every small DF in smallDFs List, you can do:
import org.apache.spark.sql.functions._
val bigDF = // some processing
val smallDFs = Seq(someSmallDF1, someSmallDF2)
val joined = smallDFs.foldLeft(bigDF)((acc, df) => acc.join(broadcast(df), "join_column"))
broadcast is a function to add Broadcast Hint to small DF, so that small DF will use more efficient Broadcast Join instead of Sort Merge Join

1) No, you can only use plain scala code within UDFs
2) If you interpreted your code correctly, you can achieve your goal with:
df2
.join(
df1.select($"ID",y_list.foldLeft(lit(0))(_ + _).as("Result")),Seq("ID")
)

import org.apache.spark.sql.functions._
val events = Seq (
(1,1,2,3,4),
(2,1,2,3,4),
(3,1,2,3,4),
(4,1,2,3,4),
(5,1,2,3,4)).toDF("ID","amt1","amt2","amt3","amt4")
var prev_amt5=0
var i=1
def getamt5value(ID:Int,amt1:Int,amt2:Int,amt3:Int,amt4:Int) : Int = {
if(i==1){
i=i+1
prev_amt5=0
}else{
i=i+1
}
if (ID == 0)
{
if(amt1==0)
{
val cur_amt5= 1
prev_amt5=cur_amt5
cur_amt5
}else{
val cur_amt5=1*(amt2+amt3)
prev_amt5=cur_amt5
cur_amt5
}
}else if (amt4==0 || (prev_amt5==0 & amt1==0)){
val cur_amt5=0
prev_amt5=cur_amt5
cur_amt5
}else{
val cur_amt5=prev_amt5 + amt2 + amt3 + amt4
prev_amt5=cur_amt5
cur_amt5
}
}
val getamt5 = udf {(ID:Int,amt1:Int,amt2:Int,amt3:Int,amt4:Int) =>
getamt5value(ID,amt1,amt2,amt3,amt4)
}
myDF.withColumn("amnt5", getamt5(myDF.col("ID"),myDF.col("amt1"),myDF.col("amt2"),myDF.col("amt3"),myDF.col("amt4"))).show()

Related

Spark Streaming reach dataframe columns and add new column looking up to Redis

In my previous question(Spark Structured Streaming dynamic lookup with Redis ) , i succeeded to reach redis with mapparttions thanks to https://stackoverflow.com/users/689676/fe2s
I tried to use mappartitions but i could not solve one point, how i can reach per row column in the below code part while iterating.
Because i want to enrich my per-row against my lookup fields kept in Redis.
I found something like this, but how i can reach dataframe columns and add new column looking up to Redis.
for any help i really much appreciate, Thanks.
import org.apache.spark.sql.types._
def transformRow(row: Row): Row = {
Row.fromSeq(row.toSeq ++ Array[Any]("val1", "val2"))
}
def transformRows(iter: Iterator[Row]): Iterator[Row] =
{
val redisConn =new RedisClient("xxx.xxx.xx.xxx",6379,1,Option("Secret123"))
println(redisConn.get("ModelValidityPeriodName").getOrElse(""))
//want to reach DataFrame column here
redisConn.close()
iter.map(transformRow)
}
val newSchema = StructType(raw_customer_df.schema.fields ++
Array(
StructField("ModelValidityPeriod", StringType, false),
StructField("ModelValidityPeriod2", StringType, false)
)
)
spark.sqlContext.createDataFrame(raw_customer_df.rdd.mapPartitions(transformRows), newSchema).show
Iterator iter represents an iterator over the dataframe rows. So if I got your question correctly, you can access column values by iterative over iter and calling
row.getAs[Column_Type](column_name)
Something like this
def transformRows(iter: Iterator[Row]): Iterator[Row] = {
val redisConn = new RedisClient("xxx.xxx.xx.xxx",6379,1,Option("Secret123"))
println(redisConn.get("ModelValidityPeriodName").getOrElse(""))
//want to reach DataFrame column here
val res = iter.map { row =>
val columnValue = row.getAs[String]("column_name")
// lookup in redis
val valueFromRedis = redisConn.get(...)
Row.fromSeq(row.toSeq ++ Array[Any](valueFromRedis))
}.toList
redisConn.close()
res.iterator
}

Technique for joining with spark dataframe w/ custom partitioner works w/ python, but not scala?

I recently read an article that described how to custom partition a dataframe
[ https://dataninjago.com/2019/06/01/create-custom-partitioner-for-spark-dataframe/ ] in which the author illustrated the technique in Python. I use Scala, and the technique looked like a good way to address issues of skew, so I tried something similar, and what I found was that when one does the following:
- create 2 data frames, D1, D2
- convert D1, D2 to 2 Pair RDDs R1,R2
(where the key is the key you want to join on)
- repartition R1,R2 with a custom partitioner 'C'
where 'C' has 2 partitions (p-0,p-1) and
stuffs everything in P-1, except keys == 'a'
- join R1,R2 as R3
- OBSERVE that:
- partitioner for R3 is 'C' (same for R1,R2)
- when printing the contents of each partition of R3 all entries
except the one keyed by 'a' is in p-1
- set D1' <- R1.toDF
- set D2' <- R2.toDF
We note the following results:
0) The join of D1' and D2' produce expected results (good)
1) The partitioners for D1' and D2' are None -- not Some(C),
as was the case with RDD's R1/R2 (bad)
2) The contents of the glom'd underlying RDDs of D1' and D2' did
not have everything (except key 'a') piled up
in partition 1 as expected.(bad)
So, I came away with the following conclusion... which will work for me practically... But it really irks me that I could not get the behavior in the article which used Python:
When one needs to use custom partitioning with Dataframes in Scala one must
drop into RDD's do the join or whatever operation on the RDD, then convert back
to dataframe. You can't apply the custom partitioner, then convert back to
dataframe, do your operations, and expect the custom partitioning to work.
Now...I am hoping I am wrong ! Perhaps someone with more expertise in Spark internals can guide me here. I have written a little program (below) to illustrate the results. Thanks in advance if you can set me straight.
UPDATE
In addition to the Spark code which illustrates the problem I also tried a simplified version of what the original article presented in Python. The conversions below create a dataframe, extract its underlying RDD and repartition it, then recover the dataframe and verify that the partitioner is lost.
Python snippet illustrating problem
from pyspark.sql.types import IntegerType
mylist = [1, 2, 3, 4]
df = spark.createDataFrame(mylist, IntegerType())
def travelGroupPartitioner(key):
return 0
dfRDD = df.rdd.map(lambda x: (x[0],x))
dfRDD2 = dfRDD .partitionBy(8, travelGroupPartitioner)
# this line uses approach of original article and maps to only the value
# but map doesn't guarantee preserving pratitioner, so i tried without the
# map below...
df2 = spark.createDataFrame(dfRDD2 .map(lambda x: x[1]))
print ( df2.rdd.partitioner ) # prints None
# create dataframe from partitioned RDD _without_ the map,
# and we _still_ lose partitioner
df3 = spark.createDataFrame(dfRDD2)
print ( df3.rdd.partitioner ) # prints None
Scala snippet illustrating problem
object Question extends App {
val conf =
new SparkConf().setAppName("blah").
setMaster("local").set("spark.sql.shuffle.partitions", "2")
val sparkSession = SparkSession.builder .config(conf) .getOrCreate()
val spark = sparkSession
import spark.implicits._
sparkSession.sparkContext.setLogLevel("ERROR")
class CustomPartitioner(num: Int) extends Partitioner {
def numPartitions: Int = num
def getPartition(key: Any): Int = if (key.toString == "a") 0 else 1
}
case class Emp(name: String, deptId: String)
case class Dept(deptId: String, name: String)
val value: RDD[Emp] = spark.sparkContext.parallelize(
Seq(
Emp("anne", "a"),
Emp("dave", "d"),
Emp("claire", "c"),
Emp("roy", "r"),
Emp("bob", "b"),
Emp("zelda", "z"),
Emp("moe", "m")
)
)
val employee: Dataset[Emp] = value.toDS()
val department: Dataset[Dept] = spark.sparkContext.parallelize(
Seq(
Dept("a", "ant dept"),
Dept("d", "duck dept"),
Dept("c", "cat dept"),
Dept("r", "rabbit dept"),
Dept("b", "badger dept"),
Dept("z", "zebra dept"),
Dept("m", "mouse dept")
)
).toDS()
val dumbPartitioner: Partitioner = new CustomPartitioner(2)
// Convert to-be-joined dataframes to custom repartition RDDs [ custom partitioner: cp ]
//
val deptPairRdd: RDD[(String, Dept)] = department.rdd.map { dept => (dept.deptId, dept) }
val empPairRdd: RDD[(String, Emp)] = employee.rdd.map { emp: Emp => (emp.deptId, emp) }
val cpEmpRdd: RDD[(String, Emp)] = empPairRdd.partitionBy(dumbPartitioner)
val cpDeptRdd: RDD[(String, Dept)] = deptPairRdd.partitionBy(dumbPartitioner)
assert(cpEmpRdd.partitioner.get == dumbPartitioner)
assert(cpDeptRdd.partitioner.get == dumbPartitioner)
// Here we join using RDDs and ensure that the resultant rdd is partitioned so most things end up in partition 1
val joined: RDD[(String, (Emp, Dept))] = cpEmpRdd.join(cpDeptRdd)
val reso: Array[(Array[(String, (Emp, Dept))], Int)] = joined.glom().collect().zipWithIndex
reso.foreach((item: Tuple2[Array[(String, (Emp, Dept))], Int]) => println(s"array size: ${item._2}. contents: ${item._1.toList}"))
System.out.println("partitioner of RDD created by joining 2 RDD's w/ custom partitioner: " + joined.partitioner)
assert(joined.partitioner.contains(dumbPartitioner))
val recoveredDeptDF: DataFrame = deptPairRdd.toDF
val recoveredEmpDF: DataFrame = empPairRdd.toDF
System.out.println(
"partitioner for DF recovered from custom partitioned RDD (not as expected!):" +
recoveredDeptDF.rdd.partitioner)
val joinedDf = recoveredEmpDF.join(recoveredDeptDF, "_1")
println("printing results of joining the 2 dataframes we 'recovered' from the custom partitioned RDDS (looks good)")
joinedDf.show()
println("PRINTING partitions of joined DF does not match the glom'd results we got from underlying RDDs")
joinedDf.rdd.glom().collect().
zipWithIndex.foreach {
item: Tuple2[Any, Int] =>
val asList = item._1.asInstanceOf[Array[org.apache.spark.sql.Row]].toList
println(s"array size: ${item._2}. contents: $asList")
}
assert(joinedDf.rdd.partitioner.contains(dumbPartitioner)) // this will fail ;^(
}
Check out my new library which adds partitionBy method to the Dataset/Dataframe API level.
Taking your Emp and Dept objects as example:
class DeptByIdPartitioner extends TypedPartitioner[Dept] {
override def getPartitionIdx(value: Dept): Int = if (value.deptId.startsWith("a")) 0 else 1
override def numPartitions: Int = 2
override def partitionKeys: Option[Set[PartitionKey]] = Some(Set(("deptId", StringType)))
}
class EmpByDepIdPartitioner extends TypedPartitioner[Emp] {
override def getPartitionIdx(value: Emp): Int = if (value.deptId.startsWith("a")) 0 else 1
override def numPartitions: Int = 2
override def partitionKeys: Option[Set[PartitionKey]] = Some(Set(("deptId", StringType)))
}
Note that we are extending TypedPartitioner.
It is compile-time safe, you won't be able to repartition a dataset of persons with emp partitioner.
val spark = SparkBuilder.getSpark()
import org.apache.spark.sql.exchange.implicits._ //<-- addtitonal import
import spark.implicits._
val deptPartitioned = department.repartitionBy(new DeptByIdPartitioner)
val empPartitioned = employee.repartitionBy(new EmpByDepIdPartitioner)
Let's check how our data is partitioned:
Dep dataset:
Partition N 0
: List([a,ant dept])
Partition N 1
: List([d,duck dept], [c,cat dept], [r,rabbit dept], [b,badger dept], [z,zebra dept], [m,mouse dept])
If we join repartitioned by the same key dataset Catalyst will properly recognize this:
val joined = deptPartitioned.join(empPartitioned, "deptId")
println("Joined:")
val result: Array[(Int, Array[Row])] = joined.rdd.glom().collect().zipWithIndex.map(_.swap)
for (elem <- result) {
println(s"Partition N ${elem._1}")
println(s"\t: ${elem._2.toList}")
}
Partition N 0
: List([a,ant dept,anne])
Partition N 1
: List([b,badger dept,bob], [c,cat dept,claire], [d,duck dept,dave], [m,mouse dept,moe], [r,rabbit dept,roy], [z,zebra dept,zelda])
What version of Spark are you using? If it's 2.x and above, it's recommended to use Dataframe/Dataset API instead, not RDDs
It's much easier to work with the mentioned API than with RDDs, and it performs much better on later versions of Spark
You may find the link below useful for how to join DFs:
How to join two dataframes in Scala and select on few columns from the dataframes by their index?
Once you get your joined DataFrame, you can use the link below for partitioning by column values, which I assume you're trying to achieve:
Partition a spark dataframe based on column value?

spark override the dataframe variable without using var

I have one API which perform delete operation on dataframe like below
def deleteColmns(df:DataFrame,clmList :List[org.apache.spark.sql.Column]):DataFrame{
var ddf:DataFrame = null
for(clm<-clmList){
ddf.drop(clm)
}
return ddf
}
Since it is not good practice to use the var in functional programming , how to avoid this situation
With Spark >2.0, you can drop multiple columns using a sequence of column name :
val clmList: Seq[Column] = _
val strList: Seq[String] = clmList.map(c => s"$c")
df.drop(strList: _*)
Otherwise, you can always use foldLeft to fold left on the DataFrame and drop your columns :
clmList.foldLeft(df)((acc, c) => acc.drop(c))
I hope this helps.

Create RDD from RDD entry inside foreach loop

I have some custom logic that looks at elements in an RDD and would like to conditionally write to a TempView via the UNION approach using foreach, as per below:
rddX.foreach{ x => {
// Do something, some custom logic
...
val y = create new RDD from this RDD element x
...
or something else
// UNION to TempView
...
}}
Something really basic that I do not get:
How can convert the nth entry (x) of the RDD to an RDD itself of length 1?
Or, convert the nth entry (x) directly to a DF?
I get all the set based cases, but here I want to append when I meet a condition immediately for the sake of simplicity. I.e. at the level of the item entry in the RDD.
Now, before getting a -1 as SO 41356419, I am only suggesting this as I have a specific use case and to mutate a TempView in SPARK SQL, I do need such an approach - at least that is my thinking. Not a typical SPARK USE CASE, but that is what we are / I am facing.
Thanks in advance
First of all - you can't create RDD or DF inside foreach() of another RDD or DF/DS function. But you can get nth element from RDD and create new RDD with that single element.
EDIT:
The solution, however is much simplier:
import org.apache.spark.{SparkConf, SparkContext}
object Main {
val conf = new SparkConf().setAppName("myapp").setMaster("local[*]")
val sc = new SparkContext(conf)
def main(args: Array[String]): Unit = {
val n = 534 // This is input value (index of the element we'ŗe interested in)
sc.setLogLevel("ERROR")
// Creating dummy rdd
val rdd = sc.parallelize(0 to 999).cache()
val singletonRdd = rdd.zipWithIndex().filter(pair => pair._1 == n)
}
}
Hope that helps!

Taking value from one dataframe and passing that value into loop of SqlContext

Looking to try do something like this:
I have a dataframe that is one column of ID's called ID_LIST. With that column of id's I would like to pass it into a Spark SQL call looping through ID_LIST using foreach returning the result to another dataframe.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val id_list = sqlContext.sql("select distinct id from item_orc")
id_list.registerTempTable("ID_LIST")
id_list.foreach(i => println(i)
id_list println output:
[123]
[234]
[345]
[456]
Trying to now loop through ID_LIST and run a Spark SQL call for each:
id_list.foreach(i => {
val items = sqlContext.sql("select * from another_items_orc where id = " + i
items.foreach(println)
}
First.. not sure how to pull the individual value out, getting this error:
org.apache.spark.sql.AnalysisException: cannot recognize input near '[' '123' ']' in expression specification; line 1 pos 61
Second: how can I alter my code to output the result to a dataframe I can use later ?
Thanks, any help is appreciated!
Answer To First Question
When you perform the "foreach" Spark converts the dataframe into an RDD of type Row. Then when you println on the RDD it prints the Row, the first row being "[123]". It is boxing [] the elements in the row. The elements in the row are accessed by position. If you wanted to print just 123, 234, etc... try
id_list.foreach(i => println(i(0)))
Or you can use native primitive access
id_list.foreach(i => println(i.getString(0))) //For Strings
Seriously... Read the documentation I have linked about Row in Spark. This will transform your code to:
id_list.foreach(i => {
val items = sqlContext.sql("select * from another_items_orc where id = " + i.getString(0))
items.foreach(i => println(i.getString(0)))
})
Answer to Second Question
I have a sneaking suspicion about what you actually are trying to do but I'll answer your question as I have interpreted it.
Let's create an empty dataframe which we will union everything to it in a loop of the distinct items from the first dataframe.
import org.apache.spark.sql.types.{StructType, StringType}
import org.apache.spark.sql.Row
// Create the empty dataframe. The schema should reflect the columns
// of the dataframe that you will be adding to it.
val schema = new StructType()
.add("col1", StringType, true)
var df = ss.createDataFrame(ss.sparkContext.emptyRDD[Row], schema)
// Loop over, select, and union to the empty df
id_list.foreach{ i =>
val items = sqlContext.sql("select * from another_items_orc where id = " + i.getString(0))
df = df.union(items)
}
df.show()
You now have the dataframe df that you can use later.
NOTE: An easier thing to do would probably be to join the two dataframes on the matching columns.
import sqlContext.implicits.StringToColumn
val bar = id_list.join(another_items_orc, $"distinct_id" === $"id", "inner").select("id")
bar.show()

Resources