how to create dataframe in UDF - apache-spark

I have a problem. I want to create a DataFrame in UDF and use my model to transform it to another. But I get this Exception. Is there something wrong in Spark Conf? I don't know. Is there anyone can help me to solve this problem?
Code:
val model = PipelineModel.load("/user/abel/model/pipeline_model")
val modelBroad = spark.sparkContext.broadcast(model)
def model_predict(id:Long, text:String):Double = {
val modelLoaded = modelBroad.value
val sparkss = SparkSession.builder.master("local[*]").getOrCreate()
val dataDF = sparkss.createDataFrame(Seq((id,text))).toDF("id","text")
val result = modelLoaded.transform(dataDF).select("prediction").collect().apply(0).getDouble(0)
println(f"The prediction of $id and $text is $result")
result
}
val udf_func = udf(model_predict _)
test.withColumn("prediction",udf_func($"id",$"text")).show()
Exception:
Caused by: java.lang.NullPointerException
at org.apache.spark.sql.execution.SparkPlan.sparkContext(SparkPlan.scala:56)
at org.apache.spark.sql.execution.LocalTableScanExec.metrics$lzycompute(LocalTableScanExec.scala:37)
at org.apache.spark.sql.execution.LocalTableScanExec.metrics(LocalTableScanExec.scala:36)
at org.apache.spark.sql.execution.SparkPlan.resetMetrics(SparkPlan.scala:85)
at org.apache.spark.sql.Dataset$$anonfun$withAction$1.apply(Dataset.scala:3366)
at org.apache.spark.sql.Dataset$$anonfun$withAction$1.apply(Dataset.scala:3365)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:117)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3365)
at org.apache.spark.sql.Dataset.collect(Dataset.scala:2788)
at com.zamplus.mine.SparkSubmit$.com$zamplus$mine$SparkSubmit$$model_predict$1(SparkSubmit.scala:21)
at com.zamplus.mine.SparkSubmit$$anonfun$1.apply(SparkSubmit.scala:40)
at com.zamplus.mine.SparkSubmit$$anonfun$1.apply(SparkSubmit.scala:40)
... 22 more

There is issue with your UDF. UDF runs on multiple instances and uses all variables that we are using inside it. So you should passed all required global variable as a parameters such as modelBroad otherwise it will give you null pointer exception.
There are few more good practice that you are not following in UDF. Some of are:
You do not need to create spark session in UDF. Otherwise it will create multiple spark session and which will cause issues. Instead of this pass global spark session as a variable in UDF if required.
Remove unnecessary pritnln in UDF, which effect your return also.
I have changed your code just for reference. It is just a prototype of ideal UDF. Please change it accordingly.
val sparkss = SparkSession.builder.master("local[*]").getOrCreate()
val model = PipelineModel.load("/user/abel/model/pipeline_model")
val modelBroad = spark.sparkContext.broadcast(model)
def model_predict(id:Long, text:String,spark:SparkSession,modelBroad:<datatype>):Double = {
val modelLoaded = modelBroad.value
val dataDF = spark.createDataFrame(Seq((id,text))).toDF("id","text")
val result = modelLoaded.transform(dataDF).select("prediction").collect().apply(0).getDouble(0)
result
}
val udf_func = udf(model_predict _)
test.withColumn("prediction",udf_func($"id",$"text",lit(sparkss),lit(modelBroad))).show()

Related

Kotlin with spark create dataframe from POJO which has pojo classes within

I have a kotlin data class as shown below
data class Persona_Items(
val key1:Int = 0,
val key2:String = "Hello")
data class Persona(
val persona_type: String,
val created_using_algo: String,
val version_algo: String,
val createdAt:Long,
val listPersonaItems:List<Persona_Items>)
data class PersonaMetaData
(val user_id: Int,
val persona_created: Boolean,
val persona_createdAt: Long,
val listPersona:List<Persona>)
fun main() {
val personalItemList1 = listOf(Persona_Items(1), Persona_Items(key2="abc"), Persona_Items(10,"rrr"))
val personalItemList2 = listOf(Persona_Items(10), Persona_Items(key2="abcffffff"),Persona_Items(20,"rrr"))
val persona1 = Persona("HelloWorld","tttAlgo","1.0",10L,personalItemList1)
val persona2 = Persona("HelloWorld","qqqqAlgo","1.0",10L,personalItemList2)
val personMetaData = PersonaMetaData(884,true,1L, listOf(persona1,persona2))
val spark = SparkSession
.builder()
.master("local[2]")
.config("spark.driver.host","127.0.0.1")
.appName("Simple Application").orCreate
val rdd1: RDD<PersonaMetaData> = spark.toDS(listOf(personMetaData)).rdd()
val df = spark.createDataFrame(rdd1, PersonaMetaData::class.java)
df.show(false)
}
When I try to create a dataframe I get the below error.
Exception in thread main java.lang.UnsupportedOperationException: Schema for type src.Persona is not supported.
Does this mean that for list of data classes, creating dataframe is not supported? Please help me understand what is missing this the above code.
It could be much easier for you to use the Kotlin API for Apache Spark (Full disclosure: I'm the author of the API). With it your code could look like this:
withSpark {
val ds = dsOf(Persona_Items(1), Persona_Items(key2="abc"), Persona_Items(10,"rrr")))
// rest of logics here
}
Thing is Spark does not support data classes out of the box and we had to make an there are nothing like import spark.implicits._ in Kotlin, so we had to make extra step to make it work automatically.
In Scala import spark.implicits._ is required to encode your serialize and deserialize your entities automatically, in the Kotlin API we do this almost at compile time.
Error means that Spark doesn't know how to serialize the Person class.
Well, it works for me out of the box. I've created a simple app for you to demonstrate it check it out here, https://github.com/szymonprz/kotlin-spark-simple-app/blob/master/src/main/kotlin/CreateDataframeFromRDD.kt
you can just run this main and you will see that correct content is displayed.
Maybe you need to fix your build tool configuration if you see something scala specific in kotlin project, then you can check my build.gradle inside this project or you can read more about it here https://github.com/JetBrains/kotlin-spark-api/blob/main/docs/quick-start-guide.md

Cannot evaluate ML model on Structured Streaming, because RDD transformations and actions are invoked inside other transformations

This is a well-known limitation[1] of Structured Streaming that I'm trying to get around using a custom sink.
In what follows, modelsMap is a map of string keys to org.apache.spark.mllib.stat.KernelDensity models
and
streamingData is a streaming dataframe org.apache.spark.sql.DataFrame = [id1: string, id2: string ... 6 more fields]
I'm trying to evaluate each row of streamingData against its corresponding model from modelsMap, enhance each row with prediction, and write to Kakfa.
An obvious way would be .withColumn, using a UDF to predict, and write using kafka sink.
But this is illegal because:
org.apache.spark.SparkException: This RDD lacks a SparkContext. It
could happen in the following cases: (1) RDD transformations and
actions are NOT invoked by the driver, but inside of other
transformations; for example, rdd1.map(x => rdd2.values.count() * x) is
invalid because the values transformation and count action cannot be
performed inside of the rdd1.map transformation. For more information,
see SPARK-5063.
I get the same error with a custom sink that implements forEachWriter which was a bit unexpected:
import org.apache.spark.sql.ForeachWriter
import java.util.Properties
import kafkashaded.org.apache.kafka.clients.producer._
class customSink(topic:String, servers:String) extends ForeachWriter[(org.apache.spark.sql.Row)] {
val kafkaProperties = new Properties()
kafkaProperties.put("bootstrap.servers", servers)
kafkaProperties.put("key.serializer", "kafkashaded.org.apache.kafka.common.serialization.StringSerializer")
kafkaProperties.put("value.serializer", "kafkashaded.org.apache.kafka.common.serialization.StringSerializer")
val results = new scala.collection.mutable.HashMap[String, String]
var producer: KafkaProducer[String, String] = _
def open(partitionId: Long,version: Long): Boolean = {
producer = new KafkaProducer(kafkaProperties)
true
}
def process(value: (org.apache.spark.sql.Row)): Unit = {
var prediction = Double.NaN
try {
val id1 = value(0)
val id2 = value(3)
val id3 = value(5)
val time_0 = value(6).asInstanceOf[Double]
val key = f"$id1/$id2/$id3"
var model = modelsMap(key)
println("Looking up key: ",key)
var prediction = Double.NaN
prediction = model.estimate(Array[Double](time_0))(0)
println(prediction)
} catch {
case e: NoSuchElementException =>
val prediction = Double.NaN
println(prediction)
}
producer.send(new ProducerRecord(topic, value.mkString(",")+","+prediction.toString))
}
def close(errorOrNull: Throwable): Unit = {
producer.close()
}
}
val writer = new customSink("<broker>", "<topic>")
val query = streamingData
.writeStream
.foreach(writer)
.outputMode("update")
.trigger(Trigger.ProcessingTime(10.seconds))
.start()
model.estimate is implemented under the hood using aggregate in mllib.stat, and there's no way to get around it.
What changes do I make? (I could collect each batch and execute a for loop using driver, but then I'm not using spark the way it's intended)
References:
https://www.slideshare.net/databricks/realtime-machine-learning-analytics-using-structured-streaming-and-kinesis-firehose slide#11 mentions limitations
https://www.oreilly.com/learning/extend-structured-streaming-for-spark-ml
https://github.com/holdenk/spark-structured-streaming-ml (proposed solution)
https://issues.apache.org/jira/browse/SPARK-16454
https://issues.apache.org/jira/browse/SPARK-16407

Spark: Persisting a dataframe within a function

I'm trying to reuse the dataframe by persisting it within a function.
def run() {
val df1 = getdf1(spark: SparkSession)
val df2 = getdf2(spark:SparkSession, df1)
}
def getdf1(spark: SparkSession): DataFrame {
val sqltxt = "select * from emp"
val df1 = spark.sql(sqltxt)
df1.persist
spark.sql("SET spark.sql.hive.convertMetastoreParquet=false")
df1.write.mode(SaveMode.Overwrite).parquet("/user/user1/emp")
df1
}
def getdf2(spark: SparkSession, df1: DataFrame): DataFrame {
// perform some operations
}
But, when getdf2 is executing it is performing all operations again. Not sure, if I'm doing anything wrong here. Please help me understand above scenario. Thanks.
I remember that in scala, when you pass a function as parameters, the function(in this scenario, it is df1) will be executed every time when you call getdf2. Whenever you call getdf1, all the statement in getdf1 was executed again. That's the reason why you see the same operations again.
Read Chapter 5 First-Class Functions of <>

Spark: Perform SQL function on RDD subsets

I need to perform SQL function on RDD subsets, increasing in size. For this I have to take subsets from input RDD with take function:
def main(args: Array[String]) {
// set up environment
val conf = new SparkConf()
.setMaster("local[5]")
.setAppName("Test")
.set("spark.executor.memory", "4g")
val sc = new SparkContext(conf)
val cntPairsRdd = cntsRdd.map(n => {
val sample = data0.take(n)
val dataRDD = sc.parallelize(sample)
val df = dataRDD.toDF()
val result = df.select( ...)
val xCnt = result.count
(n, xCnt)
})
}
cntsRdd is a set of increasing integers. Function take returns a list not RDD. So to make my SQL work I need first to convert my list to RDD and then to dataframe. Unfortunately inside map function Spark does not allow to create another RDD. In other words in Spark one can not create RDD inside another RDD. Because of the same reason Spark does not support SparkContext serilization. I get serilization exception when trying to sc.parallelize(sample).
Please advise some workaround to perform SQL function on RDD subsets, as defined in this scenario.

spark : use the global config variables in executors

I have a global config Object in my spark app.
Object Config {
var lambda = 0.01
}
and I will set the value of lambda according to user's input.
Object MyApp {
def main(args: String[]) {
Config.lambda = args(0).toDouble
...
rdd.map(_ * Config.lambda)
}
}
and I found that the modification does not take effect in executors. The value of lambda is always 0.01. I guess the modification in driver's jvm will not effect the executor's.
Do you have other solution ?
I found a similar question in stackoverflow :
how to set and get static variables from spark?
in #DanielL. 's answer, he gives three solutions:
Put the value inside a closure to be serialized to the executors to perform a task.
But I wonder how to write the closure and how to serialized it to the executors, could any one give me some code example?
2.If the values are fixed or the configuration is available on the executor nodes (lives inside the jar, etc), then you can have a lazy val, guaranteeing initialization only once.
what if I declare the lambda as a lazy val variable? the modification in the driver will take effects in the executors? could you give me some code example?
3.Create a broadcast variable with the data. I know this way, but it also need a local Broadcast[] variable which wraps the Config Object right? for example:
val config = sc.broadcast(Config)
and use config.value.lambda in executors , right ?
Put the value inside a closure
object Config {var lambda = 0.01}
object SOTest {
def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf().setAppName("StaticVar"))
val r = sc.parallelize(1 to 10, 3)
Config.lambda = 0.02
mul(r).collect.foreach(println)
sc.stop()
}
def mul(rdd: RDD[Int]) = {
val l = Config.lambda
rdd.map(_ * l)
}
}
lazy val for only once initialisation
object SOTest {
def main(args: Array[String]) {
lazy val lambda = args(0).toDouble
val sc = new SparkContext(new SparkConf().setAppName("StaticVar"))
val r = sc.parallelize(1 to 10, 3)
r.map(_ * lambda).collect.foreach(println)
sc.stop()
}
}
Create a broadcast variable with the data
object Config {var lambda = 0.01}
object SOTest {
def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf().setAppName("StaticVar"))
val r = sc.parallelize(1 to 10, 3)
Config.lambda = 0.04
val bc = sc.broadcast(Config.lambda)
r.map(_ * bc.value).collect.foreach(println)
sc.stop()
}
}
Note: You shouldn't pass in the Config Object into sc.broadcast() directly, it would serialise your Config before transfer it to executors, however, your Config is not serialisable. Another thing to mention here: Broadcast variable do not fit well for your situation here, because you are only sharing a single value.

Resources