spark : use the global config variables in executors - apache-spark

I have a global config Object in my spark app.
Object Config {
var lambda = 0.01
}
and I will set the value of lambda according to user's input.
Object MyApp {
def main(args: String[]) {
Config.lambda = args(0).toDouble
...
rdd.map(_ * Config.lambda)
}
}
and I found that the modification does not take effect in executors. The value of lambda is always 0.01. I guess the modification in driver's jvm will not effect the executor's.
Do you have other solution ?
I found a similar question in stackoverflow :
how to set and get static variables from spark?
in #DanielL. 's answer, he gives three solutions:
Put the value inside a closure to be serialized to the executors to perform a task.
But I wonder how to write the closure and how to serialized it to the executors, could any one give me some code example?
2.If the values are fixed or the configuration is available on the executor nodes (lives inside the jar, etc), then you can have a lazy val, guaranteeing initialization only once.
what if I declare the lambda as a lazy val variable? the modification in the driver will take effects in the executors? could you give me some code example?
3.Create a broadcast variable with the data. I know this way, but it also need a local Broadcast[] variable which wraps the Config Object right? for example:
val config = sc.broadcast(Config)
and use config.value.lambda in executors , right ?

Put the value inside a closure
object Config {var lambda = 0.01}
object SOTest {
def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf().setAppName("StaticVar"))
val r = sc.parallelize(1 to 10, 3)
Config.lambda = 0.02
mul(r).collect.foreach(println)
sc.stop()
}
def mul(rdd: RDD[Int]) = {
val l = Config.lambda
rdd.map(_ * l)
}
}
lazy val for only once initialisation
object SOTest {
def main(args: Array[String]) {
lazy val lambda = args(0).toDouble
val sc = new SparkContext(new SparkConf().setAppName("StaticVar"))
val r = sc.parallelize(1 to 10, 3)
r.map(_ * lambda).collect.foreach(println)
sc.stop()
}
}
Create a broadcast variable with the data
object Config {var lambda = 0.01}
object SOTest {
def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf().setAppName("StaticVar"))
val r = sc.parallelize(1 to 10, 3)
Config.lambda = 0.04
val bc = sc.broadcast(Config.lambda)
r.map(_ * bc.value).collect.foreach(println)
sc.stop()
}
}
Note: You shouldn't pass in the Config Object into sc.broadcast() directly, it would serialise your Config before transfer it to executors, however, your Config is not serialisable. Another thing to mention here: Broadcast variable do not fit well for your situation here, because you are only sharing a single value.

Related

how to create dataframe in UDF

I have a problem. I want to create a DataFrame in UDF and use my model to transform it to another. But I get this Exception. Is there something wrong in Spark Conf? I don't know. Is there anyone can help me to solve this problem?
Code:
val model = PipelineModel.load("/user/abel/model/pipeline_model")
val modelBroad = spark.sparkContext.broadcast(model)
def model_predict(id:Long, text:String):Double = {
val modelLoaded = modelBroad.value
val sparkss = SparkSession.builder.master("local[*]").getOrCreate()
val dataDF = sparkss.createDataFrame(Seq((id,text))).toDF("id","text")
val result = modelLoaded.transform(dataDF).select("prediction").collect().apply(0).getDouble(0)
println(f"The prediction of $id and $text is $result")
result
}
val udf_func = udf(model_predict _)
test.withColumn("prediction",udf_func($"id",$"text")).show()
Exception:
Caused by: java.lang.NullPointerException
at org.apache.spark.sql.execution.SparkPlan.sparkContext(SparkPlan.scala:56)
at org.apache.spark.sql.execution.LocalTableScanExec.metrics$lzycompute(LocalTableScanExec.scala:37)
at org.apache.spark.sql.execution.LocalTableScanExec.metrics(LocalTableScanExec.scala:36)
at org.apache.spark.sql.execution.SparkPlan.resetMetrics(SparkPlan.scala:85)
at org.apache.spark.sql.Dataset$$anonfun$withAction$1.apply(Dataset.scala:3366)
at org.apache.spark.sql.Dataset$$anonfun$withAction$1.apply(Dataset.scala:3365)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:117)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3365)
at org.apache.spark.sql.Dataset.collect(Dataset.scala:2788)
at com.zamplus.mine.SparkSubmit$.com$zamplus$mine$SparkSubmit$$model_predict$1(SparkSubmit.scala:21)
at com.zamplus.mine.SparkSubmit$$anonfun$1.apply(SparkSubmit.scala:40)
at com.zamplus.mine.SparkSubmit$$anonfun$1.apply(SparkSubmit.scala:40)
... 22 more
There is issue with your UDF. UDF runs on multiple instances and uses all variables that we are using inside it. So you should passed all required global variable as a parameters such as modelBroad otherwise it will give you null pointer exception.
There are few more good practice that you are not following in UDF. Some of are:
You do not need to create spark session in UDF. Otherwise it will create multiple spark session and which will cause issues. Instead of this pass global spark session as a variable in UDF if required.
Remove unnecessary pritnln in UDF, which effect your return also.
I have changed your code just for reference. It is just a prototype of ideal UDF. Please change it accordingly.
val sparkss = SparkSession.builder.master("local[*]").getOrCreate()
val model = PipelineModel.load("/user/abel/model/pipeline_model")
val modelBroad = spark.sparkContext.broadcast(model)
def model_predict(id:Long, text:String,spark:SparkSession,modelBroad:<datatype>):Double = {
val modelLoaded = modelBroad.value
val dataDF = spark.createDataFrame(Seq((id,text))).toDF("id","text")
val result = modelLoaded.transform(dataDF).select("prediction").collect().apply(0).getDouble(0)
result
}
val udf_func = udf(model_predict _)
test.withColumn("prediction",udf_func($"id",$"text",lit(sparkss),lit(modelBroad))).show()

wondering why empty inner iterator causes not serializable exception with mapPartitionsWithIndex

I've been experimenting with Spark's mapPartitionsWithIndex and I ran into problems when
trying to return an Iterator of a tuple that itself contained an empty iterator.
I tried several different ways of constructing the inner iterator [ via Iterator(), and List(...).iterator ], and
all roads let to my getting this error:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 2.0 in stage 0.0 (TID 2) had a not serializable result: scala.collection.LinearSeqLike$$anon$1
Serialization stack:
- object not serializable (class: scala.collection.LinearSeqLike$$anon$1, value: empty iterator)
- field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
- object (class scala.Tuple2, (1,empty iterator))
- element of array (index: 0)
- array (class [Lscala.Tuple2;, size 1)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
My code example is given below. Note that as given it runs OK (an empty iterator is returned as the
mapPartitionsWithIndex value.) But when you run with the now commented-out version of
the mapPartitionsWithIndex invocations you will get the error above.
If anyone has a suggestion on how to this can be made to work, I'd be much obliged.
import org.apache.spark.{Partition, SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object ANonWorkingExample extends App {
val sparkConf = new SparkConf().setAppName("continuous").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val parallel: RDD[Int] = sc.parallelize(1 to 9)
val parts: Array[Partition] = parallel.partitions
val partRDD: RDD[(Int, Iterator[Int])] =
parallel.coalesce(3).
mapPartitionsWithIndex {
(partitionIndex: Int, inputiterator: Iterator[Int]) =>
val mappedInput: Iterator[Int] = inputiterator.map(_ + 1)
// Iterator((partitionIndex, mappedInput)) // FAILS
Iterator() // no exception.. but not really what i want.
}
val data = partRDD.collect
println("data:" + data.toList);
}
I am not sure what you are trying to achieve and I am a sort of novice compared to some of the expert folks here.
I present something that may give you an idea of how to do things I think correctly and make some comments:
You seem to get the partitions explicitly and call mapPartitions - a 1st for me.
RDD inside mapPartitions and the various SPARK SCALA thing will not fly; it is about iterables and I think you need to drop to SCALA only level.
The serializeable error come from doing List[Int].
Here is an example showing index partition along with those corresponding index values.
import org.apache.spark.{Partition, SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.{Minutes, Seconds, StreamingContext}
// from your stuff, left in
val parallel: RDD[Int] = sc.parallelize(1 to 9, 4)
val mapped = parallel.mapPartitionsWithIndex{
(index, iterator) => {
println("Called in Partition -> " + index)
val myList = iterator.toList
myList.map(x => (index, x)).groupBy( _._1 ).mapValues( _.map( _._2 ) ).toList.iterator
}
}
mapped.collect()
This returns the following that resembles a little of what I think you seemed to want:
res38: Array[(Int, List[Int])] = Array((0,List(1, 2)), (1,List(3, 4)), (2,List(5, 6)), (3,List(7, 8, 9)))
Final note: the documentation and such is not so easy to follow, you don't get it all from word count example!
So, hope this helps.
I think it might get you on the right path to where you want to go, I could not quite see it, but may be you can now see the forest for the trees.
So, the dumb thing I was doing was trying to return an unserializable data structure: an Iterator, as clearly indicated by the stack trace I got.
And the solution is to not use an iterator. Rather, use a collection like a Seq, or List. The sample program below illustrates the correct way to do what I was trying to do.
import org.apache.spark.{Partition, SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object AWorkingExample extends App {
val sparkConf = new SparkConf().setAppName("batman").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val parallel: RDD[Int] = sc.parallelize(1 to 9)
val parts: Array[Partition] = parallel.partitions
val partRDD: RDD[(Int, List[Int])] =
parallel.coalesce(3).
mapPartitionsWithIndex {
(partitionIndex: Int, inputiterator: Iterator[Int]) =>
val mappedInput: Iterator[Int] = inputiterator.map(_ + 1)
Iterator((partitionIndex, mappedInput.toList)) // Note the .toList() call -- that makes it work
}
val data = partRDD.collect
println("data:" + data.toList);
}
By the way, what I was trying to do originally was to see concretely which chunks of data from my parallelized-to-RDD structure were assigned to which partition. Here is the output you get if you run the program:
data:List((0,List(2, 3)), (1,List(4, 5, 6)), (2,List(7, 8, 9, 10)))
Interesting that the data distribution could have been more optimally balanced, but wasn't. That's not the point of the question, but I thought it was interesting.

Cannot evaluate ML model on Structured Streaming, because RDD transformations and actions are invoked inside other transformations

This is a well-known limitation[1] of Structured Streaming that I'm trying to get around using a custom sink.
In what follows, modelsMap is a map of string keys to org.apache.spark.mllib.stat.KernelDensity models
and
streamingData is a streaming dataframe org.apache.spark.sql.DataFrame = [id1: string, id2: string ... 6 more fields]
I'm trying to evaluate each row of streamingData against its corresponding model from modelsMap, enhance each row with prediction, and write to Kakfa.
An obvious way would be .withColumn, using a UDF to predict, and write using kafka sink.
But this is illegal because:
org.apache.spark.SparkException: This RDD lacks a SparkContext. It
could happen in the following cases: (1) RDD transformations and
actions are NOT invoked by the driver, but inside of other
transformations; for example, rdd1.map(x => rdd2.values.count() * x) is
invalid because the values transformation and count action cannot be
performed inside of the rdd1.map transformation. For more information,
see SPARK-5063.
I get the same error with a custom sink that implements forEachWriter which was a bit unexpected:
import org.apache.spark.sql.ForeachWriter
import java.util.Properties
import kafkashaded.org.apache.kafka.clients.producer._
class customSink(topic:String, servers:String) extends ForeachWriter[(org.apache.spark.sql.Row)] {
val kafkaProperties = new Properties()
kafkaProperties.put("bootstrap.servers", servers)
kafkaProperties.put("key.serializer", "kafkashaded.org.apache.kafka.common.serialization.StringSerializer")
kafkaProperties.put("value.serializer", "kafkashaded.org.apache.kafka.common.serialization.StringSerializer")
val results = new scala.collection.mutable.HashMap[String, String]
var producer: KafkaProducer[String, String] = _
def open(partitionId: Long,version: Long): Boolean = {
producer = new KafkaProducer(kafkaProperties)
true
}
def process(value: (org.apache.spark.sql.Row)): Unit = {
var prediction = Double.NaN
try {
val id1 = value(0)
val id2 = value(3)
val id3 = value(5)
val time_0 = value(6).asInstanceOf[Double]
val key = f"$id1/$id2/$id3"
var model = modelsMap(key)
println("Looking up key: ",key)
var prediction = Double.NaN
prediction = model.estimate(Array[Double](time_0))(0)
println(prediction)
} catch {
case e: NoSuchElementException =>
val prediction = Double.NaN
println(prediction)
}
producer.send(new ProducerRecord(topic, value.mkString(",")+","+prediction.toString))
}
def close(errorOrNull: Throwable): Unit = {
producer.close()
}
}
val writer = new customSink("<broker>", "<topic>")
val query = streamingData
.writeStream
.foreach(writer)
.outputMode("update")
.trigger(Trigger.ProcessingTime(10.seconds))
.start()
model.estimate is implemented under the hood using aggregate in mllib.stat, and there's no way to get around it.
What changes do I make? (I could collect each batch and execute a for loop using driver, but then I'm not using spark the way it's intended)
References:
https://www.slideshare.net/databricks/realtime-machine-learning-analytics-using-structured-streaming-and-kinesis-firehose slide#11 mentions limitations
https://www.oreilly.com/learning/extend-structured-streaming-for-spark-ml
https://github.com/holdenk/spark-structured-streaming-ml (proposed solution)
https://issues.apache.org/jira/browse/SPARK-16454
https://issues.apache.org/jira/browse/SPARK-16407

How to build a lookup map in Spark Streaming?

What is the best way to maintain application state in a spark streaming application?
I know of two ways :
use "Union" operation to append to the lookup RDD and persist it after each union.
save the state in a file or database and load it in the start of each batch.
My question is from the performance perspective which one is better ? Also, is there a better way to do this?
You should really be using mapWithState(spec: StateSpec[K, V, StateType, MappedType]) as follows:
import org.apache.spark.streaming.{ StreamingContext, Seconds }
val ssc = new StreamingContext(sc, batchDuration = Seconds(5))
// checkpointing is mandatory
ssc.checkpoint("_checkpoints")
val rdd = sc.parallelize(0 to 9).map(n => (n, n % 2 toString))
import org.apache.spark.streaming.dstream.ConstantInputDStream
val sessions = new ConstantInputDStream(ssc, rdd)
import org.apache.spark.streaming.{State, StateSpec, Time}
val updateState = (batchTime: Time, key: Int, value: Option[String], state: State[Int]) => {
println(s">>> batchTime = $batchTime")
println(s">>> key = $key")
println(s">>> value = $value")
println(s">>> state = $state")
val sum = value.getOrElse("").size + state.getOption.getOrElse(0)
state.update(sum)
Some((key, value, sum)) // mapped value
}
val spec = StateSpec.function(updateState)
val mappedStatefulStream = sessions.mapWithState(spec)
mappedStatefulStream.print()

aparch spark, NotSerializableException: org.apache.hadoop.io.Text

here is my code:
val bg = imageBundleRDD.first() //bg:[Text, BundleWritable]
val res= imageBundleRDD.map(data => {
val desBundle = colorToGray(bg._2) //lineA:NotSerializableException: org.apache.hadoop.io.Text
//val desBundle = colorToGray(data._2) //lineB:everything is ok
(data._1, desBundle)
})
println(res.count)
lineB goes well but lineA shows that:org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: org.apache.hadoop.io.Text
I try to use use Kryo to solve my problem but it seems nothing has been changed:
import com.esotericsoftware.kryo.Kryo
import org.apache.spark.serializer.KryoRegistrator
class MyRegistrator extends KryoRegistrator {
override def registerClasses(kryo: Kryo) {
kryo.register(classOf[Text])
kryo.register(classOf[BundleWritable])
}
}
System.setProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
System.setProperty("spark.kryo.registrator", "hequn.spark.reconstruction.MyRegistrator")
val sc = new SparkContext(...
Thanks!!!
I had a similar problem when my Java code was reading sequence files containing Text keys.
I found this post helpful:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-solve-java-io-NotSerializableException-org-apache-hadoop-io-Text-td2650.html
In my case, I converted Text to a String using map:
JavaPairRDD<String, VideoRecording> mapped = videos.map(new PairFunction<Tuple2<Text,VideoRecording>,String,VideoRecording>() {
#Override
public Tuple2<String, VideoRecording> call(
Tuple2<Text, VideoRecording> kv) throws Exception {
// Necessary to copy value as Hadoop chooses to reuse objects
VideoRecording vr = new VideoRecording(kv._2);
return new Tuple2(kv._1.toString(), vr);
}
});
Be aware of this note in the API for sequenceFile method in JavaSparkContext:
Note: Because Hadoop's RecordReader class re-uses the same Writable object for each record, directly caching the returned RDD will create many references to the same object. If you plan to directly cache Hadoop writable objects, you should first copy them using a map function.
In Apache Spark while dealing with Sequence files, we have to follow these techniques:
-- Use Java equivalent Data Types in place of Hadoop data types.
-- Spark Automatically converts the Writables into Java equivalent Types.
Ex:- We have a sequence file "xyz", here key type is say Text and value
is LongWritable. When we use this file to create an RDD, we need use their
java equivalent data types i.e., String and Long respectively.
val mydata = = sc.sequenceFile[String, Long]("path/to/xyz")
mydata.collect
The reason your code has the serialization problem is that your Kryo setup, while close, isn't quite right:
change:
System.setProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
System.setProperty("spark.kryo.registrator", "hequn.spark.reconstruction.MyRegistrator")
val sc = new SparkContext(...
to:
val sparkConf = new SparkConf()
// ... set master, appname, etc, then:
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryo.registrator", "hequn.spark.reconstruction.MyRegistrator")
val sc = new SparkContext(sparkConf)

Resources