Spark Streaming Exception: java.util.NoSuchElementException: None.get - apache-spark

I am writing SparkStreaming data to HDFS by converting it to a dataframe:
Code
object KafkaSparkHdfs {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("SparkKafka")
sparkConf.set("spark.driver.allowMultipleContexts", "true");
val sc = new SparkContext(sparkConf)
def main(args: Array[String]): Unit = {
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val ssc = new StreamingContext(sparkConf, Seconds(20))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "stream3",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("fridaydata")
val stream = KafkaUtils.createDirectStream[String, String](
ssc, PreferConsistent, Subscribe[String, String](topics, kafkaParams)
)
val lines = stream.map(consumerRecord => consumerRecord.value)
val words = lines.flatMap(_.split(" "))
val wordMap = words.map(word => (word, 1))
val wordCount = wordMap.reduceByKey(_ + _)
wordCount.foreachRDD(rdd => {
val dataframe = rdd.toDF();
dataframe.write
.mode(SaveMode.Append)
.save("hdfs://localhost:9000/newfile24")
})
ssc.start()
ssc.awaitTermination()
}
}
The folder is created but the file is not written.
The program is getting terminated with the following error:
18/06/22 16:14:41 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:670)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:289)
at java.lang.Thread.run(Thread.java:748)
18/06/22 16:14:41 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:670)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:289)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
In my pom I am using respective dependencies:
spark-core_2.11
spark-sql_2.11
spark-streaming_2.11
spark-streaming-kafka-0-10_2.11

The error is due to trying to run multiple spark contexts at the same time. Setting allowMultipleContexts to true is mostly used for testing purposes and it's use is discouraged. The solution to your problem is therefore to make sure that the same SparkContext is used everywhere. From the code we can see that the SparkContext (sc) is used to create a SQLContext which is fine. However, when creating the StreamingContext it is not used, instead the SparkConf is used.
By looking at the documentation we see:
Create a StreamingContext by providing the configuration necessary for a new SparkContext
In other words, by using SparkConf as parameter a new SparkContext will be created. Now there are two separate contexts.
The easiest solution here would be to continue using the same context as before. Change the line creating the StreamingContext to:
val ssc = new StreamingContext(sc, Seconds(20))
Note: In newer versions of Spark (2.0+) use SparkSession instead. A new streaming context can then be created using StreamingContext(spark.sparkContext, ...). It can look as follows:
val spark = SparkSession().builder
.setMaster("local[*]")
.setAppName("SparkKafka")
.getOrCreate()
import sqlContext.implicits._
val ssc = new StreamingContext(spark.sparkContext, Seconds(20))

There is an obvious problem here - coalesce(1).
dataframe.coalesce(1)
While reducing number of files might be tempting in many scenarios, it should be done if and only if it amount of data is low enough for nodes to handle (clearly it isn't here).
Furthermore, let me quote the documentation:
However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can call repartition. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).
The conclusion is you should adjust the parameter accordingly to the expected amount of data and desired parallelism. coalesce(1) as such is rarely useful in practice, especially in a context like streaming, where data properties can differ over time.

Related

How to use foreachRDD in legacy Spark Streaming

I am getting exception while using foreachRDD for my CSV data processing. Here is my code
case class Person(name: String, age: Long)
val conf = new SparkConf()
conf.setMaster("local[*]")
conf.setAppName("CassandraExample").set("spark.driver.allowMultipleContexts", "true")
val ssc = new StreamingContext(conf, Seconds(10))
val smDstream=ssc.textFileStream("file:///home/sa/testFiles")
smDstream.foreachRDD((rdd,time) => {
val peopleDF = rdd.map(_.split(",")).map(attributes =>
Person(attributes(0), attributes(1).trim.toInt)).toDF()
peopleDF.createOrReplaceTempView("people")
val teenagersDF = spark.sql("insert into table devDB.stam SELECT name, age
FROM people WHERE age BETWEEN 13 AND 29")
//teenagersDF.show
})
ssc.checkpoint("hdfs://go/hive/warehouse/devDB.db")
ssc.start()
i am getting following error
java.io.NotSerializableException: DStream checkpointing has been enabled but the DStreams with their functions are not serializable
org.apache.spark.streaming.StreamingContext
Serialization stack:
- object not serializable (class: org.apache.spark.streaming.StreamingContext, value: org.apache.spark.streaming.StreamingContext#1263422a)
- field (class: $iw, name: ssc, type: class org.apache.spark.streaming.StreamingContext)
please help
The question does not really make sense anymore in that dStreams are being deprecated / abandoned.
There a few things to consider in the code, what the exact question is therefore hard to glean. That said, I had to ponder as well as I am not a Serialization expert.
You can find a few posts of some trying to write to Hive table directly as opposed to a path, in my answer I use an approach but you can use your approach of Spark SQL to write for a TempView, that is all possible.
I simulated input from a QueueStream, so I need no split to be applied. You can adapt this to your own situation if you follow the same "global" approach. I elected to write to a parquet file that gets created if needed. You can create your tempView and then use spark.sql as per your initial approach.
The Output Operations on DStreams are:
print()
saveAsTextFiles(prefix, [suffix])
saveAsObjectFiles(prefix, [suffix])
saveAsHadoopFiles(prefix, [suffix])
foreachRDD(func)
foreachRDD
The most generic output operator that applies a function, func, to
each RDD generated from the stream. This function should push the data
in each RDD to an external system, such as saving the RDD to files, or
writing it over the network to a database. Note that the function func
is executed in the driver process running the streaming application,
and will usually have RDD actions in it that will force the
computation of the streaming RDDs.
It states saving to files, but it can do what you want via foreachRDD, albeit I
assumed the idea was to external systems. Saving to files is quicker
in my view as opposed to going through steps to write a table
directly. You want to offload data asap with Streaming as volumes are typically high.
Two steps:
In a separate class to the Streaming Class - run under Spark 2.4:
case class Person(name: String, age: Int)
Then the Streaming logic you need to apply - you may need some imports
that I have in my notebook otherwise as I ran this under DataBricks:
import org.apache.spark.sql.SparkSession
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.{Seconds, StreamingContext}
import scala.collection.mutable
import org.apache.spark.sql.SaveMode
val spark = SparkSession
.builder
.master("local[4]")
.config("spark.driver.cores", 2)
.appName("forEachRDD")
.getOrCreate()
val sc = spark.sparkContext
val ssc = new StreamingContext(spark.sparkContext, Seconds(1))
val rddQueue = new mutable.Queue[RDD[List[(String, Int)]]]()
val QS = ssc.queueStream(rddQueue)
QS.foreachRDD(q => {
if(!q.isEmpty) {
val q_flatMap = q.flatMap{x=>x}
val q_withPerson = q_flatMap.map(field => Person(field._1, field._2))
val df = q_withPerson.toDF()
df.write
.format("parquet")
.mode(SaveMode.Append)
.saveAsTable("SO_Quest_BigD")
}
}
)
ssc.start()
for (c <- List(List(("Fred",53), ("John",22), ("Mary",76)), List(("Bob",54), ("Johnny",92), ("Margaret",15)), List(("Alfred",21), ("Patsy",34), ("Sylvester",7)) )) {
rddQueue += ssc.sparkContext.parallelize(List(c))
}
ssc.awaitTermination()

DSE Spark Streaming: Long active batches queue

I have the following code:
val conf = new SparkConf()
.setAppName("KafkaReceiver")
.set("spark.cassandra.connection.host", "192.168.0.78")
.set("spark.cassandra.connection.keep_alive_ms", "20000")
.set("spark.executor.memory", "2g")
.set("spark.driver.memory", "4g")
.set("spark.submit.deployMode", "cluster")
.set("spark.executor.instances", "3")
.set("spark.executor.cores", "3")
.set("spark.shuffle.service.enabled", "false")
.set("spark.dynamicAllocation.enabled", "false")
.set("spark.io.compression.codec", "snappy")
.set("spark.rdd.compress", "true")
.set("spark.streaming.backpressure.enabled", "true")
.set("spark.streaming.backpressure.initialRate", "200")
.set("spark.streaming.receiver.maxRate", "500")
val sc = SparkContext.getOrCreate(conf)
val ssc = new StreamingContext(sc, Seconds(10))
val sqlContext = new SQLContext(sc)
val kafkaParams = Map[String, String](
"bootstrap.servers" -> "192.168.0.113:9092",
"group.id" -> "test-group-aditya",
"auto.offset.reset" -> "largest")
val topics = Set("random")
val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
I'm running the code through spark-submit with the following command:
dse> bin/dse spark-submit --class test.kafkatesting /home/aditya/test.jar
I have a three-node Cassandra DSE cluster installed on different machines. Whenever I run the application, it takes so much data and starts creating a queue of active batches, which in turn creates a backlog and a long scheduling delay. How can I increase the performance and control the queue such that it receives a new batch only after it finishes executing the current batch?
I found the solution, did some optimisation in code. Instead of saving RDD try to create Dataframe, saving DF to Cassandra in much faster as compared to RDD. Also, increase the no of core and and executor memory in order to achieve good results.
Thanks,

java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access

I have a Scala Spark Streaming application that receives data from the same topic from 3 different Kafka producers.
The Spark streaming application is on machine with host 0.0.0.179, the Kafka server is on machine with host 0.0.0.178, the Kafka producers are on machines, 0.0.0.180, 0.0.0.181, 0.0.0.182.
When I try to run the Spark Streaming application got below error
Exception in thread "main" org.apache.spark.SparkException: Job
aborted due to stage failure: Task 0 in stage 19.0 failed 1 times,
most recent failure: Lost task 0.0 in stage 19.0 (TID 19, localhost):
java.util.ConcurrentModificationException: KafkaConsumer is not safe
for multi-threaded access at
org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1625)
at
org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1198)
at
org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95)
at
org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69)
at
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:228)
at
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:194)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at
scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply$mcV$sp(PairRDDFunctions.scala:1204)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala:1203)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala:1203)
at
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1211)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Now I read thousand of different posts but no one seems to be able to find a solution at this issue.
How can I handle this on my application? Do I have to modify some parameters on Kakfa (at the moment the num.partition parameter is set to 1)?
Following is the code of my application :
// Create the context with a 5 second batch size
val sparkConf = new SparkConf().setAppName("SparkScript").set("spark.driver.allowMultipleContexts", "true").set("spark.streaming.concurrentJobs", "3").setMaster("local[4]")
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(3))
case class Thema(name: String, metadata: String)
case class Tempo(unit: String, count: Int, metadata: String)
case class Spatio(unit: String, metadata: String)
case class Stt(spatial: Spatio, temporal: Tempo, thematic: Thema)
case class Location(latitude: Double, longitude: Double, name: String)
case class Datas1(location : Location, timestamp : String, windspeed : Double, direction: String, strenght : String)
case class Sensors1(sensor_name: String, start_date: String, end_date: String, data1: Datas1, stt: Stt)
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "0.0.0.178:9092",
"key.deserializer" -> classOf[StringDeserializer].getCanonicalName,
"value.deserializer" -> classOf[StringDeserializer].getCanonicalName,
"group.id" -> "test_luca",
"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics1 = Array("topics1")
val s1 = KafkaUtils.createDirectStream[String, String](ssc, PreferConsistent, Subscribe[String, String](topics1, kafkaParams)).map(record => {
implicit val formats = DefaultFormats
parse(record.value).extract[Sensors1]
}
)
s1.print()
s1.saveAsTextFiles("results/", "")
ssc.start()
ssc.awaitTermination()
Thank you
Your problem is here:
s1.print()
s1.saveAsTextFiles("results/", "")
Since Spark creates a graph of flows, and you define two flows here:
Read from Kafka -> Print to console
Read from Kafka -> Save to text file
Spark will attempt to concurrently run both of these graphs, since they are independent of each other. Since Kafka uses a cached consumer approach, it is effectively trying to use the same consumer for both stream executions.
What you can do is cache the DStream before running the two queries:
val dataFromKafka = KafkaUtils.createDirectStream[String, String](ssc, PreferConsistent, Subscribe[String, String](topics1, kafkaParams)).map(/* stuff */)
val cachedStream = dataFromKafka.cache()
cachedStream.print()
cachedStream.saveAsTextFiles("results/", "")
Using cache worked for me . In my case print , transformation and then print on JavaPairDstream was giving me that error .
I used cache just before first print, it worked for me.
s1.print()
s1.saveAsTextFiles("results/", "")
Below code will work, i used similar code .
s1.cache();
s1.print();
s1.saveAsTextFiles("results/", "");

How to keep a SQLContext instance alive in a spark streaming application's life cycle?

I used SQLContext in a spark streaming application as blew:
case class topic_name (f1: Int, f2: Int)
val sqlContext = new SQLContext(sc)
#transient val ssc = new StreamingContext(sc, new Duration(5 * 1000))
ssc.checkpoint(".")
val theDStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, Set("topic_name"))
theDStream.map(x => x._2).foreach { rdd =>
sqlContext.jsonRDD(newsIdRDD).registerTempTable("topic_name")
sqlContext.sql("select count(*) from topic_name").foreach { x =>
WriteToFile("file_path", x(0).toString)
}
}
ssc.start()
ssc.awaitTermination()
I found i could only get every 5 seconds's count of message, because "The lifetime of this temporary table is tied to the SQLContext that was used to create this DataFrame", i guess every 5 seconds, a new sqlContext will be create and the temporary table can only alive just 5 seconds, i want to the sqlContext and the temporary table alive all the streaming application's life cycle, how to do it?
Thanks~
You are right. A SQLContext only remembers the tables registered for the lifetime of that object. So, instead of using registerTempTable, you should proabably use a persistent storage like Hive using saveAsTable command.

Read from Spark RDD a Kryo File

I'm Spark & Scala newbie.
I need to read and analyze a file in Spark that it has written in my scala code with Kryo serialized:
import com.esotericsoftware.kryo.Kryo
import com.esotericsoftware.kryo.io.Output
val kryo:Kryo = new Kryo()
val output:Output = new Output(new FileOutputStream("filename.ext",true))
//kryo.writeObject(output, feed) (tested both line)
kryo.writeClassAndObject(output, myScalaObject)
This is a pseudo-code for create a file with my object (myScalaObject) serialized, that is a complex object.
The file seems that write well, but i have problem when I read it in Spark RDD
pseudo-code in Spark:
val conf = new SparkConf()
.setMaster("local")
.setAppName("My application")
.set("spark.executor.memory", "1g")
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
conf.set("spark.kryo.registrator", "myScalaObject")
val sc = new SparkContext(conf)
val file=sc.objectFile[myScalaObject]("filename.ext")
val counts = file.count()
When I try to execute it i receive this error:
org.apache.spark.SparkException:
Job aborted: Task 0.0:0 failed 1 times (most recent failure:
Exception failure: java.io.IOException: file: filename.ext not a SequenceFile)
Is possible read this type of file in Spark?
If this solution is not possible, what is a good solution for create a complex file structure to read in Spark?
thank you
If you want to read with objectFile, write out the data with saveAsObjectFile.
val myObjects: Seq[MyObject] = ...
val rddToSave = sc.parallelize(myObjects) // Or better yet: construct as RDD from the start.
rddToSave.saveAsObjectFile("/tmp/x")
val rddLoaded = sc.objectFile[MyObject]("/tmp/x")
Alternatively, as zsxwing says, you can create an RDD of the filenames, and use map to read the contents of each. If want each file to be read into a separate partition, parallelize the filenames into separate partitions:
def loadFiles(filenames: Seq[String]): RDD[Object] = {
def load(filename: String): Object = {
val input = new Input(new FileInputStream(filename))
return kryo.readClassAndObject(input)
}
val partitions = filenames.length
return sc.parallelize(filenames, partitions).map(load)
}

Resources