How to stop a StreamingContext in Apache Spark on Zeppelin - apache-spark

import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.eventhubs.EventHubsUtils
import sqlContext.implicits._
val ehParams = Map[String, String](
"eventhubs.policyname" -> "Full",
...
)
val ssc = new StreamingContext(sc, Seconds(2))
val stream = EventHubsUtils.createUnionStream(ssc, ehParams)
val cr = stream.window(Seconds(6))
case class Message(msg: String)
stream.map(msg=>Message(new String(msg))).foreachRDD(rdd=>rdd.toDF().registerTempTable("temp"))
stream.print
ssc.start
This above starts and runs fine but I cannot seem to stop it. Any call to %sql show tables will just freeze.
How do i stop the StreamingContext above ?

ssc.stop also kills the Spark Context, requiring an interpreter restart.
Use ssc.stop(stopSparkContext=false, stopGracefully=true) instead.

ssc.stop in a new paragraph should stop it
There is also an ongoing discussion in the dev# mailing list on how to improve integration with streaming platforms.

Related

How to use foreachRDD in legacy Spark Streaming

I am getting exception while using foreachRDD for my CSV data processing. Here is my code
case class Person(name: String, age: Long)
val conf = new SparkConf()
conf.setMaster("local[*]")
conf.setAppName("CassandraExample").set("spark.driver.allowMultipleContexts", "true")
val ssc = new StreamingContext(conf, Seconds(10))
val smDstream=ssc.textFileStream("file:///home/sa/testFiles")
smDstream.foreachRDD((rdd,time) => {
val peopleDF = rdd.map(_.split(",")).map(attributes =>
Person(attributes(0), attributes(1).trim.toInt)).toDF()
peopleDF.createOrReplaceTempView("people")
val teenagersDF = spark.sql("insert into table devDB.stam SELECT name, age
FROM people WHERE age BETWEEN 13 AND 29")
//teenagersDF.show
})
ssc.checkpoint("hdfs://go/hive/warehouse/devDB.db")
ssc.start()
i am getting following error
java.io.NotSerializableException: DStream checkpointing has been enabled but the DStreams with their functions are not serializable
org.apache.spark.streaming.StreamingContext
Serialization stack:
- object not serializable (class: org.apache.spark.streaming.StreamingContext, value: org.apache.spark.streaming.StreamingContext#1263422a)
- field (class: $iw, name: ssc, type: class org.apache.spark.streaming.StreamingContext)
please help
The question does not really make sense anymore in that dStreams are being deprecated / abandoned.
There a few things to consider in the code, what the exact question is therefore hard to glean. That said, I had to ponder as well as I am not a Serialization expert.
You can find a few posts of some trying to write to Hive table directly as opposed to a path, in my answer I use an approach but you can use your approach of Spark SQL to write for a TempView, that is all possible.
I simulated input from a QueueStream, so I need no split to be applied. You can adapt this to your own situation if you follow the same "global" approach. I elected to write to a parquet file that gets created if needed. You can create your tempView and then use spark.sql as per your initial approach.
The Output Operations on DStreams are:
print()
saveAsTextFiles(prefix, [suffix])
saveAsObjectFiles(prefix, [suffix])
saveAsHadoopFiles(prefix, [suffix])
foreachRDD(func)
foreachRDD
The most generic output operator that applies a function, func, to
each RDD generated from the stream. This function should push the data
in each RDD to an external system, such as saving the RDD to files, or
writing it over the network to a database. Note that the function func
is executed in the driver process running the streaming application,
and will usually have RDD actions in it that will force the
computation of the streaming RDDs.
It states saving to files, but it can do what you want via foreachRDD, albeit I
assumed the idea was to external systems. Saving to files is quicker
in my view as opposed to going through steps to write a table
directly. You want to offload data asap with Streaming as volumes are typically high.
Two steps:
In a separate class to the Streaming Class - run under Spark 2.4:
case class Person(name: String, age: Int)
Then the Streaming logic you need to apply - you may need some imports
that I have in my notebook otherwise as I ran this under DataBricks:
import org.apache.spark.sql.SparkSession
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.{Seconds, StreamingContext}
import scala.collection.mutable
import org.apache.spark.sql.SaveMode
val spark = SparkSession
.builder
.master("local[4]")
.config("spark.driver.cores", 2)
.appName("forEachRDD")
.getOrCreate()
val sc = spark.sparkContext
val ssc = new StreamingContext(spark.sparkContext, Seconds(1))
val rddQueue = new mutable.Queue[RDD[List[(String, Int)]]]()
val QS = ssc.queueStream(rddQueue)
QS.foreachRDD(q => {
if(!q.isEmpty) {
val q_flatMap = q.flatMap{x=>x}
val q_withPerson = q_flatMap.map(field => Person(field._1, field._2))
val df = q_withPerson.toDF()
df.write
.format("parquet")
.mode(SaveMode.Append)
.saveAsTable("SO_Quest_BigD")
}
}
)
ssc.start()
for (c <- List(List(("Fred",53), ("John",22), ("Mary",76)), List(("Bob",54), ("Johnny",92), ("Margaret",15)), List(("Alfred",21), ("Patsy",34), ("Sylvester",7)) )) {
rddQueue += ssc.sparkContext.parallelize(List(c))
}
ssc.awaitTermination()

Case Class within foreachRDD causes Serialization Error

I can can create a DF inside foreachRDD if I do not try and use a Case Class and simply let default names for columns be made with toDF() or if I assign them via toDF("c1, "c2").
As soon as I try and use a Case Class, and having looked at the examples, I get:
Task not serializable
If I shift the Case Class statement around I then get:
toDF() not part of RDD[CaseClass]
It's legacy, but I am curious as to the nth Serialization error that Spark can produce and if it carries over into Structured Streaming.
I have an RDD that need not be split, may be that is the issue? NO. Running in DataBricks?
Coding is as follows:
import org.apache.spark.sql.SparkSession
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.{Seconds, StreamingContext}
import scala.collection.mutable
case class Person(name: String, age: Int) //extends Serializable // Some say inherently serializable so not required
val spark = SparkSession.builder
.master("local[4]")
.config("spark.driver.cores", 2)
.appName("forEachRDD")
.getOrCreate()
val sc = spark.sparkContext
val ssc = new StreamingContext(spark.sparkContext, Seconds(1))
val rddQueue = new mutable.Queue[RDD[List[(String, Int)]]]()
val QS = ssc.queueStream(rddQueue)
QS.foreachRDD(q => {
if(!q.isEmpty) {
import spark.implicits._
val q_flatMap = q.flatMap{x=>x}
val q_withPerson = q_flatMap.map(field => Person(field._1, field._2))
val df = q_withPerson.toDF()
df.show(false)
}
}
)
ssc.start()
for (c <- List(List(("Fred",53), ("John",22), ("Mary",76)), List(("Bob",54), ("Johnny",92), ("Margaret",15)), List(("Alfred",21), ("Patsy",34), ("Sylvester",7)) )) {
rddQueue += ssc.sparkContext.parallelize(List(c))
}
ssc.awaitTermination()
Having not grown up with Java, but having looked around I found out what to do, but am not expert enough to explain.
I was running in a DataBricks notebook where I prototype.
The clue is that the
case class Person(name: String, age: Int)
was inside the same DB Notebook. One needs to define the case class external to the current notebook - in a separate notebook - and thus separate to the class running the Streaming.

I'm trying to read the data from the files in the directory as soon as a new file is created. Real time "File Streaming"

I'm currently learning spark-streaming. I'm trying to read the data from the files in the directory as soon as a new file is created. Real time "File Streaming". I'm getting the below error. Can anyone suggest me a solution?
import org.apache.spark.SparkContext._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
object FileStreaming {
def main(args:Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[2]")
val ssc = new StreamingContext(conf, Seconds(10))
val lines = ssc.textFileStream("C:\\Users\\PRAGI V\\Desktop\\data-
master\\data-master\\cards")
lines.flatMap(x => x.split(" ")).map(x => (x, 1)).print()
ssc.start()
ssc.awaitTermination()
}
}
Error:
Exception in thread "main" org.apache.spark.SparkException: An application
name must be set in your configuration
at org.apache.spark.SparkContext. <init>(SparkContext.scala:170)
at org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:555)
at org.apache.spark.streaming.StreamingContext.<init>
(StreamingContext.scala:75)
at FileStreaming$.main(FileStreaming.scala:15)
at FileStreaming.main(FileStreaming.scala)
The error message is very clear, you need to set app name in spark conf objet.
Replace
val conf = new SparkConf().setMaster("local[2]”)
to
val conf = new SparkConf().setMaster("local[2]”).setAppName(“MyApp")
Would suggest to read official Spark Programming Guide
The first thing a Spark program must do is to create a SparkContext object, which tells Spark how to access a cluster. Only one SparkContext may be active per JVM. You must stop() the active SparkContext before creating a new one.
val conf = new SparkConf().setAppName(appName).setMaster(master)
new SparkContext(conf)
The appName parameter is a name for your application to show on the
cluster UI. master is a Spark, Mesos or YARN cluster URL, or a special
“local” string to run in local mode.
Online documentation have lot of examples to get started.
Cheers !

Error when creating a StreamingContext

I open the spark shell
spark-shell --packages org.apache.spark:spark-streaming-kafka_2.10:1.6.0
Then I want to create a streaming context
import org.apache.spark._
import org.apache.spark.streaming._
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount").set("spark.driver.allowMultipleContexts", "true")
val ssc = new StreamingContext(conf, Seconds(1))
I run into a exception:
org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true. The currently running SparkContext was created at:
When you open the spark-shell, there is already a streaming context created. It is called sc, meaning you do not need to create a configure object. Simply use the existing sc object.
val ssc = new StreamingContext(sc,Seconds(1))
instead of var we will use val

How to keep a SQLContext instance alive in a spark streaming application's life cycle?

I used SQLContext in a spark streaming application as blew:
case class topic_name (f1: Int, f2: Int)
val sqlContext = new SQLContext(sc)
#transient val ssc = new StreamingContext(sc, new Duration(5 * 1000))
ssc.checkpoint(".")
val theDStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, Set("topic_name"))
theDStream.map(x => x._2).foreach { rdd =>
sqlContext.jsonRDD(newsIdRDD).registerTempTable("topic_name")
sqlContext.sql("select count(*) from topic_name").foreach { x =>
WriteToFile("file_path", x(0).toString)
}
}
ssc.start()
ssc.awaitTermination()
I found i could only get every 5 seconds's count of message, because "The lifetime of this temporary table is tied to the SQLContext that was used to create this DataFrame", i guess every 5 seconds, a new sqlContext will be create and the temporary table can only alive just 5 seconds, i want to the sqlContext and the temporary table alive all the streaming application's life cycle, how to do it?
Thanks~
You are right. A SQLContext only remembers the tables registered for the lifetime of that object. So, instead of using registerTempTable, you should proabably use a persistent storage like Hive using saveAsTable command.

Resources