spark streaming checkpoint recovery is very very slow - apache-spark

Goal: Read from Kinesis and store data in to S3 in Parquet format via spark streaming.
Situation:
Application runs fine initially, running batches of 1hour and the processing time is less than 30 minutes on average. For some reason lets say the application crashes, and we try to restart from checkpoint. The processing now takes forever and does not move forward.
We tried to test out the same thing at batch interval of 1 minute, the processing runs fine and takes 1.2 minutes for batch to finish. When we recover from checkpoint it takes about 15 minutes for each batch.
Notes:
we are using s3 for checkpoints
using 1 executor, with 19g mem & 3 cores per executor
Attaching the screenshots:
First Run - Before checkpoint Recovery
Trying to Recover from checkpoint:
Config.scala
object Config {
val sparkConf = new SparkConf
val sc = new SparkContext(sparkConf)
val sqlContext = new HiveContext(sc)
val eventsS3Path = sc.hadoopConfiguration.get("eventsS3Path")
val useIAMInstanceRole = sc.hadoopConfiguration.getBoolean("useIAMInstanceRole",true)
val checkpointDirectory = sc.hadoopConfiguration.get("checkpointDirectory")
// sc.hadoopConfiguration.set("spark.sql.parquet.output.committer.class","org.apache.spark.sql.parquet.DirectParquetOutputCommitter")
DateTimeZone.setDefault(DateTimeZone.forID("America/Los_Angeles"))
val numStreams = 2
def getSparkContext(): SparkContext = {
this.sc
}
def getSqlContext(): HiveContext = {
this.sqlContext
}
}
S3Basin.scala
object S3Basin {
def main(args: Array[String]): Unit = {
Kinesis.startStreaming(s3basinFunction _)
}
def s3basinFunction(streams : DStream[Array[Byte]]): Unit ={
streams.foreachRDD(jsonRDDRaw =>{
println(s"Old partitions ${jsonRDDRaw.partitions.length}")
val jsonRDD = jsonRDDRaw.coalesce(10,true)
println(s"New partitions ${jsonRDD.partitions.length}")
if(!jsonRDD.isEmpty()){
val sqlContext = SQLContext.getOrCreate(jsonRDD.context)
sqlContext.read.json(jsonRDD.map(f=>{
val str = new String(f)
if(str.startsWith("{\"message\"")){
str.substring(11,str.indexOf("#version")-2)
}
else{
str
}
})).registerTempTable("events")
sqlContext.sql(
"""
|select
|to_date(from_utc_timestamp(from_unixtime(at), 'US/Pacific')) as event_date,
|hour(from_utc_timestamp(from_unixtime(at), 'US/Pacific')) as event_hour,
|*
|from events
""".stripMargin).coalesce(1).write.mode(SaveMode.Append).partitionBy("event_date", "event_hour","verb").parquet(Config.eventsS3Path)
sqlContext.dropTempTable("events")
}
})
}
}
Kinesis.scala
object Kinesis{
def functionToCreateContext(streamFunc: (DStream[Array[Byte]]) => Unit): StreamingContext = {
val streamingContext = new StreamingContext(Config.sc, Minutes(Config.sc.hadoopConfiguration.getInt("kinesis.StreamingBatchDuration",1))) // new context
streamingContext.checkpoint(Config.checkpointDirectory) // set checkpoint directory
val sc = Config.getSparkContext
var awsCredentails : BasicAWSCredentials = null
val kinesisClient = if(Config.useIAMInstanceRole){
new AmazonKinesisClient()
}
else{
awsCredentails = new BasicAWSCredentials(sc.hadoopConfiguration.get("kinesis.awsAccessKeyId"),sc.hadoopConfiguration.get("kinesis.awsSecretAccessKey"))
new AmazonKinesisClient(awsCredentails)
}
val endpointUrl = sc.hadoopConfiguration.get("kinesis.endpointUrl")
val appName = sc.hadoopConfiguration.get("kinesis.appName")
val streamName = sc.hadoopConfiguration.get("kinesis.streamName")
kinesisClient.setEndpoint(endpointUrl)
val numShards = kinesisClient.describeStream(streamName).getStreamDescription().getShards().size
val batchInterval = Minutes(sc.hadoopConfiguration.getInt("kinesis.StreamingBatchDuration",1))
// Kinesis checkpoint interval is the interval at which the DynamoDB is updated with information
// on sequence number of records that have been received. Same as batchInterval for this
// example.
val kinesisCheckpointInterval = batchInterval
// Get the region name from the endpoint URL to save Kinesis Client Library metadata in
// DynamoDB of the same region as the Kinesis stream
val regionName = sc.hadoopConfiguration.get("kinesis.regionName")
val kinesisStreams = (0 until Config.numStreams).map { i =>
println(s"creating stream for $i")
if(Config.useIAMInstanceRole){
KinesisUtils.createStream(streamingContext, appName, streamName, endpointUrl, regionName,
InitialPositionInStream.TRIM_HORIZON, kinesisCheckpointInterval, StorageLevel.MEMORY_AND_DISK_2)
}else{
KinesisUtils.createStream(streamingContext, appName, streamName, endpointUrl, regionName,
InitialPositionInStream.TRIM_HORIZON, kinesisCheckpointInterval, StorageLevel.MEMORY_AND_DISK_2,awsCredentails.getAWSAccessKeyId,awsCredentails.getAWSSecretKey)
}
}
val unionStreams = streamingContext.union(kinesisStreams)
streamFunc(unionStreams)
streamingContext
}
def startStreaming(streamFunc: (DStream[Array[Byte]]) => Unit) = {
val sc = Config.getSparkContext
if(sc.defaultParallelism < Config.numStreams+1){
throw new Exception(s"Number of shards = ${Config.numStreams} , number of processor = ${sc.defaultParallelism}")
}
val streamingContext = StreamingContext.getOrCreate(Config.checkpointDirectory, () => functionToCreateContext(streamFunc))
// sys.ShutdownHookThread {
// println("Gracefully stopping Spark Streaming Application")
// streamingContext.stop(true, true)
// println("Application stopped greacefully")
// }
//
streamingContext.start()
streamingContext.awaitTermination()
}
}
DAG

raised a Jira issue : https://issues.apache.org/jira/browse/SPARK-19304
The issue is because we read more data per iteration than what is required and then discard the data. This can be avoided by adding a limit to getResults aws call.
Fix: https://github.com/apache/spark/pull/16842

When a failed driver is restart, the following occurs:
Recover computation – The checkpointed information is used to
restart the driver, reconstruct the contexts and restart all the
receivers.
Recover block metadata – The metadata of all the blocks that will be
necessary to continue the processing will be recovered.
Re-generate incomplete jobs – For the batches with processing that
has not completed due to the failure, the RDDs and corresponding
jobs are regenerated using the recovered block metadata.
Read the block saved in the logs – When those jobs are executed, the
block data is read directly from the write ahead logs. This recovers
all the necessary data that were reliably saved to the logs.
Resend unacknowledged data – The buffered data that was not saved to
the log at the time of failure will be sent again by the source. as
it had not been acknowledged by the receiver.
Since all these steps are performed at driver your batch of 0 events take so much time. This should happen with the first batch only then things will be normal.
Reference here.

I had similar issues before, my application getting slower and slower.
try to release memory after using rdd, call rdd.unpersist() https://spark.apache.org/docs/latest/api/java/org/apache/spark/rdd/RDD.html#unpersist(boolean)
or spark.streaming.backpressure.enabled to true
http://spark.apache.org/docs/latest/streaming-programming-guide.html#setting-the-right-batch-interval
http://spark.apache.org/docs/latest/streaming-programming-guide.html#requirements
also, check your locality setting, maybe too much data move around.

Related

Writing to DB2 database gets stuck inside foreachpartition() from my Java code on Spark Cluster with 20 worker nodes configured

I am trying to read data from a file uploaded on a server path, do some manipulation and then save this data to DB2 database. We have around 300K records which may increase further in future so we are trying to do all the manipulation as well write to DB2 inside foreachpartition. Below are the steps followed to do so.
Create spark context as global and static.
static SparkContext sparkContext = new SparkContext();
static JavaSparkContext jc = sparkContext.getJavaSparkContext();
static SparkSession sc = sparkContext.getSparkSession();
Create a dataset of file present on server
Dataset<Row> dataframe = sparkContext.getSparkSession().read().option("delimiter","|").option("header","false").option("inferSchema","false").schema(SchemaClass.generateSchema()).csv(filePath).withColumn("ID",monotonically_increasing_id())).withColumn("Partition_Id",spark_partition_id());
Calling foreachpartition
dataframe.foreachpartition(new ForeachPartitionFunction<Row>()){
#Override
public void call(Iterator<Row> _row) throws Exception{
List<SparkPositionDto> li = new ArrayList<>();
while(_row.hasNext()){
PositionDto positionDto = AnotherClass.method(row,1);
SparkPositionDto spd = copyToSparkDto(positionDto);
if(spd != null){
li.add(spd);
}
}
System.out.println("Writing via Spark : List size : "+li.size());
JavaRDD<SparkPositionDto> finalRdd = jc.parallelize(li);
Dataset<Row> dfToWrite = sc.createDataFrame(finalRDD, SparkPositionDto.class);
System.out.println("Writing Data");
if(dfToWrite != null){
dfToWrite.write()
.format("jdbc")
.option("url","jdbc:msdb2://"+"Database_Name"+";useKerberos=true")
.option("driver","DRIVER_NAME")
.option("dbtable","TABLE_NAME")
.mode(SaveMode.Append)
.save();
}
}
}
The weird observation is that when i run this code outside foreachpartition for a small set of data, it works fine and in my spark cluster just 1 driver and 1 application runs, but when the same code is running inside foreachpartition, I could see 1 driver and 2 applications running with 1 app in running state and other in waiting. If I add numberOfPartitions as 5 in my schema then 5 applications can be seen running. It is running continuously, nothing in logs, seems it got stuck somewhere.

spark 2.2 struct streaming foreach writer jdbc sink lag

i'm in a project using spark 2.2 struct streaming to read kafka msg into oracle database. the message flow into kafka is about 4000-6000 messages per second .
when using hdfs file system as sink destination ,it just works fine. when using foreach jdbc writer,it will have a huge delay over time . I think the lag is caused by foreach loop .
the jdbc sink class(stand alone class file):
class JDBCSink(url: String, user: String, pwd: String) extends org.apache.spark.sql.ForeachWriter[org.apache.spark.sql.Row] {
val driver = "oracle.jdbc.driver.OracleDriver"
var connection: java.sql.Connection = _
var statement: java.sql.PreparedStatement = _
val v_sql = "insert INTO sparkdb.t_cf(EntityId,clientmac,stime,flag,id) values(?,?,to_date(?,'YYYY-MM-DD HH24:MI:SS'),?,stream_seq.nextval)"
def open(partitionId: Long, version: Long): Boolean = {
Class.forName(driver)
connection = java.sql.DriverManager.getConnection(url, user, pwd)
connection.setAutoCommit(false)
statement = connection.prepareStatement(v_sql)
true
}
def process(value: org.apache.spark.sql.Row): Unit = {
statement.setString(1, value(0).toString)
statement.setString(2, value(1).toString)
statement.setString(3, value(2).toString)
statement.setString(4, value(3).toString)
statement.executeUpdate()
}
def close(errorOrNull: Throwable): Unit = {
connection.commit()
connection.close
}
}
the sink part :
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "namenode:9092").option("fetch.message.max.bytes", "50000000").option("kafka.max.partition.fetch.bytes", "50000000")
.option("subscribe", "rawdb.raw_data")
.option("startingOffsets", "latest")
.load()
.select($"value".as[Array[Byte]])
.map(avroDeserialize(_))
.filter(some logic).select(some logic)
.writeStream.format("csv").option("checkpointLocation", "/user/root/chk").option("path", "/user/root/testdir").start()
if I change the last line
.writeStream.format("csv")...
into jdbc foreach sink as following:
val url = "jdbc:oracle:thin:#(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=x.x.x.x)(PORT=1521)))(CONNECT_DATA=(SERVICE_NAME=fastdb)))"
val user = "user";
val pwd = "password";
val writer = new JDBCSink(url, user, pwd)
.writeStream.foreach(writer).outputMode("append").start()
the lag show up.
I guess the problem most likely caused by foreach loop mechanics-it's not in batch mode deal with like several thousands row in a batch ,as an oracle DBA either, I have fine tuned oracle database side ,mostly the database is waiting for idle events . excessive commit is trying to be avoided by setting connection.setAutoCommit(false) already,any suggestion will be much appreciate.
Although I don't have an actual profile of whats taking the longest time in your application, I would assume it is due to the fact that using ForeachWriter will effectively close and re-open your JDBC connection on each run, because that's how ForeachWriter works.
I would advise that instead of using it, write a custom Sink for JDBC where you control how the connection is opened or closed.
There is an open pull request to add a JDBC driver to Spark which you can take a peek at to see a possible approach to the implementation.
problem solved by injecting the result into another Kafka topic , then wrote another program read from the new topic write them into database on batches .
I think in next spark release,they might provide the jdbc sink and have some parameter setting batch size .
the main code is as following :
write to another topic:
.writeStream.format("kafka")
.option("kafka.bootstrap.servers", "x.x.x.x:9092")
.option("topic", "fastdbtest")
.option("checkpointLocation", "/user/root/chk")
.start()
read the topic and write to databases,i'm using c3p0 connection pool
lines.foreachRDD(rdd => {
if (!rdd.isEmpty) {
rdd.foreachPartition(partitionRecords => {
//get a connection from connection pool
val conn = ConnManager.getManager.getConnection
val ps = conn.prepareStatement("insert into sparkdb.t_cf(ENTITYID,CLIENTMAC,STIME,FLAG) values(?,?,?,?)")
try {
conn.setAutoCommit(false)
partitionRecords.foreach(record => {
insertIntoDB(ps, record)
}
)
ps.executeBatch()
conn.commit()
} catch {
case e: Exception =>{}
// do some log
} finally {
ps.close()
conn.close()
}
})
}
})
Have you tried using a trigger?
I notice when I didn't use a trigger my Foreach Sink open and close several times the connection to the database.
writeStream.foreach(writer).start()
But when I used a trigger, the Foreach only opened and closed the connection one time, processing for example 200 queries and when the micro-batch was ended it closed the connection until a new micro batch was received.
writeStream.trigger(Trigger.ProcessingTime("3 seconds")).foreach(writer).start()
My use case is reading from a Kafka topic with only one partition, so Spark I think is using one partition. I dont know if this solution works the same with multiple Spark partitions but my conclusion here is the Foreach process all the micro-batch at a time (row by row) in the process method and doesn't call open() and close() for every row like a lot of people think.

What is the best way to restart spark streaming application?

I basically want to write an event callback in my driver program which will restart the spark streaming application on arrival of that event.
My driver program is setting up the streams and the execution logic by reading configurations from a file.
Whenever the file is changed (new configs added) the driver program has to do the following steps in a sequence,
Restart,
Read the config file (as part of the main method) and
Set up the streams
What is the best way to achieve this?
In some cases you may want to reload streaming context dynamically (for example to reloading of streaming operations).
In that cases you may (Scala example):
val sparkContext = new SparkContext()
val stopEvent = false
var streamingContext = Option.empty[StreamingContext]
val shouldReload = false
val processThread = new Thread {
override def run(): Unit = {
while (!stopEvent){
if (streamingContext.isEmpty) {
// new context
streamingContext = Option(new StreamingContext(sparkContext, Seconds(1)))
// create DStreams
val lines = streamingContext.socketTextStream(...)
// your transformations and actions
// and decision to reload streaming context
// ...
streamingContext.get.start()
} else {
if (shouldReload) {
streamingContext.get.stop(stopSparkContext = false, stopGracefully = true)
streamingContext.get.awaitTermination()
streamingContext = Option.empty[StreamingContext]
} else {
Thread.sleep(1000)
}
}
}
streamingContext.get.stop(stopSparkContext =true, stopGracefully = true)
streamingContext.get.awaitTermination()
}
}
// and start it in separate thread
processThread.start()
processThread.join()
or in python:
spark_context = SparkContext()
stop_event = Event()
spark_streaming_context = None
should_reload = False
def process(self):
while not stop_event.is_set():
if spark_streaming_context is None:
# new context
spark_streaming_context = StreamingContext(spark_context, 0.5)
# create DStreams
lines = spark_streaming_context.socketTextStream(...)
# your transformations and actions
# and decision to reload streaming context
# ...
self.spark_streaming_context.start()
else:
# TODO move to config
if should_reload:
spark_streaming_context.stop(stopSparkContext=False, stopGraceFully=True)
spark_streaming_context.awaitTermination()
spark_streaming_context = None
else:
time.sleep(1)
else:
self.spark_streaming_context.stop(stopGraceFully=True)
self.spark_streaming_context.awaitTermination()
# and start it in separate thread
process_thread = threading.Thread(target=process)
process_thread.start()
process_thread.join()
If you want to prevent you code from crashes and restart streaming context from the last place use checkpointing mechanism.
It allow you to restore your job state after failure.
The best way to Restart the Spark is actually according to your environment.But it is always suggestible to use spark-submit console.
You can background the spark-submit process like any other linux process, by putting it into the background in the shell. In your case, the spark-submit job actually then runs the driver on YARN, so, it's baby-sitting a process that's already running asynchronously on another machine via YARN.
Cloudera blog
One way that we explored recently (in a spark meetup here) was to achieve this by using Zookeeper in Tandem with Spark. This in a nutshell uses Apache Curator to watch for changes on Zookeeper (changes in config of ZK this can be triggered by your external event) that then causes a listener to restart.
The referenced code base is here , you will find that a change in config causes the Watcher (a spark streaming app) to reboot after a graceful shutdown and reload changes. Hope this is a pointer!
I am currently solving this issue as follows,
Listen to external events by subscribing to a MQTT topic
In the MQTT callback, stop the streaming context ssc.stop(true,true) which will gracefully shutdown the streams and underlying
spark config
Start the spark application again by creating a spark conf and
setting up the streams by reading the config file
// Contents of startSparkApplication() method
sparkConf = new SparkConf().setAppName("SparkAppName")
ssc = new StreamingContext(sparkConf, Seconds(1))
val myStream = MQTTUtils.createStream(ssc,...) //provide other options
myStream.print()
ssc.start()
The application is built as Spring boot application
In Scala, stopping sparkStreamingContext may involve stopping SparkContext. I have found that when a receiver hangs, it is best to restart the SparkCintext and the SparkStreamingContext.
I am sure the code below can be written much more elegantly, but it allows for the restarting of SparkContext and SparkStreamingContext programatically. Once this is done, you can restart your receivers programatically as well.
package coname.utilobjects
import com.typesafe.config.ConfigFactory
import grizzled.slf4j.Logging
import coname.conameMLException
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{Seconds, StreamingContext}
import scala.collection.mutable
object SparkConfProviderWithStreaming extends Logging
{
val sparkVariables: mutable.HashMap[String, Any] = new mutable.HashMap
}
trait SparkConfProviderWithStreaming extends Logging{
private val keySSC = "SSC"
private val keyConf = "conf"
private val keySparkSession = "spark"
lazy val packagesversion=ConfigFactory.load("streaming").getString("streaming.cassandraconfig.packagesversion")
lazy val sparkcassandraconnectionhost=ConfigFactory.load("streaming").getString("streaming.cassandraconfig.sparkcassandraconnectionhost")
lazy val sparkdrivermaxResultSize=ConfigFactory.load("streaming").getString("streaming.cassandraconfig.sparkdrivermaxResultSize")
lazy val sparknetworktimeout=ConfigFactory.load("streaming").getString("streaming.cassandraconfig.sparknetworktimeout")
#throws(classOf[conameMLException])
def intitializeSpark(): Unit =
{
getSparkConf()
getSparkStreamingContext()
getSparkSession()
}
#throws(classOf[conameMLException])
def getSparkConf(): SparkConf = {
try {
if (!SparkConfProviderWithStreaming.sparkVariables.get(keyConf).isDefined) {
logger.info("\n\nLoading new conf\n\n")
val conf = new SparkConf().setMaster("local[4]").setAppName("MLPCURLModelGenerationDataStream")
conf.set("spark.streaming.stopGracefullyOnShutdown", "true")
conf.set("spark.cassandra.connection.host", sparkcassandraconnectionhost)
conf.set("spark.driver.maxResultSize", sparkdrivermaxResultSize)
conf.set("spark.network.timeout", sparknetworktimeout)
SparkConfProviderWithStreaming.sparkVariables.put(keyConf, conf)
logger.info("Loaded new conf")
getSparkConf()
}
else {
logger.info("Returning initialized conf")
SparkConfProviderWithStreaming.sparkVariables.get(keyConf).get.asInstanceOf[SparkConf]
}
}
catch {
case e: Exception =>
logger.error(e.getMessage, e)
throw new conameMLException(e.getMessage)
}
}
#throws(classOf[conameMLException])
def killSparkStreamingContext
{
try
{
if(SparkConfProviderWithStreaming.sparkVariables.get(keySSC).isDefined)
{
SparkConfProviderWithStreaming.sparkVariables -= keySSC
SparkConfProviderWithStreaming.sparkVariables -= keyConf
}
SparkSession.clearActiveSession()
SparkSession.clearDefaultSession()
}
catch {
case e: Exception =>
logger.error(e.getMessage, e)
throw new conameMLException(e.getMessage)
}
}
#throws(classOf[conameMLException])
def getSparkStreamingContext(): StreamingContext = {
try {
if (!SparkConfProviderWithStreaming.sparkVariables.get(keySSC).isDefined) {
logger.info("\n\nLoading new streaming\n\n")
SparkConfProviderWithStreaming.sparkVariables.put(keySSC, new StreamingContext(getSparkConf(), Seconds(6)))
logger.info("Loaded streaming")
getSparkStreamingContext()
}
else {
SparkConfProviderWithStreaming.sparkVariables.get(keySSC).get.asInstanceOf[StreamingContext]
}
}
catch {
case e: Exception =>
logger.error(e.getMessage, e)
throw new conameMLException(e.getMessage)
}
}
def getSparkSession():SparkSession=
{
if(!SparkSession.getActiveSession.isDefined)
{
SparkSession.builder.config(getSparkConf()).getOrCreate()
}
else
{
SparkSession.getActiveSession.get
}
}
}

Mid-Stream Changing Configuration with Check-Pointed Spark Stream

I have a Spark streaming / DStream app like this:
// Function to create and setup a new StreamingContext
def functionToCreateContext(): StreamingContext = {
val ssc = new StreamingContext(...) // new context
val lines = ssc.socketTextStream(...) // create DStreams
...
ssc.checkpoint(checkpointDirectory) // set checkpoint directory
ssc
}
// Get StreamingContext from checkpoint data or create a new one
val context = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _)
// Do additional setup on context that needs to be done,
// irrespective of whether it is being started or restarted
context. ...
// Start the context
context.start()
context.awaitTermination()
Where my context uses a configuration file where I can pull items with methods like appConf.getString. So I actually use:
val context = StreamingContext.getOrCreate(
appConf.getString("spark.checkpointDirectory"),
() => createStreamContext(sparkConf, appConf))
where val sparkConf = new SparkConf()....
If I stop my app and change configuration in the app file, these changes are not picked up unless I delete the checkpoint directory contents. For example, I would like to change spark.streaming.kafka.maxRatePerPartition or spark.windowDurationSecs dynamically. (EDIT: I kill the app, change the configuration file and then restart the app.) How can I do dynamically change these settings or enforce a (EDITED WORD) configuration change without trashing my checkpoint directory (which is about to include checkpoints for state info)?
How can I do dynamically change these settings or enforce a configuration change without trashing my checkpoint directory?
If dive into the code for StreamingContext.getOrCreate:
def getOrCreate(
checkpointPath: String,
creatingFunc: () => StreamingContext,
hadoopConf: Configuration = SparkHadoopUtil.get.conf,
createOnError: Boolean = false
): StreamingContext = {
val checkpointOption = CheckpointReader.read(
checkpointPath, new SparkConf(), hadoopConf, createOnError)
checkpointOption.map(new StreamingContext(null, _, null)).getOrElse(creatingFunc())
}
You can see that if CheckpointReader has checkpointed data in the class path, it uses new SparkConf() as a parameter, as the overload doesn't allow for passing of a custom created SparkConf. By default, SparkConf will load any settings declared either as an environment variable or passed to the classpath:
class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging {
import SparkConf._
/** Create a SparkConf that loads defaults from system properties and the classpath */
def this() = this(true)
So one way of achieving what you want is instead of creating a SparkConf object in the code, you can pass the parameters via spark.driver.extraClassPath and spark.executor.extraClassPath to spark-submit.
Do you create your Streaming Context the way the docs suggest, by using StreamingContext.getOrCreate, which takes a previous checkpointDirectory as an argument?
// Function to create and setup a new StreamingContext
def functionToCreateContext(): StreamingContext = {
val ssc = new StreamingContext(...) // new context
val lines = ssc.socketTextStream(...) // create DStreams
...
ssc.checkpoint(checkpointDirectory) // set checkpoint directory
ssc
}
// Get StreamingContext from checkpoint data or create a new one
val context = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _)
// Do additional setup on context that needs to be done,
// irrespective of whether it is being started or restarted
context. ...
// Start the context
context.start()
context.awaitTermination()
http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing
It can not be done adding/updating spark configurations when you are restoring from checkpoint directory. You can find spark checkpointing behaviour in documentation:
When the program is being started for the first time, it will create a new StreamingContext, set up all the streams and then call start().
When the program is being restarted after failure, it will re-create a StreamingContext from the checkpoint data in the checkpoint directory
So if you use checkpoint directory then on restart of job it will re-create a StreamingContext from checkpoint data which will have old sparkConf.

Flume+Spark - Storing DStream in HDFS

I have flume stream which I want to store it in HDFS via spark . Below is spark code that I am running
object FlumePull {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println(
"Usage: FlumePollingEventCount <host> <port>")
System.exit(1)
}
val batchInterval = Milliseconds(60000)
val sparkConf = new SparkConf().setAppName("FlumePollingEventCount")
val ssc = new StreamingContext(sparkConf, batchInterval)
val stream = FlumeUtils.createPollingStream(ssc, "localhost", 9999)
stream.map(x => x + "!!!!")
.saveAsTextFiles("/user/root/spark/flume_Map_", "_Mapout")
ssc.start()
ssc.awaitTermination()
}
}
When I start my spsark streaming job , it does stores output in HDFS but output is something like this:
[root#sandbox ~]# hadoop fs -cat /user/root/spark/flume_Map_-1459450380000._Mapout/part-00000
org.apache.spark.streaming.flume.SparkFlumeEvent#1b9bd2c9!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#33fd3a48!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#35fd67a2!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#f9ed85f!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#58f4cfc!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#307373e!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#4ebbc8ff!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#a8905bb!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#29d73d64!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#71ff85b1!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#3ea261ef!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#16cbb209!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#17157890!!!!
org.apache.spark.streaming.flume.SparkFlumeEvent#29e41c7!!!!
It is storing flume event instead of data coming from Flume. How do it get data out of it?
Thanks
You need to extract the underlying buffer from the SparkFlumeEvent and save that. For example, if your event body is a String:
stream.map(x => new String(x.event.getBody.array) + "!!!!")
.saveAsTextFiles("/user/root/spark/flume_Map_", "_Mapout")

Resources