We are putting data file in HDFS path which is monitored by spark streaming application. And spark streaming application sending data to kafka topic. We are stopping streaming application ?in between and again starting so that it should start from where it stopped. But it is processing whole input data file again. So i guess checkpointing is not properly being used. We are using spark 1.4.1 version
How we can make the streaming application to start from the point where it failed/stopped?
Thanks in advance.
While creating the context use getOfCreate(checkpoint,..) to load previous checkpointed data if any.
eg: JavaStreamingContext ssc = JavaStreamingContext.getOrCreate(checkpointDir,..)
Check a working sample program https://github.com/atulsm/Test_Projects/blob/master/src/spark/StreamingKafkaRecoverableDirectEvent.java
Related
I have a spark application which runs multiple structured streams. (Spark 2.3.2)
The problem is, that they all log to the same file.
Is there a way to set log file per spark stream?
I use Spark 2.3 (HDP 2.3.0.2.6.5.108-1) and Spark Streaming (JavaInputDStream).
I am writing a test of some component that use spark streaming. What I am trying to do is:
start the component in a separate thread, which start spark streaming
wait it is started
send a notification in kafka (read by spark)
wait it is processed
validate the outputs
However, I am stuck on the (2) and I don't know how I can at least check the streaming job has started. Is there any api that I can use?
Notes:
I only have access to the spark context, not the streaming one... So it would be perfect if I could access such api from the spark context.
the 3 comes after the 2 because setting spark auto.offset.reset` to earliest seams useless :\
You should use SparkListener interface and listen to the events emitted, e.g. onApplicationStart.
For Spark Streaming-specific events, use StreamingListener interface.
I had developed Streaming application using spark streaming.when i start running my application it is creating the dstream for each topic in kafka, then i am starting the streaming context.now i want to create the dstream for new topic got added in kafka with out stopping my streaming context.
I have a Spark 2.0.2 structured streaming job connecting to Apache Kafka data stream as the source. The job takes in Twitter data (JSON) from Kafka and uses CoreNLP to annotate the data with things like sentiment, parts of speech tagging etc.. It works well with a local[*] master. However, when I setup a stand alone Spark cluster, only one worker gets used to process the data. I have two workers with the same capability.
Is there something I need to set when submitting my job that I'm missing. I've tried setting the --num-executors in my spark-submit command but I have had no luck.
Thanks in advance for the pointer in the right direction.
I ended up creating the kafka source stream with more partitions. This seems to have sped up the processing part 9 folds. Spark and kafka have a lot of knobs. Lots to sift through... See Kafka topic partitions to Spark streaming
I am new to spark. Consuming message from kafka as xml format in spark streaming. Can you tell me how to process this xml is spark streaming?
Spark Streaming and Kafka documentation is available upstream with examples:
http://spark.apache.org/docs/latest/streaming-kafka-0-8-integration.html
Here's the compatibility matrix for versions supported. Stick to the stable releases first since you're getting started with streaming:
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
You could use this library to process XML records from Spark.
https://github.com/databricks/spark-xml