I use Spark 2.3 (HDP 2.3.0.2.6.5.108-1) and Spark Streaming (JavaInputDStream).
I am writing a test of some component that use spark streaming. What I am trying to do is:
start the component in a separate thread, which start spark streaming
wait it is started
send a notification in kafka (read by spark)
wait it is processed
validate the outputs
However, I am stuck on the (2) and I don't know how I can at least check the streaming job has started. Is there any api that I can use?
Notes:
I only have access to the spark context, not the streaming one... So it would be perfect if I could access such api from the spark context.
the 3 comes after the 2 because setting spark auto.offset.reset` to earliest seams useless :\
You should use SparkListener interface and listen to the events emitted, e.g. onApplicationStart.
For Spark Streaming-specific events, use StreamingListener interface.
Related
I will have to create a Kafka producer application that exports a fairly high amount of data from a database with minimal transformation. Each export job would be triggered by a message from another Kafka topic. I was thinking of using the transactional API provided by Kafka to achieve exactly-once semantics, even if the application fails, and an export job has to be restarted. And doing the export itself with Spark in batch mode. However, I cannot find any clue in the docs if I can have that kind of control over Spark - Kafka interaction.
Can I specify the commit id? Can I have control over when to initialize, finish the commit?
(I want to use Spark's batch processing capabilites.)
I have read a bit about spark streaming and I would like to know if its possible to stream data from a custom source with rabbitmq as a broker and feed this data through the spark stream where Spark’s machine learning and graph processing algorithms will be performed on them and send it to other filesystems/databases/dashboards or customer receivers.
P.S I code with python, I do not have any experience using spark and Can I call what I'm trying to achieve a microservice?
Thank you.
I feel spark Structured streaming is more suitable and easy to implement rather than spark-streaming. Spark Structured Streaming follows the below concept
Source(read from RabbitMQ) -- Transformation (apply ML algo) -- Sink
(write to database)
You can refer this github project for an example on Spark structured streaming.
I don't think there is an inbuilt spark connector which can consume from rabbitMq. I know there is one for Kafka but you can write your own custom source and sink (Writing this without any spark knowledge might be tricky).
You can start this as a spark-job and you have to create a wrapper service layer which triggers this as a spark job (spark job launcher) or use spark rest api
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Is it possible to achieve exactly once by handling Kafka topic at Spark Streaming application?
To achieve exactly once you need the following things:
Exactly once on Kafka producer to Kafka broker. This is achieved by Kafka's 0.11 idempotent producer. But is Kafka 0.11 to Spark Streaming integration production ready? I found this JIRA ticket with lots of bugs.
Exactly once on Kafka broker to Spark Streaming app. Could it be achieved? Because of Spark Streaming app failures, the application can read some data twice, right? As solution, can I persist computation results & last handled event uuid to Redis transactionaly?
Exactly once on trasforming data by Spark Streaming app. This is out-of-the-box property of RDD.
Exactly once on persisting results. Is solved at the 2nd statement by transactionaly persisting last event uuid to Redis.
I have a Spark 2.0.2 structured streaming job connecting to Apache Kafka data stream as the source. The job takes in Twitter data (JSON) from Kafka and uses CoreNLP to annotate the data with things like sentiment, parts of speech tagging etc.. It works well with a local[*] master. However, when I setup a stand alone Spark cluster, only one worker gets used to process the data. I have two workers with the same capability.
Is there something I need to set when submitting my job that I'm missing. I've tried setting the --num-executors in my spark-submit command but I have had no luck.
Thanks in advance for the pointer in the right direction.
I ended up creating the kafka source stream with more partitions. This seems to have sped up the processing part 9 folds. Spark and kafka have a lot of knobs. Lots to sift through... See Kafka topic partitions to Spark streaming
I try to learn apache spark and I can't understand from documentation how window operations work.
I have two worker node and I use Kafka Spark Utils to create DStream from a Topic.
On this DStream I apply map function and a reductionByWindow.
I can't understand if reductionByWindow is executed on a each worker or in the driver.
I have searched on google without any result.
Can Someone explain me?
Both receiving and processing data happens on the worker nodes. Driver creates receivers (on worker nodes) which are responsible for data collection, and periodically starts jobs to process collected data. Everything else is pretty much standard RDDs and normal Spark jobs.