Spark Structured Streaming with JMS source - apache-spark

How can we process data from JMS sources (like Solace) in spark structured streaming? Could see implementations with Dstreams but nothing using new structured streaming. Any pointers or tips spark structured streaming and JMS would be great?

Related

Is there a monitoring endpoint for Spark Structured streaming?

In Spark's official docs, we see that there are monitoring endpoints for DStream like
/streaming/statistics
However, there does not seem to be ones for structured streaming mentioned here. I'm looking to monitor streaming statistics for a structured streaming job.
https://spark.apache.org/docs/latest/monitoring.html

GCP: Spark Structured Streaming + Custom Pub/Sub Source

There is currently no support for a PubSub source in Spark Structured Streaming.
Has anyone wrote a custom source in Spark Structured Streaming to read from PubSub?
Is this a possible approach?

Do Spark Streaming and Spark Structured Streaming use same micro-batch engine?

Do Spark Streaming and Spark Structured Streaming use the same micro-batch scheduler engine? Does Spark Structured Streaming have lower latency than Spark Streaming?
Do Spark Streaming and Spark Structured Streaming use same micro-batch scheduler engine
Certainly not. They're different internally, but share the same high-level concepts of a stream and a record.
While in Spark Structured Streaming you can get as close to how it was in Spark Streaming using DataStreamWriter.foreach or DataStreamWriter.foreachBatch methods.
The main difference is how to describe a streaming pipeline. In Spark Structured Streaming you use Spark SQL's Dataset API while Spark Streaming bet on Spark Core's RDD API. Both end up as a RDD-based computation, but Spark SQL uses higher-level abstractions (e.g. Dataset API).
Do they both use a "micro-batch scheduler engine"? Yes, but Spark Structured Streaming is trying to leverage some data sources that can be queried continuously (and no micro-batching).
does Spark Structured Streaming have lower latency than Spark Streaming?
That'd be hard to answer. The creators of Spark Streaming decided to develop Spark Structured Streaming and hope to get better at query performance and expressiveness. Spark Streaming is no longer recommended.
Structered Streaming is mostly a higher-level abstraction that allows you to define your streaming logic then it uses Spark SQL engine for execution on the same micro-batch engine.
By default Structured Streaming uses micro-batch engine, however if you are using Spark 2.3+, then you can have the continuous mode where you can get down to 1 millisecond latency

Spark structured streaming integration with RabbitMQ

I want to use Spark structured streaming to aggregate data which is consumed from RabbitMQ.
I know there is official spark structured streaming integration with apache kafka, and I was wondering if there exists some integration with RabbitMQ as well?
Since I'm not able to switch the existing messaging system (RabbitMQ), I thought of using kafka-connect to move the data between the messaging systems (Rabbit to kafka) and then use Spark structured streaming.
Does anyone knows a better solution?
This custom RabbitMQ receiver seems to available if you're open to exploring Spark Streaming rather than Structured Streaming.

How to process XMLmessage in spark streaming and kafka?

I am new to spark. Consuming message from kafka as xml format in spark streaming. Can you tell me how to process this xml is spark streaming?
Spark Streaming and Kafka documentation is available upstream with examples:
http://spark.apache.org/docs/latest/streaming-kafka-0-8-integration.html
Here's the compatibility matrix for versions supported. Stick to the stable releases first since you're getting started with streaming:
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
You could use this library to process XML records from Spark.
https://github.com/databricks/spark-xml

Resources