GCP: Spark Structured Streaming + Custom Pub/Sub Source - apache-spark

There is currently no support for a PubSub source in Spark Structured Streaming.
Has anyone wrote a custom source in Spark Structured Streaming to read from PubSub?
Is this a possible approach?

Related

Building a service with spark and spark streaming

I have read a bit about spark streaming and I would like to know if its possible to stream data from a custom source with rabbitmq as a broker and feed this data through the spark stream where Spark’s machine learning and graph processing algorithms will be performed on them and send it to other filesystems/databases/dashboards or customer receivers.
P.S I code with python, I do not have any experience using spark and Can I call what I'm trying to achieve a microservice?
Thank you.
I feel spark Structured streaming is more suitable and easy to implement rather than spark-streaming. Spark Structured Streaming follows the below concept
Source(read from RabbitMQ) -- Transformation (apply ML algo) -- Sink
(write to database)
You can refer this github project for an example on Spark structured streaming.
I don't think there is an inbuilt spark connector which can consume from rabbitMq. I know there is one for Kafka but you can write your own custom source and sink (Writing this without any spark knowledge might be tricky).
You can start this as a spark-job and you have to create a wrapper service layer which triggers this as a spark job (spark job launcher) or use spark rest api
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

Spark structured streaming with Apache Hudi

I have a requirement where i need to write the stream using structured streaming to Hudi dataset. I found there is a provision to do this over Apache Hudi Jira issues but wanted to know if anyone successfully implemented this and have an example. I am trying to structure stream the data from AWS Kinesis Firehose to Apache Hudi using spark structured streaming
Quick help is appreciated.
I know of atleast one user using structure streaming sink in Hudi. https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/test/scala/DataSourceTest.scala#L190 could help?

Spark Structured Streaming with JMS source

How can we process data from JMS sources (like Solace) in spark structured streaming? Could see implementations with Dstreams but nothing using new structured streaming. Any pointers or tips spark structured streaming and JMS would be great?

How to write spark structured streaming data to REST API?

I would like to push my spark structured streaming processed data to the REST API. can someone share the examples of same. i have found few but all are related to spark streaming, not the structured streaming.
I have not heard about a REST API sink for Spark Structured Streaming, but you could write one yourself. Start from org.apache.spark.sql.execution.streaming.Source.
The easiest would however be to use DataStreamWriter.foreach or foreachBatch (since 2.4).

Spark structured streaming integration with RabbitMQ

I want to use Spark structured streaming to aggregate data which is consumed from RabbitMQ.
I know there is official spark structured streaming integration with apache kafka, and I was wondering if there exists some integration with RabbitMQ as well?
Since I'm not able to switch the existing messaging system (RabbitMQ), I thought of using kafka-connect to move the data between the messaging systems (Rabbit to kafka) and then use Spark structured streaming.
Does anyone knows a better solution?
This custom RabbitMQ receiver seems to available if you're open to exploring Spark Streaming rather than Structured Streaming.

Resources