how to manage structured spark streaming queries - apache-spark

Extending my POC asked before:
sql in spark structure streaming
I understand that structured spark streaming provides api to manage streaming queries. However, need some help to understand how to use these api.
for example, once I submit my spark application, and If some query needs to be managed (stopped, re-run). Is it possible to do it ?

Related

Spark structured streaming from JDBC source

Can someone let me know if its possible to to Spark structured streaming from a JDBC source? E.g SQL DB or any RDBMS.
I have looked at a few similar questions on SO, e.g
Spark streaming jdbc read the stream as and when data comes - Data source jdbc does not support streamed reading
jdbc source and spark structured streaming
However, I would like to know if its officially supported on Apache Spark?
If there is any sample code that would be helpful.
Thanks
No, there is no such built-in support in Spark Structured Streaming. The main reason is that most of databases doesn't provided an unified interface for obtaining the changes.
It's possible to get changes from some databases using archive logs, write-ahead logs, etc. But it's database-specific. For many databases the popular choice is Debezium that can read such logs and push list of changes into a Kafka, or something similar, from which it could be consumed by Spark.
I am on a project now architecting this using CDC Shareplex from ORACLE and writing to KAFKA and then using Spark Structured Streaming with KAFKA integration and MERGE on delta format on HDFS.
Ie that is the way to do it if not using Debezium. You can use change logs for base tables or materialized views to feed CDC.
So direct JDBC is not possible.

Building a service with spark and spark streaming

I have read a bit about spark streaming and I would like to know if its possible to stream data from a custom source with rabbitmq as a broker and feed this data through the spark stream where Spark’s machine learning and graph processing algorithms will be performed on them and send it to other filesystems/databases/dashboards or customer receivers.
P.S I code with python, I do not have any experience using spark and Can I call what I'm trying to achieve a microservice?
Thank you.
I feel spark Structured streaming is more suitable and easy to implement rather than spark-streaming. Spark Structured Streaming follows the below concept
Source(read from RabbitMQ) -- Transformation (apply ML algo) -- Sink
(write to database)
You can refer this github project for an example on Spark structured streaming.
I don't think there is an inbuilt spark connector which can consume from rabbitMq. I know there is one for Kafka but you can write your own custom source and sink (Writing this without any spark knowledge might be tricky).
You can start this as a spark-job and you have to create a wrapper service layer which triggers this as a spark job (spark job launcher) or use spark rest api
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

Spark Streaming vs Structured Streaming

The last months I've been using quite a lot Structured Streaming for implementing Stream Jobs (after using Kafka a lot). After reading the book Stream Processing with Apache Spark i was having this question: Is there any point or use cases where i would use Spark Streaming instead of Structured Streaming? Should i invest some time getting into it or since im already using Spark Structured Streaming i should stick with it and there is no benefit on the previous API.
Would appreciate any opinion/insight
Hi Sharing my personal experience.
Structured streaming is the future for spark based streaming implementation. It provides higher level of abstraction and other great features. However there are few restrictions.
i have had to switch to spark streaming on few occasions due to the flexibility offered by it. One recent example is, we had to perform Joins with static reference data, however Outer joins are not supported in Structured streaming. This can be accomplished with Spark streaming.
With the newer spark version 2.4, Structured streaming is much improved with support for foreachBatch sink which gives similar flexibility offered by spark streaming.
My personal thought is having the knowledge of spark streaming is helpful and you might have to use it depending on your use case.

How to write spark structured streaming data to REST API?

I would like to push my spark structured streaming processed data to the REST API. can someone share the examples of same. i have found few but all are related to spark streaming, not the structured streaming.
I have not heard about a REST API sink for Spark Structured Streaming, but you could write one yourself. Start from org.apache.spark.sql.execution.streaming.Source.
The easiest would however be to use DataStreamWriter.foreach or foreachBatch (since 2.4).

Spark structured streaming integration with RabbitMQ

I want to use Spark structured streaming to aggregate data which is consumed from RabbitMQ.
I know there is official spark structured streaming integration with apache kafka, and I was wondering if there exists some integration with RabbitMQ as well?
Since I'm not able to switch the existing messaging system (RabbitMQ), I thought of using kafka-connect to move the data between the messaging systems (Rabbit to kafka) and then use Spark structured streaming.
Does anyone knows a better solution?
This custom RabbitMQ receiver seems to available if you're open to exploring Spark Streaming rather than Structured Streaming.

Resources