Spark structured streaming from JDBC source - apache-spark

Can someone let me know if its possible to to Spark structured streaming from a JDBC source? E.g SQL DB or any RDBMS.
I have looked at a few similar questions on SO, e.g
Spark streaming jdbc read the stream as and when data comes - Data source jdbc does not support streamed reading
jdbc source and spark structured streaming
However, I would like to know if its officially supported on Apache Spark?
If there is any sample code that would be helpful.
Thanks

No, there is no such built-in support in Spark Structured Streaming. The main reason is that most of databases doesn't provided an unified interface for obtaining the changes.
It's possible to get changes from some databases using archive logs, write-ahead logs, etc. But it's database-specific. For many databases the popular choice is Debezium that can read such logs and push list of changes into a Kafka, or something similar, from which it could be consumed by Spark.

I am on a project now architecting this using CDC Shareplex from ORACLE and writing to KAFKA and then using Spark Structured Streaming with KAFKA integration and MERGE on delta format on HDFS.
Ie that is the way to do it if not using Debezium. You can use change logs for base tables or materialized views to feed CDC.
So direct JDBC is not possible.

Related

Understanging kappa architecture with apache superset

There is a lot of information about kappa architecture in the internet and after going through some of the conceptual aspects I am trying to drill down to something more concrete. As I main source I used this website.
Let's imaging you want to implement a kappa architecture involving the following tech stack:
Apache Kafka
Apache Spark
Apache Superset
Now imagine the application you want to build do data-analytics against has a PostgreSQL database. Of course you can easily directly connect apache superset with the PostgresSQL database and create charts.
But now you want to see how you would do this with a kappa architecture and you add kafka and spark.
You can emit events to kafka and you can read such events in apache spark. Kafka will retain messages for topcis a certain period as pointed out in the answers to this quesition. When I read about connecting superset with spark in the docs it says hive should be used as a connector (also the project websites states the tool is unsupported, and if you look at this issue on pyhive then you find impyla could be an alternative). But apache hive is a completely different project for a storage system. So how would this connection work?
Assuming you have kafka nodes running (with zookeper obviously) and also have spark running and then you connect apache superset through this hive connector with spark.
How can you write queries against the data that is in kafka (which is in fact the live data)?
On spark side itself you can easily write a scala program that reads data from kafka and does something with it but how can you achieve this from apache superset?
Or is this not the intended way of connecting the things?
If I understood your question, you'd need to use Spark Structured Streaming to register a streaming SQL table into the Hive metastore, which could be queried from Superset from the Spark Thiftserver.
Hive itself doesn't store any of the data. Hive also has a built-in Kafka query handler, so Spark isn't completely necessary.
But, Hive/Spark isn't the only option. You could use Spark to write to HDFS/S3 and have Presto query that from Superset.
Or you can remove Spark and use Kafka Connect write to any other thing that a dashboarding tool (Tableau is another popular one) can support - JDBC database (i.e. Postgres), Mongo, Cassandra, etc. Then you'd just refresh the panels to run a new query.

Is possible send data from Nifi to Spark Structured Streaming/Storm directly without loss data?

In my current scenario; Nifi collects data, then sends to Kafka. Then any streaming engine consumes data from kafka, and analysis it. In this situation; I dont want to use Kafka between Nifi and Streaming Engine. So, I want to send data from Nifi to streaming engine directly. But, I don't know some details here.
For example Spark Structured Streaming; Assumet that I send data from Nifi to Spark Structured Streaming directly, Spark was received this data but then spark's node is down. What happens to data in Spark node? (Do Spark Structured Streaming have any Nifi receiver?), Also, in this case, what is the data guarantee on Spark Structured Streaming?
For example Storm; Storm has Nifi Bolt. But, assume that Storm have received data from Nifi, but then node was down. What happens to the data? Also, in this case, what is the data guarantee on Storm?
In shortly, I want to send data from Nifi to SparkStructuredStreaming/Storm(I'm more likely to used Spark.) directly. But if any node is downs in streaming engine cluster, I dont want to lose data.
Is this possible for Spark Structured Streaming?
All of the streaming integration with NiFi is done using the site-to-site protocol, which is originally made for two NiFi instances to transfer data.
As far as I know there are currently integrations with Storm, Spark streaming, and Flink. I'm not familiar with Spark structured streaming, but I would imagine you could build this integration similar to the others.
https://github.com/apache/nifi/tree/master/nifi-external/nifi-spark-receiver
https://github.com/apache/nifi/tree/master/nifi-external/nifi-storm-spout
https://github.com/apache/flink/tree/master/flink-connectors/flink-connector-nifi
NiFi is not a replayable source of data though. The data is transferred from NiFi to the streaming system in a transaction to ensure it is not removed from the NiFi side until the destination has confirmed the transaction. However, if something fails in the streaming system after that commit, then the data is no longer in NiFi and it is the streaming system's problem.
I'm not sure the reason why you don't want to use Kafka, but NiFi -> Kafka -> Streaming is a more standard and proven approach.
There is a NifiReceiver for spark.
Comparing the implementation with the apache-spark documentatation this receiver is fault tolerant, as it should replay data not passed on.

Spark structured streaming integration with RabbitMQ

I want to use Spark structured streaming to aggregate data which is consumed from RabbitMQ.
I know there is official spark structured streaming integration with apache kafka, and I was wondering if there exists some integration with RabbitMQ as well?
Since I'm not able to switch the existing messaging system (RabbitMQ), I thought of using kafka-connect to move the data between the messaging systems (Rabbit to kafka) and then use Spark structured streaming.
Does anyone knows a better solution?
This custom RabbitMQ receiver seems to available if you're open to exploring Spark Streaming rather than Structured Streaming.

Is there a way to load streaming data from Kafka into HDFS using Spark and without Flume?

I was looking if there is a way to load the streaming data from Kafka directly into HDFS using spark streaming and without using Flume.
I have tried it using Flume(Kafka source and HDFS sink) already.
Thanks in Advance!
There is HDFS connector for Kafka Connect. Confluent's documentation have more information.
This is a pretty basic function for Spark Streaming. Depending on what version of spark and Kafka you are using, you can look at the spark streaming kafka integration documentation for the versions you are using. Saving to HDFS is as easy as rdd.saveAsTextFile("hdfs:///directory/filename").
Spark/Kafka integration guide for latest versions

Akka stream vs Hive Stream

I am working on a requirement where we need to read messages from Kafka and save (sink) to Hive. I can think about of multiple implementation using different technologies:
Akka stream - where source will be Kafka source and sink to hive
Hive Stream - using hive streaming
Spark streaming
nifi - https://nifi.apache.org/
What would be best way to handle large set of kafka messages to stream with Hive?
Thanks
Arun
Best is of course a very vague concept, but I personally like NiFi as a data movement solution.
If you are looking for fast development, and clear monitoring then the intuitive GUI should prove very valuable.
If you find that you cannot get enough performance, or good enough latency, you might be able to improve with Spark Streaming, but often that should not be needed.
Ful disclosure: Have not worked with Akka Streams, and work for Cloudera a driving force behind Nifi, Spark and Hive

Resources