How can I connect to Apache Spark Stream in Mule? - apache-spark

I need to connect to Apache Spark Stream where input will come from Kafka and processed data then go to Cassandra. I tried to find Spark connector but didn't get any result.
Is there any custom connector available ?
How can I use Apache Spark Stream in Mule ?

I need to connect to Apache Spark Stream where input will come from
Kafka and processed data then go to Cassandra.
So you need not a Spark connector, but Kafka connector: https://docs.mulesoft.com/mule-user-guide/v/3.8/kafka-connector

Related

Spark structured streaming from JDBC source

Can someone let me know if its possible to to Spark structured streaming from a JDBC source? E.g SQL DB or any RDBMS.
I have looked at a few similar questions on SO, e.g
Spark streaming jdbc read the stream as and when data comes - Data source jdbc does not support streamed reading
jdbc source and spark structured streaming
However, I would like to know if its officially supported on Apache Spark?
If there is any sample code that would be helpful.
Thanks
No, there is no such built-in support in Spark Structured Streaming. The main reason is that most of databases doesn't provided an unified interface for obtaining the changes.
It's possible to get changes from some databases using archive logs, write-ahead logs, etc. But it's database-specific. For many databases the popular choice is Debezium that can read such logs and push list of changes into a Kafka, or something similar, from which it could be consumed by Spark.
I am on a project now architecting this using CDC Shareplex from ORACLE and writing to KAFKA and then using Spark Structured Streaming with KAFKA integration and MERGE on delta format on HDFS.
Ie that is the way to do it if not using Debezium. You can use change logs for base tables or materialized views to feed CDC.
So direct JDBC is not possible.

Save each Kafka messages in hdfs using spark streaming

I am using spark streaming to do analysis. after analysis I have to save the kafka message in hdfs. Each kafka message is a xml file. I can't use rdd.saveAsTextFile because it will save whole rdd. Each element of rdd is kafka message ( xml file ). How to save each rdd element (file) in hdfs using spark.
I would go about this a different way. Stream your transformed data back into Kafka, and then use the HDFS connector for Kafka Connect to stream the data to HDFS. Kafka Connect is part of Apache Kafka. The HDFS connector is open source and available standalone or as part of Confluent Platform.
Doing it this way you decouple your processing from writing your data to HDFS, which makes it easier to manage, to troubleshoot, to scale.

Is there a way to load streaming data from Kafka into HDFS using Spark and without Flume?

I was looking if there is a way to load the streaming data from Kafka directly into HDFS using spark streaming and without using Flume.
I have tried it using Flume(Kafka source and HDFS sink) already.
Thanks in Advance!
There is HDFS connector for Kafka Connect. Confluent's documentation have more information.
This is a pretty basic function for Spark Streaming. Depending on what version of spark and Kafka you are using, you can look at the spark streaming kafka integration documentation for the versions you are using. Saving to HDFS is as easy as rdd.saveAsTextFile("hdfs:///directory/filename").
Spark/Kafka integration guide for latest versions

How to send data from kafka to spark

I want to send my data from kafka to Spark.
I have installed spark in my system and kafka is also working in my system in proper way.
You need to use a Kafka connector from Spark. Technically, Kafka won't send the data to Spark. In fact, Spark pull the data from Kafka.
Here the link from the documentation : https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html

How to integrate kafka and spark streaming in Datastax Enterprise Edition?

I've integrated kafka and spark streaming after downloading from the apache website. However, I wanted to use Datastax for my Big Data solution and I saw you can easily integrate Cassandra and Spark.
But I can't see any kafka modules in the latest version of Datastax enterprise. How to integrate kafka with spark streaming here?
What I want to do is basically:
Start necessary brokers and servers
Start kafka producer
Start kafka consumer
Connect spark streaming to kafka broker and receive the messages from there
However after a quick google search, I can't see anywhere that kafka has been incorporated with datastax enterprise.
How can I achieve this? I'm really new to datastax and kafka and all so I need some advice. Language preference- Python.
Thanks!
Good question. DSE does not incorporate Kafka out of the box, you must set up kafka yourself and then set up your spark streaming job to read from kafka. Since DSE does bundle spark, use DSE Spark to run your spark streaming job.
You can use either the direct kafka API or kafka receivers, more details here on the tradeoffs. TL;DR direct api does not require WAL or zookeeper for HA.
Here is an example of how you can configure Kafka to work with DSE by Cary Bourgeois:
https://github.com/CaryBourgeois/DSE-Spark-Streaming/tree/master

Resources