How to integrate kafka and spark streaming in Datastax Enterprise Edition? - apache-spark

I've integrated kafka and spark streaming after downloading from the apache website. However, I wanted to use Datastax for my Big Data solution and I saw you can easily integrate Cassandra and Spark.
But I can't see any kafka modules in the latest version of Datastax enterprise. How to integrate kafka with spark streaming here?
What I want to do is basically:
Start necessary brokers and servers
Start kafka producer
Start kafka consumer
Connect spark streaming to kafka broker and receive the messages from there
However after a quick google search, I can't see anywhere that kafka has been incorporated with datastax enterprise.
How can I achieve this? I'm really new to datastax and kafka and all so I need some advice. Language preference- Python.
Thanks!

Good question. DSE does not incorporate Kafka out of the box, you must set up kafka yourself and then set up your spark streaming job to read from kafka. Since DSE does bundle spark, use DSE Spark to run your spark streaming job.
You can use either the direct kafka API or kafka receivers, more details here on the tradeoffs. TL;DR direct api does not require WAL or zookeeper for HA.
Here is an example of how you can configure Kafka to work with DSE by Cary Bourgeois:
https://github.com/CaryBourgeois/DSE-Spark-Streaming/tree/master

Related

Spark Streaming with Spark 2 and Kafka 2.1

I'm upgrading a Java project from Cloudera 5.10 to Cloudera 6.2. We have Spark Streaming reading data from Kafka to process it and write the results elsewhere. During the upgrade, Spark is going from v1.6 to v2.1, and Kafka from v0.8 to v2.1.
To perform the streaming processing, we were connecting to Kafka using KafkaUtils.createStream(...), but KafkaUtils are not available in Kafka 2.11 anymore. However, I can't seem to find any Spark Streaming + Kafka example or documentation which doesn't use this method in Java.
Is there something I'm missing? What is the best way to connect both worlds in these versions?
The module was renamed to spark-streaming-kafka-0-10
https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10
However, you should consider using Structured Streaming, instead.

Version of Kafka Connector For Use in Spark Streaming

The latest version of Kafka available for download is Kafka 2.1.0. But in order to use Kafka in Spark Streaming, or Spark Structured Streaming, we use respectively the following connectors:
spark-streaming-kafka-0-10_2.11
spark-sql-kafka-0-10_2.11
My question is that it seems that the connectors are for Kafka version 0.10.0.0 since the name of the connectors include 0-10. Is there something that I don't understand here, or we are really using connectors which are for much older versions of Kafka?
For Spark Structure Streaming 2.4, Kafka Client 2.0 is used.
0-10 means it is compatible with Kafka Brokers in version 0.10 or above.
You can check it in pom.xml in spark project: https://github.com/apache/spark/blob/branch-2.4/external/kafka-0-10-sql/pom.xml#L33

Spark structured streaming integration with RabbitMQ

I want to use Spark structured streaming to aggregate data which is consumed from RabbitMQ.
I know there is official spark structured streaming integration with apache kafka, and I was wondering if there exists some integration with RabbitMQ as well?
Since I'm not able to switch the existing messaging system (RabbitMQ), I thought of using kafka-connect to move the data between the messaging systems (Rabbit to kafka) and then use Spark structured streaming.
Does anyone knows a better solution?
This custom RabbitMQ receiver seems to available if you're open to exploring Spark Streaming rather than Structured Streaming.

Is there a way to load streaming data from Kafka into HDFS using Spark and without Flume?

I was looking if there is a way to load the streaming data from Kafka directly into HDFS using spark streaming and without using Flume.
I have tried it using Flume(Kafka source and HDFS sink) already.
Thanks in Advance!
There is HDFS connector for Kafka Connect. Confluent's documentation have more information.
This is a pretty basic function for Spark Streaming. Depending on what version of spark and Kafka you are using, you can look at the spark streaming kafka integration documentation for the versions you are using. Saving to HDFS is as easy as rdd.saveAsTextFile("hdfs:///directory/filename").
Spark/Kafka integration guide for latest versions

How to send data from kafka to spark

I want to send my data from kafka to Spark.
I have installed spark in my system and kafka is also working in my system in proper way.
You need to use a Kafka connector from Spark. Technically, Kafka won't send the data to Spark. In fact, Spark pull the data from Kafka.
Here the link from the documentation : https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html

Resources