Can use spark streaming 1.5.1 with kafka 0.10.0? - apache-spark

Can I use spark streaming 1.5.1 with kafka 0.10.0?
The site spark.apache.org recommended that spark streaming should work with kafka 0.8.2;
I just wanna know what if I use kafka 0.10.0 with spark 1.5.1 ?
Can someone please help.

I think it's possible but there are some limitations. For instance, Apache spark 1.5.x does not support features like streaming in a kerberized environment.

This API spark-streaming-kafka-0-10 is still in experimental condition.
You can use this. But you might get some surprises as it is yet to stable.
Please follow the below link:
https://github.com/jerryshao/spark-streaming-kafka-0-10-connector
This is a Kafka 0.10 connector for Spark 1.x Streaming. Hope it will work.

Related

Spark Streaming with Spark 2 and Kafka 2.1

I'm upgrading a Java project from Cloudera 5.10 to Cloudera 6.2. We have Spark Streaming reading data from Kafka to process it and write the results elsewhere. During the upgrade, Spark is going from v1.6 to v2.1, and Kafka from v0.8 to v2.1.
To perform the streaming processing, we were connecting to Kafka using KafkaUtils.createStream(...), but KafkaUtils are not available in Kafka 2.11 anymore. However, I can't seem to find any Spark Streaming + Kafka example or documentation which doesn't use this method in Java.
Is there something I'm missing? What is the best way to connect both worlds in these versions?
The module was renamed to spark-streaming-kafka-0-10
https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10
However, you should consider using Structured Streaming, instead.

Fetch kafka headers in spark 2.4.X

How to get Kafka header fields (which were introduced in Kafka 0.11+) in Spark Structured Streaming?
I see the headers implementation is added in Spark 3.0 but not in 2.4.5.
And I see by default spark-sql-kafka-0-10 is using kafka-client 2.0.
If it is not possible to read Kafka headers using Spark then can you suggest any alternative?
I don't found the way to do it in spark 2.X. can use Kafka connect SMT if the use case is simple

Version of Kafka Connector For Use in Spark Streaming

The latest version of Kafka available for download is Kafka 2.1.0. But in order to use Kafka in Spark Streaming, or Spark Structured Streaming, we use respectively the following connectors:
spark-streaming-kafka-0-10_2.11
spark-sql-kafka-0-10_2.11
My question is that it seems that the connectors are for Kafka version 0.10.0.0 since the name of the connectors include 0-10. Is there something that I don't understand here, or we are really using connectors which are for much older versions of Kafka?
For Spark Structure Streaming 2.4, Kafka Client 2.0 is used.
0-10 means it is compatible with Kafka Brokers in version 0.10 or above.
You can check it in pom.xml in spark project: https://github.com/apache/spark/blob/branch-2.4/external/kafka-0-10-sql/pom.xml#L33

Is there a way to load streaming data from Kafka into HDFS using Spark and without Flume?

I was looking if there is a way to load the streaming data from Kafka directly into HDFS using spark streaming and without using Flume.
I have tried it using Flume(Kafka source and HDFS sink) already.
Thanks in Advance!
There is HDFS connector for Kafka Connect. Confluent's documentation have more information.
This is a pretty basic function for Spark Streaming. Depending on what version of spark and Kafka you are using, you can look at the spark streaming kafka integration documentation for the versions you are using. Saving to HDFS is as easy as rdd.saveAsTextFile("hdfs:///directory/filename").
Spark/Kafka integration guide for latest versions

How to integrate kafka and spark streaming in Datastax Enterprise Edition?

I've integrated kafka and spark streaming after downloading from the apache website. However, I wanted to use Datastax for my Big Data solution and I saw you can easily integrate Cassandra and Spark.
But I can't see any kafka modules in the latest version of Datastax enterprise. How to integrate kafka with spark streaming here?
What I want to do is basically:
Start necessary brokers and servers
Start kafka producer
Start kafka consumer
Connect spark streaming to kafka broker and receive the messages from there
However after a quick google search, I can't see anywhere that kafka has been incorporated with datastax enterprise.
How can I achieve this? I'm really new to datastax and kafka and all so I need some advice. Language preference- Python.
Thanks!
Good question. DSE does not incorporate Kafka out of the box, you must set up kafka yourself and then set up your spark streaming job to read from kafka. Since DSE does bundle spark, use DSE Spark to run your spark streaming job.
You can use either the direct kafka API or kafka receivers, more details here on the tradeoffs. TL;DR direct api does not require WAL or zookeeper for HA.
Here is an example of how you can configure Kafka to work with DSE by Cary Bourgeois:
https://github.com/CaryBourgeois/DSE-Spark-Streaming/tree/master

Resources