spark streaming + kafka compatibility issue - apache-spark

Will spark streaming compatible with kafka versions above 0.8.2.1?
Is writing custom receiver the only option to make spark streaming use kafka version above 0.9?

I just added the "inter.broker.protocol.version=CURRENT_KAFKA_VERSION (e.g. 0.8.2 or 0.9.0.0)" in server.properties file. That will make the old 0.8.2.1 consumer to receive the data from new versions of kafka brokers.

Related

Spark Streaming with Spark 2 and Kafka 2.1

I'm upgrading a Java project from Cloudera 5.10 to Cloudera 6.2. We have Spark Streaming reading data from Kafka to process it and write the results elsewhere. During the upgrade, Spark is going from v1.6 to v2.1, and Kafka from v0.8 to v2.1.
To perform the streaming processing, we were connecting to Kafka using KafkaUtils.createStream(...), but KafkaUtils are not available in Kafka 2.11 anymore. However, I can't seem to find any Spark Streaming + Kafka example or documentation which doesn't use this method in Java.
Is there something I'm missing? What is the best way to connect both worlds in these versions?
The module was renamed to spark-streaming-kafka-0-10
https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10
However, you should consider using Structured Streaming, instead.

Version of Kafka Connector For Use in Spark Streaming

The latest version of Kafka available for download is Kafka 2.1.0. But in order to use Kafka in Spark Streaming, or Spark Structured Streaming, we use respectively the following connectors:
spark-streaming-kafka-0-10_2.11
spark-sql-kafka-0-10_2.11
My question is that it seems that the connectors are for Kafka version 0.10.0.0 since the name of the connectors include 0-10. Is there something that I don't understand here, or we are really using connectors which are for much older versions of Kafka?
For Spark Structure Streaming 2.4, Kafka Client 2.0 is used.
0-10 means it is compatible with Kafka Brokers in version 0.10 or above.
You can check it in pom.xml in spark project: https://github.com/apache/spark/blob/branch-2.4/external/kafka-0-10-sql/pom.xml#L33

Can I use spark 2.3.0 and pyspark to do stream processing from Kafka?

I am going to do stream processing with pyspark and use Kafka as a data source.
I see that Kafka 0.10 connector is not supported under Spark Python API.
Can I use Kafka 0.8 connector in Spark 2.3.0 regardless it is deprecated?
It's deprecated, but not deleted. You can use it.
However, you may be interested in Structured Streaming, which has Kafka 0.10 support in Python - link here. This is the new Streaming API in Spark, that will replace DStreams

How to process XMLmessage in spark streaming and kafka?

I am new to spark. Consuming message from kafka as xml format in spark streaming. Can you tell me how to process this xml is spark streaming?
Spark Streaming and Kafka documentation is available upstream with examples:
http://spark.apache.org/docs/latest/streaming-kafka-0-8-integration.html
Here's the compatibility matrix for versions supported. Stick to the stable releases first since you're getting started with streaming:
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
You could use this library to process XML records from Spark.
https://github.com/databricks/spark-xml

How to integrate kafka and spark streaming in Datastax Enterprise Edition?

I've integrated kafka and spark streaming after downloading from the apache website. However, I wanted to use Datastax for my Big Data solution and I saw you can easily integrate Cassandra and Spark.
But I can't see any kafka modules in the latest version of Datastax enterprise. How to integrate kafka with spark streaming here?
What I want to do is basically:
Start necessary brokers and servers
Start kafka producer
Start kafka consumer
Connect spark streaming to kafka broker and receive the messages from there
However after a quick google search, I can't see anywhere that kafka has been incorporated with datastax enterprise.
How can I achieve this? I'm really new to datastax and kafka and all so I need some advice. Language preference- Python.
Thanks!
Good question. DSE does not incorporate Kafka out of the box, you must set up kafka yourself and then set up your spark streaming job to read from kafka. Since DSE does bundle spark, use DSE Spark to run your spark streaming job.
You can use either the direct kafka API or kafka receivers, more details here on the tradeoffs. TL;DR direct api does not require WAL or zookeeper for HA.
Here is an example of how you can configure Kafka to work with DSE by Cary Bourgeois:
https://github.com/CaryBourgeois/DSE-Spark-Streaming/tree/master

Resources