How to get Kafka header fields (which were introduced in Kafka 0.11+) in Spark Structured Streaming?
I see the headers implementation is added in Spark 3.0 but not in 2.4.5.
And I see by default spark-sql-kafka-0-10 is using kafka-client 2.0.
If it is not possible to read Kafka headers using Spark then can you suggest any alternative?
I don't found the way to do it in spark 2.X. can use Kafka connect SMT if the use case is simple
Related
I'm upgrading a Java project from Cloudera 5.10 to Cloudera 6.2. We have Spark Streaming reading data from Kafka to process it and write the results elsewhere. During the upgrade, Spark is going from v1.6 to v2.1, and Kafka from v0.8 to v2.1.
To perform the streaming processing, we were connecting to Kafka using KafkaUtils.createStream(...), but KafkaUtils are not available in Kafka 2.11 anymore. However, I can't seem to find any Spark Streaming + Kafka example or documentation which doesn't use this method in Java.
Is there something I'm missing? What is the best way to connect both worlds in these versions?
The module was renamed to spark-streaming-kafka-0-10
https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10
However, you should consider using Structured Streaming, instead.
I want to create multiple kafka Topics run time in my Spark Structured Streaming application. I found that there are various methods available in Java API. But I couldn't find any with Spark Structured Streaming.
Please let me know if there is any way available or I need to use java library
My apache Spark version is 2.4.4 and Kafka library dependency is spark-sql-kafka-0-10_2.12
AFAIK, Spark doesn't create topics.
You can use the same Java APIs you've found before initializing your SparkSession
spark-sql-kafka includes kafka-clients, so you have the AdminClient class available
How to create a Topic in Kafka through Java
The latest version of Kafka available for download is Kafka 2.1.0. But in order to use Kafka in Spark Streaming, or Spark Structured Streaming, we use respectively the following connectors:
spark-streaming-kafka-0-10_2.11
spark-sql-kafka-0-10_2.11
My question is that it seems that the connectors are for Kafka version 0.10.0.0 since the name of the connectors include 0-10. Is there something that I don't understand here, or we are really using connectors which are for much older versions of Kafka?
For Spark Structure Streaming 2.4, Kafka Client 2.0 is used.
0-10 means it is compatible with Kafka Brokers in version 0.10 or above.
You can check it in pom.xml in spark project: https://github.com/apache/spark/blob/branch-2.4/external/kafka-0-10-sql/pom.xml#L33
I am going to do stream processing with pyspark and use Kafka as a data source.
I see that Kafka 0.10 connector is not supported under Spark Python API.
Can I use Kafka 0.8 connector in Spark 2.3.0 regardless it is deprecated?
It's deprecated, but not deleted. You can use it.
However, you may be interested in Structured Streaming, which has Kafka 0.10 support in Python - link here. This is the new Streaming API in Spark, that will replace DStreams
I am new to spark. Consuming message from kafka as xml format in spark streaming. Can you tell me how to process this xml is spark streaming?
Spark Streaming and Kafka documentation is available upstream with examples:
http://spark.apache.org/docs/latest/streaming-kafka-0-8-integration.html
Here's the compatibility matrix for versions supported. Stick to the stable releases first since you're getting started with streaming:
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
You could use this library to process XML records from Spark.
https://github.com/databricks/spark-xml