How to process XMLmessage in spark streaming and kafka? - apache-spark

I am new to spark. Consuming message from kafka as xml format in spark streaming. Can you tell me how to process this xml is spark streaming?

Spark Streaming and Kafka documentation is available upstream with examples:
http://spark.apache.org/docs/latest/streaming-kafka-0-8-integration.html
Here's the compatibility matrix for versions supported. Stick to the stable releases first since you're getting started with streaming:
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
You could use this library to process XML records from Spark.
https://github.com/databricks/spark-xml

Related

Spark Streaming with Spark 2 and Kafka 2.1

I'm upgrading a Java project from Cloudera 5.10 to Cloudera 6.2. We have Spark Streaming reading data from Kafka to process it and write the results elsewhere. During the upgrade, Spark is going from v1.6 to v2.1, and Kafka from v0.8 to v2.1.
To perform the streaming processing, we were connecting to Kafka using KafkaUtils.createStream(...), but KafkaUtils are not available in Kafka 2.11 anymore. However, I can't seem to find any Spark Streaming + Kafka example or documentation which doesn't use this method in Java.
Is there something I'm missing? What is the best way to connect both worlds in these versions?
The module was renamed to spark-streaming-kafka-0-10
https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10
However, you should consider using Structured Streaming, instead.

Fetch kafka headers in spark 2.4.X

How to get Kafka header fields (which were introduced in Kafka 0.11+) in Spark Structured Streaming?
I see the headers implementation is added in Spark 3.0 but not in 2.4.5.
And I see by default spark-sql-kafka-0-10 is using kafka-client 2.0.
If it is not possible to read Kafka headers using Spark then can you suggest any alternative?
I don't found the way to do it in spark 2.X. can use Kafka connect SMT if the use case is simple

Can I use spark 2.3.0 and pyspark to do stream processing from Kafka?

I am going to do stream processing with pyspark and use Kafka as a data source.
I see that Kafka 0.10 connector is not supported under Spark Python API.
Can I use Kafka 0.8 connector in Spark 2.3.0 regardless it is deprecated?
It's deprecated, but not deleted. You can use it.
However, you may be interested in Structured Streaming, which has Kafka 0.10 support in Python - link here. This is the new Streaming API in Spark, that will replace DStreams

Is there a way to load streaming data from Kafka into HDFS using Spark and without Flume?

I was looking if there is a way to load the streaming data from Kafka directly into HDFS using spark streaming and without using Flume.
I have tried it using Flume(Kafka source and HDFS sink) already.
Thanks in Advance!
There is HDFS connector for Kafka Connect. Confluent's documentation have more information.
This is a pretty basic function for Spark Streaming. Depending on what version of spark and Kafka you are using, you can look at the spark streaming kafka integration documentation for the versions you are using. Saving to HDFS is as easy as rdd.saveAsTextFile("hdfs:///directory/filename").
Spark/Kafka integration guide for latest versions

spark streaming + kafka compatibility issue

Will spark streaming compatible with kafka versions above 0.8.2.1?
Is writing custom receiver the only option to make spark streaming use kafka version above 0.9?
I just added the "inter.broker.protocol.version=CURRENT_KAFKA_VERSION (e.g. 0.8.2 or 0.9.0.0)" in server.properties file. That will make the old 0.8.2.1 consumer to receive the data from new versions of kafka brokers.

Resources