spark streaming + kafka always rebalance - apache-spark

spark streaming is consuming streaming from kafka, but when kafka add new topics ,spark applicaitons always go wrong (kafka's groupid rebalance), and spark applicaitons cant Auto-Recovery。
version:
kafka:kafka_2.11-1.0.0
spark:spark-streaming_2.11-2.4.0

Related

Spark NiFi site to site connection

I am new with NiFi, I am trying to send data from NiFi to Spark or to establish a stream from NiFi output port to Spark according to this tutorial.
Nifi is running on Kubernetes and I am using Spark operator on the same cluster to submit my applications.
It seems like Spark is able to access the web NiFi and it starts a streaming receiver. However, data is not coming to the Spark app through output and I have empty rdds. I have not seen any warnings or errors in Spark logs
Any Idea or information which could help me to solve this issue is appreciated.
My code:
val conf = new SiteToSiteClient.Builder()
.keystoreFilename("..")
.keystorePass("...")
.keystoreType(...)
.truststoreFilename("..")
.truststorePass("..")
.truststoreType(...)
.url("https://...../nifi")
.portName("spark")
.buildConfig()
val lines = ssc.receiverStream(new NiFiReceiver(conf, StorageLevel.MEMORY_ONLY))

How to subscribe to a new topic with subscribePattern?

I am using Spark Structured streaming with Kafka and topic been subscribed as pattern:
option("subscribePattern", "topic.*")
// Subscribe to a pattern
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribePattern", "topic.*")
.load()
Once I start the job and a new topic is listed say topic.new_topic, the job automatically doesn't start listening to the new topic and it requires a restart.
Is there a way to automatically subscribe to a new pattern without restarting the job?
Spark: 3.0.0
The default behavior of a KafkaConsumer is to check every 5 minutes if there are new partitions to be consumed. This configuration is set through the Consumer config
metadata.max.age.ms: The period of time in milliseconds after which we force a refresh of metadata even if we haven't seen any partition leadership changes to proactively discover any new brokers or partitions.
According to the Spark + Kafka Integration Guide on Kafka Specific Configuration you can set this configuration by using the prefix kafka. as shown below:
.option("kafka.metadata.max.age.ms", "1000")
Through this setting the newly created topic will be consumed 1 second after its creation.
(Tested with Spark 3.0.0 and Kafka Broker 2.5.0)

StreamingQueryException: 'Error while Describe Streams\n=== Streaming Query

I am gettign below error while running a Glue Streaming job which fails to connect to Kinesis data source :
Error:
WARNING:root:StreamingQueryException caught. Retry number 10 ERROR:root:Exceeded maximuim number of retries in streaming interval,
exception thrown Parse yarn logs get error message:
StreamingQueryException: 'Error while Describe Streams\n=== Streaming Query ===\nIdentifier: [id = 60exxxxxxxxxxxxx
Following are the set of jars i used :
spark-tags_2.11-2.4.0.jar,
spark-streaming-kinesis-asl_2.11-2.4.0.jar,
spark-streaming_2.11-2.4.0.jar,
aws-java-sdk-sts-1.11.271.jar,
amazon-kinesis-client-1.8.10.jar,
spark-sql_2.11-2.4.0.jar
#####################################################################
spark-tags_2.11-2.4.3.jar,
spark-streaming-kinesis-asl_2.11-2.4.3.jar,
aws-java-sdk-sts-1.11.271.jar,
jackson-dataformat-cbor-2.6.7.jar,
unused-1.0.0.jar,
spark-sql_2.11-2.4.3.jar
##########################################
spark-sql-kinesis_2.11-1.1.3-spark_2.4.jar,
spark-tags_2.11-2.4.0.jar,
unused-1.0.0.jar,
scala-library-2.11.12.jar,
spark-sql_2.11-2.4.0.jar
Please suggest, since there is very less and vague information on Glue Streaming and Kinesis integration.
Glue Streaming with Kinesis as a source uses a version of qubole/kinesis-sql
The Samples on that Github Repo should be a good starting point. Also this blog by qubole.
Kinesis ASL (spark-streaming-kinesis-asl) uses older spark streaming APIs, InputDStreams etc. Glue streaming has in-built support for spark structured streaming APIs and so if using Glue Streaming with Kinesis, need not import all those dependencies. If the application has to be Kinesis ASL based, then one might try building a Jar with all these dependencies and running them as part of a normal Glue ETL Job itself instead of a Glue Streaming Job. Porting the application to use the spark structured APIs would be much easier.
Also, there is an example of using the Structured Streaming APIs with Glue Streaming on the docs - https://docs.aws.amazon.com/glue/latest/dg/glue-etl-scala-example.html

Spark Streaming with Kafka - Not all Kafka messages are received

I'm working with Spark Streaming with Kafka for the first time have been facing the following issue:
I'm using the receiver-based approach for Kafka integration with Spark Streaming as:
val kafkaConf = Map("metadata.broker.list" -> "ip1:9042,ip2:9042", "group.id" -> "raw-email-event-streaming-consumer", "zookeeper.connect" -> "ip1:2181,ip2:2181")
`val kafkaStream = KafkaUtils.createStream[Array[Byte], String, DefaultDecoder, StringDecoder](ssc, kafkaConf, Map(RN_MAIL_TOPIC -> RN_MAIL_TOPIC_PARTITIONS), StorageLevel.MEMORY_ONLY_SER)`
I have found that a lot of messages from a Kafka topic are not received in the Spark Streaming job (~5k messages out of 7k messages produced aren't received). Please provide insights as to why this might be happening. I submit my streaming job to an Azure HDInsights cluster in a standalone mode currently, as follows:
spark-submit --class <class_name> --master local[*] --deploy-mode client <executable>.jar

How to process DynamoDB Stream in a Spark streaming application

I would like to consume a DynamoDB Stream from a Spark Streaming application.
Spark streaming uses KCL to read from Kinesis. There is a lib to make KCL able to read from a DynamoDB Stream: dynamodb-streams-kinesis-adapter.
But is it possible to plug this lib into spark? Anyone done this?
I'm using Spark 2.1.0.
My backup plan is to have another app reading from DynamoDB stream into a Kinesis stream.
Thanks
The way to do this it to implement the KinesisInputDStream to use the worker provided by dynamodb-streams-kinesis-adapter
The official guidelines suggest something like this:
final Worker worker = StreamsWorkerFactory
.createDynamoDbStreamsWorker(
recordProcessorFactory,
workerConfig,
adapterClient,
amazonDynamoDB,
amazonCloudWatchClient);
From the Spark's perspective, it is implemented under the kinesis-asl module in KinesisInputDStream.scala
I have tried this for Spark 2.4.0. Here is my repo. It needs little refining but gets the work done
https://github.com/ravi72munde/spark-dynamo-stream-asl
After modifying the KinesisInputDStream, we can use it as shown below.
val stream = KinesisInputDStream.builder
.streamingContext(ssc)
.streamName("sample-tablename-2")
.regionName("us-east-1")
.initialPosition(new Latest())
.checkpointAppName("sample-app")
.checkpointInterval(Milliseconds(100))
.storageLevel(StorageLevel.MEMORY_AND_DISK_2)
.build()

Resources