How to update Kafka consumer max.request.size config while using Spark structured stream - apache-spark

Spark readStream for Kafka fails with the following errors:
org.apache.kafka.common.errors.RecordTooLargeException (The message
is 1166569 bytes when serialized which is larger than the maximum
request size you have configured with the max.request.size
configuration.)
How do we bump up the max.request.size?
Code:
val ctxdb = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "ip:port")
.option("subscribe","topic")
.option("startingOffsets", "earliest")
.option(" failOnDataLoss", "false")
.option("max.request.size", "15728640")
We have tried to update option("max.partition.fetch.bytes", "15728640") with no luck.

You need to add the kafka prefix to the writer stream setting:
.option("kafka.max.request.size", "15728640")

Related

How to stream data from Delta Table to Kafka Topic

Internet is filled with examples of streaming data from Kafka topic to delta tables. But my requirement is to stream data from Delta Table to Kafka topic. Is that possible? If yes, can you please share code example?
Here is the code I tried.
val schemaRegistryAddr = "https://..."
val avroSchema = buildSchema(topic) //defined this method
val Df = spark.readStream.format("delta").load("path..")
.withColumn("key", col("lskey").cast(StringType))
.withColumn("topLevelRecord",struct(col("col1"),col("col2")...)
.select(
to_avro($"key", lit("topic-key"), schemaRegistryAddr).as("key"),
to_avro($"topLevelRecord", lit("topic-value"), schemaRegistryAddr, avroSchema).as("value"))
Df.writeStream
.format("kafka")
.option("checkpointLocation",checkpointPath)
.option("kafka.bootstrap.servers", bootstrapServers)
.option("kafka.security.protocol", "SSL")
.option("kafka.ssl.keystore.location", kafkaKeystoreLocation)
.option("kafka.ssl.keystore.password", keystorePassword)
.option("kafka.ssl.truststore.location", kafkaTruststoreLocation)
.option("topic",topic)
.option("batch.size",262144)
.option("linger.ms",5000)
.trigger(ProcessingTime("25 seconds"))
.start()
But it fails with: org.spark_project.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Schema not found; error code: 40403
But when I try to write to the same topic using a Batch Producer it goes through successfully. Can anyone please let me know what am I missing in the streaming write to Kafka topic?
Later I found this old blog which says that current Structured Streaming API does not support 'kakfa' format.
https://www.databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html?_ga=2.177174565.1658715673.1672876248-681971438.1669255333

How to guarantee sequence of execution of multiple sinks in spark structured streaming

In my scenario, I have a structured streaming application which reads from kafka and writes to hdfs and kafka using 3 different sinks. Primary sink is the hdfs one and others are secondary. I want the primary sink to run first and then secondary sinks. All have a triggertime of 60seconds. Is there a way to achieve that in spark structured streaming. Adding the code snippet:
val spark = SparkSession
.builder
.master(StreamerConfig.sparkMaster)
.appName(StreamerConfig.sparkAppName)
.getOrCreate()
spark.conf.set("spark.sql.orc.impl", "native")
spark.conf.set("spark.streaming.stopGracefullyOnShutdown","true")
spark.conf.set("spark.sql.files.ignoreCorruptFiles","true")
spark.conf.set("spark.dynamicAllocation.enabled","true")
spark.conf.set("spark.shuffle.service.enabled","true")
val readData = spark
.readStream
.format("kafka") .option("kafka.bootstrap.servers",StreamerConfig.kafkaBootstrapServer)
.option("subscribe",StreamerConfig.topicName)
.option("failOnDataLoss", false)
.option("startingOffsets",StreamerConfig.kafkaStartingOffset) .option("maxOffsetsPerTrigger",StreamerConfig.maxOffsetsPerTrigger)
.load()
val deserializedRecords = StreamerUtils.deserializeAndMapData(readData,spark)
val streamingQuery = deserializedRecords.writeStream
.queryName(s"Persist data to hive table for ${StreamerConfig.topicName}")
.outputMode("append")
.format("orc")
.option("path",StreamerConfig.hdfsLandingPath)
.option("checkpointLocation",StreamerConfig.checkpointLocation)
.partitionBy("date","hour")
.option("truncate","false")
.trigger(Trigger.ProcessingTime(StreamerConfig.triggerTime))
.start()
deserializedRecords.select(to_json(struct("*")).alias("value"))
.writeStream
.format("kafka") // Local Testing - "console"
.option("topic", StreamerConfig.watermarkKafkaTopic)
.option("kafka.bootstrap.servers", StreamerConfig.kafkaBroker)
.option("checkpointLocation", StreamerConfig.phase1Checkpoints)
.trigger(Trigger.ProcessingTime(StreamerConfig.triggerTime))
.start()
deserializedRecords.select(to_json(struct("*")).alias("value"))
.writeStream
.format("kafka") // Local Testing - "console"
.option("topic", StreamerConfig.watermarkKafkaTopic)
.option("kafka.bootstrap.servers", StreamerConfig.kafkaBroker)
.option("checkpointLocation", StreamerConfig.phase2Checkpoints)
.trigger(Trigger.ProcessingTime(StreamerConfig.triggerTime))
.start()
PS: I am using spark 2.3.2

Spark Structured Streaming - AssertionError in Checkpoint due to increasing the number of input sources

I am trying to join two streams into one and write the result to a topic
code:
1- Reading two topics
val PERSONINFORMATION_df: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "xx:9092")
.option("subscribe", "PERSONINFORMATION")
.option("group.id", "info")
.option("maxOffsetsPerTrigger", 1000)
.option("startingOffsets", "earliest")
.load()
val CANDIDATEINFORMATION_df: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "xxx:9092")
.option("subscribe", "CANDIDATEINFORMATION")
.option("group.id", "candent")
.option("startingOffsets", "earliest")
.option("maxOffsetsPerTrigger", 1000)
.option("failOnDataLoss", "false")
.load()
2- Parse data to join them:
val parsed_PERSONINFORMATION_df: DataFrame = PERSONINFORMATION_df
.select(from_json(expr("cast(value as string) as actualValue"), schemaPERSONINFORMATION).as("s")).select("s.*")
val parsed_CANDIDATEINFORMATION_df: DataFrame = CANDIDATEINFORMATION_df
.select(from_json(expr("cast(value as string) as actualValue"), schemaCANDIDATEINFORMATION).as("s")).select("s.*")
val df_person = parsed_PERSONINFORMATION_df.as("dfperson")
val df_candidate = parsed_CANDIDATEINFORMATION_df.as("dfcandidate")
3- Join two frames
val joined_df : DataFrame = df_candidate.join(df_person, col("dfcandidate.PERSONID") === col("dfperson.ID"),"inner")
val string2json: DataFrame = joined_df.select($"dfcandidate.ID".as("key"),to_json(struct($"dfcandidate.ID", $"FULLNAME", $"PERSONALID")).cast("String").as("value"))
4- Write them to a topic
string2json.writeStream.format("kafka")
.option("kafka.bootstrap.servers", xxxx:9092")
.option("topic", "toDelete")
.option("checkpointLocation", "checkpoints")
.option("failOnDataLoss", "false")
.start()
.awaitTermination()
Error message:
21/01/25 11:01:41 ERROR streaming.MicroBatchExecution: Query [id = 9ce8bcf2-0299-42d5-9b5e-534af8d689e3, runId = 0c0919c6-f49e-48ae-a635-2e95e31fdd50] terminated with error
java.lang.AssertionError: assertion failed: There are [1] sources in the checkpoint offsets and now there are [2] sources requested by the query. Cannot continue.
Your code looks fine to me, it is rather the checkpointing that is causing the issue.
Based on the error message you are getting you probably ran this job with only one stream source. Then, you added the code for the stream join and tried to re-start the application without remiving existing checkpoint files. Now, the application tries to recover from the checkpoint files but realises that you initially had only one source and now you have two sources.
The section Recovery Semantics after Changes in a Streaming Query explains which changes are allowed and not allowed when using checkpointing. Changing the number of input sources is not allowed:
"Changes in the number or type (i.e. different source) of input sources: This is not allowed."
To solve your problem: Delete the current checkpoint files and re-start the job.

Trying to consuming the kafka streams using spark structured streaming

I'm new to Kafka streaming. I setup a twitter listener using python and it is running in the localhost:9092 kafka server. I could consume the stream produced by the listener using a kafka client tool (conduktor) and also using the command "bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic twitter --from-beginning"
BUt when i try to consume the same stream using Spark Structured streaming, it is not capturing and throws the error - Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".;
Find the screenshot below
Command output - Consumes Data
Jupyter output for spark consumer - Doesn't consume data
My Producer or listener code:
auth = tweepy.OAuthHandler("**********", "*************")
auth.set_access_token("*************", "***********************")
# session.set('request_token', auth.request_token)
api = tweepy.API(auth)
class KafkaPushListener(StreamListener):
def __init__(self):
#localhost:9092 = Default Zookeeper Producer Host and Port Adresses
self.client = pykafka.KafkaClient("0.0.0.0:9092")
#Get Producer that has topic name is Twitter
self.producer = self.client.topics[bytes("twitter", "ascii")].get_producer()
def on_data(self, data):
#Producer produces data for consumer
#Data comes from Twitter
self.producer.produce(bytes(data, "ascii"))
return True
def on_error(self, status):
print(status)
return True
twitter_stream = Stream(auth, KafkaPushListener())
twitter_stream.filter(track=['#fashion'])
Consumer access from Spark Structured streaming
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "twitter") \
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
Found what was missing, when I submitted the spark-job, I had to include the right dependency package version.
I have spark 3.0.0
Therefore, I included - org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 package
Add sink It will start consum data from kafka.
Check below code.
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "twitter") \
.load()
query = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
.writeStream \
.outputMode("append") \
.format("console") \ # here I am using console format .. you may change as per your requirement.
.start()
query.awaitTermination()

Mixing Spark Structured Streaming API and DStream to write to Kafka

I've recently noticed I have a confusion regarding Spark Streaming (I'm currently learning Spark).
I am reading data from a Kafka topic like this:
val stream = KafkaUtils.createDirectStream[String, String](
streamingContext,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
Which creates a DStream.
In order to work with event-time (and not processing-time) I did this:
outputStream
.foreachRDD(rdd => {
rdd.toDF().withWatermark("timestamp", "60 seconds")
.groupBy(
window($"timestamp", "60 seconds", "10 seconds")
)
.sum("meterIncrement")
.toJSON
.toDF("value")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "taxi-dollar-accurate")
.start()
)
})
And I get the error
'writeStream' can be called only on streaming Dataset/DataFrame
Which surprised me, because the source of the DF is a DStream. Anyway, I managed to solve this by changing .writeStream to .write and .start() to .save().
But I got the feeling that I lost the streaming power on that foreach somehow. Clearly that's why I am writing this question. Is this a correct approach? I've seen other scripts that use
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
But I don't know how different is this from just calling foreach on the DStream and then transforming each RDD to DF.
But I don't know how different is this from just calling foreach on the DStream and then transforming each RDD to DF.
When you are calling:
outputStream
.foreachRDD(rdd => {
rdd.toDF()
.[...]
.toJSON
.toDF("value")
.writeStream
.format("kafka")
your variable rdd (or the Dataframe) became a single RDD which is not a stream anymore. Hence, the rdd.toDF.[...].writeStream will not work anymore.
Continue with RDD
If you choose to use the DSream approach, you can send those single RDDs calling the KafkaProducer API.
An example:
dstream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
val producer = new KafkaProducer[String, String](kafkaParameters)
partitionOfRecords.foreach { message =>
producer.send(message)
}
producer.close()
}
}
However, this is not the recommended approach as you are creating and closing a KafkaProducer in each batch interval on each executor. But this should give you a basic understanding on how to write data to Kafka using the DirectStream API.
To further optimize sending your data to Kafka you can follow the guidance given here.
Continue with Dataframe
However, you could also transform your RDD into a Dataframe, but then making sure to call the batch-oriented API to write data into Kafka:
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.write
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic1")
.save()
For all the details on how to write a batch Dataframe into Kafka is geven in the Spark Structured Streaming + Kafka Integration Guide
Note
Still, and most importantly, I highly recommend to not mix up RDD and Structured API for such a case and rather stick to the one or the other.

Resources