Kafka Spark streaming increase message size - apache-spark

I have a scenario where I am running a Spark streaming job. This is receiving data from Kafka. All I am trying to do is fetch the records from the stream and place them in local. I also implemented offset handling for it. The size of the message can be upto 5 MB. When I tried with 0.4MB - 0.6MB files, the job was running fine but when I tried running with a 1.3MB file, which is greater than the default 1MB, I am facing the following issue.
java.lang.AssertionError: assertion failed: Ran out of messages before reaching ending offset 9 for topic lms_uuid_test partition 0 start 5. This should not happen, and indicates that messages may have been lost
at scala.Predef$.assert(Predef.scala:179)
at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:211)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$class.isEmpty(Iterator.scala:256)
at org.apache.spark.util.NextIterator.isEmpty(NextIterator.scala:21)
at com.scb.BulkUpload.PortalConsumeOffset$$anonfun$createStreamingContext$1$$anonfun$apply$1.apply(PortalConsumeOffset.scala:94)
at com.scb.BulkUpload.PortalConsumeOffset$$anonfun$createStreamingContext$1$$anonfun$apply$1.apply(PortalConsumeOffset.scala:93)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$35.apply(RDD.scala:927)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$35.apply(RDD.scala:927)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1881)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1881)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I tried adding the following as Kafka consumer properties hoping that the larger messages would be handled but no luck.
"send.buffer.bytes"->"5000000", "max.partition.fetch.bytes" -> "5000000", "consumer.fetchsizebytes" -> "5000000"
I hope someone might be able to help me out. Thanks in advance.

fetch.message.max.bytes - this will determine the largest size of a message that can be fetched by the consumer.
Property Name: fetch.message.max.bytes
The number of bytes of messages to attempt to fetch for each topic-partition in each fetch request. The fetch request size must be at least as large as the maximum message size the server allows or else it is possible for the producer to send messages larger than the consumer can fetch.
Example:
Kafka Producer sends 5 MB --> Kafka Broker Allows/Stores 5 MB --> Kafka Consumer receives 5 MB
If so, please set the value to fetch.message.max.bytes=5242880 and then try it should work.

Related

Spark can not recover from checkpoint ProvisionedThroughputExceededException: Rate exceeded for shard shardId

I have a spark application which consume from Kinesis with 6 shards. Data was produced to Kinesis at at most 2000 records/second. At non peak time data only comes in at 200 records/second. Each record is 0.5K Bytes. So 6 shards is enough to handle that.
I am using EMR 5.23.0, Spark 2.4.0, spark-streaming-kinesis-asl 2.4.0
I have 6 r5.4xLarge in my cluster, plenty of memory
Recently I am trying to checkpoint the application to S3. I am testing this at nonpeak time so the data incoming rate is very low like 200 records/sec. I run the Spark application by creating new context, checkpoint is created at s3, but when I kill the app and restarts, it failed to recover from checkpoint, and the error message is the following and my SparkUI shows all the batches are stucked:
19/12/24 00:15:21 WARN TaskSetManager: Lost task 571.0 in stage 33.0 (TID 4452, ip-172-17-32-11.ec2.internal, executor 9): org.apache.spark.SparkException: Gave up after 3 retries while getting shard iterator from sequence number 49601654074184110438492229476281538439036626028298502210, last exception:
at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$retryOrTimeout$2.apply(KinesisBackedBlockRDD.scala:288)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.retryOrTimeout(KinesisBackedBlockRDD.scala:282)
at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getKinesisIterator(KinesisBackedBlockRDD.scala:246)
at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getRecords(KinesisBackedBlockRDD.scala:206)
at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:162)
at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:133)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.amazonaws.services.kinesis.model.ProvisionedThroughputExceededException: Rate exceeded for shard shardId-000000000004 in stream my-stream-name under account my-account-number. (Service: AmazonKinesis; Status Code: 400; Error Code: ProvisionedThroughputExceededException; Request ID: e368b876-c315-d0f0-b513-e2af2bd14525)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1712)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1367)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:512)
at com.amazonaws.services.kinesis.AmazonKinesisClient.doInvoke(AmazonKinesisClient.java:2782)
at com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:2749)
at com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:2738)
at com.amazonaws.services.kinesis.AmazonKinesisClient.executeGetShardIterator(AmazonKinesisClient.java:1383)
at com.amazonaws.services.kinesis.AmazonKinesisClient.getShardIterator(AmazonKinesisClient.java:1355)
at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$3.apply(KinesisBackedBlockRDD.scala:247)
at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$3.apply(KinesisBackedBlockRDD.scala:247)
at org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.retryOrTimeout(KinesisBackedBlockRDD.scala:269)
... 20 more
My batch interval is 2.5 minutes, and I create the stream the regular way:
val kinesisStreams = (0 until 6).map { i =>
KinesisUtils.createStream(ssc, "MyKinesisStreamName", settings.kinesis_stream, settings.kinesis_endpoint, settings.kinesis_region, settings.kinesis_initial_position, Milliseconds(settings.spark_batchinterval_seconds*1000), StorageLevel.MEMORY_AND_DISK)
}
val unionStreams = ssc.union(kinesisStreams)
.map(linebytes => new String(linebytes, "UTF-8").replace("\"", ""))
Someone reported the same problem and does not seem to have an answer:
http://mail-archives.apache.org/mod_mbox/spark-issues/201807.mbox/%3CJIRA.13175528.1532948992000.116869.1532949000171#Atlassian.JIRA%3E
Checkpointing records with Amazon KCL throws ProvisionedThroughputExceededException
Since Spark consuming from Kinesis with checkpoint are such a common thing in many Spark application, I am wondering whether I did anything wrong? Is there any other person experiencing the same and how to resolve this?
I am wondering if this is because my batch interval 2.5 minutes (150 seconds) is too long? because 150 seconds*200 record/second = 30000 records per batch and when checkpoint recovery is trying to load 30000 records from kinesis it will throw the error? Should I increase my shard count to 30 from 6?
Please help, I have to find an answer so this is frustrating.
Appreciate your help.

S3 Slow Down exception for Spark program [duplicate]

This question already has answers here:
S3 SlowDown error in Spark on EMR
(2 answers)
Closed 4 years ago.
I have simple spark program running in EMR cluster trying to convert 60 GB of CSV file into parquet. When i submit the job i get below exception.
391, ip-172-31-36-116.us-west-2.compute.internal, executor 96): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)Caused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Slow Down (Service: Amazon S3; Status Code: 503; Error Code: 503 Slow Down; Request ID: D13A3F4D7DD970FA; S3 Extended Request ID: gj3cPalkkOwtaf9XN/P+sb3jX0CNHu/QF9WTabkgP2ISuXcXdbvYO1Irg0O54OCvKlLz8WoR8E4=), S3 Extended Request ID: gj3cPalkkOwtaf9XN/P+sb3jX0CNHu/QF9WTabkgP2ISuXcXdbvYO1Irg0O54OCvKlLz8WoR8E4=
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1639)
503 Slow Down is a generic response from AWS services when you're doing too many requests per second.
Possible solutions:
Copy your file to HDFS first.
Do you have one 60 Gb file or a lot of files that sums up to 60 Gb? If you have a lot of small files, try to combine them first.
Try to decrease the number of partitions in your Parquet output, if you can.
df.repartition(100)
Try using less Spark workers.
val spark = SparkSession.builder.appName("Simple Application").master("local[1]").getOrCreate()
I'm surprised that things failed; the Apache s3a client backs off when it sees a problem like this: your work is done, just more slowly.
All of Sergey's advice is good. I'd start by coalescing small files and reducing workers: a smaller cluster can deliver more performance, and save money.
One more: if you are using SSE-KMS to encrypt the data, accessing that key can trigger throttle events too; throttling shared across all applications trying to use the KMS store.

Too many open files Kafka Exception on running for long

I have a Kafka producer code in Java that watches a directory for new files using java nio WatchService api and takes any new file and pushes to a kafka topic. Spark streaming consumer reads from the kafka topic. I am getting the following error after the Kafka producer job keeps running for a day. The producer pushes about 500 files every 2 mins. My Kafka topic has 1 partition and 2 replication factor. Can someone please help?
org.apache.kafka.common.KafkaException: Failed to construct kafka producer
at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:342)
at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:166)
at com.hp.hawkeye.HawkeyeKafkaProducer.Sender.createProducer(Sender.java:60)
at com.hp.hawkeye.HawkeyeKafkaProducer.Sender.<init>(Sender.java:38)
at com.hp.hawkeye.HawkeyeKafkaProducer.HawkeyeKafkaProducer.<init>(HawkeyeKafkaProducer.java:54)
at com.hp.hawkeye.HawkeyeKafkaProducer.myKafkaTestJob.main(myKafkaTestJob.java:81)
Caused by: org.apache.kafka.common.KafkaException: java.io.IOException: Too many open files
at org.apache.kafka.common.network.Selector.<init>(Selector.java:125)
at org.apache.kafka.common.network.Selector.<init>(Selector.java:147)
at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:306)
... 7 more
Caused by: java.io.IOException: Too many open files
at sun.nio.ch.EPollArrayWrapper.epollCreate(Native Method)
at sun.nio.ch.EPollArrayWrapper.<init>(EPollArrayWrapper.java:130)
at sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:69)
at sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:36)
at java.nio.channels.Selector.open(Selector.java:227)
at org.apache.kafka.common.network.Selector.<init>(Selector.java:123)
... 9 more
Check ulimit -aH
check with your admin and increase the open files size, for eg:
open files (-n) 655536
else I suspect there might be leaks in your code, refer:
http://mail-archives.apache.org/mod_mbox/spark-user/201504.mbox/%3CCAKWX9VVJZObU9omOVCfPaJ_bPAJWiHcxeE7RyeqxUHPWvfj7WA#mail.gmail.com%3E

spark streaming assertion failed: Failed to get records for spark-executor-a-group a-topic 7 244723248 after polling for 4096

Spark Streaming Problem with Kafka DirectStream:
spark streaming assertion failed: Failed to get records for
spark-executor-a-group a-topic 7 244723248 after polling for 4096
Tried:
1) Adjust increasing spark.streaming.kafka.consumer.poll.ms
-- from 512 to 4096, less failed, but even 10s the failed still exists
2) Adjust executor memory from 1G to 2G
-- partly work, much less failed
3) https://issues.apache.org/jira/browse/SPARK-19275
-- still got failed when streaming durations all less than 8s ("session.timeout.ms" -> "30000")
4) Try Spark 2.1
-- problem still there
with Scala 2.11.8, Kafka version : 0.10.0.0, Spark version : 2.0.2
Spark configs
.config("spark.cores.max", "4")
.config("spark.default.parallelism", "2")
.config("spark.streaming.backpressure.enabled", "true")
.config("spark.streaming.receiver.maxRate", "1024")
.config("spark.streaming.kafka.maxRatePerPartition", "256")
.config("spark.streaming.kafka.consumer.poll.ms", "4096")
.config("spark.streaming.concurrentJobs", "2")
using spark-streaming-kafka-0-10-assembly_2.11-2.1.0.jar
Error stacks:
at scala.Predef$.assert(Predef.scala:170)
at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74)
at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:228)
at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:194)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.foreach(KafkaRDD.scala:194)
...
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:109)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:108)
at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:142)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:108)
...
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:925)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:925)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Losing 1%+ blocks datum from Kafka with this failure :( pls help!
Current solution:
Increase num.network.threads in kafka/config/server.properties, default is 3
Increase spark.streaming.kafka.consumer.poll.ms value ~! a large one ...
without config spark.streaming.kafka.consumer.poll.ms, it's using the spark.network.timeout, which is 120s -- causing some problem
Optional step: Decrease the "max.poll.records", default is 500
Optional step: use Future{} to run time-cost-task in parallel
A solution that worked for me:
Decrease the Kafka Consumer max.partition.fetch.bytes.
Reasoning: When the consumer polls for records from Kafka, it prefetches data up to this amount. If the network is slow, it may not be able to prefetch the full amount and the poll in the CachedKafakConsumer will timeout.

Why does Spark Streaming fail at String decoding due to java.lang.OutOfMemoryError?

I run a Spark Streaming (createStream API) application on a YARN cluster of 3 nodes with 128G RAM each (!) The app reads records from a Kafka topic and writes to HDFS.
Most of the time the application fails/is killed (mostly receiver fails) due to Java heap error no matter how much memory I configure to executor/driver.
16/11/23 13:00:20 WARN ReceiverTracker: Error reported by receiver for stream 0: Error handling message; exiting - java.lang.OutOfMemoryError: Java heap space
at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:149)
at java.lang.StringCoding.decode(StringCoding.java:193)
at java.lang.String.<init>(String.java:426)
at java.lang.String.<init>(String.java:491)
at kafka.serializer.StringDecoder.fromBytes(Decoder.scala:50)
at kafka.serializer.StringDecoder.fromBytes(Decoder.scala:42)
at kafka.message.MessageAndMetadata.message(MessageAndMetadata.scala:32)
at org.apache.spark.streaming.kafka.KafkaReceiver$MessageHandler.run(KafkaInputDStream.scala:137)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
If you are using KafkaUtil.createStream(....) single Receiver will be run in an spark executor and if the topic is partioned, multiple receiver threads run for each partition. So if your stream has large string objects and the frequency is high and all threads share single executor memory you may get OOM issue.
The below are the possible solutions.
As the job fails out of memory in receiver, First check the batch and block interval properties. If batch interval is grater(like 5 min) try with lesser value like(100ms).
Limit the rate of the records received per second as "spark.streaming.receiver.maxRate", also make ensure that
"spark.streaming.unpersist" value is "true".
You may use KafkaUtil.KafkaUtils.createDirectStream[String, String,
StringDecoder, StringDecoder](streamingContext, kafkaParams,
topics). In this case instead of single receiver spark executors
directly connect to the kafka partition leads and receive the data
parallel(each kfka partition is one KafkaRDD partition). Unlike
multiple threads in single receiver executor here multiple
executors will run parallel and load will be distributed.

Resources