Is there any parameter partitioning when Spark reads RDBMS through JDBC? - apache-spark

When I run the spark application for table synchronization, the error message is as follows:
19/10/16 01:37:40 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 51)
com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
at com.mysql.cj.jdbc.exceptions.SQLError.createCommunicationsException(SQLError.java:590)
at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:57)
at com.mysql.cj.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:1606)
at com.mysql.cj.jdbc.ConnectionImpl.<init>(ConnectionImpl.java:633)
at com.mysql.cj.jdbc.ConnectionImpl.getInstance(ConnectionImpl.java:347)
at com.mysql.cj.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:219)
at org.apache.spark.sql.execution.datasources.jdbc.DriverWrapper.connect(DriverWrapper.scala:45)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:63)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:54)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:272)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I think this is caused by the large amount of data in the table. I used the parameters related to the mongo partition before,such as:spark.mongodb.input.partitioner,spark.mongodb.input.partitionerOptions.partitionSizeMB
I want to know if Spark has similar parameters for partitioning when reading RDBMS via JDBC?

Below are the parameters along with their description which we can use while reading the RDBMS table using spark jdbc.
partitionColumn, lowerBound, upperBound -These options must all be specified if any of them is specified. In addition, numPartitions must be specified. They describe how to partition the table when reading in parallel from multiple workers. partitionColumn must be a numeric, date, or timestamp column from the table in question. Notice that lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in table. So all rows in the table will be partitioned and returned. This option applies only to reading.
numPartitions-The maximum number of partitions that can be used for parallelism in table reading and writing. This also determines the maximum number of concurrent JDBC connections. If the number of partitions to write exceeds this limit, we decrease it to this limit by calling coalesce(numPartitions) before writing.
fetchsize - The JDBC fetch size, which determines how many rows to fetch per round trip. This can help performance on JDBC drivers which default to low fetch size (eg. Oracle with 10 rows). This option applies only to reading.
Please note that all the above parameters should be used in together. Below is an example:-
spark.read.format("jdbc").
option("driver", driver).
option("url",url ).
option("partitionColumn",column name).
option("lowerBound", 10).
option("upperBound", 10000).
option("numPartitions", 10).
option("fetchsize",1000).
option("dbtable", query).
option("user", user).
option("password",password).load()

Related

Recovering from "Offsets out of range with no configured reset policy for partitions"

I've got Spark Structured Streaming application (Spark 2.4.5) which is consuming from Kafka. The application was down for a bit, but when I restarted it I get the below error.
I fully understand why I'm getting the error, and I'm ok with that. But I cannot seem to get around it. Based on the logs I see "Recovering from the earliest offset: 1234332978" but this does seem to be happening. I've tried deleting the 'source' folder in my checkpoint location which also didn't help.
My code is using a mapGroupWithState function, so I do have State data which I don't want to lose, as a result deleting the entire Checkpoint directory isn't my preferred approach.
I have set:
.option("failOnDataLoss", false)
.option("startingOffsets", "latest")
But it seems this only applies to new partitions.
Is there a way to tell Spark to just accept that there are missing offsets and continue? Or some approach to delete the offset data manually without impacting the application 'state'?
20/07/29 01:02:40 WARN InternalKafkaConsumer: Cannot fetch offset 1215191190 (GroupId: spark-kafka-source-f9562fca-ab0c-4f7a-93c3-20506cbcdeb7--1440771761-executor, TopicPartition: cmusstats-0).
Some data may have been lost because they are not available in Kafka any more; either the
data was aged out by Kafka or the topic may have been deleted before all the data in the
topic was processed. If you want your streaming query to fail on such cases, set the source
option "failOnDataLoss" to "true".
org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of range with no configured reset policy for partitions: {cmusstats-0=1215191190}
at org.apache.kafka.clients.consumer.internals.Fetcher.parseCompletedFetch(Fetcher.java:970)
at org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:490)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1259)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1187)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1115)
at org.apache.spark.sql.kafka010.InternalKafkaConsumer.fetchData(KafkaDataConsumer.scala:470)
at org.apache.spark.sql.kafka010.InternalKafkaConsumer.org$apache$spark$sql$kafka010$InternalKafkaConsumer$$fetchRecord(KafkaDataConsumer.scala:361)
at org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:251)
at org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:234)
at org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
at org.apache.spark.sql.kafka010.InternalKafkaConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:209)
at org.apache.spark.sql.kafka010.InternalKafkaConsumer.get(KafkaDataConsumer.scala:234)
at org.apache.spark.sql.kafka010.KafkaDataConsumer$class.get(KafkaDataConsumer.scala:64)
at org.apache.spark.sql.kafka010.KafkaDataConsumer$CachedKafkaDataConsumer.get(KafkaDataConsumer.scala:500)
at org.apache.spark.sql.kafka010.KafkaMicroBatchInputPartitionReader.next(KafkaMicroBatchReader.scala:357)
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:49)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
20/07/29 01:02:40 WARN InternalKafkaConsumer: Some data may be lost. Recovering from the earliest offset: 1234332978
20/07/29 01:02:40 WARN InternalKafkaConsumer:
The current available offset range is AvailableOffsetRange(1234332978,1328165875).
Offset 1215191190 is out of range, and records in [1215191190, 1215691190) will be
skipped (GroupId: spark-kafka-source-f9562fca-ab0c-4f7a-93c3-20506cbcdeb7--1440771761-executor, TopicPartition: cmusstats-0).
Some data may have been lost because they are not available in Kafka any more; either the
data was aged out by Kafka or the topic may have been deleted before all the data in the
topic was processed. If you want your streaming query to fail on such cases, set the source
option "failOnDataLoss" to "true".
It turns out, that the Structured Streaming application recovered eventually. For a period of time, many errors about 'Cannot fetch offset' were being logged. But after a period of time the processing continued with the earliest offset.
I cannot explain why I got so many of these errors before the processing starting continuing but it did continue in the end.

Spark Structured Streaming NOT process Kafka offset expires

We have Spark structured streaming app that push the data from Kafka to S3.
Spark Job runs fine for a few days then it start accumulating lag. We have Kafka topic with expiry of 6 hours. If lag increases and some of the offset starts expiring then Spark cannot find the offset and it starts logging following warning. On the surface Spark job seems be running but it is NOT processing any data. When I try to manually restart the system I am getting GC issues (Seen in the screen shot below). I have set "failOnDataLoss" to "false". We want system to not STOP processing when offsets are not found. I do not see any error in the logs except the warning mentioned below.
Only warn we see is this:
The current available offset range is AvailableOffsetRange(34066048,34444327).
Offset 34005119 is out of range, and records in [34005119, 34006993) will be
skipped (GroupId: spark-kafka-source-6b17001a-01ff-4c10-8877-7677cdbbecfc--1295174908-executor, TopicPartition: DataPipelineCopy-46).
Some data may have been lost because they are not available in Kafka any more; either the
data was aged out by Kafka or the topic may have been deleted before all the data in the
topic was processed. If you want your streaming query to fail on such cases, set the source
option "failOnDataLoss" to "true".
20/05/17 17:16:30 INFO Fetcher: [Consumer clientId=consumer-7, groupId=spark-kafka-source-6b17001a-01ff-4c10-8877-7677cdbbecfc--1295174908-executor] Resetting offset for partition DataPipelineCopy-1 to offset 34444906.
20/05/17 17:16:30 WARN InternalKafkaConsumer: Some data may be lost. Recovering from the earliest offset: 34068782
20/05/17 17:16:30 WARN InternalKafkaConsumer:
The current available offset range is AvailableOffsetRange(34068782,34444906).
Offset 34005698 is out of range, and records in [34005698, 34007572) will be
skipped (GroupId: spark-kafka-source-6b17001a-01ff-4c10-8877-7677cdbbecfc--1295174908-executor, TopicPartition: DataPipelineCopy-1).
Some data may have been lost because they are not available in Kafka any more; either the
data was aged out by Kafka or the topic may have been deleted before all the data in the
topic was processed. If you want your streaming query to fail on such cases, set the source
option "failOnDataLoss" to "true".
ome data may have been lost because they are not available in Kafka any more; either the
data was aged out by Kafka or the topic may have been deleted before all the data in the
topic was processed. If you want your streaming query to fail on such cases, set the source
option "failOnDataLoss" to "true".
org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of range with no configured reset policy for partitions: {DataPipelineCopy-1=34005698}
at org.apache.kafka.clients.consumer.internals.Fetcher.parseCompletedFetch(Fetcher.java:970)
at org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:490)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1259)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1187)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1115)
at org.apache.spark.sql.kafka010.InternalKafkaConsumer.fetchData(KafkaDataConsumer.scala:470)
at org.apache.spark.sql.kafka010.InternalKafkaConsumer.org$apache$spark$sql$kafka010$InternalKafkaConsumer$$fetchRecord(KafkaDataConsumer.scala:361)
at org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:251)
at org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:234)
at org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
at org.apache.spark.sql.kafka010.InternalKafkaConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:209)
at org.apache.spark.sql.kafka010.InternalKafkaConsumer.get(KafkaDataConsumer.scala:234)
at org.apache.spark.sql.kafka010.KafkaDataConsumer$class.get(KafkaDataConsumer.scala:64)
at org.apache.spark.sql.kafka010.KafkaDataConsumer$CachedKafkaDataConsumer.get(KafkaDataConsumer.scala:500)
at org.apache.spark.sql.kafka010.KafkaMicroBatchInputPartitionReader.next(KafkaMicroBatchReader.scala:357)
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:49)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:216)
at org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:108)
at org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:101)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
20/05/17 17:16:30 WARN ConsumerConfig: The configuration 'consumer.commit.groupid' was supplied but isn't a known config.
20/05/17 17:16:30 INFO AppInfoParser: Kafka version : 2.0.0
Before above failure system seems to be working just fine, BUT NOT PROCESSING ANY NEW DATA FROM KAFKA.
"The current available offset range is AvailableOffsetRange(34066048,34444327). Offset 34005119 is out of range"
It seems your program is processing the records much slower than you expect it to. causing the logs from kafka to get deleted due to retention may be . can you check what is the retention you have set. The offset your job is looking for certainly less than the available offset range.
It seems the records been remarked as "unvisible",before your app (kafka comsumer) processing them. as mentions What determines Kafka consumer offset?
my solutions:
1.create a new consumer group and restart your app.(your kafka confsumer offset policy set to earliest at first)
2.if step1 isn't work increase the kafka log retention window ( kafka broker parameters: log.retention.hours or log.retention.ms or log.cleaner.delete.retentions.ms , that rely on your product env).
and step2 works fine for me.

Encounter SparkException "Cannot broadcast the table that is larger than 8GB"

I am using Spark 2.2.0 to do data processing. I am using Dataframe.join to join 2 dataframes together, however I encountered this stack trace:
18/03/29 11:27:06 INFO YarnAllocator: Driver requested a total number of 0 executor(s).
18/03/29 11:27:09 ERROR FileFormatWriter: Aborting job null.
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:123)
at org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:248)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:126)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:197)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:82)
at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155)
...........
Caused by: org.apache.spark.SparkException: Cannot broadcast the table that is larger than 8GB: 10 GB
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:86)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:73)
at org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:103)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:72)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:72)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I searched on Internet for this error, but didn't get any hint or solution how to fix this.
Does Spark automatically broadcast Dataframe as part of the join? I am very surprise with this 8GB limit because I would have thought Dataframe supports "big data" and 8GB is not very big at all.
Thank you very much in advance for your advice on this.
Linh
After some reading, I've tried to disable the auto-broadcast and it seemed to work. Change Spark config with:
'spark.sql.autoBroadcastJoinThreshold': '-1'
Currently it is a hard limit in spark that the broadcast variable size should be less than 8GB. See here.
The 8GB size is generally big enough. If you consider that you re running a job with 100 executors, spark driver needs to send the 8GB data to 100 Nodes resulting 800GB network traffic. This cost will be much less if you don't broadcast and use simple join.

SparkStreaming+Kafka: Failed to get records after polling for 60000

I'm doing Spark Streaming over Kafka. The streaming job starts fine and runs for a few hours before it runs into the following issue:
17/05/18 03:44:47 ERROR Executor: Exception in task 8.0 in stage 1864.0 (TID 27968)
java.lang.AssertionError: assertion failed: Failed to get records for spark-executor-c10f4ea9-a1c6-4a9f-b87f-8d6ff66e10a5 madlytics-rt_1 3 1150964759 after polling for 60000
at scala.Predef$.assert(Predef.scala:170)
at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74)
at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Also, i increased the values of heartbeat.interval.ms, session.timeout.ms and request.timeout.ms appropriately as suggested here: https://issues.apache.org/jira/browse/SPARK-19275
Given below are some relevant configs:
batch.interval = 60s
spark.streaming.kafka.consumer.poll.ms = 60000
session.timeout.ms = 60000 (default: 30000)
heartbeat.interval.ms = 6000 (default: 3000)
request.timeout.ms = 90000 (default: 40000)
Also, the Kafka cluster is a 5 node one, and the topic that i'm reading has 15 partitions. Some other Kafka configs are listed below:
num.network.threads=8
num.io.threads=8
Any help will be much appreciated. Thanks.
I solved the issue using a simple configuration change which was quite apparent but it took me sometime to realize how such a default (mis)configuration could be left untreated.
The primary issue is Spark config spark.streaming.kafka.consumer.poll.ms (default 512ms in KafkaRDD) or spark.network.timeout (default 120sec, if spark.streaming.kafka.consumer.poll.ms is not set) is always less than Kafka consumer request.timeout.ms (default 305000ms in Kafka newconsumerapi) ... hence spark polling always times out before timeout happens at Kafka consumer request/poll (when there are no records available in Kafka topic).
Simply increasing spark.streaming.kafka.consumer.poll.ms to a value greater than Kafka request.timeout.ms should do the trick. Also adjust Kafka consumer max.poll.interval.ms to be always less than request.timeout.ms.
Q.E.D and Good luck.
In my experience this particular failure is a symptom of a Kafka cluster which is overloaded. The usual suspects are always GC worldstops and threads starving.
On top of this, everything might be fine with Kafka on the surface, but perhaps not.
Is it spending a lot of time rebalancing after you added a partition? Or is it maintaining a humongous offsets topic because of all the load tests you performed?
What happened to me once is that the cluster was fine on the surface, but this timeout came out here and there. On a brand new, and even smaller, cluster, this issue disappeared.

Efficient grouping by key using mapPartitions or partitioner in Spark

So, I have a data like the following,
[ (1, data1), (1, data2), (2, data3), (1, data4), (2, data5) ]
which I want to convert to the following, for further processing.
[ (1, [data1, data2, data4]), (2, [data3, data5]) ]
I used groupByKey and reduceByKey, but due to really large amount of data it fails. The data is not tall but wide. In other words, keys are from 1 upto 10000, but, value list ranges from 100k to 900k.
I am struggling with this issue and plan to apply mapPartitions or (Hash)partitioner.
So, if one of these may work, I'd like to know
Using mapPartions, could you please give some code snippet?
Using (Hash)partitioner, could you please give some example how to control partitions by some element like key.. e.g. is there a way to create each partition based on key (i.e. 1,2,.. above) with no need to shuffle.
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 9 (flatMap at TSUMLR.scala:209) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:542)
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:538)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:538)
at org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:155)
at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:47)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
None of the proposed method would work. Partitioner by definition have to shuffle the data and will suffer from the same limitations as groupByKey. mapPartitions cannot move data to another partition so it is completely useless. Since your description of the problem is rather vague it is hard to give a specific advice but in general I would try following steps:
try to rethink the problem. Do you really need all the values at once? How do you plan to utilize these? Can you obtain the same results without collecting to a single partition?
is it possible to reduce the traffic? How many unique values do you expect? Is it possible to compress the data before the shuffle (for example count values or use RLE)?
consider using larger executors. Spark has to keep in memory only the values for a single key and can spill processed keys to disk.
split your data by key:
val keys = rdd.keys.distinct.collect
val rdds = keys.map(k => rdd.filter(_._1 == k))
and process each RDD separatelly.

Resources