Hudi Job failing with ExecutorDeadException in spark - apache-spark

I have recently started using Hudi so not very experience on hudi internals.
I am trying to run a hudi job with around 50GB of base data on spark in upsert mode.
from pyspark.sql import SparkSession spark = SparkSession \
    .builder \
    .appName("hudi_test") \
    .enableHiveSupport().getOrCreate()  
tableName = "hudi_test12"
basePath = "/tmp/rahul/hudi_test12" 
df = spark.read.parquet("/user/data/input")
df = df.repartition(500)
#df.show() 
hudi_options = {
    'hoodie.table.name': tableName,
    'hoodie.datasource.write.recordkey.field': 'row_id',
    'hoodie.datasource.write.partitionpath.field': 'rpt_partition_id',
    'hoodie.datasource.write.table.name': tableName,
    'hoodie.datasource.write.operation': 'upsert',
    'hoodie.datasource.write.precombine.field': 'created',
    'hoodie.upsert.shuffle.parallelism': 500,
    'hoodie.insert.shuffle.parallelism': 500
} 
df.write.format("hudi"). \
    options(**hudi_options). \
    mode("overwrite"). \
    save(basePath)
I creates around 3000 task at stage SparkUpsertCommitActionExecutor and after around 1000 tasks completion, all other tasks start failing and job eventually fails.
On executors, i see following error logs getting printed multiple times before executor gets killed
23/01/24 12:37:55 ERROR client.TransportResponseHandler: Still have 1 requests outstanding when connection from mnplld-shddn02.india.airtel.itm/10.240.8.108:42916 is closed
23/01/24 12:37:55 INFO shuffle.RetryingBlockTransferor: Retrying fetch (1/3) for 1 outstanding blocks after 5000 ms
23/01/24 12:38:01 INFO client.TransportClientFactory: Found inactive connection to mnplld-shddn02.india.airtel.itm/10.240.8.108:42916, creating a new one.
23/01/24 12:38:01 ERROR shuffle.RetryingBlockTransferor: Exception while beginning fetch of 1 outstanding blocks (after 1 retries)
org.apache.spark.ExecutorDeadException: The relative remote executor(Id: 19), which maintains the block data to fetch is dead.
at org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:136)
at org.apache.spark.network.shuffle.RetryingBlockTransferor.transferAllOutstanding(RetryingBlockTransferor.java:154)
at org.apache.spark.network.shuffle.RetryingBlockTransferor.lambda$initiateRetry$0(RetryingBlockTransferor.java:184)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:834)
23/01/24 12:38:01 ERROR storage.ShuffleBlockFetcherIterator: Failed to get block(s) from mnplld-shddn02.india.airtel.itm:42916
org.apache.spark.ExecutorDeadException: The relative remote executor(Id: 19), which maintains the block data to fetch is dead.
at org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:136)
at org.apache.spark.network.shuffle.RetryingBlockTransferor.transferAllOutstanding(RetryingBlockTransferor.java:154)
at org.apache.spark.network.shuffle.RetryingBlockTransferor.lambda$initiateRetry$0(RetryingBlockTransferor.java:184)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:834)
spark configurations given are as follows :
spark3-submit --master yarn --jars avro-1.10.0.jar,hudi-spark3.2-bundle_2.12-0.13.0-SNAPSHOT.jar --deploy-mode cluster --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'   --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'   --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' --driver-memory 5g  --executor-memory 22g --num-executors 20 --executor-cores 12 --queue ocp test.py
I think numbers of executros and cores are sufficient for the data. What is reason for the failures and how to fix this ?

Related

EMR PySpark write to Redshift: java.sql.SQLException: [Amazon](500310) Invalid operation: The session is read-only

I got an error when trying to write data to Redshift using PySpark on an EMR cluster.
df.write.format("jdbc") \
.option("url", "jdbc:redshift://clustername.yyyyy.us-east-1.redshift.amazonaws.com:5439/db") \
.option("driver", "com.amazon.redshift.jdbc42.Driver") \
.option("dbtable", "public.table") \
.option("user", user_redshift) \
.option("password", password_redshift) \
.mode("overwrite") \
.save()
The error I have got is:
py4j.protocol.Py4JJavaError: An error occurred while calling o143.save.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, , executor 1):
java.sql.SQLException: [Amazon](500310) Invalid operation: The session is read-only;
at com.amazon.redshift.client.messages.inbound.ErrorResponse.toErrorException(Unknown Source)
at com.amazon.redshift.client.PGMessagingContext.handleErrorResponse(Unknown Source)
at com.amazon.redshift.client.PGMessagingContext.handleMessage(Unknown Source)
at com.amazon.jdbc.communications.InboundMessagesPipeline.getNextMessageOfClass(Unknown Source)
at com.amazon.redshift.client.PGMessagingContext.doMoveToNextClass(Unknown Source)
at com.amazon.redshift.client.PGMessagingContext.getParameterDescription(Unknown Source)
at com.amazon.redshift.client.PGClient.prepareStatement(Unknown Source)
at com.amazon.redshift.dataengine.PGQueryExecutor.<init>(Unknown Source)
at com.amazon.redshift.dataengine.PGDataEngine.prepare(Unknown Source)
at com.amazon.jdbc.common.SPreparedStatement.<init>(Unknown Source)
...
I appreciate any help. Thanks!
We also faced the same issue for our EMR pySpark cluster.
EMR with "ReleaseLabel": "emr-5.33.0" and Spark version 2.4.7
We resolved it with the following changes
Used the redshift Jar: redshift-jdbc42-2.0.0.7.jar from https://docs.aws.amazon.com/redshift/latest/mgmt/jdbc20-previous-driver-version-20.html
Changed the JDBC URL to the following:
jdbc:redshift://clustername.yyyyy.us-east-1.redshift.amazonaws.com:5439/db?user=username&password=password;ReadOnly=false
You can then try to run your spark-submit with the following:
spark-submit --jars s3://jars/redshift-jdbc42-2.0.0.7.jar s3://scripts/scriptname.py
where scriptname.py has
df.write\
.format('jdbc')\
.option("driver", "com.amazon.redshift.jdbc42.Driver")\
.option("url", jdbcUrl)\
.option("dbtable", "schema.table")\
.option("aws_iam_role", "XXXX") \
.option("tempdir", f"s3://XXXXXX") \
.mode('append')\
.save()

Apache Ignite Spark Integration persists data to Ignite error

I want to persist the DataFrame data to the ignite table, and native debugging is normal. When running on spark on yarn, it always reports an error.
df.write
.format(IgniteDataFrameSettings.FORMAT_IGNITE)
.option(IgniteDataFrameSettings.OPTION_CONFIG_FILE, "META-INF/default-config.xml")
.option(IgniteDataFrameSettings.OPTION_STREAMER_ALLOW_OVERWRITE, true)
.option(IgniteDataFrameSettings.OPTION_CREATE_TABLE_PRIMARY_KEY_FIELDS, "primary_key")
.option(IgniteDataFrameSettings.OPTION_STREAMER_FLUSH_FREQUENCY, 10000)
.option(IgniteDataFrameSettings.OPTION_TABLE, "tb_name")
.option(IgniteDataFrameSettings.OPTION_CREATE_TABLE_PARAMETERS, "template=partitioned,backups=0,affinityKey=groupcode")
.mode(SaveMode.Overwrite)
.save()
This is spark config
/data/spark-2.3.0-bin-hadoop2.7/bin/spark-submit\
--queue default\
--class cn.com.DataProcess\
--name DataProcess\
--master yarn\
--deploy-mode client \
--conf spark.default.parallelism=20\
--conf spark.executor.extraClassPath=/data/ignite/*\
--conf spark.driver.extraClassPath=/data/ignite/*\
--conf spark.driver.extraJavaOptions=-Dfile.encoding=utf8\
--conf spark.executor.extraJavaOptions=-Dfile.encoding=utf8\
--num-executors 6\
--executor-memory 1G\
--total-executor-cores 6\
--jars /data/ignite/service-2.0-SNAPSHOT-jar-with-dependencies.jar /data/ignite/service-2.0-SNAPSHOT.jar
Error stack:
Job aborted due to stage failure: Task 6 in stage 3.0 failed 4 times, most recent failure: Lost task 6.3 in stage 3.0 (TID 451, hadoop-slave2, executor 4): java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at org.apache.ignite.spark.impl.QueryHelper$.org$apache$ignite$spark$impl$QueryHelper$$savePartition(QueryHelper.scala:155)
at org.apache.ignite.spark.impl.QueryHelper$$anonfun$saveTable$1.apply(QueryHelper.scala:117)
at org.apache.ignite.spark.impl.QueryHelper$$anonfun$saveTable$1.apply(QueryHelper.scala:116)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:929)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:929)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2067)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2067)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Which friend has encountered this problem, please help me!
I found the source code based on the error message and judged that it should be the error suggested here。
private def savePartition(iterator: Iterator[Row],
insertQry: String,
tblName: String,
ctx: IgniteContext,
streamerAllowOverwrite: Option[Boolean],
streamerFlushFrequency: Option[Long],
streamerPerNodeBufferSize: Option[Int],
streamerPerNodeParallelOperations: Option[Int]
): Unit = {
**val tblInfo = sqlTableInfo[Any, Any](ctx.ignite(), tblName).get**
val streamer = ctx.ignite().dataStreamer(tblInfo._1.getName)
}
But I don't understand why this happens. Both cache name and table name exist. Is my code used incorrectly?

Cassandra Connector fails when run under Spark 2.3 on Kubernetes

I'm trying to use the connector, which I've used a bunch of times in the past super successfully, with the new Spark 2.3 native Kubernetes support and am running into a lot of trouble.
I have a super simple job that looks like this:
package io.rhom
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.cassandra._
import com.datastax.spark.connector.cql.CassandraConnectorConf
import com.datastax.spark.connector.rdd.ReadConf
/** Computes an approximation to pi */
object BackupLocations {
def main(args: Array[String]) {
val spark = SparkSession
.builder
.appName("BackupLocations")
.getOrCreate()
spark.sparkContext.hadoopConfiguration.set(
"fs.defaultFS",
"wasb://<snip>"
)
spark.sparkContext.hadoopConfiguration.set(
"fs.azure.account.key.rhomlocations.blob.core.windows.net",
"<snip>"
)
val df = spark
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "locations", "keyspace" -> "test"))
.load()
df.write
.mode("overwrite")
.format("com.databricks.spark.avro")
.save("wasb://<snip>")
spark.stop()
}
}
which I'm building under SBT with Scala 2.11 and packaging with a Dockerfile that looks like this:
FROM timfpark/spark:20180305
COPY core-site.xml /opt/spark/conf
RUN mkdir -p /opt/spark/jars
COPY target/scala-2.11/rhom-backup-locations_2.11-0.1.0-SNAPSHOT.jar /opt/spark/jars
and then executing with:
bin/spark-submit --master k8s://blue-rhom-io.eastus2.cloudapp.azure.com:443 \
--deploy-mode cluster \
--name backupLocations \
--class io.rhom.BackupLocations \
--conf spark.executor.instances=2 \
--conf spark.cassandra.connection.host=10.1.0.10 \
--conf spark.kubernetes.container.image=timfpark/rhom-backup-locations:20180306v12 \
--jars https://dl.bintray.com/spark-packages/maven/datastax/spark-cassandra-connector/2.0.3-s_2.11/spark-cassandra-connector-2.0.3-s_2.11.jar,http://central.maven.org/maven2/org/apache/hadoop/hadoop-azure/2.7.2/hadoop-azure-2.7.2.jar,http://central.maven.org/maven2/com/microsoft/azure/azure-storage/3.1.0/azure-storage-3.1.0.jar,http://central.maven.org/maven2/com/databricks/spark-avro_2.11/4.0.0/spark-avro_2.11-4.0.0.jar \
local:///opt/spark/jars/rhom-backup-locations_2.11-0.1.0-SNAPSHOT.jar
all of this works except for the Cassandra connection piece, which eventually fails with:
2018-03-07 01:19:38 WARN TaskSetManager:66 - Lost task 0.0 in stage 0.0 (TID 0, 10.4.0.46, executor 1): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Exception during preparation of SELECT "user_id", "timestamp", "accuracy", "altitude", "altitude_accuracy", "course", "features", "latitude", "longitude", "source", "speed" FROM "rhom"."locations" WHERE token("user_id") > ? AND token("user_id") <= ? ALLOW FILTERING: org/apache/spark/sql/catalyst/package$ScalaReflectionLock$
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.createStatement(CassandraTableScanRDD.scala:323)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.com$datastax$spark$connector$rdd$CassandraTableScanRDD$$fetchTokenRange(CassandraTableScanRDD.scala:339)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD$$anonfun$17.apply(CassandraTableScanRDD.scala:367)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD$$anonfun$17.apply(CassandraTableScanRDD.scala:367)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at com.datastax.spark.connector.util.CountingIterator.hasNext(CountingIterator.scala:12)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:380)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
... 8 more
Caused by: java.lang.NoClassDefFoundError: org/apache/spark/sql/catalyst/package$ScalaReflectionLock$
at org.apache.spark.sql.catalyst.ReflectionLock$.<init>(ReflectionLock.scala:5)
at org.apache.spark.sql.catalyst.ReflectionLock$.<clinit>(ReflectionLock.scala)
at com.datastax.spark.connector.types.TypeConverter$.<init>(TypeConverter.scala:73)
at com.datastax.spark.connector.types.TypeConverter$.<clinit>(TypeConverter.scala)
at com.datastax.spark.connector.types.BigIntType$.converterToCassandra(PrimitiveColumnType.scala:50)
at com.datastax.spark.connector.types.BigIntType$.converterToCassandra(PrimitiveColumnType.scala:46)
at com.datastax.spark.connector.types.ColumnType$.converterToCassandra(ColumnType.scala:231)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD$$anonfun$11.apply(CassandraTableScanRDD.scala:312)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD$$anonfun$11.apply(CassandraTableScanRDD.scala:312)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.createStatement(CassandraTableScanRDD.scala:312)
... 23 more
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.catalyst.package$ScalaReflectionLock$
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 41 more
2018-03-07 01:19:38 INFO TaskSetManager:54 - Starting task 0.1 in stage 0.0 (TID 3, 10.4.0.46, executor 1, partition 0, ANY, 9486 bytes)
I've tried every thing I can possibly think of to resolve this - anyone have any ideas? Is this possibly caused by another unrelated issue?
It turns out that version 2.0.7 of the Datastax Cassandra Connector does not support Spark 2.3 currently. I opened a JIRA ticket on Datastax's site for this and hopefully it will be addressed soon.

Spark streaming batches stopping and waiting for GC

I have got a simple spark streaming job. It reads event from Kafka topic, does simple event transformation (eg. replace some characters with another ones) and sends transformed events to second Kafka topic. Everything works OK for some time (1 – 1.5 h) and after that we see that batches are scheduled (see screen below) and waiting to run. The pause takes about 5-6 minutes and this time GC is working and cleaning memory. After that everything works OK, but sometimes processing stops and in logs we see errors like that (see stack trace below). Please advise what Spark / Java parameters should be set to avoid this GC overhead.
Spark jobs are scheduled every 10 sec., one batch execution takes about 5 sec.
Stack trace
2017-09-21 11:26:15 WARN TaskSetManager:66 - Lost task 33.0 in stage 115.0 (TID 4699, work8, executor 6): java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.kafka.clients.consumer.internals.Fetcher.createFetchRequests(Fetcher.java:724)
at org.apache.kafka.clients.consumer.internals.Fetcher.sendFetches(Fetcher.java:176)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1042)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:995)
at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:99)
at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:70)
at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:228)
at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:194)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2017-09-21 11:26:15 INFO TaskSetManager:54 - Lost task 37.0 in stage 115.0 (TID 4702) on work8, executor 6: java.lang.OutOfMemoryError (GC overhead limit exceeded) [duplicate 1]
2017-09-21 11:26:15 INFO TaskSetManager:54 - Lost task 26.0 in stage 115.0 (TID 4695) on work8, executor 6: java.lang.OutOfMemoryError (GC overhead limit exceeded) [duplicate 2]
Parameters of spark – submit
spark-2.1.1-bin-hadoop2.6/bin/spark-submit \
--master yarn \
--deploy-mode client \
--executor-cores 8 \
--executor-memory 20g \
--driver-memory 20g \
--num-executors 4 \
--conf "spark.driver.maxResultSize=8g" \
--conf "spark.streaming.receiver.maxRate=1125" \
--conf "spark.streaming.kafka.maxRatePerPartition=1125" \
//Job
val sendToKafka = KafkaSender.sendToKafka(spark, kafkaServers, outputTopic, kafkaEnabled) _
val stream = KafkaUtils
.createDirectStream(ssc, PreferConsistent, Subscribe[String, String](inputTopics, kafkaParams))
stream.foreachRDD { statementsStreamBatch =>
val offsetRanges = statementsStreamBatch.asInstanceOf[HasOffsetRanges].offsetRanges
if (!statementsStreamBatch.isEmpty) {
val inputCsvRDD = statementsStreamBatch.map(_.value)
var outputCsvRDD : RDD[String] = null
if(enrichmerEnabled) {
outputCsvRDD = Enricher.processStreaminputCsvRDD, enricherNumberOfFields)
} else {
outputCsvRDD = inputCsvRDD
}
sendToKafka(outputCsvRDD)
}
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
ssc.start()
ssc.awaitTermination()
//Enricher
object Enricher {
def processStream(eventStream: RDD[String], numberOfFields : Integer): RDD[String] = {
eventStream.map(
csv => if (csv.count(_ == ',') <= numberOfFields) {
csv
} else {
csv.replaceAll(",(?=[^']*',)", "#")
}
)
}
//KafkaSender
object KafkaSender {
def sendToKafka(spark: SparkSession, servers: String, topic: String, enabled: Boolean)(message: RDD[String]): Unit = {
val kafkaSink = spark.sparkContext.broadcast(KafkaSink(getKafkaProperties(servers)))
val kafkaTopic = spark.sparkContext.broadcast(topic)
message.foreach(kafkaSink.value.send(kafkaTopic.value, _))
}
}

Spark 2.0 + Kryo Serializer + Avro -> NullPointerException?

I have the simple pyspark program:
from pyspark import SQLContext
from pyspark import SparkConf
from pyspark import SparkContext
if __name__ == "__main__":
spark_settings = {
"spark.serializer": 'org.apache.spark.serializer.KryoSerializer'
}
conf = SparkConf()
conf.setAll(spark_settings.items())
spark_context = SparkContext(appName="test app", conf=conf)
spark_sql_context = SQLContext(spark_context)
source_path = "s3n://my_bucket/data.avro"
data_frame = spark_sql_context.read.load(source_path, format="com.databricks.spark.avro")
# The schema comes back correctly.
data_frame.printSchema()
# This count() call fails. A call to head() triggers the same error.
data_frame.count()
I run with
$SPARK_HOME/bin/spark-submit --master yarn \
--packages com.databricks:spark-avro_2.11:3.0.0 \
bug_isolation.py
It fails with the following exception and stack trace.
If I switch to --master local it works. If I disable the KryoSerializer option, it works. Or if I use a Parquet source rather than an Avro source it works.
The combination of using --master yarn and the KryoSerializer and an Avro source triggers the exception and stack trace listed below.
I suspect I may need to manually register some Avro plugin classes with the KryoSerializer for it to work? Which classes would I need to register.
File "/usr/lib/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o58.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 0.0 failed 4 times, most recent failure: Lost task 3.3 in stage 0.0 (TID 9, ip-172-31-97-24.us-west-2.compute.internal): java.lang.NullPointerException
at com.databricks.spark.avro.DefaultSource$$anonfun$buildReader$1.apply(DefaultSource.scala:151)
at com.databricks.spark.avro.DefaultSource$$anonfun$buildReader$1.apply(DefaultSource.scala:143)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(fileSourceInterfaces.scala:279)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(fileSourceInterfaces.scala:263)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Resources