Executor heartbeat timed out Spark on DataProc - apache-spark

I am trying to fit a ml model in Spark (2.0.0) on a Google DataProc Cluster. When fitting the model I receive an Executor heartbeat timed out error. How can I resolve this?
Other solutions indicate this is probably due to Out of Memory of (one of) the executors. I read as solutions: Set the right setting, repartition, cache, and get a bigger cluster. What can I do, preferably without setting up a larger cluster? (Make more/less partitions? Cache less? Adjust settings?)
My setting:
Spark 2.0.0 on a Google DataProc Cluster:
1 Master and 2 workers all with the same specs: n1-highmem-8 -> 8 vCPUs, 52.0 GB memory - 500GB disk
Settings:
spark\:spark.executor.cores=1
distcp\:mapreduce.map.java.opts=-Xmx2457m
spark\:spark.driver.maxResultSize=1920m
mapred\:mapreduce.map.java.opts=-Xmx2457m
yarn\:yarn.nodemanager.resource.memory-mb=6144
mapred\:mapreduce.reduce.memory.mb=6144
spark\:spark.yarn.executor.memoryOverhead=384
mapred\:mapreduce.map.cpu.vcores=1
distcp\:mapreduce.reduce.memory.mb=6144
mapred\:yarn.app.mapreduce.am.resource.mb=6144
mapred\:mapreduce.reduce.java.opts=-Xmx4915m
yarn\:yarn.scheduler.maximum-allocation-mb=6144
dataproc\:dataproc.scheduler.max-concurrent-jobs=11
dataproc\:dataproc.heartbeat.master.frequency.sec=30
mapred\:mapreduce.reduce.cpu.vcores=2
distcp\:mapreduce.reduce.java.opts=-Xmx4915m
distcp\:mapreduce.map.memory.mb=3072
spark\:spark.driver.memory=3840m
mapred\:mapreduce.map.memory.mb=3072
yarn\:yarn.scheduler.minimum-allocation-mb=512
mapred\:yarn.app.mapreduce.am.resource.cpu-vcores=2
spark\:spark.yarn.am.memoryOverhead=384
spark\:spark.executor.memory=2688m
spark\:spark.yarn.am.memory=2688m
mapred\:yarn.app.mapreduce.am.command-opts=-Xmx4915m
Full Error:
Py4JJavaError: An error occurred while calling o4973.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 151 in stage 16964.0 failed 4 times, most recent failure: Lost task 151.3 in stage 16964.0 (TID 779444, reco-test-w-0.c.datasetredouteasvendor.internal): ExecutorLostFailure (executor 14 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 175122 ms
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1897)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1911)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:893)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.collect(RDD.scala:892)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$countByKey$1.apply(PairRDDFunctions.scala:372)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$countByKey$1.apply(PairRDDFunctions.scala:372)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.PairRDDFunctions.countByKey(PairRDDFunctions.scala:371)
at org.apache.spark.rdd.RDD$$anonfun$countByValue$1.apply(RDD.scala:1156)
at org.apache.spark.rdd.RDD$$anonfun$countByValue$1.apply(RDD.scala:1156)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.countByValue(RDD.scala:1155)
at org.apache.spark.ml.feature.StringIndexer.fit(StringIndexer.scala:91)
at org.apache.spark.ml.feature.StringIndexer.fit(StringIndexer.scala:66)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)

As this question doesn't have an answer, to summarize the issue appears to have been related to spark.executor.memory being set too low, causing occasional out-of-memory errors on an executor.
The suggested fix was to first try starting with the default Dataproc config, which tries to fully use all cores and memory available on the instance. If issues continue, then adjust spark.executor.memory and spark.executor.cores to increase the amount of memory available per task (essentially spark.executor.memory / spark.executor.cores).
Dennis also gives more details about the Spark memory config on Dataproc in the following answer:
Google Cloud Dataproc configuration issues

Related

Spark EMR Job Failing: Caused by: org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 16384 bytes of memory, got 0

I'm running a Spark Scala EMR Job with 12 Executors, m5.4xlarge instance type, 51GB of Heap, using Spark 2.4.6 and EMR 5.31.0. In my EMR cluster I also have the following configuration:
[{"classification":"spark", "properties":{"maximizeResourceAllocation":"true"}, "configurations":[]}]
I dont understand where my OOM error is coming from and what I can do to fix it. Before I write I do a join, groupBy,groupBy, and another join. How can I figure out what is causing the OOM error?
DF.withColumn("hour", hour(col("start")))
.withColumn("day", dayofmonth(col("start")))
.withColumn("month", month(col("start")))
.withColumn("year", year(col("start")))
.write
.partitionBy("year", "month", "day", "hour")
.mode(SaveMode.Overwrite)
.parquet(outputPath)
Caused by: org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 16384 bytes of memory, got 0
at org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:157)
at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:98)
at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.<init>(UnsafeInMemorySorter.java:128)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.<init>(UnsafeExternalSorter.java:161)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:128)
at org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray.add(ExternalAppendOnlyUnsafeRowArray.scala:115)
at org.apache.spark.sql.execution.window.WindowExec$$anonfun$11$$anon$1.fetchNextPartition(WindowExec.scala:343)
at org.apache.spark.sql.execution.window.WindowExec$$anonfun$11$$anon$1.next(WindowExec.scala:369)
at org.apache.spark.sql.execution.window.WindowExec$$anonfun$11$$anon$1.next(WindowExec.scala:303)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage22.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:585)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:188)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
DAG Visulization of Failed Stage
Aggregated Metrics By Executor of Failed Stage
Executors Tab

Error GC life time is shorter than transaction duration while writing to TiDB using Spark

I am using Apache Spark to write data in batches. The batches are of 1 day. While running the spark job I get this error. I am using MySQL java connector to connect to TiDB cluster. Spark creates 144 parallel tasks for writing.
java.sql.SQLException: GC life time is shorter than transaction duration
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1055)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:956)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3536)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3468)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1957)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2107)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2642)
at com.mysql.jdbc.ConnectionImpl.commit(ConnectionImpl.java:1610)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.mysql.jdbc.LoadBalancingConnectionProxy.invoke(LoadBalancingConnectionProxy.java:359)
at com.sun.proxy.$Proxy13.commit(Unknown Source)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:665)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:821)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:821)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:929)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:929)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2067)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2067)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
This error means, the Spark task Transaction Time is exceed than the TiDB's GC Life Time that means:
The current read data has been deleted by TiDB, since the data's life time exceed than the configed GC Life Time
so solution maybe try to increase the tikv_gc_life_time by:
update mysql.tidb set variable_value='30m' where variable_name='tikv_gc_life_time';
see more:
https://github.com/pingcap/docs/blob/master/op-guide/gc.md#configuration-and-monitor

Spark - how to write ~20TB of data from a DataFrame to a hive table or hdfs?

I am using Spark to process 20TB+ amount of data.
I'm trying to write the data into a Hive table, using the following:
df.registerTempTable('temporary_table')
sqlContext.sql("INSERT OVERWRITE TABLE my_table SELECT * FROM temporary_table")
where df is the Spark DataFrame. Unfortunately it doesn't have any dates I can partition over. When I ran the above code, I encountered the error message:
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 95561 tasks (1024.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1433)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1421)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1420)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1420)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:801)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:801)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:801)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1642)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1601)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1590)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:622)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1831)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1844)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1857)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:212)
at org.apache.spark.sql.execution.EvaluatePython$$anonfun$takeAndServe$1.apply$mcI$sp(python.scala:126)
at org.apache.spark.sql.execution.EvaluatePython$$anonfun$takeAndServe$1.apply(python.scala:124)
at org.apache.spark.sql.execution.EvaluatePython$$anonfun$takeAndServe$1.apply(python.scala:124)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2087)
at org.apache.spark.sql.execution.EvaluatePython$.takeAndServe(python.scala:124)
at org.apache.spark.sql.execution.EvaluatePython.takeAndServe(python.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
The error message seems to also depends on the amount of data. With slightly smaller data, I encountered the following error message
Map output statuses were 395624469 bytes which exceeds spark.akka.frameSize (134217728 bytes).
What's a more practical way to achieve this (if the task is feasible)? I'm using Spark 1.6.
Below are the config variables when submitting the spark job:
spark-submit --deploy-mode cluster --master yarn
--executor-memory 20G
--num-executors 500
--driver-memory 64g
--driver-cores 8
--files 'my_script.py'
BTW, naively I would imagine that when the write operations happen, Spark will write data from the executors to hdfs. But the error message seems to imply that there are some data transfers between the executors and the driver?
I only have shallow knowledge in Spark so please pardon me for the dumb questions!
Check following configuration and modify as per your need ,default values are 1 g
set by SparkConf: conf.set("spark.driver.maxResultSize", "10g")
set by spark-defaults.conf: spark.driver.maxResultSize 10g
set when calling spark-submit: --conf spark.driver.maxResultSize=10g
https://spark.apache.org/docs/latest/configuration.html

AWS Glue is throwing error while processing data in TBs

I am converting CSV data on s3 in parquet format using AWS glue ETL job. Snappy compressed parquet data is stored back to s3.
Complete Architecture:
As data is uploaded to s3, a lambda function triggers glue ETL job if it's not already running. A job continuously uploads glue input data on s3. Glue successfully processes 100GB data but as input data piles up to 0.5 to 1TB, Glue job throws an error after running for a long time, say 10 hours.
Traceback (most recent call last):
File "script_2018-01-08-23-01-55.py", line 60, in <module>
partitioned_dataframe.write.partitionBy(['part_date']).format("parquet").save(output_lg_partitioned_dir, mode="append")
File "/mnt/yarn/usercache/root/appcache/application_1515414270379_0004/container_1515414270379_0004_02_000001/pyspark.zip/pyspark/sql/readwriter.py", line 550, in save
File "/mnt/yarn/usercache/root/appcache/application_1515414270379_0004/container_1515414270379_0004_02_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/mnt/yarn/usercache/root/appcache/application_1515414270379_0004/container_1515414270379_0004_02_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/mnt/yarn/usercache/root/appcache/application_1515414270379_0004/container_1515414270379_0004_02_000001/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o193.save.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:147)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:121)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:101)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:492)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:198)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 3228 tasks (1024.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:127)
... 30 more
End of LogType:stdout
I worked a lot to resolve this error but got no clue. Though I tried some suggested approach like -
setting SparkConf: conf.set("spark.driver.maxResultSize", "3g")
The above setting didn't work. I would appreciate it if you could provide any guidance to resolve this issue.
Glue default DPU count is 10 DPU, where a single Data Processing Unit (DPU) provides 4 vCPU and 16 GB of memory. Try increasing DPU count for single job run.
For your use case, I would suggest increasing the DPU count to 64. As for single run you are receiving nearly 1 TB of file. Currently by default you can use 100 DPU for individual ETL job run. reference Although you could always reach to AWS support for any of the limit increase.
from pyspark import SparkConf
sc_conf.set("spark.driver.maxResultSize", 0)
sc_conf.set("spark.executor.memory", '4g')
sc = SparkContext(conf=sc_conf)
that works for me, please give it a try

FileNotFoundException during compaction

All of my nodes are throwing a FileNotFoundException during compaction. As such, not a single compaction (auto, manual) can finish and my SSTable count is now in the thousands for a single CF (CQL3).
nodetool compactionstats shows hundreds of pending tasks in each node but nothing is being processed.
Below is an example log of the exception:
Error occurred during compaction
java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.io.FileNotFoundException: /home/cassandra/data/mtg_keywords_v5/keyword_organic_results/mtg_keywords_v5-keyword_organic_results-jb-31111-Data.db (No such file or directory)
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:188)
at org.apache.cassandra.db.compaction.CompactionManager.performMaximal(CompactionManager.java:281)
at org.apache.cassandra.db.ColumnFamilyStore.forceMajorCompaction(ColumnFamilyStore.java:1935)
at org.apache.cassandra.service.StorageService.forceKeyspaceCompaction(StorageService.java:2210)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:75)
at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:279)
at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:112)
at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:46)
at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:237)
at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:138)
at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:252)
at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:819)
at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801)
at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1487)
at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:97)
at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1328)
at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1420)
at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:848)
at sun.reflect.GeneratedMethodAccessor40.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:322)
at sun.rmi.transport.Transport$1.run(Transport.java:177)
at sun.rmi.transport.Transport$1.run(Transport.java:174)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.Transport.serviceCall(Transport.java:173)
at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:556)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:811)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:670)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: /home/cassandra/data/mtg_keywords_v5/keyword_organic_results/mtg_keywords_v5-keyword_organic_results-jb-31111-Data.db (No such file or directory)
at org.apache.cassandra.io.compress.CompressedThrottledReader.open(CompressedThrottledReader.java:52)
at org.apache.cassandra.io.sstable.SSTableReader.openDataReader(SSTableReader.java:1355)
at org.apache.cassandra.io.sstable.SSTableScanner.<init>(SSTableScanner.java:67)
at org.apache.cassandra.io.sstable.SSTableReader.getScanner(SSTableReader.java:1161)
at org.apache.cassandra.io.sstable.SSTableReader.getScanner(SSTableReader.java:1173)
at org.apache.cassandra.db.compaction.AbstractCompactionStrategy.getScanners(AbstractCompactionStrategy.java:252)
at org.apache.cassandra.db.compaction.AbstractCompactionStrategy.getScanners(AbstractCompactionStrategy.java:258)
at org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:126)
at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60)
at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59)
at org.apache.cassandra.db.compaction.CompactionManager$6.runMayThrow(CompactionManager.java:296)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
... 3 more
Caused by: java.io.FileNotFoundException: /home/cassandra/data/mtg_keywords_v5/keyword_organic_results/mtg_keywords_v5-keyword_organic_results-jb-31111-Data.db (No such file or directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:241)
at org.apache.cassandra.io.util.RandomAccessReader.<init>(RandomAccessReader.java:58)
at org.apache.cassandra.io.compress.CompressedRandomAccessReader.<init>(CompressedRandomAccessReader.java:76)
at org.apache.cassandra.io.compress.CompressedThrottledReader.<init>(CompressedThrottledReader.java:34)
at org.apache.cassandra.io.compress.CompressedThrottledReader.open(CompressedThrottledReader.java:48)
... 18 more
I'm currently in the middle of migrating 4.8 billion rows from MySQL which I do via sstableloader in batches of 1 to 4 million rows. Does the exception mean that I've already lost data and must repeat the migration from scratch? So far I don't see any stream error in my logs.
My environment is as follows:
DSE 4.0.1 (Cassandra 2.0.5)
CentOS 6.x x86_64
Java 1.7.0_5x
EDIT:
Some additional info:
During the bulk loading process, I devised a mechanism to kill the sstableloader when the total progress reaches 100%. I also issue a "nodetool stop INDEX_BUILD" to all nodes. The reason for this is because sstableloader waits for the secondary index build to finish and this takes hours to complete (whereas the actual import time is just a fraction of the index build time). I figured out that the imported data remains intact after killing the sstableloader process and cancelling the secondary index build so I wrote a script to automate the mechanism. So far, I have completed more than 200 bulk loads with this trick.
I have paused the migration and restarted the nodes several times in the past week because the OS load reaches high levels (yellow or red in OpsCenter) after finishing several cycles of note #1. It's possible that a compaction is in progress when I restarted the nodes via dse cassandra-stop (yes, we are running DSE as a standalone process)
Could any of these be the cause? How do I get out of this situation? Manual compaction/repair doesn't work because they always throw exceptions. For repair, the exception is different but the meaning is the same - some sstable files are missing:
ERROR [MiscStage:2] 2014-05-03 00:42:10,386 CassandraDaemon.java (line 196) Exception in thread Thread[MiscStage:2,5,main]
java.lang.RuntimeException: Tried to hard link to file that does not exist /home/cassandra/data/mtg_keywords_v5/keyword_organic_results/mtg_keywords_v5-keyword_organic_results-jb-23797-Summary.db
at org.apache.cassandra.io.util.FileUtils.createHardLink(FileUtils.java:76)
at org.apache.cassandra.io.sstable.SSTableReader.createLinks(SSTableReader.java:1215)
at org.apache.cassandra.db.ColumnFamilyStore.snapshotWithoutFlush(ColumnFamilyStore.java:1816)
at org.apache.cassandra.db.ColumnFamilyStore.snapshot(ColumnFamilyStore.java:1849)
at org.apache.cassandra.service.SnapshotVerbHandler.doVerb(SnapshotVerbHandler.java:40)
at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:60)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Have you dropped and recreated the keyspace? If so, it's probably this:
https://issues.apache.org/jira/browse/CASSANDRA-4857
Restart your nodes to clear the bad filename out of memory.

Resources