org.apache.spark.shuffle.FetchFailedException - apache-spark

I am running this query on a data size of 4 billion rows and getting
org.apache.spark.shuffle.FetchFailedException error.
select adid,position,userid,price
from (
select adid,position,userid,price,
dense_rank() OVER (PARTITION BY adlocationid ORDER BY price DESC) as rank
FROM trainInfo) as tmp
WHERE rank <= 2
I have attached the error logs from spark-sql terminal.Please suggest what is the reason for these kind of errors and how can I resolve them.
error logs

The problem is that you lost an executor:
15/08/25 10:08:13 WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 165758 ms exceeds timeout 120000 ms
15/08/25 10:08:13 ERROR TaskSchedulerImpl: Lost executor 1 on 192.168.1.223: Executor heartbeat timed out after 165758 ms
The exception occurs when trying to read shuffle data from that node. The node may be doing a very long GC (maybe try using a smaller heap size for executors), or network failure, or a pure crash. Normally Spark should recover from lost nodes like this one, and indeed it starts resubmitting the first stage to another node. Depending how big is your cluster, it may succeed or not.

Related

AWS Glue ExecutorLostFailure (executor 15 exited caused by one of the running tasks) Reason: Remote RPC client disassociated

I have a simple glue job where I am using pyspark to read 14million rows from RDS using JDBC and then trying to save it into S3. I can see Output logs in Glue that reading and creating dataframe is quick but while calling write opeation, it fails with the error:
error occurred while calling o89.save. Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 5, 10.150.85.95, executor 15): ExecutorLostFailure (executor 15 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
I have tried the following solutions:
Adding --conf with spark.executor.memory=10g and also with 30g after seeing some solutions on SO.
Tried to convert spark df to DynamicFrame and then call the save opeartion.
Tried increasing the workers to 500!
And still no luck getting it to pass.
1 weird thing I observed it is, after I create the dataframe by reading from JDBC, it keeps the entire df in 1 partition until I do repartition. But the reading step completes without any error.
I used the same code to run for 6M rows and the job completes in 5 mins.
But it fails for 14M rows with the ExecutorLostFailure Error.
I also see this error sometimes if I dig deep in the Logs:
2023-01-22 10:36:52,972 WARN [allocator] glue.ExecutorTaskManagement (Logging.scala:logWarning(66)): executor task creation failed for executor 203, restarting within 15 secs. restart reason: Executor task resource limit has been temporarily hit..
Code:
def read_from_db():
logger.info(f'Starts Reading Data from {DB_TABLE} table')
start = time.perf_counter()
filter_query = f'SELECT * FROM {DB_TABLE}'
sql_query = '({}) as query'.format(filter_query)
spark_df = (glueContext.read.format('jdbc')
.option('driver', 'org.postgresql.Driver')
.option('url', JDBC_URL)
.option('dbtable', sql_query)
.option('user', DB_USERS)
.option('password', DB_PASSWORD)
.load()
)
end = time.perf_counter()
logger.info(f'Count of records in DB is {spark_df.count()}')
logger.info(f'Elapsed time for reading records from {DB_TABLE} table = {end - start:0.4f} seconds')
logger.info(f'Finished Reading Data from {DB_TABLE} table')
logger.info(f"Total no. of partitions - {spark_df.rdd.getNumPartitions()}")
# def write_to_s3(spark_df_rep):
# S3_PATH = (
# f"{S3_BUCKET}/all-entities-update/{date}/{cur_time}"
# )
# spark_df_rep.write.format("csv").option("header", "true").save(S3_PATH)
spark_df = spark_df.repartition(20)
logger.info(f"Completed Repartitioning. Total no. of partitions - {spark_df.rdd.getNumPartitions()}")
# spark_df.foreachPartition(write_to_s3)
# spark_dynamic_frame = DynamicFrame.fromDF(spark_df, glueContext, "spark_dynamic_frame")
# logger.info("Conversion to DynmaicFrame compelete")
# glueContext.write_dynamic_frame.from_options(
# frame=spark_dynamic_frame,
# connection_type="s3",
# connection_options={"path": S3_PATH},
# format="csv"
# )
S3_PATH = (
f"{S3_BUCKET}/all-entities-update/{date}/{cur_time}"
)
spark_df.write.format("csv").option("header", "true").save(S3_PATH)
return
In many cases this quite a criptic error message signals about OOM. Setting spark.task.cpus to value greater than default 1 (up to 8 which is the number of cores on G2.X worker for Glue verson 3 or higher) helped me. This effectively increases the amount of memory a single Spark task will get (at a cost of a few cores being idle).
I Understood this was because, no memory was left in 1 executor - Increasing workers doesn't help. Because 1 Worker → 1 Executor → 2 DPUs. Even max configuration with G2.X doesn’t help.
This issue stir up because the data was skewed. All rows in my Database were similar, except 2 columns out of 13 columns. And Pyspark wasn't able to load it into different partitions and it was trying to load all my rows into a single partition.
So increasing Workers/ Executors was of no help.
I solved this by loading data into different partitions manually. Spark actually tried to keep everything in 1 partition, I verified that it was in 1 partition.
Even adding repartitioning doesn’t help,
I was getting error while writing and not when reading. This was the cause of confusion. But the actual issue was with reading and the read was actually trigered when write(transformation) is called. So we were getting error at write step:
From other SO answers
Spark reads the data as soon as an action is applied, since you are just reading and writing to s3 so data is read when the write is triggered.
Spark is not optimized to read bulk data from rdbms as it establishes only single connection to the database
Write data to parquet format in parallel
Also see:
Databricks Spark Pyspark RDD Repartition - "Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues."
Manually partition for skewed data
I added a temporary column called RNO (Row number) which is used as partitionColumn to partition data into partitions and it has to be either int/ datetime. After we are done with the job I drop this RNO column in the job itself or manually.
I had to read 14 million records from RDBMS and then write it to S3 where in each file should have around 200k records.
This is where we can use upperBound, lowerBound and numPartitions along with your partitionKey.
Ran with upperBound-14,000,000 and lowerBound-1 and numPartitions-70 to check if all files get 200k records (upperBound/numPartitions - lowerBound/numPartitions) . And it created 65 files and job ran successfully within 10mins.
filter_query = f'select ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS RNO, * from {DB_TABLE}'
sql_query = '({}) as query'.format(filter_query)
spark_df = (spark.read.format('jdbc')
.option('driver', 'org.postgresql.Driver')
.option('url', JDBC_URL)
.option('dbtable', sql_query)
.option('user', DB_USERS)
.option('password', DB_PASSWORD)
.option('partitionColumn','RNO')
.option('numPartitions',70)
.option('lowerBound',1)
.option('upperBound',14000000)
.load()
)
Additional references:
https://blog.knoldus.com/understanding-the-working-of-spark-driver-and-executor/

Max Executor failures in Spark dynamic allocation

I am using dynamic allocation feature of spark to run my spark job. It allocates around 50-100 executors. For some reason few executors are lost resulting in shutting down the job. Log shows that this happened due to max executor failures reached. It is set to 3 by default. Hence when 3 executors are lost the job gets killed even if other 40-50 executors are running.
I know that I can change the max executor failure limit but this seems like a workaround. Is there something else that I can try. All suggestions are welcome.

Negative Active Tasks in Spark UI under load (Max number of executor failed)

I am running a spark streaming application on Spark 1.5.0 in CDH 5.5.0. In the logs I see max number of executor failed. I am unable to find the root cause.
We are getting this issue intermittently every other day.Final app status: FAILED, exitCode: 11, (reason: Max number of executor failures reached)
It's a bug, you can track changes in following tickets:
https://issues.apache.org/jira/browse/SPARK-5098,
https://issues.apache.org/jira/browse/SPARK-10141,
https://issues.apache.org/jira/browse/SPARK-2319
Edit: about this message "Max number of executors failed" - Spark have parameter spark.yarn.max.executor.failures. By default 2x number of executors, minimum 3. If there were more failures than it was set in this parameter, then application will be killed.
You can change value of this parameter. However I would be worried why you have so many executor failures - maybe you've got too less memory? Or bug in code? Without code and/or context information we are not able to help in investigation about potential bug

cassandra spark connector read performance

I have some Spark experience but just starting out with Cassandra. I am trying to do a very simple read and getting really bad performance -- can't tell why. Here is the code I am using:
sc.cassandraTable("nt_live_october","nt")
.where("group_id='254358'")
.where("epoch >=1443916800 and epoch<=1444348800")
.first
all 3 params are part of the key on the table:
PRIMARY KEY (group_id, epoch, group_name, auto_generated_uuid_field)
) WITH CLUSTERING ORDER BY (epoch ASC, group_name ASC, auto_generated_uuid_field ASC)
And the output I see from my driver is like this:
15/10/07 15:05:02 INFO CassandraConnector: Connected to Cassandra
cluster: shakassandra 15/10/07 15:07:02 ERROR Session: Error
creating pool to attila./198.xxx:9042
com.datastax.driver.core.ConnectionException:
[attila./198.xxx:9042] Unexpected error
during transport initialization
(com.datastax.driver.core.OperationTimedOutException: [attila
/198.xxx:9042] Operation timed out)
15/10/07 15:07:02 INFO SparkContext: Starting job: take at
CassandraRDD.scala:121
15/10/07 15:07:03 INFO BlockManagerInfo:
Added broadcast_5_piece0 in memory on
osd09:39903 (size: 4.8 KB, free: 265.4 MB)
15/10/07 15:08:23 INFO TaskSetManager: Finished task 0.0 in stage 6.0
(TID 8) in 80153 ms on osd09 (1/1)
15/10/07 15:08:23 INFO TaskSetManager: Finished task 0.0 in stage 6.0 (TID 8)
in 80153 ms on osd09 (1/1)
15/10/07 15:08:23
INFO DAGScheduler: ResultStage 6 (take at CassandraRDD.scala:121)
finished in 80.958 s 15/10/07 15:08:23 INFO TaskSchedulerImpl: Removed
TaskSet 6.0, whose tasks have all completed, from pool
15/10/07 15:08:23 INFO DAGScheduler: Job 5 finished: take at
CassandraRDD.scala:121, took 81.043413 s
I expect this query to be really fast yet it's taking over a minute. A few things jump out at me
It takes almost two minutes to get the session error -- I pass the IPs of 3 nodes to Spark Cassandra connector -- is there a way to tell it to skip failed connections faster?
The task gets sent to a Spark worker which is not a Cassandra node -- this seems pretty strange to me -- is there a way to get information as to why the scheduler chose to send the task to a remote node?
Even if the task was sent to a remote node, the Input Size(Max) on that worker shows up as 334.0 B / 1 but the executor time is 1.3 min (see picture). This seems really slow -- I would expect time to be spent on deserialization, not compute...
Any tips on how to debug this, where to look for potential problems much appreciated. Using Spark 1.4.1 with connector 1.4.0-M3, cassandra ReleaseVersion: 2.1.9, all defaults on tuneable connector params
I think the problem lays into distribution of data between partitions. Your table has one cluster (partitioning) key - groupId, epoch is a clustering column only. Data distributes on cluster nodes only by groupId, so you have a huge partition with groupId='254358' on one node on the cluster.
When you run your query Cassandra reaches very fast partition with groupId='254358' and then filter all rows to find records with epoch between 1443916800 and 1444348800. If there are a lot of rows the query will be really slow. Actually this query is not distributed it will always run on one node.
Better practice extract date or even hour and add it as partitioning key, in your case something like
PRIMARY KEY ((group_id, date), epoch, group_name, auto_generated_uuid_field)
WITH CLUSTERING ORDER BY (epoch ASC, group_name ASC, auto_generated_uuid_field ASC)
To verify my hypothesis you can run your current query in cqlsh with turning on tracing read here how to do it. So the problem has nothing in connect with Spark.
About error and time to get it, everything is fine because you receive error after timeout happened.
Also I remember recommendations of spark-cassandra-connector to place Spark slaves joint to Cassandra nodes exactly to distribute queries by partitioning key.

Severe straggler tasks due to Locality Level being "Any" and a Network Fetch on cached RDD

A cached dataset that has been completely read through - successfully - is being reprocessed. A small number (typically 2/204 tasks - 1%) of the tasks may fail on a subsequent pass over the same (still cached) dataset. We are on spark 1.3.1.
The following screenshot shows that - of 204 tasks - the last two seem to have been 'forgotten' by the scheduler.
Is there any way to get more information about these tasks that are in limbo?
All of the other tasks completed within a reasonable fraction of similar time: in particular the 75% is still within 50% of the median. It is just these last two stragglers that are killing the entire job completion time. Notice also these are not due to record count skew
Update The two stragglers did finally finish - at over 7 minutes (over 3x longer any other other 202 tasks) !
15/08/15 20:04:54 INFO TaskSetManager: Finished task 201.0 in stage 2.0 (TID 601) in 133583 ms on x125 (202/204)
15/08/15 20:09:53 INFO TaskSetManager: Finished task 189.0 in stage 2.0 (TID 610) in 423230 ms on i386 (203/204)
15/08/15 20:10:05 INFO TaskSetManager: Finished task 190.0 in stage 2.0 (TID 611) in 435459 ms on i386 (204/204)
15/08/15 20:10:05 INFO DAGScheduler: Stage 2 (countByKey at MikeFilters386.scala:76) finished in 599.028 s
Suggestions on what to look for /review appreciated.
Another update The TYPE has turned out to be Network for those two. What does that mean?
I had a similar issue with you. Try increasing spark.locality.wait.
If that works, the following might apply to you:
https://issues.apache.org/jira/browse/SPARK-13718#
** ADDED **
Some extra information that I found helpful.
Spark will always initially assign a task to the executor that contains the respective cached RDD partition.
If Task is not accepted under the locality timeouts as defined in the spark config, then it will try NODE_LOCAL, RACK_LOCAL, ANY in that sequence.
Regardless if the cached data are available locally (HDFS replicas), Spark will always fetch the cached partition from the node that contains it. It will only re-compute if that executor crashed so the RDD is no longer cached. This will, in many cases, cause a network bottleneck on the original straggler node as well.
Have you tried using Spark speculation (spark.speculation true)? Spark will identify these stragglers and relaunch then on another node.

Resources