spark job fails due to timeout

spark job fails due to timeout - apache-spark

I am using spark-v2.4.5 with java1.8 in my spark job.
Its failing with below error
Application application_1650878536374_11143 failed 1 times in previous
3600000 milliseconds (global limit =2; local limit is =1) due to
ApplicationMaster for attempt appattempt_1650878536374_11143_000001
timed out. Failing the application
What does this mean (global limit =2; local limit is =1) ? how to fix this ?

Related

Can't init session in Spark. How to debug "User capacity has reached its maximum limit."?

I'm trying to create a session in Apache Spark using the Livy rest API. It fails with the following error: User capacity has reached its maximum limit..
The user is running another spark job. I don't understand which capacity reached its maximum and how to fix it adjusting Spark conf parameters. Here is the log info that I think is relevant. I reformatted it to make it clearer:
22/05/30 19:18:51 INFO Client: Submitting application application_1653913029140_0247 to ResourceManager
22/05/30 19:18:51 INFO YarnClientImpl: Submitted application application_1653913029140_0247
22/05/30 19:18:51 INFO Client: Application report for application_1653913029140_0247 (state: ACCEPTED)
22/05/30 19:18:51 INFO Client:
client token: N/A
diagnostics: [Mon May 30 19:18:51 -0300 2022]
Application is Activated, waiting for resources to be assigned for AM. User capacity has reached its maximum limit.
Details : AM Partition = <DEFAULT_PARTITION> ;
Partition Resource = <memory:2662400, vCores:234> ;
Queue's Absolute capacity = 32.0 % ;
Queue's Absolute used capacity = 40.76923 % ;
Queue's Absolute max capacity = 100.0 % ;
Queue's capacity (absolute resource) = <memory:851967, vCores:74> ;
Queue's used capacity (absolute resource) = <memory:1085440, vCores:106> ;
Queue's max capacity (absolute resource) = <memory:2662400, vCores:234> ; "
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1653949131433
final status: UNDEFINED
tracking URL: http://vrt1557.bndes.net:8088/proxy/application_1653913029140_0247/
user: s-dtl-p01
22/05/30 19:18:51 INFO ShutdownHookManager: Shutdown hook called
The other running job has configured some spark parameters for high performance:
conf = {'spark.yarn.appMasterEnv.PYSPARK_PYTHON': 'python3',
'spark.cores.max': 50,
'spark.executor.memory': '10g',
'spark.executor.instances': 100,
'spark.driver.memory' : '10g'
}
The job that failed to start didn't configure any spark parameter and is using the cluster default values.
Sure I can tweak the spark parameters of the job that is running, so it won't prevent the allocation of resources for the new job, but I'd like to understand it. The queue configuration also has a lot of parameters that should be interacting with the application.
Which resource is exhausted? How do I discover it based in the log below?

This diagnostics is produced by YARN capacity scheduler when it determines that allocating resources requested by the application will violate preset per-user limits. Here is the relevant piece from LeafQueue.java:
:
if (!userAssignable) {
application.updateAMContainerDiagnostics(AMState.ACTIVATED,
"User capacity has reached its maximum limit.");
ActivitiesLogger.APP.recordRejectedAppActivityFromLeafQueue(
activitiesManager, node, application, application.getPriority(),
ActivityDiagnosticConstant.QUEUE_HIT_USER_MAX_CAPACITY_LIMIT);
continue;
}
:
Hence, queue-level metrics you cited would probably be insufficient to identify what capacity limit is getting breached. Perhaps you can enable DEBUG logging for the scheduler, and then look for one of the messages generated from LeafQueue.canAssignToUser() method.

Odoo timeout killing cron

I found in logs that timeout set to 120s is killing cronworkers.
Firs issue I have noticed is that plugin which makes backups of db stuck in loop and makes zip after zip so in 1-2h disk is full.
Second thing is scheduled action called Mass Mailing: Process queue in odoo.
It should run every 60mins but it is gettin killed by timeout and run instantly after kill again
Where should I look for this timeout? I raised already all timeouts in odoo.conf to 500sec
Odoo v12 community, ubuntu 18, nginx
2019-12-02 06:43:04,711 4493 ERROR ? odoo.service.server: WorkerCron (4518) timeout after 120s
2019-12-02 06:43:04,720 4493 ERROR ? odoo.service.server: WorkerCron (4518) timeout after 120s

The following timeouts you can find in odoo.conf are usually the ones responsible for the behaviour you experience (in particular the second one).
limit_time_cpu = 60
limit_time_real = 120
Some more explanations on Odoo documentation : https://www.odoo.com/documentation/12.0/reference/cmdline.html#multiprocessing

Spark streaming error during job runtime in cluster (yarn resource manager)

I am facing the following error:
I wrote an application which is based on Spark streaming (Dstream) to pull messages coming from PubSub. Unfortunately, I am facing errors during the execution of this job. Actually I am using a cluster composed of 4 nodes to execute the spark Job.
After 10 minutes of the job running without any specific error, I get the following error permanently:
ERROR org.apache.spark.streaming.CheckpointWriter:
Could not submit checkpoint task to the thread pool executor java.util.concurrent.RejectedExecutionException: Task org.apache.spark.streaming.CheckpointWriter$CheckpointWriteHandler#68395dc9 rejected
from java.util.concurrent.ThreadPoolExecutor#1a1acc25
[Running, pool size = 1, active threads = 1, queued tasks = 1000, completed tasks = 412]

Spark-cassandra join: Pool is busy no available connection and the queue has reached its max size 256

I am trying to join a dataframe using joinWithCassandraTable function.
With the small dataset in non-prod everything went fine and when we go to prod, due to the huge data and other connections to cassandra, it has thrown exception as below.
ERROR [org.apache.spark.executor.Executor] [Executor task launch worker for task 498] - Exception in task 4.0 in stage 8.0 (TID 498)
java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /<host1>:9042
(com.datastax.driver.core.exceptions.BusyPoolException: [/<host3>] Pool is busy (no available connection and the queue has reached its max size 256)), Pool is busy (no available connection and the queue has reached its max size 256)),
We have the same code in cassandra connector 1.6 which worked absolutely fine. But, when we upgrade spark to 2.1.1 and spark cassandra connector to 2.0.1, it had given these issues.
Please let me know, if you faced similar issue and what could be the resolution.
Code we used:
ourDF.select("joincolumn")
.rdd
.map(row => Tuple1(row.getString(0)))
.joinWithCassandraTable("key_space", "table", AllColumns, SomeColumns("<join_column_from_cassandra>"))
Spark Version: 2.1.1
Cassandra connector version: 2.0.1
Regards,
Srini

Tune this parameter in your spark conf.
spark.cassandra.input.reads_per_sec
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#read-tuning-parameters

Getting "Error initializing cluster data" in OpsCenter

I am getting following error in Opscenter, after some time the issue got resolved itself.
Error initializing cluster data: The request to
/APP_Live/keyspaces?ksfields=column_families%2Creplica_placement_strategy%2Cstrategy_options%2Cis_system%2Cdurable_writes%2Cskip_repair%2Cuser_types%2Cuser_functions%2Cuser_aggregates&cffields=solr_core%2Ccreate_query%2Cis_in_memory%2Ctiers timed out after 10 seconds..
If you continue to see this error message, you can workaround this timeout by setting [ui].default_api_timeout to a value larger than 10 in opscenterd.conf and restarting opscenterd.
Note that this is a workaround and you should also contact DataStax
Support to follow up.

Workaround of this timeout is by setting [ui].default_api_timeout to a value larger than 10 in opscenterd.conf and restarting opscenterd.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

spark job fails due to timeout - apache-spark

Related

Can't init session in Spark. How to debug "User capacity has reached its maximum limit."?

Odoo timeout killing cron

Spark streaming error during job runtime in cluster (yarn resource manager)

Spark-cassandra join: Pool is busy no available connection and the queue has reached its max size 256

Getting "Error initializing cluster data" in OpsCenter

Categories

Resources