Bull.js jobs stalling despite timeout being set

Bull.js jobs stalling despite timeout being set - node.js

I have a Bull queue running lengthy video upload jobs which could take any amount of time from < 1 min up to many minutes.
The jobs stall after the default 30 seconds, so I increased the timeout to several minutes, but this is not respected. If I set the timeout to 10ms it immediately stalls, so it is taking timeout into account.
Job {
opts: {
attempts: 1,
timeout: 600000,
delay: 0,
timestamp: 1634753060062,
backoff: undefined
},
...
}
Despite the timeout, I am receiving a stalled event, and the job starts to process again.
EDIT: I thought "stalling" was the same as timing out, but apparently there is a separate timeout for how often Bull checks for stalled jobs. In other words the real problem is why jobs are considered "stalled" even though they are busy performing an upload.

The problem seems to be your job stalling because of the operation you are running which blocks the event loop. you could convert your code into a non-blocking one and solve the problem that way.
That being said, stalled interval check could be set in queue settings while initiating the queue (more of a quick solution):
const queue = new Bull('queue', {
port: 6379,
host: 'localhost',
db: 0,
settings: {
stalledInterval: 60 * 60 * 1000, // change default from 30 sec to 1 hour, set 0 for disabling the stalled interval
},
})
based on bull's doc:
timeout: The number of milliseconds after which the job should be fail with a timeout error
stalledInterval: How often check for stalled jobs (use 0 for never checking)
Increasing the stalledInterval (or disabling it by setting it as 0) would remove the check that makes sure event loop is running thus enforcing the system to ignore the stall state.
again for docs:
When a worker is processing a job it will keep the job "locked" so other workers can't process it.
It's important to understand how locking works to prevent your jobs from losing their lock - becoming _stalled_ -
and being restarted as a result. Locking is implemented internally by creating a lock for `lockDuration` on interval
`lockRenewTime` (which is usually half `lockDuration`). If `lockDuration` elapses before the lock can be renewed,
the job will be considered stalled and is automatically restarted; it will be __double processed__. This can happen when:
1. The Node process running your job processor unexpectedly terminates.
2. Your job processor was too CPU-intensive and stalled the Node event loop, and as a result, Bull couldn't renew the job lock (see [#488](https://github.com/OptimalBits/bull/issues/488) for how we might better detect this). You can fix this by breaking your job processor into smaller parts so that no single part can block the Node event loop. Alternatively, you can pass a larger value for the `lockDuration` setting (with the tradeoff being that it will take longer to recognize a real stalled job).
As such, you should always listen for the `stalled` event and log this to your error monitoring system, as this means your jobs are likely getting double-processed.
As a safeguard so problematic jobs won't get restarted indefinitely (e.g. if the job processor always crashes its Node process), jobs will be recovered from a stalled state a maximum of `maxStalledCount` times (default: `1`).

Related

How to make a reliable Celery task with a blocking call inside?

I've a Celery (4.4.7) task which makes a blocking call and may block for long time. I don't want to keep worker busy for long time, so I setup soft_time_limit for the task. My hope was to fail a task (if it's impossible to complete it quickly) and retry it later.
My issue is that SoftTimeLimitExceeded exception is not being raised (I suppose due to the call blocking on OS level). As a result the task is killed by hard time_limit and I don't have a chance to retry it.
#shared_task(
acks_late=True,
ignore_results=True,
soft_time_limit=5,
time_limit=15,
default_retry_delay=1,
retry_kwargs={"max_retries": 10},
retry_backoff=True,
retry_backoff_max=1200, # 20 min
retry_jitter=True,
autoretry_for=(SoftTimeLimitExceeded,),
)
def my_task():
blocking_call_taking_long_time()
What I tried:
Hard time limit is impossible to intercept
I expected ack_late would push my timed-out task back to the queue, but it doesn't happen.
Tried adding reject_on_worker_lost, neither value changes things for me
SoftTimeLimitExceeded exception is 100% not there - neither autoretry_for, nor regular try ... except don't catch it
For now I ended up with setting explicit timeout for the blocking operation. This requires adding a parameter everywhere along the call chain.
Is there some other path I'm missing?

Measuring Semaphore wait times with Micrometer

We have a throttling implementation that essentially boils down to:
Semaphore s = new Semaphore(1);
...
void callMethod() {
s.acquire();
timer.recordCallable(() -> // call expensive method);
s.release();
}
I would like to gather metrics about the impact semaphore has on the overall response time of the method. For example, I would like to know the number of threads that were waiting for acquire, the time spend waiting etc., What, I guess, I am looking for is guage that also captures timing information?
How do I measure the Semphore stats?

There are multiple things you can do depending on your needs and situation.
LongTaskTimer is a timer that measures tasks that are currently in-progress. The in-progress part is key here, since after the task has finished, you will not see its effect on the timer. That's why it is for long running tasks, I'm not sure if it fits your use case.
The other thing that you can do is having a Timer and a Gauge where the timer measures the time it took to acquire the Semaphore while with the gauge, you can increment/decrement the number of threads that are currently waiting on it.

Precise Throughput Timer stuck with simple setup

I have a similar issue as Synchronizing timer hangs with simple setup, but with Precise Throughput Timer which suppose to replace Synchronizing timer:
Certain cases might be solved via Synchronizing Timer, however Precise Throughput Timer has native way to issue requests in packs. This behavior is disabled by default, and it is controlled with "Batched departures" settings
Number of threads in the batch (threads). Specifies the number of samples in a batch. Note the overall number of samples will still be in line with Target Throughput
Delay between threads in the batch (ms). For instance, if set to 42, and the batch size is 3, then threads will depart at x, x+42ms, x+84ms
I'm setting 10 thread number , 1 ramp up and 1 loop count,
I'm adding 1 HTTP Request only (less than 1 seconds response) and before it Test Action with Precise Throughput Timer as a child with the following setup:
Thread stuck after 5 threads succeeded:
EDIT 1
According to #Dimitri T solution:
Change Duration to 100 and add line to logging configuration and got 5 errors:
2018-03-12 15:43:42,330 INFO o.a.j.t.JMeterThread: Stopping Thread: org.apache.jorphan.util.JMeterStopThreadException: The thread is scheduled to stop in -99886 ms and the throughput timer generates a delay of 20004077. JMeter (as of 4.0) does not support interrupting of sleeping threads, thus terminating the thread manually.
EDIT 2
According to #Dimitri T solution set "Loop Count" to -1 executed 10 threads, but if I change Number of threads in batch from 2 to 5 it execute only 3 threads and stops
INFO o.a.j.t.JMeterThread: Stopping Thread: org.apache.jorphan.util.JMeterStopThreadException: The thread is scheduled to stop in -89233 ms and the throughput timer generates a delay of 19999450. JMeter (as of 4.0) does not support interrupting of sleeping threads, thus terminating the thread manually.

Set "Duration (seconds)" in your Thread Group to something non-zero (i.e. to 100)
Depending on what you're trying to achieve you might also want to set "Loop Count" to -1
You can also add the following line to log4j2.xml file:
<Logger name="org.apache.jmeter.timers" level="debug" />
This way you will be able to see what's going on with your timer(s) in jmeter.log file

How to deal with tasks running too long (comparing to others in job) in yarn-client?

We use a Spark cluster as yarn-client to calculate several business, but sometimes we have a task run too long time:
We don't set timeout but I think default timeout a spark task is not too long such here ( 1.7h ).
Anyone give me an ideal to work around this issue ???

There is no way for spark to kill its tasks if its taking too long.
But I figured out a way to handle this using speculation,
This means if one or more tasks are running slowly in a stage, they
will be re-launched.
spark.speculation true
spark.speculation.multiplier 2
spark.speculation.quantile 0
Note: spark.speculation.quantile means the "speculation" will kick in from your first task. So use it with caution. I am using it because some jobs get slowed down due to GC over time. So I think you should know when to use this - its not a silver bullet.
Some relevant links: http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-always-wait-for-stragglers-to-finish-running-td14298.html and http://mail-archives.us.apache.org/mod_mbox/spark-user/201506.mbox/%3CCAPmMX=rOVQf7JtDu0uwnp1xNYNyz4xPgXYayKex42AZ_9Pvjug#mail.gmail.com%3E
Update
I found a fix for my issue (might not work for everyone). I had a bunch of simulations running per task, so I added timeout around the run. If a simulation is taking longer (due to a data skew for that specific run), it will timeout.
ExecutorService executor = Executors.newCachedThreadPool();
Callable<SimResult> task = () -> simulator.run();
Future<SimResult> future = executor.submit(task);
try {
result = future.get(1, TimeUnit.MINUTES);
} catch (TimeoutException ex) {
future.cancel(true);
SPARKLOG.info("Task timed out");
}
Make sure you handle an interrupt inside the simulator's main loop like:
if(Thread.currentThread().isInterrupted()){
throw new InterruptedException();
}

The trick here is to login directly to the worker node and kill the process. Usually you can find the offending process with a combination of top, ps, and grep. Then just do a kill pid.

Guidance OnMessageOptions.AutoRenewTimeout

Can someone offer some more guidance on the use of the Azure Service Bus OnMessageOptions.AutoRenewTimeout
http://msdn.microsoft.com/en-us/library/microsoft.servicebus.messaging.onmessageoptions.autorenewtimeout.aspx
as I haven't found much documentation on this option, and would like to know if this is the correct way to renew a message lock
My use case:
1) Message Processing Queue has a Lock Duration of 5 minutes (The maximum allowed)
2) Message Processor using the OnMessageAsync message pump to read from the queue (with a ReceiveMode.PeekLock) The long running processing may take up to 10 minutes to process the message before manually calling msg.CompleteAsync
3) I want the message processor to automatically renew it's lock up until the time it's expected to Complete processing (~10minutes). If after that period it hasn't been completed, the lock should be automatically released.
Thanks
-- UPDATE
I never did end up getting any more guidance on AutoRenewTimeout. I ended up using a custom MessageLock class that auto renews the Message Lock based on a timer.
See the gist -
https://gist.github.com/Soopster/dd0fbd754a65fc5edfa9

To handle long message processing you should set AutoRenewTimeout == 10 min (in your case). That means that lock will be renewed during these 10 minutes each time when LockDuration is expired.
So if for example your LockDuration is 3 minutes and AutoRenewTimeout is 10 minutes then every 3 minute lock will be automatically renewed (after 3 min, 6 min and 9 min) and lock will be automatically released after 12 minutes since message was consumed.

To my personal preference, OnMessageOptions.AutoRenewTimeout is a bit too rough of a lease renewal option. If one sets it to 10 minutes and for whatever reason the Message is .Complete() only after 10 minutes and 5 seconds, the Message will show up again in the Message Queue, will be consumed by the next stand-by Worker and the entire processing will execute again. That is wasteful and also keeps the Workers from executing other unprocessed Requests.
To work around this:
Change your Worker process to verify if the item it just received from the Message Queue had not been already processed. Look for Success/Failure result that is stored somewhere. If already process, call BrokeredMessage.Complete() and move on to wait for the next item to pop up.
Call periodically BrokeredMessage.RenewLock() - BEFORE the lock expires, like every 10 seconds - and set OnMessageOptions.AutoRenewTimeout to TimeSpan.Zero. Thus if the Worker that processes an item crashes, the Message will return into the MessageQueue sooner and will be picked up by the next stand-by Worker.

I have the very same problem with my workers. Even the message was successfully processing, due to long processing time, service bus removes the lock applied to it and the message become available for receiving again. Other available worker takes this message and start processing it again. Please, correct me if I'm wrong, but in your case, OnMessageAsync will be called many times with the same message and you will ended up with several tasks simultaneously processing it. At the end of the process MessageLockLost exception will be thrown because the message doesn't have a lock applied.
I solved this with the following code.
_requestQueueClient.OnMessage(
requestMessage =>
{
RenewMessageLock(requestMessage);
var messageLockTimer = new System.Timers.Timer(TimeSpan.FromSeconds(290));
messageLockTimer.Elapsed += (source, e) =>
{
RenewMessageLock(requestMessage);
};
messageLockTimer.AutoReset = false; // by deffault is true
messageLockTimer.Start();
/* ----- handle requestMessage ----- */
requestMessage.Complete();
messageLockTimer.Stop();
}
private void RenewMessageLock(BrokeredMessage requestMessage)
{
try
{
requestMessage.RenewLock();
}
catch (Exception exception)
{
}
}
There are a few mounts since your post and maybe you have solved this, so could you share your solution.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string