Thread "WebContainer : 0" (00000029) has been active for 647279 milliseconds and may be hung - websphere-7

I am getting the following exception in WebSphere while trying to generate an excel report using jasper.
ThreadMonitor W WSVR0605W: Thread "WebContainer : 0" (00000029) has been active for 647279 milliseconds and may be hung. There is/are 1 thread(s) in total in the server that may be hung.
Tried increasing the default thread pool size from 20 to 50
and also increased WebContainer pool size from 50 to 150. Not worked.
Any solution for this.

Your error message tells that the thread named “ WebContainer : 0” has been doing something for 647 seconds or 10.7 minutes and 1 thread active in the JVM that my also hung (been active for longer than the threshold time).
A hung thread is a thread in Java EE apps which is being blocked by a call or is waiting on a monitor for a locked object to be released, so that the thread can use it. A hung thread can result from a simple software defect (such as an infinite loop) or a more complex cause (for example, a resource deadlock). CPU utilization for that server's java process may become high and unresponsive if number of threads grows.
There are ways to increase the hangtime (default 10min), but not recommended. You have optimize or limit your excel report generation process. I think, Report you are extracting is very big or hitting the db more for generation, basically might be due to code bug. Increasing the number of thread pool has nothing to do for your issue.
In sysout.log: Below lines to the posted warning message, should give some idea about the Java EE class causing this issue. FFDC also helps this to figure it out.
The hang detection option is turned on by default. You can configure it to accommodate the environment, so that potential hangs can be reported. When a hung thread is detected, WAS notifies you so that you can troubleshoot the problem.
Using the hang detection policy, you can specify a time that is too long for a unit of work to complete. The thread monitor checks all managed threads including Web container threads.
From the administrative console, click Servers > Application Servers > server_name
Under Server Infrastructure, click Administration > Custom Properties & Click New.
Add the following properties:
Name: com.ibm.websphere.threadmonitor.interval
Value: The frequency (in seconds) at which managed threads in the selected application server will be interrogated.
Default: 180 seconds (three minutes).
Name: com.ibm.websphere.threadmonitor.dump.java
Value: Set to true to execute the dumpThreads function when a hung thread is detected and a WSVR0605W message is printed. The threads section of the javacore dump can be analyzed to determine what the reported thread
Default: false (0).
Name: com.ibm.websphere.threadmonitor.threshold
Value: The length of time (in seconds) in which a thread can be active before it is considered hung. Any thread that is detected as active for longer than this length of time is reported as hung.
Default: The default value is 600 seconds (ten minutes).
Name: com.ibm.websphere.threadmonitor.false.alarm.threshold
Value: The number of times (T) that false alarms can occur
before automatically increasing the threshold. It is possible that a
thread that is reported as hung eventually completes its work,
resulting in a false alarm. A large number of these events indicates
that the threshhold value is too small. The hang detection facility can
automatically respond to this situation: For every T false alarms, the
threshold T is increased by a factor of 1.5. Set the value to
zero (or less) to disable the automatic adjustment. (Link to
Detecting hung threads in J2EE applications for information on false alarms)
Default: 100
To disable the hang detection option, set the com.ibm.websphere.threadmonitor.interval property to less than or equal to zero.
Save & Restart the WAS instance

This is not an Exception. As you can see it has W level, which is Warning (ThreadMonitor W WSVR0605W). It just informs you that your report generation is extremely long. Longer than 647279 milliseconds. When you will look further down the log, you will most likely notice similar message telling that your thread has finished the job, and is no longer hanged.
Increasing WebContainer pool has nothing to do with that. You have to look at class creating your report and see, if you can make it quicker.
Another solution would be to just handle the report creation as async task using either Async Beans or JMS (you unfortunately are on v7, so no EJB 3.1 or async servlets are available), store it into database or file system and later retrieve using separate request, when the report is done.

In my team we found that Websphere 8.5.5.13 never shuts down the hung threads, so our team came with a solution for a transactional system which process high number of concurrent threads, so instead of hung forever we found a solution using Java 8 that times out after a parameter setting.
Below code shows a demo of how to write a Function that executes critical arbitrary code with timeout.
public class MyFactory
{
TimeUnit timeUnit=TimeUnit.MILLISECONDS;
int REQUEST_TIMEOUT=50;
int MAX_THREADS=1;
ExecutorService executor = Executors.newFixedThreadPool(MAX_THREADS);
public Function<Supplier<?>, ?> executeWithTimeout = typeSupplier-> {
final ExecutorService singleCallableExec =
Executors.newSingleThreadExecutor();
try
{
return executor.submit(()-> {
try { return singleCallableExec.submit(()->{
return typeSupplier.get();
}).get(REQUEST_TIMEOUT, timeUnit);
}
catch (Exception t2) { singleCallableExec.shutdownNow(); throw t2; }
finally {
Optional.of(singleCallableExec)
.filter(e -> !e.isShutdown())
.ifPresent(ExecutorService::shutdown);
}
}).get();
}
catch (InterruptedException | ExecutionException e)
{
throw new RuntimeException(e);
}
};
so this can be put into a factory class and call this by every thread requiring to execute critical code which may timeout forever, this will be useful to send a timeout exception which raises an alerts instead of waiting for system to go out of resources
public static void main (String args[])
{
MyFactory mf=new MyFactory();
try
{
long result = (long) mf.executeWithTimeout.apply(()->mf.longCalculation());
System.out.println("Result:" + result);
}
catch (Exception te)
{
if (te instanceof RuntimeException &&
te.getCause() != null && te.getCause() instanceof ExecutionException
&& te.getCause().getCause() != null && te.getCause().getCause()
instanceof TimeoutException)
{
System.out.println(new Date() + " Raising alarm, TimeoutException");
}
}
finally
{
mf.executor.shutdownNow();
}
}
public long longCalculation()
{
long result=LongStream.range(0, (long) Math.pow(2,32)).max().getAsLong();
System.out.println(new Date() + " Calculation result " + result);
return result;
}
}

Related

Appropriate solution for long running computations in Azure App Service and .NET Core 3.1?

What is an appropriate solution for long running computations in Azure App Service and .NET Core 3.1 in an application that has no need for a database and no IO to anything outside of this application ? It is a computation task.
Specifically, the following is unreliable and needs a solution.
[Route("service")]
[HttpPost]
public Outbound Post(Inbound inbound)
{
Debug.Assert(inbound.Message.Equals("Hello server."));
Outbound outbound = new Outbound();
long Billion = 1000000000;
for (long i = 0; i < 33 * Billion; i++) // 230 seconds
;
outbound.Message = String.Format("The server processed inbound object.");
return outbound;
}
This sometimes returns a null object to the HttpClient (not shown). A smaller workload will always succeed. For example 3 billion iterations always succeeds. A bigger number would be nice specifically 240 billion is a requirement.
I think in the year 2020 a reasonable goal in Azure App Service with .NET Core might be to have a parent thread count to 240 billion with the help of 8 child threads so each child counts to 30 billion and the parent divides an 8 M byte inbound object into smaller objects inbound to each child. Each child receives a 1 M byte inbound and returns to the parent a 1 M byte outbound. The parent re-assembles the result into a 8 M byte outbound.
Obviously the elapsed time will be 12.5%, or 1/8, or one-eighth, of the time a single thread implementation would need. The time to cut-up and re-assemble objects is small compared to the computation time. I am assuming the time to transmit the objects is very small compared to the computation time so the 12.5% expectation is roughly accurate.
If I can get 4 or 8 cores that would be good. If I can get threads that give me say 50% of the cycles of a core, then I would need may be 8 or 16 threads. If each thread gives me 33% of the cycles of a core then I would need 12 or 24 threads.
I am considering the BackgroundService class but I am looking for confirmation that this is the correct approach. Microsoft says...
BackgroundService is a base class for implementing a long running IHostedService.
Obviously if something is long running it would be better to make it finish sooner by using multiple cores via System.Threading but this documentation seems to mention System.Threading only in the context of starting tasks via System.Threading.Timer. My example code shows there is no timer needed in my application. An HTTP POST will serve as the occasion to do work. Typically I would use System.Threading.Thread to instantiate multiple objects to use multiple cores. I find the absence of any mention of multiple cores to be a glaring omission in the context of a solution for work that takes a long time but may be there is some reason Azure App Service doesn't deal with this matter. Perhaps I am just not able to find it in tutorials and documentation.
The initiation of the task is the illustrated HTTP POST controller. Suppose the longest job takes 10 minutes. The HTTP client (not shown) sets the timeout limit to 1000 seconds which is much more than 10 minutes (600 seconds) in order for there to be a margin of safety. HttpClient.Timeout is the relevant property. For the moment I am presuming the HTTP timeout is a real limit; rather than some sort of non-binding (fake limit) such that some other constraint results in the user waiting 9 minutes and receiving an error message. A real binding limit is a limit for which I can say "but for this timeout it would have succeeded". If the HTTP timeout is not the real binding limit and there is something else constraining the system, I can adjust my HTTP controller to instead have three (3) POST methods. Thus POST1 would mean start a task with the inbound object. POST2 means tell me if it is finished. POST3 means give me the outbound object.
What is an appropriate solution for long running computations in Azure App Service and .NET Core 3.1 in an application that has no need for a database and no IO to anything outside of this application ? It is a computation task.
Prologue
A few years ago a ran in to a pretty similar problem. We needed a service that could process large amounts of data. Sometimes the processing would take 10 seconds, other times it could take an hour.
At first we did it how your question illustrates: Send a request to the service, the service processes the data from the request and returns the response when finished.
Issues At Hand
This was fine when the job only took around a minute or less, but anything above this, the server would shut down the session and the caller would report an error.
Servers have a default of around 2 minutes to produce a response before it gives up on the request. It doesn't quit the processing of the request... but it does quit the HTTP session. It doesn't matter what parameters you set on your HttpClient, the server is the one that delegates how long is too long.
Reasons For Issues
All this is for good reasons. Server sockets are extremely expensive. You have a finite amount to go around. The server is trying to protect your service by severing requests that are taking longer than a specified time in order to avoid socket starvation issues.
Typically you want your HTTP requests to take only a few milliseconds. If they are taking longer than this, you will eventually run in to socket issues if your service has to fulfil other requests at a high rate.
Solution
We decided to go the route of IHostedService, specifically the BackgroundService. We use this service in conjunction with a Queue. This way you can set up a queue of jobs and the BackgroundService will process them one at a time (in some instances we have service processing multiple queue items at once, in others we scaled horizontally producing two or more queues).
Why an ASP.NET Core service running a BackgroundService? I wanted to handle this without tightly-coupling to any Azure-specific constructs in case we needed to move out of Azure to some other cloud service (back in the day we were contemplating this for other reasons we had at the time.)
This has worked out quite well for us and we haven't seen any issues since.
The work flow goes like this:
Caller sends a request to the service with some parameters
Service generates a "job" object and returns an ID immediately via 202 (accepted) response
Service places this job in to a queue that is being maintained by a BackgroundService
Caller can query the job status and get information about how much has been done and how much is left to go using this job ID
Service finishes the job, puts the job in to a "completed" state and goes back to waiting on the queue to produce more jobs
Keep in mind your service has the capability to scale horizontally where there would be more than one instance running. In this case I am using Redis Cache to store the state of the jobs so that all instances share the same state.
I also added in a "Memory Cache" option to test things locally if you don't have a Redis Cache available. You could run the "Memory Cache" service on a server, just know that if it scales then your data will be inconsistent.
Example
Since I'm married with kids, I really don't do much on Friday nights after everyone goes to bed, so I spent some time putting together an example that you can try out. The full solution is also available for you to try out.
QueuedBackgroundService.cs
This class implementation serves two specific purposes: One is to read from the queue (the BackgroundService implementation), the other is to write to the queue (the IQueuedBackgroundService implementation).
public interface IQueuedBackgroundService
{
Task<JobCreatedModel> PostWorkItemAsync(JobParametersModel jobParameters);
}
public sealed class QueuedBackgroundService : BackgroundService, IQueuedBackgroundService
{
private sealed class JobQueueItem
{
public string JobId { get; set; }
public JobParametersModel JobParameters { get; set; }
}
private readonly IComputationWorkService _workService;
private readonly IComputationJobStatusService _jobStatusService;
// Shared between BackgroundService and IQueuedBackgroundService.
// The queueing mechanism could be moved out to a singleton service. I am doing
// it this way for simplicity's sake.
private static readonly ConcurrentQueue<JobQueueItem> _queue =
new ConcurrentQueue<JobQueueItem>();
private static readonly SemaphoreSlim _signal = new SemaphoreSlim(0);
public QueuedBackgroundService(IComputationWorkService workService,
IComputationJobStatusService jobStatusService)
{
_workService = workService;
_jobStatusService = jobStatusService;
}
/// <summary>
/// Transient method via IQueuedBackgroundService
/// </summary>
public async Task<JobCreatedModel> PostWorkItemAsync(JobParametersModel jobParameters)
{
var jobId = await _jobStatusService.CreateJobAsync(jobParameters).ConfigureAwait(false);
_queue.Enqueue(new JobQueueItem { JobId = jobId, JobParameters = jobParameters });
_signal.Release(); // signal for background service to start working on the job
return new JobCreatedModel { JobId = jobId, QueuePosition = _queue.Count };
}
/// <summary>
/// Long running task via BackgroundService
/// </summary>
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
while(!stoppingToken.IsCancellationRequested)
{
JobQueueItem jobQueueItem = null;
try
{
// wait for the queue to signal there is something that needs to be done
await _signal.WaitAsync(stoppingToken).ConfigureAwait(false);
// dequeue the item
jobQueueItem = _queue.TryDequeue(out var workItem) ? workItem : null;
if(jobQueueItem != null)
{
// put the job in to a "processing" state
await _jobStatusService.UpdateJobStatusAsync(
jobQueueItem.JobId, JobStatus.Processing).ConfigureAwait(false);
// the heavy lifting is done here...
var result = await _workService.DoWorkAsync(
jobQueueItem.JobId, jobQueueItem.JobParameters,
stoppingToken).ConfigureAwait(false);
// store the result of the work and set the status to "finished"
await _jobStatusService.StoreJobResultAsync(
jobQueueItem.JobId, result, JobStatus.Success).ConfigureAwait(false);
}
}
catch(TaskCanceledException)
{
break;
}
catch(Exception ex)
{
try
{
// something went wrong. Put the job in to an errored state and continue on
await _jobStatusService.StoreJobResultAsync(jobQueueItem.JobId, new JobResultModel
{
Exception = new JobExceptionModel(ex)
}, JobStatus.Errored).ConfigureAwait(false);
}
catch(Exception)
{
// TODO: log this
}
}
}
}
}
It is injected as so:
services.AddHostedService<QueuedBackgroundService>();
services.AddTransient<IQueuedBackgroundService, QueuedBackgroundService>();
ComputationController.cs
The controller used to read/write jobs looks like this:
[ApiController, Route("api/[controller]")]
public class ComputationController : ControllerBase
{
private readonly IQueuedBackgroundService _queuedBackgroundService;
private readonly IComputationJobStatusService _computationJobStatusService;
public ComputationController(
IQueuedBackgroundService queuedBackgroundService,
IComputationJobStatusService computationJobStatusService)
{
_queuedBackgroundService = queuedBackgroundService;
_computationJobStatusService = computationJobStatusService;
}
[HttpPost, Route("beginComputation")]
[ProducesResponseType(StatusCodes.Status202Accepted, Type = typeof(JobCreatedModel))]
public async Task<IActionResult> BeginComputation([FromBody] JobParametersModel obj)
{
return Accepted(
await _queuedBackgroundService.PostWorkItemAsync(obj).ConfigureAwait(false));
}
[HttpGet, Route("computationStatus/{jobId}")]
[ProducesResponseType(StatusCodes.Status200OK, Type = typeof(JobModel))]
[ProducesResponseType(StatusCodes.Status404NotFound, Type = typeof(string))]
public async Task<IActionResult> GetComputationResultAsync(string jobId)
{
var job = await _computationJobStatusService.GetJobAsync(jobId).ConfigureAwait(false);
if(job != null)
{
return Ok(job);
}
return NotFound($"Job with ID `{jobId}` not found");
}
[HttpGet, Route("getAllJobs")]
[ProducesResponseType(StatusCodes.Status200OK,
Type = typeof(IReadOnlyDictionary<string, JobModel>))]
public async Task<IActionResult> GetAllJobsAsync()
{
return Ok(await _computationJobStatusService.GetAllJobsAsync().ConfigureAwait(false));
}
[HttpDelete, Route("clearAllJobs")]
[ProducesResponseType(StatusCodes.Status200OK)]
[ProducesResponseType(StatusCodes.Status401Unauthorized)]
public async Task<IActionResult> ClearAllJobsAsync([FromQuery] string permission)
{
if(permission == "this is flakey security so this can be run as a public demo")
{
await _computationJobStatusService.ClearAllJobsAsync().ConfigureAwait(false);
return Ok();
}
return Unauthorized();
}
}
Working Example
For as long as this question is active, I will maintain a working example you can try out. For this specific example, you can specify how many iterations you would like to run. To simulate long-running work, each iteration is 1 second. So, if you set the iteration value to 60, it will run that job for 60 seconds.
While it's running, run the computationStatus/{jobId} or getAllJobs endpoint. You can watch all the jobs update in real time.
This example is far from a fully-functioning-covering-all-edge-cases-full-blown-ready-for-production example, but it's a good start.
Conclusion
After a few years of working in the back-end, I have seen a lot of issues arise by not knowing all the "rules" of the back-end. Hopefully this answer will shed some light on issues I had in the past and hopefully this saves you from having to deal with said problems.
One option could be to try out Azure Durable Functions, which are more oriented to long-running jobs that warrant checkpoints and state as against attempting to finish within the context of the triggering request. It also has the concept of fan-out/fan-in, in case what you're describing could be divided into smaller jobs with an aggregated result.
If just raw compute is the goal, Azure Batch might be a better option since it facilitates that scaling.
I assume the actual work that needs be done is something other than iterating over a loop doing nothing, so in terms of possible parallelization I can't offer much help right now. Is the work CPU intensive or IO related?
When it comes to long running work in an Azure App Service, one of the option is to use a Web Job. A possible solution would be to post the request for computation to a queue (Storage Queue or Azure Message Bus Queues). The webjob then processes those messages and possibly puts a new message on another queue that the requester can use to handle the results.
If the time needed for processing is guaranteed to be less than 10 minutes you could replace the Web Job with an Queue Triggered Azure Function. It is a serverless offering on Azure with great scaling possibilities.
Another option is indeed using a Service Worker or an instance of an IHostingService and do some queue processing there.
Since you're saying that your computation succeeds at fewer iterations, a simple solution is to simply save your results periodically and resume the computation.
For example, say you need to perform 240 Billion iterations, and you know that the highest number of iterations to perform reliably is 3 Billion iterations, I would set up the following:
A slave, that actually performs the task (240Billion iterations)
A master that periodically received input from the slave about progress.
The slave can periodically send a message to the master (say once every 2billion iterations ?). This message could contain whatever is relevant to resume the computation should the computation be interrupted.
The master should keep track of the slave. If the master determines that the slave has died / crashed / whatever, the master should simply create a new slave which should resume computation from the last reported position.
How exactly you implement the master and slave is a matter of your personal preference.
Rather than have a single loop perform 240 billion iterations, if you can split your computation across nodes, I would try to simultaneously compute the solution in parallel across as many nodes as possible.
I personally use node.js for multicore projects. Although you are using asp.net, I include this example of node.js to illustrate the architecture that works for me.
Node.js on multi-core machines
https://dzone.com/articles/multicore-programming-in-nodejs
As Noah Stahl has mentioned in his answer, Azure Durable Functions and Azure Batch seem like options to help you achieve your goal on your platform. Please see his answer for more details.
The standard answer is to use asynchronous messaging. I have a blog series on the topic. This is particularly the case since you're already in Azure.
You already have an Azure web app service, but now you want to run code outside of a request - "request-extrinsic code". The proper way to run that code is in a separate process - Azure Functions or Azure WebJobs are a good match for Azure webapps.
First, you want a durable queue. Azure Storage Queues are a good fit since you're in Azure anyway. Then your webapi can just write a message into the queue and return. The important part here is that this is a durable queue, not an in-memory queue.
Meanwhile, the Azure Function / WebJob is processing that queue. It will pick up the work from the queue and execute it.
The final piece of the puzzle is the completion notification. This is a pretty common approach:
I can adjust my HTTP controller to instead have three (3) POST methods. Thus POST1 would mean start a task with the inbound object. POST2 means tell me if it is finished. POST3 means give me the outbound object.
To do this, your background processor should save the "in-progress" / "complete/result" state somewhere where the webapi process can access it. If you already have a shared database (and it makes sense to keep results), then this may be the easiest choice. I would also consider using Azure Cosmos DB, which has a nice time-to-live setting so the background service can inject the results that are "good for 24 hours" or whatever, after which they're automatically cleaned up.

TaskCompletionSource intermittently does not complete with NServiceBus and WCF

I have an unusual issue with TaskCompletionSource that has me baffled. I have a TaskCompletionSource waiting for the task to complete once i call the TrySetResult. I call this in three places in the code: from a WCF thread immediately to return a value to an APM WCF BeginXXX EndXXX; from another WCF thread to return immediately to the APM; lastly from an NServiceBus handler thread.
I started with the ubiquitous ToAPM provided by MS-PL. http://blogs.msdn.com/b/pfxteam/archive/2011/06/27/using-tasks-to-implement-the-apm-pattern.aspx
I noticed that the two WCF based threads worked 100% of the time. in 100 hours of hard testing, additionally extensive unit tests, I have never experienced a single failure to return a completed task to the AsyncCallback.
From the MS provided ToAPM code, the code uses a ContinueWith on the completed task to call the AsyncCallback in a schedule enabled task.
The problem I have not solved is the NServiceBus threads calling the TrySetResult on the TaskCompletionSource object. I find times of outages, where for undefined periods of time, the call simply fails. I set break points in the code for both the call and inside the ContinueWith code. I get the break point on the TrySetResult always, but only sometimes on the code inside the ContinueWith code.
The following information hopefully will shed some light on the matter.
I use a CancellationTokenSource with a timeout and setting a result to call the TrySetResult on TaskCompletionSource obj. When the above call does not work to move the task to completed, the timeout code fires. This timeout code has never not worked. it succeeds 100% of the time.
What is interesting is this, in the same code that calls the TrySetResult from the NServiceBus thread, when it works, it works as easily calling the cancellation object's Cancel as it does the TrySetResult on the TaskCompletionSource obj.
When one fails they both fail.
Then after an indiscriminate period of time it works again.
This is a WCF server in a production and QA environment and each displays identical results.
What is most weird is the following, for one WCF connection, the NServiceBus thread succeeds and another fails at the same time. Then at times both work, and then both fail. Again, all at the same time.
I have tried a number of things to work around the issue to no avail:
I wrapped the call to TrySetResult in a TaskCompletionSource + ContinueWith -- fail
I wrapped the call in a Task.Factory.StartNew -- fail
I call it directly -- fail
I really do not know what else to try.
I put in checks to ensure that the TaskCompletionSource obj is not completed, and during the outage it is not.
I put in checks to ensure the CancellationTokenSource object is not cancelled or has a cancellation pending during the outage, it does not.
I examined the objects in the debugger and they seem good.
They just do not work sometimes.
Could there be an inconsistency in the NserviceBus threads that sometimes prevent the calls from working?
Is there some thread marshaling I can try?
I searched everywhere and I have not see one mention of this problem. Is it unique?
I am totally baffled and need some ideas.
Remove the call from the NServiceBus thread execution. Isolate the call to TrySetResult using a thread such as QueueUserWorkItem or spinning your own thread. Since, the executing resumes using the thread, you may need some additional threads to handle the throughput. Ether spin multiple dedicated threads or use the thread pool. I tested calling TrySetResult in a dedicate threads and they work.
Here is code to demonstrate a single dedicated thread:
public static void Spin()
{
ClientThread = new Thread(new ThreadStart(() =>
{
while (true)
{
try
{
if (!HasSomething.WaitOne(1000, false))
continue;
while (true)
{
WaitingAsyncData entry = null;
lock (qlocker)
{
if (!Trigger.Any())
break;
entry = Trigger.Dequeue();
}
if (entry == null)
break;
entry.TrySetResult("string");
}
}
catch
{
}
}
}));
ClientThread.IsBackground = true;
ClientThread.Start();
}
Here is the ThreadPool example code:
ThreadPool.QueueUserWorkItem(delegate
{
entry.TrySetResult("string");
});
Using the ThreadPool rather than static thread provides greater flexibility and scaleability.

Guidance OnMessageOptions.AutoRenewTimeout

Can someone offer some more guidance on the use of the Azure Service Bus OnMessageOptions.AutoRenewTimeout
http://msdn.microsoft.com/en-us/library/microsoft.servicebus.messaging.onmessageoptions.autorenewtimeout.aspx
as I haven't found much documentation on this option, and would like to know if this is the correct way to renew a message lock
My use case:
1) Message Processing Queue has a Lock Duration of 5 minutes (The maximum allowed)
2) Message Processor using the OnMessageAsync message pump to read from the queue (with a ReceiveMode.PeekLock) The long running processing may take up to 10 minutes to process the message before manually calling msg.CompleteAsync
3) I want the message processor to automatically renew it's lock up until the time it's expected to Complete processing (~10minutes). If after that period it hasn't been completed, the lock should be automatically released.
Thanks
-- UPDATE
I never did end up getting any more guidance on AutoRenewTimeout. I ended up using a custom MessageLock class that auto renews the Message Lock based on a timer.
See the gist -
https://gist.github.com/Soopster/dd0fbd754a65fc5edfa9
To handle long message processing you should set AutoRenewTimeout == 10 min (in your case). That means that lock will be renewed during these 10 minutes each time when LockDuration is expired.
So if for example your LockDuration is 3 minutes and AutoRenewTimeout is 10 minutes then every 3 minute lock will be automatically renewed (after 3 min, 6 min and 9 min) and lock will be automatically released after 12 minutes since message was consumed.
To my personal preference, OnMessageOptions.AutoRenewTimeout is a bit too rough of a lease renewal option. If one sets it to 10 minutes and for whatever reason the Message is .Complete() only after 10 minutes and 5 seconds, the Message will show up again in the Message Queue, will be consumed by the next stand-by Worker and the entire processing will execute again. That is wasteful and also keeps the Workers from executing other unprocessed Requests.
To work around this:
Change your Worker process to verify if the item it just received from the Message Queue had not been already processed. Look for Success/Failure result that is stored somewhere. If already process, call BrokeredMessage.Complete() and move on to wait for the next item to pop up.
Call periodically BrokeredMessage.RenewLock() - BEFORE the lock expires, like every 10 seconds - and set OnMessageOptions.AutoRenewTimeout to TimeSpan.Zero. Thus if the Worker that processes an item crashes, the Message will return into the MessageQueue sooner and will be picked up by the next stand-by Worker.
I have the very same problem with my workers. Even the message was successfully processing, due to long processing time, service bus removes the lock applied to it and the message become available for receiving again. Other available worker takes this message and start processing it again. Please, correct me if I'm wrong, but in your case, OnMessageAsync will be called many times with the same message and you will ended up with several tasks simultaneously processing it. At the end of the process MessageLockLost exception will be thrown because the message doesn't have a lock applied.
I solved this with the following code.
_requestQueueClient.OnMessage(
requestMessage =>
{
RenewMessageLock(requestMessage);
var messageLockTimer = new System.Timers.Timer(TimeSpan.FromSeconds(290));
messageLockTimer.Elapsed += (source, e) =>
{
RenewMessageLock(requestMessage);
};
messageLockTimer.AutoReset = false; // by deffault is true
messageLockTimer.Start();
/* ----- handle requestMessage ----- */
requestMessage.Complete();
messageLockTimer.Stop();
}
private void RenewMessageLock(BrokeredMessage requestMessage)
{
try
{
requestMessage.RenewLock();
}
catch (Exception exception)
{
}
}
There are a few mounts since your post and maybe you have solved this, so could you share your solution.

StorageClientException: The specified message does not exist?

I have a simple video encoding worker role that pulls messages from a queue encodes a video then uploads the video to storage. Everything seems to be working but occasionally when deleting the message after I am done encoding and uploading I get a "StorageClientException: The specified message does not exist." Although the video is processed, I believe the message is reappearing in the queue because it's not being deleted correctly. I have the message visablilty set to 5 mins, none of the videos have taken more than 2 to process.
Is it possible that another instance
of the Worker role is processing and
deleting the message?
Doesn't the GetMessage() prevent
other worker roles from picking up
the same message?
Am I doing something wrong in the
setup of my queue?
What could be causing this message to
not be found on delete?
some code...
//onStart() queue setup
var queueStorage = _storageAccount.CreateCloudQueueClient();
_queue = queueStorage.GetQueueReference(QueueReference);
queueStorage.RetryPolicy = RetryPolicies.Retry(5, new TimeSpan(0, 5, 0));
_queue.CreateIfNotExist();
public override void Run()
{
while (true)
{
try
{
var msg = _queue.GetMessage(new TimeSpan(0, 5, 0));
if (msg != null)
{
EncodeIt(msg);
PostIt(msg);
_queue.DeleteMessage(msg);
}
else
{
Thread.Sleep(WaitTime);
}
}
catch (StorageClientException exception)
{
BlobTrace.Write(exception.ToString());
Thread.Sleep(WaitTime);
}
}
}
If encoding process takes more time than the message invisibility timeout (5 minutes in your case), then the message will show up in the queue again. This will cause second worker to start processing it. However, chances are that by the time second worker finishes processing, first worker would already be done with the work, deleting it properly. This will cause the second worker to fail at the deletion phase, since the message no longer exists for him.
This happens due to the lightweight transactional model by Windows Azure Queues. It guarantees, that the message will be processed at least once (even if the worker fails silently), but does not guarantee "only once" processing.
Since your encoding process seems to be idempotent and lightweight (since error shows up infrequently), I'd just I advise to increase the invisibility timeout and explicitly capture this exception (by status codes) around DeleteMessages (optionally logging the process duration in order to be able to tweak invisibility timeouts further).
Is it possible it's taking longer than the five minutes you've set as a timeout?
I had my development, production and stage all pulling from the same queue this was causing some strange behavior. I believe this to be the culprit.

.Net Critical regions in threading not working as desired

I am trying to run a sample code (a very basic one) involving threading and critical regions.
Here's my code:
public static void DoCriticalWork(object o)
{
SomeClass instance = o as SomeClass;
Thread.BeginCriticalRegion();
instance.IsValid = true;
Thread.Sleep(2);
instance.IsComplete = true;
Thread.EndCriticalRegion();
instance.Print();
}
And I am calling it as follows:
private static void CriticalHandled()
{
SomeClass instance = new SomeClass();
ParameterizedThreadStart operation = new ParameterizedThreadStart(CriticalRegion.DoCriticalWork);
Thread t = new Thread(operation);
Console.WriteLine("Start thread");
t.Start(instance);
Thread.Sleep(1);
Console.WriteLine("Abort thread");
t.Abort();
Console.WriteLine("In main");
instance.Print();
}
However, the output I get is:
**
Start thread
Abort thread
In main
IsValid: True
IsComplete: False
**
Since the critical region is defined, IsComplete should be true and not false.
Can someone please explain why it is not working?
Here is SomeClass for reference:
public class SomeClass
{
private bool _isValid;
public bool IsValid
{
get { return _isValid; }
set { _isValid = value; }
}
private bool _isComplete;
public bool IsComplete
{
get { return _isComplete; }
set { _isComplete = value; }
}
public void Print()
{
Console.WriteLine("IsValid: {0}", IsValid);
Console.WriteLine("IsComplete: {0}", IsComplete);
Console.WriteLine();
}
}
Edit
Expln from MCTS notes:
The idea behind a critical region is to provide a region of code that must be executed as if it were a single line. Any attempt to abort a thread while it is within a critical region will have to wait until after the critical region is complete. At that point, the thread will be aborted, throwing the ThreadAbortException. The difference between a thread with and without a critical region is illustrated in the following figure:
Critical Region http://www.freeimagehosting.net/uploads/9dd3bb5445.gif
Thread.BeginCriticalRegion does not prevent a Thread from being aborted. I believe it is used to notify the runtime that if the Thread is aborted in the critical section, it is not necessarily safe to continue running the application/AppDomain.
The MSDN docs have a more complete explanation: http://msdn.microsoft.com/en-us/library/system.threading.thread.begincriticalregion.aspx
Its a sleep timing problem. Just widen the gap between the two sleeps and you will get the answer.
There are two threads: The main thread and the thread that does the critical work. Now when abort is called the thread 't' will abort instantly even if it has not completed the critical region.
Now as you have send the main thread to sleep for 2ms and thread t for 1ms, sometimes t will complete the critical section and sometimes it will not. So that's why the value for IsComplete is sometimes false and sometimes true.
Now just send the main thread to sleep for 100ms and you will find the IsComplete is always true.
Vice versa send the thread "t" to sleep for 100ms and you will find the IsComplete is always false.
EDIT
FROM MSDN
Notifies a host that execution is about to enter a region of code in which the effects of a thread abort or unhandled exception might jeopardize other tasks in the application domain.
For example, consider a task that attempts to allocate memory while holding a lock. If the memory allocation fails, aborting the current task is not sufficient to ensure stability of the AppDomain, because there can be other tasks in the domain waiting for the same lock. If the current task is terminated, other tasks could be deadlocked.
When a failure occurs in a critical region, the host might decide to unload the entire AppDomain rather than take the risk of continuing execution in a potentially unstable state. To inform the host that your code is entering a critical region, call BeginCriticalRegion. Call EndCriticalRegion when execution returns to a non-critical region of code.
From CLR Inside Out: Writing Reliable Code
State Corruption
There are three buckets that state corruption may fall into. The first is local state, which includes local variables and heap objects that are only used by a particular thread. The second is shared state, which includes anything shared across threads within the AppDomain, such as objects stored in static variables. Caches often fall into this category. The third is process-wide, machine-wide, and cross-machine shared state—files, sockets, shared memory, and distributed lock managers fall into this camp.
The amount of state that can be corrupted by an async exception is the maximum amount of state a thread is currently modifying. If a thread allocates a few temporary objects and doesn't expose them to other threads, only those temporary objects can be corrupted. But if a thread is writing to shared state, that shared resource may be corrupted, and other threads may potentially encounter this corrupted state. You must not let that happen. In this case, you abort all the other threads in the AppDomain and then unload the AppDomain. In this way, an asynchronous exception escalates to an AppDomain, causing it to unload and ensuring that any potentially corrupted state is thrown away. Given a transacted store like a database, this AppDomain recycling provides resiliency to corruption of local and shared state.
Critical Regions allow you to handle situations where some piece of code might corrupt other Application Domains and causing irreparable damage to the system.
A good solution is to encapsulate the code from DoCriticalWork() with
try { ... } catch(ThreadAbortedException) {...}
where you should act as you wish (maybe set IsComplete = true?)
You can read more about ThreadAbortException, and be sure to check Thread.ResetAbort and the finally block usage for this case.
As Andy mentioned and quoting from Thread.BeginCriticalRegion:
Notifies a host that execution is about to enter a region of code in which the effects of a thread abort or unhandled exception might jeopardize other tasks in the application domain.

Resources