When running a distributor that is relatively busy with 5+ workers from time to time the service stops processing messages.
NServiceBus version is 3.3.8 with a few backports of fixes from 4.x (regarding MessagePropertyFilters, Locking regarding ClearAvailabilityForWorkers and a few minor issues) the source code is available here:
https://github.com/PeterLehmann/NServiceBus/tree/undoWfpNameChange
Analyzing a hanging service with WinDBG shows that most threads is waiting for memory allocation or freeing like this:
System.Runtime.InteropServices.GCHandle.InternalFree(IntPtr)
System.Runtime.InteropServices.GCHandle.Free()
System.Messaging.Interop.MessagePropertyVariants.Unlock()
System.Messaging.MessageQueue.ReceiveCurrent(System.TimeSpan, Int32, System.Messaging.Interop.CursorHandle, System.Messaging.MessagePropertyFilter, System.Messaging.MessageQueueTransaction, System.Messaging.MessageQueueTransactionType)
System.Messaging.MessageQueue.Peek(System.TimeSpan)
NServiceBus.Unicast.Queuing.Msmq.MsmqMessageReceiver.HasMessage()
NServiceBus.Unicast.Transport.Transactional.TransactionalTransport.HasMessage()
NServiceBus.Unicast.Transport.Transactional.TransactionalTransport.Process()
NServiceBus.Utils.WorkerThread.Loop()
System.Threading.ExecutionContext.runTryCode(System.Object)
and
System.Runtime.InteropServices.GCHandle.InternalAlloc(System.Object, System.Runtime.InteropServices.GCHandleType)
System.Runtime.InteropServices.GCHandle..ctor(System.Object, System.Runtime.InteropServices.GCHandleType)
System.Messaging.Interop.MessagePropertyVariants.Lock()
System.Messaging.MessageQueue.ReceiveCurrent(System.TimeSpan, Int32, System.Messaging.Interop.CursorHandle, System.Messaging.MessagePropertyFilter, System.Messaging.MessageQueueTransaction, System.Messaging.MessageQueueTransactionType)
System.Messaging.MessageQueue.Peek(System.TimeSpan)
NServiceBus.Unicast.Queuing.Msmq.MsmqMessageReceiver.HasMessage()
NServiceBus.Unicast.Transport.Transactional.TransactionalTransport.HasMessage()
NServiceBus.Unicast.Transport.Transactional.TransactionalTransport.Process()
NServiceBus.Utils.WorkerThread.Loop()
System.Threading.ExecutionContext.runTryCode(System.Object)
We have multiple threads with the above two stacks all seem to be hanging and messages isn't being processed. And then we have single thread also waiting for InternalAlloc that is in the middle of Raven communication
System.Runtime.InteropServices.GCHandle.InternalAlloc(System.Object, System.Runtime.InteropServices.GCHandleType)
System.Runtime.InteropServices.GCHandle..ctor(System.Object, System.Runtime.InteropServices.GCHandleType)
System.Net.SafeDeleteContext.InitializeSecurityContext(System.Net.SecurDll, System.Net.SafeFreeCredentials ByRef, System.Net.SafeDeleteContext ByRef, System.String, System.Net.ContextFlags, System.Net.Endianness, System.Net.SecurityBuffer, System.Net.SecurityBuffer[], System.Net.SecurityBuffer, System.Net.ContextFlags ByRef)
System.Net.SSPIAuthType.InitializeSecurityContext(System.Net.SafeFreeCredentials, System.Net.SafeDeleteContext ByRef, System.String, System.Net.ContextFlags, System.Net.Endianness, System.Net.SecurityBuffer[], System.Net.SecurityBuffer, System.Net.ContextFlags ByRef)
System.Net.SSPIWrapper.InitializeSecurityContext(System.Net.SSPIInterface, System.Net.SafeFreeCredentials, System.Net.SafeDeleteContext ByRef, System.String, System.Net.ContextFlags, System.Net.Endianness, System.Net.SecurityBuffer[], System.Net.SecurityBuffer, System.Net.ContextFlags ByRef)
System.Net.NTAuthentication.GetOutgoingBlob(Byte[], Boolean, System.Net.SecurityStatus ByRef)
System.Net.NTAuthentication.GetOutgoingBlob(System.String)
System.Net.NegotiateClient.DoAuthenticate(System.String, System.Net.WebRequest, System.Net.ICredentials, Boolean)
System.Net.NegotiateClient.Authenticate(System.String, System.Net.WebRequest, System.Net.ICredentials)
System.Net.AuthenticationManager.Authenticate(System.String, System.Net.WebRequest, System.Net.ICredentials)
System.Net.AuthenticationState.AttemptAuthenticate(System.Net.HttpWebRequest, System.Net.ICredentials)
System.Net.HttpWebRequest.CheckResubmitForAuth()
System.Net.HttpWebRequest.CheckResubmit(System.Exception ByRef)
System.Net.HttpWebRequest.DoSubmitRequestProcessing(System.Exception ByRef)
System.Net.HttpWebRequest.ProcessResponse()
System.Net.HttpWebRequest.SetResponse(System.Net.CoreResponseData)
System.Net.ConnectStream.ProcessWriteCallDone(System.Net.ConnectionReturnResult)
System.Net.ConnectStream.CallDone(System.Net.ConnectionReturnResult)
System.Net.ConnectStream.WriteHeaders(Boolean)
System.Net.HttpWebRequest.EndSubmitRequest()
System.Net.Connection.SubmitRequest(System.Net.HttpWebRequest, Boolean)
System.Net.ServicePoint.SubmitRequest(System.Net.HttpWebRequest, System.String)
System.Net.HttpWebRequest.SubmitRequest(System.Net.ServicePoint)
System.Net.HttpWebRequest.GetResponse()
Raven.Client.Connection.HttpJsonRequest.ReadStringInternal(System.Func`1<System.Net.WebResponse>)
Raven.Client.Connection.HttpJsonRequest.ReadResponseString()
Raven.Client.Connection.HttpJsonRequest.ReadResponseJson()
Raven.Client.Connection.ServerClient.DirectCommit(System.Guid, System.String)
Raven.Client.Connection.ServerClient+<>c__DisplayClass5b.<Commit>b__5a(System.String)
Raven.Client.Connection.ServerClient.TryOperation[[System.__Canon, mscorlib]](System.Func`2<System.String,System.__Canon>, System.String, Boolean, System.__Canon ByRef)
Raven.Client.Connection.ServerClient.ExecuteWithReplication[[System.__Canon, mscorlib]](System.String, System.Func`2<System.String,System.__Canon>)
Raven.Client.Connection.ServerClient.Commit(System.Guid)
Raven.Client.Document.DocumentSession.Commit(System.Guid)
Raven.Client.Document.RavenClientEnlistment.Commit(System.Transactions.Enlistment)
System.Transactions.Oletx.OletxEnlistment.CommitRequest()
...
We have tried variations of Concurrent and Servermode garbage collection and it looks like the issue happend when Garbage Collection is taking place but only sometimes, othertimes we see GC running perfectly and normal throughput isn't affected.
I checked to see if something was blocking the finalizer but it doesn't seem to be the case, did some digging to find if something in the memoryallocation/deallocation could deadlock but doesn't seem to be able to find anything (I could dig deeper here but would like some input first).
Currently operations is restarting the services when this happens however even this fails sometimes and services get stuck in "Stopping" state sometimes and have to be killed hard to get them to start processing again.
So have anyone else experienced hanging services with nServiceBus 3.3.8?
So to answer my own question, we were running .NET 4.0.30319 on the system after upgrading to .NET 4.5.1 the above problem seem to have vanished.
I suspect that there has been some updates regarding memory-handling and garbage collection that has solved the issue.
Related
I have a range of Azure Function apps all outputting messages to Azure Service Bus using the IAsyncCollector<Message>:
public async Task Run([ServiceBus(...)] IAsyncCollector<Message> messages)
{
...
await messages.AddAsync(msg);
}
I'm getting errors logged from time to time looking like this:
Microsoft.Azure.WebJobs.Host.FunctionInvocationException: Exception while executing function: Function
---> Microsoft.Azure.ServiceBus.ServiceBusTimeoutException: The operation did not complete within the allocated time 00:00:59.9999536 for object message.Reference: ..., 7/13/2020 2:46:24 PM
---> System.TimeoutException: The operation did not complete within the allocated time 00:00:59.9999536 for object message.
at Microsoft.Azure.Amqp.AsyncResult.End[TAsyncResult](IAsyncResult result)
at Microsoft.Azure.Amqp.SendingAmqpLink.EndSendMessage(IAsyncResult result)
at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location where exception was thrown ---
at Microsoft.Azure.ServiceBus.Core.MessageSender.OnSendAsync(IList`1 messageList)
--- End of inner exception stack trace ---
at Microsoft.Azure.ServiceBus.Core.MessageSender.OnSendAsync(IList`1 messageList)
at Microsoft.Azure.ServiceBus.RetryPolicy.RunOperation(Func`1 operation, TimeSpan operationTimeout)
at Microsoft.Azure.ServiceBus.RetryPolicy.RunOperation(Func`1 operation, TimeSpan operationTimeout)
at Microsoft.Azure.ServiceBus.Core.MessageSender.SendAsync(IList`1 messageList)
at Microsoft.Azure.WebJobs.ServiceBus.Bindings.MessageSenderExtensions.SendAndCreateEntityIfNotExists(MessageSender sender, Message message, Guid functionInstanceId, EntityType entityType, CancellationToken cancellationToken)
at My.Function.Run(String mySbMsg, IAsyncCollector`1 messages)
I'm having a bit of a hard time figuring out when in the pipeline this happens. But I have recently learned about the FlushAsync method:
await messages.AddAsync(msg);
await messages.FlushAsync();
My question is the following. Why would I ever NOT include a call to FlushAsync in my function? Getting the timeout exception in my own code will make it possible to retry, do better exception logging, and more. Any downsides of flushing manually like this within the function code?
Why would I ever NOT include a call to FlushAsync in my function? Getting the timeout exception in my own code will make it possible to retry, do better exception logging, and more.
I'm going to go a bit further here and say that after I earned some experience with Azure Functions, I now avoid IAsyncCollector<T> completely. Some implementations publish on AddAsync; other implementations may publish on both AddAsync and FlushAsync. I suspect the service bus implementation is actually publishing on AddAsync, in which case FlushAsync may be a noop.
The nice part about IAsyncCollector<T> is that it gives you a "write these things" abstraction; all you have to do is provide a connection string, and the rest is magic. The problem with IAsyncCollector<T> is that it gives you an abstraction, and thus you have much less control.
Under the hood, how many retries are being done? Are they using constant delays or exponentially increasing? What's the behavior if it never succeeds? Usually none of this critical information is documented.
What's especially annoying is when the AF team changes the semantics of the abstraction. E.g., for some of the output bindings (either CosmosDB or storage, I don't remember), the retry behavior changed from one release of the functions SDK to the next.
So, I've gravitated towards avoiding output bindings, especially IAsyncCollector<T>. I usually want to do tight but exponentially increasing de-correlated jittered retries with a cap of a minute or so, but aborted when there's only a minute left of Functions runtime, and then the recovery behavior changes to writing a message to an error queue (with retries). This is way more complex than an IAsyncCollector<T> could ever provide, but it isn't that hard to do with Polly calling the SDKs directly.
Any downsides of flushing manually like this within the function code?
No. By default, IAsyncCollector<T>.FlushAsync is called by the functions host after your function executes. So if you call it yourself, you're just calling it early. It should be safe to call multiple times.
There's nothing wrong with calling FlushAsync in your code. The current implementation of IAsyncCollector for the ServiceBus output binding is that your messages get batched until FlushAsync is called or the function returns. When trying to send a high volume of messages I stumbled on the same Timeout exception as you. The solution I found is to call FlushAsync every N messages, in my case N=100 was the best trade-off. Calling FlushAsync too often would trigger obvious performance penalties.
I am using redis-py 2.10.6 and redis 4.0.11.
My application uses redis for both the db and the pubsub. When I shut down I often get either hanging or a crash. The latter usually complains about a bad file descriptor or an I/O error on a file (I don't use any) which happens while handling a pubsub callback, so I'm guessing the underlying issue is the same: somehow I don't get disconnected properly and the pool used by my redis.Redis object is alive and kicking.
An example of the output of the former kind of error (during _read_from_socket):
redis.exceptions.ConnectionError: Error while reading from socket: (9, 'Bad file descriptor')
Other times the stacktrace clearly shows redis/connection.py -> redis/client.py -> threading.py, which proves that redis isn't killing the threads it uses.
When I star the application I run:
self.redis = redis.Redis(host=XXXX, port=XXXX)
self.pubsub = self.redis.pubsub()
subscriptions = {'chan1': self.cb1, 'chan2': self.cb2} # cb1 and cb2 are functions
self.pubsub.subscribe(**subscriptions)
self.pubsub_thread = self.pubsub.run_in_thread(sleep_time=1)
When I want to exit the application the last instruction I execute in main is a call to a function in my redis using class, whose implementation is:
self.pubsub.close()
self.pubsub_thread.stop()
self.redis.connection_pool.disconnect()
My understanding is that in theory I do not even need to do any of these 'closing' calls, and yet, with or without them, I still can't guarantee a clean shutdown.
My question is, how am I supposed to guarantee a clean shutdown?
I ran into this same issue and it's largely caused by improper handling of the shutdown by the redis library. During the cleanup, the thread continues to process new messages and doesn't account for situations where the socket is no longer available. After scouring the code a bit, I couldn't find a way to prevent additional processing without just waiting.
Since this is run during a shutdown phase and it's a remedy for a 3rd party library, I'm not overly concerned about the sleep, but ideally the library should be updated to prevent further action while shutting down.
self.pubsub_thread.stop()
time.sleep(0.5)
self.pubsub.reset()
This might be worth an issue log or PR on the redis-py library.
PubSubWorkerThread class check for self._running.is_set() inside the loop.
To do a "clean shutdown" you should call self.pubsub_thread._running.clean() to set the thread event to false and it will stop.
Check how it work here:
https://redis.readthedocs.io/en/latest/_modules/redis/client.html?highlight=PubSubWorkerThread#
In Cloud Haskell, the expect function has a timeout cousin called expectTimeout. Is there also a timeout function for receiveChan (for the type safe channel)? Source
Currently, my application will wait indefinitely for a reply, therefore if a process dies my application will be in a deadlock state, thus, it would be nice if I could set a timeout so that it ignores processes that have died.
There is a receiveChanTimeout in version 0.4.2
http://hackage.haskell.org/packages/archive/distributed-process/0.4.2/doc/html/Control-Distributed-Process.html
I'm building a server with NIO, I have two questions.
Do I have to use a worker thread or a thread pool to process the messages received, or let the main thread do all this stuff ( I have performance needs).
I have two kind of sending, sendNow method which ends with selector.selectNow() and simple send method which ends with selector.wakeup().. can I have loss of data those methods?
thanks
If possible try to do it all in one thread. It gets very complicated very quickly otherwise.
I don't know why you think a sendNow() method needs to end with either selectNow() or wakeup(), but neither of them is intrinsically going to cause a data loss.
We have an application that is undergoing performance testing. Today, I decided to take a dump of w3wp & load it in windbg to see what is going on underneath the covers. Imagine my surprise when I ran !threads and saw that there are 640 background threads, almost all of which seem to say the following:
OS Thread Id: 0x1c38 (651)
Child-SP RetAddr Call Site
0000000023a9d290 000007ff002320e2 Microsoft.Practices.EnterpriseLibrary.Caching.ProducerConsumerQueue.WaitUntilInterrupted()
0000000023a9d2d0 000007ff00231f7e Microsoft.Practices.EnterpriseLibrary.Caching.ProducerConsumerQueue.Dequeue()
0000000023a9d330 000007fef727c978 Microsoft.Practices.EnterpriseLibrary.Caching.BackgroundScheduler.QueueReader()
0000000023a9d380 000007fef9001552 System.Threading.ExecutionContext.runTryCode(System.Object)
0000000023a9dc30 000007fef72f95fd System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
0000000023a9dc80 000007fef9001552 System.Threading.ThreadHelper.ThreadStart()
If i had to give a guess, I'm thinkign that one of these threads are getting spawned for each run of our app - we have 2 app servers, 20 concurrent users, and ran the test approximately 30 times...it's in the neighborhood.
Is this 'expected behavior', or perhaps have we implemented something improperly? The test ran hours ago, so i would have expected any timeouts to have occurred already.
Edit: Thank you all for your replies. It has been requested that more detail be shown about the callstack - here is the output of !mk from sosex.dll.
ESP RetAddr
00:U 0000000023a9cb38 00000000775f72ca ntdll!ZwWaitForMultipleObjects+0xa
01:U 0000000023a9cb40 00000000773cbc03 kernel32!WaitForMultipleObjectsEx+0x10b
02:U 0000000023a9cc50 000007fef8f5f595 mscorwks!WaitForMultipleObjectsEx_SO_TOLERANT+0xc1
03:U 0000000023a9ccf0 000007fef8f59f49 mscorwks!Thread::DoAppropriateAptStateWait+0x41
04:U 0000000023a9cd50 000007fef8e55b99 mscorwks!Thread::DoAppropriateWaitWorker+0x191
05:U 0000000023a9ce50 000007fef8e2efe8 mscorwks!Thread::DoAppropriateWait+0x5c
06:U 0000000023a9cec0 000007fef8f0dc7a mscorwks!CLREvent::WaitEx+0xbe
07:U 0000000023a9cf70 000007fef8fba72e mscorwks!Thread::Block+0x1e
08:U 0000000023a9cfa0 000007fef8e1996d mscorwks!SyncBlock::Wait+0x195
09:U 0000000023a9d0c0 000007fef9463d3f mscorwks!ObjectNative::WaitTimeout+0x12f
0a:M 0000000023a9d290 000007ff002321b3 *** ERROR: Module load completed but symbols could not be loaded for Microsoft.Practices.EnterpriseLibrary.Caching.DLL
Microsoft.Practices.EnterpriseLibrary.Caching.ProducerConsumerQueue.WaitUntilInterrupted()(+0x0 IL)(+0x11 Native)
0b:M 0000000023a9d2d0 000007ff002320e2 Microsoft.Practices.EnterpriseLibrary.Caching.ProducerConsumerQueue.Dequeue()(+0xf IL)(+0x18 Native)
0c:M 0000000023a9d330 000007ff00231f7e Microsoft.Practices.EnterpriseLibrary.Caching.BackgroundScheduler.QueueReader()(+0x9 IL)(+0x12 Native)
0d:M 0000000023a9d380 000007fef727c978 System.Threading.ExecutionContext.runTryCode(System.Object)(+0x18 IL)(+0x106 Native)
0e:U 0000000023a9d440 000007fef9001552 mscorwks!CallDescrWorker+0x82
0f:U 0000000023a9d490 000007fef8e9e5e3 mscorwks!CallDescrWorkerWithHandler+0xd3
10:U 0000000023a9d530 000007fef8eac83f mscorwks!MethodDesc::CallDescr+0x24f
11:U 0000000023a9d790 000007fef8f0cbd2 mscorwks!ExecuteCodeWithGuaranteedCleanupHelper+0x12a
12:U 0000000023a9da20 000007fef945e572 mscorwks!ReflectionInvocation::ExecuteCodeWithGuaranteedCleanup+0x172
13:M 0000000023a9dc30 000007fef7261722 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)(+0x60 IL)(+0x51 Native)
14:M 0000000023a9dc80 000007fef72f95fd System.Threading.ThreadHelper.ThreadStart()(+0x8 IL)(+0x2a Native)
15:U 0000000023a9dcd0 000007fef9001552 mscorwks!CallDescrWorker+0x82
16:U 0000000023a9dd20 000007fef8e9e5e3 mscorwks!CallDescrWorkerWithHandler+0xd3
17:U 0000000023a9ddc0 000007fef8eac83f mscorwks!MethodDesc::CallDescr+0x24f
18:U 0000000023a9e010 000007fef8f9ae8d mscorwks!ThreadNative::KickOffThread_Worker+0x191
19:U 0000000023a9e330 000007fef8f59374 mscorwks!TypeHandle::GetParent+0x5c
1a:U 0000000023a9e380 000007fef8e52045 mscorwks!SVR::gc_heap::make_heap_segment+0x155
1b:U 0000000023a9e450 000007fef8f66139 mscorwks!ZapStubPrecode::GetType+0x39
1c:U 0000000023a9e490 000007fef8e1c985 mscorwks!ILCodeStream::GetToken+0x25
1d:U 0000000023a9e4c0 000007fef8f594e1 mscorwks!Thread::DoADCallBack+0x145
1e:U 0000000023a9e630 000007fef8f59399 mscorwks!TypeHandle::GetParent+0x81
1f:U 0000000023a9e680 000007fef8e52045 mscorwks!SVR::gc_heap::make_heap_segment+0x155
20:U 0000000023a9e750 000007fef8f66139 mscorwks!ZapStubPrecode::GetType+0x39
21:U 0000000023a9e790 000007fef8e20e15 mscorwks!ThreadNative::KickOffThread+0x401
22:U 0000000023a9e7f0 000007fef8e20ae7 mscorwks!ThreadNative::KickOffThread+0xd3
23:U 0000000023a9e8d0 000007fef8f814fc mscorwks!Thread::intermediateThreadProc+0x78
24:U 0000000023a9f7a0 00000000773cbe3d kernel32!BaseThreadInitThunk+0xd
25:U 0000000023a9f7d0 00000000775d6a51 ntdll!RtlUserThreadStart+0x1d
Yes, the caching block has some - issues - with regard to the scavenger threads in older versions of Entlib, particularly if things are coming in faster than the scavenging settings let them come out.
This was completely rewritten in Entlib 5, so that now you'll never have more than two threads sitting in the caching block, regardless of the load, and usually it'll only be one.
Unfortunately there's no easy tweak to change the behavior in earlier versions. The best you can do is change the cache settings so that each scavenge will clean out more items at a time so not as many scavenge requests need to get scheduled.
640 threads is very bad for performance. If they are all waiting for something, then I'd say it's a fair bet that you have a deadlock and they will never exit. If they are all running (not waiting)... well, with 600+ threads on a 2 or 4 core processor none of them will get enough time slices to run very far! ;>
If your app is set up with a main thread that waits on the thread handles to find out when the threads exit, and the background threads get caught up in a loop or in a wait state and never exit the thread proc, then the process and all of its threads will never exit.
Check your thread code to make sure that every threadproc has a clear path to exit the threadproc. It's bad form to write an infinite loop in a background thread on the assumption that the thread will be forcibly terminated when the process shuts down.
If the background thread code spins in a loop waiting for an event handle to signal, make sure that you have some way to signal that event so that the thread can perform a normal orderly exit. Otherwise, you need to write the background thread to wait on multiple events and unblock when any one of the events signals. One of those events can be the activity that the background thread is primarily interested in and the other can be a shutdown event.
From the names of things in the stack dump you posted, it would appear that the thread is waiting for something to appear in the ProducerConsumerQueue. Investigate how that queue object is supposed to be shut down, probably on the producer side, and whether shutting down the queue will automatically release all consumers that are waiting on that queue.
My guess is that either the queue is not being shut down correctly or shutting it down does not implicitly release the consumers that are waiting on it. If the latter case, you may need to pump a terminate message through the queue to wake up all the consumers waiting on that queue and tell them to break out of their wait loop and exit.
You have an major issue. Every Thread occupies 1MB of stack and there is significant cost paid for Context Switching every thread in and out. Especially it becomes worst with managed code because every time GC has to run , it would have walk the threads stack to look for roots and when these threads are paged to the disk the cost to read from the disk is expensive,which adds up Perf issue.
Creating threads are Bad unless you know what you are doing? Jeffery Richter has written in detail about this.
To solve the above issue I would look what these threads are blocked on and also put a break-point on Thread Create (example sxe ct within windbg)
And later rearchitect from avoid creating threads , instead use the thread pool.
It would have been nice to some callstacks of these threads.
In Microsoft Enterprise Library 4.1, the BackgroundScheduler class creates a new thread each time an object is instantiated. It will be fixed in version 5.0. I do not know enough of this Microsoft Library to advise you how to avoid that behavior, but you may try the beta version: http://entlib.codeplex.com/wikipage?title=EntLib5%20Beta2