How to resolve Heartbeat took longer than "00:00:01" failure? - multithreading

I have a .NetCore C# project which performs an HTTP POST. The project is set up in Kubernetes and I've noticed the logs below:
Heartbeat took longer than "00:00:01" at "02/22/2020 15:43:45 +00:00".
warn: Microsoft.AspNetCore.Server.Kestrel[22]
Heartbeat took longer than "00:00:01" at "02/22/2020 15:43:46 +00:00".
warn: Microsoft.AspNetCore.Server.Kestrel[22]
Heartbeat took longer than "00:00:01" at "02/22/2020 15:43:47 +00:00".
warn: Microsoft.AspNetCore.Server.Kestrel[22]
Heartbeat took longer than "00:00:01" at "02/22/2020 15:43:48 +00:00".
warn: Microsoft.AspNetCore.Server.Kestrel[22]
Heartbeat took longer than "00:00:01" at "02/22/2020 15:43:49 +00:00".
warn: Microsoft.AspNetCore.Server.Kestrel[22]
Heartbeat took longer than "00:00:01" at "02/22/2020 15:43:50 +00:00".
warn: Microsoft.AspNetCore.Server.Kestrel[22]
Heartbeat took longer than "00:00:01" at "02/22/2020 15:43:51 +00:00".
warn: Microsoft.AspNetCore.Server.Kestrel[22]
Heartbeat took longer than "00:00:01" at "02/22/2020 15:43:52 +00:00".
warn: Microsoft.AspNetCore.Server.Kestrel[22]
Heartbeat took longer than "00:00:01" at "02/22/2020 15:43:53 +00:00".
warn: Microsoft.AspNetCore.Server.Kestrel[22]
Heartbeat took longer than "00:00:01" at "02/22/2020 15:43:54 +00:00".
warn: Microsoft.AspNetCore.Server.Kestrel[22]
Heartbeat took longer than "00:00:01" at "02/22/2020 15:43:55 +00:00".
warn: Microsoft.AspNetCore.Server.Kestrel[22]
Heartbeat took longer than "00:00:01" at "02/22/2020 15:43:56 +00:00".
warn: Microsoft.AspNetCore.Server.Kestrel[22]
Heartbeat took longer than "00:00:01" at "02/22/2020 15:44:33 +00:00".
warn: Microsoft.AspNetCore.Server.Kestrel[22]
Heartbeat took longer than "00:00:01" at "02/22/2020 15:44:34 +00:00".
warn: Microsoft.AspNetCore.Server.Kestrel[22]
Heartbeat took longer than "00:00:01" at "02/22/2020 15:44:35 +00:00".
After some initial research, it seems this is a common result of threadpool starvation. Accordingly, in November last year, I made the post asynchronous and also logged the Max threads and Available threads as follows for monitoring purposes:
ThreadPool.GetMaxThreads(out int workerThreads, out int completionPortThreads);
ThreadPool.GetAvailableThreads(out int workerThreadAvailable, out int completionPortThreadsAvailable);
_log.Info(new { message = $"Max threads = {workerThreads} and Available threads = {workerThreadAvailable}" });
Consistently over the past few months, the logging shows: Max threads = 32767 and Available threads = 32766. That seems fine, however, I'm noticing the same Heartbeat error so am wondering if this really is a threadpool starvation issue. Might someone know what else is going on and if this error is actually a result of something else? Any investigation/resolution tips for this would be much appreciated!

This is a resource issue, as #andy pointed out in his response.
According to OP, the solution to this problem is to either increase the server's CPU capacity (vertically) or the number of instances of your app (horizontally).

Related

Connection to Azure Service Bus using Java Spring Application - Timeout

I have written a client which tries to connect to Azure service bus. As soon as the server starts up i get the below errors and i receive no messages present at the queue. I tried replacing the sb protocol with amqpwss, but it dint help.
2020-05-25 21:23:11 [ReactorThreadeebf108d-444b-4acd-935f-c2c2c135451d] INFO c.m.a.s.p.RequestResponseLink - Internal send link 'RequestResponseLink-Sender_0480eb_c31e1cc239bf471e811e53a30adc6488_G51' of requestresponselink to '$cbs' encountered error.
com.microsoft.azure.servicebus.primitives.ServiceBusException: com.microsoft.azure.servicebus.amqp.AmqpException: The connection was inactive for more than the allowed 60000 milliseconds and is closed by container 'LinkTracker'. TrackingId:c31e1cc239bf471e811e53a30adc6488_G51, SystemTracker:gateway7, Timestamp:2020-05-25T21:23:10
at com.microsoft.azure.servicebus.primitives.ExceptionUtil.toException(ExceptionUtil.java:55)
at com.microsoft.azure.servicebus.primitives.RequestResponseLink$InternalSender.onClose(RequestResponseLink.java:759)
at com.microsoft.azure.servicebus.amqp.BaseLinkHandler.processOnClose(BaseLinkHandler.java:66)
at com.microsoft.azure.servicebus.amqp.BaseLinkHandler.onLinkRemoteClose(BaseLinkHandler.java:42)
at org.apache.qpid.proton.engine.BaseHandler.handle(BaseHandler.java:176)
at org.apache.qpid.proton.engine.impl.EventImpl.dispatch(EventImpl.java:108)
at org.apache.qpid.proton.reactor.impl.ReactorImpl.dispatch(ReactorImpl.java:324)
at org.apache.qpid.proton.reactor.impl.ReactorImpl.process(ReactorImpl.java:291)
at com.microsoft.azure.servicebus.primitives.MessagingFactory$RunReactor.run(MessagingFactory.java:491)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.microsoft.azure.servicebus.amqp.AmqpException: The connection was inactive for more than the allowed 60000 milliseconds and is closed by container 'LinkTracker'. TrackingId:c31e1cc239bf471e811e53a30adc6488_G51, SystemTracker:gateway7, Timestamp:2020-05-25T21:23:10
... 10 common frames omitted
There is a similar issue opened in GitHub
what you posted here is the trace, not the error. Yes, the service
closes idle connections are 10 minutes. The client traces it and
reopens the connection. It is seamless, doesn't throw any exceptions
to the application. That can't be your problem. If your sends are
failing means there may be another problem, but not this one.
As i see the second line it is about the timeout of 6 secs, can you check the troubleshoot page if it helps. Also this.
we recommend adding "sync-publish=true" to the connection url

Odoo timeout killing cron

I found in logs that timeout set to 120s is killing cronworkers.
Firs issue I have noticed is that plugin which makes backups of db stuck in loop and makes zip after zip so in 1-2h disk is full.
Second thing is scheduled action called Mass Mailing: Process queue in odoo.
It should run every 60mins but it is gettin killed by timeout and run instantly after kill again
Where should I look for this timeout? I raised already all timeouts in odoo.conf to 500sec
Odoo v12 community, ubuntu 18, nginx
2019-12-02 06:43:04,711 4493 ERROR ? odoo.service.server: WorkerCron (4518) timeout after 120s
2019-12-02 06:43:04,720 4493 ERROR ? odoo.service.server: WorkerCron (4518) timeout after 120s
The following timeouts you can find in odoo.conf are usually the ones responsible for the behaviour you experience (in particular the second one).
limit_time_cpu = 60
limit_time_real = 120
Some more explanations on Odoo documentation : https://www.odoo.com/documentation/12.0/reference/cmdline.html#multiprocessing

Consumer disappears from queue after 30-40 mins

My app just disappears from the list of consumers in RabbitMQ Admin after working just fine for like 30-40 mins. AMQP lib used: node-amqp. Here's the connection:
const con = amqp.createConnection(options,{defaultExchangeName: 'amq.topic', reconnect: true})
The following event handlers are configured too: connect, ready, close, tag.change, error
The worst part is that i don't get any errors or close events, app just disconnects and logs nothing...
It just seems that connection is terminated out of being 'idle' for a while...
Has anyone had something similar? How did you deal with it?
Perhaps this helps someone. To resolve the issue we have to put heartbeat field to options and specify the interval in seconds the connection has to be checked and refreshed.
The heartbeat is doesn't have any default values, so if it is not explicitly added, amqp won't use it.

Hazelcast warnings in server logs everytime client disconnects

I am seeing below log statements every time the client disconnects. Anything we can do to avoid them as they doesn't seem to be really a warning?
I believe these should be at DEBUG level.
01/11/2017 14:14:38,465 - INFO [hz._hzInstance_1_FB_API.IO.thread-in-0][][com.hazelcast.nio.tcp.TcpIpConnection] [127.0.0.1]:31201 [FB_API] [3.6.2] Connection [Address[127.0.0.1]:59972] lost. Reason: java.io.EOFException[Remote socket closed!]
01/11/2017 14:14:38,466 - WARN [hz._hzInstance_1_FB_API.IO.thread-in-0][][com.hazelcast.nio.tcp.nonblocking.NonBlockingSocketReader] [127.0.0.1]:31201 [FB_API] [3.6.2] hz._hzInstance_1_FB_API.IO.thread-in-0 Closing socket to endpoint Address[127.0.0.1]:59972, Cause:java.io.EOFException: Remote socket closed!
01/11/2017 14:14:38,467 - INFO [hz._hzInstance_1_FB_API.event-3][][com.hazelcast.client.ClientEndpointManager] [127.0.0.1]:31201 [FB_API] [3.6.2] Destroying ClientEndpoint{conn=Connection [0.0.0.0/0.0.0.0:31201 -> /127.0.0.1:59972], endpoint=Address[127.0.0.1]:59972, alive=false, type=JAVA_CLIENT, principal='ClientPrincipal{uuid='7181902f-8fe9-4065-b610-5a0e9c4ad212', ownerUuid='7ff02f81-c0e3-4094-a968-507c7df4df9f'}', firstConnection=true, authenticated=true}
01/11/2017 14:14:38,467 - INFO [hz._hzInstance_1_FB_API.event-3][][com.hazelcast.transaction.TransactionManagerService] [127.0.0.1]:31201 [FB_API] [3.6.2] Committing/rolling-back alive transactions of client, UUID: 7181902f-8fe9-4065-b610-5a0e9c4ad212
These exceptions/Errors are distracting our log analysis.
It's a code issue. You have three choices:
Ignore it
Filter it out with logging, something like this for Log4j2:
<Logger name="com.hazelcast.nio.tcp.nonblocking.NonBlockingSocketReader" level="error" additivity="false">
(This is a bad idea, as you will miss genuine warnings)
Upgrade to 3.7, or 3.8 if this is GA when you do the upgrade

Biztalk WCF-Custom Adapter gives time out error

I am running a stored procedure on SQL Azure Database using BizTalk WCF-Custom adapter. I have around 20k records to be processed as composite operation. We are using BizTalk 2013 R2 on Windows Server 2012.
I am getting following error after 40-50 minutes:
A message sent to adapter "WCF-Custom" on send port
"Swire.BizTalk.M3.Send.PushAssets.Local" with URI
"mssql://xngoo0zsw2.database.windows.net//overvuuat_20150720?" is
suspended. Error details:
Microsoft.ServiceModel.Channels.Common.InvalidUriException: Timeout
expired. The timeout period elapsed prior to obtaining a connection
from the pool. This may have occurred because all pooled connections
were in use and max pool size was reached. --->
System.InvalidOperationException: Timeout expired. The timeout period
elapsed prior to obtaining a connection from the pool. This may have
occurred because all pooled connections were in use and max pool size
was reached. at
System.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection
owningConnection, TaskCompletionSource1 retry, DbConnectionOptions
userOptions, DbConnectionInternal oldConnection, DbConnectionInternal&
connection) at
System.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection
outerConnection, DbConnectionFactory connectionFactory,
TaskCompletionSource1 retry, DbConnectionOptions userOptions) at
System.Data.SqlClient.SqlConnection.TryOpenInner(TaskCompletionSource1
retry) at
System.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource1
retry) at System.Data.SqlClient.SqlConnection.Open() at
Microsoft.Adapters.Sql.SqlAdapterConnection.OpenConnection() ---
End of inner exception stack trace ---
Server stack trace: at
System.Runtime.AsyncResult.End[TAsyncResult](IAsyncResult result)
at
System.ServiceModel.Channels.ServiceChannel.SendAsyncResult.End(SendAsyncResult
result) at
System.ServiceModel.Channels.ServiceChannel.EndCall(String action,
Object[] outs, IAsyncResult result) at
System.ServiceModel.Channels.ServiceChannel.EndRequest(IAsyncResult
result)
Exception rethrown at [0]: at
System.Runtime.Remoting.Proxies.RealProxy.HandleReturnMessage(IMessage
reqMsg, IMessage retMsg) at
System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData&
msgData, Int32 type) at
System.ServiceModel.Channels.IRequestChannel.EndRequest(IAsyncResult
result) at
Microsoft.BizTalk.Adapter.Wcf.Runtime.WcfClient`2.RequestCallback(IAsyncResult
result) MessageId: {9034380A-E116-4694-BD70-F9933BF37BD3}
InstanceID: {91B6E1AA-32AD-48D9-A18A-F9BF20529764}
Following are configurations for that send port:
I increased the timeout from 00:50:00 to 05:00:00 but no change in result. still getting same error.
I run SP_WHO2 to get the list of connection those are running/queued. There were more than 100+ connection when the error raised. One of the query "Select Into" was in suspend mode and displayed as it was running from SSMS. but we have no such query and we never run any query from directly from SSMS.
Please suggest a way to resolve this.
Looks like you are hitting the maxConnectionPoolSize which is 100 and increasing the timeout wouldn't help in this case. You can use non-pooled connections but before going that see why there are 100+ connections to the database. This could be because of the performance issues on the database (may need to scale) / application issues (locks being held on the table and blocking other queries or connections aren't closed so none of them are returned back to pool).
For performance issues, query sys.resource_stats to see whether your database resource utilization is peaking. Also you can use sys.dm_Exec_Requests and sys.dm_tran_locks to see what other connections are doing. I believe you are closing the connections in your app when you are done using them (otherwise they won't return back to pool)

Resources