Hung Threads in Websphere server due to Log4j Disruptor - log4j

Once we upgraded to log4j-2.17.1 version, Sometimes we are getting hung threads in the Websphere server.
Please find the below exception trace
[7/6/22 9:29:38:508 CEST] 00000054 ThreadMonitor W WSVR0605W: Thread "s85311" (00001833) has been active for 774504 milliseconds and may be hung. There is/are 5 thread(s) in total in the server that may be hung.
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:847)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:881)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1210)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:220)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:296)
at com.lmax.disruptor.TimeoutBlockingWaitStrategy.signalAllWhenBlocking(TimeoutBlockingWaitStrategy.java:62)
at com.lmax.disruptor.MultiProducerSequencer.publish(MultiProducerSequencer.java:218)
at com.lmax.disruptor.RingBuffer.translateAndPublish(RingBuffer.java:990)
at com.lmax.disruptor.RingBuffer.tryPublishEvent(RingBuffer.java:538)
at org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor.tryEnqueue(AsyncLoggerConfigDisruptor.java:392)
at org.apache.logging.log4j.core.async.AsyncLoggerConfig.logToAsyncDelegate(AsyncLoggerConfig.java:135)
at org.apache.logging.log4j.core.async.AsyncLoggerConfig.log(AsyncLoggerConfig.java:116)
at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:460)
at org.apache.logging.log4j.core.config.AwaitCompletionReliabilityStrategy.log(AwaitCompletionReliabilityStrategy.java:82)
at org.apache.logging.log4j.core.Logger.log(Logger.java:162)
at org.apache.logging.log4j.spi.AbstractLogger.tryLogMessage(AbstractLogger.java:2190)
at org.apache.logging.log4j.spi.AbstractLogger.logMessageTrackRecursion(AbstractLogger.java:2144)
at org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2127)
at org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:2003)
at org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1975)
at org.apache.logging.log4j.spi.AbstractLogger.warn(AbstractLogger.java:2651)

Related

Play framework 2.7 with scala going down

I have a play scala application running on play 2.7. this is used as a middleware for our frontend and it has rest end points.
Now I am running two different instances on cloud and using nginx and bound these two servers and load balance it with round robin.
Now I am having a problem that the servers goes down quite often i.e. 3 times a day and interesting thing is both server goes down at same time. When I looked at it says out of memory exception on the both servers. I tried to print javaheapdump for out of memory but getting no dump . I am still analysing the thread dump to figure out what might be the actual cause of my server going down but what pins me is why the two servers are going down at the same time.
Out of thread dump I see there are 7707 thread with sleeping state. here it is
"Connection evictor" #146 daemon prio=5 os_prio=0 cpu=2.33ms elapsed=1822.02s tid=0x00007f8a840c4800 nid=0x194 waiting on condition [0x00007f8a58a5e000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(java.base#11/Native Method)
at org.apache.http.impl.client.IdleConnectionEvictor$1.run(IdleConnectionEvictor.java:66)
at java.lang.Thread.run(java.base#11/Thread.java:834)
This what I see when server goes down
[35966.967s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
Uncaught error from thread [application-akka.actor.default-dispatcher-1398Uncaught error from thread [application-akka.actor.default-dispatcher-1395]: ]: unable to create native thread: possibly out of memory or process/resource limits reachedunable to create native thread: possibly out of memory or process/resource limits reached, shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for, shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[ ActorSystem[applicationapplication]
]
java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
at java.base/java.lang.Thread.start0(Native Method)
at java.base/java.lang.Thread.start(Thread.java:803)
at org.apache.http.impl.client.IdleConnectionEvictor.start(IdleConnectionEvictor.java:96)
at org.apache.http.impl.client.HttpClientBuilder.build(HttpClientBuilder.java:1219)
at org.apache.solr.client.solrj.impl.HttpClientUtil.createClient(HttpClientUtil.java:287)
at org.apache.solr.client.solrj.impl.HttpClientUtil.createClient(HttpClientUtil.java:298)
at org.apache.solr.client.solrj.impl.HttpClientUtil.createClient(HttpClientUtil.java:236)
at org.apache.solr.client.solrj.impl.HttpClientUtil.createClient(HttpClientUtil.java:223)
at org.apache.solr.client.solrj.impl.HttpSolrClient.<init>(HttpSolrClient.java:198)
at org.apache.solr.client.solrj.impl.HttpSolrClient$Builder.build(HttpSolrClient.java:934)
at com.github.takezoe.solr.scala.SolrClient$.$anonfun$$lessinit$greater$default$2$1(SolrClient.scala:11)
at com.github.takezoe.solr.scala.SolrClient.<init>(SolrClient.scala:14)
at service.tvt.solr.SolrPolygonService.getSuburbBoundary(SolrPolygonService.scala:212)
at service.tvt.search.OrbigoSearchService.mapfeeder(OrbigoSearchService.scala:54)
at service.bto.business_categories.MeedssCountService.$anonfun$suburbMeedssCount$2(MeedssCountService.scala:81)
at scala.collection.immutable.List.map(List.scala:287)
at service.bto.business_categories.MeedssCountService.suburbMeedssCount(MeedssCountService.scala:80)
at controllers.bto.industry_categories.meedss.MeedssController.$anonfun$suburbMeedssCount$1(MeedssController.scala:38)
at play.api.mvc.ActionBuilder.$anonfun$apply$11(Action.scala:368)
at scala.Function1.$anonfun$andThen$1(Function1.scala:52)
at play.api.mvc.ActionBuilderImpl.invokeBlock(Action.scala:489)
at play.api.mvc.ActionBuilderImpl.invokeBlock(Action.scala:487)
at play.api.mvc.ActionBuilder$$anon$9.invokeBlock(Action.scala:336)
at play.api.mvc.ActionBuilder$$anon$9.invokeBlock(Action.scala:331)
at play.api.mvc.ActionBuilder$$anon$10.apply(Action.scala:426)
at play.api.mvc.Action.$anonfun$apply$2(Action.scala:98)
at play.api.libs.streams.StrictAccumulator.$anonfun$mapFuture$4(Accumulator.scala:184)
at scala.util.Try$.apply(Try.scala:209)
at play.api.libs.streams.StrictAccumulator.$anonfun$mapFuture$3(Accumulator.scala:184)
at akka.stream.impl.Transform.apply(TraversalBuilder.scala:159)
at akka.stream.impl.PhasedFusingActorMaterializer.materialize(PhasedFusingActorMaterializer.scala:515)
at akka.stream.impl.PhasedFusingActorMaterializer.materialize(PhasedFusingActorMaterializer.scala:450)
at akka.stream.impl.PhasedFusingActorMaterializer.materialize(PhasedFusingActorMaterializer.scala:443)
at akka.stream.scaladsl.RunnableGraph.run(Flow.scala:629)
at play.api.libs.streams.Accumulator$.$anonfun$futureToSink$2(Accumulator.scala:262)
at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:303)
at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:37)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
at play.api.libs.streams.Execution$trampoline$.execute(Execution.scala:72)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68)
at scala.concurrent.impl.Promise$DefaultPromise.dispatchOrAddCallback(Promise.scala:312)
at scala.concurrent.impl.Promise$DefaultPromise.onComplete(Promise.scala:303)
at scala.concurrent.impl.Promise.transformWith(Promise.scala:36)
at scala.concurrent.impl.Promise.transformWith$(Promise.scala:34)
at scala.concurrent.impl.Promise$DefaultPromise.transformWith(Promise.scala:183)
at scala.concurrent.Future.flatMap(Future.scala:302)
at scala.concurrent.Future.flatMap$(Future.scala:302)
at scala.concurrent.impl.Promise$DefaultPromise.flatMap(Promise.scala:183)
at play.api.libs.streams.Accumulator$.$anonfun$futureToSink$1(Accumulator.scala:261)
at akka.stream.impl.Transform.apply(TraversalBuilder.scala:159)
at akka.stream.impl.PhasedFusingActorMaterializer.materialize(PhasedFusingActorMaterializer.scala:515)
at akka.stream.impl.PhasedFusingActorMaterializer.materialize(PhasedFusingActorMaterializer.scala:450)
at akka.stream.impl.PhasedFusingActorMaterializer.materialize(PhasedFusingActorMaterializer.scala:443)
at akka.stream.scaladsl.RunnableGraph.run(Flow.scala:629)
at play.api.libs.streams.SinkAccumulator.run(Accumulator.scala:144)
at play.api.libs.streams.SinkAccumulator.run(Accumulator.scala:148)
at play.core.server.AkkaHttpServer.$anonfun$runAction$4(AkkaHttpServer.scala:441)
at akka.http.scaladsl.util.FastFuture$.strictTransform$1(FastFuture.scala:41)
at akka.http.scaladsl.util.FastFuture$.$anonfun$transformWith$3(FastFuture.scala:51)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:92)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81)
at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:92)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:49)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Any quick pointers will be really helpful
Levi Ramsey was right it was because of TakeZoe lib which we were using. We were creating client for every new request and not closing it. Finally we created a connection pool with limited active connections and it worked.

Hazelcast Jet stuck on starting Job

I've encountered strange behaviour in Hazelcast Jet. I'm starting many jobs at once (~30, some are triggered slightly before other). However, when my Hazelcast Jet job count hits 26 (magic number?), all processing got stuck.
In the threadumps I see following info:
"hz._hzInstance_1_jet.cached.thread-1" #37 prio=5 os_prio=0 cpu=1093.29ms elapsed=393.62s tid=0x00007f95dc007000 nid=0x6bfc in Object.wait() [0x00007f95e6af4000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(java.base#11.0.2/Native Method)
- waiting on <no object reference available>
at com.hazelcast.spi.impl.AbstractCompletableFuture.get(AbstractCompletableFuture.java:229)
- waiting to re-lock in wait() <0x00000007864b7040> (a com.hazelcast.internal.util.SimpleCompletableFuture)
at com.hazelcast.spi.impl.AbstractCompletableFuture.get(AbstractCompletableFuture.java:191)
at com.hazelcast.spi.impl.operationservice.impl.InvokeOnPartitions.invoke(InvokeOnPartitions.java:88)
at com.hazelcast.spi.impl.operationservice.impl.OperationServiceImpl.invokeOnAllPartitions(OperationServiceImpl.java:385)
at com.hazelcast.map.impl.proxy.MapProxySupport.clearInternal(MapProxySupport.java:1016)
at com.hazelcast.map.impl.proxy.MapProxyImpl.clearInternal(MapProxyImpl.java:109)
at com.hazelcast.map.impl.proxy.MapProxyImpl.clear(MapProxyImpl.java:698)
at com.hazelcast.jet.impl.JobRepository.clearSnapshotData(JobRepository.java:464)
at com.hazelcast.jet.impl.MasterJobContext.tryStartJob(MasterJobContext.java:233)
at com.hazelcast.jet.impl.JobCoordinationService.tryStartJob(JobCoordinationService.java:776)
at com.hazelcast.jet.impl.JobCoordinationService.lambda$submitJob$0(JobCoordinationService.java:200)
at com.hazelcast.jet.impl.JobCoordinationService$$Lambda$634/0x00000008009ce840.run(Unknown Source)
and also:
"hz._hzInstance_1_jet.async.thread-2" #81 prio=5 os_prio=0 cpu=0.00ms elapsed=661.98s tid=0x0000025bb23ef000 nid=0x43bc in Object.wait() [0x0000005d492fe000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(java.base#11/Native Method)
- waiting on <no object reference available>
at com.hazelcast.spi.impl.AbstractCompletableFuture.get(AbstractCompletableFuture.java:229)
- waiting to re-lock in wait() <0x0000000725600100> (a com.hazelcast.internal.util.SimpleCompletableFuture)
at com.hazelcast.spi.impl.AbstractCompletableFuture.get(AbstractCompletableFuture.java:191)
at com.hazelcast.spi.impl.operationservice.impl.InvokeOnPartitions.invoke(InvokeOnPartitions.java:88)
at com.hazelcast.spi.impl.operationservice.impl.OperationServiceImpl.invokeOnAllPartitions(OperationServiceImpl.java:385)
at com.hazelcast.map.impl.proxy.MapProxySupport.removeAllInternal(MapProxySupport.java:619)
at com.hazelcast.map.impl.proxy.MapProxyImpl.removeAll(MapProxyImpl.java:285)
at com.hazelcast.jet.impl.JobRepository.deleteJob(JobRepository.java:332)
at com.hazelcast.jet.impl.JobRepository.completeJob(JobRepository.java:316)
at com.hazelcast.jet.impl.JobCoordinationService.completeJob(JobCoordinationService.java:576)
at com.hazelcast.jet.impl.MasterJobContext.lambda$finalizeJob$13(MasterJobContext.java:620)
at com.hazelcast.jet.impl.MasterJobContext$$Lambda$783/0x0000000800b26840.run(Unknown Source)
at com.hazelcast.jet.impl.MasterJobContext.finalizeJob(MasterJobContext.java:632)
at com.hazelcast.jet.impl.MasterJobContext.onCompleteExecutionCompleted(MasterJobContext.java:564)
at com.hazelcast.jet.impl.MasterJobContext.lambda$invokeCompleteExecution$6(MasterJobContext.java:544)
at com.hazelcast.jet.impl.MasterJobContext$$Lambda$779/0x0000000800b27840.accept(Unknown Source)
at com.hazelcast.jet.impl.MasterContext.lambda$invokeOnParticipants$0(MasterContext.java:242)
at com.hazelcast.jet.impl.MasterContext$$Lambda$726/0x0000000800a1c040.accept(Unknown Source)
at com.hazelcast.jet.impl.util.Util$2.onResponse(Util.java:172)
at com.hazelcast.spi.impl.AbstractInvocationFuture$1.run(AbstractInvocationFuture.java:256)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base#11/ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base#11/ThreadPoolExecutor.java:628)
at java.lang.Thread.run(java.base#11/Thread.java:834)
at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:64)
at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:80)
I don't have any other info how to reproduce this issue, however I hope someone will know how to fix this or my question will help someone else :)
My setup:
- Java 11
- Hazelcast 3.12 Snapshot
- Hazelcast Jet 3.0 Snapshot (I can't revert to previous version, it will break my logic; I need n:m joins which will be added in 3.1)
- CPU cores: 4
- RAM: 7 GB
- Jet mode: server, connects to other cluster as a client to insert final data.
Did anyone encounter similar issue? The problem is, it cannot be simply reproduces, thus it's hard to create an issue for Hazelcast Team. Only threaddumps and general behaviour can give a hint what is going on.
This was an issue in 3.0-SNAPSHOT during development and was fixed in the 3.0 release.

When using an ExecutorService, is it possible for a Thread to become uninterruptible when in the WAITING state

I'm running into an issue where it seems as if a Thread has become uninterruptible while in a WAITING state. The task thread itself (as you'll see in the stack of the thread below) is waiting for a call to FuturePromise.get() (a Jetty class).
The task thread is being executed in the context of an ExecutorService. Below is what the ExecutorService invocation looks like (simplified).
ExecutorService es = Executors.newFixedThreadPool(8, new CustomizableThreadFactory("Test-Thread-"));
es.submit(taskToRun);
es.shutdown();
es.awaitTermination(10, TimeUnit.SECONDS);
es.shutdownNow();
String result = taskToRun.get();
What I'm seeing is the main thread gets stuck at taskToRun.get() waiting for the task to complete/be interrupted while the thread running the task sits in this state:
"Test-Thread-1"
#135245 prio=5 os_prio=0 tid=0x00007f972c0c2000 nid=0x6838 waiting on condition [0x00007f96ca165000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000005d00d1118> (a java.util.concurrent.CountDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
at org.eclipse.jetty.util.FuturePromise.get(FuturePromise.java:118)
... + app code
What I'm expecting to happen is Test-Thread-1 will be interrupted which will throw an InterruptedException and the taskToRun.get() call would also throw a InterruptedException.
Unfortunately I've been unable to reproduce this problem with a unit test, but will update as I get more info.
After upgrading to Jetty 9.4.6, this no longer seems to be a problem. I did come across some Jetty code that clears the interrupted state of threads, but it's not clear whether or not that was the actual cause.

Hazelcast stuck in TIMED_WAITING when using 2nd-level cache

I am using Hazelcast 3.2.6 as second level cache for Hibernate. The cluster has 4 servers with multiple Read/Update/Delete operations being performed on the DB. It was running fine for quite sometime suddenly I see that all the threads which are trying to perform db operation are stuck, following is an extract from thread dump, there are no exceptions being printed.
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.pollResponse(BasicInvocation.java:767)
- locked <0x0000000665956110> (a com.hazelcast.spi.impl.BasicInvocation$InvocationFuture)
at com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.waitForResponse(BasicInvocation.java:719)
at com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.get(BasicInvocation.java:697)
at com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.get(BasicInvocation.java:676)
at com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.getSafely(BasicInvocation.java:689)
at com.hazelcast.concurrent.lock.LockProxySupport.lock(LockProxySupport.java:80)
at com.hazelcast.concurrent.lock.LockProxySupport.lock(LockProxySupport.java:74)
at com.hazelcast.concurrent.lock.LockProxy.lock(LockProxy.java:70)
at com.xxx.database.ccsecure.persistance.impl.DataStore.get(DataStore.java:120)
Apparently the invocation doesn't get a result. This means that the invocation-future is not going to complete. The big question is: why does the operation not get a response to its request.
Do you know which operation it is?

thread waits in log4j

We are receiving thread locks (PFB the thread dump).Can you give us suggestion why we receive it.
Note that we use Java 1.5, weblogic 9.1 , log4j version 1.2.8
[ACTIVE] ExecuteThread: '4' for queue: 'weblogic.kernel.Default (self-tuning)'" daemon prio=2 tid=0x01d332b0 nid=0x23 waiting for monitor entry [0x5fffd000..0x5ffffb10]
at org.apache.log4j.Category.callAppenders(Category.java:185)
- waiting to lock <0x7c669620> (a org.apache.log4j.spi.RootCategory)
at org.apache.log4j.Category.forcedLog(Category.java:372)
at org.apache.log4j.Category.log(Category.java:864)
at org.apache.commons.logging.impl.Log4JLogger.debug(Log4JLogger.java:110)
at org.hibernate.loader.Loader.doQuery(Loader.java:687)
at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:224)
at org.hibernate.loader.Loader.doList(Loader.java:2150)
at org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2029)
at org.hibernate.loader.Loader.list(Loader.java:2024)
at org.hibernate.loader.hql.QueryLoader.list(QueryLoader.java:369)
at org.hibernate.hql.ast.QueryTranslatorImpl.list(QueryTranslatorImpl.java:300)
at org.hibernate.engine.query.HQLQueryPlan.performList(HQLQueryPlan.java:146)
at org.hibernate.impl.SessionImpl.list(SessionImpl.java:1093)
at org.hibernate.impl.QueryImpl.list(QueryImpl.java:79)
at com.lks.myapp.data.dao.SourceCheckImpl.getSources(SourceCheckImpl.java:87)
Switch to logback for high performance logging. Log4j is having performance issues. We have done the same in one of our products

Resources