Spark read HDFS slow after run several hours - apache-spark

We have a spark streaming application, for each interval it will read a hive table and do some aggregation, usually it's very fast and can finish in 20 seconds. However after run several hours, one executor will be very slow, while other executor is fast as normal, I check the slow server: disk, network has no problem and there is other executors of other application on same server, they are all good.
The CPU usage of slow executor is 100%, and I get threaddump of slow executor, it is reading HDFS, stack below. The data size reading is 128M, it should be done in several seconds, but in this slow executor it takes 90+ seconds.
Any ideas? Thanks in advance.
2017-06-15 15:04:04
"Executor task launch worker-172" daemon prio=10 tid=0x00007fb308001000 nid=0x3966 runnable [0x00007fb338a4c000]
java.lang.Thread.State: RUNNABLE
at sun.nio.cs.UTF_8$Decoder.decodeArrayLoop(UTF_8.java:211)
at sun.nio.cs.UTF_8$Decoder.decodeLoop(UTF_8.java:354)
at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:561)
at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:783)
at org.apache.hadoop.io.Text.decode(Text.java:405)
at org.apache.hadoop.io.Text.decode(Text.java:382)
at org.apache.hadoop.io.Text.toString(Text.java:280)
at org.apache.spark.sql.hive.HiveInspectors$class.unwrap(HiveInspectors.scala:322)
at org.apache.spark.sql.hive.HadoopTableReader$.unwrap(TableReader.scala:315)
at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$14.apply(TableReader.scala:401)
at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$14.apply(TableReader.scala:401)
at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:416)
at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:408)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:511)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.<init>(TungstenAggregationIterator.scala:686)
2017-06-15 15:04:07
"Executor task launch worker-172" daemon prio=10 tid=0x00007fb308001000 nid=0x3966 runnable [0x00007fb338a4c000]
java.lang.Thread.State: RUNNABLE
at sun.misc.Unsafe.getByte(Native Method)
at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:249)
at java.nio.ByteBuffer.get(ByteBuffer.java:673)
at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:279)
at java.nio.ByteBuffer.get(ByteBuffer.java:694)
at org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:304)
at org.apache.hadoop.hdfs.RemoteBlockReader2.readNextPacket(RemoteBlockReader2.java:203)
at org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:138)
- locked <0x00000007bed42868> (a org.apache.hadoop.hdfs.RemoteBlockReader2)
at org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:683)
at org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:739)
- locked <0x00000007bed42808> (a org.apache.hadoop.hdfs.DFSInputStream)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:796)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837)
- locked <0x00000007bed42808> (a org.apache.hadoop.hdfs.DFSInputStream)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209)
- locked <0x00000007bed41c58> (a org.apache.hadoop.mapred.LineRecordReader)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47)
at org.apache.spark.rdd.HadoopRDD$anon$1.getNext(HadoopRDD.scala:249)
at org.apache.spark.rdd.HadoopRDD$anon$1.getNext(HadoopRDD.scala:211)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$anon$14.hasNext(Iterator.scala:388)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:511)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.<init>(TungstenAggregationIterator.scala:686)
at org.apache.spark.sql.execution.aggregate.TungstenAggregate$anonfun$doExecute$1$anonfun$2.apply(TungstenAggregate.scala:95)

Related

Play framework 2.7 with scala going down

I have a play scala application running on play 2.7. this is used as a middleware for our frontend and it has rest end points.
Now I am running two different instances on cloud and using nginx and bound these two servers and load balance it with round robin.
Now I am having a problem that the servers goes down quite often i.e. 3 times a day and interesting thing is both server goes down at same time. When I looked at it says out of memory exception on the both servers. I tried to print javaheapdump for out of memory but getting no dump . I am still analysing the thread dump to figure out what might be the actual cause of my server going down but what pins me is why the two servers are going down at the same time.
Out of thread dump I see there are 7707 thread with sleeping state. here it is
"Connection evictor" #146 daemon prio=5 os_prio=0 cpu=2.33ms elapsed=1822.02s tid=0x00007f8a840c4800 nid=0x194 waiting on condition [0x00007f8a58a5e000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(java.base#11/Native Method)
at org.apache.http.impl.client.IdleConnectionEvictor$1.run(IdleConnectionEvictor.java:66)
at java.lang.Thread.run(java.base#11/Thread.java:834)
This what I see when server goes down
[35966.967s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
Uncaught error from thread [application-akka.actor.default-dispatcher-1398Uncaught error from thread [application-akka.actor.default-dispatcher-1395]: ]: unable to create native thread: possibly out of memory or process/resource limits reachedunable to create native thread: possibly out of memory or process/resource limits reached, shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for, shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[ ActorSystem[applicationapplication]
]
java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
at java.base/java.lang.Thread.start0(Native Method)
at java.base/java.lang.Thread.start(Thread.java:803)
at org.apache.http.impl.client.IdleConnectionEvictor.start(IdleConnectionEvictor.java:96)
at org.apache.http.impl.client.HttpClientBuilder.build(HttpClientBuilder.java:1219)
at org.apache.solr.client.solrj.impl.HttpClientUtil.createClient(HttpClientUtil.java:287)
at org.apache.solr.client.solrj.impl.HttpClientUtil.createClient(HttpClientUtil.java:298)
at org.apache.solr.client.solrj.impl.HttpClientUtil.createClient(HttpClientUtil.java:236)
at org.apache.solr.client.solrj.impl.HttpClientUtil.createClient(HttpClientUtil.java:223)
at org.apache.solr.client.solrj.impl.HttpSolrClient.<init>(HttpSolrClient.java:198)
at org.apache.solr.client.solrj.impl.HttpSolrClient$Builder.build(HttpSolrClient.java:934)
at com.github.takezoe.solr.scala.SolrClient$.$anonfun$$lessinit$greater$default$2$1(SolrClient.scala:11)
at com.github.takezoe.solr.scala.SolrClient.<init>(SolrClient.scala:14)
at service.tvt.solr.SolrPolygonService.getSuburbBoundary(SolrPolygonService.scala:212)
at service.tvt.search.OrbigoSearchService.mapfeeder(OrbigoSearchService.scala:54)
at service.bto.business_categories.MeedssCountService.$anonfun$suburbMeedssCount$2(MeedssCountService.scala:81)
at scala.collection.immutable.List.map(List.scala:287)
at service.bto.business_categories.MeedssCountService.suburbMeedssCount(MeedssCountService.scala:80)
at controllers.bto.industry_categories.meedss.MeedssController.$anonfun$suburbMeedssCount$1(MeedssController.scala:38)
at play.api.mvc.ActionBuilder.$anonfun$apply$11(Action.scala:368)
at scala.Function1.$anonfun$andThen$1(Function1.scala:52)
at play.api.mvc.ActionBuilderImpl.invokeBlock(Action.scala:489)
at play.api.mvc.ActionBuilderImpl.invokeBlock(Action.scala:487)
at play.api.mvc.ActionBuilder$$anon$9.invokeBlock(Action.scala:336)
at play.api.mvc.ActionBuilder$$anon$9.invokeBlock(Action.scala:331)
at play.api.mvc.ActionBuilder$$anon$10.apply(Action.scala:426)
at play.api.mvc.Action.$anonfun$apply$2(Action.scala:98)
at play.api.libs.streams.StrictAccumulator.$anonfun$mapFuture$4(Accumulator.scala:184)
at scala.util.Try$.apply(Try.scala:209)
at play.api.libs.streams.StrictAccumulator.$anonfun$mapFuture$3(Accumulator.scala:184)
at akka.stream.impl.Transform.apply(TraversalBuilder.scala:159)
at akka.stream.impl.PhasedFusingActorMaterializer.materialize(PhasedFusingActorMaterializer.scala:515)
at akka.stream.impl.PhasedFusingActorMaterializer.materialize(PhasedFusingActorMaterializer.scala:450)
at akka.stream.impl.PhasedFusingActorMaterializer.materialize(PhasedFusingActorMaterializer.scala:443)
at akka.stream.scaladsl.RunnableGraph.run(Flow.scala:629)
at play.api.libs.streams.Accumulator$.$anonfun$futureToSink$2(Accumulator.scala:262)
at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:303)
at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:37)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
at play.api.libs.streams.Execution$trampoline$.execute(Execution.scala:72)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68)
at scala.concurrent.impl.Promise$DefaultPromise.dispatchOrAddCallback(Promise.scala:312)
at scala.concurrent.impl.Promise$DefaultPromise.onComplete(Promise.scala:303)
at scala.concurrent.impl.Promise.transformWith(Promise.scala:36)
at scala.concurrent.impl.Promise.transformWith$(Promise.scala:34)
at scala.concurrent.impl.Promise$DefaultPromise.transformWith(Promise.scala:183)
at scala.concurrent.Future.flatMap(Future.scala:302)
at scala.concurrent.Future.flatMap$(Future.scala:302)
at scala.concurrent.impl.Promise$DefaultPromise.flatMap(Promise.scala:183)
at play.api.libs.streams.Accumulator$.$anonfun$futureToSink$1(Accumulator.scala:261)
at akka.stream.impl.Transform.apply(TraversalBuilder.scala:159)
at akka.stream.impl.PhasedFusingActorMaterializer.materialize(PhasedFusingActorMaterializer.scala:515)
at akka.stream.impl.PhasedFusingActorMaterializer.materialize(PhasedFusingActorMaterializer.scala:450)
at akka.stream.impl.PhasedFusingActorMaterializer.materialize(PhasedFusingActorMaterializer.scala:443)
at akka.stream.scaladsl.RunnableGraph.run(Flow.scala:629)
at play.api.libs.streams.SinkAccumulator.run(Accumulator.scala:144)
at play.api.libs.streams.SinkAccumulator.run(Accumulator.scala:148)
at play.core.server.AkkaHttpServer.$anonfun$runAction$4(AkkaHttpServer.scala:441)
at akka.http.scaladsl.util.FastFuture$.strictTransform$1(FastFuture.scala:41)
at akka.http.scaladsl.util.FastFuture$.$anonfun$transformWith$3(FastFuture.scala:51)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:92)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81)
at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:92)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:49)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Any quick pointers will be really helpful
Levi Ramsey was right it was because of TakeZoe lib which we were using. We were creating client for every new request and not closing it. Finally we created a connection pool with limited active connections and it worked.

Hazelcast Jet stuck on starting Job

I've encountered strange behaviour in Hazelcast Jet. I'm starting many jobs at once (~30, some are triggered slightly before other). However, when my Hazelcast Jet job count hits 26 (magic number?), all processing got stuck.
In the threadumps I see following info:
"hz._hzInstance_1_jet.cached.thread-1" #37 prio=5 os_prio=0 cpu=1093.29ms elapsed=393.62s tid=0x00007f95dc007000 nid=0x6bfc in Object.wait() [0x00007f95e6af4000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(java.base#11.0.2/Native Method)
- waiting on <no object reference available>
at com.hazelcast.spi.impl.AbstractCompletableFuture.get(AbstractCompletableFuture.java:229)
- waiting to re-lock in wait() <0x00000007864b7040> (a com.hazelcast.internal.util.SimpleCompletableFuture)
at com.hazelcast.spi.impl.AbstractCompletableFuture.get(AbstractCompletableFuture.java:191)
at com.hazelcast.spi.impl.operationservice.impl.InvokeOnPartitions.invoke(InvokeOnPartitions.java:88)
at com.hazelcast.spi.impl.operationservice.impl.OperationServiceImpl.invokeOnAllPartitions(OperationServiceImpl.java:385)
at com.hazelcast.map.impl.proxy.MapProxySupport.clearInternal(MapProxySupport.java:1016)
at com.hazelcast.map.impl.proxy.MapProxyImpl.clearInternal(MapProxyImpl.java:109)
at com.hazelcast.map.impl.proxy.MapProxyImpl.clear(MapProxyImpl.java:698)
at com.hazelcast.jet.impl.JobRepository.clearSnapshotData(JobRepository.java:464)
at com.hazelcast.jet.impl.MasterJobContext.tryStartJob(MasterJobContext.java:233)
at com.hazelcast.jet.impl.JobCoordinationService.tryStartJob(JobCoordinationService.java:776)
at com.hazelcast.jet.impl.JobCoordinationService.lambda$submitJob$0(JobCoordinationService.java:200)
at com.hazelcast.jet.impl.JobCoordinationService$$Lambda$634/0x00000008009ce840.run(Unknown Source)
and also:
"hz._hzInstance_1_jet.async.thread-2" #81 prio=5 os_prio=0 cpu=0.00ms elapsed=661.98s tid=0x0000025bb23ef000 nid=0x43bc in Object.wait() [0x0000005d492fe000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(java.base#11/Native Method)
- waiting on <no object reference available>
at com.hazelcast.spi.impl.AbstractCompletableFuture.get(AbstractCompletableFuture.java:229)
- waiting to re-lock in wait() <0x0000000725600100> (a com.hazelcast.internal.util.SimpleCompletableFuture)
at com.hazelcast.spi.impl.AbstractCompletableFuture.get(AbstractCompletableFuture.java:191)
at com.hazelcast.spi.impl.operationservice.impl.InvokeOnPartitions.invoke(InvokeOnPartitions.java:88)
at com.hazelcast.spi.impl.operationservice.impl.OperationServiceImpl.invokeOnAllPartitions(OperationServiceImpl.java:385)
at com.hazelcast.map.impl.proxy.MapProxySupport.removeAllInternal(MapProxySupport.java:619)
at com.hazelcast.map.impl.proxy.MapProxyImpl.removeAll(MapProxyImpl.java:285)
at com.hazelcast.jet.impl.JobRepository.deleteJob(JobRepository.java:332)
at com.hazelcast.jet.impl.JobRepository.completeJob(JobRepository.java:316)
at com.hazelcast.jet.impl.JobCoordinationService.completeJob(JobCoordinationService.java:576)
at com.hazelcast.jet.impl.MasterJobContext.lambda$finalizeJob$13(MasterJobContext.java:620)
at com.hazelcast.jet.impl.MasterJobContext$$Lambda$783/0x0000000800b26840.run(Unknown Source)
at com.hazelcast.jet.impl.MasterJobContext.finalizeJob(MasterJobContext.java:632)
at com.hazelcast.jet.impl.MasterJobContext.onCompleteExecutionCompleted(MasterJobContext.java:564)
at com.hazelcast.jet.impl.MasterJobContext.lambda$invokeCompleteExecution$6(MasterJobContext.java:544)
at com.hazelcast.jet.impl.MasterJobContext$$Lambda$779/0x0000000800b27840.accept(Unknown Source)
at com.hazelcast.jet.impl.MasterContext.lambda$invokeOnParticipants$0(MasterContext.java:242)
at com.hazelcast.jet.impl.MasterContext$$Lambda$726/0x0000000800a1c040.accept(Unknown Source)
at com.hazelcast.jet.impl.util.Util$2.onResponse(Util.java:172)
at com.hazelcast.spi.impl.AbstractInvocationFuture$1.run(AbstractInvocationFuture.java:256)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base#11/ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base#11/ThreadPoolExecutor.java:628)
at java.lang.Thread.run(java.base#11/Thread.java:834)
at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:64)
at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:80)
I don't have any other info how to reproduce this issue, however I hope someone will know how to fix this or my question will help someone else :)
My setup:
- Java 11
- Hazelcast 3.12 Snapshot
- Hazelcast Jet 3.0 Snapshot (I can't revert to previous version, it will break my logic; I need n:m joins which will be added in 3.1)
- CPU cores: 4
- RAM: 7 GB
- Jet mode: server, connects to other cluster as a client to insert final data.
Did anyone encounter similar issue? The problem is, it cannot be simply reproduces, thus it's hard to create an issue for Hazelcast Team. Only threaddumps and general behaviour can give a hint what is going on.
This was an issue in 3.0-SNAPSHOT during development and was fixed in the 3.0 release.

When using an ExecutorService, is it possible for a Thread to become uninterruptible when in the WAITING state

I'm running into an issue where it seems as if a Thread has become uninterruptible while in a WAITING state. The task thread itself (as you'll see in the stack of the thread below) is waiting for a call to FuturePromise.get() (a Jetty class).
The task thread is being executed in the context of an ExecutorService. Below is what the ExecutorService invocation looks like (simplified).
ExecutorService es = Executors.newFixedThreadPool(8, new CustomizableThreadFactory("Test-Thread-"));
es.submit(taskToRun);
es.shutdown();
es.awaitTermination(10, TimeUnit.SECONDS);
es.shutdownNow();
String result = taskToRun.get();
What I'm seeing is the main thread gets stuck at taskToRun.get() waiting for the task to complete/be interrupted while the thread running the task sits in this state:
"Test-Thread-1"
#135245 prio=5 os_prio=0 tid=0x00007f972c0c2000 nid=0x6838 waiting on condition [0x00007f96ca165000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000005d00d1118> (a java.util.concurrent.CountDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
at org.eclipse.jetty.util.FuturePromise.get(FuturePromise.java:118)
... + app code
What I'm expecting to happen is Test-Thread-1 will be interrupted which will throw an InterruptedException and the taskToRun.get() call would also throw a InterruptedException.
Unfortunately I've been unable to reproduce this problem with a unit test, but will update as I get more info.
After upgrading to Jetty 9.4.6, this no longer seems to be a problem. I did come across some Jetty code that clears the interrupted state of threads, but it's not clear whether or not that was the actual cause.

Elasticsearch transport client spawning a large number of threads

From an Apache storm bolt, I am using Elasticsearch's transport client to remotely connect to an ES cluster. When I take a jstack output of the storm process, I notice that there are nearly 1000 threads with an ES stack trace like:
elasticsearch[Flying Tiger][transport_client_worker][T#22]{New I/O worker #269}" daemon prio=10 tid=0x00007f80ac3cb000 nid=0x356b runnable [0x00007f7e96b2a000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
- locked <0x00000000d148d138> (a sun.nio.ch.Util$2)
- locked <0x00000000d148d128> (a java.util.Collections$UnmodifiableSet)
- locked <0x00000000d148c9b8> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
at org.elasticsearch.common.netty.channel.socket.nio.SelectorUtil.select(SelectorUtil.java:68)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.select(AbstractNioSelector.java:415)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:212)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
I am using a single instance of the ES transport client across my storm topology, which has about 18 output streams invoking the ES client to write data to the ES cluster.
Why does the ES transport client spawn so many threads ? Is there any way I can tune this ? I tried looking up ES documentation but it does not provide any internal details on the threading mechanism of the transport client nor does it give an option to tune the number of threads of the client.
I had very similar experience before. As you've mentioned, one transport client creates tens of threads including timers and so on.
What you have to check is whether there is exactly one transport client per worker process. Back in my earlier days, when I used 32 transport clients, there were more than 1 thousand threads and after I've correctly modified it to be singleton instance, number of threads decreased to less than 2 hundreds(including all other threads created in my topology)
You can define system property: "es.processors.override" or setting "processors", based on source code of org.elasticsearch.common.util.concurrent.EsExecutors. I tried this method and limit the number of worker threads successfully.
/**
* Settings key to manually set the number of available processors.
* This is used to adjust thread pools sizes etc. per node.
*/
public static final String PROCESSORS = "processors";
/** Useful for testing */
public static final String DEFAULT_SYSPROP = "es.processors.override";
Also from limit number of thread in ThreadPool while creating TransportClient in elasticsearch
Settings settings = ImmutableSettings.settingsBuilder()
.put("transport.netty.workerCount",NUM_THREADS)
.build();

Hazelcast stuck in TIMED_WAITING when using 2nd-level cache

I am using Hazelcast 3.2.6 as second level cache for Hibernate. The cluster has 4 servers with multiple Read/Update/Delete operations being performed on the DB. It was running fine for quite sometime suddenly I see that all the threads which are trying to perform db operation are stuck, following is an extract from thread dump, there are no exceptions being printed.
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.pollResponse(BasicInvocation.java:767)
- locked <0x0000000665956110> (a com.hazelcast.spi.impl.BasicInvocation$InvocationFuture)
at com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.waitForResponse(BasicInvocation.java:719)
at com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.get(BasicInvocation.java:697)
at com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.get(BasicInvocation.java:676)
at com.hazelcast.spi.impl.BasicInvocation$InvocationFuture.getSafely(BasicInvocation.java:689)
at com.hazelcast.concurrent.lock.LockProxySupport.lock(LockProxySupport.java:80)
at com.hazelcast.concurrent.lock.LockProxySupport.lock(LockProxySupport.java:74)
at com.hazelcast.concurrent.lock.LockProxy.lock(LockProxy.java:70)
at com.xxx.database.ccsecure.persistance.impl.DataStore.get(DataStore.java:120)
Apparently the invocation doesn't get a result. This means that the invocation-future is not going to complete. The big question is: why does the operation not get a response to its request.
Do you know which operation it is?

Resources