AbstractNioSelector Unexpected exception in the selector loop - memory-leaks

I'm running WebSocket server with Scala Play Framework 2.4.6 and Netty 3.10.6.Final on production. Oracle JDK 8.
Usually, the server works fine, but on high load, it fires exceptions and starts leaking Linux socks.
2017-05-16 15:18:28,149 WARN o.j.n.c.s.n.AbstractNioSelector Unexpected exception in the selector loop.
java.lang.NullPointerException: null
at sun.nio.ch.EPollArrayWrapper.isEventsHighKilled(EPollArrayWrapper.java:174) ~[na:1.8.0_131]
at sun.nio.ch.EPollArrayWrapper.setUpdateEvents(EPollArrayWrapper.java:190) ~[na:1.8.0_131]
at sun.nio.ch.EPollArrayWrapper.add(EPollArrayWrapper.java:239) ~[na:1.8.0_131]
at sun.nio.ch.EPollSelectorImpl.implRegister(EPollSelectorImpl.java:178) ~[na:1.8.0_131]
at sun.nio.ch.SelectorImpl.register(SelectorImpl.java:132) ~[na:1.8.0_131]
at java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:212) ~[na:1.8.0_131]
at org.jboss.netty.channel.socket.nio.NioWorker$RegisterTask.run(NioWorker.java:151) ~[io.netty.netty-3.10.4.Final.jar:na]
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.processTaskQueue(AbstractNioSelector.java:391) ~[io.netty.netty-3.10.4.Final.jar:na]
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:315) ~[io.netty.netty-3.10.4.Final.jar:na]
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) [io.netty.netty-3.10.4.Final.jar:na]
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) [io.netty.netty-3.10.4.Final.jar:na]
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) [io.netty.netty-3.10.4.Final.jar:na]
at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) [io.netty.netty-3.10.4.Final.jar:na]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_131]
The problem happens when 1800 users connect to the server. Then memory and Linux sockets grow. It's really hard to debug this and I didn't google any sockets leak issue in Play Framework release notes.
I saw the similar question in SO, but I don't control server sockets manually.

This looks like a Java bug. Please open an issue on the openjdk issue tracker and include the stacktrace..

This may be JDK-8168500, fixed in JDK 9. Can you say what the kernel version is here? When this issue originally came up then it could only be reproduced by dynamically changing file descriptor limit.

I filled issue, and Oracle responded that such exception is likely already fixed in JDK9 (in JDK-8168500).
We was able to fix that by reducing number of threads for akka.

Related

Apache pulsar get timeout in unpredictable way

I installed Apache pulsar standalone. Pulsar get timeout sometimes. It's not related to high throuput neither to a particular topic (following log). Pulsar-admin brokers healthcheck returns OK or timeout also. How to investigate on it ?
10:46:46.365 [pulsar-ordered-OrderedExecutor-7-0] WARN org.apache.pulsar.broker.service.BrokerService - Got exception when reading persistence policy for persistent://nnx/agent_ns/action_up-53da8177-b4b9-4b92-8f75-efe94dc2309d: null
java.util.concurrent.TimeoutException: null
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784) ~[?:1.8.0_232]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) ~[?:1.8.0_232]
at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97) ~[org.apache.pulsar-pulsar-zookeeper-utils-2.5.0.jar:2.5.0]
at org.apache.pulsar.broker.service.BrokerService.lambda$getManagedLedgerConfig$32(BrokerService.java:922) ~[org.apache.pulsar-pulsar-broker-2.5.0.jar:2.5.0]
at org.apache.bookkeeper.mledger.util.SafeRun$2.safeRun(SafeRun.java:49) [org.apache.pulsar-managed-ledger-2.5.0.jar:2.5.0]
at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_232]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_232]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
I am glad you were able to resolve the issue my adding more cores. The issue was a connection timeout while trying to access some topic metadata that is stored inside of ZookKeeper as indicated by the following line in the stack trace:
at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97) ~[org.apache.pulsar-pulsar-zookeeper-utils-2.5.0.jar:2.5.0]
Increasing the cores must of freed up enough threads to allow the ZK node to respond to this request.
You can check the connection to the server looks like connection issue if you are using any TLScertificate file path check if you have the right certificate.
The Problem is we don't have lot of solutions in internet for apache pulsar but if you are following the apache pulsar doc might help and also we have apache pulsar git hub and sample projects.

How to connect Tableau Desktop to Spark SQL 2.0 through Spark Thrift Server?

I am trying to connect to Spark SQL (Spark 2.0.0) from Tableau Desktop 10.1.1 from OS X. I have SimbaSparkODBC already installed, and Spark Thirft Server is up and running. I am able to use beeline to connect and verify the Thrift Server.
However, when I configure Tableau using Spark SQL connector, it does not connect. After sometime, the query times out. When I checked the Thrift Server logs, I see the following message.
16/11/17 17:01:26 ERROR TThreadPoolServer: Error occurred during processing of message.
java.lang.RuntimeException: org.apache.thrift.transport.TTransportException: Invalid status -128
at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:219)
at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:268)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.thrift.transport.TTransportException: Invalid status -128
at org.apache.thrift.transport.TSaslTransport.sendAndThrowMessage(TSaslTransport.java:232)
at org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:184)
at org.apache.thrift.transport.TSaslServerTransport.handleSaslStartMessage(TSaslServerTransport.java:125)
at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
at org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216)
... 4 more
I tried with Spark 1.6.1 and the outcome is the same. Does anyone have Tableau working with a similar setup? If so, what am I missing here?
When connecting to Spark SQL, choose "Username" authentication instead of "No Authentication". You can leave the username blank.

Spark scheduler thersholds

I'm running on top of Spark some analysis tool that creates plenty of overhead, so computations takes a lot more time. When I run it I get this error:
16/08/30 23:36:37 WARN TransportChannelHandler: Exception in connection from /132.68.60.126:36922
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
16/08/30 23:36:37 ERROR TaskSchedulerImpl: Lost executor 0 on 132.68.60.126: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
I guess this happens because the scheduler thinks the executor failed, so it starts another one.
The workload is a simple string search (grep), both master and slave are local so there aren't suppose to be any failures. When running without the overheads things are fine.
The question is - can I configure those timeout thresholds somewhere?
Thanks!
Solved it with spark.network.timeout 10000000 on spark-defaults.conf.
I was getting the same error even if I tried many things.My job used to get stuck throwing this error after running a very long time. I tried few work around which helped me to resolve. Although, I still get the same error by at least my job runs fine.
one reason could be the executors kills themselves thinking that they
lost the connection from the master. I added the below configurations
in spark-defaults.conf file.
spark.network.timeout 10000000
spark.executor.heartbeatInterval 10000000
basically,I have increased the network timeout and heartbeat interval
The particular step which used to get stuck, I just cached the
dataframe that is used for processing (in the step which used to get
stuck)
Note:- These are work arounds, I still see the same error in error logs but the my job does not get terminated.

Networking problems with submitting Spark application

I want to submit Spark application from my laptop to a Spark cluster in standalone mode. They are in the same network and have routable IP addresses.
I can get it to work if I disable firewall on my laptop. But, if I have firewall enabled, it fails while trying to connect to the master node. The following error message seems to indicate that the driver can't get any resources.
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Next, I try setting the spark.driver.port in my driver program. I then open up that particular port on the driver machine. This gets past the first error, but gives me a new error.
WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 172.31.51.158): java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1168)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1104)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:998)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:932)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:356)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$3.apply(Executor.scala:393)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$3.apply(Executor.scala:390)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:390)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
It seems like the master and driver are able to use the spark.driver.port to get something started, but failed later? It still seems like it's some kind of networking problem.

Connection reset message from JMeter

When i ran my test I got the below error no the response section of View Result Tree.
I was testing a IIS server.
Is this Connection reset created by JMeter of was it done IIS server.
How do you interpret this error message.
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(Unknown Source)
at java.net.SocketInputStream.read(Unknown Source)
at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:166)
at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:90)
at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:281)
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:92)
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:61)
at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:254)
at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:289)
at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:252)
at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:191)
at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:300)
at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:127)
at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:715)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:520)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
at org.apache.jmeter.protocol.http.sampler.HTTPHC4Impl.executeRequest(HTTPHC4Impl.java:475)
at org.apache.jmeter.protocol.http.sampler.HTTPHC4Impl.sample(HTTPHC4Impl.java:295)
at org.apache.jmeter.protocol.http.sampler.HTTPSamplerProxy.sample(HTTPSamplerProxy.java:74)
at org.apache.jmeter.protocol.http.sampler.HTTPSamplerBase.sample(HTTPSamplerBase.java:1105)
at org.apache.jmeter.protocol.http.sampler.HTTPSamplerBase.sample(HTTPSamplerBase.java:1094)
at org.apache.jmeter.threads.JMeterThread.process_sampler(JMeterThread.java:429)
at org.apache.jmeter.threads.JMeterThread.run(JMeterThread.java:257)
at java.lang.Thread.run(Unknown Source)
There are at least 3 possibles reasons:
Most probable cause : Your server (meaning web servers handling request and any components after them) is not handling the load correctly and slowing down, monitor the system and check
You have exhausted your injector ephemeral ports , you need to adjust your OS TCP settings to increase port range
You're running load test in GUI mode with a View Results Tree in test, this is bad practice as GC will happen frequently possibly triggering Stop The World leading to this. As per best-practices use NON GUI mode:
https://jmeter.apache.org/usermanual/best-practices.html
https://www.ubik-ingenierie.com/blog/jmeter_performance_tuning_tips/

Resources