Channel is closed while reading from Alluxio using Presto - presto

I encountered this stack trace while running a Presto query on top of Alluxio. Sometimes my query is able to succeed, but sometimes it fails with this error. What does it mean, and how can I fix it?
com.facebook.presto.spi.PrestoException: Error opening Hive split alluxio://xxxxx:19998/s3/data/m-00025 (offset=100663296, length=53990296) using org.apache.hadoop.mapred.TextInputFormat: Channel [id: 0xfa748b02, L:/xxxxx:34874 ! R:xxxxx/xxxxx:29999] is closed.
at com.facebook.presto.hive.HiveUtil.createRecordReader(HiveUtil.java:219)
at com.facebook.presto.hive.GenericHiveRecordCursorProvider.lambda$createRecordCursor$0(GenericHiveRecordCursorProvider.java:71)
at com.facebook.presto.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:23)
at com.facebook.presto.hive.HdfsEnvironment.doAs(HdfsEnvironment.java:80)
at com.facebook.presto.hive.GenericHiveRecordCursorProvider.createRecordCursor(GenericHiveRecordCursorProvider.java:70)
at com.facebook.presto.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:183)
at com.facebook.presto.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:93)
at com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:44)
at com.facebook.presto.split.PageSourceManager.createPageSource(PageSourceManager.java:56)
at com.facebook.presto.operator.ScanFilterAndProjectOperator.getOutput(ScanFilterAndProjectOperator.java:216)
at com.facebook.presto.operator.Driver.processInternal(Driver.java:379)
at com.facebook.presto.operator.Driver.lambda$processFor$8(Driver.java:283)
at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:675)
at com.facebook.presto.operator.Driver.processFor(Driver.java:276)
at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1053)
at com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:162)
at com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:456)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Channel [id: 0xfa748b02, L:/xxxxx:34874 ! R:xxxxx/xxxxx:29999] is closed.
at alluxio.client.block.stream.NettyPacketReader$PacketReadHandler.channelUnregistered(NettyPacketReader.java:314)
at alluxio.core.client.runtime.io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:176)

This means the connection between the Alluxio client (Presto) and Alluxio worker was closed unexpectedly.
Usually this is caused by a long GC pause on the client. The Alluxio client periodically sends a keep-alive on the connection, but this can be delayed (to the point of the worker closing the connection) by full GCs.
You can verify if there is GC pressure by adding the Java options -XX:+PrintGCDetails and -Xloggc:<file name here> to the Presto daemons.

Related

JanusGraph Error : "Could not find type for id" during a concurrent load operation

While performing a concurrent bulk load operation, I received this error. Subsequently, all my queries failed, and I kept getting the same error .
The exception I got is as follows:
java.lang.NullPointerException: Could not find type for id: 52237 at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:250) at org.janusgraph.graphdb.types.vertices.JanusGraphSchemaVertex.name(JanusGraphSchemaVertex.java:57) at org.janusgraph.graphdb.vertices.AbstractVertex.label(AbstractVertex.java:121) at org.apache.tinkerpop.gremlin.structure.util.reference.ReferenceElement.(ReferenceElement.java:57) at org.apache.tinkerpop.gremlin.structure.util.reference.ReferenceVertex.(ReferenceVertex.java:46) at org.apache.tinkerpop.gremlin.structure.util.reference.ReferenceFactory.detach(ReferenceFactory.java:48) at org.apache.tinkerpop.gremlin.structure.util.reference.ReferenceFactory.detach(ReferenceFactory.java:69) at org.apache.tinkerpop.gremlin.structure.util.reference.ReferenceFactory.detach(ReferenceFactory.java:80) at org.apache.tinkerpop.gremlin.process.traversal.strategy.decoration.HaltedTraverserStrategy.halt(HaltedTraverserStrategy.java:60) at org.apache.tinkerpop.gremlin.server.util.TraverserIterator.next(TraverserIterator.java:64) at org.apache.tinkerpop.gremlin.server.op.traversal.TraversalOpProcessor.handleIterator(TraversalOpProcessor.java:529) at org.apache.tinkerpop.gremlin.server.op.traversal.TraversalOpProcessor.lambda$iterateBytecodeTraversal$4(TraversalOpProcessor.java:382) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
Some additional context :
storage.batch-loading was NOT enabled
The bulk write operation I was running was highly concurrent and with high load
I used about 100 instances of gremlin server connecting to Cassandra/ES backend
I did not explicitly define a schema
Would be great if someone could give me an idea about what could have caused this .
Thanks !
it happens if multiple instance of gremlin-server are running
it is because gremlin server was not shutdown or killed properly.
it can be because the vm on which gremlin-server is running might have restarted.
so the solution is login to gremlin-console and run your commands based on your backend.in my case it's cassandra and elasticsearch
so i will run
method 1
:remote connect tinkerpop.server conf/remote.yaml session
:remote console session
or
graph=JanusGraphFactory.open('conf/janusgraph-cql-es.properties');
g=graph.traversal()
and if you are running containers then your command must be similar to this
graph=JanusGraphFactory.open('/etc/opt/janusgraph/janusgraph.properties');
g=graph.traversal()
now after running those you can run
mgmt = graph.openManagement()
mgmt.getOpenInstances()
it will display all the instances
eg
ac12000231-a9ffbcbb0e921
ac12000230-a9ffbcbb0e921(current)
except that current instance close other instances
mgmt.forceCloseInstance('ac12000231-a9ffbcbb0e921')
after closing all the instances commit the changes
mgmt.commit()
now restart your gremlin server and run your query it should work
method 2
if the problem persists just kill your gremlin-server and start it again few times...it should work
load command should work
another reason why this happens is if the data is not restored properly..
if you are using cluster take the backup on all the nodes
then restore on your destination node or nodes
i used nodetool for backup and sstableloader for restoring data

Apache Spark on k8s: securing RPC communication between driver and executors is not working

I have been trying Spark 2.4 deployment on k8s and want to establish a secured RPC communication channel between driver and executors. Was using the following configuration parameters as part of spark-submit
spark.authenticate true
spark.authenticate.secret good
spark.network.crypto.enabled true
spark.network.crypto.keyFactoryAlgorithm PBKDF2WithHmacSHA1
spark.network.crypto.saslFallback false
The driver and executors were not able to communicate on a secured channel and were throwing the following errors.
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:281)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:201)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:65)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
... 4 more
Caused by: java.lang.RuntimeException: java.lang.IllegalArgumentException: Unknown challenge message.
at org.apache.spark.network.crypto.AuthRpcHandler.receive(AuthRpcHandler.java:109)
at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:181)
at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:103)
at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
Can someone guide me on this?
Disclaimer: I do not have a very deep understanding of spark implementation, so, be careful when using the workaround described below.
AFAIK, spark does not have support for auth/encryption for k8s in 2.4.0 version.
There is a ticket, which is already fixed and likely will be released in a next spark version: https://issues.apache.org/jira/browse/SPARK-26239
The problem is that spark executors try to open connection to a driver, and a configuration will be sent only using this connection. Although, an executor creates the connection with default config AND system properties started with "spark.".
For reference, here is the place where executor opens the connection: https://github.com/apache/spark/blob/5fa4384/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L201
Theoretically, if you would set spark.executor.extraJavaOptions=-Dspark.authenticate=true -Dspark.network.crypto.enabled=true ..., it should help, although driver checks that there are no spark parameters set in extraJavaOptions.
Although, there is a workaround (a little bit hacky): you can set spark.executorEnv.JAVA_TOOL_OPTIONS=-Dspark.authenticate=true -Dspark.network.crypto.enabled=true .... Spark does not check this parameter, but JVM uses this env variable to add this parameter to properties.
Also, instead of using JAVA_TOOL_OPTIONS to pass secret, I would recommend to use spark.executorEnv._SPARK_AUTH_SECRET=<secret>.

When would ShuffleBlockFetcherIterator throw "Failed to get block(s)" exceptions?

In my spark application which is run in a cluster mode, I get below exception. I know somehow this coud be due to emery issue. But as the error says, it can not connect to a node. But I ma sure the node is available and it can be connected. Can anyone know what is the main cause of this error and how to resolve it?
17/10/31 17:10:54 ERROR ShuffleBlockFetcherIterator: Failed to get block(s) from AUPER01-02-10-12-0.prod.vroc.com.au:36787
java.io.IOException: Failed to connect to AUPER01-02-10-12-0.prod.vroc.com.au/192.168.11.22:36787
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:97)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:171)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: AUPER01-02-10-12-0.prod.vroc.com.au/192.168.11.22:36787
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
... 2 more
It appears that one of the executors died while the other executors tried to pull blocks from earlier shuffle stages to complete a Spark job.
Right after you've spark-submited a Spark application to a cluster, the application gets a set of machines for executors. They are responsible for executing tasks and caching their results (in memory and/or disk).
Every executor has its own BlockManager that is responsible for managing datasets (as blocks).
The BlockManagers in a Spark application have all to be available or the Spark application will re-trigger task execution.
ShuffleBlockFetcherIterator is a Scala Iterator that fetches multiple shuffle blocks (aka shuffle map outputs) from local and remote BlockManagers.

How to fix Connection reset by peer message from apache-spark?

I keep getting the the following exception very frequently and I wonder why this is happening? After researching I found I could do .set("spark.submit.deployMode", "nio"); but that did not work either and I am using spark 2.0.0
WARN TransportChannelHandler: Exception in connection from /172.31.3.245:46014
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:898)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
I was getting the same error even if I tried many things.My job used to get stuck throwing this error after running a very long time. I tried few work around which helped me to resolve. Although, I still get the same error by at least my job runs fine.
one reason could be the executors kills themselves thinking that they lost the connection from the master. I added the below configurations in spark-defaults.conf file.
spark.network.timeout 10000000
spark.executor.heartbeatInterval 10000000
basically,I have increased the network timeout and heartbeat interval
The particular step which used to get stuck, I just cached the dataframe that is used for processing (in the step which used to get stuck)
Note:- These are work arounds, I still see the same error in error logs but the my job does not get terminated.

All the TCP connection b/w DataStax driver to the Cassandra Remain in Active close state . i.e TIME_WAIT state.

The setup:
Web server
Apache Tomcat
RestFull web services
Using DataStax java driver 2.0
Database
-2-node Cassandra 2.0.7.31 cluster
-replicas=1
Problem
After sending set of 1500 request more than three times. I got error at the tomcat log
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.181.13.239 ([/10.181.13.239] Unexpected exception triggered))
at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:64)
at com.datastax.driver.core.ResultSetFuture.extractCauseFromExecutionException(ResultSetFuture.java:214)
at com.datastax.driver.core.ResultSetFuture.getUninterruptibly(ResultSetFuture.java:169)
at com.jpmc.es.rtm.storage.impl.EventExtract.main(EventExtract.java:36)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.181.13.239 ([/10.181.13.239] Unexpected exception triggered))
at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:98)
at com.datastax.driver.core.RequestHandler$1.run(RequestHandler.java:165)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
Observation
After this state of tomcat. All the further request attaining the same fate. That is drivers are not able to send my insert request to cassandra.
After executing net stat command i find that the all the TCP connection b/w web server and the Cassandra are in TIMED_WAIT state.
What could be the reason ? why Datastax driver is not able to take back the connection back to the pool? or why does the Cassandra is engaging all the connection form its client.
Thanks in Advance
The connection was increasing Due to calling creating multiple session for each request. Now it is working Fine.
builder = new Cluster.Builder().
addContactPoints("192.168.114.42");
builder.withPoolingOptions(new PoolingOptions().setCoreConnectionsPerHost(
HostDistance.LOCAL, new PoolingOptions().getMaxConnectionsPerHost(HostDistance.LOCAL)));
cluster = builder
.withRetryPolicy(DowngradingConsistencyRetryPolicy.INSTANCE)
.withReconnectionPolicy(new ConstantReconnectionPolicy(100L))
.build();
session = cluster.connect("demodb");
Now Driver is maintain 17-26 number of connection irrespective of number of transaction.

Resources