spark-hbase connector expired ticket kerberos - apache-spark

I have a cluster with CDH 5.8.4. I'm runnin a spark streaming application which reads and writes data from/to HBase by using the cloudera spark-hbase connector namely the HBaseContext.
When I start the application I give the principal and the kinit to the spark-submit script.
I'm seeing that after 7 days the application crashed with an error about the expiration of the ticket kerberos related to the HBase context. This is the error from the executors log:
ERROR executor.Executor: Exception in task 0.0 in stage 544265.0 (TID 1149098)
org.apache.hadoop.hbase.client.RetriesExhaustedException: Can't get the location
at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java
:326)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:157)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:61)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:320)
at org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:295)
at org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:160)
at org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:155)
at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:867)
at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.restart(TableRecordReaderImpl.java:91)
at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.initialize(TableRecordReaderImpl.java:169)
at org.apache.hadoop.hbase.mapreduce.TableRecordReader.initialize(TableRecordReader.java:134)
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase$1.initialize(TableInputFormatBase.java:211)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:164)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:129)
at org.apache.hadoop.hbase.spark.NewHBaseRDD.compute(NewHBaseRDD.scala:34)
at org.apache.hadoop.hbase.spark.NewHBaseRDD.compute(NewHBaseRDD.scala:25)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: Token has expired
at sun.reflect.GeneratedConstructorAccessor58.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:327)
at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRowOrBefore(ProtobufUtil.java:1593)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1398)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1199)
at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:315)
... 30 more
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): Token has expired
at org.apache.hadoop.hbase.security.HBaseSaslRpcClient.readStatus(HBaseSaslRpcClient.java:155)
at org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:222)
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupSaslConnection(RpcClientImpl.java:617)
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.access$700(RpcClientImpl.java:162)
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:743)
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:740)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1783)
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:740)
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.writeRequest(RpcClientImpl.java:906)
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.tracedWriteRequest(RpcClientImpl.java:873)
at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1242)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:227)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:336)
at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.get(ClientProtos.java:34070)
at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRowOrBefore(ProtobufUtil.java:1589)
Does anyone knows how to solve this issue?
Thanks in advance,
Beniamino

We (Splice Machine) had the same issue with a customer. Our issue was caused by https://issues.apache.org/jira/browse/SPARK-12646. We wrote some code to fix the _HOST issue and we also upgraded to Spark 2.2 to get around this issue.

You should not rely on an external ticket cache for distributed jobs. The best solution is to ship a keytab with your application or rely on a keytab being deployed on all nodes where your Spark task may be executed.
UserGroupInformation.loginUserFromKeytab("name#xyz.com", keyTab);
connection=ConnectionFactory.createConnection(conf);
With your approach above, you would need to do something like the following after obtaining the UserGroupInformation instance:
ugi.doAs(new PrivilegedAction<Void>() {
public Void run() {
connection = ConnectionFactory.createConnection(conf);
...
return null;
}
});

Related

AWS Graviton based EC2 Instance Upgrade in EMR causing task failures

I have a spark Scala job running in EMR that I am trying to improve. As of right now it runs on m5.8xlarge with no issues. I recently tried upgrading to the Graviton based EC2 instances m6g.8xlarge and while the job does succeed, I am seeing some weird issues. Some of the issues I see is tasks failing due to a timeout, stages running in a strange order, and it looks like the memory is strained. The stage that runs out of order is the one with failed tasks, stage 6 runs then fails, then stages 4 & 5 complete, and then stage 6 retry succeeds. In the m5.8xlarge run that currently is working, stages 4 & 5 get skipped. I'm not sure why this is happening since the only change I made was going from an m5 instance type to an m6g, so I wanted to see if anyone experienced something similar or has solutions. I will also post some of the errors from the failed tasks, but I think they are related to the oom.
Here is the main error I am seeing:
ERROR TransportClientFactory:261 - Exception while bootstrapping client after 60041 ms
java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout waiting for task.
at org.spark_project.guava.base.Throwables.propagate(Throwables.java:160)
at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:263)
at org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:70)
at org.apache.spark.network.crypto.AuthClientBootstrap.doSaslAuth(AuthClientBootstrap.java:116)
at org.apache.spark.network.crypto.AuthClientBootstrap.doBootstrap(AuthClientBootstrap.java:89)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:257)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
at org.apache.spark.network.shuffle.ExternalShuffleClient.lambda$fetchBlocks$0(ExternalShuffleClient.java:100)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:121)
at org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:109)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:264)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.org$apache$spark$storage$ShuffleBlockFetcherIterator$$send$1(ShuffleBlockFetcherIterator.scala:614)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:609)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:442)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.<init>(ShuffleBlockFetcherIterator.scala:160)
at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:66)
at org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:173)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task.
at org.spark_project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:276)
at org.spark_project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:96)
at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:259)
... 39 more
I think this is not an out of memory problem.
Both m6g.8xlarge and m5.8xlarge have 120 GB of memory following their specs:
https://aws.amazon.com/ec2/instance-types/m6g and https://aws.amazon.com/ec2/instance-types/m5
I see in the backtrace that the timeout is during the authentication process:
First it fails to authenticate with Spark's auth protocol in doBootstrap(AuthClientBootstrap.java:89
https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/crypto/AuthClientBootstrap.java#L99
Bootstraps a {#link TransportClient} by performing authentication using Spark's auth protocol.
This bootstrap falls back to using the SASL bootstrap if the server throws an error during authentication, and the configuration allows it. This is used for backwards compatibility with external shuffle services that do not support the new protocol.
and then it also fails to authenticate with SASL in doBootstrap(SaslClientBootstrap.java:70
https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/sasl/SaslClientBootstrap.java#L54
Bootstraps a {#link TransportClient} by performing SASL authentication on the connection. The server should be setup with a {#link SaslRpcHandler} with matching keys for the given appId.
Performs SASL authentication by sending a token, and then proceeding with the SASL challenge-response tokens until we either successfully authenticate or throw an exception due to mismatch.

When would ShuffleBlockFetcherIterator throw "Failed to get block(s)" exceptions?

In my spark application which is run in a cluster mode, I get below exception. I know somehow this coud be due to emery issue. But as the error says, it can not connect to a node. But I ma sure the node is available and it can be connected. Can anyone know what is the main cause of this error and how to resolve it?
17/10/31 17:10:54 ERROR ShuffleBlockFetcherIterator: Failed to get block(s) from AUPER01-02-10-12-0.prod.vroc.com.au:36787
java.io.IOException: Failed to connect to AUPER01-02-10-12-0.prod.vroc.com.au/192.168.11.22:36787
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:97)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:171)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: AUPER01-02-10-12-0.prod.vroc.com.au/192.168.11.22:36787
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
... 2 more
It appears that one of the executors died while the other executors tried to pull blocks from earlier shuffle stages to complete a Spark job.
Right after you've spark-submited a Spark application to a cluster, the application gets a set of machines for executors. They are responsible for executing tasks and caching their results (in memory and/or disk).
Every executor has its own BlockManager that is responsible for managing datasets (as blocks).
The BlockManagers in a Spark application have all to be available or the Spark application will re-trigger task execution.
ShuffleBlockFetcherIterator is a Scala Iterator that fetches multiple shuffle blocks (aka shuffle map outputs) from local and remote BlockManagers.

After creating a new jhipster project, unable to launch the application

After creating a jhipster project, tried with the following command.
**mvnw
I am getting the following error. For existing project also, i am facing the same issue.
Error :
com.netflix.discovery.shared.transport.TransportException: Cannot execute request on any known server
at com.netflix.discovery.shared.transport.decorator.RetryableEurekaHttpClient.execute(RetryableEurekaHttpClient.java:111)
at com.netflix.discovery.shared.transport.decorator.EurekaHttpClientDecorator.register(EurekaHttpClientDecorator.java:56)
at com.netflix.discovery.shared.transport.decorator.EurekaHttpClientDecorator$1.execute(EurekaHttpClientDecorator.java:59)
at com.netflix.discovery.shared.transport.decorator.SessionedEurekaHttpClient.execute(SessionedEurekaHttpClient.java:77)
at com.netflix.discovery.shared.transport.decorator.EurekaHttpClientDecorator.register(EurekaHttpClientDecorator.java:56)
at com.netflix.discovery.DiscoveryClient.register(DiscoveryClient.java:815)
at com.netflix.discovery.InstanceInfoReplicator.run(InstanceInfoReplicator.java:104)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Your question is not detailed enough but it is obvious that you are using a microservice achitecture and did not start the registry, check documentation.

Spark on Amazon EMR: "Timeout waiting for connection from pool"

I'm running a Spark job on a small three server Amazon EMR 5 (Spark 2.0) cluster. My job runs for an hour or so, fails with the error below. I can manually restart and it works, processes more data, and eventually fails again.
My Spark code is fairly simple and is not using any Amazon or S3 APIs directly. My Spark code passes S3 text string paths to Spark and Spark uses S3 internally.
My Spark program just does the following in a loop: Load data from S3 -> Process -> Write data to different location on S3.
My first suspicion is that some internal Amazon or Spark code is not properly disposing of connections and the connection pool becomes exhausted.
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.AmazonClientException: Unable to execute HTTP request: Timeout waiting for connection from pool
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:618)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:376)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:338)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:287)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3826)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1015)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:991)
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:212)
at sun.reflect.GeneratedMethodAccessor45.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy44.retrieveMetadata(Unknown Source)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:780)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1428)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.exists(EmrFileSystem.java:313)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:85)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:487)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
at sun.reflect.GeneratedMethodAccessor85.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:226)
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:195)
at sun.reflect.GeneratedMethodAccessor43.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.$Proxy45.getConnection(Unknown Source)
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:423)
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:837)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:607)
... 41 more
I encountered this issue with a very trivial program on EMR (read data from S3, filter, write to S3).
I could solve it by using the S3A file system implementation and setting fs.s3a.connection.maximum to 100 to have a bigger connection pool.
(default is 15; see Hadoop-AWS module: Integration with Amazon Web Services for more config properties)
This is how I set the configuration:
// in Scala
val hc = sc.hadoopConfiguration
// in Python (not tested)
hc = sc._jsc.hadoopConfiguration()
// setting the config is the same for both languages
hc.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hc.setInt("fs.s3a.connection.maximum", 100)
To make it work, the S3 URIs passed to Spark have to start with s3a://...
This issue may also be resolved while remaining on EMRFS by setting fs.s3.maxConnections to something larger than the default 500 in emrfs-site config
https://aws.amazon.com/premiumsupport/knowledge-center/emr-timeout-connection-wait/
If using Java SDK to spin up an EMR cluster you can set this using the withConfigurations method (which is much easier than doing it manually by modifying files). See also https://stackoverflow.com/a/52595058/1586965
You can check this has been set correctly by using the Configurations tab in EMR, e.g.

Grails 2.0.1 Hibernate exception in foreign worker thread

I have a Grails 2.0.1 app and I spawn worker threads. My worker threads use GORM to read/write domain objects from MySql. Given that the code that is accessing the DB is not inside of an HTTP request, I create a Hibernate session for the save() like this:
MyClass.withTransaction { status ->
myClass.save()
}
I do that to resolve the "No Hibernate Session bound to thread" problem.
This worked fine on Grails 1.3.7. I am currently attempting to upgrade to Grails 2.0.1, and am seeing an exception I do not understand.
The following stack trace shows the exception.
Although I recognize the exception, and normally resolve it via the "withTransaction" technique shown above, the thing that baffles me about this one is that it is happening on a worker thread that I did not create.
Exception in thread "pool-12-thread-2" org.hibernate.HibernateException: No Hibernate Session bound to thread, and configuration does not allow creation of non-transactional one here
at org.springframework.orm.hibernate3.SpringSessionContext.currentSession(SpringSessionContext.java:63)
at org.hibernate.impl.SessionFactoryImpl.getCurrentSession(SessionFactoryImpl.java:687)
at org.codehaus.groovy.grails.orm.hibernate.SessionFactoryProxy.getCurrentSession(SessionFactoryProxy.java:145)
at org.codehaus.groovy.grails.orm.hibernate.validation.HibernateDomainClassValidator.validate(HibernateDomainClassValidator.java:51)
at org.codehaus.groovy.grails.validation.GrailsDomainClassValidator.validate(GrailsDomainClassValidator.java:121)
at org.codehaus.groovy.grails.orm.hibernate.metaclass.ValidatePersistentMethod.doInvokeInternal(ValidatePersistentMethod.java:119)
at org.codehaus.groovy.grails.orm.hibernate.metaclass.AbstractDynamicPersistentMethod.invoke(AbstractDynamicPersistentMethod.java:63)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at com.springsource.loaded.ri.ReflectiveInterceptor.jlrMethodInvoke(ReflectiveInterceptor.java:1231)
at org.codehaus.groovy.runtime.callsite.PojoMetaMethodSite$PojoCachedMethodSite.invoke(PojoMetaMethodSite.java:189)
at org.codehaus.groovy.runtime.callsite.PojoMetaMethodSite.call(PojoMetaMethodSite.java:53)
at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:42)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:108)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:124)
at org.codehaus.groovy.grails.orm.hibernate.HibernateGormValidationApi.validate(HibernateGormEnhancer.groovy:702)
at com......MyClass.validate(MyClass.groovy)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at com.springsource.loaded.ri.ReflectiveInterceptor.jlrMethodInvoke(ReflectiveInterceptor.java:1231)
at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:90)
at org.codehaus.groovy.grails.orm.hibernate.support.ClosureEventListener$7.call(ClosureEventListener.java:282)
at org.codehaus.groovy.grails.orm.hibernate.support.ClosureEventListener$7.call(ClosureEventListener.java:267)
at org.codehaus.groovy.grails.orm.hibernate.support.ClosureEventListener.doWithManualSession(ClosureEventListener.java:302)
at org.codehaus.groovy.grails.orm.hibernate.support.ClosureEventListener.onPreUpdate(ClosureEventListener.java:267)
at org.codehaus.groovy.grails.orm.hibernate.EventTriggeringInterceptor.onPreUpdate(EventTriggeringInterceptor.java:164)
at org.codehaus.groovy.grails.orm.hibernate.EventTriggeringInterceptor.onPersistenceEvent(EventTriggeringInterceptor.java:89)
at org.grails.datastore.mapping.engine.event.AbstractPersistenceEventListener.onApplicationEvent(AbstractPersistenceEventListener.java:46)
at org.springframework.context.event.SimpleApplicationEventMulticaster$1.run(SimpleApplicationEventMulticaster.java:92)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)
I am hoping someone can enlighten me regarding:
where this worker thread pool originates from?
why is is doing a preUpdate event (when I believe the save in my thread already did a successful validate and save
Can I somehow config my app such that these foreign worker threads are not there
If #3 can't be pulled off, how can I do the equivalent of a "withTransaction" over in the worker thread code that I do not "own"

Resources