Could not get a Transport from the Transport Pool for host - apache-spark

I'm trying to write to an IBM Compose Elasticsearch sink from Spark Structured Streaming on IBM Analytics Engine. My spark code:
dataDf
.writeStream
.outputMode(OutputMode.Append)
.format("org.elasticsearch.spark.sql")
.queryName("ElasticSink")
.option("checkpointLocation", s"${s3Url}/checkpoint_elasticsearch")
.option("es.nodes", "xxx1.composedb.com,xxx2.composedb.com")
.option("es.port", "xxxx")
.option("es.net.http.auth.user", "admin")
.option("es.net.http.auth.pass", "xxxx")
.option("es.net.ssl", true)
.option("es.nodes.wan.only", true)
.option("es.net.ssl.truststore.location", SparkFiles.getRootDirectory() + "/my.jks")
.option("es.net.ssl.truststore.pass", "xxxx")
.start("test/broadcast")
However, I'm receiving the following exception:
org.elasticsearch.hadoop.EsHadoopException: Could not get a Transport from the Transport Pool for host [xxx2.composedb.com:xxxx]
at org.elasticsearch.hadoop.rest.pooling.PooledHttpTransportFactory.borrowFrom(PooledHttpTransportFactory.java:106)
at org.elasticsearch.hadoop.rest.pooling.PooledHttpTransportFactory.create(PooledHttpTransportFactory.java:55)
at org.elasticsearch.hadoop.rest.NetworkClient.selectNextNode(NetworkClient.java:99)
at org.elasticsearch.hadoop.rest.NetworkClient.<init>(NetworkClient.java:82)
at org.elasticsearch.hadoop.rest.NetworkClient.<init>(NetworkClient.java:59)
at org.elasticsearch.hadoop.rest.RestClient.<init>(RestClient.java:94)
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:317)
at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:576)
at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:58)
at org.elasticsearch.spark.sql.streaming.EsStreamQueryWriter.run(EsStreamQueryWriter.scala:41)
at org.elasticsearch.spark.sql.streaming.EsSparkSqlStreamingSink$$anonfun$addBatch$2$$anonfun$2.apply(EsSparkSqlStreamingSink.scala:52)
at org.elasticsearch.spark.sql.streaming.EsSparkSqlStreamingSink$$anonfun$addBatch$2$$anonfun$2.apply(EsSparkSqlStreamingSink.scala:51)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Any ideas?

I modified the Elasticsearch hadoop library to output the exception and the underlying problem was the truststore not being found:
org.elasticsearch.hadoop.EsHadoopIllegalStateException: Cannot initialize SSL - Expected to find keystore file at [/tmp/spark-e2203f9c-4f0f-4929-870f-d491fce0ad06/userFiles-62df70b0-7b76-403d-80a1-8845fd67e6a0/my.jks] but was unable to. Make sure that it is available on the classpath, or if not, that you have specified a valid URI.
at org.elasticsearch.hadoop.rest.pooling.PooledHttpTransportFactory.borrowFrom(PooledHttpTransportFactory.java:106)

Related

Using pubsub lite library in spark getting error

I am getting error while publishing message to gcp pubsub lite using spark structured streaming.
I cannot use writestream as I want to use it in forEachBatch sink in spark so I am using foreachpartition and foreach and publishing message inside foreach for each dataframe row.
Below is error I get , some messages get published but in some I can see below exception:
2022-06-07 10:08:17 WARN PartitionCountWatcherImpl:101 - Failed to refresh partition count
com.google.api.gax.rpc.ApiException:
at com.google.cloud.pubsublite.internal.CheckedApiException.<init>(CheckedApiException.java:51)
at com.google.cloud.pubsublite.internal.CheckedApiException.<init>(CheckedApiException.java:55)
at com.google.cloud.pubsublite.internal.ExtractStatus.toCanonical(ExtractStatus.java:49)
at com.google.cloud.pubsublite.internal.wire.PartitionCountWatcherImpl.pollTopicConfig(PartitionCountWatcherImpl.java:92)
at com.google.cloud.pubsublite.internal.wire.PartitionCountWatcherImpl.onAlarm(PartitionCountWatcherImpl.java:71)
at com.google.cloud.pubsublite.internal.AlarmFactory.lambda$null$0(AlarmFactory.java:41)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.InterruptedException
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:456)
at com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:100)
at com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:73)
at com.google.cloud.pubsublite.internal.wire.PartitionCountWatcherImpl.pollTopicConfig(PartitionCountWatcherImpl.java:81)
... 9 more

Is there a way to read a file from a Kerberized HDFS into a non kerberized spark cluster given the keytab file, principal and other details?

I need to read data from a Kerberized HDFS cluster using webHDFS in a non Kerberized Spark cluster. I have access to the Keytab file, username/principal, and can access any other details needed to log in. I need to programmatically log in and allow my Spark cluster to read the file(s) from the Kerberized HDFS.
It looks like I can log in using this piece of code:
System.setProperty("java.security.krb5.kdc", "<kdc>")
System.setProperty("java.security.krb5.realm", "<realm>")
val conf = new Configuration()
conf.set("hadoop.security.authentication", "kerberos")
UserGroupInformation.setConfiguration(conf)
UserGroupInformation.loginUserFromKeytab("<my-principal>", "<keytab-file-path>/keytabFile.keytab")
From the official documentation of webHDFS here, I see that I can configure the keytab file path and principal in the hadoop configuration like this:
sparkSession.sparkContext.hadoopConfiguration.set("dfs.web.authentication.kerberos.principal", "<my-principal>")
sparkSession.sparkContext.hadoopConfiguration.set("dfs.web.authentication.kerberos.keytab", "<keytab-file-path>/keytabFile.keytab")
After this, I should be able to read the files from the Kerberized HDFS cluster using:
val irisDFWebHDFS = sparkSession.read
.format("csv")
.option("header", "true")
.csv(s"webhdfs://<namenode-host>:<namenode-port>/user/hadoop/iris.csv")
But it still refuses to read and throws the following exception:
org.apache.hadoop.security.AccessControlException: Authentication required
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.validateResponse(WebHdfsFileSystem.java:490)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$300(WebHdfsFileSystem.java:135)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.connect(WebHdfsFileSystem.java:721)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:796)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.access$100(WebHdfsFileSystem.java:619)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner$1.run(WebHdfsFileSystem.java:657)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:653)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getDelegationToken(WebHdfsFileSystem.java:1741)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getDelegationToken(WebHdfsFileSystem.java:365)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getAuthParameters(WebHdfsFileSystem.java:585)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.toUrl(WebHdfsFileSystem.java:608)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractFsPathRunner.getUrl(WebHdfsFileSystem.java:898)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:794)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.access$100(WebHdfsFileSystem.java:619)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner$1.run(WebHdfsFileSystem.java:657)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:653)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getHdfsFileStatus(WebHdfsFileSystem.java:1086)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getFileStatus(WebHdfsFileSystem.java:1097)
at org.apache.hadoop.fs.FileSystem.isDirectory(FileSystem.java:1707)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:47)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:376)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:326)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:308)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:308)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:796)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
... 44 elided
Any pointers on what I might be missing?

Kafka Sink Connector for Cassandra failed

I'm creating Kafka Sink Conector for Cassandra via Lenses. My configuration is:
connector.class=com.datamountaineer.streamreactor.connect.cassandra.sink.CassandraSinkConnector
connect.cassandra.key.space=space1
connect.cassandra.contact.points=cassandra1
tasks.max=1
topics=demo-1206-enriched-clicks-v0.1
connect.cassandra.port=9042
connect.cassandra.kcql=INSERT INTO space1.CLicks_Test SELECT ClicksId from demo-1206-enriched-clicks-v0.1
name=test_cassandra
but, I'm getting this error:
org.apache.kafka.common.config.ConfigException: Mandatory `topics` configuration contains topics not set in connect.cassandra.kcql: Set(demo-1206-enriched-clicks-v0.1)
at com.datamountaineer.streamreactor.connect.config.Helpers$.checkInputTopics(Helpers.scala:107)
at com.datamountaineer.streamreactor.connect.cassandra.sink.CassandraSinkConnector.start(CassandraSinkConnector.scala:65)
at org.apache.kafka.connect.runtime.WorkerConnector.doStart(WorkerConnector.java:100)
at org.apache.kafka.connect.runtime.WorkerConnector.start(WorkerConnector.java:125)
at org.apache.kafka.connect.runtime.WorkerConnector.transitionTo(WorkerConnector.java:182)
at org.apache.kafka.connect.runtime.Worker.startConnector(Worker.java:210)
at org.apache.kafka.connect.runtime.distributed.DistributedHerder.startConnector(DistributedHerder.java:872)
at org.apache.kafka.connect.runtime.distributed.DistributedHerder.processConnectorConfigUpdates(DistributedHerder.java:324)
at org.apache.kafka.connect.runtime.distributed.DistributedHerder.tick(DistributedHerder.java:296)
at org.apache.kafka.connect.runtime.distributed.DistributedHerder.run(DistributedHerder.java:199)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Any ideas why?
Jelena, as discussed already the problem is with Kafka Connect framework not being able to parse .1 in topics=demo-1206-enriched-clicks-v0.1. A bug should be reported to Kafka GitHub.
#Alex what you see there is Not KSQL. The configuration SQL we use for all our connectors is KCQL(Kafka connect query language).
Furthermore we have our equivalent SQL for Apache Kafka.called LSQL: http://www.landoop.com/docs/lenses/
This was discussed at https://launchpass.com/datamountaineers and KCQL and the Connector was verified that properly supports the . character in the KCQL configuration. The root cause, was identified to be an issue upstream in Kafka Connect framework

Spark streaming kafka Couldn't find leader offsets for Set

I used spark streaming 'org.apache.spark:spark-streaming_2.10:1.6.1' and 'org.apache.spark:spark-streaming-kafka_2.10:1.6.1' to connect to a kafka broker version 0.10.0.1. When I try this code:
def messages = KafkaUtils.createDirectStream(jssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topicsSet)
I've received this exception:
INFO consumer.SimpleConsumer: Reconnect due to socket error: java.nio.channels.ClosedChannelException
Exception in thread "main" org.apache.spark.SparkException: java.nio.channels.ClosedChannelException
org.apache.spark.SparkException: Couldn't find leader offsets for Set([stream,0])
at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
at scala.util.Either.fold(Either.scala:97)
at org.apache.spark.streaming.kafka.KafkaCluster$.checkErrors(KafkaCluster.scala:365)
at org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:222)
at org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484)
at org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:607)
at org.apache.spark.streaming.kafka.KafkaUtils.createDirectStream(KafkaUtils.scala)
at org.apache.spark.streaming.kafka.KafkaUtils$createDirectStream.call(Unknown Source)
at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:45)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:108)
at com.privowny.classification.jobs.StreamingClassification.main(StreamingClassification.groovy:48)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I've try to search for some answers in this site but it seems left unanswered, could you give me some suggestion what to do? The topic stream is not empty.
I also came across this issue. So you have to change some configuration on your the Kafka.
Go to your the Kafka configuration and config listeners;
In the Socket Server Settings section in the format:
listeners=PLAINTEXT://[hostname or IP]:[port]
For example:
listeners=PLAINTEXT://192.168.1.24:9092
I know from experience that one thing that can cause this error message is if the Spark driver cannot reach the kafka brokers using the brokers' advertised hostname (advertised.host.name in server.properties). This is the case even if the spark config identifies the kafka brokers using different addresses that work. All the brokers' advertised hostnames have to be reachable from the Spark driver.
This happened to me because the cluster runs in a separate AWS account, the brokers identify themselves using internal DNS records, these had to be copied to the other AWS account. Prior to doing that, I got this error message because the Spark driver couldn't reach the brokers to ask for their latest offsets, even though we use the brokers' private IP addresses in the spark config.
Hope that helps someone.
I was running kafka from HDP, so the default port was 6667 instead of 9092, when I switched the port of bootstrap.servers to <hostname>:6667 the issue was resolved.

Spark on Amazon EMR: "Timeout waiting for connection from pool"

I'm running a Spark job on a small three server Amazon EMR 5 (Spark 2.0) cluster. My job runs for an hour or so, fails with the error below. I can manually restart and it works, processes more data, and eventually fails again.
My Spark code is fairly simple and is not using any Amazon or S3 APIs directly. My Spark code passes S3 text string paths to Spark and Spark uses S3 internally.
My Spark program just does the following in a loop: Load data from S3 -> Process -> Write data to different location on S3.
My first suspicion is that some internal Amazon or Spark code is not properly disposing of connections and the connection pool becomes exhausted.
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.AmazonClientException: Unable to execute HTTP request: Timeout waiting for connection from pool
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:618)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:376)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:338)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:287)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3826)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1015)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:991)
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:212)
at sun.reflect.GeneratedMethodAccessor45.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy44.retrieveMetadata(Unknown Source)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:780)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1428)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.exists(EmrFileSystem.java:313)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:85)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:487)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
at sun.reflect.GeneratedMethodAccessor85.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:226)
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:195)
at sun.reflect.GeneratedMethodAccessor43.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.$Proxy45.getConnection(Unknown Source)
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:423)
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:837)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:607)
... 41 more
I encountered this issue with a very trivial program on EMR (read data from S3, filter, write to S3).
I could solve it by using the S3A file system implementation and setting fs.s3a.connection.maximum to 100 to have a bigger connection pool.
(default is 15; see Hadoop-AWS module: Integration with Amazon Web Services for more config properties)
This is how I set the configuration:
// in Scala
val hc = sc.hadoopConfiguration
// in Python (not tested)
hc = sc._jsc.hadoopConfiguration()
// setting the config is the same for both languages
hc.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hc.setInt("fs.s3a.connection.maximum", 100)
To make it work, the S3 URIs passed to Spark have to start with s3a://...
This issue may also be resolved while remaining on EMRFS by setting fs.s3.maxConnections to something larger than the default 500 in emrfs-site config
https://aws.amazon.com/premiumsupport/knowledge-center/emr-timeout-connection-wait/
If using Java SDK to spin up an EMR cluster you can set this using the withConfigurations method (which is much easier than doing it manually by modifying files). See also https://stackoverflow.com/a/52595058/1586965
You can check this has been set correctly by using the Configurations tab in EMR, e.g.

Resources