Write file to GCS using PySpark - apache-spark

Trying to write to Google Cloud Storage using a local PySpark session, getting this error:
FileSystem: Failed to initialize fileystem gs://vta-delta-lake/test_table: java.io.IOException: toDerInputStream rejects tag type 123
Full stack trace:
Py4JJavaError: An error occurred while calling o51.save.
: java.io.IOException: toDerInputStream rejects tag type 123
at sun.security.util.DerValue.toDerInputStream(DerValue.java:874)
at sun.security.pkcs12.PKCS12KeyStore.engineLoad(PKCS12KeyStore.java:1942)
at java.security.KeyStore.load(KeyStore.java:1445)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.util.SecurityUtils.loadKeyStore(SecurityUtils.java:80)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.util.SecurityUtils.loadPrivateKeyFromKeyStore(SecurityUtils.java:113)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.auth.oauth2.GoogleCredential$Builder.setServiceAccountPrivateKeyFromP12File(GoogleCredential.java:715)
at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.auth.oauth2.GoogleCredential$Builder.setServiceAccountPrivateKeyFromP12File(GoogleCredential.java:699)
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromPrivateKeyServiceAccount(CredentialFactory.java:276)
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.util.CredentialFactory.getCredential(CredentialFactory.java:401)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getCredential(GoogleHadoopFileSystemBase.java:1341)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.createGcsFs(GoogleHadoopFileSystemBase.java:1497)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1479)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:467)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at org.apache.spark.sql.execution.datasources.DataSource.planForWritingFileFormat(DataSource.scala:461)
at org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:556)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:382)
at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:355)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:748)
I've configured the Google Cloud Storage Hadoop connector, and I can read from a gcs bucket without any issue, in the same notebook. If I impersonate the service account, I can upload to the bucket without any issue. I just can't write to the bucket using Spark. Any ideas?

Found it - I had been messing with the spark configuration. Originally I had set this property:
spark.hadoop.google.cloud.auth.service.account.keyfile
To point to my service account JSON file. This hadn't worked when trying to read from a bucket, so I set the GOOGLE_APPLICATION_CREDENTIALS to the json file path. This got read working. It seems, though, that when writing the code looks for the config setting above first, and errors out because it's expecting a P12 file. I needed to use this property instead:
spark.hadoop.google.cloud.auth.service.account.json.keyfile
Having set that and restarted PySpark, I can now write to GCS buckets.

Related

Connection wth BigQuery on Google Cloud Platform using Spark with Scala

I am not able to connect to Big Query table from Spark on GCP.
https://cloud.google.com/hadoop/bigquery-connector
I already tried steps present in above by providing project Id dataset name and table name link still no success .When I am trying to print the data using below code I am getting below error:
Exception in thread "main" java.lang.NoClassDefFoundError: com/google/cloud/hadoop/io/bigquery/BigQueryConfiguration
at Main.main(Main.scala:27)
at Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.orgapachesparkdeploySparkSubmitrunMain(SparkSubmit.scala:849)
at org.apache.spark.deploy.SparkSubmit.doRunMain1(SparkSubmit.scala:167)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmitanon2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: com.google.cloud.hadoop.io.bigquery.BigQueryConfiguration
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 14 more
You probably miss the spark-bigquery-connector in your classpath, you can add it by adding the following parameter:
gcloud dataproc jobs submit spark --cluster "$MY_CLUSTER" --jars gs://spark-lib/bigquery/spark-bigquery-latest.jar ...

Accessing S3 bucket from local pyspark using assume role

Background: In order to allow developers to build and unit test code on an easy-to-use environment, we built a local Spark environment with other tools integrated to it. However, we also want to access S3 and Kinesis from Local environment. When we access S3 from Pyspark application from local using assume-role(as per our security standards), it is throwing forbidden error.
FYI - We are below access pattern for accessing resources on AWS account.
assume-role access pattern
Code for testing access-s3-from-pyspark.py :
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName("s3a_test").setMaster("local[1]")
sc = SparkContext(conf=conf)
sc.setSystemProperty("com.amazonaws.services.s3.enableV4", "true")
hadoopConf = {}
iterator = sc._jsc.hadoopConfiguration().iterator()
while iterator.hasNext():
prop = iterator.next()
hadoopConf[prop.getKey()] = prop.getValue()
for item in sorted(hadoopConf.items()):
if "fs.s3" in item[0] :
print(item)
path="s3a://<your bucket>/test-file.txt"
## read the file for testing
lines = sc.textFile(path)
if lines.isEmpty() == False:
lines.saveAsTextFile("test-file2.text")
Property file spark-s3.properties
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.endpoint s3.eu-central-1.amazonaws.com
spark.hadoop.fs.s3a.access.key <your access key >
spark.hadoop.fs.s3a.secret.key <your secret key>
spark.hadoop.fs.s3a.assumed.role.sts.endpoint sts.eu-central-1.amazonaws.com
spark.hadoop.fs.s3a.aws.credentials.provider org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider
spark.hadoop.fs.s3a.aws.credentials.provider org.apache.hadoop.fs.s3a.AssumedRoleCredentialProvider
spark.hadoop.fs.s3a.aws.credentials.provider org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider
spark.hadoop.fs.s3a.assumed.role.session.name testSession1
spark.haeoop.fs.s3a.assumed.role.session.duration 3600
spark.hadoop.fs.s3a.assumed.role.arn <role arn>
spark.hadoop.fs.s3.canned.acl BucketOwnerFullControl
How to run the code:
spark-submit --properties-file spark-s3.properties \
--jars jars/hadoop-aws-2.7.3.jar,jars/aws-java-sdk-1.7.4.jar \
access-s3-from-pyspark.pyenter code here
The above code is returning below error, please note I am able to access S3 via CLI and boto3 using assume-role profile or api.
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: 66FB4D6351898F33, AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID: J8lZ4qTZ25+a8/R3ZeBTrW5TDHzo98A9iUshbe0/7VcHmiaSXZ5u6fa0TvA3E7ZYvhqXj40tf74=
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:976)
at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:956)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:892)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:77)
at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57)
at org.apache.hadoop.fs.Globber.glob(Globber.java:252)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1676)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:259)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:61)
at org.apache.spark.api.java.AbstractJavaRDDLike.partitions(JavaRDDLike.scala:45)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Questions:
This is right way of doing?
Is there any other easy way to use AWS resources locally for dev and test (I have also explored localstack package which is working most of the cases but still not fully dependable)
Am I using right jars for this?
that config of spark.hadoop.fs.s3a.aws.credentials.provider is wrong.
There should only be one entry and it should list all the AWS credential providers in one single entry
the S3A assumed role provider (which takes a full login and asks for an assumed role) is only on very recent Hadoop releases (3.1+), not 2.7.x and probably doesn't do what you want. It's mostly used for dynamically creating logins with restricted rights and verifying that the S3A connector itself can cope with things.
It's good that your organisation is strict about security, it just slightly complicates life.
Assuming you can get the account ID, session token and session secret (somehow),
then for Hadoop 2.8+ you can fill in the spark defaults with this
spark.hadoop.fs.s3a.access.key AAAIKIAAA
spark.hadoop.fs.s3a.secret.key ABCD
spark.hadoop.fs.s3a.fs.s3a.session.token REALLYREALLYLONGVALUE
spark.hadoop.fs.s3a.aws.credentials.provider org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider
You'll need to create those assumed role session secrets for as long as the session last, which used to be PITA as their life was <= 60 minutes; assumed roles can not last for up to 12 hours -your IT team will need to increase the lifespan of any roles you want to use for that long.
The hadoop 2.7.x releases don't have that TemporaryAWSCredentialsProvider, so instead you have to
Rely on the env var credential provider, which looks up the AWS_ environment variables. This is enabled by default so you shouldn't need to ply with spark.hadoop.fs.s3a.aws.credentials.provider at all
Set all three of the env vars (AWS_ACCESS_KEY, AWS_SECRET_KEY and AWS_SESSION_TOKEN (?)) to those of the values you get from an assumed role API call.
Then run your jobs. You'll probably need to set those env vars up everywhere I'm afraid.

Alfresco CMIS unauthorized

im doing an integration of Liferay and alfresco, my goal is to use alfresco 5.2 to store content created in Liferay DXP.
To do that ,i have added this line to portal-ext.porperties
dl.store.impl=com.liferay.portal.store.cmis.CMISStore
and added the config file com.liferay.portal.store.cmis.configuration.CMISStoreConfiguration.cfg with this content:
repositoryUrl=http://localhost:8080/alfresco/api/-default-/public/cmis/versions/1.1/atom
credentialsUsername=admin credentialsPassword=password
systemRootDir=Liferay
Everything worked fine but suddenly im getting this error while rebooting liferay:
21:19:37,536 ERROR [localhost-startStop-1][com_liferay_portal_store_cmis:97] [com.liferay.portal.store.cmis.CMISStore(1549)] The activate method has thrown an exception
org.apache.chemistry.opencmis.commons.exceptions.CmisUnauthorizedException: Unauthorized
at org.apache.chemistry.opencmis.client.bindings.spi.atompub.AbstractAtomPubService.convertStatusCode(AbstractAtomPubService.java:477)
at org.apache.chemistry.opencmis.client.bindings.spi.atompub.AbstractAtomPubService.read(AbstractAtomPubService.java:645)
at org.apache.chemistry.opencmis.client.bindings.spi.atompub.AbstractAtomPubService.getRepositoriesInternal(AbstractAtomPubService.java:808)
at org.apache.chemistry.opencmis.client.bindings.spi.atompub.RepositoryServiceImpl.getRepositoryInfos(RepositoryServiceImpl.java:65)
at org.apache.chemistry.opencmis.client.bindings.impl.RepositoryServiceImpl.getRepositoryInfos(RepositoryServiceImpl.java:90)
at org.apache.chemistry.opencmis.client.runtime.SessionFactoryImpl.getRepositories(SessionFactoryImpl.java:135)
at org.apache.chemistry.opencmis.client.runtime.SessionFactoryImpl.getRepositories(SessionFactoryImpl.java:112)
at com.liferay.portal.store.cmis.CMISStore.createSession(CMISStore.java:591)
at com.liferay.portal.store.cmis.CMISStore.activate(CMISStore.java:475)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.felix.scr.impl.inject.BaseMethod.invokeMethod(BaseMethod.java:224)
at org.apache.felix.scr.impl.inject.BaseMethod.access$500(BaseMethod.java:39)
at org.apache.felix.scr.impl.inject.BaseMethod$Resolved.invoke(BaseMethod.java:617)
at org.apache.felix.scr.impl.inject.BaseMethod.invoke(BaseMethod.java:501)
at org.apache.felix.scr.impl.inject.ActivateMethod.invoke(ActivateMethod.java:302)
at org.apache.felix.scr.impl.inject.ActivateMethod.invoke(ActivateMethod.java:294)
at org.apache.felix.scr.impl.manager.SingleComponentManager.createImplementationObject(SingleComponentManager.java:297)
at org.apache.felix.scr.impl.manager.SingleComponentManager.createComponent(SingleComponentManager.java:108)
at org.apache.felix.scr.impl.manager.SingleComponentManager.getService(SingleComponentManager.java:906)
at org.apache.felix.scr.impl.manager.SingleComponentManager.getServiceInternal(SingleComponentManager.java:879)
at org.apache.felix.scr.impl.manager.SingleComponentManager.getService(SingleComponentManager.java:823)
at org.eclipse.osgi.internal.serviceregistry.ServiceFactoryUse$1.run(ServiceFactoryUse.java:212)
at java.security.AccessController.doPrivileged(Native Method)
What does unauthorized mean in this context?

Cross region S3 access from AWS EMR Spark

My EMR is at us-west-1 but my S3 bucket is at us-east-1 and i'm getting below error.
I have tried s3://{bucketname}.s3.amazon.com but that would create a new bucket with s3.amazon.com.
How can access s3 bucket cross region?
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Moved Permanently (Service: Amazon S3; Status Code: 301; Error Code: 301 Moved Permanently; Request ID: FB1139D9BD8F409B), S3 Extended Request ID: pWK3X9BBRp8BLlXEHOx008RCdlZC64YFTounDYGtnwsAneR0IDP1Z/gmDudRoqWhDArfYLNRxk4=
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1389)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:902)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:607)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:376)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:338)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:287)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3826)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1015)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:991)
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:212)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy38.retrieveMetadata(Unknown Source)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:780)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1428)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.exists(EmrFileSystem.java:313)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:85)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:487)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
at org.apache.spark.sql.DataFrameWriter.text(DataFrameWriter.scala:520)
This solution is working for me on emr-5.0.0/emr-5.0.3:
Add the following property to the core-site configuration
"fs.s3n.endpoint":"s3.amazonaws.com"
Did reached out to AWS support team and TLDR is that they are aware of this problem and they are currently working on it and hope to fix this for next EMR release but I have no eta.
For "s3a", you can use custom s3 end points within spark at runtime but this does not work for "s3" or "s3n".
Also, you can configure EMR to point to another s3 region on creation time but once you configure this way you are stuck with that region.
This EMRFS's regional bind is applied from EMR 4.7.2 and onward according to the support team.
As suggested by #codingtwinky in the comments, EMR 4.6.0 does not have this problem in the emr.hadoop.fs layer. My hadoop jobs now work in EMR 4.6.0, but do not work in 5.0.0 or 4.7.0.

Spark on Amazon EMR: "Timeout waiting for connection from pool"

I'm running a Spark job on a small three server Amazon EMR 5 (Spark 2.0) cluster. My job runs for an hour or so, fails with the error below. I can manually restart and it works, processes more data, and eventually fails again.
My Spark code is fairly simple and is not using any Amazon or S3 APIs directly. My Spark code passes S3 text string paths to Spark and Spark uses S3 internally.
My Spark program just does the following in a loop: Load data from S3 -> Process -> Write data to different location on S3.
My first suspicion is that some internal Amazon or Spark code is not properly disposing of connections and the connection pool becomes exhausted.
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.AmazonClientException: Unable to execute HTTP request: Timeout waiting for connection from pool
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:618)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:376)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:338)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:287)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3826)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1015)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:991)
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:212)
at sun.reflect.GeneratedMethodAccessor45.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy44.retrieveMetadata(Unknown Source)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:780)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1428)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.exists(EmrFileSystem.java:313)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:85)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:487)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
at sun.reflect.GeneratedMethodAccessor85.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:226)
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:195)
at sun.reflect.GeneratedMethodAccessor43.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.$Proxy45.getConnection(Unknown Source)
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:423)
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:837)
at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:607)
... 41 more
I encountered this issue with a very trivial program on EMR (read data from S3, filter, write to S3).
I could solve it by using the S3A file system implementation and setting fs.s3a.connection.maximum to 100 to have a bigger connection pool.
(default is 15; see Hadoop-AWS module: Integration with Amazon Web Services for more config properties)
This is how I set the configuration:
// in Scala
val hc = sc.hadoopConfiguration
// in Python (not tested)
hc = sc._jsc.hadoopConfiguration()
// setting the config is the same for both languages
hc.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hc.setInt("fs.s3a.connection.maximum", 100)
To make it work, the S3 URIs passed to Spark have to start with s3a://...
This issue may also be resolved while remaining on EMRFS by setting fs.s3.maxConnections to something larger than the default 500 in emrfs-site config
https://aws.amazon.com/premiumsupport/knowledge-center/emr-timeout-connection-wait/
If using Java SDK to spin up an EMR cluster you can set this using the withConfigurations method (which is much easier than doing it manually by modifying files). See also https://stackoverflow.com/a/52595058/1586965
You can check this has been set correctly by using the Configurations tab in EMR, e.g.

Resources