Unable to read xlsx file to pyspark dataframe from azure blob storage container - apache-spark

I am trying to load data from the Azure storage container to the Pyspark data frame in Azure Databricks. When I read txt or CSV files it is working. But when I try to read .xlsx files I am getting the following issue.
Apache Spark 3.2.0, Scala 2.12
Below are the steps I am performing
spark.conf.set("fs.azure.account.key.teststorage.blob.core.windows.net",
"**********************")
It is working
df = spark.read.format("csv").option("header", "true") \
.option("inferSchema", "true") \
.load("wasbs://testcontainer#teststorage.blob.core.windows.net/data/samplefile.txt")
Not working
df = spark.read.format("com.crealytics.spark.excel") \
.option("header", "true").option("inferSchema","true") \
.load("wasbs://testcontainer#teststorage.blob.core.windows.net/data/samplefile.xlsx")
Getting below error while loading xlsx files:
: shaded.databricks.org.apache.hadoop.fs.azure.AzureException: shaded.databricks.org.apache.hadoop.fs.azure.AzureException: Container producer in account teststorage.blob.core.windows.net not found, and we can't create it using anoynomous credentials, and no credentials found for them in the configuration.
at shaded.databricks.org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:1063)
at shaded.databricks.org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:512)
at shaded.databricks.org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1384)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:537)
at com.crealytics.spark.excel.WorkbookReader$.readFromHadoop$1(WorkbookReader.scala:35)
at com.crealytics.spark.excel.WorkbookReader$.$anonfun$apply$2(WorkbookReader.scala:41)
at com.crealytics.spark.excel.DefaultWorkbookReader.$anonfun$openWorkbook$1(WorkbookReader.scala:49)
at scala.Option.fold(Option.scala:251)
at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:49)
at com.crealytics.spark.excel.WorkbookReader.withWorkbook(WorkbookReader.scala:14)
at com.crealytics.spark.excel.WorkbookReader.withWorkbook$(WorkbookReader.scala:13)
at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:45)
at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:31)
at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:31)
at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:102)
at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:101)
at com.crealytics.spark.excel.ExcelRelation.$anonfun$inferSchema$1(ExcelRelation.scala:163)
at scala.Option.getOrElse(Option.scala:189)
at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:162)
at com.crealytics.spark.excel.ExcelRelation.<init>(ExcelRelation.scala:35)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:35)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:13)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:385)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:355)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:322)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:322)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:235)
at sun.reflect.GeneratedMethodAccessor338.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
Note: I am able to read from dbfs and mount point.

This problem arises due to the public access level of the blob storage container. When the container has Private or Blob public access level, the same error occurs for excel files. But when Container public access level blob storage container is used, you will be able to read the excel files without error. This is what I got while trying to reproduce the issue.
A simple solution would be either to change the public access level of the container to Container, or to mount the blob storage account to the Databricks file system (which is working for you). If you choose to change the public access level of container, go to the container in your blob storage, and you will find the option “Change access level” where you can select Container level.
Navigate to container of storage account and change access level.
Go back to Databricks, run the Dataframe read again which works without any error.
df2 = spark.read.format("com.crealytics.spark.excel") \
.option("header", "true").option("inferSchema","true") \
.load("wasbs://<container>#<storage_acc>.blob.core.windows.net/data.xlsx")
Please refer to the following document to understand more about accessing blob storage account using Databricks.
https://learn.microsoft.com/en-us/azure/databricks/data/data-sources/azure/azure-storage

Related

Is there a way to read a file from a Kerberized HDFS into a non kerberized spark cluster given the keytab file, principal and other details?

I need to read data from a Kerberized HDFS cluster using webHDFS in a non Kerberized Spark cluster. I have access to the Keytab file, username/principal, and can access any other details needed to log in. I need to programmatically log in and allow my Spark cluster to read the file(s) from the Kerberized HDFS.
It looks like I can log in using this piece of code:
System.setProperty("java.security.krb5.kdc", "<kdc>")
System.setProperty("java.security.krb5.realm", "<realm>")
val conf = new Configuration()
conf.set("hadoop.security.authentication", "kerberos")
UserGroupInformation.setConfiguration(conf)
UserGroupInformation.loginUserFromKeytab("<my-principal>", "<keytab-file-path>/keytabFile.keytab")
From the official documentation of webHDFS here, I see that I can configure the keytab file path and principal in the hadoop configuration like this:
sparkSession.sparkContext.hadoopConfiguration.set("dfs.web.authentication.kerberos.principal", "<my-principal>")
sparkSession.sparkContext.hadoopConfiguration.set("dfs.web.authentication.kerberos.keytab", "<keytab-file-path>/keytabFile.keytab")
After this, I should be able to read the files from the Kerberized HDFS cluster using:
val irisDFWebHDFS = sparkSession.read
.format("csv")
.option("header", "true")
.csv(s"webhdfs://<namenode-host>:<namenode-port>/user/hadoop/iris.csv")
But it still refuses to read and throws the following exception:
org.apache.hadoop.security.AccessControlException: Authentication required
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.validateResponse(WebHdfsFileSystem.java:490)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$300(WebHdfsFileSystem.java:135)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.connect(WebHdfsFileSystem.java:721)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:796)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.access$100(WebHdfsFileSystem.java:619)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner$1.run(WebHdfsFileSystem.java:657)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:653)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getDelegationToken(WebHdfsFileSystem.java:1741)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getDelegationToken(WebHdfsFileSystem.java:365)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getAuthParameters(WebHdfsFileSystem.java:585)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.toUrl(WebHdfsFileSystem.java:608)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractFsPathRunner.getUrl(WebHdfsFileSystem.java:898)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:794)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.access$100(WebHdfsFileSystem.java:619)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner$1.run(WebHdfsFileSystem.java:657)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:653)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getHdfsFileStatus(WebHdfsFileSystem.java:1086)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getFileStatus(WebHdfsFileSystem.java:1097)
at org.apache.hadoop.fs.FileSystem.isDirectory(FileSystem.java:1707)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:47)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:376)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:326)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:308)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:308)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:796)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
... 44 elided
Any pointers on what I might be missing?

hdfs dfs -setfacl fails on Azure Data Lake Store

I am trying to change the Access Control List for a file located in an Azure Data Lake Store Gen 1 from an HDInsight 3.6 Cluster with the following command:
hdfs dfs -setfacl -m user:d7de0903-abcabc-44c1-8f1c-31311f5caa69:r-x adl://<myadls>.azuredatalakestore.net/user/pics/sammple.jpeg
It fails with:
-setfacl: Fatal internal error
java.lang.NullPointerException
at
org.apache.hadoop.fs.adl.HdiAdlFileSystem.shouldUseDaemonUserOrGroup(Unknown Source)
at
org.apache.hadoop.fs.adl.HdiAdlFileSystem.getEffectiveAclEntries(Unknown Source)
at org.apache.hadoop.fs.adl.HdiAdlFileSystem.modifyAclEntries(Unknown Source)
at
org.apache.hadoop.fs.shell.AclCommands$SetfaclCommand.processPath(AclCommands.java:240)
at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317)
at
org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289)
at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271)
at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255)
at
org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:119)
at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:297)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:356)
As
hdfs dfs -getfacl adl://<myadls>.azuredatalakestore.net/user/pics/sample.jpeg
works fine I am wondering if an ESP (Enterprise Security Package) Configuration is needed.
You can try the below steps to set acls on Azure Data Lake Gen1 as follows.
Before: Getting acls of ADLS folder example and user:bd04d6b3-xxx-xxxx-xxxx-3e30e6c99f3c:r--
hdfs dfs -getfacl adl://<adlsname>.azuredatalakestore.net/clusters/example
Set ACLs on ADLS: Use the following command to set acls for specific user as follows:
hdfs dfs -setfacl -m user:bd04d6b3-xxxx-xxxx-xxxx--3e30e6c99f3c:rwx adl://<adlsname>.azuredatalakestore.net/clusters/example
After: Now you can notice the acls are changed.
Hope this helps.

Azure Databricks - Unable to read simple blob storage file from notebook

I've set up a cluster with databricks runtime version 5.1 (includes Apache Spark 2.4.0, Scala 2.11) and Python 3. I also installed hadoop azure library (hadoop-azure-3.2.0) to the cluster.
I'm trying to read a blob stored in my blob storage account which is just a text file containing some numeric data delimited by spaces for example. I used the template generated by databricks for reading blob data
spark.conf.set(
"fs.azure.account.key."+storage_account_name+".blob.core.windows.net",
storage_account_access_key)
df = spark.read.format(file_type).option("inferSchema", "true").load(file_location)
where file_location is my blob file (https://xxxxxxxxxx.blob.core.windows.net).
I get the following error:
No filesystem named https
I tried using sc.textFile(file_location) to read in an rdd and get the same error.
Your file_location should be in the format:
"wasbs://<your-container-name>#<your-storage-account-name>.blob.core.windows.net/<your-directory-name>"
See: https://docs.databricks.com/spark/latest/data-sources/azure/azure-storage.html
You need to mount the blob with external location to access it via Azure Databricks.
Reference: https://docs.databricks.com/spark/latest/data-sources/azure/azure-storage.html#mount-azure-blob-storage-containers-with-dbfs
These three lines of code worked for me:
spark.conf.set("fs.azure.account.key.STORAGE_ACCOUNT.blob.core.windows.net","BIG_KEY")
df = spark.read.csv("wasbs://CONTAINER#STORAGE_ACCOUNT.blob.core.windows.net/")
df.select('*').show()
NOTE that line 2 ends with .net/ because I do not have a sub-folder.

Accessing S3 bucket from local pyspark using assume role

Background: In order to allow developers to build and unit test code on an easy-to-use environment, we built a local Spark environment with other tools integrated to it. However, we also want to access S3 and Kinesis from Local environment. When we access S3 from Pyspark application from local using assume-role(as per our security standards), it is throwing forbidden error.
FYI - We are below access pattern for accessing resources on AWS account.
assume-role access pattern
Code for testing access-s3-from-pyspark.py :
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName("s3a_test").setMaster("local[1]")
sc = SparkContext(conf=conf)
sc.setSystemProperty("com.amazonaws.services.s3.enableV4", "true")
hadoopConf = {}
iterator = sc._jsc.hadoopConfiguration().iterator()
while iterator.hasNext():
prop = iterator.next()
hadoopConf[prop.getKey()] = prop.getValue()
for item in sorted(hadoopConf.items()):
if "fs.s3" in item[0] :
print(item)
path="s3a://<your bucket>/test-file.txt"
## read the file for testing
lines = sc.textFile(path)
if lines.isEmpty() == False:
lines.saveAsTextFile("test-file2.text")
Property file spark-s3.properties
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.endpoint s3.eu-central-1.amazonaws.com
spark.hadoop.fs.s3a.access.key <your access key >
spark.hadoop.fs.s3a.secret.key <your secret key>
spark.hadoop.fs.s3a.assumed.role.sts.endpoint sts.eu-central-1.amazonaws.com
spark.hadoop.fs.s3a.aws.credentials.provider org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider
spark.hadoop.fs.s3a.aws.credentials.provider org.apache.hadoop.fs.s3a.AssumedRoleCredentialProvider
spark.hadoop.fs.s3a.aws.credentials.provider org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider
spark.hadoop.fs.s3a.assumed.role.session.name testSession1
spark.haeoop.fs.s3a.assumed.role.session.duration 3600
spark.hadoop.fs.s3a.assumed.role.arn <role arn>
spark.hadoop.fs.s3.canned.acl BucketOwnerFullControl
How to run the code:
spark-submit --properties-file spark-s3.properties \
--jars jars/hadoop-aws-2.7.3.jar,jars/aws-java-sdk-1.7.4.jar \
access-s3-from-pyspark.pyenter code here
The above code is returning below error, please note I am able to access S3 via CLI and boto3 using assume-role profile or api.
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: 66FB4D6351898F33, AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID: J8lZ4qTZ25+a8/R3ZeBTrW5TDHzo98A9iUshbe0/7VcHmiaSXZ5u6fa0TvA3E7ZYvhqXj40tf74=
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:976)
at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:956)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:892)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:77)
at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57)
at org.apache.hadoop.fs.Globber.glob(Globber.java:252)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1676)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:259)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:61)
at org.apache.spark.api.java.AbstractJavaRDDLike.partitions(JavaRDDLike.scala:45)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Questions:
This is right way of doing?
Is there any other easy way to use AWS resources locally for dev and test (I have also explored localstack package which is working most of the cases but still not fully dependable)
Am I using right jars for this?
that config of spark.hadoop.fs.s3a.aws.credentials.provider is wrong.
There should only be one entry and it should list all the AWS credential providers in one single entry
the S3A assumed role provider (which takes a full login and asks for an assumed role) is only on very recent Hadoop releases (3.1+), not 2.7.x and probably doesn't do what you want. It's mostly used for dynamically creating logins with restricted rights and verifying that the S3A connector itself can cope with things.
It's good that your organisation is strict about security, it just slightly complicates life.
Assuming you can get the account ID, session token and session secret (somehow),
then for Hadoop 2.8+ you can fill in the spark defaults with this
spark.hadoop.fs.s3a.access.key AAAIKIAAA
spark.hadoop.fs.s3a.secret.key ABCD
spark.hadoop.fs.s3a.fs.s3a.session.token REALLYREALLYLONGVALUE
spark.hadoop.fs.s3a.aws.credentials.provider org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider
You'll need to create those assumed role session secrets for as long as the session last, which used to be PITA as their life was <= 60 minutes; assumed roles can not last for up to 12 hours -your IT team will need to increase the lifespan of any roles you want to use for that long.
The hadoop 2.7.x releases don't have that TemporaryAWSCredentialsProvider, so instead you have to
Rely on the env var credential provider, which looks up the AWS_ environment variables. This is enabled by default so you shouldn't need to ply with spark.hadoop.fs.s3a.aws.credentials.provider at all
Set all three of the env vars (AWS_ACCESS_KEY, AWS_SECRET_KEY and AWS_SESSION_TOKEN (?)) to those of the values you get from an assumed role API call.
Then run your jobs. You'll probably need to set those env vars up everywhere I'm afraid.

Spark submit cluster mode from s3

I have Spark stand-alone set up on EC2 instances. I'm try to use cluster mode to submit a Spark apllication. The jar is in S3, and access to it is set up via IAM roles. I can run aws s3 cp s3://bucket/dir/foo.jar . to get the jar file - that works fine. However, when I run the following:
spark-submit --master spark://master-ip:7077 --class Foo
--deploy-mode cluster --verbose s3://bucket/dir/foo/jar
I get the error outlined below. Seeing that the boxes have IAM roles configured to allow access, what would be the correct way to submit the job? The job itself doesn't use S3 at all...the issue seems to be fetching the jar from S3.
Any help will be appreciated.
16/07/04 11:44:09 ERROR ClientEndpoint: Exception from cluster was: java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).
at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66)
at org.apache.hadoop.fs.s3.Jets3tFileSystemStore.initialize(Jets3tFileSystemStore.java:82)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)
at com.sun.proxy.$Proxy5.initialize(Unknown Source)
at org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:77)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1446)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1464)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:263)
at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1686)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:598)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:395)
at org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:150)
at org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:79)
I've found a workaround. I put the jar in a static http server, and use http://server/foo.jar in spark-submit. That seems to work.

Resources