(python) Spark .textFile(s3://...) access denied 403 with valid credentials - apache-spark

In order to access my S3 bucket i have exported my creds
export AWS_SECRET_ACCESS_KEY=
export AWS_ACCESSS_ACCESS_KEY=
I can verify that everything works by doing
aws s3 ls mybucket
I can also verify with boto3 that it works in python
resource = boto3.resource("s3", region_name="us-east-1")
resource.Object("mybucket", "text/text.py") \
.put(Body=open("text.py", "rb"),ContentType="text/x-py")
This works and I can see the file in the bucket.
However when I do this with spark:
spark_context = SparkContext()
sql_context = SQLContext(spark_context)
spark_context.textFile("s3://mybucket/my/path/*)
I get a nice
> Caused by: org.jets3t.service.S3ServiceException: Service Error
> Message. -- ResponseCode: 403, ResponseStatus: Forbidden, XML Error
> Message: <?xml version="1.0"
> encoding="UTF-8"?><Error><Code>InvalidAccessKeyId</Code><Message>The
> AWS Access Key Id you provided does not exist in our
> records.</Message><AWSAccessKeyId>[MY_ACCESS_KEY]</AWSAccessKeyId><RequestId>XXXXX</RequestId><HostId>xxxxxxx</HostId></Error>
this is how I submit the job locally
spark-submit --packages com.amazonaws:aws-java-sdk-pom:1.11.98,org.apache.hadoop:hadoop-aws:2.7.3 test.py
Why does it works with command line + boto3 but spark is chocking ?
EDIT:
Same issue using s3a:// with
hadoopConf = spark_context._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3a.access.key", "xxxx")
hadoopConf.set("fs.s3a.secret.key", "xxxxxxx")
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
and same issue using aws-sdk 1.7.4 and hadoop 2.7.2

Spark will automatically copy your AWS Credentials to the s3n and s3a secrets. Apache Spark releases don't touch the s3:// URLs, as in Apache Hadoop, the s3:// schema is associated with the original, now deprecated s3 client, one which is incompatible with everything else.
On Amazon EMR, s3:// is bound to amazon EMR S3; EC2 VMs will provide the secrets for executors automatically. So I don't think it bothers with the env var propagation mechanism. It might also be that how it sets up the authentication chain, you can't override the EC2/IAM data.
If you are trying to talk to S3 and you are not running in an EMR VM, then presumably you are using Apache Spark with the Apache Hadoop JARs, not the EMR versions. In that world use URLs with s3a:// to get the latest S3 client library
If that doesn't work, look at the troubleshooting section of the apache docs. There's a section on "403" there including recommended steps for troubleshooting. It can be due to classpath/JVM version problems as well as credential issues, even clock-skew between client and AWS.

Related

403 Error while accessing s3a using Spark/hadoop

I have configured Hadoop and spark in docker through k8s agent container which we are using to run the Jenkins job and we are using AWS EKS. but while running the spark-submit job we are getting the below error
py4j.protocol.Py4JJavaError: An error occurred while calling o40.exists.
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: xxxxxxxxx, AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID: xxxxxxxxxxxxxxx/xxxxxxxx
we have created a service account in k8s and added annotation as IAM role.(IAM role to access s3 which created in aws )
we see it can copy files from s3 but getting this error in job and not able to find out root cause .
note : Spark version 2.2.1
hadoop version : 2.7.4
Thanks
this is a five year old version of spark built on an eight year old set of hadoop binaries, including the s3a connector. "uch some of the binding logic to pick up iam roles simply isn't there.
Upgrade to spark 3.3.x with a full set of the hadoop-3.3.4 jars and try again.
(Note that "use a recent release" is step one of any problem with an open source application, it'd be the first action required if you ever file a bug report)

Write to local Hive metastore instead of AWS Glue Data Catalog when developing a AWS Glue job locally

I'm trying to create a local development environment for writing glue jobs and have followed https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html to use the amazon/aws-glue-libs:glue_libs_3.0.0_image_01 docker image.
However in my glue code I also want to pull data from s3 and create a database in a metastore with spark sql eg
spark.sql(f'CREATE DATABASE IF NOT EXISTS {database_name}')
I have managed to use a local version of aws by using localstack, and configuring hadoop to use my local aws endpoint
spark-submit --conf spark.hadoop.fs.s3a.endpoint=localstack:4566 \
\--conf spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider \
\--conf spark.hadoop.fs.s3a.access.key=bar \
\--conf spark.hadoop.fs.s3a.secret.key=foo \
\--conf spark.hadoop.fs.s3a.path.style.access=true
However when calling the above spark sql command I'm getting an error as it's trying to use the real aws glue data catalog as a metastore
org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Unable to verify existence of default database: com.amazonaws.services.glue.model.AWSGlueException: The security token included in the request is invalid. (Service: AWSGlue; Status Code: 400
I have tried to configure spark to use a local metastore when initialising the spark context, however it still tried to use glue and I get the above error from aws
SparkSession.builder.appName(f"{task}")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.sql.warehouse.dir", "/temp")
.enableHiveSupport()
.getOrCreate()
The main issue was that the aws-glue-libs image contained a hive-site.xml which was referencing the Amazon's hive metastore. To get this to work I removed this as a step in the Dockerfile, and specified the full path to the local hive store in the configuration when running a spark-submit
My Dockerfile
FROM amazon/aws-glue-libs:glue_libs_3.0.0_image_01
RUN rm /home/glue_user/spark/conf/hive-site.xml
And the conf in spark-submit
--conf spark.sql.warehouse.dir=/home/glue_user/workspace/temp/db \
--conf 'spark.driver.extraJavaOptions=-Dderby.system.home=/home/glue_user/workspace/temp/hive' \

spark sqlContext read parquet S3 path not found

am using spark 2.3 scala 2.11.8 in AWS EMR and seeing s3 path not found but the path exists. aws s3 ls clearly shows the directory and content are fine
org.apache.spark.sql.AnalysisException: Path does not exist: s3://dev-us-east-1/data/v1/output/20190115/individual/part-00000-b8450da0-15e9-482e-b588-08d6baa0637a-c000.snappy.parquet;
val df= sqlContext.read.parquet("s3://dev-us-east-1/data/v1/output/"""+dt+"""/individual/part-*.snappy.parquet")
Other folders/files load just fine with same code. Am wondering if there are file size limits or a memory issue masquerading as a path issue? I've also read about using s3a:// and s3n:// rather than s3:// but I am new to spark and a quick try changing my path to s3a:// got me an ACCESS DENIED exception

Running Spark2.3 on Kubernetes with remote dependency on S3

I am running spark-submit to run on Kubernetes (Spark 2.3). My problem is that the InitContainer does not download my jar file if it's specified as an s3a:// path but does work if I put my jar on an HTTP server and use http://. The spark driver fails, of course, because it can't find my Class (and the jar file in fact is not in the image).
I have tried two approaches:
specifying the s3a path to jar as the argument to spark-submit and
using --jars to specify the jar file's location on s3a, but both fail in the same way.
edit: also, using local:///home/myuser/app.jar does not work with the same symptoms.
On a failed run (dependency on s3a), I logged into the container and found the directory /var/spark-data/spark-jars/ to be empty. The init-container logs don't indicate any type of error.
Questions:
What is the correct way to specify remote dependencies on S3A?
Is S3A not supported yet? Only http(s)?
Any suggestions on how to further debug the InitContainer to determine why the download doesn't happen?

AWS EMR - Upload file into the application master

I'm using aws cli and I launch a Cluster with the following command:
aws emr create-cluster --name "Config1" --release-label emr-5.0.0 --applications Name=Spark --use-default-role --ec2-attributes KeyName=ChiaveEMR --log-uri 's3://aws-logs-813591802533-us-west-2/elasticmapreduce/' --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.medium InstanceGroupType=CORE,InstanceCount=2,InstanceType=m1.medium
after that, I put a file into the master node:
aws emr put --cluster-id j-NSGFSP57255P --key-pair-file "ChiaveEMR.pem" --src "./configS3.txt"
The file is located in /home/hadoop/configS3.txt.
Then I launch a step:
aws emr add-steps --cluster-id ID_CLUSTER --region us-west-2 --steps Type=Spark,Name=SparkSubmit,Args=[--deploy-mode,cluster,--master,yarn,--executor-memory,1G,--class,Traccia2014,s3://tracceale/params/traccia-22-ottobre_2.11-1.0Ale.jar,/home/hadoop/configS3.txt,30,300,2,"s3a://tracceale/Tempi1"],ActionOnFailure=CONTINUE
But I get this error:
17/02/23 14:49:51 ERROR ApplicationMaster: User class threw exception: java.io.FileNotFoundException: /home/hadoop/configS3.txt (No such file or directory)
java.io.FileNotFoundException: /home/hadoop/configS3.txt (No such file or directory)
probably due to the fact that 'configS3.txt' is located on the master and not on the slaves.
How could I pass 'configS3.txt' to spark-submit script? I've tried from S3 too but it doesn't work. Any solutions? Thanks in advance
Since you are using "--deploy-mode cluster", the driver runs on a CORE/TASK instance rather than the MASTER instance, so yes, it's because you uploaded the file to the MASTER instance but then the code that's trying to access the file is not running on the MASTER instance.
Given that the error you are encountering is a FileNotFoundException, it sounds like your application code is trying to open it directly, meaning that of course you can't simply use the S3 path directly. (You can't do something like new File("s3://bucket/key") because Java has no idea how to handle this.) My assumption could be wrong though because you have not included your application code or explained what you are doing with this configS3.txt file.
Maurizio: you're still trying to fix your previous problem.
On a distributed system, you need files which are visible on all machines (which the s3:// filestore delivers) and to use an API which works with data from the distributed filesystem. which SparkContext.hadoopRDD() delivers. You aren't going to get anywhere by trying to work out how to get a file onto the local disk of every VM, because that's not the problem you need to fix: it's how to get your code to read data from the shared object store.
Sorry

Resources