Write to S3 from Spark without access and secret keys - apache-spark

Our EC2 server is configured to allow access to my-bucket when using DefaultAWSCredentialsProviderChain, so the following code using plain AWS SDK works fine:
AmazonS3 s3client = new AmazonS3Client(new DefaultAWSCredentialsProviderChain());
s3client.putObject(new PutObjectRequest("my-bucket", "my-object", "/path/to/my-file.txt"));
Spark's S3AOutputStream uses the same SDK internally, however trying to upload a file without providing acces and secret keys doesn't work:
sc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");
// not setting access and secret key
JavaRDD<String> rdd = sc.parallelize(Arrays.asList("hello", "stackoverflow"));
rdd.saveAsTextFile("s3a://my-bucket/my-file-txt");
gives:
Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: 25DF243A166206A0, AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID: Ki5SP11xQEMKb0m0UZNXb4FhfWLMdbehbknQ+jeZuO/wjhwurjkFoEYVfrQfW1KIq435Lo9jPkw=
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:976)
at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:956)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:892)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:77)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:130)
<truncated>
Is there a way to force Spark to use default credential provider chain instead of relying on access and secret key?

technically, Hadoop's s3a output stream. Look at the stack trace to see who to file bugreports against :)
And s3a does support Instance Credentials from Hadoop 2.7+, proof.
If you can't connect, you need to have the 2.7 JARs on your CP, along with the exact version of the AWS SDK is used (1.7.4, I recall).
Spark has one little feature: if you submit work and you have the AWS_* env vars set, then it picks them up, copies them in as the fs.s3a keys, so propagating them to your systems.

Related

Minio, when used with spark on EKS cluster getting access denied error

I am using MINIO and I have launched a MINIO gateway with helm on amazon EKS kubernetes cluster. I have added below properties needed on spark side
sparkConf.set("fs.s3a.endpoint", "minio-k8s-service':9000");
sparkConf.set("fs.s3a.connection.ssl.enabled", "false");
sparkConf.set("fs.s3a.signing-algorithm", "S3SignerType");
sparkConf.set("s.s3a.connection.timeout", "100000");
sparkConf.set("spark.master", "k8sSchedulerURL");
sparkConf.set("spark.deploy.mode", "cluster");
sparkConf.set("fs.s3a.committer.staging.conflict-mode", "replace");sparkConf.set("spark.hadoop.fs.s3a.access.key","myaccesskey")sparkConf.set("spark.hadoop.fs.s3a.secret.key","mysecretkey")
Below line of code works fine. When I try read a file from S3
JavaSparkContext(session.sparkContext()).textFile("s3a://mybbucket/myfolder/sample.parquet", 1)
However If I try to load a file like below it fails with access denied error
sc.read().parquet("s3a://mybucket/myfolder/myfile.parquet")
It fails with error getFileStatus on s3a://mybucket/myfolder/testfile.parquet: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: XYZ123XYZ; S3 Extended Request ID: null), S3 Extended Request ID: null:403 Forbidden
I am using hadoop-aws-3.2.0 jar with spark3.1.1. My accesskey and secretkey with AWS is correct and tried all possible options. This error looks weird even after passing right credentials shows this error.
Any help appreciated.
You might have to set fs.s3a.path.style.access to true. If it doesn't work for you, the MinIO team is available on their public slack channel or by email to answer questions 24/7/365.

Read CSV file from AWS S3

I have an EC2 instance running pyspark and I'm able to connect to it (ssh) and run interactive code within a Jupyter Notebook.
I have a S3 bucket with a csv file that I want to read, when I attempt to read it with:
spark = SparkSession.builder.appName('Basics').getOrCreate()
df = spark.read.csv('https://s3.us-east-2.amazonaws.com/bucketname/filename.csv')
Which throws a long Python error message and then something related to:
Py4JJavaError: An error occurred while calling o131.csv.
Specify S3 path along with access key and secret key as following:
's3n://<AWS_ACCESS_KEY_ID>:<AWS_SECRET_ACCESS_KEY>#my.bucket/folder/input_data.csv'
Access key-related information can be introduced in the typical username + password manner for URLs. As a rule, the access protocol should be s3a, the successor to s3n (see Technically what is the difference between s3n, s3a and s3?). Putting this together, you get
spark.read.csv("s3a://<AWS_ACCESS_KEY_ID>:<AWS_SECRET_ACCESS_KEY>#bucketname/filename.csv")
As an aside, some Spark execution environments, e.g., Databricks, allow S3 buckets to be mounted as part of the file system. You can do the same when you build a cluster using something like s3fs.

boto3 s3 connection error: An error occurred (SignatureDoesNotMatch) when calling the ListBuckets operation

I'm using the boto3 package to connect from outside an s3 cluster (i.e. the script is currently not being run within the AWS 'cloud', but from my MBP connecting to the relevant cluster). My code:
s3 = boto3.resource(
"s3",
aws_access_key_id=self.settings['CREDENTIALS']['aws_access_key_id'],
aws_secret_access_key=self.settings['CREDENTIALS']['aws_secret_access_key'],
)
bucket = s3.Bucket(self.settings['S3']['bucket_test'])
for bucket_in_all in boto3.resource('s3').buckets.all():
if bucket_in_all.name == self.settings['S3']['bucket_test']:
print ("Bucket {} verified".format(self.settings['S3']['bucket_test']))
Now I'm receiving this error message:
botocore.exceptions.ClientError: An error occurred (SignatureDoesNotMatch) when calling the ListBuckets operation
I'm aware of the sequence of how the aws credentials are checked, and tried different permutations of my environment variables and ~/.aws/credentials, and know that the credentials as per my .py script should override, however I'm still seeing this SignatureDoesNotMatch error message. Any ideas where I may be going wrong? I've also tried:
# Create a session
session = boto3.session.Session(
aws_access_key_id=self.settings['CREDENTIALS']['aws_access_key_id'],
aws_secret_access_key=self.settings['CREDENTIALS']['aws_secret_access_key'],
aws_session_token=self.settings['CREDENTIALS']['session_token'],
region_name=self.settings['CREDENTIALS']['region_name']
)
s3 = boto3.resource('s3')
for bucket in s3.buckets.all():
print(bucket.name)
...however I also see the same error traceback.
Actually, this was partly answered by #John Rotenstein and #bdcloud nevertheless I need to be more specific...
The following code in my case was not necessary and causing the error message:
# Create a session
session = boto3.session.Session(
aws_access_key_id=self.settings['CREDENTIALS']['aws_access_key_id'],
aws_secret_access_key=self.settings['CREDENTIALS']['aws_secret_access_key'],
aws_session_token=self.settings['CREDENTIALS']['session_token'],
region_name=self.settings['CREDENTIALS']['region_name']
)
The credentials now stored in self.settings mirror the ~/.aws/credentials. Weirdly (and like last week where the reverse happened), I now have access. It could be that a simple reboot of my laptop meant that my new credentials (since I updated these again yesterday) in ~/.aws/credentials were then 'accepted'.

How to mount S3 bucket on Kubernetes container/pods?

I am trying to run my spark job on Amazon EKS cluster. My spark job required some static data (reference data) at each data nodes/worker/executor and this reference data is available at S3.
Can somebody kindly help me to find out a clean and performant solution to mount S3 bucket on pods ?
S3 API is an option and I am using it for my input records and output results. But "Reference data" is static data so I dont want to download it in each run/execution of my spark job. In first run job will download the data and upcoming jobs will check if data is already available locally and there is no need to download it again.
We recently opensourced a project that looks to automate this steps for you: https://github.com/IBM/dataset-lifecycle-framework
Basically you can create a dataset:
apiVersion: com.ie.ibm.hpsys/v1alpha1
kind: Dataset
metadata:
name: example-dataset
spec:
local:
type: "COS"
accessKeyID: "iQkv3FABR0eywcEeyJAQ"
secretAccessKey: "MIK3FPER+YQgb2ug26osxP/c8htr/05TVNJYuwmy"
endpoint: "http://192.168.39.245:31772"
bucket: "my-bucket-d4078283-dc35-4f12-a1a3-6f32571b0d62"
region: "" #it can be empty
And then you will get a pvc you can mount in your pods
in general, you just don't do that. You should instead interact directly with S3 API to retrieve/store what you need (probably via some tools like aws cli).
As you run in AWS, you can have IAM configured in a way that your nodes can access particular data authorized on "infrastructure" level, or you can provide S3 access tokens via secrets/confogmaps/env etc.
S3 is not a filesystem, so don't expect it to behave like one (even if there are FUSE clients that emulate FS for your needs, this is rarely the right solution)

Boto/Boto3: bucket.get_key(): 403 Forbidden

I am trying to connect to AWS S3 without using credentials. I attached the Role S3 fullaccess for my instance to check if the file exists or not; if it is not, upload it into S3 bucket. If is isn't I want to check md5sum and if it is different from the local local file, upload a new version.
I try to get key of file in S3 via boto by using bucket.get_key('mykey') and get this error:
File "/usr/local/lib/python3.5/dist-packages/boto/s3/bucket.py", line 193, in get_key key, resp = self._get_key_internal(key_name, headers, query_args_l)
File "/usr/local/lib/python3.5/dist-packages/boto/s3/bucket.py", line 232, in _get_key_internal response.status, response.reason, '') boto.exception.S3ResponseError: S3ResponseError: 403 Forbidden"
I searched and added "validate=False" when getting the bucket, but this didn't resolve my issue. I'm using Python 3.5, boto and boto3.
Here is my code:
import boto3
import boto
from boto import ec2
import os
import boto.s3.connection
from boto.s3.key import Key
bucket_name = "abc"
conn = boto.s3.connect_to_region('us-west-1', is_secure = True, calling_format = boto.s3.connection.OrdinaryCallingFormat())
bucket = conn.get_bucket(bucket_name, validate=False)
key = bucket.get_key('xxxx')
print (key)
I don't know why I get that error. Please help me to clearly this problem. Thanks!
Updated
I've just find root cause this problem. Cause by "The difference between the request time and the current time is too large".
Then it didn't get key of file from S3 bucket. I updated ntp service to synchronize local time and UTC time. It run success.
Synchronization time by:
sudo service ntp stop
sudo ntpdate -s 0.ubuntu.pool.ntp.org
sudo service ntp start
Thanks!
IAM role is the last in the order of search. I bet you have the credentials stored before the search order which doesn't have full S3 access. Check Configuration Settings and Precedence and make sure no credentials is present so that IAM role is used to fetch the credentials. Though it is for CLI, it applies to scripts too.
The AWS CLI looks for credentials and configuration settings in the following order:
Command line options – region, output format and profile can be specified as command options to override default settings.
Environment variables – AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SESSION_TOKEN.
The AWS credentials file – located at ~/.aws/credentials on Linux, macOS, or Unix, or at C:\Users\USERNAME .aws\credentials on Windows. This file can contain multiple named profiles in addition to a default profile.
The CLI configuration file – typically located at ~/.aws/config on Linux, macOS, or Unix, or at C:\Users\USERNAME .aws\config on Windows. This file can contain a default profile, named profiles, and CLI specific configuration parameters for each.
Container credentials – provided by Amazon Elastic Container Service on container instances when you assign a role to your task.
Instance profile credentials – these credentials can be used on EC2 instances with an assigned instance role, and are delivered through the Amazon EC2 metadata service.

Resources