Spark - S3 - Access & Secret Key configured in code, is overridden with IAM Role - apache-spark

I am working on creating Spark job in Java which explicitly specifies IAM user with access & secret key in runtime. It can read or write to S3 with no issue in local machine using the keys. However, when I promote the job to Cloudera Oozie, it keeps picking up IAM role attached to EC2 instance (which can only read certain S3 slices). The goal is to set a IAM user per tenant who can only access own slice in S3 under the same bucket (multi tenancy). Can anyone advise me if you know how to prevent IAM role from overriding IAM user credentials in Spark? Thanks in advance.

Related

Does Spark allow to use Amazon Assumed Role and STS temporary credentials for Glue cross account access on EMR

We are trying to connect to the cross-account AWS Glue catalog with the EMR spark job.
I did a study that AWS supports cross-account access for the Glue catalog in two ways.
IAM role-based. (This is not working for me)
Resource-based policy. (This worked for me)
So the problem scenario is, Account A creates EMR with its
role role_Account_A. And role role_Account_A wants to access
the glue catalog of Account B.
Account A creates EMR cluster with role role_Account_A
Account B has role role_Account_B which has access to glue and s3 with role_Account_A in trusted entities.
role_Account_A has sts:AssumeRole policy for resource role_Account_B
using sdk we are able to assume role role_Account_B from role_Account_A and getting temporary credentials.
EMR has configurations[{"classification":"spark-hive-site","properties":{"hive.metastore.glue.catalogid":"Account_B", "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"}}]
SparkSession sparkSession=SparkSession.builder().appName("testing glue")
.enableHiveSupport()
.getOrCreate();
sparkSession.sparkContext().hadoopConfiguration().set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider");
sparkSession.sparkContext().hadoopConfiguration().set("fs.s3a.access.key", assumedcreds.getAccessKeyId());
sparkSession.sparkContext().hadoopConfiguration().set("fs.s3a.secret.key", assumedcreds.getSecretAccessKey());
sparkSession.sparkContext().hadoopConfiguration().set("fs.s3a.session.token", assumedcreds.getSessionToken());
sparkSession.sparkContext().conf().set("fs.s3a.access.key", assumedcreds.getAccessKeyId());
sparkSession.sparkContext().conf().set("fs.s3a.secret.key", assumedcreds.getSecretAccessKey());
sparkSession.sparkContext().conf().set("fs.s3a.session.token", assumedcreds.getSessionToken());
sparkSession.sql("show databases").show(10, false);
The error that we are getting is
Caused by: MetaException(message:User: arn:aws:sts::Account_A:assumed-role/role_Account_A/i-xxxxxxxxxxxx is not authorized to perform: glue:GetDatabase on resource: arn:aws:glue:XX-XXXX-X:Account_B:catalog
because no resource-based policy allows the glue:GetDatabase action (Service: AWSGlue; Status Code: 400; Error Code: AccessDeniedException; Request ID: X93Xbc64-0153-XXXX-XXX-XXXXXXX))
Questions:-
Does spark supports glue-based authentication properties for example aws.glue.access.key?
As per error spark is not using assumed role role_Account_B. It uses role_Account_A with which EMR was created. Can we make it use
assumed role role_Account_B?
I will update the question details if I am missing something.
I believe you're having an EMR instance profile role in Account A. If so, you would have to follow these and cross-account access should work
In Account B,
Under Glue, go to settings and add the ( EMR instance profile role A ) as principal and provide access to Account B's glue and S3. It is recommended to provide only for the buckets you need to access
Go to the bucket policy of the bucket that the glue table will be using and add the ( EMR instance profile role A ) as principal and provide read/write access.
Now if you run the EMR job in account A, you'll see the job running with cross-account access
It works for our purpose. Try it out

AWS S3 Cross-account file transfer via Spark: Getting access denied on the transferred objects in the destination bucket

I have a use-case where I want to leverage Spark to transfer files between S3 Buckets in 2 different AWS Accounts.
I have Spark running in a different AWS Account (say Account A). I do not have access to this AWS Account.
I have AWS Account B which is holding the source S3 bucket (S3_SOURCE_BUCKET) and AWS Account C that is holding destination S3 bucket (S3_DESTINATION_BUCKET).
I have created an IAM role in Account C (say: CrossAccountRoleC) to read and write from the destination S3 bucket.
I have set up the primary IAM role in Account B (say: CrossAccountRoleB).
Adding Account A's spark IAM Role in trust entity
Adding read write permission to S3 buckets in both Account B and Account C
Adding an inline policy to assume CrossAccountRoleC
Added CrossAccountRoleB as a trusted entity in CrossAccountRoleC
Also added CrossAccountRoleB in the bucket policy in the S3_DESTINATION_BUCKET.
I am using Hadoop's FileUtil.copy to transfer files between the source and destination S3 buckets. While the transfer is happening successfully, I am getting 403 access denied on the copied objects.
When I am specifying hadoopConfiguration.set("fs.s3.canned.acl", "BucketOwnerFullControl") , I am getting an error that says "The requester is not authorized to perform action [ s3:GetObject, s3:PutObject, or kms:Decrypt ] on resource [ s3 Source or Sink ]" . From the logs, it seems that the operation is failing while writing to the Destination bucket.
What am I missing?
you are better off using s3a per-bucket settings and just using a different set of credentials for the different buckets. Not as "pure" as IAM Role games but since nobody understands IAM roles or knows how to debug them, its more likely to work.
(Do not take the fact that the IAM roles aren't working as a personal skill failing. Everyone fears support issues related to them)

Uploading a file through boto3 upload_file api to AWS S3 bucket gives "Anonymous users cannot initiate multipart uploads. Please authenticate." error

I have a file of around 16mb in size and am using python boto3 upload_file api to upload this file into the S3 bucket. However, I believe the API is internally choosing multipart upload and gives me an "Anonymous users cannot initiate multipart upload" error.
In some of the runs of the application, the file generated may be smaller (few KBs) in size.
What's the best way to handle this scenario in general or fix the error I mentioned above?
I currently have a Django application that generates a file when run and uploads this file directly into an S3 bucket.
Ok, so unless you've opened your S3 bucket up for the world to upload to (which is very much NOT recommended), it sounds like you need to setup the permissions for access to your S3 bucket correctly.
How to do that will vary a little depending on how you're running this application - so let's cover off a few options - in all cases you will need to do two things:
Associate your script with an IAM Principal (an IAM User or an IAM Role depending on where / how this script is being run).
Add permissions for that principal to access the bucket (this can be accomplished either through an IAM Policy, or via the S3 Bucket Policy)
Lambda Function - You'll need to create an IAM Role for your application and associate it with your Lambda function. Boto3 should be able to assume this role transparently for you once configured.
EC2 Instance or ECS Task - You'll need to create an IAM Role for your application and associate it with your EC2 instance/ECS Task. Boto3 will be able to access the credentials for the role via instance metadata and should automatically assume the role.
Local Workstation Script - If you're running this script from your local workstation, then boto3 should be able to find and use the credentials you've setup for the AWS CLI. If those aren't the credentials you want to use you'll need to generate an access key and secret access key (be careful how you secure these if you go this route, and definitely follow least privilege).
Now, once you've got your principal you can either attach an IAM policy that grants Allow permissions to upload to the bucket to the IAM User or Role, or you can add a clause to the Bucket Policy that grants that IAM User or Role access. You only need to do one of these.
Multi-part uploads are performed via the same S3:PutObject call as single part uploads (though if your files are small I'd be surprised it was using multi-part for them). If you're using KMS one small trick to be aware of is that you need permission to use the KMS key for both Encrypt and Decrypt permissions if encrypting a multi-part upload.

manage dataproc cluster access using service account and IAM roles

I am a beginner in cloud and would like to limit my dataproc cluster access to a given gcs buckets in my project.
Lets says I have created a service account named as 'data-proc-service-account#my-cloud-project.iam.gserviceaccount.com'
and then I create a dataproc cluster and assign service account to it.
Now I have created two gcs bucket named as
'gs://my-test-bucket/spark-input-files/'
'gs://my-test-bucket/spark-output-files/'
These buckets holds some input files which needs to be accessed by spark jobs running on my dataproc cluster and also act as a location wherein my spark jobs can write some output files.
I think I have to go and edit my bucket permission as shown in given link.
Edit Bucket Permission
I want that my spark jobs can only read files from this specific bucket 'gs://my-test-bucket/spark-input-files/'.
and if they are writing to a gcs bucket, they can only write to ''gs://my-test-bucket/spark-output-files/'
Question here is: (most likely a question related to SRE resource)
What all IAM permission needs to be added to my data proc service account
data-proc-service-account#my-cloud-project.iam.gserviceaccount.com on IAM console page.
and what all read/write permissions needs to be added for given specific buckets, Which I believe has to be configured via adding member and assigning right permission to it. (as shown in the link mentioned above)
Do I need to add my data proc service account as a member and can add below these two roles. will this work?
Storage Object Creator for bucket 'gs://my-test-bucket/spark-output-files/
Storage Object Viewer for bucket 'gs://my-test-bucket/spark-input-files/'
Also let me know in case I have missed anything or something better can be done.
According to the Dataproc IAM doc:
To create a cluster with a user-specified service account, the specified service
account must have all permissions granted by the Dataproc Worker role. Additional
roles may be required depending on configured features.
The dataproc.worker role has a list of GCS related permissions, including things like storage.objects.get and storage.objects.create. And these apply to any buckets.
What you want to do, is to give your service account almost identical permissions to dataproc.worker role, but limit all the storage.xxx.xxx permissions to the Dataproc staging bucket. Then in addition, add write access to your output bucket and read access to your input bucket.
Or you can use a different service account than the Dataproc service account when you run your Spark job. This job specific service account will only need the read access to input bucket and write access to output bucket. Assuming you are using the GCS connector (which comes pre-installed on Dataproc clusters) to access GCS, you can follow the instructions found here. But in this case you will have to distribute the service account key across worker nodes or put it in GCS/HDFS.

Running EMR Spark With Multiple S3 Accounts

I have an EMR Spark Job that needs to read data from S3 on one account and write to another.
I split my job into two steps.
read data from the S3 (no credentials required because my EMR cluster is in the same account).
read data in the local HDFS created by step 1 and write it to an S3 bucket in another account.
I've attempted setting the hadoopConfiguration:
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "<your access key>")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey","<your secretkey>")
And exporting the keys on the cluster:
$ export AWS_SECRET_ACCESS_KEY=
$ export AWS_ACCESS_KEY_ID=
I've tried both cluster and client mode as well as spark-shell with no luck.
Each of them returns an error:
ERROR ApplicationMaster: User class threw exception: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
Access Denied
The solution is actually quite simple.
Firstly, EMR clusters have two roles:
A service role (EMR_DefaultRole) that grants permissions to the EMR service (eg for launching Amazon EC2 instances)
An EC2 role (EMR_EC2_DefaultRole) that is attached to EC2 instances launched in the cluster, giving them access to AWS credentials (see Using an IAM Role to Grant Permissions to Applications Running on Amazon EC2 Instances)
These roles are explained in: Default IAM Roles for Amazon EMR
Therefore, each EC2 instance launched in the cluster is assigned the EMR_EC2_DefaultRole role, which makes temporary credentials available via the Instance Metadata service. (For an explanation of how this works, see: IAM Roles for Amazon EC2.) Amazon EMR nodes use these credentials to access AWS services such as S3, SNS, SQS, CloudWatch and DynamoDB.
Secondly, you will need to add permissions to the Amazon S3 bucket in the other account to permit access via the EMR_EC2_DefaultRole role. This can be done by adding a bucket policy to the S3 bucket (here named other-account-bucket) like this:
{
"Id": "Policy1",
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Stmt1",
"Action": "s3:*",
"Effect": "Allow",
"Resource": [
"arn:aws:s3:::other-account-bucket",
"arn:aws:s3:::other-account-bucket/*"
],
"Principal": {
"AWS": [
"arn:aws:iam::ACCOUNT-NUMBER:role/EMR_EC2_DefaultRole"
]
}
}
]
}
This policy grants all S3 permissions (s3:*) to the EMR_EC2_DefaultRole role that belongs to the account matching the ACCOUNT-NUMBER in the policy, which should be the account in which the EMR cluster was launched. Be careful when granting such permissions -- you might want to grant permissions only to GetObject rather than granting all S3 permissions.
That's all! The bucket in the other account will now accept requests from the EMR nodes because they are using the EMR_EC2_DefaultRole role.
Disclaimer: I tested the above by creating a bucket in Account-A and assigning permissions (as shown above) to a role in Account-B. An EC2 instance was launched in Account-B with that role. I was able to access the bucket from the EC2 instance via the AWS Command-Line Interface (CLI). I did not test it within EMR, however it should work the same way.
Using spark you can also use assume role to access an s3 bucket in another account but using an IAM Role in the other account. This makes it easier for the other account owner to manage the permissions provided to the spark job. Managing access via s3 bucket policies can be a pain as access rights are distributed to multiple locations rather than all contained within a single IAM role.
Here is the hadoopConfiguration:
"fs.s3a.credentialsType" -> "AssumeRole",
"fs.s3a.stsAssumeRole.arn" -> "arn:aws:iam::<<AWSAccount>>:role/<<crossaccount-role>>",
"fs.s3a.impl" -> "com.databricks.s3a.S3AFileSystem",
"spark.hadoop.fs.s3a.server-side-encryption-algorithm" -> "aws:kms",
"spark.hadoop.fs.s3a.server-side-encryption-kms-master-key-id" -> "arn:aws:kms:ap-southeast-2:<<AWSAccount>>:key/<<KMS Key ID>>"
External IDs can also be used as a passphrase:
"spark.hadoop.fs.s3a.stsAssumeRole.externalId" -> "GUID created by other account owner"
We were using databricks for the above have not tried using EMR yet.
I believe you need to assign an IAM role to your compute nodes (you probably already have done this), then grant cross-account access to that role via IAM on the "Remote" account. See http://docs.aws.amazon.com/IAM/latest/UserGuide/tutorial_cross-account-with-roles.html for the details.
For controlling access of the resources, generally IAM roles are managed as a standard practice. Assume roles are used when you want to access resources in a different account. If you or your organisation follow the same then you should follow https://aws.amazon.com/blogs/big-data/securely-analyze-data-from-another-aws-account-with-emrfs/.
The basic idea here is to use a credentials provider with which the access is obtained by EMRFS to access objects in S3 buckets.
You can go one step further and make the ARN for STS and buckets parameterized for the JAR created in this blog.

Resources