I have a Spark job that reads files from an s3 bucket, formats them, and places them in another s3 bucket. I'm using the (SparkSession) spark.read.csv and spark.write.csv functionality to accomplish this
When I read the files, I need to use one IAM role (assume role), and when I write the files, need to drop the assumed role and revert to my default role.
Is this possible within the same spark session?
And if not, is there another way to do this?
Any and all help is appreciated!
For the S3A connector in Hadoop 2.8+, the S3A connector supports per-bucket settings, so you have different login options for different buckets
At some point (maybe around then, very much by hadoop 3) the AssumedRoleCredentialProvider takes a set of full credentials and calls AssumeRole for a given role ARN, so interacts with s3 under that role instead.
should be matter of
Make sure your hadoop-jars are recent
set the base settings with your full login
per bucket setting for the source bucket to use the assumed role credential provider with the chosen arn
make sure things work from the hadoop command line before trying to get submitted jobs to work.
then submit the job.
Related
I have a file of around 16mb in size and am using python boto3 upload_file api to upload this file into the S3 bucket. However, I believe the API is internally choosing multipart upload and gives me an "Anonymous users cannot initiate multipart upload" error.
In some of the runs of the application, the file generated may be smaller (few KBs) in size.
What's the best way to handle this scenario in general or fix the error I mentioned above?
I currently have a Django application that generates a file when run and uploads this file directly into an S3 bucket.
Ok, so unless you've opened your S3 bucket up for the world to upload to (which is very much NOT recommended), it sounds like you need to setup the permissions for access to your S3 bucket correctly.
How to do that will vary a little depending on how you're running this application - so let's cover off a few options - in all cases you will need to do two things:
Associate your script with an IAM Principal (an IAM User or an IAM Role depending on where / how this script is being run).
Add permissions for that principal to access the bucket (this can be accomplished either through an IAM Policy, or via the S3 Bucket Policy)
Lambda Function - You'll need to create an IAM Role for your application and associate it with your Lambda function. Boto3 should be able to assume this role transparently for you once configured.
EC2 Instance or ECS Task - You'll need to create an IAM Role for your application and associate it with your EC2 instance/ECS Task. Boto3 will be able to access the credentials for the role via instance metadata and should automatically assume the role.
Local Workstation Script - If you're running this script from your local workstation, then boto3 should be able to find and use the credentials you've setup for the AWS CLI. If those aren't the credentials you want to use you'll need to generate an access key and secret access key (be careful how you secure these if you go this route, and definitely follow least privilege).
Now, once you've got your principal you can either attach an IAM policy that grants Allow permissions to upload to the bucket to the IAM User or Role, or you can add a clause to the Bucket Policy that grants that IAM User or Role access. You only need to do one of these.
Multi-part uploads are performed via the same S3:PutObject call as single part uploads (though if your files are small I'd be surprised it was using multi-part for them). If you're using KMS one small trick to be aware of is that you need permission to use the KMS key for both Encrypt and Decrypt permissions if encrypting a multi-part upload.
I am a beginner in cloud and would like to limit my dataproc cluster access to a given gcs buckets in my project.
Lets says I have created a service account named as 'data-proc-service-account#my-cloud-project.iam.gserviceaccount.com'
and then I create a dataproc cluster and assign service account to it.
Now I have created two gcs bucket named as
'gs://my-test-bucket/spark-input-files/'
'gs://my-test-bucket/spark-output-files/'
These buckets holds some input files which needs to be accessed by spark jobs running on my dataproc cluster and also act as a location wherein my spark jobs can write some output files.
I think I have to go and edit my bucket permission as shown in given link.
Edit Bucket Permission
I want that my spark jobs can only read files from this specific bucket 'gs://my-test-bucket/spark-input-files/'.
and if they are writing to a gcs bucket, they can only write to ''gs://my-test-bucket/spark-output-files/'
Question here is: (most likely a question related to SRE resource)
What all IAM permission needs to be added to my data proc service account
data-proc-service-account#my-cloud-project.iam.gserviceaccount.com on IAM console page.
and what all read/write permissions needs to be added for given specific buckets, Which I believe has to be configured via adding member and assigning right permission to it. (as shown in the link mentioned above)
Do I need to add my data proc service account as a member and can add below these two roles. will this work?
Storage Object Creator for bucket 'gs://my-test-bucket/spark-output-files/
Storage Object Viewer for bucket 'gs://my-test-bucket/spark-input-files/'
Also let me know in case I have missed anything or something better can be done.
According to the Dataproc IAM doc:
To create a cluster with a user-specified service account, the specified service
account must have all permissions granted by the Dataproc Worker role. Additional
roles may be required depending on configured features.
The dataproc.worker role has a list of GCS related permissions, including things like storage.objects.get and storage.objects.create. And these apply to any buckets.
What you want to do, is to give your service account almost identical permissions to dataproc.worker role, but limit all the storage.xxx.xxx permissions to the Dataproc staging bucket. Then in addition, add write access to your output bucket and read access to your input bucket.
Or you can use a different service account than the Dataproc service account when you run your Spark job. This job specific service account will only need the read access to input bucket and write access to output bucket. Assuming you are using the GCS connector (which comes pre-installed on Dataproc clusters) to access GCS, you can follow the instructions found here. But in this case you will have to distribute the service account key across worker nodes or put it in GCS/HDFS.
I am running an Oozie job that previously was running fine. And now I have a permission denied error when accessing S3 files. I am just trying to figure out which credentials it is using and where to fix them.
As far as I can tell credentials seems to come from several locations and not sure the order of precedence (e.g. ~/.aws/credentials, environment variables, hadoop configuration, IAM role, etc).
Is there a way to tell which is the active credentials being used? Is it possible to print the active AWS account key id in the spark logging?
AWS login details don't really get logged for security reasons.
Spark submit will pick up the AWS_ env vars from your desktop and set the fs.s3a values, overriding any in there.
In the s3a connector, the order is
secrets in the URI (bad, avoid, removed from recent releases)
fs.s3a properties
env vars
IAM credentials supplied to an EC2 VM
you can configure the list of authentication providers to change the order, remove them, etc.
Because you running Cloudera cluster, you may have read this document Make a modified copy of the configuration files
It is better to add the following to the core-site.xml file within the element:
<property>
<name>fs.s3a.access.key</name>
<value>Amazon S3 Access Key</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>Amazon S3 Secret Key</value>
</property>
I need to copy files from S3 Production(where i have only read access) to S3 development (i have write access). The change which i face is switching the roles.
While coping i need use prod role and while writing i need to use developer role.
I am trying with below code:
import boto3
boto3.setup_default_session(profile_name='prod_role')
s3 = boto3.resource('s3')
copy_source = {
'Bucket': 'prod_bucket',
'Key': 'file.txt'
}
bucket = s3.Bucket('dev_bucket')
bucket.copy(copy_source, 'file.txt')
I need to know how to switch the role.
The most efficient way to move data between buckets in Amazon S3 is to use the resource.copy() or client.copy_object() command. This allows the two buckets to directly communicate (even between different regions), without the need to download/upload the objects themselves.
However, the credentials used to call the command require both read permission from the source and write permission to the destination. It is not possible to provide two different sets of credentials for this copy.
Therefore, you should pick ONE set of credentials and ensure it has the appropriate permissions. This means either:
Give the Prod credentials permission to write to the destination, or
Give the non-Prod credentials permission to read from the Prod bucket
This can be done either by creating a Bucket Policy, or by assigning permissions directly to the IAM Role/User being used.
If this is a regular task that needs to happen, you could consider automatically copying the files by using an Amazon S3 event on the source bucket to trigger a Lambda function that copies the object to the non-Prod destination immediately. This avoids the need to copy files in a batch at some later time.
I am working on creating Spark job in Java which explicitly specifies IAM user with access & secret key in runtime. It can read or write to S3 with no issue in local machine using the keys. However, when I promote the job to Cloudera Oozie, it keeps picking up IAM role attached to EC2 instance (which can only read certain S3 slices). The goal is to set a IAM user per tenant who can only access own slice in S3 under the same bucket (multi tenancy). Can anyone advise me if you know how to prevent IAM role from overriding IAM user credentials in Spark? Thanks in advance.