Spark Delta with AWS SSO - apache-spark

What I'm trying to do:
read from and write to S3 buckets across multiple AWS_PROFILE's
resources:
https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Configuring_different_S3_buckets_with_Per-Bucket_Configuration
does show how to use different cred on per-bucket
does show how to use different credential providers
doesn't show how to use more than one AWS_PROFILE
https://spark.apache.org/docs/latest/cloud-integration.html#authenticating
https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-sso.html
No FileSystem for scheme: s3 with pyspark
What I have working so far:
AWS SSO works and i can access different resources in python via boto3 by changing environment variable AWS_PROFILE
delta spark can read and write to S3 using hadoop configurations
enable delta tables for pyspark
builder.config("spark.sql.extensions",
"io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog",
"org.apache.spark.sql.delta.catalog.DeltaCatalog"))
allow s3 schema for read/write
"spark.hadoop.fs.s3.impl",
"org.apache.hadoop.fs.s3a.S3AFileSystem"
use instance profile AWS_PROFILE for one or more buckets
"fs.s3a.bucket.{prod_bucket}.aws.credentials.provider",
"com.amazonaws.auth.InstanceProfileCredentialsProvider"
any help, suggestions, comments appreciated. thanks!

As of October 2022, the s3a connector doesn't support AWS SSO/identity server. Moving to the AWS SDK v2 is a prerequisite, which is a WiP.
See HADOOP-18352

Related

Extracting Spark logs (Spark UI contents) from Databricks

I am trying to save Apache Spark logs (the contents of Spark UI), not necessarily stderr, stdout and log4j files (although they might be useful too) to a file so that I can send it over to someone else to analyze.
I am following the manual described in the Apache Spark documentation here:
https://spark.apache.org/docs/latest/monitoring.html#viewing-after-the-fact
The problem is that I am running the code on Azure Databricks. Databricks saves the logs elsewhere and you can display them from the web UI but cannot export it.
When I ran the Spark job with spark.eventLog.dir set to a location in DBFS, the file was created but it was empty.
Is there a way to export the full Databricks job log so that anyone can open it without giving them the access to the workspace?
The simplest way of doing it as following:
You create a separate storage account + container in it or a separate container in existing storage account & give access to it to developers
You mount that container to the Databricks workspace
You configure clusters/jobs to write logs into mount location (you can enforce it for new objects using the cluster policies). This will create sub-directories with the cluster name, containing logs of driver & executors + result of execution of init scripts
(optional) you can setup retention policy on that container to automatically remove old logs.

Access S3 files from Azure Synapse Notebook

Goal:
Move a lot of files from AWS S3 to ADLS Gen2 using Azure Synapse as fast as possible using parameterized regex expression for filename pattern using Synapse Notebook.
What I tried so far:
I know to access ADLS gen2, we can use
mssparkutils.fs.ls('abfss://container_name#storage_account_name.blob.core.windows.net/foldername') works but what is the equivalent to access S3 ?
I used mssparkutils.credentials.getsecret('AKV name','secretname') and mssparkutils.credentials.getsecret('AKV name','secret key id') to fetch secret details in the Synapse notebook but unable configure S3 to Synapse.
Question: Do I have to use the existing linked service using the credentials.getFullConnectionString(LinkedService) API ?
In short, my question is, How do I configure connectivity to S3 from within Synapse Notebook?
Answering my question here. AzCopy worked.Below is the link which helped me finish the task. The steps are as follows.
Install AzCopy on your machine.
Goto your terminal and goto the directory where the executeable is installed; run "AzCopy Login"; use Azure Active Directory credentials in your browser using the link from terminal message..Use the CODE provided in the terminal.
Authorize with S3 using below
set AWS_ACCESS_KEY_ID=
set AWS_SECRET_ACCESS_KEY=
For ADLS Gen2, you are already done in step-2
Use the commands (which ever suits your need) from the link below.
https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-v10
https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-s3

Spark Session With Multiple s3 Roles

I have a Spark job that reads files from an s3 bucket, formats them, and places them in another s3 bucket. I'm using the (SparkSession) spark.read.csv and spark.write.csv functionality to accomplish this
When I read the files, I need to use one IAM role (assume role), and when I write the files, need to drop the assumed role and revert to my default role.
Is this possible within the same spark session?
And if not, is there another way to do this?
Any and all help is appreciated!
For the S3A connector in Hadoop 2.8+, the S3A connector supports per-bucket settings, so you have different login options for different buckets
At some point (maybe around then, very much by hadoop 3) the AssumedRoleCredentialProvider takes a set of full credentials and calls AssumeRole for a given role ARN, so interacts with s3 under that role instead.
should be matter of
Make sure your hadoop-jars are recent
set the base settings with your full login
per bucket setting for the source bucket to use the assumed role credential provider with the chosen arn
make sure things work from the hadoop command line before trying to get submitted jobs to work.
then submit the job.

manage dataproc cluster access using service account and IAM roles

I am a beginner in cloud and would like to limit my dataproc cluster access to a given gcs buckets in my project.
Lets says I have created a service account named as 'data-proc-service-account#my-cloud-project.iam.gserviceaccount.com'
and then I create a dataproc cluster and assign service account to it.
Now I have created two gcs bucket named as
'gs://my-test-bucket/spark-input-files/'
'gs://my-test-bucket/spark-output-files/'
These buckets holds some input files which needs to be accessed by spark jobs running on my dataproc cluster and also act as a location wherein my spark jobs can write some output files.
I think I have to go and edit my bucket permission as shown in given link.
Edit Bucket Permission
I want that my spark jobs can only read files from this specific bucket 'gs://my-test-bucket/spark-input-files/'.
and if they are writing to a gcs bucket, they can only write to ''gs://my-test-bucket/spark-output-files/'
Question here is: (most likely a question related to SRE resource)
What all IAM permission needs to be added to my data proc service account
data-proc-service-account#my-cloud-project.iam.gserviceaccount.com on IAM console page.
and what all read/write permissions needs to be added for given specific buckets, Which I believe has to be configured via adding member and assigning right permission to it. (as shown in the link mentioned above)
Do I need to add my data proc service account as a member and can add below these two roles. will this work?
Storage Object Creator for bucket 'gs://my-test-bucket/spark-output-files/
Storage Object Viewer for bucket 'gs://my-test-bucket/spark-input-files/'
Also let me know in case I have missed anything or something better can be done.
According to the Dataproc IAM doc:
To create a cluster with a user-specified service account, the specified service
account must have all permissions granted by the Dataproc Worker role. Additional
roles may be required depending on configured features.
The dataproc.worker role has a list of GCS related permissions, including things like storage.objects.get and storage.objects.create. And these apply to any buckets.
What you want to do, is to give your service account almost identical permissions to dataproc.worker role, but limit all the storage.xxx.xxx permissions to the Dataproc staging bucket. Then in addition, add write access to your output bucket and read access to your input bucket.
Or you can use a different service account than the Dataproc service account when you run your Spark job. This job specific service account will only need the read access to input bucket and write access to output bucket. Assuming you are using the GCS connector (which comes pre-installed on Dataproc clusters) to access GCS, you can follow the instructions found here. But in this case you will have to distribute the service account key across worker nodes or put it in GCS/HDFS.

Spark - S3 - Access & Secret Key configured in code, is overridden with IAM Role

I am working on creating Spark job in Java which explicitly specifies IAM user with access & secret key in runtime. It can read or write to S3 with no issue in local machine using the keys. However, when I promote the job to Cloudera Oozie, it keeps picking up IAM role attached to EC2 instance (which can only read certain S3 slices). The goal is to set a IAM user per tenant who can only access own slice in S3 under the same bucket (multi tenancy). Can anyone advise me if you know how to prevent IAM role from overriding IAM user credentials in Spark? Thanks in advance.

Resources