Authenticate Spark to GCS with HMAC key - apache-spark

We have a Spark application accessing GCS using the GCP connector. We would like to authenticate using a service account HMAC key. Is this possible?
We have tried a few of the authentication configurations here but none seems to work.
Here's an example of what we are trying to do
val spark = SparkSession.builder()
.config("google.cloud.auth.client.id", "HMAC key id")
.config("google.cloud.auth.client.secret", "HMAC key secret")
.master("local[*]")
.appName("Test App")
.getOrCreate()
df.write.format("parquet")
.save("gs://test-project/")
We have tried the keyfile JSON which works, but HMAC would be a bit more convenient for us.

Related

How to use Temporary credentials from AssumeRole in Spark configuration

I'm currently facing a issue where I'm unable to create a Spark session (through PySpark) that uses temporary credentials (from a assumed role in a different AWS account).
The idea is to assume a role in Account B, get temporary credentials and create the spark session in Account A, so that Account A is allowed to interact with Account B through the Spark Session.
I've almost tried every possible configuration available in my spark session. Is there anyone that has some reference material to create a spark session using temporary credentials?
role_arn = "arn:aws:iam::account-b:role/example-role"
duration_seconds = 60*15 # durations of the session in seconds
# obtain the temporary credentials
credentials = boto3.client("sts").assume_role(
RoleArn=role_arn,
RoleSessionName=role_session_name#,
# DurationSeconds=duration_seconds
)['Credentials']
spark = SparkSession \
.builder \
.enableHiveSupport() \
.appName("test") \
.config("spark.jars", "/usr/local/spark/jars/hadoop-aws-2.10.0.jar,/usr/local/spark/jars/aws-java-sdk-1.7.4.jar")\
.config("spark.hadoop.fs.s3a.aws.credentials.provider", "com.amazonaws.auth.DefaultAWSCredentialsProviderChain") \
.config("spark.hadoop.fs.s3a.access.key", credentials['AccessKeyId']) \
.config("spark.hadoop.fs.s3a.secret.key", credentials['SecretAccessKey']) \
.config("spark.hadoop.fs.s3a.endpoint", "s3.eu-west-1.amazonaws.com") \
.getOrCreate()
The above seems to not work, it does not implicitly use the credentials I pass to the spark session. It uses the actual underlying execution role of the process.
Looking at the documentation there's also some notes on 'short living credentials' not being supported. So I wonder how others are able to create a spark session with temporary credentials?
update hadoop aws and compatible binaries (including aws sdk) to one written in the last eight years.
which will then include the temporary credential support

Azure Databricks Secret Scope: Azure Key Vault-backed or Databricks-backed

Is there a way to determine if an already existing Azure Databricks Secret Scope is backed by Key Vault or Databricks via a python notebook? dbutils.secrets.listScopes() does not output this. Assume that I have Manage permissions on the scope.
(Unfortunately, Google didn't help)
You can do it either via Secrets REST API - if you use List Secret Scopes API, then backend_type field shows the backend - Datbricks or KeyVault. From notebook you can do it with following code:
import requests
ctx = dbutils.notebook.entry_point.getDbutils().notebook().getContext()
host_name = ctx.tags().get("browserHostName").get()
host_token = ctx.apiToken().get()
cluster_id = ctx.tags().get("clusterId").get()
response = requests.get(
f'https://{host_name}/api/2.0/secrets/scopes/list',
headers={'Authorization': f'Bearer {host_token}'}
).json()
scopes = dict([(s['name'], s.get('backend_type', 'DATABRICKS'))
for s in response['scopes']])
backend = scopes['scope-name']
Or you can do the same via databricks-cli, using the databricks secrets list-scopes command (see docs)

Connect to GCP SQL using the .json credentials file

I have a PostgreSQL DB at GCP. Right now I can login using a username, password e.g
import pandas as pd
import pyodbc
conn_str = (
"DRIVER={PostgreSQL Unicode};"
"DATABASE=test;"
"UID=user;"
"PWD=a_very_strong_password;"
"SERVER=34.76.yy.xxxx;"
"PORT=5432;"
)
with pyodbc.connect(conn_str) as con:
print(pd.read_sql("SELECT * from entries",con=con))
Is there a way to use the .json credentialsfile which is downloaded when I created my IAM user, instead of "hard typing" the credentials like above? I recon I can use the file to connect to a GCP storage, where I then can save my credentials for the DB thus I can write a script which loads the username,password etc. from the storage, but I feel it is a kinda "clunky" workaround.
From the guide here it seems like you can create IAM roles for such, but you only grants access for an hour at a time, and you need to create a token-pair each time.
Short answer: Yes, you can connect to a Cloud SQL instance using SA keys (json file) but only with PostgreSQL but you need to refresh the token every hour.
Long answer: The purpouse of the json is more intended to make operations in the instance at resource level or when using the Cloud SQL proxy.
For example when you use the Cloud SQL proxy with a service account you make a "magical bridge" to the instance but at the end you need to authenticate the way you're doing right now but using as SERVER = 127.0.0.1. This is the recommended method in most cases.
As well you've mentioned that using IAM authentication can work, this approach works for 1 hour since you depend on token refresh. If you're okay with this, just keep in mind you need to be refreshing the token.
Another approach I can think of for now is to use Secret Manager. The steps can be as follows:
Create a service account and a key for that.
Create a secret which contains your password.
Grant access to this particular secret to the SA created in step 1:
Go to Secret Manager.
Select the secret and click on Show info panel
Click on Add member and type or paste the email of the SA
Grant Secret Manager Secret Accessor
Click on Save
Now in your code you can get the secret content (which is the password) with maybe this sample code:
import pandas as pd
import pyodbc
from google.cloud import secretmanager
from google.oauth2 import service_account
credentials = service_account.Credentials.from_service_account_file('/path/to/key.json')
client = secretmanager.SecretManagerServiceClient(credentials=credentials)
name = f"projects/{project_id}/secrets/{secret_id}/versions/{version_id}"
response = client.access_secret_version(request={"name": name})
secret_password = response.payload.data.decode("UTF-8")
conn_str = (
"DRIVER={PostgreSQL Unicode};"
"DATABASE=test;"
"UID=user;"
"PWD=" + secret_password + ";"
"SERVER=34.76.yy.xxxx;"
"PORT=5432;"
)
with pyodbc.connect(conn_str) as con:
print(pd.read_sql("SELECT * from entries",con=con))
BTW, you can install the lib using pip install google-cloud-secret-manager.
Finally, you can also use this approach to keep the instance IP, user, DB name, etc. creating more secrets if you prefer

How properly test credentials used by Azure SDK client are valid

I'm trying to create a function that given Azure credentials check if they are valid.
In this case, I'm taking ADLS account name and account key credentails.
The Azure Java SDK does not provide an API for this, os I'm doing the following manually
def testConnection(account: String, accountKey: String): Boolean = {
val storageConnectionString = s"DefaultEndpointsProtocol=https;AccountName=${accountName};AccountKey=${accountKey}"
val storageAccount = CloudStorageAccount.parse(storageConnectionString)
val client = storageAccount.createCloudBlobClient()
Try{ client.downloadServiceProperties() }.isSuccess
}
the problem is downloadServiceProperties() is relatively slow, it may take a minute or so. Are they other faster options to check if the user ADLS credentials are valid ones?
Try DefaultAzureCredential Class and then call getToken to get the connection token. If no token is retrieved, user is not authenticated.
Take into account, that the user may be logged but does not have rights to perform the operation.

Connect to Azure SQL Database from DataBricks using Service Principal

I have a requirement to connect to Azure SQL Database from Azure Databricks via Service Principal. Tried searching forums but unable to find the right approach. Any help is greatly appreciated.
Tried a similar approach with SQL User ID and Password with JDBC Connection and it worked successfully. Now looking into Service Principal approach.
P.S: The SP ID and Key should be placed in the Azure Key Vault and needs to be accessed here on Databricks.
You can use Apache Spark Connector for SQL Server and Azure SQL
and an example of what you have to do in Databricks can be found in following Python file
As you can see, we are not directly connecting with the Service Principal, instead, we are using the Service Principal to generate an access token that is going to be used later when specifying the connection parameters:
jdbc_df = spark.read.format("com.microsoft.sqlserver.jdbc.spark") \
.option("url", url) \
.option("dbtable", db_table) \
.option("accessToken", access_token) \
.option("encrypt", "true") \
.option("databaseName", database_name) \
.option("hostNameInCertificate", "*.database.windows.net") \
.load()
But if you can't or don't want to use previous library, you can also do the same with the native Azure-SQL JDBC connector of Spark:
jdbc_df = spark.read.format("com.microsoft.sqlserver.jdbc.SQLServerDriver")\
.option("url", url) \
.option("dbtable", db_table) \
.option("accessToken", access_token) \
.option("encrypt", "true") \
.option("databaseName", database_name) \
.option("hostNameInCertificate", "*.database.windows.net") \
.load()
Azure Key Vault support with Azure Databricks
https://docs.azuredatabricks.net/user-guide/secrets/secret-scopes.html#akv-ss
**Here's the working Solution**
sql_url=sqlserver://#SERVER_NAME#.database.windows.net:1433;database=#DATABASE_NAME#
properties = {"user":"#APP_NAME#","password":dbutils.secrets.get(scope =
"#SCOPE_NAME#", key =
"#KEYVAULT_SECRET_NAME#"),"driver":"com.microsoft.sqlserver.jdbc.SQLServerDriver"}
**APP_NAME**==>which is created under app registration in Azure active directory.
**SCOPE_NAME**==>Which you have create mentioned on docs Follow the
URL(https://docs.azuredatabricks.net/user-guide/secrets/secret-scopes.html)
**KEYVAULT_SECRET_NAME**==>Secret Key name which is put into AKV.
**NOTE PROVIDE ACCESS TO YOUR APP_ID ON DATABASE STEPS MENTIONED BELOW**
CREATE USER #APP_NAME# FROM EXTERNAL PROVIDER
EXEC sp_addrolemember 'db_owner', '#APP_NAME#';
Maybe you can reference this tutorial: Configuring AAD Authentication to Azure SQL Databases.
Summary:
Azure SQL is a great service - you get your databases into the cloud without having to manage all that nasty server stuff. However, one of the problems with Azure SQL is that you have to authenticate using SQL authentication - a username and password. However, you can also authenticate via Azure Active Directory (AAD) tokens. This is analogous to integrated login using Windows Authentication - but instead of Active Directory, you're using AAD.
There are a number of advantages to AAD Authentication:
You no longer have to share logins since users log in with their AAD credentials, so auditing is better
You can manage access to databases using AAD groups
You can enable "app" logins via Service Principals
In order to get this working, you need:
To enable AAD authentication on the Azure SQL Server
A Service Principal
Add logins to the database granting whatever rights required to the service principal
Add code to get an auth token for accessing the database
But in this post, author will walk through creating a service principal, configuring the database for AAD auth, creating code for retrieving a token and configuring an EF DbContext for AAD auth.
Still hope this tutorial can helps.

Resources