How can I overwrite Google service account credentials in spark config?

How can I overwrite Google service account credentials in spark config? - apache-spark

The application is already have service account set up in the core-site.xml
I'm trying to overwrite it during application execution by setting Google service account credentials but it's failing with this error.
Sample code:
spark.conf.set("fs.defaultFS", "gs://<bucket Name>")
spark.conf.set("fs.gs.auth.service.account.private.key.id", "<private key id>")
spark.conf.set("fs.gs.auth.service.account.email", "<service account email>")
spark.conf.set("fs.gs.auth.service.account.private.key", "<private key>")
val df = spark.read.csv("gs://test/test.csv")
Error:
java.lang.IllegalArgumentException: A JSON key file may not be specified at the same time as credentials via configuration.
at com.google.cloud.hadoop.repackaged.gcs.com.google.common.base.Preconditions.checkArgument(Preconditions.java:141)
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.util.CredentialConfiguration.getCredential(CredentialConfiguration.java:106)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getCredential(GoogleHadoopFileSystemBase.java:1613)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.createGcsFs(GoogleHadoopFileSystemBase.java:1699)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1658)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:683)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:646)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2796)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2830)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2812)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:390)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:705)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:389)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:389)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:388)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
How can this can be fixed?

This can be done via using unset method in hadoopConfiguration
sc.hadoopConfiguration.unset("fs.gs.auth.service.account.json.keyfile")
Any service account key set via conf in spark submit or set in core-site.xml can only be unset from hadoopConfiguration.unset

Related

Apache Camel / Azure Vault URI setup

We are having an issue trying to read outlook emails.
Currently we are using the following Apache camel endpoint to login to Outlook 365 emails:
imaps://Outlook.office365.com:993?password=XXXX&username=YYYY
We upgraded to apache Camel 3.17 to have access to azure vault. We began our testing with tenantId and clientId.
We get the following error.
Caused by: java.lang.IllegalArgumentException: Azure Secret Client or client Id, client secret and tenant Id must be specified at org.apache.camel.component.azure.key.vault.KeyVaultComponent.createEndpoint(KeyVaultComponent.java:66) at org.apache.camel.support.DefaultComponent.createEndpoint(DefaultComponent.java:171) at org.apache.camel.impl.engine.AbstractCamelContext.doGetEndpoint(AbstractCamelContext.java:951) ... 97 more
If anyone has set this up successfully, please help with an example of URI parameters
Thank you

Authenticate Spark to GCS with HMAC key

We have a Spark application accessing GCS using the GCP connector. We would like to authenticate using a service account HMAC key. Is this possible?
We have tried a few of the authentication configurations here but none seems to work.
Here's an example of what we are trying to do
val spark = SparkSession.builder()
.config("google.cloud.auth.client.id", "HMAC key id")
.config("google.cloud.auth.client.secret", "HMAC key secret")
.master("local[*]")
.appName("Test App")
.getOrCreate()
df.write.format("parquet")
.save("gs://test-project/")
We have tried the keyfile JSON which works, but HMAC would be a bit more convenient for us.

How to use Temporary credentials from AssumeRole in Spark configuration

I'm currently facing a issue where I'm unable to create a Spark session (through PySpark) that uses temporary credentials (from a assumed role in a different AWS account).
The idea is to assume a role in Account B, get temporary credentials and create the spark session in Account A, so that Account A is allowed to interact with Account B through the Spark Session.
I've almost tried every possible configuration available in my spark session. Is there anyone that has some reference material to create a spark session using temporary credentials?
role_arn = "arn:aws:iam::account-b:role/example-role"
duration_seconds = 60*15 # durations of the session in seconds
# obtain the temporary credentials
credentials = boto3.client("sts").assume_role(
RoleArn=role_arn,
RoleSessionName=role_session_name#,
# DurationSeconds=duration_seconds
)['Credentials']
spark = SparkSession \
.builder \
.enableHiveSupport() \
.appName("test") \
.config("spark.jars", "/usr/local/spark/jars/hadoop-aws-2.10.0.jar,/usr/local/spark/jars/aws-java-sdk-1.7.4.jar")\
.config("spark.hadoop.fs.s3a.aws.credentials.provider", "com.amazonaws.auth.DefaultAWSCredentialsProviderChain") \
.config("spark.hadoop.fs.s3a.access.key", credentials['AccessKeyId']) \
.config("spark.hadoop.fs.s3a.secret.key", credentials['SecretAccessKey']) \
.config("spark.hadoop.fs.s3a.endpoint", "s3.eu-west-1.amazonaws.com") \
.getOrCreate()
The above seems to not work, it does not implicitly use the credentials I pass to the spark session. It uses the actual underlying execution role of the process.
Looking at the documentation there's also some notes on 'short living credentials' not being supported. So I wonder how others are able to create a spark session with temporary credentials?

update hadoop aws and compatible binaries (including aws sdk) to one written in the last eight years.
which will then include the temporary credential support

In Apache Spark 2.4. How to force the driver logger to use the latest renewed kerberos delegation token to write to HDFS, instead of an outdated one?

One of our Spark application frequently ran into kerberos authentication error on a Hadoop cluster. Initially we believed it to be caused by a misconfigured delegation token renewal policy. But later we found the following message in the Spark driver log:
22/01/15 02:13:38 INFO YARNHadoopDelegationTokenManager: Attempting to login to KDC using principal: XXX/ip-xx-xx-xx-xx.us-west-2.compute.internal#CDHDP.COM
22/01/15 02:13:38 INFO YARNHadoopDelegationTokenManager: Successfully logged into KDC.
22/01/15 02:13:38 INFO HadoopFSDelegationTokenProvider: getting token for: DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-206445389_46, ugi=XXX/ip-xx-xx-xx-xx.us-west-2.compute.internal#CDHDP.COM (auth:KERBEROS)]] with renewer yarn/ip-172-31-34-136.us-west-2.compute.internal#CDHDP.COM
22/01/15 02:13:38 INFO YARNHadoopDelegationTokenManager: Scheduling renewal in 3.7 min.
22/01/15 02:13:38 INFO YARNHadoopDelegationTokenManager: Updating delegation tokens.
22/01/15 02:13:38 INFO SparkHadoopUtil: Updating delegation tokens for current user.
22/01/15 02:17:23 INFO YARNHadoopDelegationTokenManager: Attempting to login to KDC using principal: XXX/ip-xx-xx-xx-xx.us-west-2.compute.internal#CDHDP.COM
22/01/15 02:17:23 INFO YARNHadoopDelegationTokenManager: Successfully logged into KDC.
22/01/15 02:17:23 INFO HadoopFSDelegationTokenProvider: getting token for: DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1775743108_46, ugi=XXX/ip-xx-xx-xx-xx.us-west-2.compute.internal#CDHDP.COM (auth:KERBEROS)]] with renewer yarn/ip-172-31-34-136.us-west-2.compute.internal#CDHDP.COM
22/01/15 02:17:23 INFO YARNHadoopDelegationTokenManager: Scheduling renewal in 3.7 min.
22/01/15 02:17:23 INFO YARNHadoopDelegationTokenManager: Updating delegation tokens.
22/01/15 02:17:23 INFO SparkHadoopUtil: Updating delegation tokens for current user.
22/01/15 02:17:28 ERROR DriverLogger$DfsAsyncWriter: Failed writing driver logs to dfs
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (token for XXX: HDFS_DELEGATION_TOKEN owner=XXX/ip-xx-xx-xx-xx.us-west-2.compute.internal#CDHDP.COM, renewer=yarn, realUser=, issueDate=1642212592939, maxDate=1642212892939, sequenceNumber=25145, masterKeyId=237) is expired, current time: 2022-01-15 02:17:28,048+0000 expected renewal time: 2022-01-15 02:14:52,939+0000
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1508)
at org.apache.hadoop.ipc.Client.call(Client.java:1454)
at org.apache.hadoop.ipc.Client.call(Client.java:1364)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy12.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:497)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy13.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1085)/* w */
at org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1865)
at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1668)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:716)
22/01/15 02:17:33 ERROR DriverLogger$DfsAsyncWriter: Failed writing driver logs to dfs
So the error of DriverLogger was triggered only 5 seconds after a successful renewal (at which point the latest token couldn't possibly expire), for the same hadoop user. So the only possibility seems to be that the DriverLogger tried to write into HDFS using an obsolete delegation token.
How could I confirm this hypothesis and how to fix it?
UPDATE 1 The above log is from launching a Spark ThriftServer in a YARN environment (using ./sbin/start-thriftserver). The strange thing is that if I submit a normal application (e.g. the SparkPi example), the problem disappeared even after prolonged execution.
So part of the Spark ThriftServer could be written with errors that causes the token used by DriverLogger to be out of sync. I just don't know which part. So the question becomes: what are possible ways to ensure that the token being renewed and the token being used for the HDFS writing are the same token?

Spark authentication and encryption on spark standalone cluster with spark.master.rest.enabled=true

How can we do authentication and encryption on spark standalone cluster with spark.master.rest.enabled=true.
Getting below error on master while enabling authentication:
ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[main,5,main]
java.lang.IllegalArgumentException: requirement failed: The RestSubmissionServer does not support authentication via spark.authenticate.secret. Either turn off the RestSubmissionServer with =false, or do not use authentication.
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.deploy.master.Master.<init>(Master.scala:130)
at org.apache.spark.deploy.master.Master$.startRpcEnvAndEndpoint(Master.scala:1076)
at org.apache.spark.deploy.master.Master$.main(Master.scala:1058)
at org.apache.spark.deploy.master.Master.main(Master.scala)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How can I overwrite Google service account credentials in spark config? - apache-spark

This can be done via using unset method in hadoopConfiguration sc.hadoopConfiguration.unset("fs.gs.auth.service.account.json.keyfile") Any service account key set via conf in spark submit or set in core-site.xml can only be unset from hadoopConfiguration.unset

Related

Apache Camel / Azure Vault URI setup

Authenticate Spark to GCS with HMAC key

How to use Temporary credentials from AssumeRole in Spark configuration

In Apache Spark 2.4. How to force the driver logger to use the latest renewed kerberos delegation token to write to HDFS, instead of an outdated one?

Spark authentication and encryption on spark standalone cluster with spark.master.rest.enabled=true

Categories

Resources