Access Openstack Swift from Spark - SwiftAuthenticationFailedException - apache-spark

I am attempting to access Openstack Swift from Spark 2.4 but I get an error.
org.apache.hadoop.fs.swift.exceptions.SwiftAuthenticationFailedException: Authenticate as tenant '78axxxxxxxxxxxxxxxxxxxxxxxxxxxx' PasswordCredentials{username='xxxxxxxxxxxx'}
sc.hadoopConfiguration.set(s"fs.swift.service.ovh.auth.url", "https://auth.cloud.ovh.net/v3/")
sc.hadoopConfiguration.set(s"fs.swift.service.ovh.tenant", "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx")
sc.hadoopConfiguration.set(s"fs.swift.service.ovh.username", "xxxxxxxxxxxx")
sc.hadoopConfiguration.set(s"fs.swift.service.ovh.password", "xxxxxxxxxxxxxxxxxxxx")
sc.hadoopConfiguration.set(s"fs.swift.service.ovh.http.port", "8080")
sc.hadoopConfiguration.set(s"fs.swift.service.ovh.region", "BHS3")
sc.hadoopConfiguration.set(s"fs.swift.service.ovh.public", "false")
I believe that these credentials are correct as they came directly from the openstack rc file and I can use them fine when using python-swiftclient. I have also tried using the v2.0 endpoint without success.
Unfortunately I always get this very generic error message and it won't tell me which part is failing. Is there any way to debug this better?

I used the example below which I received from the OVH spark submit team.
An important note is to use the tenant name instead of the tenant id from the openstack.rc file.
val hadoopConf = spark.sparkContext.hadoopConfiguration
hadoopConf.set("fs.swift.impl","org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem")
hadoopConf.set("fs.swift.service.auth.endpoint.prefix","/AUTH_")
hadoopConf.set("fs.swift.service.abc.http.port","443")
hadoopConf.set("fs.swift.service.abc.auth.url","https://auth.cloud.ovh.net/v2.0/tokens")
hadoopConf.set("fs.swift.service.abc.tenant","<TENANT NAME> or <PROJECT NAME>")
hadoopConf.set("fs.swift.service.abc.region","<REGION NAME>")
hadoopConf.set("fs.swift.service.abc.useApikey","false")
hadoopConf.set("fs.swift.service.abc.username","<USER NAME>")
hadoopConf.set("fs.swift.service.abc.password","<PASSWORD>")
https://github.com/mojtabaimani/spark-wordcount-swift-scala/blob/master/src/main/scala/com/ovh/example/SparkScalaApp.scala

Related

how can spark read / write from azurite

I am trying to read (and eventually write) from azurite (version 3.18.0) using spark (3.1.1)
i can't understand what spark configurations and file uri i need to set to make this work properly
for example these are the containers and files i have inside azurite
/devstoreaccount1/container1/file1.avro
/devstoreaccount1/container2/file2.avro
This is the code that im running - the uri val is one of the values below
val uri = ...
val spark = SparkSession.builder()
.appName(appName)
.master("local")
.config("spark.driver.host", "127.0.0.1").getOrCreate()
spark.conf.set("spark.hadoop.fs.wasbs.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.conf.set(s"spark.hadoop.fs.azure.account.auth.type.devstoreaccount1.blob.core.windows.net", "SharedKey")
spark.conf.set(s"spark.hadoop.fs.azure.account.key.devstoreaccount1.blob.core.windows.net", <azurite account key>)
spark.read.format("avro").load(uri)
uri value - what is the correct one?
http://127.0.0.1:10000/container1/file1.avro
I get UnsupportedOperationException when i perform the spark.read.format("avro").load(uri) because spark will use the HttpFileSystem implementation and it doesn't support listStatus
wasb://container1#devstoreaccount1.blob.core.windows.net/file1.avro
Spark will try to authenticate against azure servers (and will fail for obvious reasons)
I have tried to follow this stackoverflow post without success.
I have also tried to remove the blob.core.windows.net configuration postfix but then i don't how to give spark the endpoint for the azurite container?
So my question is what are the correct configurations to give spark so it will be able to read from azurite, and what are the correct file path formats to pass as the URI?

Configuration Error in Cassandra Kerberos Authentication with Java

I am trying to connect to Cassandra cluster through Kerberos ticket cache with following set of configs -
java -Dcassaandra.ip.address=<IPaddress> \
-Djava.security.auth.login.config=kerb-client.conf \
-Dsun.security.krb5.debug=true \
-Djavax.security.auth.useSubjectCredsOnly=false \
-Djava.security.krb5.conf=krb5.conf -jar test-kerberos.jar
kerb-client.conf:
CassandraJavaClient {
com.sun.security.auth.module.Krb5LoginModule required useTicketCache=true ticketCache=cacheFile principal="abc#abc.net";
};
I am getting following error -
Unexpected error during transport initialization (java.lang.SecurityException: java.io.IOException: Configuration Error:
Line 3: expected [option value])
at com.datastax.driver.core.Connection$2.apply(Connection.java:205)
at com.datastax.driver.core.Connection$2.apply(Connection.java:191)
It seems like that there is some config missing but i am unable to identify the root cause.
Please advise on the corrective steps.
The Java driver needs to have a custom authenticator configured in order to authenticate using Kerberos.
Instaclustr has an open-source Kerberos authenticator for Cassandra that works with the Java driver. Details and code are available here:
Cassandra Kerberos authenticator
Java driver Kerberos plugin
Cheers!
Thanks for your reply. I am already using com.datastax.cassandra:cassandra-driver-core:3.5.1. Below is my sample code snippet.
Cluster cluster = Cluster.builder()
.addContactPoints(ipAddress)
.withAuthProvider(KerberosAuthProvider.builder().withSaslProperties(saslProperties).build())
.withSSL()
.build();
cluster.connect();
The issue got fixed after changing the ticketCache file name. It seems like that i was referring to incorrect cache file.Now, i seem to be getting something else - No valid credentials exception provided.
But i have moved ahead from basic configuration error.

z.run("paragraphId") not working in Zeppelin

I am trying to run paragraphs using zeppelin spark object method
z.run("noteId","paragraphId")
z.run("paragraphId")
I have tried both the methods, but nothing seems to be triggering the next paragraph. I do not get any errors also. Please let me know if I am missing something here.
On checking the zeppelin server logs I am getting the below error :
ERROR [2020-08-26 15:46:37,742] ({Thread-35} RemoteInterpreterEventPoller.java[run]:250) - Can't handle event RemoteInterpreterEvent(type:RUN_INTERPRETER_CONTEXT_RUNNER, data:{"logger":{"traceCapable":true,"name":"org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$ParagraphRunner"},"noteId":"2FHDFMR16","paragraphId":"20200819-131115_701985359"})
org.apache.zeppelin.rest.exception.ForbiddenException: HTTP 403 Forbidden
at org.apache.zeppelin.socket.NotebookServer.onRemoteRunParagraph(NotebookServer.java:2153)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterEventPoller.run(RemoteInterpreterEventPoller.java:141)
I have created my notebook using the admin user only. Do we need to provide any special access? I am just using admin user.
Update 1:
I changed notebook settings in which we removed user and owner permissions. It's triggering, but it's with an anonymous user. It will be a problem if we are working in multi-user environment.
I think instead of that
Try
z.z.run

Kerberos: Spark UGI credentials are not getting passed down to Hive

I'm using Spark-2.4, I have a Kerberos enabled cluster where I'm trying to run a query via the spark-sql shell.
The simplified setup basically looks like this: spark-sql shell running on one host in a Yarn cluster -> external hive-metastore running one host -> S3 to store table data.
When I launch the spark-sql shell with DEBUG logging enabled, this is what I see in the logs:
> bin/spark-sql --proxy-user proxy_user
...
DEBUG HiveDelegationTokenProvider: Getting Hive delegation token for proxy_user against hive/_HOST#REALM.COM at thrift://hive-metastore:9083
DEBUG UserGroupInformation: PrivilegedAction as:spark/spark_host#REALM.COM (auth:KERBEROS) from:org.apache.spark.deploy.security.HiveDelegationTokenProvider.doAsRealUser(HiveDelegationTokenProvider.scala:130)
This means that Spark made a call to fetch the delegation token from the Hive metastore and then added it to the list of credentials for the UGI. This is the piece of code in Spark which does that. I also verified in the metastore logs that the get_delegation_token() call was being made.
Now when I run a simple query like create table test_table (id int) location "s3://some/prefix"; I get hit with an AWS credentials error. I modified the hive metastore code and added this right before the file system in Hadoop is initialized (org/apache/hadoop/hive/metastore/Warehouse.java):
public static FileSystem getFs(Path f, Configuration conf) throws MetaException {
...
try {
// get the current user
UserGroupInformation ugi = UserGroupInformation.getCurrentUser();
LOG.info("UGI information: " + ugi);
Collection<Token<? extends TokenIdentifier>> tokens = ugi.getCredentials().getAllTokens();
// print all the tokens it has
for(Token token : tokens) {
LOG.info(token);
}
} catch (IOException e) {
e.printStackTrace();
}
...
}
In the metastore logs, this does print the correct UGI information:
UGI information: proxy_user (auth:PROXY) via hive/hive-metastore#REALM.COM (auth:KERBEROS)
but there are no tokens present in the UGI. Looks like Spark code adds it with the alias hive.server2.delegation.token but I don't see it in the UGI. This makes me suspect that somehow the UGI scope is isolated and not being shared between spark-sql and hive metastore. How do I go about solving this?
Spark is not picking up your Kerberos identity -it asks each FS to issue some "delegation token" which lets the caller interact with that service and that service alone. This is more restricted and so more secure.
The problem here is that spark collects delegation tokens from every filesystem which can issue them -and as your S3 connector isn't issuing any, nothing is coming down.
Now, Apache Hadoop 3.3.0's S3A connector can be set to issue your AWS credentials inside a delegation token, or, for bonus security, ask AWS for session credentials and send only those over. But (a) you need a spark build with those dependencies, and (b) Hive needs to be using those credentials to talk to S3.

How to disable 'spark.security.credentials.${service}.enabled' in Structured streaming while connecting to a kafka cluster

I am trying to read data from a secured Kafka cluster using spark structured streaming.
Also I am using the below library to read the data - "spark-sql-kafka-0-10_2.12":"3.0.0-preview" since it has the feature to specify our custom group id (instead of spark setting its own custom group id)
Dependency used in code:
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.12</artifactId>
<version>3.0.0-preview</version>
I am getting the below error - even after specifying the required JAAS configuration in spark options.
Caused by: java.lang.IllegalArgumentException: requirement failed: Delegation token must exist for this connector.
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.kafka010.KafkaTokenUtil$.isConnectorUsingCurrentToken(KafkaTokenUtil.scala:299)
at org.apache.spark.sql.kafka010.KafkaDataConsumer.getOrRetrieveConsumer(KafkaDataConsumer.scala:533)
at org.apache.spark.sql.kafka010.KafkaDataConsumer.$anonfun$get$1(KafkaDataConsumer.scala:275)
Following document specifies that we can disable the feature of obtaining delegation token - https://spark.apache.org/docs/3.0.0-preview/structured-streaming-kafka-integration.html
I tried setting this property spark.security.credentials.kafka.enabled to false in spark config, but it is still failing with the same error.
Apparently there seems to be a bug on the preview release and has been fixed on the GA Spark 3.x release.
Reference :
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-30495
Now, we can specify our custom consumer group name while fetching the data from Kafka (Even though it's not recommended and we will see a warning message while specifying it).

Resources