Is there a way to read a file from a Kerberized HDFS into a non kerberized spark cluster given the keytab file, principal and other details? - apache-spark

I need to read data from a Kerberized HDFS cluster using webHDFS in a non Kerberized Spark cluster. I have access to the Keytab file, username/principal, and can access any other details needed to log in. I need to programmatically log in and allow my Spark cluster to read the file(s) from the Kerberized HDFS.
It looks like I can log in using this piece of code:
System.setProperty("java.security.krb5.kdc", "<kdc>")
System.setProperty("java.security.krb5.realm", "<realm>")
val conf = new Configuration()
conf.set("hadoop.security.authentication", "kerberos")
UserGroupInformation.setConfiguration(conf)
UserGroupInformation.loginUserFromKeytab("<my-principal>", "<keytab-file-path>/keytabFile.keytab")
From the official documentation of webHDFS here, I see that I can configure the keytab file path and principal in the hadoop configuration like this:
sparkSession.sparkContext.hadoopConfiguration.set("dfs.web.authentication.kerberos.principal", "<my-principal>")
sparkSession.sparkContext.hadoopConfiguration.set("dfs.web.authentication.kerberos.keytab", "<keytab-file-path>/keytabFile.keytab")
After this, I should be able to read the files from the Kerberized HDFS cluster using:
val irisDFWebHDFS = sparkSession.read
.format("csv")
.option("header", "true")
.csv(s"webhdfs://<namenode-host>:<namenode-port>/user/hadoop/iris.csv")
But it still refuses to read and throws the following exception:
org.apache.hadoop.security.AccessControlException: Authentication required
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.validateResponse(WebHdfsFileSystem.java:490)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.access$300(WebHdfsFileSystem.java:135)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.connect(WebHdfsFileSystem.java:721)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:796)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.access$100(WebHdfsFileSystem.java:619)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner$1.run(WebHdfsFileSystem.java:657)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:653)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getDelegationToken(WebHdfsFileSystem.java:1741)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getDelegationToken(WebHdfsFileSystem.java:365)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getAuthParameters(WebHdfsFileSystem.java:585)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.toUrl(WebHdfsFileSystem.java:608)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractFsPathRunner.getUrl(WebHdfsFileSystem.java:898)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.runWithRetry(WebHdfsFileSystem.java:794)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.access$100(WebHdfsFileSystem.java:619)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner$1.run(WebHdfsFileSystem.java:657)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem$AbstractRunner.run(WebHdfsFileSystem.java:653)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getHdfsFileStatus(WebHdfsFileSystem.java:1086)
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.getFileStatus(WebHdfsFileSystem.java:1097)
at org.apache.hadoop.fs.FileSystem.isDirectory(FileSystem.java:1707)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:47)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:376)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:326)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:308)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:308)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:796)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
... 44 elided
Any pointers on what I might be missing?

Related

Unable to read xlsx file to pyspark dataframe from azure blob storage container

I am trying to load data from the Azure storage container to the Pyspark data frame in Azure Databricks. When I read txt or CSV files it is working. But when I try to read .xlsx files I am getting the following issue.
Apache Spark 3.2.0, Scala 2.12
Below are the steps I am performing
spark.conf.set("fs.azure.account.key.teststorage.blob.core.windows.net",
"**********************")
It is working
df = spark.read.format("csv").option("header", "true") \
.option("inferSchema", "true") \
.load("wasbs://testcontainer#teststorage.blob.core.windows.net/data/samplefile.txt")
Not working
df = spark.read.format("com.crealytics.spark.excel") \
.option("header", "true").option("inferSchema","true") \
.load("wasbs://testcontainer#teststorage.blob.core.windows.net/data/samplefile.xlsx")
Getting below error while loading xlsx files:
: shaded.databricks.org.apache.hadoop.fs.azure.AzureException: shaded.databricks.org.apache.hadoop.fs.azure.AzureException: Container producer in account teststorage.blob.core.windows.net not found, and we can't create it using anoynomous credentials, and no credentials found for them in the configuration.
at shaded.databricks.org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:1063)
at shaded.databricks.org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:512)
at shaded.databricks.org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1384)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:537)
at com.crealytics.spark.excel.WorkbookReader$.readFromHadoop$1(WorkbookReader.scala:35)
at com.crealytics.spark.excel.WorkbookReader$.$anonfun$apply$2(WorkbookReader.scala:41)
at com.crealytics.spark.excel.DefaultWorkbookReader.$anonfun$openWorkbook$1(WorkbookReader.scala:49)
at scala.Option.fold(Option.scala:251)
at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:49)
at com.crealytics.spark.excel.WorkbookReader.withWorkbook(WorkbookReader.scala:14)
at com.crealytics.spark.excel.WorkbookReader.withWorkbook$(WorkbookReader.scala:13)
at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:45)
at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:31)
at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:31)
at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:102)
at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:101)
at com.crealytics.spark.excel.ExcelRelation.$anonfun$inferSchema$1(ExcelRelation.scala:163)
at scala.Option.getOrElse(Option.scala:189)
at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:162)
at com.crealytics.spark.excel.ExcelRelation.<init>(ExcelRelation.scala:35)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:35)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:13)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:385)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:355)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:322)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:322)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:235)
at sun.reflect.GeneratedMethodAccessor338.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)
Note: I am able to read from dbfs and mount point.
This problem arises due to the public access level of the blob storage container. When the container has Private or Blob public access level, the same error occurs for excel files. But when Container public access level blob storage container is used, you will be able to read the excel files without error. This is what I got while trying to reproduce the issue.
A simple solution would be either to change the public access level of the container to Container, or to mount the blob storage account to the Databricks file system (which is working for you). If you choose to change the public access level of container, go to the container in your blob storage, and you will find the option “Change access level” where you can select Container level.
Navigate to container of storage account and change access level.
Go back to Databricks, run the Dataframe read again which works without any error.
df2 = spark.read.format("com.crealytics.spark.excel") \
.option("header", "true").option("inferSchema","true") \
.load("wasbs://<container>#<storage_acc>.blob.core.windows.net/data.xlsx")
Please refer to the following document to understand more about accessing blob storage account using Databricks.
https://learn.microsoft.com/en-us/azure/databricks/data/data-sources/azure/azure-storage

Spark application syncing with Hive metastore - "There is no primary group for UGI spark" error

I'm running a simple Spark job on Kubernetes cluster that writes data to HDFS with Hive catologization. For whatever reason my app fails to run Spark SQL commands with the following exception:
21/09/22 09:23:54 ERROR SplunkStreamListener: |exception=org.apache.spark.sql.AnalysisException
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: java.io.IOException There is no primary group for UGI spark (auth:SIMPLE));
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
at org.apache.spark.sql.hive.HiveExternalCatalog.createDatabase(HiveExternalCatalog.scala:183)
at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createDatabase(ExternalCatalogWithListener.scala:47)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:211)
at org.apache.spark.sql.execution.command.CreateDatabaseCommand.run(ddl.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withAction(Dataset.scala:3369)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:194)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:79)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:643)
I'm connecting to Hive metastore via Thrift URL. The docker container runs the application as non-root user. Are there some kind of groups I need the user to be added to sync with the metastore?
Try add this before setting up the spark context
System.setProperty("HADOOP_USER_NAME", "root")

Spark Cluster mode issue to read Hive-Hbase table on Kerberized Environment

Error description
We are not able execute our Spark job in yarn-cluster or yarn-client mode, though it is working fine in the local mode.
This issue occurs when we try to read the Hive-HBase tables in a Kerberized cluster.
What we have tried so far
Passing all the HBASE jar in the –jar parameter in spark submi
--jars /usr/hdp/current/hive-client/lib/hive-hbase-handler-1.2.1000.2.5.3.16-1.jar,/usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar,/usr/hdp/current/hbase-client/lib/hbase-client.jar,/usr/hdp/current/hbase-client/lib/hbase-common.jar,/usr/hdp/current/hbase-client/lib/hbase-protocol.jar,/usr/hdp/current/hbase-client/lib/htrace-core-3.1.0-incubating.jar,/usr/hdp/current/hbase-client/lib/protobuf-java-2.5.0.jar,/usr/hdp/current/hbase-client/lib/guava-12.0.1.jar,/usr/hdp/current/hbase-client/lib/hbase-server.jar
Passing Hbase site and hive site in file parameter in Spark submit
--files /usr/hdp/2.5.3.16-1/hbase/conf/hbase-site.xml,/usr/hdp/current/spark-client/conf/hive-site.xml,/home/pasusr/pasusr.keytab
Doing Kerberos authentication inside the application. In the code we are explicitly passing the key tab
UserGroupInformation.setConfiguration(configuration)
val ugi: UserGroupInformation =
UserGroupInformation.loginUserFromKeytabAndReturnUGI(principle, keyTab)
UserGroupInformation.setLoginUser(ugi)
ConnectionFactory.createConnection(configuration)
return ugi.doAs(new PrivilegedExceptionActionConnection {
#throws[IOException]
def run: Connection = {
ConnectionFactory.createConnection(configuration) }
})
Passing key tab information in the Spark submit
Passing the HBASE jar in the spark.driver.extraClassPath and spark.executor.extraClassPath
Error Log
18/03/20 15:33:24 WARN TableInputFormatBase: You are using an HTable instance that relies on an HBase-managed Connection. This is usually due to directly creating an HTable, which is deprecated. Instead, you should create a Connection object and then request a Table instance from it. If you don't need the Table instance for your own use, you should instead use the TableInputFormatBase.initalizeTable method directly.
18/03/20 15:47:38 WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 406, hadoopnode.server.name): java.lang.IllegalStateException: Error while configuring input job properties
at org.apache.hadoop.hive.hbase.HBaseStorageHandler.configureTableJobProperties(HBaseStorageHandler.java:444)
at org.apache.hadoop.hive.hbase.HBaseStorageHandler.configureInputJobProperties(HBaseStorageHandler.java:342)
Caused by: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=50, exceptions:
Caused by: java.lang.RuntimeException: SASL authentication failed. The most likely cause is missing or invalid credentials. Consider 'kinit'.
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$1.run(RpcClientImpl.java:679)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
I was able to resolve this by adding following configuration in the spark-env.sh
export SPARK_CLASSPATH=/usr/hdp/current/hbase-client/lib/hbase-common.jar:/usr/hdp/current/hbase-client/lib/hbase-client.jar:/usr/hdp/current/hbase-client/lib/hbase-server.jar:/usr/hdp/current/hbase-client/lib/hbase-protocol.jar:/usr/hdp/current/hbase-client/lib/guava-12.0.1.jar
And removing the spark.driver.extraClassPath and spark.executor.extraClassPath in which I was passing the above Jar from the Spark submit command.

Spark submit cluster mode from s3

I have Spark stand-alone set up on EC2 instances. I'm try to use cluster mode to submit a Spark apllication. The jar is in S3, and access to it is set up via IAM roles. I can run aws s3 cp s3://bucket/dir/foo.jar . to get the jar file - that works fine. However, when I run the following:
spark-submit --master spark://master-ip:7077 --class Foo
--deploy-mode cluster --verbose s3://bucket/dir/foo/jar
I get the error outlined below. Seeing that the boxes have IAM roles configured to allow access, what would be the correct way to submit the job? The job itself doesn't use S3 at all...the issue seems to be fetching the jar from S3.
Any help will be appreciated.
16/07/04 11:44:09 ERROR ClientEndpoint: Exception from cluster was: java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).
at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66)
at org.apache.hadoop.fs.s3.Jets3tFileSystemStore.initialize(Jets3tFileSystemStore.java:82)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)
at com.sun.proxy.$Proxy5.initialize(Unknown Source)
at org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:77)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1446)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1464)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:263)
at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1686)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:598)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:395)
at org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:150)
at org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:79)
I've found a workaround. I put the jar in a static http server, and use http://server/foo.jar in spark-submit. That seems to work.

Spark job on DC/OS cluster on AWS

I am trying to run a batch process in Spark on DC/OS on AWS. For each batch process, I have some specific parameters I send when I do spark submit (for example for which users to perform the batch process).
I have a Spark cluster on DC/OS, with one master and 3 private nodes.
I have created a application.conf file and uploaded it to S3, and enabled the permissions for accessing that file.
My spark submit command looks like this:
dcos spark run --submit-args='-Dspark.mesos.coarse=true --driver-class-path https://path_to_the_folder_root_where_is_the_file --conf spark.driver.extraJavaOptions=-Dconfig.file=application.conf --conf spark.executor.extraJavaOptions=-Dconfig.file=application.conf --class class_name jar_location_on_S3'
And I get the error that job.properties file is not found:
Exception in thread "main" com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'wattio-batch'
at com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:124)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:145)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:159)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:164)
at com.typesafe.config.impl.SimpleConfig.getObject(SimpleConfig.java:218)
at com.typesafe.config.impl.SimpleConfig.getConfig(SimpleConfig.java:224)
at com.typesafe.config.impl.SimpleConfig.getConfig(SimpleConfig.java:33)
at com.enerbyte.spark.jobs.wattiobatch.WattioBatchJob$.main(WattioBatchJob.scala:31)
at com.enerbyte.spark.jobs.wattiobatch.WattioBatchJob.main(WattioBatchJob.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:786)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:183)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:208)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:123)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
How to set this properly? Although one of the private slaves executes the driver, does it have the access to Internet(is it able to go to S3 and download conf file)?
Thank you
I didn't succeed to send conf file from spark submit command, but what I did is to hard-code the location of application.conf file at the beginning of my program using:
System.setProperty("config.url", "https://s3_location/application.conf")
ConfigFactory.invalidateCaches()
This way, program was able to read the application.conf file every time at launching.

Resources