Bluemix spark-submit -- How to secure credentials needed by my Scala jar - apache-spark

I have a Spark application that I am submitting to the Bluemix Spark Cluster. It reads from a DASHDB database and writes the results to Cloudant. The code accesses the DASHDB using both Spark and JDBC.
The userid & password for the DASHDB database are passed as arguments to the program. I can pass these parameters via spark-submit but I don't think that would be secure. In the code I need to know the credentials of the DASHDB database because I am using JDBC to connect to various tables.
I am trying to find the "Best Practices" way to pass credentials using spark-submit in a secure manner.
Thanks in advance - John

I think the jdbc driver will always need username and password to connect to database so that is out of question as you are in multi-tenant enviornment on bluemix.
Now about spark-submit.sh to read the arguments securely, that option is not available yet.
Thanks,
Charles.

Based on the answer here, my preference would be to pass a properties file that has the credentials. Other tenants will not be able to read the properties file, but you will be able to read if from your spark application, e.g. as a dataframe spfrom which you can access the parameters.

Related

Kerberos ticket cache in Spark

I'm running a PySpark (Spark 3.1.1) application in cluster mode on YARN cluster, which is supposed to process input data and send appropriate kafka messages to a given topic.
Data manipulation part is already covered, however I struggle to use kafka-python library to send the notifications. The problem is that it can't find a valid kerberos ticket to authenticate to kafka cluster.
While executing spark3-submit I add --principal and --keytab properties (equivalents to spark.kerberos.keytab and spark.kerberos.principal). Moreover, I am able to access HDFS and HBase resources.
Does Spark store TGT in a ticket cache that I can reference by setting krb5ccname variable? I am not able to locate a valid kerberos ticket while the app is running.
Is it common to issue kinit from PySpark application to create a ticket to get an access to resources outside HDFS etc.? I tried using krbticket module to issue kinit command from the app (using keytab that I pass as a parameter to spark3-submit), however then the process hangs.

Spark job as a web service?

A peer of mine has created code that opens a restful api web service within an interactive spark job. The intent of our company is to use his code as a means of extracting data from various datasources. He can get it to work on his machine with a local instance of spark. He insists that this is a good idea and it is my job as DevOps to implement it with Azure Databricks.
As I understand it interactive jobs are for one-time analytics inquiries and for the development of non-interactive jobs to be run solely as ETL/ELT work between data sources. There is of course the added problem of determining the endpoint for the service binding within the spark cluster.
But I'm new to spark and I have scarcely delved into the mountain of documentation that exists for all the implementations of spark. Is what he's trying to do a good idea? Is it even possible?
The web-service would need to act as a Spark Driver. Just like you'd run spark-shell, run some commands , and then use collect() methods to bring all data to be shown in the local environment, that all runs in a singular JVM environment. It would submit executors to a remote Spark cluster, then bring the data back over the network. Apache Livy is one existing implementation for a REST Spark submission server.
It can be done, but depending on the process, it would be very asynchronous, and it is not suggested for large datasets, which Spark is meant for. Depending on the data that you need (e.g. highly using SparkSQL), it'd be better to query a database directly.

How to use authentication on Spark Thrift server on Spark standalone cluster

I have a standalone spark cluster on Kubernetes and I want to use that to load some temp views in memory and expose them via JDBC using spark thrift server.
I already got it working with no security by submitting a spark job (pyspark in my case) and starting thrift server in this same job so I can access the temp views.
Since I'll need to expose some sensitive data, I want to apply at least an authentication mechanism.
I've been reading a lot and I see basically 2 methods to do so:
PAM - which is not advised for production since some critical files needs to have grant permission to user beside root.
Kerberos - which appears to be the most appropriate one for this situation.
My question is:
- For a standalone spark cluster (running on K8s) is Kerberos the best approach? If not which one?
- If Kerberos is the best one, it's really hard to find some guidance or step by step on how to setup Kerberos to work with spark thrift server specially in my case where I'm not using any specific distribution (MapR, Hortonworks, etc).
Appreciate your help

How to set user login credentials to Spark webUI in apache spark open source cluser

We are using open source apache spark cluster in our project. Need some help on following ones.
How to enable login credentials for spark web ui login?
How to disable “kill button” option from spark webui?.
Can someone help me solutions to question 1 or 2 or both?.
Thanks in advance.
Sure. According to this you need to set spark.ui.filters setting to refer to the filter class that implements the authentication method you want to deploy. Spark does not provide any built-in authentication filters.
You can see a filter example here.
You need to modify ACLs to control who has access to modify a running Spark application. It can be done by configuring parameters spark.acls.enable, spark.ui.view.acls and spark.ui.view.acls.groups. You can read more about it here.

Spark integration

I am newbie to apache spark.
My requirement is, when user clicks on the Web UI, query needs to pass to the Spark cluster and get the data back from the cluster and update the UI.
I want to know, how to pass the Spark SQL query and get the result set ?
Spark has Thrift server for this(running SQL queries through JDBC/ODBC). If you are using Java is your middle layer, use JDBC and connect spark Thrift server as like data base and pass/run what ever SQL(supports Spark).
Usually you would have to write a web application, usually with a REST interface, and implement the Spark SQL inside of the server-side REST handler.
You can use Apache Livy.
Details : https://livy.incubator.apache.org/

Resources