spark kafka security kerberos - security

I try to use kafka(0.9.1) with secure mode. I would read data with Spark, so I must pass the JAAS conf file to the JVM. I use this cmd to start my job :
/opt/spark/bin/spark-submit -v --master spark://master1:7077 \
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.conf=kafka_client_jaas.conf" \
--files "./conf/kafka_client_jaas.conf,./conf/kafka.client.1.keytab" \
--class kafka.ConsumerSasl ./kafka.jar --topics test
I still have the same error :
Caused by: java.lang.IllegalArgumentException: You must pass java.security.auth.login.config in secure mode.
at org.apache.kafka.common.security.kerberos.Login.login(Login.java:289)
at org.apache.kafka.common.security.kerberos.Login.<init>(Login.java:104)
at org.apache.kafka.common.security.kerberos.LoginManager.<init>(LoginManager.java:44)
at org.apache.kafka.common.security.kerberos.LoginManager.acquireLoginManager(LoginManager.java:85)
at org.apache.kafka.common.network.SaslChannelBuilder.configure(SaslChannelBuilder.java:55)
I think the spark does not inject the parameter Djava.security.auth.login.conf in the jvm !!

The main cause of this issue is that you have mentioned wrong property name. it should be java.security.auth.login.config and not -Djava.security.auth.login.conf. Moreover if you are using keytab file. make sure to make it available on all executors using --files argument in spark-submit. if you are using kerberos ticket make sure to set KRB5CCNAME on all executors using property SPARK_YARN_USER_ENV.
if you are using older version of spark 1.6.x or earlier. then there are some known issues with spark that this integration will not work then you have to write a custom receiver.
For spark 1.8 and later, you can see configuration here
Incase you need to create custom receiver you can see this

Related

How to connect documentdb to a spark application in an emr instance

I'm getting error while I'm trying to configure spark with mongodb in my EMR instance. Below is the command -
spark-shell --conf "spark.mongodb.output.uri=mongodb://admin123:Vibhuti21!#docdb-2021-09-18-15-29-54.cluster-c4paykiwnh4d.us-east-1.docdb.amazonaws.com:27017/?replicaSet=rs0&readPreference=secondaryPreferred&retryWrites=false" "spark.mongodb.output.collection="ecommerceCluster" --packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.3
I'm a beginner in Spark & AWS. Can anyone please help?
DocumentDB requires a CA bundle to be installed on each node where your spark executors will launch. As such you firstly need to install the CA certs on each instance, AWS has a guide under the JAVA section for this in two bash scripts which makes things easier.1
Once these certs are installed, your spark command needs to reference the truststores and its passwords using the configuration parameters you can pass to Spark. Here is an example that I ran and this worked fine.
spark-submit
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.3
--conf "spark.executor.extraJavaOptions=
-Djavax.net.ssl.trustStore=/tmp/certs/rds-truststore.jks
-Djavax.net.ssl.trustStorePassword=<yourpassword>" pytest.py
you can provide those same configuration options in both spark-shell as well.
One thing i did find tricky, was that the mongo spark connector doesnt appear to know the ssl_ca_certs parameter in the connection string, so i removed this to avoid warnings from Spark as the Spark executors would reference the keystore in the configuration anyway.

Set Cloudera application tags for Spark application

I have set spark.yarn.tags in my spark application and it is visible as well in my config when printed.
But Cloudera manager is unable to detect it in application_tags field of yarn application.
Does application_tags map to spark.yarn.tags for spark applications?
I think I found the solution.
When spark.yarn.tags is set while calling spark-submit, cloudera manager detects it. So I believe it is something it requires before spark context is created, hence it has to be passed as conf while submitting.
This is how it can be passed to the spark-submit
--conf spark.yarn.tags=tag-name

Spark job with explicit setMaster("local"), passed to spark-submit with YARN

If I have a Spark job (2.2.0) compiled with setMaster("local") what will happen if I send that job with spark-submit --master yarn --deploy-mode cluster ?
I tried this and it looked like the job did get packaged up and executed on the YARN cluster rather than locally.
What I'm not clear on:
why does this work? According to the docs, things that you set in SparkConf explicitly have precedence over things passed in from the command line or via spark-submit (see: https://spark.apache.org/docs/latest/configuration.html). Is this different because I'm using SparkSession.getBuilder?
is there any less obvious impact of leaving setMaster("local") in code vs. removing it? I'm wondering if what I'm seeing is something like the job running in local mode, within the cluster, rather than properly using cluster resources.
It's because submitting your application to Yarn happens before SparkConf.setMaster.
When you use --master yarn --deploy-mode cluster, Spark will run its main method in your local machine and upload the jar to run on Yarn. Yarn will allocate a container as the application master to run the Spark driver, a.k.a, your codes. SparkConf.setMaster("local") runs inside a Yarn container, and then it creates SparkContext running in the local mode, and doesn't use the Yarn cluster resources.
I recommend that not setting master in your codes. Just use the command line --master or the MASTER env to specify the Spark master.
If I have a Spark job (2.2.0) compiled with setMaster("local") what will happen if I send that job with spark-submit --master yarn --deploy-mode cluster
setMaster has the highest priority and as such excludes other options.
My recommendation: Don't use this (unless you convince me I'm wrong - feel challenged :))
That's why I'm a strong advocate of using spark-submit early and often. It defaults to local[*] and does its job very well. It even got improved in the recent versions of Spark where it adds a nice-looking application name (aka appName) so you don't have to set it (or even...please don't...hardcore it).
Given we are in Spark 2.2 days with Spark SQL being the entry point to all the goodies in Spark, you should always start with SparkSession (and forget about SparkConf or SparkContext as too low-level).
The only reason I'm aware of when you could have setMaster in a Spark application is when you want to run the application inside your IDE (e.g. IntelliJ IDEA). Without setMaster you won't be able to run the application.
A workaround is to use src/test/scala for the sources (in sbt) and use a launcher with setMaster that will execute the main application.

Passing multiple jar files in dcos spark-submit, jars with comma separated not suitable

uggestions needed, need to pass lots of jar files to dcos spark submit, jars with comma separated not suitable:
Tried below options:
dcos spark run --submit-args='--class com.gre.music.inn.orrd.SpaneBasicApp --jars /spark_submit_jobs/new1/unzip_new/* 30'
dcos spark run --submit-args='--class com.gre.music.inn.orrd.SpaneBasicApp --jars local:* 30'
dcos spark run --submit-args='--class com.gre.music.inn.orrd.SpaneBasicApp --jars https://s3-us-west-2.amazonaws.com/gmu_jars/* 30‘ .
The last one wont work bcz I guess wild card is not allowed with http.
Update from DC/OS:
--jars isn't supported via dcos spark run (Spark cluster mode). We'll have support for it around ~ DC/OS 1.10 when we move Spark over to Marathon instead of the Spark dispatcher. In the mean time, if you want to use --jars, you'll have to submit your job in client mode via spark-submit through metronome or marathon.
As far as I know you can't use wildcards, and you need to put the JARs somewhere where Spark can access them in a distributed manner (S3, http, hdfs, etc.).
See
http://spark.apache.org/docs/latest/submitting-applications.html#launching-applications-with-spark-submit
application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.
You can't use wildcards with --jars argument in spark-submit. Here's the feature request for that (it's still open).

submitting spark Jobs on standalone cluster

How to externally add dependent jars when you are submitting a Spark Job.
Also would like to know How to package dependent jars with application Jar.
This is a popular question, I looked for some good answer in stackoverflow but I didn't find something that answers this exactly as asked, so I will try to answer this here:
The best way to submit a job is to use the spark-submit script. This assume that you already have a running cluster (distributed or locally, doesn't matter).
You can find this script under $SPARK_HOME/bin/spark-submit.
Here is an example:
spark-submit --name "YourAppNameHere" --class com.path.to.main --master spark://localhost:7077 --driver-memory 1G --conf spark.executor.memory=4g --conf spark.cores.max=100 theUberJar.jar
You give the app a name, you define where your main class is located and the location of spark master (where the cluster runs). You can optionally pass different parameters. The last argument is the name of the uberJar that contains your main and all your dependencies.
The theUberJar.jar relates to your second question on how to package your app. If you are using Scala the best way is to use sbt and create an uber jar using sbt-assembly.
Here are the steps:
Create your uber jar using sbt assembly
Start the cluster ($SPARK_HOME/sbin/start-all.sh)
Submit the App to your running cluster using the uber jar from step 1

Resources