spark-submit works for yarn-cluster mode but SparkLauncher doesn't, with same params - apache-spark

I'm able to submit a spark job through spark-submit however when I try to do the same programatically using SparkLauncher, it gives me nothing ( I dont even see a Spark job on the UI)
Below is the scenario:
I've a server(say hostname: cr-hdbc101.dev.local:7123) which hosts the hdfs cluster. I push a fat jar to the server which I'm trying to exec.
The following spark-submit works as expected and a spark job is submitted in yarn-cluster mode
spark-submit \
--verbose \
--class com.digital.StartSparkJob \
--master yarn \
--deploy-mode cluster \
--num-executors 2 \
--driver-memory 2g \
--executor-memory 3g \
--executor-cores 4 \
/usr/share/Deployments/Consolidateservice.jar "<arg_to_main>"
However the following piece of SparkLauncher code doesn't work
val sparkLauncher = new SparkLauncher()
sparkLauncher
.setSparkHome("/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/lib/spark")
.setAppResource("/usr/share/Deployments/Consolidateservice.jar")
.setMaster("yarn-cluster")
.setVerbose(true)
.setMainClass("com.digital.StartSparkJob")
.setDeployMode("cluster")
.setConf("spark.driver.cores", "2")
.setConf("spark.driver.memory", "2g")
.setConf("spark.executor.cores", "4")
.setConf("spark.executor.memory", "3g")
.addAppArgs(<arg_to_main>)
.startApplication()
I thought maybe SparkLauncher is not getting correct env variables to work with, so I send the following to SparkLauncher, but to no avail(basically I pass everything in the spark-env.sh to SparkLauncher)
val env: java.util.Map[String, String] = new java.util.HashMap[String, String]
env.put("SPARK_CONF_DIR", "/etc/spark/conf.cloudera.spark_on_yarn")
env.put("HADOOP_HOME", "/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/lib/hadoop")
env.put("YARN_CONF_DIR", "/etc/spark/conf.cloudera.spark_on_yarn/yarn-conf")
env.put("SPARK_LIBRARY_PATH", "/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/lib/spark/lib")
env.put("SCALA_LIBRARY_PATH", "/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/lib/spark/lib")
env.put("LD_LIBRARY_PATH", "/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/lib/hadoop/lib/native")
env.put("SPARK_DIST_CLASSPATH", "/etc/spark/conf.cloudera.spark_on_yarn/classpath.txt")
val sparkLauncher = new SparkLauncher(env)
sparkLauncher
.setSparkHome("/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/lib/spark")...
What adds to the frustration, is that when I use same SparkLauncher code for yarn-client mode, it works perfectly fine.
Can someone please point to me what am I missing, I just feel I'm staring at the issue without recognizing it.
NOTE: Both the main class(com.digital.StartSparkJob) and SparkLauncher code are part of the fat jar I'm pushing to the server. I just call the SparkLauncher code with an external API, which in turn should open a driver JVM on the cluster
SparkVersion: 1.6.0, scala ver: 2.10.5

I wasn't even getting logs on the Spark-UI...the sparkApp wasn't even running. Therefore I ran the sparkLauncher as a process(using .launch().waitFor() ) so that I can capture the error Logs.
I captured the logs using .getInputStream and .getErrorStream and found out that the user being passed to the cluster is wrong. My cluster will work only for user "abcd".
I did set System.setProperty("HADOOP_USER_NAME", "abcd"), as well as added "spark.yarn.appMasterEnv.HADOOP_USER_NAME=abcd" to spark-default.conf, before launching SparkLauncher. However looks like they don't get ported over to cluster.
I therefore passed the HADOOP_USER_NAME as an childArg to the SparkLauncher
val env: java.util.Map[String, String] = new java.util.HashMap[String, String]
env.put("SPARK_CONF_DIR", "/etc/spark/conf.cloudera.spark_on_yarn")
env.put("YARN_CONF_DIR", "/etc/spark/conf.cloudera.spark_on_yarn/yarn-conf")
env.put("HADOOP_USER_NAME", "abcd")
try {
val sparkLauncher = new SparkLauncher(env)...

Related

SparkConf settings not used when running Spark app in cluster mode on YARN

I wrote a Spark application, which sets sets some configuration stuff via SparkConf instance, like this:
SparkConf conf = new SparkConf().setAppName("Test App Name");
conf.set("spark.driver.cores", "1");
conf.set("spark.driver.memory", "1800m");
conf.set("spark.yarn.am.cores", "1");
conf.set("spark.yarn.am.memory", "1800m");
conf.set("spark.executor.instances", "30");
conf.set("spark.executor.cores", "3");
conf.set("spark.executor.memory", "2048m");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> inputRDD = sc.textFile(...);
...
When I run this application with the command (master=yarn & deploy-mode=client)
spark-submit --class spark.MyApp --master yarn --deploy-mode client /home/myuser/application.jar
everything seems to work fine, the Spark History UI shows correct executor information:
But when running it with (master=yarn & deploy-mode=cluster)
my Spark UI shows wrong executor information (~512 MB instead of ~1400 MB):
Also my App name equals Test App Name when running in client mode, but is spark.MyApp when running in cluster mode. It seems that however some default settings are taken when running in Cluster mode. What am I doing wrong here? How can I make these settings for the Cluster mode?
I'm using Spark 1.6.2 on a HDP 2.5 cluster, managed by YARN.
OK, I think I found out the problem! In short form: There's a difference between running Spark settings in Standalone and in YARN-managed mode!
So when you run Spark applications in the Standalone mode, you can focus on the Configuration documentation of Spark, see http://spark.apache.org/docs/1.6.2/configuration.html
You can use the following settings for Driver & Executor CPU/RAM (just as explained in the documentation):
spark.executor.cores
spark.executor.memory
spark.driver.cores
spark.driver.memory
BUT: When running Spark inside a YARN-managed Hadoop environment, you have to be careful with the following settings and consider the following points:
orientate on the "Spark on YARN" documentation rather then on the Configuration documentation linked above: http://spark.apache.org/docs/1.6.2/running-on-yarn.html (the properties explained here have a higher priority then the ones explained in the Configuration docu (this seems to describe only the Standalone cluster vs. client mode, not the YARN cluster vs. client mode!!))
you can't use SparkConf to set properties in yarn-cluster mode! Instead use the corresponding spark-submit parameters:
--executor-cores 5
--executor-memory 5g
--driver-cores 3
--driver-memory 3g
In yarn-client mode you can't use the spark.driver.cores and spark.driver.memory properties! You have to use the corresponding AM properties in a SparkConf instance:
spark.yarn.am.cores
spark.yarn.am.memory
You can't set these AM properties via spark-submit parameters!
To set executor resources in yarn-client mode you can use
spark.executor.cores and spark.executor.memory in SparkConf
--executor-cores and executor-memory parameters in spark-submit
if you set both, the SparkConf settings overwrite the spark-submit parameter values!
This is the textual form of my notes:
Hope I can help anybody else with this findings...
Just to add on to D. Müller's answer:
Same issue happened to me and I tried the settings with some different combination. I am running Pypark 2.0.0 on YARN cluster.
I found that driver-memory must be written during spark submit but executor-memory can be written in script (i.e. SparkConf) and the application will still work.
My application will die if driver-memory is less than 2g. The error is:
ERROR yarn.ApplicationMaster: RECEIVED SIGNAL TERM
ERROR yarn.ApplicationMaster: User application exited with status 143
CASE 1:
driver & executor both written in SparkConf
spark = (SparkSession
.builder
.appName("driver_executor_inside")
.enableHiveSupport()
.config("spark.executor.memory","4g")
.config("spark.executor.cores","2")
.config("spark.yarn.executor.memoryOverhead","1024")
.config("spark.driver.memory","2g")
.getOrCreate())
spark-submit --master yarn --deploy-mode cluster myscript.py
CASE 2:
- driver in spark submit
- executor in SparkConf in script
spark = (SparkSession
.builder
.appName("executor_inside")
.enableHiveSupport()
.config("spark.executor.memory","4g")
.config("spark.executor.cores","2")
.config("spark.yarn.executor.memoryOverhead","1024")
.getOrCreate())
spark-submit --master yarn --deploy-mode cluster --conf spark.driver.memory=2g myscript.py
The job Finished with succeed status. Executor memory correct.
CASE 3:
- driver in spark submit
- executor not written
spark = (SparkSession
.builder
.appName("executor_not_written")
.enableHiveSupport()
.config("spark.executor.cores","2")
.config("spark.yarn.executor.memoryOverhead","1024")
.getOrCreate())
spark-submit --master yarn --deploy-mode cluster --conf spark.driver.memory=2g myscript.py
Apparently the executor memory is not set. Meaning CASE 2 actually captured executor memory settings despite writing it inside sparkConf.

With sbt accembly pkg, but still got Err "java.sql.SQLException: No suitable driver found for jdbc:mysql" [duplicate]

I am using
df.write.mode("append").jdbc("jdbc:mysql://ip:port/database", "table_name", properties)
to insert into a table in MySQL.
Also, I have added Class.forName("com.mysql.jdbc.Driver") in my code.
When I submit my Spark application:
spark-submit --class MY_MAIN_CLASS
--master yarn-client
--jars /path/to/mysql-connector-java-5.0.8-bin.jar
--driver-class-path /path/to/mysql-connector-java-5.0.8-bin.jar
MY_APPLICATION.jar
This yarn-client mode works for me.
But when I use yarn-cluster mode:
spark-submit --class MY_MAIN_CLASS
--master yarn-cluster
--jars /path/to/mysql-connector-java-5.0.8-bin.jar
--driver-class-path /path/to/mysql-connector-java-5.0.8-bin.jar
MY_APPLICATION.jar
It doens't work. I also tried setting "--conf":
spark-submit --class MY_MAIN_CLASS
--master yarn-cluster
--jars /path/to/mysql-connector-java-5.0.8-bin.jar
--driver-class-path /path/to/mysql-connector-java-5.0.8-bin.jar
--conf spark.executor.extraClassPath=/path/to/mysql-connector-java-5.0.8-bin.jar
MY_APPLICATION.jar
but still get the "No suitable driver found for jdbc" error.
I had to add the driver option when using the sparkSession's read function.
.option("driver", "org.postgresql.Driver")
var jdbcDF - sparkSession.read
.option("driver", "org.postgresql.Driver")
.option("url", "jdbc:postgresql://<host>:<port>/<DBName>")
.option("dbtable", "<tableName>")
.option("user", "<user>")
.option("password", "<password>")
.load()
Depending on how your dependencies are setup, you'll notice that when you include something like compile group: 'org.postgresql', name: 'postgresql', version: '42.2.8' in Gradle, for example, this will include the Driver class at org/postgresql/Driver.class, and that's the one you want to instruct spark to load.
There is 3 possible solutions,
You might want to assembly you application with your build manager (Maven,SBT) thus you'll not need to add the dependecies in your spark-submit cli.
You can use the following option in your spark-submit cli :
--jars $(echo ./lib/*.jar | tr ' ' ',')
Explanation : Supposing that you have all your jars in a lib directory in your project root, this will read all the libraries and add them to the application submit.
You can also try to configure these 2 variables : spark.driver.extraClassPath and spark.executor.extraClassPath in SPARK_HOME/conf/spark-default.conf file and specify the value of these variables as the path of the jar file. Ensure that the same path exists on worker nodes.
I tried the suggestions shown here which didn't work for me (with mysql). While debugging through the DriverManager code, I realized that I needed to register my driver since this was not happening automatically with "spark-submit". I therefore added
Driver driver = new Driver();
The constructor registers the driver with the DriverManager, which solved the SQLException problem for me.

Spark YARN on EMR - JavaSparkContext - IllegalStateException: Library directory does not exist

I have Java Spark job that works on manually deployed Spark 1.6.0 in standalone mode on an EC2.
I am spark-submitting this job to a EMR 5.3.0 cluster on the master using YARN but it fails.
Spark-submit line is,
spark-submit --class <startclass> --master yarn --queue default --deploy-mode cluster --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=hdfs://`hostname -f`:8020/tmp/ourSparkLogs --driver-memory 4G --executor-memory 4G --executor-cores 2 hdfs://`hostname -f`:8020/data/x.jar yarn-client
The "yarn-client" is the first argument to the x.jar application and is fed to the SparkContext as setMaster,
conf.setMaster(args[0]);
When I submit it, it starts out running fine, until I initialize the JavaSparkContext from a SparkConf,
JavaSparkContext sc = new JavaSparkContext(conf);
... and then Spark crashes.
In the YARN log, I can see the following,
yarn logs -applicationId application_1487325147456_0051
...
17/02/17 16:27:13 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
17/02/17 16:27:13 INFO Client: Deleted staging directory hdfs://ip-172-31-8-237.eu-west-1.compute.internal:8020/user/ec2-user/.sparkStaging/application_1487325147456_0052
17/02/17 16:27:13 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalStateException: Library directory '/mnt/yarn/usercache/ec2-user/appcache/application_1487325147456_0051/container_1487325147456_0051_01_000001/assembly/target/scala-2.11/jars' does not exist; make sure Spark is built.
...
Noting the WARN of spark.yarn.jars flag missing, I found a spark yarn JAR file in
/usr/lib/spark/jars/
... and uploaded it to HDFS per Cloudera's guide on how to run YARN applications on Spark and tried to add that conf, so this became my spark-submit line,
spark-submit --class <startclass> --master yarn --queue default --deploy-mode cluster --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=hdfs://`hostname -f`:8020/tmp/ourSparkLogs --conf spark.yarn.jars=hdfs://`hostname -f`:8020/sparkyarnlibs/spark-yarn_2.11-2.1.0.jar --driver-memory 4G --executor-memory 4G --executor-cores 2 hdfs://`hostname -f`:8020/data/x.jar yarn-client
But that did not work and gave this:
Could not find or load main class org.apache.spark.deploy.yarn.ApplicationMaster
I am really puzzled as to what that Library error is caused by and how to proceed onwards from here.
You have specified "--deploy-mode cluster" and yet are calling conf.setMaster("yarn-client") from the code. Using a master URL of "yarn-client" means "use YARN as the master, and use client mode (not cluster mode)", so I wouldn't be surprised if this is somehow confusing Spark because on one hand you're telling it to use cluster mode and on the other you're telling it to use client mode.
By the way, using a master URL like "yarn-client" or "yarn-cluster" is actually deprecated because the "-client" or "-cluster" part is not really part of the Master but rather is the deploy mode. That is, "--master yarn-client" is really more of a shortcut/alias for "--master yarn --deploy-mode client", and similarly "--master yarn-cluster" just means "--master yarn --deploy-mode cluster".
My recommendation would be to not call conf.setMaster() from your code, since the master is already set to "yarn" automatically in /etc/spark/conf/spark-defaults.conf. For this reason, you also don't need to pass "--master yarn" to spark-submit.
Lastly, it sounds like you need to decide whether you really want to use client deploy mode or cluster deploy mode. With client deploy mode, the driver runs on the master instance, and with cluster deploy mode, the driver runs in a YARN container on one of the core/task instances. See https://spark.apache.org/docs/latest/running-on-yarn.html for more information.
If you want to use client deploy mode, you don't need to pass anything extra because it's already the default. If you want to use cluster deploy mode, pass "--deploy-mode cluster".

Running spark application in local mode

I'm trying to start my Spark application in local mode using spark-submit. I am using Spark 2.0.2, Hadoop 2.6 & Scala 2.11.8 on Windows. The application runs fine from within my IDE (IntelliJ), and I can also start it on a cluster with actual, physical executors.
The command I'm running is
spark-submit --class [MyClassName] --master local[*] target/[MyApp]-jar-with-dependencies.jar [Params]
Spark starts up as usual, but then terminates with
java.io.Exception: Failed to connect to /192.168.88.1:56370
What am I missing here?
Check which port you are using: if on cluster: log in to master node and include:
--master spark://XXXX:7077
You can find it always in spark ui under port 8080
Also check your spark builder config if you have set master already as it takes priority when launching eg:
val spark = SparkSession
.builder
.appName("myapp")
.master("local[*]")

PySpark distributed processing on a YARN cluster

I have Spark running on a Cloudera CDH5.3 cluster, using YARN as the resource manager. I am developing Spark apps in Python (PySpark).
I can submit jobs and they run succesfully, however they never seem to run on more than one machine (the local machine I submit from).
I have tried a variety of options, like setting --deploy-mode to cluster and --master to yarn-client and yarn-cluster, yet it never seems to run on more than one server.
I can get it to run on more than one core by passing something like --master local[8], but that obviously doesn't distribute the processing over multiple nodes.
I have a very simply Python script processing data from HDFS like so:
import simplejson as json
from pyspark import SparkContext
sc = SparkContext("", "Joe Counter")
rrd = sc.textFile("hdfs:///tmp/twitter/json/data/")
data = rrd.map(lambda line: json.loads(line))
joes = data.filter(lambda tweet: "Joe" in tweet.get("text",""))
print joes.count()
And I am running a submit command like:
spark-submit atest.py --deploy-mode client --master yarn-client
What can I do to ensure the job runs in parallel across the cluster?
Can you swap the arguments for the command?
spark-submit --deploy-mode client --master yarn-client atest.py
If you see the help text for the command:
spark-submit
Usage: spark-submit [options] <app jar | python file>
I believe #MrChristine is correct -- the option flags you specify are being passed to your python script, not to spark-submit. In addition, you'll want to specify --executor-cores and --num-executors since by default it will run on a single core and use two executors.
Its not true that python script doesn't run in cluster mode. I am not sure about previous versions but this is executing in spark 2.2 version on Hortonworks cluster.
Command : spark-submit --master yarn --num-executors 10 --executor-cores 1 --driver-memory 5g /pyspark-example.py
Python Code :
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
conf = (SparkConf()
.setMaster("yarn")
.setAppName("retrieve data"))
sc = SparkContext(conf = conf)
sqlContext = SQLContext(sc)
parquetFile = sqlContext.read.parquet("/<hdfs-path>/*.parquet")
parquetFile.createOrReplaceTempView("temp")
df1 = sqlContext.sql("select * from temp limit 5")
df1.show()
df1.write.save('/<hdfs-path>/test.csv', format='csv', mode='append')
sc.stop()
Output : Its big so i am not pasting. But it runs perfect.
It seems that PySpark does not run in distributed mode using Spark/YARN - you need to use stand-alone Spark with a Spark Master server. In that case, my PySpark script ran very well across the cluster with a Python process per core/node.

Resources