Add extra classpath to executors in Spark client mode - apache-spark

I'm using Spark 1.5.1 with the standalone cluster manager. Spark's default spark-assembly-1.5.1-hadoop2.6.0.jar includes Avro 1.7.7. I want to use my custom Avro library for all my Spark jobs, let's call it Avro 1.7.8. This works perfectly in dev mode (master=local[*]). However, when I submit my app to the cluster in client mode, the executors still use Avro 1.7.7 library.
URL url = getClass().getClassLoader().getResource(GenericData.class.getName().replace('.','/')+".class");
When I print this, my executor's log shows :
/opt/spark/lib/spark-assembly-1.5.1-hadoop2.6.0.jar/org/apache/avro/generic/GenericData.class
Here is a part of my spark-env.sh on the worker node :
export SPARK_WORKER_OPTS="-Dspark.executor.extraClassPath=/home/ansible/avro-1.7.8.jar -Dspark.executor.userClassPathFirst=true
Here is my worker process on the worker node (ps aux | grep worker) :
spark 955 1.8 1.9 4161448 243600 ? Sl 13:29 0:09 /usr/java/jdk1.7.0_79/jre/bin/java -cp /home/ansible/avro-1.7.8.jar:/etc/spark-worker/:/opt/spark-1.5.1-bin-hadoop2.6/lib/spark-assembly-1.5.1-hadoop2.6.0.jar:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar -Dspark.executor.extraClassPath=/home/ansible/avro-1.7.8.jar -Dspark.executor.userClassPathFirst=true -Xms512m -Xmx512m -XX:MaxPermSize=256m org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://spark-a-01:7077
Obviously, I put this jar : /home/ansible/avro-1.7.8.jar in all my worker nodes.
Does anyone knows how to force the executor to use my jar instead of the spark assembly's one ?

Try using the --packages option to spark-submit:
spark-submit --packages org.apache.avro:avro:1.7.8 ....
Something like that. If you're not using spark-submit, use it -- this is exactly what it is for.

Related

Whats the command to get spark driver memory in spark shell

I know spark related configuration can be get via spark-env.sh file however what would be the command to get it from spark-shell ?
For example to get spark.driver.memory shall I use
set spark.driver.memory
above isn't working
You can provide the memory as a configuration while launching spark-shell
spark-shell --conf spark.driver.memory=2g
This will start a spark shell with 2g of driver memory. In order to access it in spark shell, you can do the following.
val conf = sparkContext.getConf
val driverMemory = conf.get("spark.driver.memory")
This will return String = 2g.

Addingdependency jars to spark classpath in a standalone mode

I m trying to execute my local java program in spark which has dependencies, i tried executing spark submit option as below :
spark-submit --class com.cerner.doc.DocumentExtractor /Users/sp054800/Downloads/Docs_lib_jar/Docs_RestAPI.jar
after setting the
spark.driver.extraClassPath /Users/sp054800/Downloads/Docs_lib_jar/lib/*
spark.driver.extraLibraryPath /Users/sp054800/Downloads/Docs_lib_jar/lib/*
spark.executor.extraClassPath /Users/sp054800/Downloads/Docs_lib_jar/lib/*
spark.executor.extraLibraryPath /Users/sp054800/Downloads/Docs_lib_jar/lib/*
in spark-defaults.conf, but still no help could anyone help me to fix this of how do i need to include the jars in spark. I m using spark2.2.0

Running spark application in local mode

I'm trying to start my Spark application in local mode using spark-submit. I am using Spark 2.0.2, Hadoop 2.6 & Scala 2.11.8 on Windows. The application runs fine from within my IDE (IntelliJ), and I can also start it on a cluster with actual, physical executors.
The command I'm running is
spark-submit --class [MyClassName] --master local[*] target/[MyApp]-jar-with-dependencies.jar [Params]
Spark starts up as usual, but then terminates with
java.io.Exception: Failed to connect to /192.168.88.1:56370
What am I missing here?
Check which port you are using: if on cluster: log in to master node and include:
--master spark://XXXX:7077
You can find it always in spark ui under port 8080
Also check your spark builder config if you have set master already as it takes priority when launching eg:
val spark = SparkSession
.builder
.appName("myapp")
.master("local[*]")

SparkDeploySchedulerBackend Error: Application has been killed. All masters are unresponsive

While I'm starting Spark shell:
bin>./spark-shell
I get the following error :
Spark assembly has been built with Hive, including Data nucleus jars on classpath
Welcome to SPARK VERSION 1.3.0
Using Scala version 2.10.4 (Java HotSpot(TM) Server VM, Java 1.7.0_75)
Type in expressions to have them evaluated.
Type :help for more information.
15/05/10 12:12:21 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
15/05/10 12:12:21 ERROR TaskSchedulerImpl: Exiting due to error from cluster scheduler: All masters are unresponsive! Giving up.
I have installed spark by follow below link :- http://www.philchen.com/2015/02/16/how-to-install-apache-spark-and-cassandra-stack-on-ubuntu
You should supply your Spark Cluster's Master URL when start a spark-shell
At least:
bin/spark-shell --master spark://master-ip:7077
All the options make up a long list and you can find the suitable ones yourself:
bin/spark-shell --help
I am assuming that you are running this in standalone/local mode.
Run your spark shell with following line. That indicates you are using all the available cores of your master which is local machine.
bin/spark-shell --master local[*]
http://spark.apache.org/docs/1.2.1/submitting-applications.html#master-urls
You also need to start spark master and slave before giving spark-submit command
start-master.sh
start-slave.sh spark://spark:7077
then use
spark-submit --master spark://spark:7077
Look at your log files for "permission denied" errors... It may happens that your client service doesn't have the proper authority to access your Master folders.

Spark SQL thrift server can't run in cluster mode?

In Spark 1.2.0, when I attempt to start the Spark SQL thrift server in cluster mode, I get the following output:
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Spark Command: /usr/java/latest/bin/java -cp ::/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/sbin/../conf:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/spark-assembly-1.2.0-hadoop2.4.0.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar -XX:MaxPermSize=128m -Xms512m -Xmx512m org.apache.spark.deploy.SparkSubmit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --deploy-mode cluster --master spark://xd-spark.xdata.data-tactics-corp.com:7077 spark-internal
========================================
Jar url 'spark-internal' is not in valid format.
Must be a jar file path in URL format (e.g. hdfs://host:port/XX.jar, file:///XX.jar)
Usage: DriverClient [options] launch <active-master> <jar-url> <main-class> [driver options]
Usage: DriverClient kill <active-master> <driver-id>
Options:
-c CORES, --cores CORES Number of cores to request (default: 1)
-m MEMORY, --memory MEMORY Megabytes of memory to request (default: 512)
-s, --supervise Whether to restart the driver on failure
-v, --verbose Print more debugging output
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
The "spark-internal" argument seems to be a special flag to tell spark-submit that the class to be run is part of Spark's libraries, so it doesn't need to distribute a jar. But for some reason, this doesn't seem to be working here.
I filed this as SPARK-5176 and it will be addressed with an error message that explains that the Thrift server can not run in cluster mode.

Resources