Spark SQL thrift server can't run in cluster mode? - apache-spark

In Spark 1.2.0, when I attempt to start the Spark SQL thrift server in cluster mode, I get the following output:
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Spark Command: /usr/java/latest/bin/java -cp ::/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/sbin/../conf:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/spark-assembly-1.2.0-hadoop2.4.0.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar -XX:MaxPermSize=128m -Xms512m -Xmx512m org.apache.spark.deploy.SparkSubmit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --deploy-mode cluster --master spark://xd-spark.xdata.data-tactics-corp.com:7077 spark-internal
========================================
Jar url 'spark-internal' is not in valid format.
Must be a jar file path in URL format (e.g. hdfs://host:port/XX.jar, file:///XX.jar)
Usage: DriverClient [options] launch <active-master> <jar-url> <main-class> [driver options]
Usage: DriverClient kill <active-master> <driver-id>
Options:
-c CORES, --cores CORES Number of cores to request (default: 1)
-m MEMORY, --memory MEMORY Megabytes of memory to request (default: 512)
-s, --supervise Whether to restart the driver on failure
-v, --verbose Print more debugging output
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
The "spark-internal" argument seems to be a special flag to tell spark-submit that the class to be run is part of Spark's libraries, so it doesn't need to distribute a jar. But for some reason, this doesn't seem to be working here.

I filed this as SPARK-5176 and it will be addressed with an error message that explains that the Thrift server can not run in cluster mode.

Related

Running spark application in local mode

I'm trying to start my Spark application in local mode using spark-submit. I am using Spark 2.0.2, Hadoop 2.6 & Scala 2.11.8 on Windows. The application runs fine from within my IDE (IntelliJ), and I can also start it on a cluster with actual, physical executors.
The command I'm running is
spark-submit --class [MyClassName] --master local[*] target/[MyApp]-jar-with-dependencies.jar [Params]
Spark starts up as usual, but then terminates with
java.io.Exception: Failed to connect to /192.168.88.1:56370
What am I missing here?
Check which port you are using: if on cluster: log in to master node and include:
--master spark://XXXX:7077
You can find it always in spark ui under port 8080
Also check your spark builder config if you have set master already as it takes priority when launching eg:
val spark = SparkSession
.builder
.appName("myapp")
.master("local[*]")

Add extra classpath to executors in Spark client mode

I'm using Spark 1.5.1 with the standalone cluster manager. Spark's default spark-assembly-1.5.1-hadoop2.6.0.jar includes Avro 1.7.7. I want to use my custom Avro library for all my Spark jobs, let's call it Avro 1.7.8. This works perfectly in dev mode (master=local[*]). However, when I submit my app to the cluster in client mode, the executors still use Avro 1.7.7 library.
URL url = getClass().getClassLoader().getResource(GenericData.class.getName().replace('.','/')+".class");
When I print this, my executor's log shows :
/opt/spark/lib/spark-assembly-1.5.1-hadoop2.6.0.jar/org/apache/avro/generic/GenericData.class
Here is a part of my spark-env.sh on the worker node :
export SPARK_WORKER_OPTS="-Dspark.executor.extraClassPath=/home/ansible/avro-1.7.8.jar -Dspark.executor.userClassPathFirst=true
Here is my worker process on the worker node (ps aux | grep worker) :
spark 955 1.8 1.9 4161448 243600 ? Sl 13:29 0:09 /usr/java/jdk1.7.0_79/jre/bin/java -cp /home/ansible/avro-1.7.8.jar:/etc/spark-worker/:/opt/spark-1.5.1-bin-hadoop2.6/lib/spark-assembly-1.5.1-hadoop2.6.0.jar:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar -Dspark.executor.extraClassPath=/home/ansible/avro-1.7.8.jar -Dspark.executor.userClassPathFirst=true -Xms512m -Xmx512m -XX:MaxPermSize=256m org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://spark-a-01:7077
Obviously, I put this jar : /home/ansible/avro-1.7.8.jar in all my worker nodes.
Does anyone knows how to force the executor to use my jar instead of the spark assembly's one ?
Try using the --packages option to spark-submit:
spark-submit --packages org.apache.avro:avro:1.7.8 ....
Something like that. If you're not using spark-submit, use it -- this is exactly what it is for.

Spark ignores SPARK_WORKER_MEMORY?

I'm using standalone cluster mode, 1.5.2.
Even though I'm setting SPARK_WORKER_MEMORY in spark-env.sh, it looks like this setting is ignored.
I can't find any indications at the scripts under bin/sbin that -Xms/-Xmx are set.
If I use ps command the worker pid, it looks like memory set to 1G:
[hadoop#sl-env1-hadoop1 spark-1.5.2-bin-hadoop2.6]$ ps -ef | grep 20232
hadoop 20232 1 0 02:01 ? 00:00:22 /usr/java/latest//bin/java
-cp /workspace/3rd-party/spark/spark-1.5.2-bin-hadoop2.6/sbin/../conf/:/workspace/
3rd-party/spark/spark-1.5.2-bin-hadoop2.6/lib/spark-assembly-1.5.2-hadoop2.6.0.jar:/workspace/
3rd-party/spark/spark-1.5.2-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/workspace/
3rd-party/spark/spark-1.5.2-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/workspace/
3rd-party/spark/spark-1.5.2-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/workspace/
3rd-party/hadoop/2.6.3//etc/hadoop/ -Xms1g -Xmx1g org.apache.spark.deploy.worker.Worker
--webui-port 8081 spark://10.52.39.92:7077
spark-defaults.conf:
spark.master spark://10.52.39.92:7077
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.executor.memory 2g
spark.executor.cores 1
spark-env.sh:
export SPARK_MASTER_IP=10.52.39.92
export SPARK_WORKER_INSTANCES=1
export SPARK_WORKER_MEMORY=12g
Am I missing something?
Thanks.
When using spark-shell or spark-submit, use the --executor-memory option.
When configuring it for a standalone jar, set the system property programmatically before creating the spark context.
System.setProperty("spark.executor.memory", executorMemory)
You are using wrong setting in cluster mode.
SPARK_EXECUTOR_MEMORY is the right option to set Executor memory in cluster mode.
SPARK_WORKER_MEMORY works only in standalone deploy mode.
Otherway to set executor memory from command line : -Dspark.executor.memory=2g
Have a loook at one more related SE question regarding these settings :
Spark configuration, what is the difference of SPARK_DRIVER_MEMORY, SPARK_EXECUTOR_MEMORY, and SPARK_WORKER_MEMORY?
This is my configuration on cluster mode, on spark-default.conf
spark.driver.memory 5g
spark.executor.memory 6g
spark.executor.cores 4
Did have something like this?
If you don't add this code (with your options) Spark executor will get 1gb of Ram as default.
Otherwise you can add these options on ./spark-submit like this :
# Run on a YARN cluster
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \ # can be client for client mode
--executor-memory 20G \
--num-executors 50 \
/path/to/examples.jar \
1000
Try to check on master(ip/name of master):8080 when you run an application if resources have been allocated correctly.
I've encountered the same problem as yours. The reason is that, in standalone mode, spark.executor.memory is actually ignored. What has an effect is spark.driver.memory, because the executor is living in the driver.
So what you can do is to set spark.driver.memory as high as you want.
This is where I've found the explanation:
How to set Apache Spark Executor memory

SparkDeploySchedulerBackend Error: Application has been killed. All masters are unresponsive

While I'm starting Spark shell:
bin>./spark-shell
I get the following error :
Spark assembly has been built with Hive, including Data nucleus jars on classpath
Welcome to SPARK VERSION 1.3.0
Using Scala version 2.10.4 (Java HotSpot(TM) Server VM, Java 1.7.0_75)
Type in expressions to have them evaluated.
Type :help for more information.
15/05/10 12:12:21 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
15/05/10 12:12:21 ERROR TaskSchedulerImpl: Exiting due to error from cluster scheduler: All masters are unresponsive! Giving up.
I have installed spark by follow below link :- http://www.philchen.com/2015/02/16/how-to-install-apache-spark-and-cassandra-stack-on-ubuntu
You should supply your Spark Cluster's Master URL when start a spark-shell
At least:
bin/spark-shell --master spark://master-ip:7077
All the options make up a long list and you can find the suitable ones yourself:
bin/spark-shell --help
I am assuming that you are running this in standalone/local mode.
Run your spark shell with following line. That indicates you are using all the available cores of your master which is local machine.
bin/spark-shell --master local[*]
http://spark.apache.org/docs/1.2.1/submitting-applications.html#master-urls
You also need to start spark master and slave before giving spark-submit command
start-master.sh
start-slave.sh spark://spark:7077
then use
spark-submit --master spark://spark:7077
Look at your log files for "permission denied" errors... It may happens that your client service doesn't have the proper authority to access your Master folders.

Spark SQL Thrift Server on CDH 5.3.0

I am trying to use CDH 5.3.0 to run Spark's Thrift Server. I'm trying to follow the Spark SQL instructions, but I can't even get the --help option to run successfully. In the output below, it dies because it can't find the HiveServer2 class.
$ /usr/lib/spark/sbin/start-thriftserver.sh --help
Usage./sbin/start-thriftserver [options] [thrift server options]
Options:
--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.
--deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or
on one of the worker machines inside the cluster ("cluster")
(Default: client).
--class CLASS_NAME Your application's main class (for Java / Scala apps).
--name NAME A name of your application.
--jars JARS Comma-separated list of local jars to include on the driver
and executor classpaths.
--py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place
on the PYTHONPATH for Python apps.
--files FILES Comma-separated list of files to be placed in the working
directory of each executor.
--conf PROP=VALUE Arbitrary Spark configuration property.
--properties-file FILE Path to a file from which to load extra properties. If not
specified, this will look for conf/spark-defaults.conf.
--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 512M).
--driver-java-options Extra Java options to pass to the driver.
--driver-library-path Extra library path entries to pass to the driver.
--driver-class-path Extra class path entries to pass to the driver. Note that
jars added with --jars are automatically included in the
classpath.
--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).
--help, -h Show this help message and exit
--verbose, -v Print additional debug output
Spark standalone with cluster deploy mode only:
--driver-cores NUM Cores for driver (Default: 1).
--supervise If given, restarts the driver on failure.
Spark standalone and Mesos only:
--total-executor-cores NUM Total cores for all executors.
YARN-only:
--executor-cores NUM Number of cores per executor (Default: 1).
--queue QUEUE_NAME The YARN queue to submit to (Default: "default").
--num-executors NUM Number of executors to launch (Default: 2).
--archives ARCHIVES Comma separated list of archives to be extracted into the
working directory of each executor.
Thrift server options:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hive/service/server/HiveServer2
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:482)
Caused by: java.lang.ClassNotFoundException: org.apache.hive.service.server.HiveServer2
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 13 more
As indicated by the error, the class is not in the classpath. Unfortunately, setting the CLASSPATH environment variable won't work. The only solution that I could find was to edit /usr/lib/spark/bin/compute-classpath.sh and add this line (it can go just about anywhere, but put it one line from the end to make it clear that it's an addition):
CLASSPATH="$CLASSPATH:/usr/lib/hive/lib/*"
Cloudera's release notes for 5.3.0 explicitly state "Spark SQL remains an experimental and unsupported feature in CDH", so it's not surprising that tweaks like this may be needed. Also, this response to a similar problem in CDH 5.2 suggests that the Hive jars are deliberately excluded by Cloudera for size reasons.
I have faced the same problem but I solved it in another way.
The cloudera CDH version was not 5.3.0 it was some version prior to that version so you will find the paths little different.
Simply the solution was to replace the spark-assembly-**.jar file that shipped with cloudera CDH by another version.
I downloaded spark from its official download page. The version I have downloaded was built for hadoop 2.4 and later. Extracting the downloaded file and look for spark-assembly-**.jar.
In the cloudera installation, I looked for the same file and I found it under that path /usr/lib/spark/libe/spark-assembly--.jar
The previous path actually was a symlink to the actual file. I uploaded the jar from spark download to the same path and make the symlink point to the new jar (ln -f -s target link).
Every thing works fine with me.
/usr/lib/spark/bin/compute-classpath.sh sets CLASSPATH="$SPARK_CLASSPATH". On CDH using parcels you can add the hive jars to SPARK_CLASSPATH like this:
SPARK_CLASSPATH=$(ls -1 /opt/cloudera/parcels/CDH/lib/hive/lib/*.jar | sed -e :a -e 'N;s/\n/:/;ta') /opt/cloudera/parcels/CDH/lib/spark/sbin/start-thriftserver.sh --help
Instructions from Cloudera Community forum
http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/CDH-5-5-does-not-have-Spark-Thrift-Server/m-p/41849#M1758 :
git clone https://github.com/cloudera/spark.git
cd spark
./make-distribution.sh -DskipTests \
-Dhadoop.version=2.6.0-cdh5.7.0 \
-Phadoop-2.6 \
-Pyarn \
-Phive -Phive-thriftserver \
-Pflume-provided \
-Phadoop-provided \
-Phbase-provided \
-Phive-provided \
-Pparquet-provided
-Phive and -Phive-thriftserver are the key pieces there.
There is a request to add Spark Thrift Server
https://issues.cloudera.org/browse/DISTRO-817
please vote up if you want to see that in CDH.

Resources