How to know deploy mode of PySpark application?

How to know deploy mode of PySpark application? - apache-spark

I am trying to fix an issue with running out of memory, and I want to know whether I need to change these settings in the default configurations file (spark-defaults.conf) in the spark home folder. Or, if I can set them in the code.
I saw this question PySpark: java.lang.OutofMemoryError: Java heap space and it says that it depends on if I'm running in client mode. I'm running spark on a cluster and monitoring it using standalone.
But, how do I figure out if I'm running spark in client mode?

If you are running an interactive shell, e.g. pyspark (CLI or via an IPython notebook), by default you are running in client mode. You can easily verify that you cannot run pyspark or any other interactive shell in cluster mode:
$ pyspark --master yarn --deploy-mode cluster
Python 2.7.11 (default, Mar 22 2016, 01:42:54)
[GCC Intel(R) C++ gcc 4.8 mode] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Error: Cluster deploy mode is not applicable to Spark shells.
$ spark-shell --master yarn --deploy-mode cluster
Error: Cluster deploy mode is not applicable to Spark shells.
Examining the contents of the bin/pyspark file may be instructive, too - here is the final line (which is the actual executable):
$ pwd
/home/ctsats/spark-1.6.1-bin-hadoop2.6
$ cat bin/pyspark
[...]
exec "${SPARK_HOME}"/bin/spark-submit pyspark-shell-main --name "PySparkShell" "$#"
i.e. pyspark is actually a script run by spark-submit and given the name PySparkShell, by which you can find it in the Spark History Server UI; and since it is run like that, it goes by whatever arguments (or defaults) are included with its spark-submit command.

Since sc.deployMode is not available in PySpark, you could check out spark.submit.deployMode configuration property.
>>> sc.getConf().get("spark.submit.deployMode")
'client'
This is not available in PySpark
Use sc.deployMode
scala> sc.deployMode
res0: String = client
scala> sc.version
res1: String = 2.1.0-SNAPSHOT

As of Spark 2+ the below works.
for item in spark.sparkContext.getConf().getAll():print(item)
(u'spark.submit.deployMode', u'client') # will be one of the items in the list.

Related

Failed to bring up Cloud SQL Metastore When create dataproc cluster using preview image

I am using Spark to do some computation over some data and then push to Hive. The Cloud Dataproc versions is 1.2 with Hive 2.1 included. The Merge command in Hive is only support by version 2.2 onwards. So I have to use preview version for dataproc cluster. When I use version 1.2 for dataproc cluster, I can create the cluster without any issue. I got this error "Failed to bring up Cloud SQL Metastore" when using preview version.
The initialisation script is here. Has anyone every met this problem before?
hive-metastore.service is not a native service, redirecting to systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install is-enabled hive-metastore
mysql.service is not a native service, redirecting to systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install disable mysql
insserv: warning: current start runlevel(s) (empty) of script `mysql` overrides LSB defaults (2 3 4 5).
insserv: warning: current stop runlevel(s) (0 1 2 3 4 5 6) of script `mysql' overrides LSB defaults (0 1 6).
Created symlink /etc/systemd/system/multi-user.target.wants/cloud-sql-proxy.service → /usr/lib/systemd/system/cloud-sql-proxy.service.
Cloud SQL Proxy installation succeeded
hive-metastore.service is not a native service, redirecting to systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install is-enabled hive-metastore
[2018-06-06T12:43:55+0000]: Failed to bring up Cloud SQL Metastore

I believe the issue may be that your metastore was initialized from an older version of Dataproc and thus has outdated schema.
If you have the failed cluster (if not, please create a new one as before, you can use --single-node option to reduce cost), then SSH to master node and upgrade schema:
$ gcloud compute ssh my-cluster-m
$ /usr/lib/hive/bin/schematool -dbType mysql -info
Hive distribution version: 2.3.0
Metastore schema version: 2.1.0 <-- you will need this
org.apache.hadoop.hive.metastore.HiveMetaException: Metastore schema version is
not compatible. Hive Version: 2.3.0, Database Schema Version: 2.1.0
*** schemaTool failed ***
$ /usr/lib/hive/bin/schematool -dbType mysql -upgradeSchemaFrom 2.1.0
Unfortunately this cluster cannot be returned to running state, so please delete and recreate it.
I have created this PR to make issue more discoverable:
https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/pull/278

Spark-shell is not working

When I submit the spark-shell command, I see the following error:
# spark-shell
> SPARK_MAJOR_VERSION is set to 2, using Spark2
File "/usr/bin/hdp-select", line 249
print "Packages:"
^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print(t "Packages:")?
ls: cannot access /usr/hdp//hadoop/lib: No such file or directory
Exception in thread "main" java.lang.IllegalStateException: hdp.version is not set while running Spark under HDP, please set through HDP_VERSION in spark-env.sh or add a java-opts file in conf with -Dhdp.version=xxx
at org.apache.spark.launcher.Main.main(Main.java:118)

The problem is that the HDP script /usr/bin/hdp-select is apparently run under Python3, whereas it contains incompatible Python2 specific code.
You may port /usr/bin/hdp-select to Python3 by:
adding parentheses to the print statements
replacing the line "packages.sort()" by "list(package).sort()")
replacing the line "os.mkdir(current, 0755)" by "os.mkdir(current, 0o755)"
You may also try to force HDP to run /usr/bin/hdp-select under Python2:
PYSPARK_DRIVER_PYTHON=python2 PYSPARK_PYTHON=python2 spark-shell

Had the same problem: I set HDP_VERSION before running spark.
export HDP_VERSION=<your hadoop version>
spark-shell

I'm using Spark version 1.3. I have a job that's taking forever to finish.
To fix it, I made some optimizations to the code, and started the job again. Unfortunately, I launched the optimized code before stopping the earlier version, and now I cannot stop the earlier job.
Here are the things I've tried to kill this app:
Through the Web UI
result: The spark UI has no "kill" option for apps (I'm assuming they have not enabled the "spark.ui.killEnabled", I'm not the owner of this cluster).
Through the command line: spark-class org.apache.spark.deploy.Client kill mymasterURL app-XXX
result: I get this message:
Driver app-XXX has already finished or does not exist
But I see in the web UI that it is still running, and the resources are still occupied.
Through the command line via spark-submit: spark-submit --master mymasterURL --deploy-mode cluster --kill app-XXX
result: I get this error:
Error: Killing submissions is only supported in standalone mode!
I tried to retrieve the spark context to stop it (via SparkContext.stop(), or cancelAllJobs() ) but have been unsuccessful as ".getOrCreate" is not available in 1.3. I have not been able to retrieve the spark context of the initial app.
I'd appreciate any ideas!
Edit: I've also tried killing the app through yarn by executing: yarn application -kill app-XXX
result: I got this error:
Exception in thread "main" java.lang.IllegalArgumentException:
Invalid ApplicationId prefix: app-XX. The valid ApplicationId should
start with prefix application

spark reads the file on the client instead of on the worker

mymaster:
$ ./sbin/start-master.sh
myworker:
$ ./sbin/start-slave.sh spark://mymaster:7077
myclient:
$ ./bin/spark-shell --master spark://mymaster:7077
at this moment, the log of myworker says the following, indicating that it has accepted the job:
16/06/01 02:22:41 INFO Worker: Asked to launch executor app-20160601022241-0007/0 for Spark shell
myclient:
scala> sc.textFile("mylocalfile.txt").map(_.length}).sum
res0: Double = 3264.0
It works if the file mylocalfile.txt is available in myclient. However, according to the doc, the file should be available in myworker, not in myclient.
If using a path on the local filesystem, the file must also be
accessible at the same path on worker nodes. Either copy the file to
all workers or use a network-mounted shared file system.
what am I missing here?

Why does Spark job fail on Mesos with "hadoop: not found"?

I use Spark 1.6.1, Hadoop 2.6.4 and Mesos 0.28 on Debian 8.
While trying to submit a job via spark-submit to a Mesos cluster a slave fails with the following in stderr log:
I0427 22:35:39.626055 48258 fetcher.cpp:424] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/ad642fcf-9951-42ad-8f86-cc4f5a5cb408-S0\/hduser","items":[{"action":"BYP$
I0427 22:35:39.628031 48258 fetcher.cpp:379] Fetching URI 'hdfs://xxxxxxxxx:54310/sources/spark/SimpleEventCounter.jar'
I0427 22:35:39.628057 48258 fetcher.cpp:250] Fetching directly into the sandbox directory
I0427 22:35:39.628078 48258 fetcher.cpp:187] Fetching URI 'hdfs://xxxxxxx:54310/sources/spark/SimpleEventCounter.jar'
E0427 22:35:39.629243 48258 shell.hpp:93] Command 'hadoop version 2>&1' failed; this is the output:
sh: 1: hadoop: not found
Failed to fetch 'hdfs://xxxxxxx:54310/sources/spark/SimpleEventCounter.jar': Failed to create HDFS client: Failed to execute 'hadoop version 2>&1'; the command was e$
Failed to synchronize with slave (it's probably exited)
My Jar file contains hadoop 2.6 binaries
The path to spark executor/binary is via an hdfs:// link
My jobs don't appear in the framework tab, but they do appear in the driver with the status 'queued' and they just sit there till I shut down the spark-mesos-dispatcher.sh service.

I was seeing a very similar error and I figured out my problem was that hadoop_home wasn't set in the mesos agent.
I added to /etc/default/mesos-slave (path may be different on your install) on each mesos-slave the following line: MESOS_hadoop_home="/path/to/my/hadoop/install/folder/"
EDIT: Hadoop has to be installed on each slave, the path/to/my/haoop/install/folder is a local path

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string