Pyspark freeze in client mode with Yarn Cluster Manager - apache-spark

Following theses instructions: https://www.linode.com/docs/databases/hadoop/install-configure-run-spark-on-top-of-hadoop-yarn-cluster/ i set up a 3 node cluster and am able to run spark-shell. But when i try to run pyspark i got theses messages:
hadoop#master:~$ pyspark
Python 3.7.1 (default, Dec 14 2018, 19:28:38)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/02/15 21:51:03 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/02/15 21:51:06 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
19/02/15 21:51:12 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
and the screen freeze (there is no other messages).
I have no idea how i could solve this issue.
PS: As explained in the link i first deployed a 3 node hadoop-yarn cluster and then installed spark on the master Node (after launching yarn-start.sh.

Related

Databricks Connect: can't connect to remote cluster on azure, command: 'databricks-connect test' stops

I try to set up Databricks Connect to be able work with remote Databricks Cluster already running on Workspace on Azure.
When I try to run command: 'databricks-connect test' it never ends.
I follow official documentation.
I've installed most recent Anaconda in version 3.7.
I've created local environment:
conda create --name dbconnect python=3.5
I've installed 'databricks-connect' in version 5.1 what matches configuration of my cluster on Azure Databricks.
pip install -U databricks-connect==5.1.*
I've already set 'databricks-connect configure as follows:
(base) C:\>databricks-connect configure
The current configuration is:
* Databricks Host: ******.azuredatabricks.net
* Databricks Token: ************************************
* Cluster ID: ****-******-*******
* Org ID: ****************
* Port: 8787
After above steps I try to run 'test' command for databricks connect:
databricks-connect test
and as a result procedure starts and stops after warning about MetricsSystem as it is visible below:
(dbconnect) C:\>databricks-connect test
* PySpark is installed at c:\users\miltad\appdata\local\continuum\anaconda3\envs\dbconnect\lib\site-packages\pyspark
* Checking java version
java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
* Testing scala command
19/05/31 08:14:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/05/31 08:14:34 WARN MetricsSystem: Using default name SparkStatusTracker for source because neither spark.metrics.namespace nor spark.app.id is set.
I expect that process should move to next steps like it is in official documentation:
* Testing scala command
18/12/10 16:38:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/12/10 16:38:50 WARN MetricsSystem: Using default name SparkStatusTracker for source because neither spark.metrics.namespace nor spark.app.id is set.
18/12/10 16:39:53 WARN SparkServiceRPCClient: Now tracking server state for 5abb7c7e-df8e-4290-947c-c9a38601024e, invalidating prev state
18/12/10 16:39:59 WARN SparkServiceRPCClient: Syncing 129 files (176036 bytes) took 3003 ms
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.0-SNAPSHOT
/_/
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_152)
Type in expressions to have them evaluated.
Type :help for more information.
So my process stops after 'WARN MetricsSystem: Using default name SparkStatusTracker'.
What am I doing wrong? Should I configure something more?
Looks like this feature isn't officially supported on runtimes 5.3 or below. If there are limitations on updating the runtime, i would make sure the spark conf is set as follows:
spark.databricks.service.server.enabled true
However, with the older runtimes things still might be wonky. I would recommend doing this with runtime 5.5 or 6.1 or above.
Lots of people seem to be seeing this issue with the test command on Windows. But if you try to use Databricks connect it works fine. It seems safe to ignore.

How to forward spark log to jupyter notebook?

I know I can set up log level via spark.sparkContext.setLogLevel('INFO') Logs such as the following appears in the terminal, but not in the jupyter notebook.
2019-03-25 11:42:37 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2019-03-25 11:42:37 WARN SparkConf:66 - In Spark 1.0 and later spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN).
2019-03-25 11:42:38 WARN Utils:66 - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
The spark session is created in local mode in the jupyter notebook cell.
spark = SparkSession \
.builder \
.master('local[7]') \
.appName('Notebook') \
.getOrCreate()
Is there any way to forward the logs to the jupyter notebook?

Why does pyspark fail with "Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder'"?

For the life of me I cannot figure out what is wrong with my PySpark install. I have installed all dependencies, including Hadoop, but PySpark cant find it--am I diagnosing this correctly?
See the full error message below, but it ultimately fails on PySpark SQL
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':"
nickeleres#Nicks-MBP:~$ pyspark
Python 2.7.10 (default, Feb 7 2017, 00:08:15)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/opt/spark-2.2.0/jars/hadoop-auth-2.7.3.jar) to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
17/10/24 21:21:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/10/24 21:21:59 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
17/10/24 21:21:59 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
17/10/24 21:21:59 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
Traceback (most recent call last):
File "/opt/spark/python/pyspark/shell.py", line 45, in <module>
spark = SparkSession.builder\
File "/opt/spark/python/pyspark/sql/session.py", line 179, in getOrCreate
session._jsparkSession.sessionState().conf().setConfString(key, value)
File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/opt/spark/python/pyspark/sql/utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':"
>>>
tl;dr Close all the other Spark processes and start over.
The following WARN messages say that there is another process (or multiple processes) that holds the ports.
I'm sure that the process(es) are Spark processes, e.g. pyspark sessions or Spark applications.
17/10/24 21:21:59 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
17/10/24 21:21:59 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
17/10/24 21:21:59 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
That's why after Spark/pyspark has found that the port 4044 is free to use for web UI it tried to instantiate HiveSessionStateBuilder and failed.
pyspark failed as you cannot have more than one Spark application up and running that uses the same local Hive metastore.
WHY THIS HAPPENS ?
Because we try to create new session more than once ! on different tabs of browser of jupyter notebook.
Solution :
START NEW SESSION ON SINGLE TAB IN JUPYTER NOTEBOOK AND AVOID TO CREATE NEW SESSION ON DIFFRENT TABS
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('EXAMPLE').getOrCreate()
We received the same error while trying to create a spark session using Jupyter notebook.
We noticed that in our case user did not have permission to spark scratch directory i.e. directory used against following spark property value "spark.local.dir". We changed the permission of directory so that user has full access to this and issue got resolved. Generally this directory resides on something like "/tmp/user".
Please note that as per spark documentation spark scratch directory is a "Directory to use for "scratch" space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on different disks".
Another possible cause is that the spark application failed to start due to minimum machine requirements were not attended.
In the Application history tab:
Diagnostics:Uncaught exception: org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested virtual cores < 0, or requested virtual cores > max configured, requestedVirtualCores=5, maxVirtualCores=4
Illustration:

Getting error when run spark-shell in CDH 5.7

I am new in Spark and using CDH-5.7 for running Spark, But I am getting these error when I run Spark-shell in terminal , I have run all Cloudera Services including Spark also by Launch Cloudera Express. Plz help.
Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_67)
Type in expressions to have them evaluated. Type :help for more information.
16/07/13 02:14:53 WARN util.Utils: Your hostname, quickstart.cloudera resolves to a loopback address: 127.0.0.1; using 192.168.44.133 instead (on interface eth1)
16/07/13 02:14:53 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
16/07/13 02:19:28 ERROR spark.SparkContext: Error initializing SparkContext. org.apache.hadoop.security.AccessControlException: Permission denied: user=cloudera, access=WRITE, inode="/user/spark/applicationHistory":spark:supergroup:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkFsPermission(DefaultAuthorizationProvider.java:281)
at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:262)

Running Spark on Yarn

I am trying to run spark on yarn in quickstart cloudera vm.It already has spark 1.3 and Hadoop 2.6.0-cdh5.4.0 installed.(I am not using spark-submit since I want to run a different version of spark).
I am able to run spark 1.3 on yarn but get the below error for spark 1.4.
The log shows its running on spark 1.4 but still gives a error on a method which is present in 1.4 and not 1.3. Even the fat jar contains the class files of 1.4.
As far as running in yarn the installed spark version shouldnt matter, but still its running on the other version.
Hadoop Version:
Hadoop 2.6.0-cdh5.4.0
Subversion http://github.com/cloudera/hadoop -r c788a14a5de9ecd968d1e2666e8765c5f018c271
Compiled by jenkins on 2015-04-21T19:18Z
Compiled with protoc 2.5.0
From source with checksum cd78f139c66c13ab5cee96e15a629025
This command was run using /usr/lib/hadoop/hadoop-common-2.6.0-cdh5.4.0.jar
Error:
LogType:stderr
Log Upload Time:Tue Oct 20 21:58:56 -0700 2015
LogLength:2334
Log Contents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12- 1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/filecache/10/simple-yarn-app-1.1.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
15/10/20 21:58:50 INFO spark.SparkContext: Running Spark version 1.4.0
15/10/20 21:58:53 INFO spark.SecurityManager: Changing view acls to: yarn
15/10/20 21:58:53 INFO spark.SecurityManager: Changing modify acls to: yarn
15/10/20 21:58:53 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn); users with modify permissions: Set(yarn)
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.network.util.JavaUtils.timeStringAsSec(Ljava/lang/String;)J
at org.apache.spark.util.Utils$.timeStringAsSeconds(Utils.scala:1027)
at org.apache.spark.SparkConf.getTimeAsSeconds(SparkConf.scala:194)
at org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:68)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53)
at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1991)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1982)
at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:56)
at org.apache.spark.rpc.akka.AkkaRpcEnvFactory.create(AkkaRpcEnv.scala:245)
at org.apache.spark.rpc.RpcEnv$.create(RpcEnv.scala:52)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:247)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:188)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:267)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:424)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61)
at com.hortonworks.simpleyarnapp.HelloWorld.main(HelloWorld.java:50)
15/10/20 21:58:53 INFO util.Utils: Shutdown hook called
Please help

Resources