module error in multi-node spark job on google cloud cluster

module error in multi-node spark job on google cloud cluster - python-3.x

This code runs perfect when I set master to localhost. The problem occurs when I submit on a cluster with two worker nodes.
All the machines have same version of python and packages. I have also set the path to point to the desired python version i.e. 3.5.1. when I submit my spark job on the master ssh session. I get the following error -
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 5, .c..internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/hadoop/yarn/nm-local-dir/usercache//appcache/application_1469113139977_0011/container_1469113139977_0011_01_000004/pyspark.zip/pyspark/worker.py", line 98, in main
command = pickleSer._read_with_length(infile)
File "/hadoop/yarn/nm-local-dir/usercache//appcache/application_1469113139977_0011/container_1469113139977_0011_01_000004/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
return self.loads(obj)
File "/hadoop/yarn/nm-local-dir/usercache//appcache/application_1469113139977_0011/container_1469113139977_0011_01_000004/pyspark.zip/pyspark/serializers.py", line 419, in loads
return pickle.loads(obj, encoding=encoding)
File "/hadoop/yarn/nm-local-dir/usercache//appcache/application_1469113139977_0011/container_1469113139977_0011_01_000004/pyspark.zip/pyspark/mllib/init.py", line 25, in
import numpy
ImportError: No module named 'numpy'
I saw other posts where people did not have access to their worker nodes. I do. I get the same message for the other worker node. not sure if I am missing some environment setting. Any help will be much appreciated.

Not sure if this qualifies as a solution. I submitted the same job using dataproc on google platform and it worked without any problem. I believe the best way to run jobs on google cluster is via the utilities offered on google platform. The dataproc utility seems to iron out any issues related to the environment.

Related

hbase-spark connector for spark2.1.0

I am using below stack
Hadoop-2.7.7
spark-2.4.5
Hbase-2.1.0
zk-3.5.9
I want to read and write data on hbase using spark with spark-submit command. But i was unable to do so.
I have successfully started all services and also searched connectors for same but i didn't get.
I have tried to create connectors using below link https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_HBase_Connector.md
But connector build getting failed somehow i have made it possible to get connectors from internet and tried with it
when i try to launch spark submit with below command my application is failing
spark-submit --jars /home/bigdata/downloads/hbase-spark-1.0.0.jar --packages org.apache.hbase:hbase-shaded-mapreduce:2.1.0 /home/bigdata/hbasefload.py
Error:
Traceback (most recent call last):
File "/home/bigdata/hbasefload.py", line 35, in <module>
.option("hbase.zookeeper.quorum", "node2.ellicium.com:2181")\
File "/opt/spark/spark245/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 73 7, in save
File "/opt/spark/spark245/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/opt/spark/spark245/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/opt/spark/spark245/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328 , in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o60.save.
: java.util.NoSuchElementException: key not found: catalog
As i try to write on to hbase using spark-shell with above jars it successfully get executed but failing with spark-submit.

ImportError: No module named scipy.stats._continuous_distns

I have Spark job which at the end uses saveAsTable to write the dataframe into an internal table w/ a given name.
The dataframe is created using different steps which one of them is using "beta" method in scipy, where I imported it through => from scipy.stats import beta. It's running on google cloud w/ 20 worker nodes but I get the following error which is complaining about scipy package,
Caused by: org.apache.spark.SparkException:
Job aborted due to stage failure:
Task 14 in stage 7.0 failed 4 times, most recent failure:
Lost task 14.3 in stage 7.0 (TID 518, name-w-3.c.somenames.internal,
executor 23): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 364, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 69, in read_command
command = serializer._read_with_length(file)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 172, in
_read_with_length
return self.loads(obj)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 583, in loads
return pickle.loads(obj)
ImportError: No module named scipy.stats._continuous_distns
Any idea or solutions?
I tried to pass the library as well for the spark job:
"spark.driver.extraLibraryPath" : "/usr/lib/spark/python/lib/pyspark.zip",
"spark.driver.extraClassPath" :"/usr/lib/spark/python/lib/pyspark.zip"

Is the library installed on all the nodes in the cluster?
You can simply do a
pip install --user scipy
I do it in AWS EMR using the bootstrap action, There should be a similar way on Google cloud as well

Failed to get broadcast_1_piece0 of broadcast_1 in pyspark application

I was building an application on Apache Spark 2.00 with Python 3.4 and trying to load some CSV files from HDFS (Hadoop 2.7) and process some KPI out of those CSV data.
I use to face "Failed to get broadcast_1_piece0 of broadcast_1" error randomly in my application and it stopped.
After searching a lot google and stakeoverflow, I found only how to get rid of it by deleting spark app created files manually from /tmp directory. It happens generally when an application is running for long and it's not responding properly but related files are in /tmp directory.
Though I don't declare any variable for broadcast but may be spark is doing at its own.
In my case, the error occurs when it is trying to load csv from hdfs.
I have taken low level logs for my application and attached herewith for support and suggestions/best practice so that I can resolve the problem.
Sample (details are Attached here):
Traceback (most recent call last): File
"/home/hadoop/development/kpiengine.py", line 258, in
df_ho_raw =
sqlContext.read.format('com.databricks.spark.csv').options(header='true').load(HDFS_BASE_URL
+ HDFS_WORK_DIR + filename) File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
line 147, in load File
"/usr/local/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
line 933, in call File
"/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line
63, in deco File
"/usr/local/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
line 312, in get_return_value py4j.protocol.Py4JJavaError: An error
occurred while calling o44.load. : org.apache.spark.SparkException:
Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times,
most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 172.26.7.192):
java.io.IOException: org.apache.spark.SparkException: Failed to get
broadcast_1_piece0 of broadcast_1

You should to extends Serializable for your class
Your code Framework error, you can test it
$SPARK_HOME/examples/src/main/scala/org/apache/spark/examples/
If it's ok, you should check your code.

Spark on EMR Yarn - EOF Error

We are running some PySpark processes on Yarn, when the datasets increase in size we are getting this error in the yarn log:
Traceback (most recent call last):
File "/home/hadoop/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 157, in manager
File "/home/hadoop/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 61, in worker
File "/home/hadoop/spark/python/lib/pyspark.zip/pyspark/worker.py", line 136, in main
if read_int(infile) == SpecialLengths.END_OF_STREAM:
File "/home/hadoop/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 544, in read_int
raise EOFError
java.net.SocketException: Socket is closed
at java.net.Socket.shutdownOutput(Socket.java:1496)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3$$anonfun$apply$2.apply$mcV$sp(PythonRDD.scala:256)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3$$anonfun$apply$2.apply(PythonRDD.scala:256)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3$$anonfun$apply$2.apply(PythonRDD.scala:256)
at org.apache.spark.util.Utils$.tryLog(Utils.scala:1785)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:256)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:208)
We are running on a EMR Setup 3*m3.xlarge - each with 4vCPUs, 15GiB and 2x40 GB
The job is executed with the following sh script:
export SPARK_HOME=/home/hadoop/spark
JARS="/home/hadoop/avro-1.7.7.jar,/home/hadoop/spark-avro-master/target/scala-2.10/spark-avro_2.10-1.0.0.jar”
$SPARK_HOME/bin/spark-submit --master yarn-cluster --py-files deploy.zip --jars $JARS main.py
where deploy.zip contains some utility methods and lambda functions
No other configuration changes were made to the cluster.
By looking at the UI seems that the all the jobs are finishing with a SUCCESS status, nevertheless we would like to get rid of this issue, or at the very least to understand what's causing it.
Would you have any idea on what it might be the origin of the error?
Thanks!

"KeyError: 'SPARK_HOME' ", "can't load main class from JAR" in running PySpark as an Oozie workflow job

This issue is a continuation of my previous question here, which was seemingly resolved but leads to here as another issue.
I am using Spark 1.4.0 on Cloudera QuickstartVM CHD-5.4.0.
When I run my PySpark script as a SparkAction in Oozie, I encounter this error in the Oozie job / container logs:
KeyError: 'SPARK_HOME'
Then I came across this solution and this which are actually for Spark 1.3.0, although I still did try. The documentations seem to say that this issue is already fixed for Spark version 1.3.2 and 1.4.0 (but here I am, encountering the same issue).
The suggested solution in the link was that I need to set spark.yarn.appMasterEnv.SPARK_HOME and spark.executorEnv.SPARK_HOME to anything, even if it's just any path that does not point to actual SPARK_HOME (i.e., /bogus, although I did set these to actual SPARK_HOME).
Here's my workflow after:
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${resourceManager}</job-tracker>
<name-node>${nameNode}</name-node>
<master>local[2]</master>
<mode>client</mode>
<name>${name}</name>
<jar>${workflowRootLocal}/lib/my_pyspark_job.py</jar>
<spark-opts>--conf spark.yarn.appMasterEnv.SPARK_HOME=/usr/lib/spark spark.executorEnv.SPARK_HOME=/usr/lib/spark</spark-opts>
</spark>
Which seems to solve the original problem above. However, it leads to another error when I try to inspect stderr of Oozie container log:
Error: Cannot load main class from JAR file:/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/cloudera/appcache/application_1437103727449_0011/container_1437103727449_0011_01_000001/spark.executorEnv.SPARK_HOME=/usr/lib/spark
If I am using Python, it should not expect for a main class right? Please note in my previous related post that the Oozie job example shipped with Cloudera QuickstartVM CDH-5.4.0, which features a SparkAction written in Java was working in my tests. It seems that the issue is only in Python.
Appreciate greatly anyone that can help.

Rather than setting spark.yarn.appMasterEnv.SPARK_HOME and spark.executorEnv.SPARK_HOME variables, try and add the following lines of code to your python script before setting your SparkConf()
os.environ["SPARK_HOME"] = "/path/to/spark/installed/location"
Found the reference here
This helped me resolve the error you face, but I faced the following error afterwards
Traceback (most recent call last):
File "/usr/hdp/current/spark-client/AnalyticsJar/boxplot_outlier.py", line 129, in <module>
main()
File "/usr/hdp/current/spark-client/AnalyticsJar/boxplot_outlier.py", line 60, in main
sc = SparkContext(conf=conf)
File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/pyspark/context.py", line 107, in __init__
File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/pyspark/context.py", line 155, in _do_init
File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/pyspark/context.py", line 201, in _initialize_context
File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/py4j/java_gateway.py", line 701, in __call__
File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s signer information does not match signer information of other classes in the same package

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

module error in multi-node spark job on google cloud cluster - python-3.x

Related

hbase-spark connector for spark2.1.0

ImportError: No module named scipy.stats._continuous_distns

Failed to get broadcast_1_piece0 of broadcast_1 in pyspark application

Spark on EMR Yarn - EOF Error

"KeyError: 'SPARK_HOME' ", "can't load main class from JAR" in running PySpark as an Oozie workflow job

Categories

Resources