I was building an application on Apache Spark 2.00 with Python 3.4 and trying to load some CSV files from HDFS (Hadoop 2.7) and process some KPI out of those CSV data.
I use to face "Failed to get broadcast_1_piece0 of broadcast_1" error randomly in my application and it stopped.
After searching a lot google and stakeoverflow, I found only how to get rid of it by deleting spark app created files manually from /tmp directory. It happens generally when an application is running for long and it's not responding properly but related files are in /tmp directory.
Though I don't declare any variable for broadcast but may be spark is doing at its own.
In my case, the error occurs when it is trying to load csv from hdfs.
I have taken low level logs for my application and attached herewith for support and suggestions/best practice so that I can resolve the problem.
Sample (details are Attached here):
Traceback (most recent call last): File
"/home/hadoop/development/kpiengine.py", line 258, in
df_ho_raw =
sqlContext.read.format('com.databricks.spark.csv').options(header='true').load(HDFS_BASE_URL
+ HDFS_WORK_DIR + filename) File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
line 147, in load File
"/usr/local/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
line 933, in call File
"/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line
63, in deco File
"/usr/local/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
line 312, in get_return_value py4j.protocol.Py4JJavaError: An error
occurred while calling o44.load. : org.apache.spark.SparkException:
Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times,
most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 172.26.7.192):
java.io.IOException: org.apache.spark.SparkException: Failed to get
broadcast_1_piece0 of broadcast_1
You should to extends Serializable for your class
Your code Framework error, you can test it
$SPARK_HOME/examples/src/main/scala/org/apache/spark/examples/
If it's ok, you should check your code.
Related
I am using below stack
Hadoop-2.7.7
spark-2.4.5
Hbase-2.1.0
zk-3.5.9
I want to read and write data on hbase using spark with spark-submit command. But i was unable to do so.
I have successfully started all services and also searched connectors for same but i didn't get.
I have tried to create connectors using below link https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_HBase_Connector.md
But connector build getting failed somehow i have made it possible to get connectors from internet and tried with it
when i try to launch spark submit with below command my application is failing
spark-submit --jars /home/bigdata/downloads/hbase-spark-1.0.0.jar --packages org.apache.hbase:hbase-shaded-mapreduce:2.1.0 /home/bigdata/hbasefload.py
Error:
Traceback (most recent call last):
File "/home/bigdata/hbasefload.py", line 35, in <module>
.option("hbase.zookeeper.quorum", "node2.ellicium.com:2181")\
File "/opt/spark/spark245/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 73 7, in save
File "/opt/spark/spark245/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/opt/spark/spark245/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/opt/spark/spark245/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328 , in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o60.save.
: java.util.NoSuchElementException: key not found: catalog
As i try to write on to hbase using spark-shell with above jars it successfully get executed but failing with spark-submit.
I have Spark job which at the end uses saveAsTable to write the dataframe into an internal table w/ a given name.
The dataframe is created using different steps which one of them is using "beta" method in scipy, where I imported it through => from scipy.stats import beta. It's running on google cloud w/ 20 worker nodes but I get the following error which is complaining about scipy package,
Caused by: org.apache.spark.SparkException:
Job aborted due to stage failure:
Task 14 in stage 7.0 failed 4 times, most recent failure:
Lost task 14.3 in stage 7.0 (TID 518, name-w-3.c.somenames.internal,
executor 23): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 364, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 69, in read_command
command = serializer._read_with_length(file)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 172, in
_read_with_length
return self.loads(obj)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 583, in loads
return pickle.loads(obj)
ImportError: No module named scipy.stats._continuous_distns
Any idea or solutions?
I tried to pass the library as well for the spark job:
"spark.driver.extraLibraryPath" : "/usr/lib/spark/python/lib/pyspark.zip",
"spark.driver.extraClassPath" :"/usr/lib/spark/python/lib/pyspark.zip"
Is the library installed on all the nodes in the cluster?
You can simply do a
pip install --user scipy
I do it in AWS EMR using the bootstrap action, There should be a similar way on Google cloud as well
I'm trying to read an XML file in my PySpark3 Jyupter notebook (running in Azure).
I have this code:
df = spark.read.load("wasb:///data/test/Sample Data.xml")
However I keep getting the error java.io.IOException: Could not read footer for file:
An error occurred while calling o616.load.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 10.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10.0 (TID 43, wn2-xxxx.cloudapp.net, executor 2): java.io.IOException: Could not read footer for file: FileStatus{path=wasb://xxxx.blob.core.windows.net/data/test/Sample Data.xml; isDirectory=false; length=6947; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
I know its reaching the file - from looking at the length - matches the xml file size - but stuck after that?
Any ideas?
Thanks.
Please refer to the two blogs below, I think they can answer your question completely.
Azure Blob Storage with Pyspark
Reading JSON, CSV and XML files efficiently in Apache Spark
The code is like as below.
session = SparkSession.builder.getOrCreate()
session.conf.set(
"fs.azure.account.key.<storage-account-name>.blob.core.windows.net",
"<your-storage-account-access-key>"
)
# OR SAS token for a container:
# session.conf.set(
# "fs.azure.sas.<container-name>.blob.core.windows.net",
# "<sas-token>"
# )
# your Sample Data.xml file in the virtual directory `data/test`
df = session.read.format("com.databricks.spark.xml") \
.options(rowTag="book").load("wasbs://<container-name>#<storage-account-name>.blob.core.windows.net/data/test/")
If you were using Azure Databricks, I think the code will works as expected. otherwise, may you need to install the com.databricks.spark.xml library in your Apache Spark cluster.
Hope it helps.
This code runs perfect when I set master to localhost. The problem occurs when I submit on a cluster with two worker nodes.
All the machines have same version of python and packages. I have also set the path to point to the desired python version i.e. 3.5.1. when I submit my spark job on the master ssh session. I get the following error -
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 5, .c..internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/hadoop/yarn/nm-local-dir/usercache//appcache/application_1469113139977_0011/container_1469113139977_0011_01_000004/pyspark.zip/pyspark/worker.py", line 98, in main
command = pickleSer._read_with_length(infile)
File "/hadoop/yarn/nm-local-dir/usercache//appcache/application_1469113139977_0011/container_1469113139977_0011_01_000004/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
return self.loads(obj)
File "/hadoop/yarn/nm-local-dir/usercache//appcache/application_1469113139977_0011/container_1469113139977_0011_01_000004/pyspark.zip/pyspark/serializers.py", line 419, in loads
return pickle.loads(obj, encoding=encoding)
File "/hadoop/yarn/nm-local-dir/usercache//appcache/application_1469113139977_0011/container_1469113139977_0011_01_000004/pyspark.zip/pyspark/mllib/init.py", line 25, in
import numpy
ImportError: No module named 'numpy'
I saw other posts where people did not have access to their worker nodes. I do. I get the same message for the other worker node. not sure if I am missing some environment setting. Any help will be much appreciated.
Not sure if this qualifies as a solution. I submitted the same job using dataproc on google platform and it worked without any problem. I believe the best way to run jobs on google cluster is via the utilities offered on google platform. The dataproc utility seems to iron out any issues related to the environment.
I am running a spark(1.2.1) standalone cluster on my virtual machine(Ubuntu 12.04). I can run the example such as als.py and pi.py successfully. But I can't run the workcount.py example because a connection error will occur.
bin/spark-submit --master spark://192.168.1.211:7077 /examples/src/main/python/wordcount.py ~/Documents/Spark_Examples/wordcount.py
The error message is as below:
15/03/13 22:26:02 INFO BlockManagerMasterActor: Registering block manager a12:45594 with 267.3 MB RAM, BlockManagerId(0, a12, 45594)
15/03/13 22:26:03 INFO Client: Retrying connect to server: a11/192.168.1.211:9000. Already tried 4 time(s).
......
Traceback (most recent call last):
File "/home/spark/spark/examples/src/main/python/wordcount.py", line 32, in <module>
.reduceByKey(add)
File "/home/spark/spark/lib/spark-assembly-1.2.1 hadoop1.0.4.jar/pyspark/rdd.py", line 1349, in reduceByKey
File "/home/spark/spark/lib/spark-assembly-1.2.1-hadoop1.0.4.jar/pyspark/rdd.py", line 1559, in combineByKey
File "/home/spark/spark/lib/spark-assembly-1.2.1-hadoop1.0.4.jar/pyspark/rdd.py", line 1942, in _defaultReducePartitions
File "/home/spark/spark/lib/spark-assembly-1.2.1-hadoop1.0.4.jar/pyspark/rdd.py", line 297, in getNumPartitions
......
py4j.protocol.Py4JJavaError: An error occurred while calling o23.partitions.
java.lang.RuntimeException: java.net.ConnectException: Call to a11/192.168.1.211:9000 failed on connection exception: java.net.ConnectException: Connection refused
......
I didn't use Yarn or ZooKeeper. And all the virtual machines can connect to each other via ssh without password. I also set the SPARK_LOCAL_IP for master and workers.
I think that wordcount.py example is accessing hdfs to reading lines in a file (and then count the words)
Something like:
sc.textFile("hdfs://<master-hostname>:9000/path/to/whatever")
Port 9000 is usually used for hdfs.
Please be sure that this file is accessible or do not use hdfs for that example :).
I hope it helps.