PySpark3 - Reading XML files - azure

I'm trying to read an XML file in my PySpark3 Jyupter notebook (running in Azure).
I have this code:
df = spark.read.load("wasb:///data/test/Sample Data.xml")
However I keep getting the error java.io.IOException: Could not read footer for file:
An error occurred while calling o616.load.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 10.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10.0 (TID 43, wn2-xxxx.cloudapp.net, executor 2): java.io.IOException: Could not read footer for file: FileStatus{path=wasb://xxxx.blob.core.windows.net/data/test/Sample Data.xml; isDirectory=false; length=6947; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
I know its reaching the file - from looking at the length - matches the xml file size - but stuck after that?
Any ideas?
Thanks.

Please refer to the two blogs below, I think they can answer your question completely.
Azure Blob Storage with Pyspark
Reading JSON, CSV and XML files efficiently in Apache Spark
The code is like as below.
session = SparkSession.builder.getOrCreate()
session.conf.set(
"fs.azure.account.key.<storage-account-name>.blob.core.windows.net",
"<your-storage-account-access-key>"
)
# OR SAS token for a container:
# session.conf.set(
# "fs.azure.sas.<container-name>.blob.core.windows.net",
# "<sas-token>"
# )
# your Sample Data.xml file in the virtual directory `data/test`
df = session.read.format("com.databricks.spark.xml") \
.options(rowTag="book").load("wasbs://<container-name>#<storage-account-name>.blob.core.windows.net/data/test/")
If you were using Azure Databricks, I think the code will works as expected. otherwise, may you need to install the com.databricks.spark.xml library in your Apache Spark cluster.
Hope it helps.

Related

Block size invalid or too large - Failed to read Avro files

I'm using spark and scala , and trying to read avro folders using
com.databricks - spark-avro_2.11. All the folders were read successfully, except for one folder, which failed with the following exception. (attached)
I checked the files manually, and all of them seems ok and not corrupted. What could be the problem?
Caused by: org.apache.avro.AvroRuntimeException: java.io.IOException:
Block size invalid or too large for this implementation: -26 at
org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:275)
at
org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:197)

How to set pythonpath at startup when running jupyter notebook server

I am running a Jupyter notebook public server as in this tutorial : http://jupyter-notebook.readthedocs.io/en/stable/public_server.html
I want to use pyspark-2.2.1 with this server. I pip-installed py4j and downloaded spark-2.2.1 from the repository.
Locally, i added in my .bashrc the command lines
export SPARK_HOME='/home/ubuntu/spark-2.2.1-bin-hadoop2.7'
export PATH=$SPARK_HOME:$PATH
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
and everything works fine when i run python locally.
However, when using the notebook server, i cannot import pyspark, because the above commands have not been executed at jupyter notebook's startup.
I partly (and non elegantly) solved the issue by typing
import sys
sys.path.append("/home/ubuntu/spark-2.2.1-bin-hadoop2.7/python")
in the first cell of my notebook. But
from pyspark import SparkContext
sc = SparkContext()
myrdd = sc.textFile('exemple.txt')
myrdd.collect() # Everything works find util here
words = myrdd.map(lambda x:x.split())
words.collect()
returns the error
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.io.IOException: Cannot run program "python": error=2, No such file or directory
Any idea how i can set the correct paths (either manually or at startup) ?
Thanks

Failed to get broadcast_1_piece0 of broadcast_1 in pyspark application

I was building an application on Apache Spark 2.00 with Python 3.4 and trying to load some CSV files from HDFS (Hadoop 2.7) and process some KPI out of those CSV data.
I use to face "Failed to get broadcast_1_piece0 of broadcast_1" error randomly in my application and it stopped.
After searching a lot google and stakeoverflow, I found only how to get rid of it by deleting spark app created files manually from /tmp directory. It happens generally when an application is running for long and it's not responding properly but related files are in /tmp directory.
Though I don't declare any variable for broadcast but may be spark is doing at its own.
In my case, the error occurs when it is trying to load csv from hdfs.
I have taken low level logs for my application and attached herewith for support and suggestions/best practice so that I can resolve the problem.
Sample (details are Attached here):
Traceback (most recent call last): File
"/home/hadoop/development/kpiengine.py", line 258, in
df_ho_raw =
sqlContext.read.format('com.databricks.spark.csv').options(header='true').load(HDFS_BASE_URL
+ HDFS_WORK_DIR + filename) File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
line 147, in load File
"/usr/local/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
line 933, in call File
"/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line
63, in deco File
"/usr/local/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
line 312, in get_return_value py4j.protocol.Py4JJavaError: An error
occurred while calling o44.load. : org.apache.spark.SparkException:
Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times,
most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 172.26.7.192):
java.io.IOException: org.apache.spark.SparkException: Failed to get
broadcast_1_piece0 of broadcast_1
You should to extends Serializable for your class
Your code Framework error, you can test it
$SPARK_HOME/examples/src/main/scala/org/apache/spark/examples/
If it's ok, you should check your code.

module error in multi-node spark job on google cloud cluster

This code runs perfect when I set master to localhost. The problem occurs when I submit on a cluster with two worker nodes.
All the machines have same version of python and packages. I have also set the path to point to the desired python version i.e. 3.5.1. when I submit my spark job on the master ssh session. I get the following error -
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 5, .c..internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/hadoop/yarn/nm-local-dir/usercache//appcache/application_1469113139977_0011/container_1469113139977_0011_01_000004/pyspark.zip/pyspark/worker.py", line 98, in main
command = pickleSer._read_with_length(infile)
File "/hadoop/yarn/nm-local-dir/usercache//appcache/application_1469113139977_0011/container_1469113139977_0011_01_000004/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
return self.loads(obj)
File "/hadoop/yarn/nm-local-dir/usercache//appcache/application_1469113139977_0011/container_1469113139977_0011_01_000004/pyspark.zip/pyspark/serializers.py", line 419, in loads
return pickle.loads(obj, encoding=encoding)
File "/hadoop/yarn/nm-local-dir/usercache//appcache/application_1469113139977_0011/container_1469113139977_0011_01_000004/pyspark.zip/pyspark/mllib/init.py", line 25, in
import numpy
ImportError: No module named 'numpy'
I saw other posts where people did not have access to their worker nodes. I do. I get the same message for the other worker node. not sure if I am missing some environment setting. Any help will be much appreciated.
Not sure if this qualifies as a solution. I submitted the same job using dataproc on google platform and it worked without any problem. I believe the best way to run jobs on google cluster is via the utilities offered on google platform. The dataproc utility seems to iron out any issues related to the environment.

Spark job submiited from local machine to remore cluster can't see data on remote server

The post may seem a bit long but I am providing all the specific details to help readers what I am trying to achieve and what all I have already done but still running into issue.
I am trying to submit the spark job to remote cluster from eclipse running locally on windows 7 machine but running into issue with respect to finding the input path to data on cluster nodes. I followed the suggestion made in this forum to configure the sparkContext as following where I set the spark.driver.host to IP address of Windows machine.
SparkConf sparkConf = new SparkConf().setAppName("Count Lines")
.set("spark.driver.host", "9.1.194.199") //IP address of Windows 7
.set("spark.driver.port", "51910")
.set("spark.fileserver.port", "51811")
.set("spark.broadcast.port", "51812")
.set("spark.replClassServer.port", "51813")
.set("spark.blockManager.port", "51814")
.setMaster("spark://master.aa.bb.com:7077"); //mater hostname
I also had to set HADOOP_HOME to c:\winutils in eclipse, to be able to run this code on windows.
Then I set the path to data which exists on all the nodes of spark cluster as following
String topDir = "/data07/html/test";
JavaRDD<String> lines = sc.textFile(topDir+"/*");
However, I get following error.
5319 [main] INFO org.apache.spark.SparkContext - Created broadcast 0 from textFile at CountLines2.java:65
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input Pattern file:/data07/html/test/* matches 0 files
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
Now considering the fact that running the code inside eclipse needed local hadoop installation (ie., setting HADOOP_HOME to c:\winutils), I modified the code to use a data path that exists locally on Windows machine. With that modification, the progam went a bit further and launched tasks on all the nodes of the cluster but failed later for path issue with a different error.
105926 [task-result-getter-2] INFO org.apache.spark.scheduler.TaskSetManager - Lost task 15.2 in stage 0.0 (TID 162) on executor master.aa.bb.com: java.lang.IllegalArgumentException (java.net.URISyntaxException: Relative path in absolute URI: C:%5Cdata%5CMedicalSieve%5Crepositories%5Craw%5CMedscape%5Cclinical/*) [duplicate 162]
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 44 in stage 0.0 failed 4 times, most recent failure: Lost task 44.3 in stage 0.0 (TID 148, aalim03.almaden.ibm.com): java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: C:%5Cdata%5Chtml%5Ctest/*
at org.apache.hadoop.fs.Path.initialize(Path.java:206)
at org.apache.hadoop.fs.Path.<init>(Path.java:172)
at org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:241)
As a rule of thumb every input you use should be accessible on every node (both workers and driver). These could be local file system, files on some DFS or external resource.
The only situation when data is shipped directly from the driver is when you use ParallelCollectionRDD with parallelize / makeRDD.

Resources