Spark Read files with special characters in Filename - apache-spark

While trying to reproduce XML parsing from this Databricks workshop uploaded on Github, I'm not able to read the files in the download folder with accented (special) characters.
mkdir -p /dbfs/user/hive/warehouse/hls_cms_source.db/raw_files/synthea_mass
wget https://synthetichealth.github.io/synthea-sample-data/downloads/synthea_sample_data_ccda_sep2019.zip -O ./synthea_sample_data_ccda_sep2019.zip
unzip ./synthea_sample_data_ccda_sep2019.zip -d /dbfs/user/hive/warehouse/hls_cms_source.db/raw_files/synthea_mass/
This is probably because the path in the spark.read method is treated a regex string. How do I deal with this situation where the folder containing the XMLs have special characters in the filenames?
spark.conf.set("spark.sql.caseSensitive", "true")
df = (
spark.read.format('xml')
.option("rowTag", "ClinicalDocument")
.load('/user/hive/warehouse/hls_cms_source.db/raw_files/synthea_mass/ccda/')
)
The below file Amalia471_Magaña874_3912696a-0aef-492e-83ef-468262b82966.xml exists in the directory, however an error is generated complaining that the file does not exist:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 43 in stage 251.0 failed 4 times, most recent failure: Lost task 43.3 in stage 251.0 (TID 494) (10.139.64.6 executor 0): java.io.FileNotFoundException: /4842022074360943/user/hive/warehouse/hls_cms_source.db/raw_files/synthea_mass/ccda/Amalia471_Magaña874_3912696a-0aef-492e-83ef-468262b82966.xml
I'm using Apache Spark 3.2.1 on Databricks with DBR 10.4 LTS.

Related

Cannot save a pyspark word2vec model

I am trying to save a trained pyspark word2vec in local, however this is resulting in an error - Mkdirs failed to create
model.write().overwrite().save("word2vec.model")
22/08/31 12:56:24 WARN TaskSetManager: Lost task 0.0 in stage 421.0
(TID 11440) (100.66.40.74 executor 98): java.io.IOException: Mkdirs
failed to create
file:/home/jovyan/_git/notebooks/word2vec.model/metadata/_temporary/0/_temporary/attempt_202208311256242380698786646139780_0477_m_000000_0
(exists=false, cwd=file:/opt/spark/work-dir) at
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:458)
The command does create a folder word2vec.model, but eventually fails. What can be the possible issue here?

org.apache.spark.SparkException: Writing job aborted on Databricks

I have used Databricks to ingest data from Event Hub and process it in real time with Pyspark Streaming. The code is working fine, but after this line:
df.writeStream.trigger(processingTime='100 seconds').queryName("myquery")\
.format("console").outputMode('complete').start()
I'm getting the following error:
org.apache.spark.SparkException: Writing job aborted.
Caused by: java.io.InvalidClassException: org.apache.spark.eventhubs.rdd.EventHubsRDD; local class incompatible: stream classdesc
I have read that this could be due to low processing power, but I am using a Standard_F4 machine, standard cluster mode with autoscaling enabled.
Any ideas?
This looks like a JAR issue. Go to your JAR's folder in spark and check if you have multiple jars for azure-eventhubs-spark_XXX.XX. I think you've downloaded different versions of it and placed it there, you should remove any JAR with that name from your collection. This error may also occur if your JAR version is incompatible with other JAR's. Try adding spark jars using spark config.
spark = SparkSession \
.builder \
.appName('my-spark') \
.config('spark.jars.packages', 'com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.12') \
.getOrCreate()
This way spark will download JAR files through maven.

Spark Parquet read error : java.io.EOFException: Reached the end of stream with XXXXX bytes left to read

While reading parquet files in spark, if you face the below problem.
App > Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 44, 10.23.5.196, executor 2): java.io.EOFException: Reached the end of stream with 193212 bytes left to read
App > at org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104)
App > at org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
App > at org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)
App > at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174)
App > at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
App > at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301)
App > at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:256)
App > at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159)
App > at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
App > at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:124)
App > at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:215)
For below spark commands:
val df = spark.read.parquet("s3a://.../file.parquet")
df.show(5, false)
For me above didn't do the trick, but the following did:
--conf spark.hadoop.fs.s3a.experimental.input.fadvise=sequential
Not sure why, but what gave me a hint was this issue and some details about the options here.
I think you can bypass this issue with
--conf spark.sql.parquet.enableVectorizedReader=false

why spark job don't work on zepplin while they work when using pyspark shell

i'am trying to execute the following code on zepplin
df = spark.read.csv('/path/to/csv')
df.show(3)
but i get the following error
Py4JJavaError: An error occurred while calling o786.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 39.0 failed 4 times, most recent failure: Lost task 5.3 in stage 39.0 (TID 326, 172.16.23.92, executor 0): java.io.InvalidClassException: org.apache.commons.lang3.time.FastDateParser; local class incompatible: stream classdesc serialVersionUID = 2, local class serialVersionUID = 3
i have hadoop-2.7.3 running on 2 nodes cluster and spark 2.3.2 running on standalone mode and zeppelin 0.8.1, this problem only occur when using zepplin
and i have the SPARK_HOME in zeppelin configuration.
I solved it, the problem was that zeppelin was using a commons-lang3-3.5.jar and spark using commons-lang-2.6.jar so all i did is to add the jar path to zeppelin configuration on the interpreter menu:
1-Click 'Interpreter' menu in navigation bar.
2-Click 'edit' button of the interpreter which you want to load dependencies to.
3-Fill artifact and exclude field to your needs. Add the path to the respective jar file.
4-Press 'Save' to restart the interpreter with loaded libraries.
Zeppelin is using its commons-lang2 jar to stream to Spark executors while Spark local is using common-lang3. like Achref mentioned, just fill out artifact location of commons-lang3 and restart interpreter then you should be good.

Hive on Spark StackOverFlow Error

i am running Hive on Spark on CDH 5.10. and i get the below error. I have checked all the logs of YARN , Hive and Spark, but there is no useful information apart from the below error:
Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 4, xxx.local, executor 1): java.lang.StackOverflowError
Tyr to set the following parameters before executing your query:
set spark.executor.extraJavaOptions=-Xss16m;
set hive.execution.engine=spark;

Resources