spark-xml library is parsing xml file manytimes - apache-spark

I use spark-xml library from databricks for parsing xml file (550 MB).
Dataset books= spark.sqlContext().read()
.format("com.databricks.spark.xml")
.option("rootTag", "books")
.option("rowTag", "book")
.option("treatEmptyValuesAsNulls", "true")
.load("path");
Spark parses the file a first time with many tasks/partitions.
Then, when I call this code :
books.select("code").count()
Spark starts a new parsing.
Is it a solution to avoid a parsing file each function call on the dataset ?

Related

Structured streaming: File source stream fails to start with "No usable value for path"

I have a file source stream that reads data from one s3 bucket and writes its results to another, like this:
data_sdf = spark.readStream \
.schema(input_data.schema) \
.parquet("s3://my_input_folder")
results_sdf = process(data_sdf)
results_query = results_sdf.writeStream \
.format("parquet") \
.option("path", "s3://my_results_folder") \
.option("checkpointLocation", "s3://my_checkpoint_folder") \
.queryName("results_query") \
.start()
where process() is an arbitrary function for transforming a Spark DataFrame.
However after 'starting' the results_query stream, these statements:
print(spark.streams.active)
print(results_query.status)
print(results_query.lastProgress)
result in:
[]
{'message': 'Terminated with exception: No usable value for path\nDid not find value which can be converted into java.lang.String', 'isDataAvailable': False, 'isTriggerActive': False}
None
What I haven't said is that I have another structured stream that writes files to the input folder, s3://my_input_folder. However I still get the exception above even after stopping this other stream before trying to start the one above. But if I simply write a regular, non-streaming, DataFrame to the same input folder instead then the stream above works.
Does anyone know what I'm doing wrong?
Interestingly, the Spark tutorial on structured streaming here says this about file input sources:
NOTE 2: The source path should not be used from multiple sources or queries when enabling this option. Similarly, you must ensure the source path doesn't match to any files in output directory of file stream sink.
So maybe I should not be trying to read files from the same folder into which I write files from another stream; and the rather cryptic error No usable value for path is to be expected. I guess file stream sinks don't write files atomically in the target directory.

How to read excel as a pyspark dataframe

I am able to read all the files and formats like csv, parquet, delta from adls2 account with oauth2 cred.
However when I am trying to read excel file like below,
df = spark.read.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("inferSchema", "true") \
.option("dataAddress", "'excel sheet name'!A1") \
.load(filepath)
I am getting below error
Failure to initialize configurationInvalid configuration value detected for fs.azure.account.key
Note: I have installed external library "com.crealytics:spark-excel_2.11:0.12.2" to read excel as a dataframe.
Can anyone help me with error here?
Try to use in configs as: "fs.azure.account.oauth2.client.secret": "<key-name>",
And different versions have different set of parameters, so try use the latest release: https://mvnrepository.com/artifact/com.crealytics/spark-excel_2.12/0.13.7

Reading AVRO from Azure Datalake in Databricks

I am trying to read eventhub data (AVRO) format. I am having issues loading data into a dataframe in databricks.
Here's the code I am using. Please let me know if I am doing anything wrong
path='/mnt/datastore/origin/zone=raw/subject=customer_events/source=EventHub/ver=1.0/*.avro'
df = spark.read.format("com.databricks.spark.avro") \
.load(path)
Error
IllegalArgumentException: 'java.net.URISyntaxException: Relative path in absolute URI:
I did try using some code to remove the error, but I am getting the syntax errors
import org.apache.spark.sql.SparkSession
SparkSession spark = SparkSession
.builder()
.config("spark.sql.warehouse.dir","/mnt/datastore/origin/zone=raw/subject=customer_events/source=EventHub/ver=1.0/")
.getOrCreate()
SyntaxError: invalid syntax
File "<command-265213674761208>", line 2
SparkSession spark = SparkSession
Relative path in absolute URI
You need to specify the protocol rather than use /mnt
For example, wasb://some/path/ if reading from Azure blobstore
You can also exclude *.avro since the Avro reader should already pick up all Avro files in the path
https://docs.databricks.com/data/data-sources/read-avro.html#python-api
And if you want to read from EventHub, that exposes a Kafka API, not a filepath, AFAIK

Databricks: convert data frame and export to xls / xlsx

Is it possible for Databricks: convert data frame and export to xls / xlsx and save to blob storage ?
Using Python
Here's an example of writing a dataframe to excel:
Using pyspark:
df.write
.format("com.crealytics.spark.excel")
.option("dataAddress", "'My Sheet'!B3:C35")
.option("useHeader", "true")
.option("dateFormat", "yy-mmm-d")
.option("timestampFormat", "mm-dd-yyyy hh:mm:ss")
.mode("append")
.save("Worktime2.xlsx")
Based upon this library: spark-excel by Crealytics.
The following way does not require as much maneuvering. First, you will convert your pyspark dataframe to a pandas data frame (toPandas()) and then use the "to_excel" to write to excel format.
import pandas
df.describe().toPandas().to_excel('fileOutput.xls', sheet_name = 'Sheet1', index = False)
Note, the above requires xlwt package to be installed (pip install xlwt in the command line)
Does it have to be an Excel file? CSV files are so much easier to work with. You can certainly open a CSV into Excel, and save that as an Excel file. As I know, you can write directly to the Blob storage, and completely bypass the step of storing the data locally.
df.write \
.format("com.databricks.spark.csv") \
.option("header", "true") \
.save("myfile.csv")
In this example, you can try changing the extension to xls before you run the job. I can't test this because I don't have Databricks setup on my personal laptop.

How to write dataset object to excel in spark java?

I Am reading excel file using com.crealytics.spark.excel package.
Below is the code to read an excel file in spark java.
Dataset<Row> SourcePropertSet = sqlContext.read()
.format("com.crealytics.spark.excel")
.option("location", "D:\\5Kto10K.xlsx")
.option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "true")
.option("addColorColumns", "false")
.load("com.databricks.spark.csv");
But I tried with the same (com.crealytics.spark.excel) package to write dataset object to an excel file in spark java.
SourcePropertSet.write()
.format("com.crealytics.spark.excel")
.option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "true")
.option("addColorColumns", "false").save("D:\\resultset.xlsx");
But i am getting below error.
java.lang.RuntimeException: com.crealytics.spark.excel.DefaultSource
does not allow create table as select.
And even I tried with org.zuinnote.spark.office.excel package also.
below is the code for that.
SourcePropertSet.write()
.format("org.zuinnote.spark.office.excel")
.option("write.locale.bcp47", "de")
.save("D:\\result");
i have added following dependencies in my pom.xml
<dependency>
<groupId>com.github.zuinnote</groupId>
<artifactId>hadoopoffice-fileformat</artifactId>
<version>1.0.0</version>
</dependency>
<dependency>
<groupId>com.github.zuinnote</groupId>
<artifactId>spark-hadoopoffice-ds_2.11</artifactId>
<version>1.0.3</version>
</dependency>
But I am getting below error.
java.lang.IllegalAccessError: tried to access method org.zuinnote.hadoop.office.format.mapreduce.ExcelFileOutputFormat.getSuffix(Ljava/lang/String;)Ljava/lang/String;
from class org.zuinnote.spark.office.excel.ExcelOutputWriterFactory
Please help me to write dataset object to an excel file in spark java.
Looks like the library you chose, com.crealytics.spark.excel, does not have any code related to writing excel files. Underneath it uses Apache POI for reading Excel files, there are also few examples.
The good news are that CSV is a valid Excel file, and you may use spark-csv to write it. You need to change your code like this:
sourcePropertySet.write
.format("com.databricks.spark.csv")
.option("header", "true")
.save("D:\\resultset.csv");
Keep in mind that Spark makes 1 output file per partition, and you might want to do .repartition(1) to have exactly one result file.
The error you face when writing comes from an old version of the HaodoopOffice library. Please make sure that you have only version 1.0.3 or better 1.0.4 as a dependency. Can you provide your build file? The following should work:
SourcePropertSet.write()
.format("org.zuinnote.spark.office.excel")
.option("spark.write.useHeader",true)
.option("write.locale.bcp47", "us")
.save("D:\\result");
Version 1.0.4 of the Spark2 data source for HadoopOffice also supports inferring the schema when reading:
Dataset<Row> SourcePropertSet = sqlContext.read()
.format("org.zuinnote.spark.office.excel")
.option("spark.read.useHeader", "true")
.option("spark.read.simpleMode", "true")
.load("D:\\5Kto10K.xlsx");
Please note that it is not recommended to mix different Excel data sources based on POI in one application.
More information here: https://github.com/ZuInnoTe/spark-hadoopoffice-ds

Resources