dask read parquet file from spark - apache-spark

For a parquet file written from spark (without any partitioning) its directoy looks like:
%ls foo.parquet
part-00017-c17ab661-2564-428e-8233-e7a9951fb012-c000.gz.parquet
part-00018-c17ab661-2564-428e-8233-e7a9951fb012-c000.gz.parquet
part-00019-c17ab661-2564-428e-8233-e7a9951fb012-c000.gz.parquet
_SUCCESS
When trying to read via pandas:
pd.read_parquet('foo.parquet')
everything works fine as expected.
However, when using dask it fails:
dd.read_parquet('foo.parquet')
[Errno 17] File exists: 'foo.parquet/_SUCCESS'
What do I need to change so that dask is able to read the data successfully?

It turns out that pandas is using pyarrow. When switching to this backend for dask:
dd.read_parquet('foo.parquet', engine='pyarrow')
it works just like expected

Related

Spark overwrite parquet files on aws s3 raise URISyntaxException: Relative path in absolute URI

I am using Spark to write and read parquet files on AWS S3. I have parquet files which stored in
's3a://mybucket/file_name.parquet/company_name=company_name/record_day=2019-01-01 00:00:00'
partitioned by 'company_name' and 'record_day'
I want to write basic pipeline to update my parquet files on regularly basis by 'record_day'. To do this, i am gonna use overwrite mode:
df.write.mode('overwrite').parquet(s3a://mybucket/file_name.parquet/company_name='company_name'/record_day='2019-01-01 00:00:00')
But am getting unexpected error 'java.net.URISyntaxException: Relative path in absolute URI: key=2019-01-01 00:00:00'.
I spent several hours searching for the problem but found no solution(. For some tests, I replaced the 'overwrite' parameter with 'append', and everything works fine. I also made a simple dataframe and overwrite mode also works fine on it. I know that i can solve my problem in a different way, by deleting and then writing the particular part, but I would like to understand what the cause of the error is?
Spark 2.4.4 Hadoop 2.8.5
Appreciate any help.
I had the same error and the my solution was to remove the : part in the date.

databricks: writing spark dataframe directly to excel

Are there any method to write spark dataframe directly to xls/xlsx format ????
Most of the example in the web showing there is example for panda dataframes.
but I would like to use spark dataframe for working with my data. Any idea ?
I'm assuming that because you have the "databricks" tag you are wanting to create an .xlsx file within databricks file store and that you are running code within databricks notebooks. I'm also going to assume that your notebooks are running python.
There is no direct way to save an excel document from a spark dataframe. You can, however, convert a spark dataframe to a pandas dataframe then export from there. We'll need to start by installing the xlsxwriter package. You can do this for your notebook environment using a databricks utilites command:
dbutils.library.installPyPI('xlsxwriter')
dbutils.library.restartPython()
I was having a few permission issues saving an excel file directly to dbfs. A quick workaround was to save to the cluster's default directory then sudo move the file into dbfs. Here's some example code:
# Creating dummy spark dataframe
spark_df = spark.sql('SELECT * FROM default.test_delta LIMIT 100')
# Converting spark dataframe to pandas dataframe
pandas_df = spark_df.toPandas()
# Exporting pandas dataframe to xlsx file
pandas_df.to_excel('excel_test.xlsx', engine='xlsxwriter')
Then in a new command, specifying the command to run in shell with %sh:
%sh
sudo mv excel_test.xlsx /dbfs/mnt/data/
It is possible to generate an Excel file from pySpark.
df_spark.write.format("com.crealytics.spark.excel")\
.option("header", "true")\
.mode("overwrite")\
.save(path)
You need to install the com.crealytics:spark-excel_2.12:0.13.5 (or a more recent version of course) library though, for example in Azure Databricks by specifying it as a new Maven library in the libraries list of your cluster (one of the buttons on the left sidebar of the Databricks UI).
For more info see https://github.com/crealytics/spark-excel.
I believe you can do it like this.
sourcePropertySet.write
.format("com.databricks.spark.csv")
.option("header", "true")
.save("D:\\resultset.csv")
I'm not sure you can write directly to Excel, but Excel can definitely consume a CSV. This is almost certainly the easiest way of doing this kind of thing and the cleanest as well. In Excel you have all kinds of formatting, which can throw errors when used in some systems (think of merged cells).
You can not save it directly but you can have it as its stored in temp location and move it to your directory. My code piece is:
import xlsxwriter import pandas as pd1
workbook = xlsxwriter.Workbook('data_checks_output.xlsx')
worksheet = workbook.add_worksheet('top_rows')
Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd1.ExcelWriter('data_checks_output.xlsx', engine='xlsxwriter')
output = dataset.limit(10)
output = output.toPandas()
output.to_excel(writer, sheet_name='top_rows',startrow=row_number)
writer.save()
Below code does the work of moving files.
%sh
sudo mv data_checks_output.xlsx /dbfs/mnt/fpmount/
Comment if anyone has new update or better way to do it.
Yet Pyspark does not offer any method to save excel file. But you can save csv file, then it can be read in Excel.
From pyspark.sql module version 2.3 you have write.csv:
df.write.csv('path/filename'))
Documentation: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=save

Is it possible to read Excel file from Apache Zeppellin to PySpark or to a Pandas Dataframe?

I have got a file in HDFS (/user/username/Project/data/file.xlsx) that I want to read into a DataFrame. (I do not care if it is a PySpark DataFrame or Pandas, but Pandas is preferred.)
I am using a Zeppelin Notebook to do my code.
Is it possible to get data from this file?
I have already tried the following commands, but none of them worked:
df = pd.read_excel("/user/username/Project/data/file.xlsx")
df = pd.read_excel("hdfs:///user/username/Project/data/file.xlsx")
df = pd.read_excel("hdfs://user/username/Project/data/file.xlsx")
I don't think you can read files stored in hdfs directly with pandas.
You probably have to either :
load the file into spark then use toPandas()
df = spark.read.format("excel").load("hdfs:xxx").toPandas()
use some alternative to enable pandas to read directly, as described here
It seems export and import commands in Python Interpreter in Apache Zeppellin can be only realised through "pd.read_csv" and "to_csv" modules.

How to save files in same directory using saveAsNewAPIHadoopFile spark scala

I am using spark streaming and I want to save each batch of spark streaming on my local in Avro format. I have used saveAsNewAPIHadoopFile to save data in Avro format. This works well. But it overwrites the existing file. Next batch data will overwrite the old data. Is there any way to save Avro file in common directory? I tried by adding some properties of Hadoop job conf for adding a prefix in the file name. But not working any properties.
dstream.foreachRDD {
rdd.saveAsNewAPIHadoopFile(
path,
classOf[AvroKey[T]],
classOf[NullWritable],
classOf[AvroKeyOutputFormat[T]],
job.getConfiguration()
)
}
Try this -
You can make your process split into 2 steps :
Step-01 :- Write Avro file using saveAsNewAPIHadoopFile to <temp-path>
Step-02 :- Move file from <temp-path> to <actual-target-path>
This will definitely solve your problem for now. I will share my thoughts if I get to fulfill this scenario in one step instead of two.
Hope this is helpful.

Overwriting Parquet File in Bluemix Object Storage with Apache Spark Notebook

I'm running a Spark Notebook to save a DataFrame as a Parquet File in the Bluemix Object Storage.
I want to overwrite the Parquet File, when rerunning the Notebook. But actually it's just appending the data.
Below a sample of the iPython Code:
df = sqlContext.sql("SELECT * FROM table")
df.write.parquet("swift://my-container.spark/simdata.parquet", mode="overwrite")
I'm not the python guy,but SaveMode work for dataframe like this
df.write.mode(SaveMode.Overwrite).parquet("swift://my-container.spark/simdata.parquet")
I think the blockstorage replace only the 'simdata.parquet' the 'PART-0000*' remains cuz was 'simdata.parquet' with the 'UUID' of app-id, when you try to read, the DF read all files with the 'simdata.parquet*'

Resources