Unable to Save Apache Spark parquet file to csv with Databricks - apache-spark

I'm trying save/convert a parquet file to csv on Apache Spark with Databricks but not having much luck.
The following code successfully writes to a folder called tempDelta:
df.coalesce(1).write.format("parquet").mode("overwrite").option("header","true").save(saveloc+"/tempDelta")
I then would like to convert the parquet file to csv as follows:
df.coalesce(1).write.format("parquet").mode("overwrite").option("header","true").save(saveloc+"/tempDelta").csv(saveloc+"/tempDelta")
AttributeError Traceback (most recent call last)
<command-2887017733757862> in <module>
----> 1 df.coalesce(1).write.format("parquet").mode("overwrite").option("header","true").save(saveloc+"/tempDelta").csv(saveloc+"/tempDelta")
AttributeError: 'NoneType' object has no attribute 'csv'
I have also tried the following after writing to the location:
df.write.option("header","true").csv(saveloc+"/tempDelta2")
But it get the error:
A transaction log for Databricks Delta was found at `/CURATED/F1Area/F1Domain/final/_delta_log`,
but you are trying to write to `/CURATED/F1Area/F1Domain/final/tempDelta2` using format("csv"). You must use
'format("delta")' when reading and writing to a delta table.
And when I try to save as a csv to folder that isn't a delta folder I get the following error:
df.write.option("header","true").csv("testfolder")
AnalysisException: CSV data source does not support struct data type.
Can someone let me know the best way of saving / converting from parquet to csv with Databricks

You can use either of the below 2 options
1. df.write.option("header",true).csv(path)
2. df.write.format("csv").save(path)
Note: You cant mention format as parquet and use .csv function at once.

Related

Databricks Error: AnalysisException: Incompatible format detected. with Delta

I'm getting the following error when I attempt to write to my data lake with Delta on Databricks
fulldf = spark.read.format("csv").option("header", True).option("inferSchema",True).load("/databricks-datasets/flights/")
fulldf.write.format("delta").mode("overwrite").save('/mnt/lake/BASE/flights/Full/')
The above produces the following error:
AnalysisException: Incompatible format detected.
You are trying to write to `/mnt/lake/BASE/flights/Full/` using Databricks Delta, but there is no
transaction log present. Check the upstream job to make sure that it is writing
using format("delta") and that you are trying to write to the table base path.
To disable this check, SET spark.databricks.delta.formatCheck.enabled=false
To learn more about Delta, see https://docs.databricks.com/delta/index.html
Any reason for the error?
Such error usually occurs when you have data in another format inside the folder. For example, if you wrote Parquet or CSV files into it before. Remove the folder completely and try again
This worked in my similar situation:
%sql CONVERT TO DELTA parquet.`/mnt/lake/BASE/flights/Full/`

Error:'str' object has no attribute 'write' when converting Parquet to CSV

I have the following parquet files listed in my lake and I would like to convert the parquet files to CSV.
I have attempted to carry out the conversion using the suggestions on SO, but I keep on getting the Attribute Error:
AttributeError: 'str' object has no attribute 'write'
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<command-507817377983169> in <module>
----> 1 df.write.format("csv").save("/mnt/lake/RAW/export/")
AttributeError: 'str' object has no attribute 'write'
I have created a dataframe to the location where the parquet files reside as 'df' which gives the following output:
Out[71]: '/mnt/lake/CUR/CURATED/F1Area/F1Domain/myfinal'
When I attempt to write / convert the parquets to CSV using either of the following I get the above error:
df.write.format("csv").save("/mnt/lake/RAW/export/")
df.write.csv(path)
I'm entering the following to read: df = spark.read.parquet("/mnt/lake/CUR/CURATED/F1Area/F1Domain/myfinal/"), but I'm getting the following error message:
A transaction log for Databricks Delta was found at /mnt/lake/CUR/CURATED/F1Area/F1Domain/myfinal/_delta_log, but you are trying to read from /mnt/lake/CUR/CURATED/F1Area/F1Domain/myfinal/ using format("parquet"). You must use 'format("delta")' when reading and writing to a delta table. To disable this check, SET spark.databricks.delta.formatCheck.enabled=false
The file you have stored is in delta format. So, read it as the following command
df= spark.read.format("delta").load(path_to_data)
Once loaded, try to display first to make sure that it is loaded properly using display(df).
If the output is as expected, then you can write it as CSV to your desired location.
Type of df variable is a string and its value is /mnt/lake/CUR/CURATED/F1Area/F1Domain/myfinal.
You need to read the file first and make sure df variable is a pyspark dataframe before calling df.write

Pyspark: Delta table as stream source, How to do it?

I am facing issue in readStream on delta table.
What is expected, reference from following link
https://docs.databricks.com/delta/delta-streaming.html#delta-table-as-a-stream-source
Ex:
spark.readStream.format("delta").table("events") -- As expected, should work fine
Issue, I have tried the same in the following way:
df.write.format("delta").saveAsTable("deltatable") -- Saved the Dataframe as a delta table
spark.readStream.format("delta").table("deltatable") -- Called readStream
error:
Traceback (most recent call last):
File "<input>", line 1, in <module>
AttributeError: 'DataStreamReader' object has no attribute 'table'
Note:
I am running it in localhost, using pycharm IDE,
Installed latest version of pyspark, spark version = 2.4.5, Scala version 2.11.12
The DataStreamReader.table and DataStreamWriter.table methods are not in Apache Spark yet. Currently you need to use Databricks Notebook in order to call them.
Try now with Delta Lake 0.7.0 release which provides support for registering your tables with the Hive metastore. As mentioned in a comment, most of the Delta Lake examples used a folder path, because metastore support wasn't integrated before this.
Also note, it's best for the Open Source version of Delta Lake to follow the docs at https://docs.delta.io/latest/index.html

dask read parquet file from spark

For a parquet file written from spark (without any partitioning) its directoy looks like:
%ls foo.parquet
part-00017-c17ab661-2564-428e-8233-e7a9951fb012-c000.gz.parquet
part-00018-c17ab661-2564-428e-8233-e7a9951fb012-c000.gz.parquet
part-00019-c17ab661-2564-428e-8233-e7a9951fb012-c000.gz.parquet
_SUCCESS
When trying to read via pandas:
pd.read_parquet('foo.parquet')
everything works fine as expected.
However, when using dask it fails:
dd.read_parquet('foo.parquet')
[Errno 17] File exists: 'foo.parquet/_SUCCESS'
What do I need to change so that dask is able to read the data successfully?
It turns out that pandas is using pyarrow. When switching to this backend for dask:
dd.read_parquet('foo.parquet', engine='pyarrow')
it works just like expected

How to Ignore Empty Parquet Files in Databricks When Creating Dataframe

I am having issues loading multiple files into a dataframe in Databricks. When I load a parquet file in an individual folder, it is fine, but the following error returns when I try to load multiple files in the dataframe:
DF = spark.read.parquet('S3 path/')
"org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually."
Per other StackOverflow answers, I added spark.sql.files.ignoreCorruptFiles true to the cluster configuration but it didn't seem to resolve the issue. Any other ideas?

Resources