How to convert Delta file format to Parquet File only - apache-spark

Delta Lake is the default storage format.I understand how to convert a parquet to Delta.
My question is is there any way to revert it back to parquet.Any options ?
What I need is I want single parquet file while writing .Do not need the extra log file !

If you run vacuum on the table and delete the log folder, you end up with regular parquet files.

Related

Unable to read Databricks Delta / Parquet File with Delta Format

I am trying to read a delta / parquet in Databricks using the follow code in Databricks
df3 = spark.read.format("delta").load('/mnt/lake/CUR/CURATED/origination/company/opportunities_final/curorigination.presentation.parquet')
However, I'm getting the following error:
A partition path fragment should be the form like `part1=foo/part2=bar`. The partition path: curorigination.presentation.parquet
This seemed very straightforward, but not sure why I'm getting the error
Any thoughts?
The file structure looks like the following
The error shows that delta lake think you have wrong partition path naming.
If you have any partition column in your delta table, for example year month day, your path should look like /mnt/lake/CUR/CURATED/origination/company/opportunities_final/year=yyyy/month=mm/day=dd/curorigination.presentation.parquet and you just need to do df = spark.read.format("delta").load("/mnt/lake/CUR/CURATED/origination/company/opportunities_final").
If you just read it as parquet, you can just do df = spark.read.parquet("/mnt/lake/CUR/CURATED/origination/company/opportunities_final") because you don't need to read the absolute path of the parquet file.
The above error mainly happens because of incorrect path format curorigination.presentation.parquet. please check your delta location and also check whether delta file is created or not :
%fs ls /mnt/lake/CUR/CURATED/origination/company/opportunities_final/
I reproduced the same thing in my environment. First of all, I created a data frame with a parquet file.
df1 = spark.read.format("parquet").load("/FileStore/tables/")
display(df1)
After that I just converted the parquet file into delta format and saved the file into this location/mnt/lake/CUR/CURATED/origination/company/opportunities_final/demo_delta1 .
df1.coalesce(1).write.format('delta').mode("overwrite").save("/mnt/lake/CUR/CURATED/origination/company/opportunities_final/demo_delta1")
#Reading delta file
df3 = spark.read.format("delta").load("/mnt/lake/CUR/CURATED/origination/company/opportunities_final/demo_delta")
display(df3)

file transfer from DBFS to Azure Blob Storage

I need to transfer the files in the below dbfs file system path:
%fs ls /FileStore/tables/26AS_report/customer_monthly_running_report/parts/
To the below Azure Blob
dbutils.fs.ls("wasbs://"+blob.storage_account_container+"#"
+ blob.storage_account_name+".blob.core.windows.net/")
WHAT SERIES OF STEPS SHOULD I FOLLOW? Pls suggest
The simplest way would be to load the data into a dataframe and then to write that dataframe into the target.
df = spark.read.format(format).load("dbfs://FileStore/tables/26AS_report/customer_monthly_running_report/parts/*")
df.write.format(format).save("wasbs://"+blob.storage_account_container+"#" + blob.storage_account_name+".blob.core.windows.net/")
You will have to replace "format" with the source file format and the format you want in the target folder.
Keep in mind that if you do not want to do any transformations to the data but to just move it, it will most likely be more efficient not to use pyspark but to just use the az-copy command line tool. You can also run that in Databricks with the %sh magic command if needed.

How to convert CSV to ORC format using Azure Datafactory

I am coping comma separated partition data files into ADLS using azure datafactory.
The requirement is to copy the comma separated files to ORC format with SNAPPY compression.
Is it possible to achieve this with ADF? if yes, then could you please help me?
Unfortunatelly, data factory can read from ZLIB and SNAPPY, but can only write ZLIB, which is the default for the orc file format.
More info here: https://learn.microsoft.com/en-us/azure/data-factory/supported-file-formats-and-compression-codecs#orc-format
Hope this helped!!

How can we export a cassandra table into a csv format using its snapshots file

I have taken snapshot of a cassandra table . Following are the files generated :-
manifest.json mc-10-big-Filter.db mc-10-big-TOC.txt mc-11-big-Filter.db mc-11-big-TOC.txt mc-9-big-Filter.db mc-9-big-TOC.txt
mc-10-big-CompressionInfo.db mc-10-big-Index.db mc-11-big-CompressionInfo.db mc-11-big-Index.db mc-9-big-CompressionInfo.db mc-9-big-Index.db schema.cql
mc-10-big-Data.db mc-10-big-Statistics.db mc-11-big-Data.db mc-11-big-Statistics.db mc-9-big-Data.db mc-9-big-Statistics.db
mc-10-big-Digest.crc32 mc-10-big-Summary.db mc-11-big-Digest.crc32 mc-11-big-Summary.db mc-9-big-Digest.crc32 mc-9-big-Summary.db
Is there a way to use these files to extract data of the table into a csv file .
Yes, you can do that with the sstable2json tool.
Use the tool against the *Data.db file
This outputs in JSON format. You need to convert to CSV after.

Parquet file format on S3: which is the actual Parquet file?

Scala 2.12 and Spark 2.2.1 here. I used the following code to write the contents of a DataFrame to S3:
myDF.write.mode(SaveMode.Overwrite)
.parquet("s3n://com.example.mybucket/mydata.parquet")
When I go to com.example.mybucket on S3 I actually see a directory called "mydata.parquet", as well as file called "mydata.parquet_$folder$"!!! If I go into the mydata.parquet directory I see two files under it:
_SUCCESS; and
part-<big-UUID>.snappy.parquet
Whereas I was just expecting to see a single file called mydata.parquet living in the root of the bucket.
Is something wrong here (if so, what?!?) or is this expected with the Parquet file format? If its expected, which is the actual Parquet file that I should read from:
mydata.parquet directory?; or
mydata.parquet_$folder$ file?; or
mydata.parquet/part-<big-UUID>.snappy.parquet?
Thanks!
The mydata.parquet/part-<big-UUID>.snappy.parquet is the actual parquet data file. However, often tools like Spark break data sets into multiple part files, and expect to be pointed to a directory that contains multiple files. The _SUCCESS file is a simple flag indicating that the write operation has completed.
According to the api to save the parqueat file it saves inside the folder you provide. Sucess is incidation that the process is completed scuesffuly.
S3 create those $folder if you write directly commit to s3. What happens is it writes to temporory folders and copies to the final destination inside the s3. The reason is there no concept of rename.
Look at the s3-distcp and also DirectCommiter for performance issue.
The $folder$ marker is used by s3n/amazon's emrfs to indicate "empty directory". ignore.
The _SUCCESS file is, as the others note, a 0-byte file. ignore
all other .parquet files in the directory are the output; the number you end up with depends on the number of tasks executed on the input
When spark uses a directory (tree) as a source of data, all files beginning with _ or . are ignored; s3n will strip out those $folder$ things too. So if you use the path for a new query, it will only pick up that parquet file.

Resources