Spark - How to change the name of the coalesced parquet file

Spark - How to change the name of the coalesced parquet file - apache-spark

So, when writing parquet files to s3, I'm able to change the directory name using the following code:
spark_NCDS_df.coalesce(1).write.parquet(s3locationC1+"parquet")
Now, when I output this, the contents within that directory are as follows:
I'd like to make two changes:
Can I update the file name for the part-0000....snappy.parquet file?
Can I output this file without the _SUCCESS, _committed and _started files?
The documentation i've found online hasn't been very helpful.

out_file_name = snappy.parquet
path = "mnt/s3locationC1/"
tmp_path = "mnt/s3locationC1/tmp_data"
df = spark_NCDS_df
def copy_file(path,tmp_path,df,out_file_name):
df.coalesce(1).write.parquet(tmp_path)
file = dbutils.fs.ls(tmp_path)[-1][0]
dbutils.fs.cp(file,path+out_file_name)
dbutils.fs.rm(tmp_path,True)
copy_file(path,tmp_path,df,out_file_name)
This function copy and paste your required output file to the destination and then delete the temp files, all the _SUCCESS, _committed and _started removed with it.
If you need anything more, please let me know.

Related

Add the creation date of a parquet file into a DataFrame

Currently I load multiple parquet file with this code :
df = spark.read.parquet("/mnt/dev/bronze/Voucher/*/*")
(Into the Voucher folder, there is one folder by date, and one parquet file inside it)
How can I add the creation date of each parquet file into my DataFrame ?
Thanks
EDIT 1:
Thanks rainingdistros, I wrote this:
import os
from datetime import datetime, timedelta
Path = "/dbfs/mnt/dev/bronze/Voucher/2022-09-23/"
fileFull = Path +'/'+'XXXXXX.parquet'
statinfo = os.stat(fileFull)
create_date = datetime.fromtimestamp(statinfo.st_ctime)
display(create_date)
Now I must find a way to loop through all the files and add a column in the DataFrame.

The information returned from os.stat might not be accurate unless the file is first operation on these files is your requirement (i.e., adding the additional column with creation time).
Each time the file is modified, both st_mtime and st_ctime will be updated to this modification time. The following are the images indicating the same:
When I modify this file, the changes can be observed in the information returned by os.stat.
So, if adding this column is the first operation that is going to be performed on these files, then you can use the following code to add this date as column to your files.
from pyspark.sql.functions import lit
import pandas as pd
path = "/dbfs/mnt/repro/2022-12-01"
fileinfo = os.listdir(path)
for file in fileinfo:
pdf = pd.read_csv(f"{path}/{file}")
pdf.display()
statinfo = os.stat("/dbfs/mnt/repro/2022-12-01/sample1.csv")
create_date = datetime.fromtimestamp(statinfo.st_ctime)
pdf['creation_date'] = [create_date.date()] * len(pdf)
pdf.to_csv(f"{path}/{file}", index=False)
These files would have this new column as shown below after running the code:
It might be better to take the value directly from folder in this case as the information is already available and all that needs to be done is to extract and add column to files in a similar manner as in the above code.

See if below steps help....
Refer to the link to get the list of files in DBFS - SO - Loop through Files in DBFS
Once you have the files, loop through them and for each file use the code you have written in your question.
Please note that dbutils has the mtime of a file in it. The os module provides way to identify the ctime i.e. the time of most recent metadata changes on Unix, - ideally should have been st_birthtime - but that does not seem to work in my trials...Hope it works for you...

Loop through multiple folders and subfolders using Pyspark in Azure Blob container (ADLS Gen2)

I am trying to loop through multiple folders and subfolders in Azure Blob container and read multiple xml files.
Eg: I have files in YYYY/MM/DD/HH/123.xml format
Similarly I have multiple sub folders under month, date, hours and multiple XML files at last.
My intention is to loop through all these folder and read XML files. I have tried using few Pythonic approaches which did not give me the intended result. Can you please help me with any ideas in implementing this?
import glob, os
for filename in glob.iglob('2022/08/18/08/225.xml'):
if os.path.isfile(filename): #code does not enter the for loop
print(filename)
import os
dir = '2022/08/19/08/'
r = []
for root, dirs, files in os.walk(dir): #Code not moving past this for loop, no exception
for name in files:
filepath = root + os.sep + name
if filepath.endswith(".xml"):
r.append(os.path.join(root, name))
return r

The glob is a python function and it won't recognize the blob folders path directly as code is in pyspark. we have to give the path from root for this. Also, make sure to specify recursive=True in that.
For Example, I have checked above pyspark code in databricks.
and the OS code as well.
You can see I got the no result as above. Because for the above, we need to give the absolute root. it means the root folder.
glob code:
import glob, os
for file in glob.iglob('/path_from_root_to_folder/**/*.xml',recursive=True):
print(file)
For me in databricks the root to access is /dbfs and I have used csv files.
Using os:
You can see my blob files are listed from folders and subfolders.
I have used databricks for my repro after mounting. Wherever you are trying this code in pyspark, make sure you are giving the root of the folder in the path. when using glob, set the recursive = True as well.

There is an easier way to solve this problem with PySpark!
The tough part is all the files have to have the same format. In the Azure databrick's sample directory, there is a /cs100 folder that has a bunch of files that can be read in as text (line by line).
The trick is the option called "recursiveFileLookup". It will assume that the directories are created by spark. You can not mix and match files.
I added to the data frame the name of the input file for the dataframe. Last but not least, I converted the dataframe to a temporary view.
Looking at a simple aggregate query, we have 10 unique files. The biggest have a little more than 1 M records.
If you need to cherry pick files for a mixed directory, this method will not work.
However, I think that is an organizational cleanup task, versus easy reading one.
Last but not least, use the correct formatter to read XML.
spark.read.format("com.databricks.spark.xml")

Exporting panda data frame as excel file on FTP

I am exporting a panda data frame as an excel file on FTP and using the below code. The code is creating a file on FTP. The issue here is that if I am make any change in the code and expecting a different output file it is creating the same output file as before. However if I change the file name in: myFTP.storbinary('STOR %s.xlsx' %filename,bio)..It works fine. Moreover, if I made the output on my local keeping the same name it also works fine. I dont want to change the file name every time I make some change in my code."It is not creating a different file with the same name" Below is the code:
myFTP = ftplib.FTP("ftp address","username","password)
myFTP.cwd("change directory/")
buffer=io.BytesIO()
df.to_excel(buffer,index=False)
text = buffer.getvalue()
bio = io.BytesIO(text)
file name = 'FileName_{0}{1}'.format(current_year,current_month)
myFTP.storbinary('STOR %s.xlsx'%file_name,bio)
myFTP.close()
Name of the output file must be: FileName_currentyearcurrentmonth

file name = 'FileName_{0}{1}'.format(current_year,current_month)
If this line of code is as it is in your code, well. It seens you have a syntax error. Also in cases like this contextual manager are actually pretty usefull. Why dont you try doing like this. So if you get an error well you dont keep your file open
with ftplib.FTP("ftp address","username","password) as myFTP:
myFTP.cwd("change directory/")
buffer=io.BytesIO()
df.to_excel(buffer,index=False)
text = buffer.getvalue()
bio = io.BytesIO(text)
file name = 'FileName_{0}{1}'.format(current_year,current_month)
myFTP.storbinary('STOR %s.xlsx'%file_name,bio)

Avoid overwriting of files with "for" loop

I have a list of dataframes (df_cleaned) created from multiple csv files chosen by the user.
My objective is to save each dataframe within the df_cleaned list as a separate csv file locally.
I have the following code done which saves the file with its original title. But I see that it overwrites and manages to save a copy of only the last dataframe.
How can I fix it? According to my very basic knowledge perhaps I could use a break-continue statement in the loop? But I do not know how to implement it correctly.
for i in range(len(df_cleaned)):
outputFile = df_cleaned[i].to_csv(r'C:\...\Data Docs\TrainData\{}.csv'.format(name))
print('Saving of files as csv is complete.')

You can create a different name for each file, as an example in the following I attach the index to name:
for i in range(len(df_cleaned)):
outputFile = df_cleaned[i].to_csv(r'C:\...\Data Docs\TrainData\{0}_{1}.csv'.format(name,i))
print('Saving of files as csv is complete.')
this will create a list of files named <name>_N.csv with N = 0, ..., len(df_cleaned)-1.

A very easy way of solving. Just figured out the answer myself. Posting to help someone else.
fileNames is a list I created at the start of the code to save the
names of the files chosen by the user.
for i in range(len(df_cleaned)):
outputFile = df_cleaned[i].to_csv(r'C:\...\TrainData\{}.csv'.format(fileNames[i]))
print('Saving of files as csv is complete.')
Saves a separate copy for each file in the defined directory.

Spark: How to overwrite a file on S3 folder and not complete folder

Using Spark I am trying to push some data(in csv, parquet format) to S3 bucket.
df.write.mode("OVERWRITE").format("com.databricks.spark.csv").options(nullValue=options['nullValue'], header=options['header'], delimiter=options['delimiter'], quote=options['quote'], escape=options['escape']).save(destination_path)
In above code piece, destination_path variable holds the S3 bucket location where data needs to be exported.
Eg. destination_path = "s3://some-test-bucket/manish/"
In the folder manish of some-test-bucket if I have several files and sub-folders. Above command will delete all of them and spark will write new output files. But I want to overwrite just one file with this new file.
Even if I am able to overwrite just contents of this folder, but sub-folder remain intact even that would solve the problem to certain extent.
How can this be achieved?
I tried to use mode as append instead of overwrite.
Here in this case subfolder name remains intact but again all the contents of manish folder and its sub-folder are overwritten.

Short answer: Set the Spark configuration parameter spark.sql.sources.partitionOverwriteMode to dynamic instead of static. This will only overwrite the necessary partitions and not all of them.
PySpark example:
conf=SparkConf().setAppName("test).set("spark.sql.sources.partitionOverwriteMode","dynamic").setMaster("yarn")
sc = SparkContext(conf=conf)
sqlContext = sql.SQLContext(sc)

The file's can be deleted first and then use append mode to insert the data instead of overwriting to retain the sub folder's. Below is an example from Pyspark.
import subprocess
subprocess.call(["hadoop", "fs", "-rm", "{}*.csv.deflate".format(destination_path)])
df.write.mode("append").format("com.databricks.spark.csv").options(nullValue=options['nullValue'], header=options['header'], delimiter=options['delimiter'], quote=options['quote'], escape=options['escape']).save(destination_path)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark - How to change the name of the coalesced parquet file - apache-spark

Related

Add the creation date of a parquet file into a DataFrame

Loop through multiple folders and subfolders using Pyspark in Azure Blob container (ADLS Gen2)

Exporting panda data frame as excel file on FTP

Avoid overwriting of files with "for" loop

Spark: How to overwrite a file on S3 folder and not complete folder

Categories

Resources