Read and Write xlsx file, from pandas dataframe to specific directory - python-3.x

I have a function in python, that basically merges three txt files into one file, in xlsx format.
For that I use pandas package.
So I am running the python function in a certain directory. This function has the input as a specific path. Then the function takes this path, list the files of the directory, and filter the files that are needed. Meaning that, since I only want to read the txt files, I then filter the txt files. However, when I try to convert this txt files into pandas dataframe, the dataframe is None.
Also, I want to write a final xlsx to the directory where the initial files are.
Here is my function:
def concat_files(path):
summary=''
files_separate=[]
arr2 = os.listdir(mypath)
for i, items_list in enumerate(arr2):
if len(items_list) > 50:
files_separate.append(items_list)
files_separate
chunks= [files_separate[x:x+3] for x in range(0,len(files_separate),3)]
while chunks:
focus=chunks.pop(0)
for items_1 in focus:
if items_1.endswith('.Cox1.fastq.fasta.usearch_cluster_fast.fasta.reps.fasta.blastn.report.txt.all_together.txt'):
pandas_dataframe=pd.Dataframe(example)
pandas_dataframe.to_excel('destiny_path/' + str(header_file)+'.final.xlsx')

you need to create the folders before exporting the xlsx files.
so assuming you already have the folders created.
change this line
pandas_dataframe.to_excel('destiny_path/' + str(header_file)+'.final.xlsx')
to
pandas_dataframe.to_excel(os.path.join('destiny_path' ,str(header_file),'.final.xlsx'))

Related

Loop through multiple folders and subfolders using Pyspark in Azure Blob container (ADLS Gen2)

I am trying to loop through multiple folders and subfolders in Azure Blob container and read multiple xml files.
Eg: I have files in YYYY/MM/DD/HH/123.xml format
Similarly I have multiple sub folders under month, date, hours and multiple XML files at last.
My intention is to loop through all these folder and read XML files. I have tried using few Pythonic approaches which did not give me the intended result. Can you please help me with any ideas in implementing this?
import glob, os
for filename in glob.iglob('2022/08/18/08/225.xml'):
if os.path.isfile(filename): #code does not enter the for loop
print(filename)
import os
dir = '2022/08/19/08/'
r = []
for root, dirs, files in os.walk(dir): #Code not moving past this for loop, no exception
for name in files:
filepath = root + os.sep + name
if filepath.endswith(".xml"):
r.append(os.path.join(root, name))
return r
The glob is a python function and it won't recognize the blob folders path directly as code is in pyspark. we have to give the path from root for this. Also, make sure to specify recursive=True in that.
For Example, I have checked above pyspark code in databricks.
and the OS code as well.
You can see I got the no result as above. Because for the above, we need to give the absolute root. it means the root folder.
glob code:
import glob, os
for file in glob.iglob('/path_from_root_to_folder/**/*.xml',recursive=True):
print(file)
For me in databricks the root to access is /dbfs and I have used csv files.
Using os:
You can see my blob files are listed from folders and subfolders.
I have used databricks for my repro after mounting. Wherever you are trying this code in pyspark, make sure you are giving the root of the folder in the path. when using glob, set the recursive = True as well.
There is an easier way to solve this problem with PySpark!
The tough part is all the files have to have the same format. In the Azure databrick's sample directory, there is a /cs100 folder that has a bunch of files that can be read in as text (line by line).
The trick is the option called "recursiveFileLookup". It will assume that the directories are created by spark. You can not mix and match files.
I added to the data frame the name of the input file for the dataframe. Last but not least, I converted the dataframe to a temporary view.
Looking at a simple aggregate query, we have 10 unique files. The biggest have a little more than 1 M records.
If you need to cherry pick files for a mixed directory, this method will not work.
However, I think that is an organizational cleanup task, versus easy reading one.
Last but not least, use the correct formatter to read XML.
spark.read.format("com.databricks.spark.xml")

copying files from one folder to another folder based on the file names in python 3

In Python 3.7, I want to write a scrip that
creates folders based on a list
iterates through a list (elements represent different "runs")
searches for .txt files in predifined directories derived from certain operations
copies certain .txt files to the previously created folders
I managed to do that via following script:
from shutil import copy
import os
import glob
# define folders and batches
folders = ['folder_ce', 'folder_se']
runs = ['A001', 'A002', 'A003']
# make folders
for f in folders:
os.mkdir(f)
# iterate through batches,
# extract files for every operation,
# and copy them to target folder
for r in runs:
# operation 1
ce = glob.glob(f'{r}/{r}/several/more/folders/{r}*.txt')
for c in ce:
copy(c, 'folder_ce')
# operation 2
se = glob.glob(f'{r}/{r}/several/other/folders/{r}*.txt')
for s in se:
copy(s, 'folder_se')
In the predifined directories there are several .txt files
one file with the format A001.txt (where the "A001"-part is derived from the list "runs" specified above)
plus sometimes several files with the format A001.20200624.1354.56.txt
If a file with the format A001.txt is there, I only want to copy this one to the target directory.
If the format A001.txt is not available, I want to copy all files with the longer format (e.g. A001.20200624.1354.56.txt).
After the comment of #adamkwm, I tried
if f'{b}/{b}/pcs.target/data/xmanager/CEPA_Station/Verwaltung_CEPA_44S4/{b}.txt' in cepa:
copy(f'{b}/{b}/pcs.target/data/xmanager/CEPA_Station/Verwaltung_CEPA_44S4/{b}.txt', 'c_py_1')
else:
for c in cepa:
copy(c, 'c_py_1')
but that still copies both files (A001.txt and A001.20200624.1354.56.txt), which I understand. I think the trick is to first check in ce, which is a list, if the {r}.txt format is present and if it is, only copy that one. If not, copy all files. However, I don't seem to get the logic right or use the wrong modules or methods, it seems.
After searching for answers, i didn't find one resolving this specific case.
Can you help me with a solution for this "selective copying" of the files?
Thanks!

Is there a way to give multiple file locations for renaming files?

I want a program to rename subtitles file as same as the movie file, which are located in different folders and sub-folders.
What I have done:
Imported os, re, and shutil modules.
Made a for loop to iterate through a directory and return files/folders inside of a parent folder.
for foldername, subfoldername, filename in os.walk('E:\Movies'):
This loop will iterate through the E:\Movies folder and assume a list of sub folders and files.
To check whether file is a subtitle file, inside of for loop,
if filename.endswith('(.srt|.idx|.sub)'):
How do i give multiple paths and new names in the single second argument?
os.rename(filename,'')
Why do you want to give multiple paths and new names in the second argument?
The Code you shared does the following :
Loop through the directory tree.
For each filename in directory :
If file is subtitle file :
Rename file to movie file present some directory
In the last step you are not renaming all the files at once. You are doing them one at a time.
os.rename(src,dest)
Accepts two arguments only, the src filename and dest filename.
So for your case you will have to loop again through all the files in the directory, match the name of subtitle file with movie file, and then rename the subtitle file.
Try Something Like :
for foldername,subfoldername,filename in os.walk('E:\Movies'):
if filename.endswith('(.srt|.idx|.sub)'):
for folder2,subfolder2,moviename in os.walk('E:\Movies'):
# We don't want to match the file with itself
if(moviename != filename):
# You would have to think of your matching logic here
# How would you know if that subtitle is of that particular movie
# eg. if subtitle is of form 'a_good_movie.srt' you can split on '_' and check if all words are present in movie name
Edit
After the clarifications in the comments, it seems you want to implement the following :
Loop through all Folders in Directory:
For each Folder in directory, rename all subtitle_files to the Folder name
You can do this in Python 3 like this :
for folder in next(os.walk(directory))[1]:
for filename in next(os.walk(directory+folder))[2]:
if(filename.endswith(('.srt','.idx','.sub'))):
os.rename(filename,directory);
os.walk() returns a generator function. You can access the value of os.walk() generator as such in python 3 :
next(os.walk('C:\startdir'))[0] # returns 'C:\startdir'
next(os.walk('C:\startdir'))[1] # returns list of directories in 'C:\startdir'
next(os.walk('C:\startdir'))[2] # returns list of files in 'C:\startdir'
For python 2 you can call os.walk().next()[] with same return values

Extracting a particular file from a zipfile using Python

I have a list of 3 million html files in a zipfile. I would like to extract ~4000 html files from the entire list of files. Is there a way to extract a specific file without unzipping the entire zipfile using Python?
Any leads would be appreciated! Thanks in advance.
Edit:My bad, I should have elaborated on the question. I have a list of all the html filenames that need to be extracted but they are spread out over 12 zipfiles. How do I iterate through each zipfile, extract the matched html file and get the final list of extracted html files?
Let's say you wish to extract all the html files, then you can this out. If you have the list of all the file names to be extracted, then this will require slight modification.
listOfZipFiles = ['sample1.zip', 'sample2.zip', 'sample1.zip',... , 'sample12.zip' ]
fileNamesToBeExtracted = ['file1.html', 'file2.html', ... 'filen.html']
# Create a ZipFile Object and load sample.zip in it
for zipFileName in listOfZipFiles:
with ZipFile(zipFileName, 'r') as zipObj:
# Get a list of all archived file names from the zip
listOfFileNames = zipObj.namelist()
# Iterate over the file names
for fileName in listOfFileNames:
# Check if file to be extracted is present in file names to be extracted
if fileName in fileNamesToBeExtracted:
# Extract a single file from zip
zipObj.extract(fileName)

Avoid overwriting of files with "for" loop

I have a list of dataframes (df_cleaned) created from multiple csv files chosen by the user.
My objective is to save each dataframe within the df_cleaned list as a separate csv file locally.
I have the following code done which saves the file with its original title. But I see that it overwrites and manages to save a copy of only the last dataframe.
How can I fix it? According to my very basic knowledge perhaps I could use a break-continue statement in the loop? But I do not know how to implement it correctly.
for i in range(len(df_cleaned)):
outputFile = df_cleaned[i].to_csv(r'C:\...\Data Docs\TrainData\{}.csv'.format(name))
print('Saving of files as csv is complete.')
You can create a different name for each file, as an example in the following I attach the index to name:
for i in range(len(df_cleaned)):
outputFile = df_cleaned[i].to_csv(r'C:\...\Data Docs\TrainData\{0}_{1}.csv'.format(name,i))
print('Saving of files as csv is complete.')
this will create a list of files named <name>_N.csv with N = 0, ..., len(df_cleaned)-1.
A very easy way of solving. Just figured out the answer myself. Posting to help someone else.
fileNames is a list I created at the start of the code to save the
names of the files chosen by the user.
for i in range(len(df_cleaned)):
outputFile = df_cleaned[i].to_csv(r'C:\...\TrainData\{}.csv'.format(fileNames[i]))
print('Saving of files as csv is complete.')
Saves a separate copy for each file in the defined directory.

Resources