Avoiding for loops when working with folders in Python - python-3.x

The code below is an attempt at a minimal reproducible example, it relies on the folders (folder_source and folder_target) and files (file_id1.csv, fileid2.csv). The code loads a csv from a directory, changes the name, and saves it to another directory.
The code works fine. I would like to know if there is a way of avoiding the nested for loop.
Thank you!
list_of_file_paths =['C:\\Users\\user\\Desktop\\folder_source\\file_id1.csv','C:\\Users\\user\\Desktop\\folder_source\\file_id2.csv']
list_of_variables =['heat','patience','charmander']
target_path=r'C:\\Users\\user\\Desktop\\folder_target\\'
for filepath_load in list_of_file_paths:
for variable in list_of_variables:
df_loaded = pd.read_csv(filepath_load) #grab one of the csv in the source folder
id_number=filepath_load.split(".")[0].split("_")[-1] #extracts the name of the id from the csv file
df_loaded.to_csv(target_path+id_number+'_'+variable+'.csv',index=False) #rename the folder and saves into another folder

You're looking for Cartesian product of 2 lists I guess?
from itertools import product
for (filepath_load, variable) in (product(list_of_file_paths, list_of_variables)):
df_loaded = pd.read_csv(filepath_load)
id_number=filepath_load.split(".")[0].split("_")[-1]
df_loaded.to_csv(target_path+id_number+'_'+variable+'.csv',index=False)
But as Roland Smith says, you have some redundancy here. I'd prefer his code, which has two loops but the minimal amount of I/O and computation.

If you really want to save each file into three identical copies with a different name, there is really no alternative.
Although I would move the inner loop down, removing redundant file reads.
for filepath_load in list_of_file_paths:
df_loaded = pd.read_csv(filepath_load)
id_number=filepath_load.split(".")[0].split("_")[-1]
for variable in list_of_variables:
df_loaded.to_csv(target_path+id_number+'_'+variable+'.csv',index=False)
Adittionally, consider using shutil.copy since the source file is not modified:
import shutil
for filepath_load in list_of_file_paths:
df_loaded = pd.read_csv(filepath_load)
id_number=filepath_load.split(".")[0].split("_")[-1]
for variable in list_of_variables:
shutil.copy(filepath_load, target_path+id_number+'_'+variable+'.csv')
That would employ the operating system's buffer cache, at least for the second and third write.

Related

Loop through multiple folders and subfolders using Pyspark in Azure Blob container (ADLS Gen2)

I am trying to loop through multiple folders and subfolders in Azure Blob container and read multiple xml files.
Eg: I have files in YYYY/MM/DD/HH/123.xml format
Similarly I have multiple sub folders under month, date, hours and multiple XML files at last.
My intention is to loop through all these folder and read XML files. I have tried using few Pythonic approaches which did not give me the intended result. Can you please help me with any ideas in implementing this?
import glob, os
for filename in glob.iglob('2022/08/18/08/225.xml'):
if os.path.isfile(filename): #code does not enter the for loop
print(filename)
import os
dir = '2022/08/19/08/'
r = []
for root, dirs, files in os.walk(dir): #Code not moving past this for loop, no exception
for name in files:
filepath = root + os.sep + name
if filepath.endswith(".xml"):
r.append(os.path.join(root, name))
return r
The glob is a python function and it won't recognize the blob folders path directly as code is in pyspark. we have to give the path from root for this. Also, make sure to specify recursive=True in that.
For Example, I have checked above pyspark code in databricks.
and the OS code as well.
You can see I got the no result as above. Because for the above, we need to give the absolute root. it means the root folder.
glob code:
import glob, os
for file in glob.iglob('/path_from_root_to_folder/**/*.xml',recursive=True):
print(file)
For me in databricks the root to access is /dbfs and I have used csv files.
Using os:
You can see my blob files are listed from folders and subfolders.
I have used databricks for my repro after mounting. Wherever you are trying this code in pyspark, make sure you are giving the root of the folder in the path. when using glob, set the recursive = True as well.
There is an easier way to solve this problem with PySpark!
The tough part is all the files have to have the same format. In the Azure databrick's sample directory, there is a /cs100 folder that has a bunch of files that can be read in as text (line by line).
The trick is the option called "recursiveFileLookup". It will assume that the directories are created by spark. You can not mix and match files.
I added to the data frame the name of the input file for the dataframe. Last but not least, I converted the dataframe to a temporary view.
Looking at a simple aggregate query, we have 10 unique files. The biggest have a little more than 1 M records.
If you need to cherry pick files for a mixed directory, this method will not work.
However, I think that is an organizational cleanup task, versus easy reading one.
Last but not least, use the correct formatter to read XML.
spark.read.format("com.databricks.spark.xml")

How to read the most recent Excel export into a Pandas dataframe without specifying the file name?

I frequent a real estate website that shows recent transactions, from which I will download data to parse within a Pandas dataframe. Everything about this dataset remains identical every time I download it (regarding the column names, that is).
The name of the Excel output may change, though. For example, if I already have download a few of these in my Downloads folder, the file that's exported may read "Generic_File_(3)" or "Generic_File_(21)" if I already have a few older "Generic_File" exports in that folder from a previous export.
Ideally, I'd like my workflow to look like this: export this Excel file of real estate sales, then run a Python script to read in the most recent export as a Pandas dataframe. The catch is, I don't want to have to go in and change the filename in the script to match the appending number of the Excel export everytime. I want the pd.read_excel method to simply read the "Generic_File" that is appended with the largest number (which will obviously correspond to the most rent export).
I suppose I could always just delete old exports out of my Downloads folder so the newest, freshest export is always named the same ("Generic_File", in this case), but I'm looking for a way to ensure I don't have to do this. Are wildcards the best path forward, or is there some other method to always read in the most recently downloaded Excel file from my Downloads folder?
I would use the OS package and create a method to read to file names in the downloads folder. Parsing string filenames you could then find the file following your specified format with the highest copy number. Something like the following might help you get started.
import os
downloads = os.listdir('C:/Users/[username here]/Downloads/')
is_file = [True if '.' in item else False for item in downloads]
files = [item for keep, item in zip(is_file, downloads) if keep]
** INSERT CODE HERE TO IDENTIFY THE FILE OF INTEREST **
Regex might be the best way to find matches if you have a diverse listing of files in your downloads folder.

Python 3 - Copy files if they do not exist in destination folder

I am attempting to move a couple thousand pdfs from one file location to another. The source folder contains multiple subfolders and I am combining just the pdfs (technical drawings) into one folder to simplify searching for the rest of my team.
The main goal is to only copy over files that do not already exist in the destination folder. I have tried a couple different options, most recently what is shown below, and in all cases, every file is copied every time. Prior to today, any time I attempted a bulk file move, I would received errors if the file existed in the destination folder but I no longer do.
I have verified that some of the files exist in both locations but are still being copied. Is there something I am missing or can modify to correct?
Thanks for the assistance.
import os.path
import shutil
source_folder = os.path.abspath(r'\\source\file\location')
dest_folder = os.path.abspath(r'\\dest\folder\location')
for folder, subfolders, files in os.walk(source_folder):
for file in files:
path_file=os.path.join(folder, file)
if os.path.exists(file) in os.walk(dest_folder):
print(file+" exists.")
if not os.path.exists(file) in os.walk(dest_folder):
print(file+' does not exist.')
shutil.copy2(path_file, dest_folder)
os.path.exists returns a Boolean value. os.walk creates a generator which produces triples of the form (dirpath, dirnames, filenames). So, that first conditional will never be true.
Also, even if that conditional were correct, your second conditional has a redundancy since it's merely the negation of the first. You could replace it with else.
What you want is something like
if file in os.listdir(dest_folder):
...
else:
...

Is is possible without knowing structure to make hdf5 file that will be the same copy of another one in python 3.6?

If I have one hdf5 file f1.h5 and I want to make a copy of this file to another one (e.g. f2.h5), but I don't know the structure of f1.h5 and I want to copy it automatically, can I do it with some tips of h5py?
You can use the .copy() method to recursively copy objects from f1.h5 to f2.h5. You don't need to know the schema: use keys to access groups/datasets at the root level. If the source is a Group object, by default all objects within that group will be copied recursively.
import h5py
h5r=h5py.File("f1.h5", 'r')
with h5py.File("f2.h5", 'w') as h5w:
for obj in h5r.keys():
h5r.copy(obj, h5w )
h5r.close()
I don't know about h5py, but it should be possible by:
f1=open('f1.h5','rb')
f2=open('f2.h5','wb')
f2.write(f1.read())
You read each byte of the first file and write it to the second file. Things such as structure don't matter

Avoid overwriting of files with "for" loop

I have a list of dataframes (df_cleaned) created from multiple csv files chosen by the user.
My objective is to save each dataframe within the df_cleaned list as a separate csv file locally.
I have the following code done which saves the file with its original title. But I see that it overwrites and manages to save a copy of only the last dataframe.
How can I fix it? According to my very basic knowledge perhaps I could use a break-continue statement in the loop? But I do not know how to implement it correctly.
for i in range(len(df_cleaned)):
outputFile = df_cleaned[i].to_csv(r'C:\...\Data Docs\TrainData\{}.csv'.format(name))
print('Saving of files as csv is complete.')
You can create a different name for each file, as an example in the following I attach the index to name:
for i in range(len(df_cleaned)):
outputFile = df_cleaned[i].to_csv(r'C:\...\Data Docs\TrainData\{0}_{1}.csv'.format(name,i))
print('Saving of files as csv is complete.')
this will create a list of files named <name>_N.csv with N = 0, ..., len(df_cleaned)-1.
A very easy way of solving. Just figured out the answer myself. Posting to help someone else.
fileNames is a list I created at the start of the code to save the
names of the files chosen by the user.
for i in range(len(df_cleaned)):
outputFile = df_cleaned[i].to_csv(r'C:\...\TrainData\{}.csv'.format(fileNames[i]))
print('Saving of files as csv is complete.')
Saves a separate copy for each file in the defined directory.

Resources