Python 3 - Copy files if they do not exist in destination folder - python-3.x

I am attempting to move a couple thousand pdfs from one file location to another. The source folder contains multiple subfolders and I am combining just the pdfs (technical drawings) into one folder to simplify searching for the rest of my team.
The main goal is to only copy over files that do not already exist in the destination folder. I have tried a couple different options, most recently what is shown below, and in all cases, every file is copied every time. Prior to today, any time I attempted a bulk file move, I would received errors if the file existed in the destination folder but I no longer do.
I have verified that some of the files exist in both locations but are still being copied. Is there something I am missing or can modify to correct?
Thanks for the assistance.
import os.path
import shutil
source_folder = os.path.abspath(r'\\source\file\location')
dest_folder = os.path.abspath(r'\\dest\folder\location')
for folder, subfolders, files in os.walk(source_folder):
for file in files:
path_file=os.path.join(folder, file)
if os.path.exists(file) in os.walk(dest_folder):
print(file+" exists.")
if not os.path.exists(file) in os.walk(dest_folder):
print(file+' does not exist.')
shutil.copy2(path_file, dest_folder)

os.path.exists returns a Boolean value. os.walk creates a generator which produces triples of the form (dirpath, dirnames, filenames). So, that first conditional will never be true.
Also, even if that conditional were correct, your second conditional has a redundancy since it's merely the negation of the first. You could replace it with else.
What you want is something like
if file in os.listdir(dest_folder):
...
else:
...

Related

Loop through multiple folders and subfolders using Pyspark in Azure Blob container (ADLS Gen2)

I am trying to loop through multiple folders and subfolders in Azure Blob container and read multiple xml files.
Eg: I have files in YYYY/MM/DD/HH/123.xml format
Similarly I have multiple sub folders under month, date, hours and multiple XML files at last.
My intention is to loop through all these folder and read XML files. I have tried using few Pythonic approaches which did not give me the intended result. Can you please help me with any ideas in implementing this?
import glob, os
for filename in glob.iglob('2022/08/18/08/225.xml'):
if os.path.isfile(filename): #code does not enter the for loop
print(filename)
import os
dir = '2022/08/19/08/'
r = []
for root, dirs, files in os.walk(dir): #Code not moving past this for loop, no exception
for name in files:
filepath = root + os.sep + name
if filepath.endswith(".xml"):
r.append(os.path.join(root, name))
return r
The glob is a python function and it won't recognize the blob folders path directly as code is in pyspark. we have to give the path from root for this. Also, make sure to specify recursive=True in that.
For Example, I have checked above pyspark code in databricks.
and the OS code as well.
You can see I got the no result as above. Because for the above, we need to give the absolute root. it means the root folder.
glob code:
import glob, os
for file in glob.iglob('/path_from_root_to_folder/**/*.xml',recursive=True):
print(file)
For me in databricks the root to access is /dbfs and I have used csv files.
Using os:
You can see my blob files are listed from folders and subfolders.
I have used databricks for my repro after mounting. Wherever you are trying this code in pyspark, make sure you are giving the root of the folder in the path. when using glob, set the recursive = True as well.
There is an easier way to solve this problem with PySpark!
The tough part is all the files have to have the same format. In the Azure databrick's sample directory, there is a /cs100 folder that has a bunch of files that can be read in as text (line by line).
The trick is the option called "recursiveFileLookup". It will assume that the directories are created by spark. You can not mix and match files.
I added to the data frame the name of the input file for the dataframe. Last but not least, I converted the dataframe to a temporary view.
Looking at a simple aggregate query, we have 10 unique files. The biggest have a little more than 1 M records.
If you need to cherry pick files for a mixed directory, this method will not work.
However, I think that is an organizational cleanup task, versus easy reading one.
Last but not least, use the correct formatter to read XML.
spark.read.format("com.databricks.spark.xml")

copying files from one folder to another folder based on the file names in python 3

In Python 3.7, I want to write a scrip that
creates folders based on a list
iterates through a list (elements represent different "runs")
searches for .txt files in predifined directories derived from certain operations
copies certain .txt files to the previously created folders
I managed to do that via following script:
from shutil import copy
import os
import glob
# define folders and batches
folders = ['folder_ce', 'folder_se']
runs = ['A001', 'A002', 'A003']
# make folders
for f in folders:
os.mkdir(f)
# iterate through batches,
# extract files for every operation,
# and copy them to target folder
for r in runs:
# operation 1
ce = glob.glob(f'{r}/{r}/several/more/folders/{r}*.txt')
for c in ce:
copy(c, 'folder_ce')
# operation 2
se = glob.glob(f'{r}/{r}/several/other/folders/{r}*.txt')
for s in se:
copy(s, 'folder_se')
In the predifined directories there are several .txt files
one file with the format A001.txt (where the "A001"-part is derived from the list "runs" specified above)
plus sometimes several files with the format A001.20200624.1354.56.txt
If a file with the format A001.txt is there, I only want to copy this one to the target directory.
If the format A001.txt is not available, I want to copy all files with the longer format (e.g. A001.20200624.1354.56.txt).
After the comment of #adamkwm, I tried
if f'{b}/{b}/pcs.target/data/xmanager/CEPA_Station/Verwaltung_CEPA_44S4/{b}.txt' in cepa:
copy(f'{b}/{b}/pcs.target/data/xmanager/CEPA_Station/Verwaltung_CEPA_44S4/{b}.txt', 'c_py_1')
else:
for c in cepa:
copy(c, 'c_py_1')
but that still copies both files (A001.txt and A001.20200624.1354.56.txt), which I understand. I think the trick is to first check in ce, which is a list, if the {r}.txt format is present and if it is, only copy that one. If not, copy all files. However, I don't seem to get the logic right or use the wrong modules or methods, it seems.
After searching for answers, i didn't find one resolving this specific case.
Can you help me with a solution for this "selective copying" of the files?
Thanks!

Logic error with moving all files in a folder

I am writing two simple scripts, one to move all files into a folder, and one to move all files back to said folder. I am not getting any errors, but the files aren't moving so I am likely missing something stupidly obvious somewhere.
I tried making sure the file paths were correct, looked up how the syntax of the commands worked, and checked for any basic errors.
import shutil
import os
source = r'C:\\Users\JonTh\Saved Games\DCS\Mods\aircraft'
destination = r'C:\\Users\JonTh\Saved Games\dcs planes'
files = os.listdir(source)
for index in files:
shutil.move(source,destination)
you should modify your code to consider files from for loop
for index in files:
shutil.move(source+"\\"+index,destination)

Get all files in a matching subdirectory name

I have this piece of code from another project:
import pathlib
p = pathlib.Path(root)
for img_file in p.rglob("*.jpg"):
#Do something for each image file
It finds all jpg files in the whole directory and its subfolders and acts upon them.
I have a directory that contains 100+ 'main' folders with each folder having some combination of 2 subfolders - lets call them 'FolderA' and 'FolderB'. The main folders can have one, both or none of these subfolders.
I want to run a piece of code against all the pdf files contained within the 'FolderB' subdirectories, but ignore all files in the main folders and 'FolderA' folders.
Can someone help me manipulate the above code to allow me to continue?
Many thanks!
You can modify the pattern to just search for what you want:
from pathlib import Path
p = Path("root")
for file in p.rglob("*FolderB/*.pdf"):
# Do something with file
pass

Copy files in Python

I want to copy files with a specific file extention from one directory and put in another directory. I tried searching and found code the same as im doing however it doesnt appear to do anything, any help would be great.
import shutil
import os
source = "/tmp/folder1/"
destination = "/tmp/newfolder/"
for files in source:
if files.endswith(".txt"):
shutil.move(files,destination)
I think the problem is your for-loop. You are actually looping over the string "tmp/folder1/" instead of looping over the the files in the folder. What your for-loop does is going through the string letter by letter (t, m, p etc.).
What you want is looping over a list of files in the source folder. How that works is described here: How do I list all files of a directory?.
Going fro there you can run through the filenames, testing for their extension and moving them just as you showed.
Your "for file in source" pick one character after another one from your string "source" (the for doesn't know that source is a path, for him it is just a basic str object).
You have to use os.listdir :
import shutil
import os
source = "source/"
destination = "dest/"
for files in os.listdir(source): #list all files and directories
if os.path.isfile(os.path.join(source, files)): #is this a file
if files.endswith(".txt"):
shutil.move(os.path.join(source, files),destination) #move the file
os.path.join is used to join a directory and a filename (to have a complete path).

Resources