Python3: Recursively compare two directories based on file contents - python-3.x

I have two directories containing a bunch of files and subfolders.
I would like to check if the file contents are the same in both directories (ignoring the file name). The subfolder structure should be the same too.
I looked at filecmp.dircmp but this is not helping because it does not consider the file content; there is no shallow=False option with filecmp.dircmp(), see here.
The workaround in this SO answer does not work either, because it considers the file names.
What's the best way to do my comparison?

Got around to this. After minor testing this seems to work, though more is needed. Again, this can be extremely long, depending both on the amount of files and their size:
import filecmp
import os
from collections import defaultdict
from sys import argv
def compareDirs(d1,d2):
files1 = defaultdict(set)
files2 = defaultdict(set)
subd1 = set()
subd2 = set()
for entry in os.scandir(d1):
if entry.is_dir(): subd1.add(entry)
else: files1[os.path.getsize(entry)].add(entry)
#Collecting first to compare length since we are guessing no
#match is more likely. Can compare files directly if this is
# not true.
for entry in os.scandir(d2):
if entry.is_dir(): subd2.add(entry)
else: files2[os.path.getsize(entry)].add(entry)
#Structure not the same. Checking prior to content.
if len(subd1) != len(subd2) or len(files1) != len(files2): return False
for size in files2:
for entry in files2[size]:
for fname in files1[size]: #If size does not exist will go to else
if filecmp.cmp(fname,entry,shallow=False): break
else: return False
files1[size].remove(fname)
if not files1[size]: del files1[size]
#Missed a file
if files1: return False
#This is enough since we checked lengths - if all sd2 are matched, sd1
#will be accounted for.
for sd1 in subd1:
for sd2 in subd2:
if compareDirs(sd1,sd2): break
else: return False #Did not find a sub-directory
subd2.remove(sd2)
return True
print(compareDirs(argv[1],argv[2]))
Recursively enter both directories. Compare files on the first level - fail if no match. Then try and match any sub-dir in the first directory to any sub-dir in the next recursively, until all are matched.
This is the most naive solution. Possibly traversing the tree and only matching sizes and structure would be beneficial in the average case. In that case the function would look similar, except we compare getsize instead of using filecmp, and save the matching tree structures, so the second run would be faster.
Of course, in case of a few sub-directories with the exact same structures and sizes we would still need to compare all possibilities of matching.

Related

Return a list of the paths of all the parts.txt files

Write a function list_files_walk that returns a list of the paths of all the parts.txt files, using the os module's walk generator. The function takes no input parameters.
def list_filess_walk():
for dirpath, dirnames, filenames in os.walk("CarItems"):
if 'parts.txt' in dirpath:
list_files.append(filenames)
print(list_files)
return list_files
Currently, list_files is still empty. The output is supposed to look similar to this:
CarItems/Chevrolet/Chevelle/2011/parts.txt
CarItems/Chevrolet/Chevelle/1982/parts.txt
How can I produce this output?
You pretty much have it here--the only adjustments I'd make are:
Make sure list_files is scoped locally to the function to avoid side effects.
Use parameters so that the function can work on any arbitrary path.
Return a generator with the yield keyword which allows for the next file to be fetched lazily.
'parts.txt' in dirpath could be error-prone if the filename happens to be a substring elsewhere in a path. I'd use endswith or iterate over the second item in the tuple that os.walk which is a list of all the items in the current directory, e.g. 'parts.txt' in dirnames.
Along the same line of thought as above, you might want to make sure that your target is a file with os.path.isfile.
Here's an example:
import os
def find_files_rec(path, fname):
for dirpath, dirnames, files in os.walk(path):
if fname in files:
yield f"{dirpath}/{fname}"
if __name__ == "__main__":
print(list(find_files_rec(".", "parts.txt")))

I'm trying to 'shuffle' a folder of music and there is an error where random.choice() keeps choosing things that it is supposed to have removed

I'm trying to make a python script that renames files randomly from a list and I used numbers.remove(place) on it but it keeps choosing values that are supposed to have been removed.
I used to just use random.randint but now I have moved to choosing from a list then removing the chosen value from the list but it seems to keep choosing chosen values.
'''python
from os import chdir, listdir, rename
from random import choice
def main():
chdir('C:\\Users\\user\\Desktop\\Folders\\Music')
for f in listdir():
if f.endswith('.mp4'):
numbers = [str(x) for x in range(0, 100)]
had = []
print(f'numbers = {numbers}')
place = choice(numbers)
print(f'place = {place}')
numbers.remove(place)
print(f'numbers = {numbers}')
while place in had:
input('Place has been had.')
place = choice(numbers)
had.append(place)
name = place + '.mp4'
print(f'name = {name}')
print(f'\n\nRenaming {f} to {name}.\n\n')
try:
rename(f, name)
except FileExistsError:
pass
if __name__ == '__main__':
main()
'''
It should randomly number the files without choosing the same value for a file twice but it does that and I have no idea why.
When you call listdir() the first time, that's the same list that you're iterating over the entire time. Yes, you're changing the contents of the directory, but python doesn't really care about that because you only asked for the contents of the directory at a specific point in time - before you began modifying it.
I would do this in two separate steps:
# get the current list of files in the directory
dirlist = os.listdir()
# choose a new name for each file
to_rename = zip(
dirlist,
[f'{num}.mp4' for num in random.sample(range(100), len(dirlist))]
)
# actually rename each file
for oldname, newname in to_rename:
try:
os.rename(oldname, newname)
except FileExistsError:
pass
This method is more concise than the one you're using. First, I use random.sample() on the iterable range(100) to generate non-overlapping numbers from that range (without having to do the extra step of using had like you're doing now). I generate exactly as many as I need, and then use the built-in zip() function to bind together the original filenames with these new numbers.
Then, I do the rename() operations all at once.

How to remove/delete characters from end of string that match another end of string

I have thousands of strings (not in English) that are in this format:
['MyWordMyWordSuffix', 'SameVocabularyItemMyWordSuffix']
I want to return the following:
['MyWordMyWordSuffix', 'SameVocabularyItem']
Because strings are immutable and I want to start the matching from the end I keep confusing myself on how to approach it.
My best guess is some kind of loop that starts from the end of the strings and keeps checking for a match.
However, since I have so many of these to process it seems like there should be a built in way faster than looping through all the characters, but as I'm still learning Python I don't know of one (yet).
The nearest example I could find already on SO can be found here but it isn't really what I'm looking for.
Thank you for helping me!
You can use commonprefix from os.path to find the common suffix between them:
from os.path import commonprefix
def getCommonSuffix(words):
# get common suffix by reversing both words and finding the common prefix
prefix = commonprefix([word[::-1] for word in words])
return prefix[::-1]
which you can then use to slice out the suffix from the second string of the list:
word_list = ['MyWordMyWordSuffix', 'SameVocabularyItemMyWordSuffix']
suffix = getCommonSuffix(word_list)
if suffix:
print("Found common suffix:", suffix)
# filter out suffix from second word in the list
word_list[1] = word_list[1][0:-len(suffix)]
print("Filtered word list:", word_list)
else:
print("No common suffix found")
Output:
Found common suffix: MyWordSuffix
Filtered word list: ['MyWordMyWordSuffix', 'SameVocabularyItem']
Demo: https://repl.it/#glhr/55705902-common-suffix

Python 3 img2pdf wrong order of images in pdf

I am working on a small program that takes images from a website and puts them into a pdf for easy access and simpler viewing.
I have a small problem as the img2pdf module seems to put the images into the pdf in the wrong order and I don't really get why.
It seems to put the files in order of 1,10,11.
import urllib.request
import os
import img2pdf
n = 50
all = 0
for counter in range(1,n+1):
all = all + 1
urllib.request.urlretrieve("https://website/images/"+str(all)+".jpg", "img"+str(all)+".jpg")
cwd = os.getcwd()
if all == 50:
with open("output2.pdf", "wb") as f:
f.write(img2pdf.convert([i for i in os.listdir(cwd) if i.endswith(".jpg")]))
Without seeing the filenames you're trying to read in, a guess is that your filenames include numbers that are not zero-padded. Lexicographic ordering (sorting in alphabetical order) of a sequence of files called 0.jpg, 1.jpg, ... 11.jpg will lead to this ordering: 0.jpg, 1.jpg, 10.jpg, 11.jpg, 2.jpg, 3.jpg, 4.jpg, 5.jpg, 6.jpg, 7.jpg, 8.jpg, 9.jpg, because "1" < "2".
To combine your files such that 2 comes before 10, you can zero-pad the filenames (but also beware that some software will interpret leading zeros as indicators of an octal representation of a number, as opposed to just a leading zero.)
If you can't manipulate the filenames, then you could change your file-getting code as follows: use a regular expression to extract the numbers, as int type, from the filenames of your entire list of files, then sort the list of filenames by those extracted numbers (which will be sorted as int, for which 2 < 10).

Can I force os.walk to visit directories in alphabetical order?

I would like to know if it's possible to force os.walk in python3 to visit directories in alphabetical order. For example, here is a directory and some code that will walk this directory:
ryan:~/bktest$ ls -1 sample
CD01
CD02
CD03
CD04
CD05
--------
def main_work_subdirs(gl):
for root, dirs, files in os.walk(gl['pwd']):
if root == gl['pwd']:
for d2i in dirs:
print(d2i)
When the python code hits the directory above, here is the output:
ryan:~/bktest$ ~/test.py sample
CD03
CD01
CD05
CD02
CD04
I would like to force walk to visit these dirs in alphabetical order, 01, 02 ... 05. In the python3 doc for os.walk, it says:
When topdown is True, the caller can modify the dirnames list in-place
(perhaps using del or slice assignment), and walk() will only recurse
into the subdirectories whose names remain in dirnames; this can be
used to prune the search, impose a specific order of visiting
Does that mean that I can impose an alphabetical visiting order on os.walk? If so, how?
Yes. You sort dirs in the loop.
def main_work_subdirs(gl):
for root, dirs, files in os.walk(gl['pwd']):
dirs.sort()
if root == gl['pwd']:
for d2i in dirs:
print(d2i)
I know this has already been answered but I wanted to add one little detail and adding more than a single line of code in the comments is wonky.
In addition to wanting the directories sorted I also wanted the files sorted so that my iteration through "gl" was consistent and predictable. To do this one more sort was required:
for root, dirs, files in os.walk(gl['pwd']):
dirs.sort()
for filename in sorted(files):
print(os.path.join(root, filename))
And, with benefit of learning more about Python, a different (better) way:
from pathlib import Path
# Directories, per original question.
[print(p) for p in sorted(Path(gl['pwd']).glob('**/*')) if p.is_dir()]
# Files, like I usually need.
[print(p) for p in sorted(Path(gl['pwd']).glob('**/*')) if p.is_file()]
This answer is not specific to this question and the problem is a little different but the solution can be used in either case.
Consider having these files ("one1.txt", "one2.txt", "one10.txt") and the content of all of them is a String "default":
I want to loop through a directory that contains these files and find a specific String in every file and replace it with the name of the file.
If you use any other methods which have already mentioned here and in other questions (like dirs.sort() and sorted(files) and sorted(dirs), the result will be something like this:
"one1.txt"--> "one10"
"one2.txt"--> "one1"
"one10.txt" --> "one2"
But we want it to be:
"one1.txt"--> "one1"
"one2.txt"--> "one2"
"one10.txt" --> "one10"
I found this method which changes file content alphabetically:
import re, os, fnmatch
def atoi(text):
return int(text) if text.isdigit() else text
def natural_keys(text):
'''
alist.sort(key=natural_keys) sorts in human order
http://nedbatchelder.com/blog/200712/human_sorting.html
(See Toothy's implementation in the comments)
'''
return [ atoi(c) for c in re.split('(\d+)', text) ]
def findReplace(directory, find, replace, filePattern):
count = 0
for path, dirs, files in sorted(os.walk(os.path.abspath(directory))):
dirs.sort()
for filename in sorted(fnmatch.filter(files, filePattern), key=natural_keys):
count = count +1
filepath = os.path.join(path, filename)
with open(filepath) as f:
s = f.read()
s = s.replace(find, replace+str(count)+".png")
with open(filepath, "w") as f:
f.write(s)
Then run this line:
findReplace(os.getcwd(), "default", "one", "*.xml")

Resources