How to find files in multilevel subdirectories - python-3.x

Suppose I have a directory that contains multiple subdirectories:
one_meter = r"C:\Projects\NED_1m"
Within the directory one_meter I want to find all of the files that end with '.xml' and contain the string "_meta". My problem is that some of the subdirectories have that file one level donw, while others have it 2 levels down
EX:
one_meter > USGS_NED_one_meter_x19y329_LA_Jean_Lafitte_2013_IMG_2015 > USGS_NED_one_meter_x19y329_LA_Jean_Lafitte_2013_IMG_2015_meta.xml
one_meter > NY_Long_Island> USGS_NED_one_meter_x23y454_NY_LongIsland_Z18_2014_IMG_2015 > USGS_NED_one_meter_x23y454_NY_LongIsland_Z18_2014_IMG_2015_meta.xml
I want to look in my main directory (one_meter') and find all of the_meta.xmlfiles (regardless of the subdirectory) and append them to a list (one_m_lister = []`).
I tried the following but it doesn't produce any results. What am I doing incorrectly?
one_m_list = []
for filename in os.listdir(one_meter):
if filename.endswith(".xml") and "_meta" in filename:
print(filename)
one_m_list.append(filename)

The answer of #JonathanDavidArndt is good but quite outdated. Since Python 3.5, you can use pathlib.Path.glob to search a pattern in any subdirectory.
For instance:
import pathlib
destination_root = r"C:\Projects\NED_1m"
pattern = "**/*_meta*.xml"
master_list = list(pathlib.Path(destination_root).glob(pattern))

The function you are looking for is os.walk
A simple and minimal working example is below. You should be able to modify this to suit your needs:
destination_root = "C:\Projects\NED_1m"
extension_to_find = ".xml"
master_list = []
extension_to_find_len = len(extension_to_find)
for path,dir,files in os.walk(destination_root):
for filename in files:
# and of course, you can add extra filter criteria
# such as "contains _meta" right in here
if filename[-extension_to_find_len:] == extension_to_find:
print(os.path.join(path, filename))
master_list.append(os.path.join(path, filename))

Related

How to copy merge files of two different directories with different extensions into one directory and remove the duplicated ones

I would need a Python function which performs below action:
I have two directories which in one of them I have files with .xml format and in the other one I have files with .pdf format. To simplify things consider this example:
Directory 1: a.xml, b.xml, c.xml
Directory 2: a.pdf, c.pdf, d.pdf
Output:
Directory 3: a.xml, b.xml, c.xml, d.pdf
As you can see the priority is with the xml files in the case that both extensions have similar names.
I would be thankful for your help.
You need to use the shutil module and the os module to achieve this. This function will work on the following assumption:
A given directory has all files with the same extension
The priority_directory will be the directory with file extensions to be prioritized
The secondary_directory will be the directory with file extensions to be dropped in case of a name collision
Try:
import os,shutil
def copy_files(priority_directory,secondary_directory,destination = "new_directory"):
file_names = [os.path.splitext(filename)[0] for filename in os.listdir(priority_directory)] # get the file names to check for collisions
os.mkdir(destination) # make a new directory
for file in os.listdir(priority_directory): # this loop copies the first direcotory as it is
file_path = os.path.join(priority_directory,file)
dst_path = os.path.join(destination,file)
shutil.copy(file_path,dst_path)
for file in os.listdir(secondary_directory): # this loop checks for collisions and drops files whose name collide
if(os.path.splitext(file)[0] not in file_names):
file_path = os.path.join(secondary_directory,file)
dst_path = os.path.join(destination,file)
shutil.copy(file_path,dst_path)
print(os.listdir(destination))
Let's run it with your direcotry names as arguments:
copy_files('directory_1','directory_2','directory_3')
You can now check a new directory with the name directory_3 will be created with the desired files in it.
This will work for all such similar cases no matter what the extension is.
Note: There should not be a need to do this i guess cause a directory can have two files with the same name as long as the extensions differ.
Rough working solution:
import os
from shutil import copy2
d1 = './d1/'
d2 = './d2/'
d3 = './d3/'
ext_1 = '.xml'
ext_2 = '.pdf'
def get_files(d: str, files: list):
directory = os.fsencode(d)
for file in os.listdir(d):
dup = False
filename = os.fsdecode(file)
if filename[-4:] == ext_2:
for (x, y) in files:
if y == filename[:-4] + ext_1:
dup = True
break
if dup:
continue
files.append((d, filename))
files = []
get_files(d1, files)
get_files(d2, files)
for d, file in files:
copy2(d+file, d3)
I'll see if I can get it to look/perform better.

For Loop to Move and Rename .html Files - Python 3

I'm asking for help in trying to create a loop to make this script go through all files in a local directory. Currently I have this script working with a single HTML file, but would like it so it picks the first file in the directory and just loops until it gets to the last file in the directory.
Another way to help would be adding a line to the string would add a (1), (2), (3), etc. at the end if the names are duplicate.
Can anyone help with renaming thousands of files with a string that is parsed with BeautifulSoup4. Each file contains a name and reference number at the same position/line. Could be same name and reference number, or could be different reference number with same name.
import bs4, shutil, os
src_dir = os.getcwd()
print(src_dir)
dest_dir = os.mkdir('subfolder')
os.listdir()
dest_dir = src_dir+"/subfolder"
src_file = os.path.join(src_dir, 'example_filename_here.html')
shutil.copy(src_file, dest_dir)
exampleFile = open('example_filename_here.html')
exampleSoup = bs4.BeautifulSoup(exampleFile.read(), 'html.parser')
elems = exampleSoup.select('.bodycopy')
type(elems)
elems[2].getText()
dst_file = os.path.join(dest_dir, 'example_filename_here.html')
new_dst_file_name = os.path.join(dest_dir, elems[2].getText()+ '.html')
os.rename(dst_file, new_dst_file_name)
os.chdir(dest_dir)
print(elems[2].getText())

Wildcard in string as input

I have a variable that is an input for a process. Its essentially the full path name of a file, but injects a value based on a list to get the correct name:
fipsList = ['06001','06037','06059']
for fip in fipsList:
file = r"T:\CCSI\TECH\FEMA\Datasets\NFHL\NFHL_06122018\NFHL_{}_20180518.gdb".format(fip)"
What I want to do now is make everything between "...NFHL_{}_ and ....gdb" to be a wildcard "*". Simply using file = r"T:\CCSI\TECH\FEMA\Datasets\NFHL\NFHL_06122018\NFHL_{}_*.gdb".format(fip)"
doesn't seem to work. Essentially, this is what that produces:
>>>'T:\\CCSI\\TECH\\FEMA\\Datasets\\NFHL\\NFHL_06122018\\NFHL_06_*.gdb'. Suggestions on how to get it to work?
Maybe some old good concat?
Like:
fipsList = ['06001','06037','06059']
for fip in fipsList:
file = "T:\CCSI\TECH\FEMA\Datasets\NFHL\NFHL_06122018\NFHL_{}_" + fip + ".gdb"
Simply adding '*' into a string this way will not work. The set up of the question is poor (my own fault), but for clarification's sake, here's how I resolved the issue:
fipsList = ['06001','06037','06059']
for fip in fipsList:
path = r"T:\CCSI\TECH\FEMA\Datasets\NFHL\NFHL_06122018"
for root, dirs, filename in os.walk(path):
for dir in dirs:
if('NFHL_' + fip[:2] in dir and '.gdb' in dir):
file = os.path.join(root, dir)
Essentially, I had to walk through the folder and use an if conditional to make sure that the conditions of having both the fip value and the .gdb extension were met.

Count multiple files in a directory with the same name

I'm relatively new to Python and was working on a project where the user can navigate to a folder, after which the program does a count of all the files in that folder with a specific name.
The problem is that I have a folder with over 5000 files many of them sharing the same name but different extensions. I wrote code that somewhat does what I want the final version to do but its VERY redundant and I can't see myself doing this for over 600 file names.
Wanted to ask if it is possible to make this program "automated" or less redundant where I don't have to manually type out the names of 600 files to return data for.
Sample code I currently have:
import os, sys
print(sys.version)
file_counting1 = 0
file_counting2 = 0
filepath = input("Enter file path here: ")
if os.path.exists(filepath):
for file in os.listdir(filepath):
if file.startswith('expressmail'):
file_counting1 += 1
print('expressmail')
print('Total files found:', file_counting1)
for file in os.listdir(filepath):
if file.startswith('prioritymail'):
file_counting2 += 1
print('prioritymail')
print('Total files found:', file_counting2)
Sample Output:
expressmail
Total files found: 3
prioritymail
Total files found: 1
The following script will count occurrences of files with the same name. If the file does not have an extension, the whole filename is treated as the name. It also does not traverse subdirectories, since the original question just asks about files in the given folder.
import os
dir_name = "."
files = next(os.walk(dir_name))[2] # get all the files directly in the directory
names = [f[:f.rindex(".")] for f in files if "." in f] # drop the extensions
names += [f for f in files if "." not in f] # add those without extensions
for name in set(names): # for each unique name-
print("{}\nTotal files found: {}".format(name, names.count(name)))
If you want to support files in subdirectories, you could use something like
files = [os.path.join(r,file) for r,d,f in os.walk(dir_name) for file in f]
If you don't want to consider files without extensions, just remove the line:
names += [f for f in files if "." not in f]
There are a number of ways you can do what you're trying to do. Partly it depends on whether or not you need to recover the list of extension for a given duplicated file.
Counter, from the collections module - use this for a simple count of file. Ignore the extensions when building the count.
Use the filename without extension as a dictionary key, add a list of items as the key-value, where the list of items is each occurrence of the file.
Here's an example using the Counter class:
import os, sys, collections
c = collections.Counter()
for root, dirs,files in os.walk('/home/myname/hg/2018/'):
# discard any path data and just use filename
for names in files:
name, ext = os.path.splitext(names)
# discard any extension
c[name] += 1
# Counter.most_common() gives the values in the form of (entry, count)
# Counter.most_common(x) - pass a value to display only the top x counts
# e.g. Counter.most_common(2) = top 2
for x in c.most_common():
print(x[0] + ': ' + str(x[1]))
you can use regular expressions:
import os, sys, re
print(sys.version)
filepath = input("Enter file path here: ")
if os.path.exists(filepath):
allfiles = "\n".join(os.listdir(filepath))
file_counting1 = len(re.findall("^expressmail",allfiles,re.M))
print('expressmail')
print('Total files found:', file_counting1)
file_counting2 = len(re.findall("^prioritymail",allfiles,re.M))
print('prioritymail')
print('Total files found:', file_counting2)

Can I force os.walk to visit directories in alphabetical order?

I would like to know if it's possible to force os.walk in python3 to visit directories in alphabetical order. For example, here is a directory and some code that will walk this directory:
ryan:~/bktest$ ls -1 sample
CD01
CD02
CD03
CD04
CD05
--------
def main_work_subdirs(gl):
for root, dirs, files in os.walk(gl['pwd']):
if root == gl['pwd']:
for d2i in dirs:
print(d2i)
When the python code hits the directory above, here is the output:
ryan:~/bktest$ ~/test.py sample
CD03
CD01
CD05
CD02
CD04
I would like to force walk to visit these dirs in alphabetical order, 01, 02 ... 05. In the python3 doc for os.walk, it says:
When topdown is True, the caller can modify the dirnames list in-place
(perhaps using del or slice assignment), and walk() will only recurse
into the subdirectories whose names remain in dirnames; this can be
used to prune the search, impose a specific order of visiting
Does that mean that I can impose an alphabetical visiting order on os.walk? If so, how?
Yes. You sort dirs in the loop.
def main_work_subdirs(gl):
for root, dirs, files in os.walk(gl['pwd']):
dirs.sort()
if root == gl['pwd']:
for d2i in dirs:
print(d2i)
I know this has already been answered but I wanted to add one little detail and adding more than a single line of code in the comments is wonky.
In addition to wanting the directories sorted I also wanted the files sorted so that my iteration through "gl" was consistent and predictable. To do this one more sort was required:
for root, dirs, files in os.walk(gl['pwd']):
dirs.sort()
for filename in sorted(files):
print(os.path.join(root, filename))
And, with benefit of learning more about Python, a different (better) way:
from pathlib import Path
# Directories, per original question.
[print(p) for p in sorted(Path(gl['pwd']).glob('**/*')) if p.is_dir()]
# Files, like I usually need.
[print(p) for p in sorted(Path(gl['pwd']).glob('**/*')) if p.is_file()]
This answer is not specific to this question and the problem is a little different but the solution can be used in either case.
Consider having these files ("one1.txt", "one2.txt", "one10.txt") and the content of all of them is a String "default":
I want to loop through a directory that contains these files and find a specific String in every file and replace it with the name of the file.
If you use any other methods which have already mentioned here and in other questions (like dirs.sort() and sorted(files) and sorted(dirs), the result will be something like this:
"one1.txt"--> "one10"
"one2.txt"--> "one1"
"one10.txt" --> "one2"
But we want it to be:
"one1.txt"--> "one1"
"one2.txt"--> "one2"
"one10.txt" --> "one10"
I found this method which changes file content alphabetically:
import re, os, fnmatch
def atoi(text):
return int(text) if text.isdigit() else text
def natural_keys(text):
'''
alist.sort(key=natural_keys) sorts in human order
http://nedbatchelder.com/blog/200712/human_sorting.html
(See Toothy's implementation in the comments)
'''
return [ atoi(c) for c in re.split('(\d+)', text) ]
def findReplace(directory, find, replace, filePattern):
count = 0
for path, dirs, files in sorted(os.walk(os.path.abspath(directory))):
dirs.sort()
for filename in sorted(fnmatch.filter(files, filePattern), key=natural_keys):
count = count +1
filepath = os.path.join(path, filename)
with open(filepath) as f:
s = f.read()
s = s.replace(find, replace+str(count)+".png")
with open(filepath, "w") as f:
f.write(s)
Then run this line:
findReplace(os.getcwd(), "default", "one", "*.xml")

Resources