Stuck in a recursive directory search using os.scandir - python-3.x

I'm searching through a large directory to sort an old archive into a specific order. I have embedded a function which is called recursively and when it finds a directory whose file path matches the search criteria it adds it to the 'found' dictionary fdict.
The expected outcome is that when the function is called on a directory with no subdirectories it completes with no actions and moves back up a level.
When run it gets stuck in the first directory it finds that contains no sub-directories and simply recursively calls the current directory for a search, getting stuck in a loop.
Below is the code abstract, any insight into why it is looping would be much appreciated.
def scan(queries, directory):
fdict = {}
def search(queries, directory, fdict):
for entry in os.scandir(directory):
if entry.is_dir():
for x in queries:
if str(x) in entry.path:
fdict[str(x)] = entry.path
print("{} found and dicted".format(str(x)))
else:
search(queries, entry.path, fdict)
else: pass
search(queries, directory, fdict)
return fdict

The whole thing can be written as
import os
# let qs be a list of queries [q]
# root be the start dir
for path, dirnames, filenames in os.walk(root):
for dirname in dirnames:
full_path = os.path.join(path, dirname) # optional (depends)
for q in qs:
if q in full_path:
# do whatever
os.walk is recursive. You can do some set operation as well, to eliminate for q in qs. Comment if it doesn't work for you.

OK so it turns out the problem was in the for x in queries: statement.
The apparent loop was caused by bad design which meant that only the first value in the queries list compared to entry.path before the else statement was called and the search function called on the current entry.path.
Once a directory with no sub-directories was reached, it would then step back up one level and test the second entry in queries against entry.path.
Although the code would eventually produce the required result, this approach would take absolutely ages (in this instance queries is a 4000 value long list!) and gave the appearance of a loop on inspection.
Below is the corrected code for future reference if anyone stumbles across a similar problem.
def scan(queries, directory):
fdict = {}
def search(queries, directory, fdict):
for entry in os.scandir(directory):
if entry.is_dir():
if entry.name in queries:
fdict[str(x)] = entry.path
else:
time.sleep(2)
search(queries, entry.path, fdict)
else: pass
search(queries, directory, fdict)
return fdict

Related

FileNotFoundError But The File Is There: Cryptography Edition

I'm working on a script that takes a checksum and directory as inputs.
Without too much background, I'm looking for 'malware' (ie. a flag) in a directory of executables. I'm given the SHA512 sum of the 'malware'. I've gotten it to work (I found the flag), but I ran into an issue with the output after generalizing the function for different cryptographic protocols, encodings, and individual files instead of directories:
FileNotFoundError: [Errno 2] No such file or directory : 'lessecho'
There is indeed a file lessecho in the directory, and as it happens, is close to the file that returns the actual flag. Probably a coincidence. Probably.
Below is my Python script:
#!/usr/bin/python3
import hashlib, sys, os
"""
### TO DO ###
Add other encryption techniques
Include file read functionality
"""
def main(to_check = sys.argv[1:]):
dir_to_check = to_check[0]
hash_to_check = to_check[1]
BUF_SIZE = 65536
for f in os.listdir(dir_to_check):
sha256 = hashlib.sha256()
with open(f, 'br') as f: <--- line where the issue occurs
while True:
data = f.read(BUF_SIZE)
if not data:
break
sha256.update(data)
f.close()
if sha256.hexdigest() == hash_to_check:
return f
if __name__ == '__main__':
k = main()
print(k)
Credit to Randall for his answer here
Here are some humble trinkets from my native land in exchange for your wisdom.
Your listdir call is giving you bare filenames (e.g. lessecho), but that is within the dir_to_check directory (which I'll call foo for convenience). To open the file, you need to join those two parts of the path back together, to get a proper path (e.g. foo/lessecho). The os.path.join function does exactly that:
for f in os.listdir(dir_to_check):
sha256 = hashlib.sha256()
with open(os.path.join(dir_to_check, f), 'br') as f: # add os.path.join call here!
...
There are a few other issues in the code, unrelated to your current error. One is that you're using the same variable name f for both the file name (from the loop) and file object (in the with statement). Pick a different name for one of them, since you need both available (because I assume you intend return f to return the filename, not the recently closed file object).
And speaking of the closed file, you're actually closing the file object twice. The first one happens at the end of the with statement (that's why you use with). The second is your manual call to f.close(). You don't need the manual call at all.

Python: command line, sys.argv, "if __name__ == '__main__' "

I have a moderate amount of experience using Python in Jupyter but am pretty clueless about how to use the command line. I have this prompt for a homework assignment -- I understand how the algorithms work, but I don't know how to format everything so it works from the command line in the way that is specified.
The prompt:
Question 1: 80 points
Input: a text file that specifies a travel problem (see travel-input.txt
for the format) and a search algorithm
(more details are below).
python map.py [file] [search] should read
the travel problem from “file” and run the “search” algorithm to find
a solution. It will print the solution and its cost.
search is one of
[DFTS, DFGS, BFTS, BFGS, UCTS, UCGS, GBFTS, GBFGS, ASTS, ASGS]
Here is the template I was given:
from search import ... # TODO import the necessary classes and methods
import sys
if __name__ == '__main__':
input_file = sys.argv[1]
search_algo_str = sys.argv[2]
# TODO implement
goal_node = ... # TODO call the appropriate search function with appropriate parameters
# Do not change the code below.
if goal_node is not None:
print("Solution path", goal_node.solution())
print("Solution cost", goal_node.path_cost)
else:
print("No solution was found.")
So as far as python map.py [file] [search] goes, 'file' refers to travel-input.txt and 'search' refers to one of DFTS, DFGS, BFTS,... etc - a user-specified choice. My questions:
Where do I put my search functions? Should they all just be back-to-back in the same block of code?
How do I get the command line to recognize each function from its four or five-letter code? Is it just the name of the function? If I call it just using those letters, how can the functions receive input?
Do I need to reference the input file anywhere in my code?
Does it matter where I save my files in order for them to be accessible from the command line - .py files, travel-input.txt, etc? I've tried accessing them from the command line, with no success.
Thanks for the help!
The function definitions go before the if __name__ == "__main__" block. To select the correct function you can put them in a dict and use the four-letter abbreviations as keys, i.e.
def dfts_search(...):
...
def dfgs_search(...):
....
...
if __name__ == "__main__":
input_file = sys.argv[1]
search_algo_str = sys.argv[2]
search_dict = {"DFTS": dfts_search, "DFGS": dfgs_search, ...}
try:
func = search_dict[search_algo_str]
result = func(...)
except KeyError:
print(f'{search_algo_str} is an unknown search algorithm')
Not sure what you mean by reference, but input_file already refers to the input file. You will need to write a function to read the file and process the contents.
The location of the files shouldn't matter too much. Putting everything in the same directory is probably easiest. In the command window, just cd to the directory where the files are located and run the script as described in the assignment.

Python "with os.chdir('/opt/intel/mkl/bin'): AttributeError: __enter__" Error

I am trying to go the "/opt/intel/mkl/bin" and list the files and check the presence of some files and come out. I am getting this error. Any help is appreciated.
files = ['mklvars.csh','mklvars.sh']
with os.chdir('/opt/intel/mkl/bin'):
print('testing')
if all([os.path.isfile(f)for f in files]):
print("Installation succesful")
else:
print("not succesful")```
You can only use the keyword with ... : with context-manager and os.chdir() just isn't a Context Manager.
You can simply do something like (if you don't need working dir to be restored afterwards)
files = ['mklvars.csh','mklvars.sh']
os.chdir('/opt/intel/mkl/bin')
print('testing')
if all([os.path.isfile(f) for f in files]):
print("Installation succesful")
else:
print("not succesful")
But if you need the current working directory to be restored at the end, there are different options you can use.
One example (with https://github.com/jaraco/path):
from path import Path
# Changing the working directory:
files = ['mklvars.csh','mklvars.sh']
with Path('/opt/intel/mkl/bin'):
print('testing')
if all([os.path.isfile(f)for f in files]):
print("Installation succesful")
else:
print("not succesful")
Other solutions: checkout this Question
For further reading on context-managers:
https://docs.python.org/3/library/contextlib.html
https://docs.python.org/2.5/whatsnew/pep-343.html

Write a recursive function to list all paths of parts.txt

Write a function list_files_recursive that returns a list of the paths of all the parts.txt files without using the os module's walk generator. Instead, the function should use recursion. The input will be a directory name.
Here is the code I have so far and I think it's basically right, but what's happening is that the output is not one whole list?
def list_files_recursive(top_dir):
rec_list_files = []
list_dir = os.listdir(top_dir)
for item in list_dir:
item_path = os.path.join(top_dir, item)
if os.path.isdir(item_path):
list_files_recursive(item_path)
else:
if os.path.basename(item_path) == 'parts.txt':
rec_list_files.append(os.path.join(item_path))
print(rec_list_files)
return rec_list_files
This is part of the output I'm getting (from the print statement):
['CarItems/Honda/Accord/1996/parts.txt']
[]
['CarItems/Honda/Odyssey/2000/parts.txt']
['CarItems/Honda/Odyssey/2002/parts.txt']
[]
So the problem is that it's not one list and that there's empty lists in there. I don't quite know why this isn't not working and have tried everything to work through it. Any help is much appreciated on this!
This is very close, but the issue is that list_files_recursive's child calls don't pass results back to the parent. One way to do this is to concatenate all of the lists together from each child call, or to pass a reference to a single list all the way through the call chain.
Note that in rec_list_files.append(os.path.join(item_path)), there's no point in os.path.join with only a single parameter. print(rec_list_files) should be omitted as a side effect that makes the output confusing to interpret--only print in the caller. Additionally,
else:
if ... :
can be more clearly written here as elif: since they're logically equivalent. It's always a good idea to reduce nesting of conditionals whenever possible.
Here's the approach that works by extending the parent list:
import os
def list_files_recursive(top_dir):
files = []
for item in os.listdir(top_dir):
item_path = os.path.join(top_dir, item)
if os.path.isdir(item_path):
files.extend(list_files_recursive(item_path))
# ^^^^^^ add child results to parent
elif os.path.basename(item_path) == "parts.txt":
files.append(item_path)
return files
if __name__ == "__main__":
print(list_files_recursive("foo"))
Or by passing a result list through the call tree:
import os
def list_files_recursive(top_dir, files=[]):
for item in os.listdir(top_dir):
item_path = os.path.join(top_dir, item)
if os.path.isdir(item_path):
list_files_recursive(item_path, files)
# ^^^^^ pass our result list recursively
elif os.path.basename(item_path) == "parts.txt":
files.append(item_path)
return files
if __name__ == "__main__":
print(list_files_recursive("foo"))
A major problem with these functions are that they only work for finding files named precisely parts.txt since that string literal was hard coded. That makes it pretty much useless for anything but the immediate purpose. We should add a parameter for allowing the caller to specify the target file they want to search for, making the function general-purpose.
Another problem is that the function doesn't do what its name claims: list_files_recursive should really be called find_file_recursive, or, due to the hardcoded string, find_parts_txt_recursive.
Beyond that, the function is a strong candidate for turning into a generator function, which is a common Python idiom for traversal, particularly for situations where the subdirectories may contain huge amounts of data that would be expensive to keep in memory all at once. Generators also allow the flexibility of using the function to cancel the search after the first match, further enhancing its (re)usability.
The yield keyword also makes the function code itself very clean--we can avoid the problem of keeping a result data structure entirely and just fire off result items on demand.
Here's how I'd write it:
import os
def find_file_recursive(top_dir, target):
for item in os.listdir(top_dir):
item_path = os.path.join(top_dir, item)
if os.path.isdir(item_path):
yield from find_file_recursive(item_path, target)
elif os.path.basename(item_path) == target:
yield item_path
if __name__ == "__main__":
print(list(find_file_recursive("foo", "parts.txt")))

How to use generator in os find function like wrapper?

I have a function in python which works like find command. So basically it will go into depth till it hit m_depth (maxdepth) and will not go into the directory if it is specified in ignore_dirs. It will return a list of files which is found in a walk. The code is really simple and uses recursion.
But for a large number of files or greater depth, the recursion is taking time and the list is getting bigger when returning. So I am seeking if anyway the generator can be used, so atleast the memory consumption is less for each iteration?
I tried with yielding the result but then it is exiting whenever a ignore_dirs is found.
This is the code I have:
def find(source_d, m_depth, ignore_dirs):
'''
This method does a recursive listing of files/directories from a given
path upto maximun recursion value provide as m_depth.
:param source_d: Given source path to start the recursion from
:param m_depth: Maximum recursion depth [determines how deep the method will traverse through the file system]
:param ignore_dirs: this paths will not be traversed. List of strings.
'''
def helper_find(path, ignore_dirs, m_depth, curr_depth=1):
files = []
if any(ignore_sub_dir == os.path.split(path)[-1] for ignore_sub_dir in ignore_dirs):
return []
if m_depth < curr_depth:
return []
else:
things = os.listdir(path)
for thing in things:
if(os.path.isdir(os.path.join(path, thing))):
files.extend(helper_find(os.path.join(path, thing), ignore_dirs, m_depth, curr_depth+1))
else:
files.append(os.path.join(path, thing))
return files
return helper_find(source_d, ignore_dirs, m_depth)
The answer is yes, you can make a recursive generator by using yield from (available only in Python 3):
def find(source_d, m_depth, ignore_dirs):
'''
This method does a recursive listing of files/directories from a given
path upto maximun recursion value provide as m_depth.
:param source_d: Given source path to start the recursion from
:param m_depth: Maximum recursion depth [determines how deep the method will traverse through the file system]
:param ignore_dirs: this paths will not be traversed. List of strings.
'''
def helper_find(path, ignore_dirs, m_depth, curr_depth=1):
if not any(ignore_sub_dir == os.path.split(path)[-1] for ignore_sub_dir in ignore_dirs)and m_depth >= curr_depth:
things = os.listdir(path)
for thing in things:
if(os.path.isdir(os.path.join(path, thing))):
yield from helper_find(os.path.join(path, thing), ignore_dirs, m_depth, curr_depth+1)
else:
yield os.path.join(path, thing)
return helper_find(source_d, ignore_dirs, m_depth)

Resources