I have a function in python which works like find command. So basically it will go into depth till it hit m_depth (maxdepth) and will not go into the directory if it is specified in ignore_dirs. It will return a list of files which is found in a walk. The code is really simple and uses recursion.
But for a large number of files or greater depth, the recursion is taking time and the list is getting bigger when returning. So I am seeking if anyway the generator can be used, so atleast the memory consumption is less for each iteration?
I tried with yielding the result but then it is exiting whenever a ignore_dirs is found.
This is the code I have:
def find(source_d, m_depth, ignore_dirs):
'''
This method does a recursive listing of files/directories from a given
path upto maximun recursion value provide as m_depth.
:param source_d: Given source path to start the recursion from
:param m_depth: Maximum recursion depth [determines how deep the method will traverse through the file system]
:param ignore_dirs: this paths will not be traversed. List of strings.
'''
def helper_find(path, ignore_dirs, m_depth, curr_depth=1):
files = []
if any(ignore_sub_dir == os.path.split(path)[-1] for ignore_sub_dir in ignore_dirs):
return []
if m_depth < curr_depth:
return []
else:
things = os.listdir(path)
for thing in things:
if(os.path.isdir(os.path.join(path, thing))):
files.extend(helper_find(os.path.join(path, thing), ignore_dirs, m_depth, curr_depth+1))
else:
files.append(os.path.join(path, thing))
return files
return helper_find(source_d, ignore_dirs, m_depth)
The answer is yes, you can make a recursive generator by using yield from (available only in Python 3):
def find(source_d, m_depth, ignore_dirs):
'''
This method does a recursive listing of files/directories from a given
path upto maximun recursion value provide as m_depth.
:param source_d: Given source path to start the recursion from
:param m_depth: Maximum recursion depth [determines how deep the method will traverse through the file system]
:param ignore_dirs: this paths will not be traversed. List of strings.
'''
def helper_find(path, ignore_dirs, m_depth, curr_depth=1):
if not any(ignore_sub_dir == os.path.split(path)[-1] for ignore_sub_dir in ignore_dirs)and m_depth >= curr_depth:
things = os.listdir(path)
for thing in things:
if(os.path.isdir(os.path.join(path, thing))):
yield from helper_find(os.path.join(path, thing), ignore_dirs, m_depth, curr_depth+1)
else:
yield os.path.join(path, thing)
return helper_find(source_d, ignore_dirs, m_depth)
Related
I am working on the GeeksForGeeks problem Delete node in Doubly Linked List:
Given a doubly linked list and a position. The task is to delete a node from given position in a doubly linked list.
Your Task:
The task is to complete the function deleteNode() which should delete the node at given position and return the head of the linkedlist.
My code:
def deleteNode(self,head, x):
# Code here
temp=head
count_of_nodes=0
prev_of_delete_node=None
next_of_delete_node=None
while temp != head:
count_of_nodes+=1
if count_of_nodes==x:
prev_of_delete_node=temp.prev
next_of_delete_node=temp.next
#print(y.data,z.data)
prev_of_delete_node.next=next_of_delete_node
next_of_delete_node.prev=prev_of_delete_node
break
temp=temp.next
if x==1:
head=next_of_delete_node
There is no effect on the doubly LinkedList after executing above code. Why is this?
Some issues:
The while condition is wrong: it is false immediately, so the loop will not execute.
The value for prev_of_delete_node could be None when you dereference it with prev_of_delete_node.next. So guard that operation. Same for next_of_delete_node.
The function doesn't return anything, but it should return the head of the list after the deletion
Correction:
def deleteNode(self,head, x):
temp=head
count_of_nodes=0
prev_of_delete_node=None
next_of_delete_node=None
while temp: # Corrected loop condition
count_of_nodes+=1
if count_of_nodes==x:
prev_of_delete_node=temp.prev
next_of_delete_node=temp.next
if prev_of_delete_node: # Guard
prev_of_delete_node.next=next_of_delete_node
if next_of_delete_node: # Guard
next_of_delete_node.prev=prev_of_delete_node
break
temp=temp.next
# Should return:
if x==1:
return next_of_delete_node
return head
Write a function list_files_recursive that returns a list of the paths of all the parts.txt files without using the os module's walk generator. Instead, the function should use recursion. The input will be a directory name.
Here is the code I have so far and I think it's basically right, but what's happening is that the output is not one whole list?
def list_files_recursive(top_dir):
rec_list_files = []
list_dir = os.listdir(top_dir)
for item in list_dir:
item_path = os.path.join(top_dir, item)
if os.path.isdir(item_path):
list_files_recursive(item_path)
else:
if os.path.basename(item_path) == 'parts.txt':
rec_list_files.append(os.path.join(item_path))
print(rec_list_files)
return rec_list_files
This is part of the output I'm getting (from the print statement):
['CarItems/Honda/Accord/1996/parts.txt']
[]
['CarItems/Honda/Odyssey/2000/parts.txt']
['CarItems/Honda/Odyssey/2002/parts.txt']
[]
So the problem is that it's not one list and that there's empty lists in there. I don't quite know why this isn't not working and have tried everything to work through it. Any help is much appreciated on this!
This is very close, but the issue is that list_files_recursive's child calls don't pass results back to the parent. One way to do this is to concatenate all of the lists together from each child call, or to pass a reference to a single list all the way through the call chain.
Note that in rec_list_files.append(os.path.join(item_path)), there's no point in os.path.join with only a single parameter. print(rec_list_files) should be omitted as a side effect that makes the output confusing to interpret--only print in the caller. Additionally,
else:
if ... :
can be more clearly written here as elif: since they're logically equivalent. It's always a good idea to reduce nesting of conditionals whenever possible.
Here's the approach that works by extending the parent list:
import os
def list_files_recursive(top_dir):
files = []
for item in os.listdir(top_dir):
item_path = os.path.join(top_dir, item)
if os.path.isdir(item_path):
files.extend(list_files_recursive(item_path))
# ^^^^^^ add child results to parent
elif os.path.basename(item_path) == "parts.txt":
files.append(item_path)
return files
if __name__ == "__main__":
print(list_files_recursive("foo"))
Or by passing a result list through the call tree:
import os
def list_files_recursive(top_dir, files=[]):
for item in os.listdir(top_dir):
item_path = os.path.join(top_dir, item)
if os.path.isdir(item_path):
list_files_recursive(item_path, files)
# ^^^^^ pass our result list recursively
elif os.path.basename(item_path) == "parts.txt":
files.append(item_path)
return files
if __name__ == "__main__":
print(list_files_recursive("foo"))
A major problem with these functions are that they only work for finding files named precisely parts.txt since that string literal was hard coded. That makes it pretty much useless for anything but the immediate purpose. We should add a parameter for allowing the caller to specify the target file they want to search for, making the function general-purpose.
Another problem is that the function doesn't do what its name claims: list_files_recursive should really be called find_file_recursive, or, due to the hardcoded string, find_parts_txt_recursive.
Beyond that, the function is a strong candidate for turning into a generator function, which is a common Python idiom for traversal, particularly for situations where the subdirectories may contain huge amounts of data that would be expensive to keep in memory all at once. Generators also allow the flexibility of using the function to cancel the search after the first match, further enhancing its (re)usability.
The yield keyword also makes the function code itself very clean--we can avoid the problem of keeping a result data structure entirely and just fire off result items on demand.
Here's how I'd write it:
import os
def find_file_recursive(top_dir, target):
for item in os.listdir(top_dir):
item_path = os.path.join(top_dir, item)
if os.path.isdir(item_path):
yield from find_file_recursive(item_path, target)
elif os.path.basename(item_path) == target:
yield item_path
if __name__ == "__main__":
print(list(find_file_recursive("foo", "parts.txt")))
I am working on processing a dataset that includes dense GPS data. My goal is to use parallel processing to test my dataset against all possible distributions and return the best one with the parameters generated for said distribution.
Currently, I have code that does this in serial thanks to this answer https://stackoverflow.com/a/37616966. Of course, it is going to take entirely too long to process my full dataset. I have been playing around with multiprocessing, but can't seem to get it to work right. I want it to test multiple distributions in parallel, keeping track of sum of square error. Then I want to select the distribution with the lowest SSE and return its name along with the parameters generated for it.
def fit_dist(distribution, data=data, bins=200, ax=None):
#Block of code that tests the distribution and generates params
return(distribution.name, best_params, sse)
if __name__ == '__main__':
p = Pool()
result = p.map(fit_dist, DISTRIBUTIONS)
p.close()
p.join()
I need some help with how to actually make use of the return values on each of the iterations in the multiprocessing to compare those values. I'm really new to python especially multiprocessing so please be patient with me and explain as much as possible.
The problem I'm having is it's giving me an "UnboundLocalError" on the variables that I'm trying to return from my fit_dist function. The DISTRIBUTIONS list is 89 objects. Could this be related to the parallel processing, or is it something to do with the definition of fit_dist?
With the help of Tomerikoo's comment and some further struggling, I got the code working the way I wanted it to. The UnboundLocalError was due to me not putting the return statement in the correct block of code within my fit_dist function. To answer the question I did the following.
from multiprocessing import Pool
def fit_dist:
#put this return under the right section of this method
return[distribution.name, params, sse]
if __name__ == '__main__':
p = Pool()
result = p.map(fit_dist, DISTRIBUTIONS)
p.close()
p.join()
'''filter out the None object results. Due to the nature of the distribution fitting,
some distributions are so far off that they result in None objects'''
res = list(filter(None, result))
#iterates over nested list storing the lowest sum of squared errors in best_sse
for dist in res:
if best_sse > dist[2] > 0:
best_sse = dis[2]
else:
continue
'''iterates over list pulling out sublist of distribution with best sse.
The sublists are made up of a string, tuple with parameters,
and float value for sse so that's why sse is always index 2.'''
for dist in res:
if dist[2]==best_sse:
best_dist_list = dist
else:
continue
The rest of the code simply consists of me using that list to construct charts and plots with that best distribution overtop of a histogram of my raw data.
I have to compute a function many many times.
To compute this function the elements of an array must be computed.
The array is quite large.
How can I avoid the allocation of the array in every function call.
The code I have tried goes something like this:
class FunctionCalculator(object):
def __init__(self, data):
"""
Get the data and do some small handling of it
Let's say that we do
self.data = data
"""
def function(self, point):
return numpy.sum(numpy.array([somecomputations(item) for item in self.data]))
Well, maybe my concern is unfounded, so I have first this question.
Question: Is it true that the array [somecomputations(item) for item in data] is being allocated and deallocated for every call to function?
Thinking that that is the case I have tried
class FunctionCalculator(object):
def __init__(self, data):
"""
Get the data and do some small handling of it
Let's say that we do
self.data = data
"""
self.number_of_data = range(0, len(data))
self.my_array = numpy.zeros(len(data))
def function(self, point):
for i in self.number_of_data:
self.my_array[i] = somecomputations(self.data[i])
return numpy.sum(self.my_array)
This is slower than the previous version. I assume that the list comprehension in the first version can be ran in C entirely, while in the second version smaller parts of the script can be translated into optimized C code.
I have very little idea of how Python works inside.
Question: Is there a good way to skip the array allocation in every function call and at the same time take advantage of a well optimized loop on the array?
I am using Python3.5
Looping over the array is unnecessary and access python to c many times, hence the slow down. The beauty of numpy arrays that functions work on them cell by cell. I think the fastest would be:
return numpy.sum(somecomputations(self.data))
Somecomputations may need a bit of a modification, but often it will work off the bat. Also, you're not using point, and other stuff.
I'm searching through a large directory to sort an old archive into a specific order. I have embedded a function which is called recursively and when it finds a directory whose file path matches the search criteria it adds it to the 'found' dictionary fdict.
The expected outcome is that when the function is called on a directory with no subdirectories it completes with no actions and moves back up a level.
When run it gets stuck in the first directory it finds that contains no sub-directories and simply recursively calls the current directory for a search, getting stuck in a loop.
Below is the code abstract, any insight into why it is looping would be much appreciated.
def scan(queries, directory):
fdict = {}
def search(queries, directory, fdict):
for entry in os.scandir(directory):
if entry.is_dir():
for x in queries:
if str(x) in entry.path:
fdict[str(x)] = entry.path
print("{} found and dicted".format(str(x)))
else:
search(queries, entry.path, fdict)
else: pass
search(queries, directory, fdict)
return fdict
The whole thing can be written as
import os
# let qs be a list of queries [q]
# root be the start dir
for path, dirnames, filenames in os.walk(root):
for dirname in dirnames:
full_path = os.path.join(path, dirname) # optional (depends)
for q in qs:
if q in full_path:
# do whatever
os.walk is recursive. You can do some set operation as well, to eliminate for q in qs. Comment if it doesn't work for you.
OK so it turns out the problem was in the for x in queries: statement.
The apparent loop was caused by bad design which meant that only the first value in the queries list compared to entry.path before the else statement was called and the search function called on the current entry.path.
Once a directory with no sub-directories was reached, it would then step back up one level and test the second entry in queries against entry.path.
Although the code would eventually produce the required result, this approach would take absolutely ages (in this instance queries is a 4000 value long list!) and gave the appearance of a loop on inspection.
Below is the corrected code for future reference if anyone stumbles across a similar problem.
def scan(queries, directory):
fdict = {}
def search(queries, directory, fdict):
for entry in os.scandir(directory):
if entry.is_dir():
if entry.name in queries:
fdict[str(x)] = entry.path
else:
time.sleep(2)
search(queries, entry.path, fdict)
else: pass
search(queries, directory, fdict)
return fdict