How to implement batching on a folder by folder basis - python-3.x

I have a script that uses the MTCNN face detection library that iterates through a fair amount of directories, totaling thousands of images. An issue that I've been running into with this script is the excessive memory usage when processing all of these images, which will eventually cause my MacBook (16gb of RAM) to run out of memory. What I'm looking to do is to implement batching on a folder by folder basis, instead of a specific batch limit because none of the folders contain enough images individually that would make the system run out of memory.
# open up the csv file
with open(csv_path, 'w', newline='') as file:
writer = csv.writer(file)
writer.writerow(['Index', 'Threshhold', 'Path'])
for path, subdirs, files in os.walk(path):
for name in files:
if name == '.DS_Store':
print("Skipping .DS_Store")
continue
else:
try:
image = os.path.join(path, name)
pixels = pyplot.imread(image)
print("Processing " + image)
print("Count: " + str(inc))
# calculate the area of the image
total_height = pixels.shape[0]
total_width = pixels.shape[1]
total_area = total_height * total_width
# create the detector, using default weights
detector = MTCNN()
faces = detector.detect_faces(pixels)
ax = pyplot.gca()
face_total_area = 0
if faces == []:
print("No faces detected.")
# pass in 0 for the threshold becuase there's no faces
#write_to_csv(inc, 0, image)
print()
else:
for face in faces:
# get dimensions from the face
x, y, width, height = face['box']
# calculate the area of the face
face_area = width * height
face_total_area += face_area
threshold = face_total_area / total_area
# write to csv only if the threshold is less than the limit
# change back to this eventually ^^^^^^^^^
if threshold > threshhold_limit:
print("Facial area is over the threshold - writing file path to csv.")
write_to_csv(inc, threshold, image)
else:
print("Image threshold is under the limit - good")
print(threshold)
print()
inc += 1
except:
print("Processing error - skipping image")
Is something like this possible to do? Or should it be done a different way? The idea is that batching like this will allow mtcnn to release the memory it's holding onto when it's done processing that folder.

Memory usage should not increase with this program, because it does not accumulate data from one image to the next one. So, what you are asking for will have no effect. Have you tried runnng tis same code outside of a Python notebook? As a standalone program? It may be that the notebook is keeping references to all read images.
Either that, or find a call that would really reset pyplot's internal state inside the innermost loop. (maybe pyplot.clf()).
"Batching" as you say is what takes place inside the first for loop, which will run once for each folder in your tree. The only bennefit you could possibly have would be to reset the internal state inside the first loop, but outside the second for (for name in ...), you'd have to find the exactly same call to reset the internal state.
(also, on a side note, you create a csv writer in your with block that is invalidated at the end of the block - you should refactor this code not to keep reopening the CSV file for each new line - (which happens in the not-shown write_to_csv function) )

Related

How to lower RAM usage using xarray open_mfdataset and the quantile function

I am trying to load multiple years of daily data in nc files (one nc file per year). A single nc file has a dimension of 365 (days) * 720 (lat) * 1440 (lon). All the nc files are in the "data" folder.
import xarray as xr
ds = xr.open_mfdataset('data/*.nc',
chunks={'latitude': 10, 'longitude': 10})
# I need the following line (time: -1) in order to do quantile, or it throws a ValueError:
# ValueError: dimension time on 0th function argument to apply_ufunc with dask='parallelized'
# consists of multiple chunks, but is also a core dimension. To fix, either rechunk into a single
# dask array chunk along this dimension, i.e., ``.chunk(time: -1)``, or pass ``allow_rechunk=True``
# in ``dask_gufunc_kwargs`` but beware that this may significantly increase memory usage.
ds = ds.chunk({'time': -1})
# Perform the quantile "computation" (looks more like a reference to the computation, as it's fast
ds_qt = ds.quantile(0.975, dim="time")
# Verify the shape of the loaded ds
print(ds)
# This shows the expected "concatenation" of the nc files.
# Get a sample for a given location to test the algorithm
print(len(ds.sel(lon = 35.86,lat = 14.375, method='nearest')['variable'].values))
print(ds_qt.sel(lon = 35.86,lat = 14.375, method='nearest')['variable'].values)
The result is correct. My issue comes from memory usage. I thought that by doing the open_mfdataset method, which uses Dask under the hood, this would be solved. However, loading "just" 2 years of nc files uses around 8GB of virtual RAM, and using 10 years of data uses my entire virtual RAM (around 32GB).
Am I missing something to be able to take a given percentile value across a dask array (I would need 30 nc files)? I apparently have to apply the chunk({'time': -1}) to the dataset to be able to use the quantile function, is this what makes the RAM savings fail?
This may help somebody in the future, here is the solution I am implementing, even though it is not optimized. I basically break the nc files into slices based on geolocation, and paste it back together to create the output file.
ds = xr.open_mfdataset('data/*.nc')
step = 10
min_lat = -90
max_lat = min_lat + step
output_ds = None
while max_lat <= 90:
cropped_ds = ds.sel(lat=slice(min_lat, max_lat))
cropped_ds = cropped_ds.chunk({'time': -1})
cropped_ds_quantile = cropped_ds.quantile(0.975, dim="time")
if not output_ds:
output_ds = cropped_ds_quantile
else:
output_ds = xr.merge([output_ds, cropped_ds_quantile])
min_lat += step
max_lat += step
output_ds.to_netcdf('output.nc')
It's not great, but it limits RAM usage to manageable levels. I am still open to a cleaner/faster solution if it exists (likely).

python - multiprocessing stops running after some batch operation

I am trying to do image processing in all cores available in my machine(which has 4 cores and 8 processors). I chose to do Multiprocessing because it's a kind of CPU bound workload. Now, explaining the data I have a CSV file that has file paths recorded(local path), Image Category(explain what image is). The CSV has exactly 9258 categories. My Idea is to do batch processing. Assign 10 categories to each processor and loop through the images one by one, wait till all the processors complete its job, and assign the next batch.
The categories are stored in this format as_batches = [[C1, C2, ..., C10], [C11, C12, C13, ..., C20], [Cn-10, Cn-9,..., Cn]]
Here is the function that starts the process.
def get_n_process(as_batches, processes, df, q):
p = []
for i in range(processes):
work = Process(target=submit_job, args=(df, as_batches[i], q, i))
p.append(work)
work.start()
as_batches = as_batches[processes:]
return p, as_batches
Here is the main loop,
while(len(as_batches) > 0):
t = []
#dynamically check the lists
if len(as_batches) > 8:
n_process = 8
else:
n_process = len(as_batches)
print("For this it Requries {} Process".format(n_process))
process_obj_inlist, as_batches = get_n_process(as_batches, n_process, df, q)
for ind_process in process_obj_inlist:
ind_process.join()
with open("logs.txt", "a") as f:
f.write("\n")
f.write("Log Recording at: {timestamp}, Remaining N = {remaining} yet to be processed".format(
timestamp=datetime.datetime.now(),
remaining = len(as_batches)
))
f.close()
For log purposes, I am writing into a text file to see how many categories are there to process yet. And here is the main function
def do_something(fromprocess):
time.sleep(1)
print("Operation Ended for Process:{}, process Id:{}".format(
current_process().name, os.getpid()
))
return "msg"
def submit_job(df, list_items, q, fromprocess):
a = []
for items in list_items:
oneitemdf = df[df['MinorCategory']==items]['FilePath'].values.tolist()
oneitemdf = [x for x in oneitemdf if x.endswith('.png')]
result = do_something(fromprocess)
a.append(result)
q.put(a)
For now, I am just printing in the console, but in real code, I will be using KAZE algorithm to extract features from the images, store it in a list and append it to the Queue(Shared Memory) from all the processors. Now the script is running for few minutes but after some time the script is halted. It didn't run further. I tried to exit it but I couldn't. I think some deadlock might happen but I am not sure. I read online sources but couldn't figure out the solution and reason why it's happening?
For the full code, here is the gist link Full Source Code Link. What I am doing wrong here? I am new to Multiprocessing and MultiThreading. I would like to understand the concept in-depth. Links/Resources related to this topic are much appreciated.
UPDATE - The same code working perfectly on Mac OS.

Python: Spacy and memory consumption

1 - THE PROBLEM
I'm using "spacy" on python for text documents lemmatization.
There are 500,000 documents having size up to 20 Mb of clean text.
The problem is the following: spacy memory consuming is growing in time till the whole memory is used.
2 - BACKGROUND
My hardware configuration:
CPU: Intel I7-8700K 3.7 GHz (12 cores)
Memory: 16 Gb
SSD: 1 Tb
GPU is onboard but is not used for this task
I'm using "multiprocessing" to split the task among several processes (workers).
Each worker receives a list of documents to process.
The main process performs monitoring of child processes.
I initiate "spacy" in each child process once and use this one spacy instance to handle the whole list of documents in the worker.
Memory tracing says the following:
[ Memory trace - Top 10 ]
/opt/develop/virtualenv/lib/python3.6/site-packages/thinc/neural/mem.py:68: size=45.1 MiB, count=99, average=467 KiB
/opt/develop/virtualenv/lib/python3.6/posixpath.py:149: size=40.3 MiB, count=694225, average=61 B
:487: size=9550 KiB, count=77746, average=126 B
/opt/develop/virtualenv/lib/python3.6/site-packages/dawg_python/wrapper.py:33: size=7901 KiB, count=6, average=1317 KiB
/opt/develop/virtualenv/lib/python3.6/site-packages/spacy/lang/en/lemmatizer/_nouns.py:7114: size=5273 KiB, count=57494, average=94 B
prepare_docs04.py:372: size=4189 KiB, count=1, average=4189 KiB
/opt/develop/virtualenv/lib/python3.6/site-packages/dawg_python/wrapper.py:93: size=3949 KiB, count=5, average=790 KiB
/usr/lib/python3.6/json/decoder.py:355: size=1837 KiB, count=20456, average=92 B
/opt/develop/virtualenv/lib/python3.6/site-packages/spacy/lang/en/lemmatizer/_adjectives.py:2828: size=1704 KiB, count=20976, average=83 B
prepare_docs04.py:373: size=1633 KiB, count=1, average=1633 KiB
3 - EXPECTATIONS
I have seen a good recommendation to build a separated server-client solution [here]Is possible to keep spacy in memory to reduce the load time?
Is it possible to keep memory consumption under control using "multiprocessing" approach?
4 - THE CODE
Here is a simplified version of my code:
import os, subprocess, spacy, sys, tracemalloc
from multiprocessing import Pipe, Process, Lock
from time import sleep
# START: memory trace
tracemalloc.start()
# Load spacy
spacyMorph = spacy.load("en_core_web_sm")
#
# Get word's lemma
#
def getLemma(word):
global spacyMorph
lemmaOutput = spacyMorph(str(word))
return lemmaOutput
#
# Worker's logic
#
def workerNormalize(lock, conn, params):
documentCount = 1
for filenameRaw in params[1]:
documentTotal = len(params[1])
documentID = int(os.path.basename(filenameRaw).split('.')[0])
# Send to the main process the worker's current progress
if not lock is None:
lock.acquire()
try:
statusMessage = "WORKING:{:d},{:d},".format(documentID, documentCount)
conn.send(statusMessage)
documentCount += 1
finally:
lock.release()
else:
print(statusMessage)
# ----------------
# Some code is excluded for clarity sake
# I've got a "wordList" from file "filenameRaw"
# ----------------
wordCount = 1
wordTotalCount = len(wordList)
for word in wordList:
lemma = getLemma(word)
wordCount += 1
# ----------------
# Then I collect all lemmas and save it to another text file
# ----------------
# Here I'm trying to reduce memory usage
del wordList
del word
gc.collect()
if __name__ == '__main__':
lock = Lock()
processList = []
# ----------------
# Some code is excluded for clarity sake
# Here I'm getting full list of files "fileTotalList" which I need to lemmatize
# ----------------
while cursorEnd < (docTotalCount + stepSize):
fileList = fileTotalList[cursorStart:cursorEnd]
# ----------------
# Create workers and populate it with list of files to process
# ----------------
processData = {}
processData['total'] = len(fileList) # worker total progress
processData['count'] = 0 # worker documents done count
processData['currentDocID'] = 0 # current document ID the worker is working on
processData['comment'] = '' # additional comment (optional)
processData['con_parent'], processData['con_child'] = Pipe(duplex=False)
processName = 'worker ' + str(count) + " at " + str(cursorStart)
processData['handler'] = Process(target=workerNormalize, name=processName, args=(lock, processData['con_child'], [processName, fileList]))
processList.append(processData)
processData['handler'].start()
cursorStart = cursorEnd
cursorEnd += stepSize
count += 1
# ----------------
# Run the monitor to look after the workers
# ----------------
while True:
runningCount = 0
#Worker communication format:
#STATUS:COMMENTS
#STATUS:
#- WORKING - worker is working
#- CLOSED - worker has finished his job and closed pipe-connection
#COMMENTS:
#- for WORKING status:
#DOCID,COUNT,COMMENTS
#DOCID - current document ID the worker is working on
#COUNT - count of done documents
#COMMENTS - additional comments (optional)
# ----------------
# Run through the list of workers ...
# ----------------
for i, process in enumerate(processList):
if process['handler'].is_alive():
runningCount += 1
# ----------------
# .. and check if there is somethng in the PIPE
# ----------------
if process['con_parent'].poll():
try:
message = process['con_parent'].recv()
status = message.split(':')[0]
comment = message.split(':')[1]
# ----------------
# Some code is excluded for clarity sake
# Update worker's information and progress in "processList"
# ----------------
except EOFError:
print("EOF----")
# ----------------
# Some code is excluded for clarity sake
# Here I draw some progress lines per workers
# ----------------
else:
# worker has finished his job. Close the connection.
process['con_parent'].close()
# Whait for some time and monitor again
sleep(PARAM['MONITOR_REFRESH_FREQUENCY'])
print("================")
print("**** DONE ! ****")
print("================")
# ----------------
# Here I'm measuring memory usage to find the most "gluttonous" part of the code
# ----------------
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
print("[ Memory trace - Top 10 ]")
for stat in top_stats[:10]:
print(stat)
'''
For people who land on this in the future, I found a hack that seems to work well:
import spacy
import en_core_web_lg
import multiprocessing
docs = ['Your documents']
def process_docs(docs, n_processes=None):
# Load the model inside the subprocess,
# as that seems to be the main culprit of the memory issues
nlp = en_core_web_lg.load()
if not n_processes:
n_processes = multiprocessing.cpu_count()
processed_docs = [doc for doc in nlp.pipe(docs, disable=['ner', 'parser'], n_process=n_processes)]
# Then do what you wish beyond this point. I end up writing results out to s3.
pass
for x in range(10):
# This will spin up a subprocess,
# and everytime it finishes it will release all resources back to the machine.
with multiprocessing.Manager() as manager:
p = multiprocessing.Process(target=process_docs, args=(docs))
p.start()
p.join()
The idea here is to put everything Spacy-related into a subprocess so all the memory gets released once the subprocess finishes. I know it's working because I can actually watch the memory get released back to the instance every time the subprocess finishes (also the instance no longer crashes xD).
Full Disclosure: I have no idea why Spacy seems to go up in memory overtime, I've read all over trying to find a simple answer, and all the github issues I've seen claim they've fixed the issue yet I still see this happening when I use Spacy on AWS Sagemaker instances.
Hope this helps someone! I know I spent hours pulling my hair out over this.
Credit to another SO answer that explains a bit more about subprocesses in Python.
Memory leaks with spacy
Memory problems when processing large amounts of data seem to be a known issue, see some relevant github issues:
https://github.com/explosion/spaCy/issues/3623
https://github.com/explosion/spaCy/issues/3556
Unfortunately, it doesn't look like there's a good solution yet.
Lemmatization
Looking at your particular lemmatization task, I think your example code is a bit too over-simplified, because you're running the full spacy pipeline on single words and then not doing anything with the results (not even inspecting the lemma?), so it's hard to tell what you actually want to do.
I'll assume you just want to lemmatize, so in general, you want to disable the parts of the pipeline that you're not using as much as possible (especially parsing if you're only lemmatizing, see https://spacy.io/usage/processing-pipelines#disabling) and use nlp.pipe to process documents in batches. Spacy can't handle really long documents if you're using the parser or entity recognition, so you'll need to break up your texts somehow (or for just lemmatization/tagging you can just increase nlp.max_length as much as you need).
Breaking documents into individual words as in your example kind of the defeats the purpose of most of spacy's analysis (you often can't meaningfully tag or parse single words), plus it's going to be very slow to call spacy this way.
Lookup lemmatization
If you just need lemmas for common words out of context (where the tagger isn't going to provide any useful information), you can see if the lookup lemmatizer is good enough for your task and skip the rest of the processing:
from spacy.lemmatizer import Lemmatizer
from spacy.lang.en import LOOKUP
lemmatizer = Lemmatizer(lookup=LOOKUP)
print(lemmatizer(u"ducks", ''), lemmatizer(u"ducking", ''))
Output:
['duck'] ['duck']
It is just a static lookup table, so it won't do well on unknown words or capitalization for words like "wugs" or "DUCKS", so you'll have to see if it works well enough for your texts, but it would be much much faster without memory leaks. (You could also just use the table yourself without spacy, it's here: https://github.com/michmech/lemmatization-lists.)
Better lemmatization
Otherwise, use something more like this to process texts in batches:
nlp = spacy.load('en', disable=['parser', 'ner'])
# if needed: nlp.max_length = MAX_DOC_LEN_IN_CHAR
for doc in nlp.pipe(texts):
for token in doc:
print(token.lemma_)
If you process one long text (or use nlp.pipe() for lots of shorter texts) instead of processing individual words, you should be able to tag/lemmatize (many) thousands of words per second in one thread.

Change gif characteristics

I have written a python code which creates a gif from a list of images. In order to do this, I used the python library: imageio. Here is my code :
def create_gif(files, gif_path):
"""Creates an animated gif from a list of figures
Args:
files (list of str) : list of the files that are to be used for the gif creation.
All files should have the same extension which should be either png or jpg
gif_path (str) : path where the created gif is to be saved
Raise:
ValueError: if the files given in argument don't have the proper
file extenion (".png" or ".jpeg" for the images in 'files',
and ".gif" for 'gif_path')
"""
images = []
for image in files:
# Make sure that the file is a ".png" or a ".jpeg" one
if splitext(image)[-1] == ".png" or splitext(image)[-1] == ".jpeg":
pass
elif splitext(image)[-1] == "":
image += ".png"
else:
raise ValueError("Wrong file extension ({})".format(image))
# Reads the image with imageio and puts it into the images list
images.append(imageio.imread(image))
# Mak sure that the file is a ".gif" one
if splitext(gif_path)[-1] == ".gif":
pass
elif splitext(gif_path)[-1] == "":
gif_path += ".gif"
else:
raise ValueError("Wrong file extension ({})".format(gif_path))
# imageio writes all the images in a .gif file at the gif_path
imageio.mimsave(gif_path, images)
When I try this code with a list of images the Gif is correctly created but I have no idea how to change its parameters :
What I mean by that is that I would like to be able to control the delay between the gif's images, and also to control how much time the gif's is running.
I have tried to my gif with the Image module from PIL, and change its info, but when I save it my gif turns into my first image.
Could you please help me understand what I am doing wrong?
here is the code that I ran to try to change the gif prameter :
# Try to change gif parameters
my_gif = Image.open(my_gif.name)
my_gif_info = my_gif.info
print(my_gif_info)
my_gif_info['loop'] = 65535
my_gif_info['duration'] = 100
print(my_gif.info)
my_gif.save('./generated_gif/my_third_gif.gif')
You can just pass both parameters, loop and duration, to the mimsave/mimwrite method.
imageio.mimsave(gif_name, fileList, loop=4, duration = 0.3)
Next time you want to check which parameters can be used for a format compatible with imageio you can just use imageio.help(format name).
imageio.help("gif")
GIF-PIL - Static and animated gif (Pillow)
A format for reading and writing static and animated GIF, based
on Pillow.
Images read with this format are always RGBA. Currently,
the alpha channel is ignored when saving RGB images with this
format.
Parameters for reading
----------------------
None
Parameters for saving
---------------------
loop : int
The number of iterations. Default 0 (meaning loop indefinitely).
duration : {float, list}
The duration (in seconds) of each frame. Either specify one value
that is used for all frames, or one value for each frame.
Note that in the GIF format the duration/delay is expressed in
hundredths of a second, which limits the precision of the duration.
fps : float
The number of frames per second. If duration is not given, the
duration for each frame is set to 1/fps. Default 10.
palettesize : int
The number of colors to quantize the image to. Is rounded to
the nearest power of two. Default 256.
subrectangles : bool
If True, will try and optimize the GIF by storing only the
rectangular parts of each frame that change with respect to the
previous. Default False.

Tensorflow image classification - eating up my memory

I am trying to write a small program which is able to classify certain pictures into categories. I create a list with pictures in the main code and pass them to the function in a loop. Code is working perfectly fine except for the fact that it does not free my memory and with each iteration the program uses more until it completely crashes.
Already tried to use "gc.collect()" in the function to force it to clear the memory, but that does not help. Shouldn't the memory be cleared automatically after checking one file or did I miss anything here?
def classify_pictures(self, files):
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
# Read the image_data
image_data = tf.gfile.FastGFile(files, 'rb').read()
# Loads label file, strips off carriage return
label_lines = [line.rstrip() for line
in tf.gfile.GFile("tf_files/retrained_labels.txt")]
# Unpersists graph from file
with tf.gfile.FastGFile("tf_files/retrained_graph.pb", 'rb') as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
_ = tf.import_graph_def(graph_def, name='')
with tf.Session() as sess:
# Feed the image_data as input to the graph and get first prediction
softmax_tensor = sess.graph.get_tensor_by_name('final_result:0')
predictions = sess.run(softmax_tensor, \
{'DecodeJpeg/contents:0': image_data})
# Sort to show labels of first prediction in order of confidence
top_k = predictions[0].argsort()[-len(predictions[0]):][::-1]
for each_picture in range(0, 10):
human_string = label_lines[top_k[0]]
if human_string == "selfie":
return ("selfie")
if "passport" in human_string:
return("passport")
if "statement" in human_string:
return("bill")
If you call this function in a loop the computational graph is being rebuild on every iteration and the previous graph is of course still there too. That is what uses all the memory.
To solve this only do the session.run() call inside the loop.
In general when writing Tensorflow code you should always try to keep code that generates the graph separate from code which executes the graph. In your case you are doing both inside a single function, which is than called multiple times so a new graph is rebuild every time.

Resources