python - multiprocessing stops running after some batch operation - python-3.x

I am trying to do image processing in all cores available in my machine(which has 4 cores and 8 processors). I chose to do Multiprocessing because it's a kind of CPU bound workload. Now, explaining the data I have a CSV file that has file paths recorded(local path), Image Category(explain what image is). The CSV has exactly 9258 categories. My Idea is to do batch processing. Assign 10 categories to each processor and loop through the images one by one, wait till all the processors complete its job, and assign the next batch.
The categories are stored in this format as_batches = [[C1, C2, ..., C10], [C11, C12, C13, ..., C20], [Cn-10, Cn-9,..., Cn]]
Here is the function that starts the process.
def get_n_process(as_batches, processes, df, q):
p = []
for i in range(processes):
work = Process(target=submit_job, args=(df, as_batches[i], q, i))
p.append(work)
work.start()
as_batches = as_batches[processes:]
return p, as_batches
Here is the main loop,
while(len(as_batches) > 0):
t = []
#dynamically check the lists
if len(as_batches) > 8:
n_process = 8
else:
n_process = len(as_batches)
print("For this it Requries {} Process".format(n_process))
process_obj_inlist, as_batches = get_n_process(as_batches, n_process, df, q)
for ind_process in process_obj_inlist:
ind_process.join()
with open("logs.txt", "a") as f:
f.write("\n")
f.write("Log Recording at: {timestamp}, Remaining N = {remaining} yet to be processed".format(
timestamp=datetime.datetime.now(),
remaining = len(as_batches)
))
f.close()
For log purposes, I am writing into a text file to see how many categories are there to process yet. And here is the main function
def do_something(fromprocess):
time.sleep(1)
print("Operation Ended for Process:{}, process Id:{}".format(
current_process().name, os.getpid()
))
return "msg"
def submit_job(df, list_items, q, fromprocess):
a = []
for items in list_items:
oneitemdf = df[df['MinorCategory']==items]['FilePath'].values.tolist()
oneitemdf = [x for x in oneitemdf if x.endswith('.png')]
result = do_something(fromprocess)
a.append(result)
q.put(a)
For now, I am just printing in the console, but in real code, I will be using KAZE algorithm to extract features from the images, store it in a list and append it to the Queue(Shared Memory) from all the processors. Now the script is running for few minutes but after some time the script is halted. It didn't run further. I tried to exit it but I couldn't. I think some deadlock might happen but I am not sure. I read online sources but couldn't figure out the solution and reason why it's happening?
For the full code, here is the gist link Full Source Code Link. What I am doing wrong here? I am new to Multiprocessing and MultiThreading. I would like to understand the concept in-depth. Links/Resources related to this topic are much appreciated.
UPDATE - The same code working perfectly on Mac OS.

Related

How to implement batching on a folder by folder basis

I have a script that uses the MTCNN face detection library that iterates through a fair amount of directories, totaling thousands of images. An issue that I've been running into with this script is the excessive memory usage when processing all of these images, which will eventually cause my MacBook (16gb of RAM) to run out of memory. What I'm looking to do is to implement batching on a folder by folder basis, instead of a specific batch limit because none of the folders contain enough images individually that would make the system run out of memory.
# open up the csv file
with open(csv_path, 'w', newline='') as file:
writer = csv.writer(file)
writer.writerow(['Index', 'Threshhold', 'Path'])
for path, subdirs, files in os.walk(path):
for name in files:
if name == '.DS_Store':
print("Skipping .DS_Store")
continue
else:
try:
image = os.path.join(path, name)
pixels = pyplot.imread(image)
print("Processing " + image)
print("Count: " + str(inc))
# calculate the area of the image
total_height = pixels.shape[0]
total_width = pixels.shape[1]
total_area = total_height * total_width
# create the detector, using default weights
detector = MTCNN()
faces = detector.detect_faces(pixels)
ax = pyplot.gca()
face_total_area = 0
if faces == []:
print("No faces detected.")
# pass in 0 for the threshold becuase there's no faces
#write_to_csv(inc, 0, image)
print()
else:
for face in faces:
# get dimensions from the face
x, y, width, height = face['box']
# calculate the area of the face
face_area = width * height
face_total_area += face_area
threshold = face_total_area / total_area
# write to csv only if the threshold is less than the limit
# change back to this eventually ^^^^^^^^^
if threshold > threshhold_limit:
print("Facial area is over the threshold - writing file path to csv.")
write_to_csv(inc, threshold, image)
else:
print("Image threshold is under the limit - good")
print(threshold)
print()
inc += 1
except:
print("Processing error - skipping image")
Is something like this possible to do? Or should it be done a different way? The idea is that batching like this will allow mtcnn to release the memory it's holding onto when it's done processing that folder.
Memory usage should not increase with this program, because it does not accumulate data from one image to the next one. So, what you are asking for will have no effect. Have you tried runnng tis same code outside of a Python notebook? As a standalone program? It may be that the notebook is keeping references to all read images.
Either that, or find a call that would really reset pyplot's internal state inside the innermost loop. (maybe pyplot.clf()).
"Batching" as you say is what takes place inside the first for loop, which will run once for each folder in your tree. The only bennefit you could possibly have would be to reset the internal state inside the first loop, but outside the second for (for name in ...), you'd have to find the exactly same call to reset the internal state.
(also, on a side note, you create a csv writer in your with block that is invalidated at the end of the block - you should refactor this code not to keep reopening the CSV file for each new line - (which happens in the not-shown write_to_csv function) )

How to process the data returned from a function (Python 3.7)

Background:
My question should be relatively easy, however I am not able to figure it out.
I have written a function regarding queueing theory and it will be used for ambulance service planning. For example, how many calls for service can I expect in a given time frame.
The function takes two parameters; a starting value of the number of ambulances in my system starting at 0 and ending at 100 ambulances. This will show the probability of zero calls for service, one call for service, three calls for service….up to 100 calls for service. Second parameter is an arrival rate number which is the past historical arrival rate in my system.
The function runs and prints out the result to my screen. I have checked the math and it appears to be correct.
This is Python 3.7 with the Anaconda distribution.
My question is this:
I would like to process this data even further but I don’t know how to capture it and do more math. For example, I would like to take this list and accumulate the probability values. With an arrival rate of five, there is a cumulative probability of 61.56% of at least five calls for service, etc.
A second example of how I would like to process this data is to format it as percentages and write out a text file
A third example would be to process the cumulative probabilities and exclude any values higher than the 99% cumulative value (because these vanish into extremely small numbers).
A fourth example would be to create a bar chart showing the probability of n calls for service.
These are some of the things I want to do with the queueing theory calculations. And there are a lot more. I am planning on writing a larger application. But I am stuck at this point. The function writes an output into my Python 3.7 console. How do I “capture” that output as an object or something and perform other processing on the data?
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import math
import csv
def probability_x(start_value = 0, arrival_rate = 0):
probability_arrivals = []
while start_value <= 100:
probability_arrivals = [start_value, math.pow(arrival_rate, start_value) * math.pow(math.e, -arrival_rate) / math.factorial(start_value)]
print(probability_arrivals)
start_value = start_value + 1
return probability_arrivals
#probability_x(arrival_rate = 5, x = 5)
#The code written above prints to the console, but my goal is to take the returned values and make other calculations.
#How do I 'capture' this data for further processing is where I need help (for example, bar plots, cumulative frequency, etc )
#failure. TypeError: writerows() argument must be iterable.
with open('ExpectedProbability.csv', 'w') as writeFile:
writer = csv.writer(writeFile)
for value in probability_x(arrival_rate = 5):
writer.writerows(value)
writeFile.close()
#Failure. Why does it return 2. Yes there are two columns but I was expecting 101 as the length because that is the end of my loop.
print(len(probability_x(arrival_rate = 5)))
The problem is, when you write
probability_arrivals = [start_value, math.pow(arrival_rate, start_value) * math.pow(math.e, -arrival_rate) / math.factorial(start_value)]
You're overwriting the previous contents of probability_arrivals. Everything that it held previously is lost.
Instead of using = to reassign probability_arrivals, you want to append another entry to the list:
probability_arrivals.append([start_value, math.pow(arrival_rate, start_value) * math.pow(math.e, -arrival_rate) / math.factorial(start_value)])
I'll also note, your while loop can be improved. You're basically just looping over start_value until it reaches a certain value. A for loop would be more appropriate here:
for s in range(start_value, 101): # The end value is exclusive, so it's 101 not 100
probability_arrivals = [s, math.pow(arrival_rate, s) * math.pow(math.e, -arrival_rate) / math.factorial(s)]
print(probability_arrivals)
Now you don't need to manually worry about incrementing the counter.

Python For Loop Slows Due To Large List

So currently I have a for loop, which causes the python program to die with the program saying 'Killed'. It slows down around 6000 items in, with the program slowly dying at around 6852 list items. How do I fix this?
I assume it's due to the list being too large.
I've tried splitting the list in two around 6000. Maybe it's due to memory management or something. Help would be appreciated.
for id in listofids:
connection = psycopg2.connect(user = "username", password = "password", host = "localhost", port = "5432", database = "darkwebscraper")
cursor = connection.cursor()
cursor.execute("select darkweb.site_id, darkweb.site_title, darkweb.sitetext from darkweb where darkweb.online='true' AND darkweb.site_id = %s", ([id]))
print(len(listoftexts))
try:
row = cursor.fetchone()
except:
print("failed to fetch one")
try:
listoftexts.append(row[2])
cursor.close()
connection.close()
except:
print("failed to print")
You're right, it's probably because the list becomes large: python list are contiguous spaces in memory. Each time you append to the list, python looks if there is a spot at the next position, and if not he relocates the whole array somewhere where there is enough room. The bigger your array, the more python has a to relocate.
One way around would be to create an array of the right size beforehand.
EDIT: Just to make sure it was clear, I made up an example to illustrate my point. I've made 2 functions. The first one appends the stringified index (to make it bigger) to a list at each iteration, and the other just fills a numpy array:
import numpy as np
import matplotlib.pyplot as plt
from time import time
def test_bigList(N):
L = []
times = np.zeros(N,dtype=np.float32)
for i in range(N):
t0 = time()
L.append(str(i))
times[i] = time()-t0
return times
def test_bigList_numpy(N):
L = np.empty(N,dtype="<U32")
times = np.zeros(N,dtype=np.float32)
for i in range(N):
t0 = time()
L[i] = str(i)
times[i] = time()-t0
return times
N = int(1e7)
res1 = test_bigList(N)
res2 = test_bigList_numpy(N)
plt.plot(res1,label="list")
plt.plot(res2,label="numpy array")
plt.xlabel("Iteration")
plt.ylabel("Running time")
plt.legend()
plt.title("Evolution of iteration time with the size of an array")
plt.show()
I get the following result:
You can see on the figure that for the list case, you have regularly some peaks (probably due to relocation), and they seem to increase with the size of the list. This example is with short appended strings, but the bigger the string, the more you will see this effect.
If it does not do the trick, then it might be linked to the database itself, but I can't help you without knowing the specifics of the database.

Python: Spacy and memory consumption

1 - THE PROBLEM
I'm using "spacy" on python for text documents lemmatization.
There are 500,000 documents having size up to 20 Mb of clean text.
The problem is the following: spacy memory consuming is growing in time till the whole memory is used.
2 - BACKGROUND
My hardware configuration:
CPU: Intel I7-8700K 3.7 GHz (12 cores)
Memory: 16 Gb
SSD: 1 Tb
GPU is onboard but is not used for this task
I'm using "multiprocessing" to split the task among several processes (workers).
Each worker receives a list of documents to process.
The main process performs monitoring of child processes.
I initiate "spacy" in each child process once and use this one spacy instance to handle the whole list of documents in the worker.
Memory tracing says the following:
[ Memory trace - Top 10 ]
/opt/develop/virtualenv/lib/python3.6/site-packages/thinc/neural/mem.py:68: size=45.1 MiB, count=99, average=467 KiB
/opt/develop/virtualenv/lib/python3.6/posixpath.py:149: size=40.3 MiB, count=694225, average=61 B
:487: size=9550 KiB, count=77746, average=126 B
/opt/develop/virtualenv/lib/python3.6/site-packages/dawg_python/wrapper.py:33: size=7901 KiB, count=6, average=1317 KiB
/opt/develop/virtualenv/lib/python3.6/site-packages/spacy/lang/en/lemmatizer/_nouns.py:7114: size=5273 KiB, count=57494, average=94 B
prepare_docs04.py:372: size=4189 KiB, count=1, average=4189 KiB
/opt/develop/virtualenv/lib/python3.6/site-packages/dawg_python/wrapper.py:93: size=3949 KiB, count=5, average=790 KiB
/usr/lib/python3.6/json/decoder.py:355: size=1837 KiB, count=20456, average=92 B
/opt/develop/virtualenv/lib/python3.6/site-packages/spacy/lang/en/lemmatizer/_adjectives.py:2828: size=1704 KiB, count=20976, average=83 B
prepare_docs04.py:373: size=1633 KiB, count=1, average=1633 KiB
3 - EXPECTATIONS
I have seen a good recommendation to build a separated server-client solution [here]Is possible to keep spacy in memory to reduce the load time?
Is it possible to keep memory consumption under control using "multiprocessing" approach?
4 - THE CODE
Here is a simplified version of my code:
import os, subprocess, spacy, sys, tracemalloc
from multiprocessing import Pipe, Process, Lock
from time import sleep
# START: memory trace
tracemalloc.start()
# Load spacy
spacyMorph = spacy.load("en_core_web_sm")
#
# Get word's lemma
#
def getLemma(word):
global spacyMorph
lemmaOutput = spacyMorph(str(word))
return lemmaOutput
#
# Worker's logic
#
def workerNormalize(lock, conn, params):
documentCount = 1
for filenameRaw in params[1]:
documentTotal = len(params[1])
documentID = int(os.path.basename(filenameRaw).split('.')[0])
# Send to the main process the worker's current progress
if not lock is None:
lock.acquire()
try:
statusMessage = "WORKING:{:d},{:d},".format(documentID, documentCount)
conn.send(statusMessage)
documentCount += 1
finally:
lock.release()
else:
print(statusMessage)
# ----------------
# Some code is excluded for clarity sake
# I've got a "wordList" from file "filenameRaw"
# ----------------
wordCount = 1
wordTotalCount = len(wordList)
for word in wordList:
lemma = getLemma(word)
wordCount += 1
# ----------------
# Then I collect all lemmas and save it to another text file
# ----------------
# Here I'm trying to reduce memory usage
del wordList
del word
gc.collect()
if __name__ == '__main__':
lock = Lock()
processList = []
# ----------------
# Some code is excluded for clarity sake
# Here I'm getting full list of files "fileTotalList" which I need to lemmatize
# ----------------
while cursorEnd < (docTotalCount + stepSize):
fileList = fileTotalList[cursorStart:cursorEnd]
# ----------------
# Create workers and populate it with list of files to process
# ----------------
processData = {}
processData['total'] = len(fileList) # worker total progress
processData['count'] = 0 # worker documents done count
processData['currentDocID'] = 0 # current document ID the worker is working on
processData['comment'] = '' # additional comment (optional)
processData['con_parent'], processData['con_child'] = Pipe(duplex=False)
processName = 'worker ' + str(count) + " at " + str(cursorStart)
processData['handler'] = Process(target=workerNormalize, name=processName, args=(lock, processData['con_child'], [processName, fileList]))
processList.append(processData)
processData['handler'].start()
cursorStart = cursorEnd
cursorEnd += stepSize
count += 1
# ----------------
# Run the monitor to look after the workers
# ----------------
while True:
runningCount = 0
#Worker communication format:
#STATUS:COMMENTS
#STATUS:
#- WORKING - worker is working
#- CLOSED - worker has finished his job and closed pipe-connection
#COMMENTS:
#- for WORKING status:
#DOCID,COUNT,COMMENTS
#DOCID - current document ID the worker is working on
#COUNT - count of done documents
#COMMENTS - additional comments (optional)
# ----------------
# Run through the list of workers ...
# ----------------
for i, process in enumerate(processList):
if process['handler'].is_alive():
runningCount += 1
# ----------------
# .. and check if there is somethng in the PIPE
# ----------------
if process['con_parent'].poll():
try:
message = process['con_parent'].recv()
status = message.split(':')[0]
comment = message.split(':')[1]
# ----------------
# Some code is excluded for clarity sake
# Update worker's information and progress in "processList"
# ----------------
except EOFError:
print("EOF----")
# ----------------
# Some code is excluded for clarity sake
# Here I draw some progress lines per workers
# ----------------
else:
# worker has finished his job. Close the connection.
process['con_parent'].close()
# Whait for some time and monitor again
sleep(PARAM['MONITOR_REFRESH_FREQUENCY'])
print("================")
print("**** DONE ! ****")
print("================")
# ----------------
# Here I'm measuring memory usage to find the most "gluttonous" part of the code
# ----------------
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
print("[ Memory trace - Top 10 ]")
for stat in top_stats[:10]:
print(stat)
'''
For people who land on this in the future, I found a hack that seems to work well:
import spacy
import en_core_web_lg
import multiprocessing
docs = ['Your documents']
def process_docs(docs, n_processes=None):
# Load the model inside the subprocess,
# as that seems to be the main culprit of the memory issues
nlp = en_core_web_lg.load()
if not n_processes:
n_processes = multiprocessing.cpu_count()
processed_docs = [doc for doc in nlp.pipe(docs, disable=['ner', 'parser'], n_process=n_processes)]
# Then do what you wish beyond this point. I end up writing results out to s3.
pass
for x in range(10):
# This will spin up a subprocess,
# and everytime it finishes it will release all resources back to the machine.
with multiprocessing.Manager() as manager:
p = multiprocessing.Process(target=process_docs, args=(docs))
p.start()
p.join()
The idea here is to put everything Spacy-related into a subprocess so all the memory gets released once the subprocess finishes. I know it's working because I can actually watch the memory get released back to the instance every time the subprocess finishes (also the instance no longer crashes xD).
Full Disclosure: I have no idea why Spacy seems to go up in memory overtime, I've read all over trying to find a simple answer, and all the github issues I've seen claim they've fixed the issue yet I still see this happening when I use Spacy on AWS Sagemaker instances.
Hope this helps someone! I know I spent hours pulling my hair out over this.
Credit to another SO answer that explains a bit more about subprocesses in Python.
Memory leaks with spacy
Memory problems when processing large amounts of data seem to be a known issue, see some relevant github issues:
https://github.com/explosion/spaCy/issues/3623
https://github.com/explosion/spaCy/issues/3556
Unfortunately, it doesn't look like there's a good solution yet.
Lemmatization
Looking at your particular lemmatization task, I think your example code is a bit too over-simplified, because you're running the full spacy pipeline on single words and then not doing anything with the results (not even inspecting the lemma?), so it's hard to tell what you actually want to do.
I'll assume you just want to lemmatize, so in general, you want to disable the parts of the pipeline that you're not using as much as possible (especially parsing if you're only lemmatizing, see https://spacy.io/usage/processing-pipelines#disabling) and use nlp.pipe to process documents in batches. Spacy can't handle really long documents if you're using the parser or entity recognition, so you'll need to break up your texts somehow (or for just lemmatization/tagging you can just increase nlp.max_length as much as you need).
Breaking documents into individual words as in your example kind of the defeats the purpose of most of spacy's analysis (you often can't meaningfully tag or parse single words), plus it's going to be very slow to call spacy this way.
Lookup lemmatization
If you just need lemmas for common words out of context (where the tagger isn't going to provide any useful information), you can see if the lookup lemmatizer is good enough for your task and skip the rest of the processing:
from spacy.lemmatizer import Lemmatizer
from spacy.lang.en import LOOKUP
lemmatizer = Lemmatizer(lookup=LOOKUP)
print(lemmatizer(u"ducks", ''), lemmatizer(u"ducking", ''))
Output:
['duck'] ['duck']
It is just a static lookup table, so it won't do well on unknown words or capitalization for words like "wugs" or "DUCKS", so you'll have to see if it works well enough for your texts, but it would be much much faster without memory leaks. (You could also just use the table yourself without spacy, it's here: https://github.com/michmech/lemmatization-lists.)
Better lemmatization
Otherwise, use something more like this to process texts in batches:
nlp = spacy.load('en', disable=['parser', 'ner'])
# if needed: nlp.max_length = MAX_DOC_LEN_IN_CHAR
for doc in nlp.pipe(texts):
for token in doc:
print(token.lemma_)
If you process one long text (or use nlp.pipe() for lots of shorter texts) instead of processing individual words, you should be able to tag/lemmatize (many) thousands of words per second in one thread.

Reduce execution time when parsing files

I need to write a script in python2.7 which parse 4 files.
I need to be fast as possible.
For the moment i create a loop, and i parse the 4 files one after another.
I need to understand one thing. If a created 4 parsing script programs (one for each file) and launch the 4 script in 4 different terminal, is this going to reduce the execution time (or not) ?
Thx,
if you ave a potato pc , Yes it will Reduce the Execution Time
i suggest yu to use multithreading on every script for up the speed
import threading
import time
def main():
starttime = time.time()
endtime = time.time()
for x in range(1,10000): # For Print 10000 Times The Character X For Try If It Reduce Or Nop
print x
print "Time Speled : " + round((endtime-starttime), 2) # For Show The Time
threads = []
t = threading.Thread(target=main)
threads.append(t)
t.start()
and you can Try the difference with threading and without it

Resources