1 - THE PROBLEM
I'm using "spacy" on python for text documents lemmatization.
There are 500,000 documents having size up to 20 Mb of clean text.
The problem is the following: spacy memory consuming is growing in time till the whole memory is used.
2 - BACKGROUND
My hardware configuration:
CPU: Intel I7-8700K 3.7 GHz (12 cores)
Memory: 16 Gb
SSD: 1 Tb
GPU is onboard but is not used for this task
I'm using "multiprocessing" to split the task among several processes (workers).
Each worker receives a list of documents to process.
The main process performs monitoring of child processes.
I initiate "spacy" in each child process once and use this one spacy instance to handle the whole list of documents in the worker.
Memory tracing says the following:
[ Memory trace - Top 10 ]
/opt/develop/virtualenv/lib/python3.6/site-packages/thinc/neural/mem.py:68: size=45.1 MiB, count=99, average=467 KiB
/opt/develop/virtualenv/lib/python3.6/posixpath.py:149: size=40.3 MiB, count=694225, average=61 B
:487: size=9550 KiB, count=77746, average=126 B
/opt/develop/virtualenv/lib/python3.6/site-packages/dawg_python/wrapper.py:33: size=7901 KiB, count=6, average=1317 KiB
/opt/develop/virtualenv/lib/python3.6/site-packages/spacy/lang/en/lemmatizer/_nouns.py:7114: size=5273 KiB, count=57494, average=94 B
prepare_docs04.py:372: size=4189 KiB, count=1, average=4189 KiB
/opt/develop/virtualenv/lib/python3.6/site-packages/dawg_python/wrapper.py:93: size=3949 KiB, count=5, average=790 KiB
/usr/lib/python3.6/json/decoder.py:355: size=1837 KiB, count=20456, average=92 B
/opt/develop/virtualenv/lib/python3.6/site-packages/spacy/lang/en/lemmatizer/_adjectives.py:2828: size=1704 KiB, count=20976, average=83 B
prepare_docs04.py:373: size=1633 KiB, count=1, average=1633 KiB
3 - EXPECTATIONS
I have seen a good recommendation to build a separated server-client solution [here]Is possible to keep spacy in memory to reduce the load time?
Is it possible to keep memory consumption under control using "multiprocessing" approach?
4 - THE CODE
Here is a simplified version of my code:
import os, subprocess, spacy, sys, tracemalloc
from multiprocessing import Pipe, Process, Lock
from time import sleep
# START: memory trace
tracemalloc.start()
# Load spacy
spacyMorph = spacy.load("en_core_web_sm")
#
# Get word's lemma
#
def getLemma(word):
global spacyMorph
lemmaOutput = spacyMorph(str(word))
return lemmaOutput
#
# Worker's logic
#
def workerNormalize(lock, conn, params):
documentCount = 1
for filenameRaw in params[1]:
documentTotal = len(params[1])
documentID = int(os.path.basename(filenameRaw).split('.')[0])
# Send to the main process the worker's current progress
if not lock is None:
lock.acquire()
try:
statusMessage = "WORKING:{:d},{:d},".format(documentID, documentCount)
conn.send(statusMessage)
documentCount += 1
finally:
lock.release()
else:
print(statusMessage)
# ----------------
# Some code is excluded for clarity sake
# I've got a "wordList" from file "filenameRaw"
# ----------------
wordCount = 1
wordTotalCount = len(wordList)
for word in wordList:
lemma = getLemma(word)
wordCount += 1
# ----------------
# Then I collect all lemmas and save it to another text file
# ----------------
# Here I'm trying to reduce memory usage
del wordList
del word
gc.collect()
if __name__ == '__main__':
lock = Lock()
processList = []
# ----------------
# Some code is excluded for clarity sake
# Here I'm getting full list of files "fileTotalList" which I need to lemmatize
# ----------------
while cursorEnd < (docTotalCount + stepSize):
fileList = fileTotalList[cursorStart:cursorEnd]
# ----------------
# Create workers and populate it with list of files to process
# ----------------
processData = {}
processData['total'] = len(fileList) # worker total progress
processData['count'] = 0 # worker documents done count
processData['currentDocID'] = 0 # current document ID the worker is working on
processData['comment'] = '' # additional comment (optional)
processData['con_parent'], processData['con_child'] = Pipe(duplex=False)
processName = 'worker ' + str(count) + " at " + str(cursorStart)
processData['handler'] = Process(target=workerNormalize, name=processName, args=(lock, processData['con_child'], [processName, fileList]))
processList.append(processData)
processData['handler'].start()
cursorStart = cursorEnd
cursorEnd += stepSize
count += 1
# ----------------
# Run the monitor to look after the workers
# ----------------
while True:
runningCount = 0
#Worker communication format:
#STATUS:COMMENTS
#STATUS:
#- WORKING - worker is working
#- CLOSED - worker has finished his job and closed pipe-connection
#COMMENTS:
#- for WORKING status:
#DOCID,COUNT,COMMENTS
#DOCID - current document ID the worker is working on
#COUNT - count of done documents
#COMMENTS - additional comments (optional)
# ----------------
# Run through the list of workers ...
# ----------------
for i, process in enumerate(processList):
if process['handler'].is_alive():
runningCount += 1
# ----------------
# .. and check if there is somethng in the PIPE
# ----------------
if process['con_parent'].poll():
try:
message = process['con_parent'].recv()
status = message.split(':')[0]
comment = message.split(':')[1]
# ----------------
# Some code is excluded for clarity sake
# Update worker's information and progress in "processList"
# ----------------
except EOFError:
print("EOF----")
# ----------------
# Some code is excluded for clarity sake
# Here I draw some progress lines per workers
# ----------------
else:
# worker has finished his job. Close the connection.
process['con_parent'].close()
# Whait for some time and monitor again
sleep(PARAM['MONITOR_REFRESH_FREQUENCY'])
print("================")
print("**** DONE ! ****")
print("================")
# ----------------
# Here I'm measuring memory usage to find the most "gluttonous" part of the code
# ----------------
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
print("[ Memory trace - Top 10 ]")
for stat in top_stats[:10]:
print(stat)
'''
For people who land on this in the future, I found a hack that seems to work well:
import spacy
import en_core_web_lg
import multiprocessing
docs = ['Your documents']
def process_docs(docs, n_processes=None):
# Load the model inside the subprocess,
# as that seems to be the main culprit of the memory issues
nlp = en_core_web_lg.load()
if not n_processes:
n_processes = multiprocessing.cpu_count()
processed_docs = [doc for doc in nlp.pipe(docs, disable=['ner', 'parser'], n_process=n_processes)]
# Then do what you wish beyond this point. I end up writing results out to s3.
pass
for x in range(10):
# This will spin up a subprocess,
# and everytime it finishes it will release all resources back to the machine.
with multiprocessing.Manager() as manager:
p = multiprocessing.Process(target=process_docs, args=(docs))
p.start()
p.join()
The idea here is to put everything Spacy-related into a subprocess so all the memory gets released once the subprocess finishes. I know it's working because I can actually watch the memory get released back to the instance every time the subprocess finishes (also the instance no longer crashes xD).
Full Disclosure: I have no idea why Spacy seems to go up in memory overtime, I've read all over trying to find a simple answer, and all the github issues I've seen claim they've fixed the issue yet I still see this happening when I use Spacy on AWS Sagemaker instances.
Hope this helps someone! I know I spent hours pulling my hair out over this.
Credit to another SO answer that explains a bit more about subprocesses in Python.
Memory leaks with spacy
Memory problems when processing large amounts of data seem to be a known issue, see some relevant github issues:
https://github.com/explosion/spaCy/issues/3623
https://github.com/explosion/spaCy/issues/3556
Unfortunately, it doesn't look like there's a good solution yet.
Lemmatization
Looking at your particular lemmatization task, I think your example code is a bit too over-simplified, because you're running the full spacy pipeline on single words and then not doing anything with the results (not even inspecting the lemma?), so it's hard to tell what you actually want to do.
I'll assume you just want to lemmatize, so in general, you want to disable the parts of the pipeline that you're not using as much as possible (especially parsing if you're only lemmatizing, see https://spacy.io/usage/processing-pipelines#disabling) and use nlp.pipe to process documents in batches. Spacy can't handle really long documents if you're using the parser or entity recognition, so you'll need to break up your texts somehow (or for just lemmatization/tagging you can just increase nlp.max_length as much as you need).
Breaking documents into individual words as in your example kind of the defeats the purpose of most of spacy's analysis (you often can't meaningfully tag or parse single words), plus it's going to be very slow to call spacy this way.
Lookup lemmatization
If you just need lemmas for common words out of context (where the tagger isn't going to provide any useful information), you can see if the lookup lemmatizer is good enough for your task and skip the rest of the processing:
from spacy.lemmatizer import Lemmatizer
from spacy.lang.en import LOOKUP
lemmatizer = Lemmatizer(lookup=LOOKUP)
print(lemmatizer(u"ducks", ''), lemmatizer(u"ducking", ''))
Output:
['duck'] ['duck']
It is just a static lookup table, so it won't do well on unknown words or capitalization for words like "wugs" or "DUCKS", so you'll have to see if it works well enough for your texts, but it would be much much faster without memory leaks. (You could also just use the table yourself without spacy, it's here: https://github.com/michmech/lemmatization-lists.)
Better lemmatization
Otherwise, use something more like this to process texts in batches:
nlp = spacy.load('en', disable=['parser', 'ner'])
# if needed: nlp.max_length = MAX_DOC_LEN_IN_CHAR
for doc in nlp.pipe(texts):
for token in doc:
print(token.lemma_)
If you process one long text (or use nlp.pipe() for lots of shorter texts) instead of processing individual words, you should be able to tag/lemmatize (many) thousands of words per second in one thread.
Related
I'm trying to train a forecasting model on several backtest dates and model parameters. I wrote a custom function that basically takes an average of ARIMA, ETS, and a few other univariate and multivariate forecasting models from a dataset that's about 10 years of quarterly data (40 data points). I want to run this model in parallel on thousands of different combinations.
The custom model I wrote looks like this
def train_test_func(model_params)
data = read_data_from_pickle()
data_train, data_test = train_test_split(data, backtestdate)
model1 = ARIMA.fit(data_train)
data_pred1 = model1.predict(len(data_test))
...
results = error_eval(data_pred1, ..., data_pred_i, data_test)
save_to_aws_s3(results)
logger.info("log steps here")
My multiprocessing script looks like this:
# Custom function I work that trains and tests
from my_custom_model import train_test_func
commands = []
if __name__ == '__main__':
for backtest_date in target_backtest_dates:
for param_a in target_drugs:
for param_b in param_b_options:
for param_c in param_c_options:
args = {
"backtest_date": backtest_date,
"param_a": param_a,
"param_b": param_b,
"param_c": param_c
}
commands.append(args)
count = multiprocessing.cpu_count()
with multiprocessing.get_context("spawn").Pool(processes=count) as pool:
pool.map(train_test_func, batched_args)
I can get relatively fast results for the first 200 or so iterations, roughly 50 iterations per min. Then, it drastically slows down to ~1 iteration per minute. For reference, running this on a single core gets me about 5 iterations per minute. Each process is independent and uses a relatively small dataset (40 data points). None of the processes need to depend on each other, either--they are completely standalone.
Can anyone help me understand where I'm going wrong with multiprocessing? Is there enough information here to identify the problem? At the moment, the multiprocessing versions are slower than single core versions.
Attaching performance output
I found the answer. Basically my model uses numpy, which, by default, is configured to use multicore. The clue was in my CPU usage from the top command.
This stackoverflow post led me to the correct answer. I added this code block to the top of my scripts that use numpy:
import os
ncore = "1"
os.environ["OMP_NUM_THREADS"] = ncore
os.environ["OPENBLAS_NUM_THREADS"] = ncore
os.environ["MKL_NUM_THREADS"] = ncore
os.environ["VECLIB_MAXIMUM_THREADS"] = ncore
os.environ["NUMEXPR_NUM_THREADS"] = ncore
import numpy
...
The key being that you have to add these configurations before you import numpy.
Performance increased from 50 cycles / min to 150 cycles / min and didn't experience any throttling after a few minutes. CPU usage was also improved, with no processes exceeding 100%.
I am trying to load multiple years of daily data in nc files (one nc file per year). A single nc file has a dimension of 365 (days) * 720 (lat) * 1440 (lon). All the nc files are in the "data" folder.
import xarray as xr
ds = xr.open_mfdataset('data/*.nc',
chunks={'latitude': 10, 'longitude': 10})
# I need the following line (time: -1) in order to do quantile, or it throws a ValueError:
# ValueError: dimension time on 0th function argument to apply_ufunc with dask='parallelized'
# consists of multiple chunks, but is also a core dimension. To fix, either rechunk into a single
# dask array chunk along this dimension, i.e., ``.chunk(time: -1)``, or pass ``allow_rechunk=True``
# in ``dask_gufunc_kwargs`` but beware that this may significantly increase memory usage.
ds = ds.chunk({'time': -1})
# Perform the quantile "computation" (looks more like a reference to the computation, as it's fast
ds_qt = ds.quantile(0.975, dim="time")
# Verify the shape of the loaded ds
print(ds)
# This shows the expected "concatenation" of the nc files.
# Get a sample for a given location to test the algorithm
print(len(ds.sel(lon = 35.86,lat = 14.375, method='nearest')['variable'].values))
print(ds_qt.sel(lon = 35.86,lat = 14.375, method='nearest')['variable'].values)
The result is correct. My issue comes from memory usage. I thought that by doing the open_mfdataset method, which uses Dask under the hood, this would be solved. However, loading "just" 2 years of nc files uses around 8GB of virtual RAM, and using 10 years of data uses my entire virtual RAM (around 32GB).
Am I missing something to be able to take a given percentile value across a dask array (I would need 30 nc files)? I apparently have to apply the chunk({'time': -1}) to the dataset to be able to use the quantile function, is this what makes the RAM savings fail?
This may help somebody in the future, here is the solution I am implementing, even though it is not optimized. I basically break the nc files into slices based on geolocation, and paste it back together to create the output file.
ds = xr.open_mfdataset('data/*.nc')
step = 10
min_lat = -90
max_lat = min_lat + step
output_ds = None
while max_lat <= 90:
cropped_ds = ds.sel(lat=slice(min_lat, max_lat))
cropped_ds = cropped_ds.chunk({'time': -1})
cropped_ds_quantile = cropped_ds.quantile(0.975, dim="time")
if not output_ds:
output_ds = cropped_ds_quantile
else:
output_ds = xr.merge([output_ds, cropped_ds_quantile])
min_lat += step
max_lat += step
output_ds.to_netcdf('output.nc')
It's not great, but it limits RAM usage to manageable levels. I am still open to a cleaner/faster solution if it exists (likely).
I am trying to develop a simple code in NEST: a network constituted by 10 identical neurons connected between them in order to form a loop. I would like to use arrays to develop this code, but I obtained error messages related to mismatch of argument type. Below I copy my code:
# SIMPLE NET -- FIRST TRIAL
# First example of a net made by 10 exicitatory neurons
# modeled by IAF models, which are connected in a loop
# and each neuron receives a synaptic current of double
# exponential type with rise and decay times respectively
# of 0.5 s. and 3 s.
###############################################################################
# Import the necessary modules
import pylab
import nest
import nest.raster_plot
import nest.voltage_trace
import numpy as np
###############################################################################
# Create the NODES
nest.SetDefaults("iaf_psc_alpha",
{"C_m": 10.0,
"tau_m": 15.58,
"t_ref": 2.0,
"E_L": -65.0,
"V_th": -40.0,
"V_reset": -65.0})
neurons = nest.Create("iaf_psc_alpha",10)
###############################################################################
# SYNAPTIC CURRENTS NODES
exc = 0.5
ini = 3.0
Istim = 0.0
Istim1 = 20.0
nest.SetStatus(neurons[:1], {"tau_syn_ex": exc, "tau_syn_in": ini, "I_e": Istim1})
nest.SetStatus(neurons[1:], {"tau_syn_ex": exc, "tau_syn_in": ini, "I_e": Istim})
###############################################################################
# OUTPUT DEVICE
voltmeter1 = nest.Create("voltmeter")
nest.Connect(voltmeter1, neurons[1])
voltmeter2 = nest.Create("voltmeter")
nest.Connect(voltmeter2, neurons[2])
voltmeter3 = nest.Create("voltmeter")
nest.Connect(voltmeter3, neurons[3])
spikes = nest.Create('spike_detector')
###############################################################################
# EXCITATORY CONNECTION BETWEEN NODES in a LOOP
weight = 200.0
delay = 1.0
#'excitatory',
for i in range(1,9,1):
nest.Connect(neurons[i], neurons[i+1], syn_spec={'weight': weight, 'delay': delay})
nest.Connect(neurons[10], neurons[1], syn_spec={'weight': weight, 'delay': delay})
###############################################################################
# SPIKE DETECTOR
nest.Connect(neurons, spikes)
###############################################################################
#SIMULATIONS AND OUTPUTS
nest.Simulate(400.0)
nest.voltage_trace.from_device(voltmeter1)
nest.voltage_trace.from_device(voltmeter2)
nest.voltage_trace.from_device(voltmeter3)
nest.voltage_trace.show()
nest.raster_plot.from_device(spikes, hist=True)
nest.raster_plot.show()
In Python arrays are indexed from zero, so the ten neurons created are neurons[0] to neurons[9]. To close the ring you need to connect the last neuron neurons[9] to the first neurons[0]. The neurons[10] does not exist.
Note also, that you close the ring 9 times, since the second Connect() call is also indented and thereby is inside the for loop.
If you use another Python feature, that negative indices are counted from the back, then your loop becomes very simple. Try something like this:
for i in range(10):
print(f"connect {(i-1)%10} -> {i}")
nest.Connect([neurons[i-1]], [neurons[i]], syn_spec={'weight': weight, 'delay': delay})
(You can of course omit the print() call. I just added it to show what happens.)
As you see this closes your ring automatically and you do not need the extra Connect() call. In Python lists and numpy arrays the convention is that neurons[-1] is the last element of the array, neurons[-2] is the one before the last, etc. Nice effect is that you see the 10 everywhere and it's easy to replace with a variable for the number of neurons. Also there is need anymore for making the range() so complex.
Additional note: For long rings it may be much faster to do array slicing:
nest.Connect(neurons[:-1], neurons[1:], syn_spec={'weight': weight, 'delay': delay})
nest.Connect([neurons[-1]], [neurons[0]], syn_spec={'weight': weight, 'delay': delay})
In this variant you need only two connect calls (without the Python for loop!), so most of the connections can be generated by the NEST kernel without returning control to Python after each synapse. Generally the fewer calls the better the kernel can make use of parallelization.
I have a script that uses the MTCNN face detection library that iterates through a fair amount of directories, totaling thousands of images. An issue that I've been running into with this script is the excessive memory usage when processing all of these images, which will eventually cause my MacBook (16gb of RAM) to run out of memory. What I'm looking to do is to implement batching on a folder by folder basis, instead of a specific batch limit because none of the folders contain enough images individually that would make the system run out of memory.
# open up the csv file
with open(csv_path, 'w', newline='') as file:
writer = csv.writer(file)
writer.writerow(['Index', 'Threshhold', 'Path'])
for path, subdirs, files in os.walk(path):
for name in files:
if name == '.DS_Store':
print("Skipping .DS_Store")
continue
else:
try:
image = os.path.join(path, name)
pixels = pyplot.imread(image)
print("Processing " + image)
print("Count: " + str(inc))
# calculate the area of the image
total_height = pixels.shape[0]
total_width = pixels.shape[1]
total_area = total_height * total_width
# create the detector, using default weights
detector = MTCNN()
faces = detector.detect_faces(pixels)
ax = pyplot.gca()
face_total_area = 0
if faces == []:
print("No faces detected.")
# pass in 0 for the threshold becuase there's no faces
#write_to_csv(inc, 0, image)
print()
else:
for face in faces:
# get dimensions from the face
x, y, width, height = face['box']
# calculate the area of the face
face_area = width * height
face_total_area += face_area
threshold = face_total_area / total_area
# write to csv only if the threshold is less than the limit
# change back to this eventually ^^^^^^^^^
if threshold > threshhold_limit:
print("Facial area is over the threshold - writing file path to csv.")
write_to_csv(inc, threshold, image)
else:
print("Image threshold is under the limit - good")
print(threshold)
print()
inc += 1
except:
print("Processing error - skipping image")
Is something like this possible to do? Or should it be done a different way? The idea is that batching like this will allow mtcnn to release the memory it's holding onto when it's done processing that folder.
Memory usage should not increase with this program, because it does not accumulate data from one image to the next one. So, what you are asking for will have no effect. Have you tried runnng tis same code outside of a Python notebook? As a standalone program? It may be that the notebook is keeping references to all read images.
Either that, or find a call that would really reset pyplot's internal state inside the innermost loop. (maybe pyplot.clf()).
"Batching" as you say is what takes place inside the first for loop, which will run once for each folder in your tree. The only bennefit you could possibly have would be to reset the internal state inside the first loop, but outside the second for (for name in ...), you'd have to find the exactly same call to reset the internal state.
(also, on a side note, you create a csv writer in your with block that is invalidated at the end of the block - you should refactor this code not to keep reopening the CSV file for each new line - (which happens in the not-shown write_to_csv function) )
I am trying to do image processing in all cores available in my machine(which has 4 cores and 8 processors). I chose to do Multiprocessing because it's a kind of CPU bound workload. Now, explaining the data I have a CSV file that has file paths recorded(local path), Image Category(explain what image is). The CSV has exactly 9258 categories. My Idea is to do batch processing. Assign 10 categories to each processor and loop through the images one by one, wait till all the processors complete its job, and assign the next batch.
The categories are stored in this format as_batches = [[C1, C2, ..., C10], [C11, C12, C13, ..., C20], [Cn-10, Cn-9,..., Cn]]
Here is the function that starts the process.
def get_n_process(as_batches, processes, df, q):
p = []
for i in range(processes):
work = Process(target=submit_job, args=(df, as_batches[i], q, i))
p.append(work)
work.start()
as_batches = as_batches[processes:]
return p, as_batches
Here is the main loop,
while(len(as_batches) > 0):
t = []
#dynamically check the lists
if len(as_batches) > 8:
n_process = 8
else:
n_process = len(as_batches)
print("For this it Requries {} Process".format(n_process))
process_obj_inlist, as_batches = get_n_process(as_batches, n_process, df, q)
for ind_process in process_obj_inlist:
ind_process.join()
with open("logs.txt", "a") as f:
f.write("\n")
f.write("Log Recording at: {timestamp}, Remaining N = {remaining} yet to be processed".format(
timestamp=datetime.datetime.now(),
remaining = len(as_batches)
))
f.close()
For log purposes, I am writing into a text file to see how many categories are there to process yet. And here is the main function
def do_something(fromprocess):
time.sleep(1)
print("Operation Ended for Process:{}, process Id:{}".format(
current_process().name, os.getpid()
))
return "msg"
def submit_job(df, list_items, q, fromprocess):
a = []
for items in list_items:
oneitemdf = df[df['MinorCategory']==items]['FilePath'].values.tolist()
oneitemdf = [x for x in oneitemdf if x.endswith('.png')]
result = do_something(fromprocess)
a.append(result)
q.put(a)
For now, I am just printing in the console, but in real code, I will be using KAZE algorithm to extract features from the images, store it in a list and append it to the Queue(Shared Memory) from all the processors. Now the script is running for few minutes but after some time the script is halted. It didn't run further. I tried to exit it but I couldn't. I think some deadlock might happen but I am not sure. I read online sources but couldn't figure out the solution and reason why it's happening?
For the full code, here is the gist link Full Source Code Link. What I am doing wrong here? I am new to Multiprocessing and MultiThreading. I would like to understand the concept in-depth. Links/Resources related to this topic are much appreciated.
UPDATE - The same code working perfectly on Mac OS.