Torch distributed broadcast and reduce between CPU/ GPU devices - pytorch

Using the torch.distributed package. I am trying to move tensors from CPU -> GPU0, GPU1 in two separate processes and update the master version (on CPU).
Assume I have two GPU's connected. One on Device0, the other on Device1.
Store a very large array on CPU (something that can't fit onto a single device/gpu)
X = [1,2,3,4,5,6] for example.
Broadcast part of the array to GPU device 0 and GPU device 1. 0,1 have different chunks of that array.
GPU0 Inds = [0,1] GPU0 data = [1,2]
GPU1 Inds = [2,3] GPU1 data = [2,3]
Run a process on GPU0 and GPU1 independently. For this purpose a simple Add() function will do.
Update the CPU version with GPU data where necessary (for the inds that GPU grabbed). This is where I would probably use reduce to get both tensors from the devices. I would probably store it in a key-value dict where key is the device ID (0 for GPU 0, 1 for GPU 1) and have the inds and data stored in a tuple. Then I need to update the CPU version and run the whole process again.
What I am trying to do can be seen in this diagram.
I plan on using an NCCL backend which apparently supports broadcast and reduce.
My code should look something like this:
Main() function spawns two processes and holds the cpu tensor
Foo() starts the two processes and allows broadcasting, and updating between them (what I want to do in diagram)
def main():
args.world_size=2
args.backend='nccl'
os.environ['MASTER_ADDR'] = '127.0.0.2'
os.environ['MASTER_PORT'] = '29500'
args.data = t.Tensor([1,2,3,4,5,6])
mp.spawn(foo, nprocs=2, args=(args,), join=True)
def foo(rank):
data = args.data # cpu data
dist.init_process_group(backend=args.backend, init_method='env://', world_size=args.world_size, rank=rank)
if rank==0:
inds=[0,1]
elif rank == 1:
inds=[2,3]
gpu_data = data[inds].cuda(rank) # will send to GPU0 or GPU1. # probably need to use the torch.dist.broadcast operation here but idk how.
data[inds].data=gpu.data # this would be the update step.
gpu_data +=1
dist.destroy_process_group()
print('data: ', data) >> is not [2,3,4,5,5,6] which is what it should be

From the looks of it, you want to use three replicas: one on CPU, two on GPU. If that's the case, probably you need a different backend than nccl. Please take a look at https://pytorch.org/docs/stable/distributed.html

Related

multiprocessing starts off fast and drastically slows down

I'm trying to train a forecasting model on several backtest dates and model parameters. I wrote a custom function that basically takes an average of ARIMA, ETS, and a few other univariate and multivariate forecasting models from a dataset that's about 10 years of quarterly data (40 data points). I want to run this model in parallel on thousands of different combinations.
The custom model I wrote looks like this
def train_test_func(model_params)
data = read_data_from_pickle()
data_train, data_test = train_test_split(data, backtestdate)
model1 = ARIMA.fit(data_train)
data_pred1 = model1.predict(len(data_test))
...
results = error_eval(data_pred1, ..., data_pred_i, data_test)
save_to_aws_s3(results)
logger.info("log steps here")
My multiprocessing script looks like this:
# Custom function I work that trains and tests
from my_custom_model import train_test_func
commands = []
if __name__ == '__main__':
for backtest_date in target_backtest_dates:
for param_a in target_drugs:
for param_b in param_b_options:
for param_c in param_c_options:
args = {
"backtest_date": backtest_date,
"param_a": param_a,
"param_b": param_b,
"param_c": param_c
}
commands.append(args)
count = multiprocessing.cpu_count()
with multiprocessing.get_context("spawn").Pool(processes=count) as pool:
pool.map(train_test_func, batched_args)
I can get relatively fast results for the first 200 or so iterations, roughly 50 iterations per min. Then, it drastically slows down to ~1 iteration per minute. For reference, running this on a single core gets me about 5 iterations per minute. Each process is independent and uses a relatively small dataset (40 data points). None of the processes need to depend on each other, either--they are completely standalone.
Can anyone help me understand where I'm going wrong with multiprocessing? Is there enough information here to identify the problem? At the moment, the multiprocessing versions are slower than single core versions.
Attaching performance output
I found the answer. Basically my model uses numpy, which, by default, is configured to use multicore. The clue was in my CPU usage from the top command.
This stackoverflow post led me to the correct answer. I added this code block to the top of my scripts that use numpy:
import os
ncore = "1"
os.environ["OMP_NUM_THREADS"] = ncore
os.environ["OPENBLAS_NUM_THREADS"] = ncore
os.environ["MKL_NUM_THREADS"] = ncore
os.environ["VECLIB_MAXIMUM_THREADS"] = ncore
os.environ["NUMEXPR_NUM_THREADS"] = ncore
import numpy
...
The key being that you have to add these configurations before you import numpy.
Performance increased from 50 cycles / min to 150 cycles / min and didn't experience any throttling after a few minutes. CPU usage was also improved, with no processes exceeding 100%.

Proper way to log things when using Pytorch Lightning DDP

I was wondering what is the proper way of logging metrics when using DDP. I noticed that if I want to print something inside validation_epoch_end it will be printed twice when using 2 GPUs. I was expecting validation_epoch_end to be called only on rank 0 and to receive the outputs from all GPUs, but I am not sure this is correct anymore. Therefore I have several questions:
validation_epoch_end(self, outputs) - When using DDP does every subprocess receive the data processed from the current GPU or data processed from all GPUs, i.e. does the input parameter outputs contains the outputs of the entire validation set, from all GPUs?
If outputs is GPU/process specific what is the proper way to calculate any metric on the entire validation set in validation_epoch_end when using DDP?
I understand that I can solve the printing by checking self.global_rank == 0 and printing/logging only in that case, however I am trying to get a deeper understanding of what I am printing/logging in this case.
Here is a code snippet from my use case. I would like to be able to report f1, precision and recall on the entire validation dataset and I am wondering what is the correct way of doing it when using DDP.
def _process_epoch_outputs(self,
outputs: List[Dict[str, Any]]
) -> Tuple[torch.Tensor, torch.Tensor]:
"""Creates and returns tensors containing all labels and predictions
Goes over the outputs accumulated from every batch, detaches the
necessary tensors and stacks them together.
Args:
outputs (List[Dict])
"""
all_labels = []
all_predictions = []
for output in outputs:
for labels in output['labels'].detach():
all_labels.append(labels)
for predictions in output['predictions'].detach():
all_predictions.append(predictions)
all_labels = torch.stack(all_labels).long().cpu()
all_predictions = torch.stack(all_predictions).cpu()
return all_predictions, all_labels
def validation_epoch_end(self, outputs: List[Dict[str, Any]]) -> None:
"""Logs f1, precision and recall on the validation set."""
if self.global_rank == 0:
print(f'Validation Epoch: {self.current_epoch}')
predictions, labels = self._process_epoch_outputs(outputs)
for i, name in enumerate(self.label_columns):
f1, prec, recall, t = metrics.get_f1_prec_recall(predictions[:, i],
labels[:, i],
threshold=None)
self.logger.experiment.add_scalar(f'{name}_f1/Val',
f1,
self.current_epoch)
self.logger.experiment.add_scalar(f'{name}_Precision/Val',
prec,
self.current_epoch)
self.logger.experiment.add_scalar(f'{name}_Recall/Val',
recall,
self.current_epoch)
if self.global_rank == 0:
print((f'F1: {f1}, Precision: {prec}, '
f'Recall: {recall}, Threshold {t}'))
Questions
validation_epoch_end(self, outputs) - When using DDP does every
subprocess receive the data processed from the current GPU or data
processed from all GPUs, i.e. does the input parameter outputs
contains the outputs of the entire validation set, from all GPUs?
Data processed from the current GPU only, outputs are not synchronized, there is only backward synchronization (gradients are synchronized during training and distributed to replicas of models residing on each GPU).
Imagine that all of the outputs were passed from 1000 GPUs to this poor master, it could give it an OOM very easily
If outputs is GPU/process specific what is the proper way to calculate
any metric on the entire validation set in validation_epoch_end when
using DDP?
According to documentation (emphasis mine):
When validating using a accelerator that splits data from each batch
across GPUs, sometimes you might need to aggregate them on the master
GPU for processing (dp, or ddp2).
And here is accompanying code (validation_epoch_end would receive accumulated data across multiple GPUs from single step in this case, also see the comments):
# Done per-process (GPU)
def validation_step(self, batch, batch_idx):
x, y = batch
y_hat = self.model(x)
loss = F.cross_entropy(y_hat, y)
pred = ...
return {'loss': loss, 'pred': pred}
# Gathered data from all processes (per single step)
# Allows for accumulation so the whole data at the end of epoch
# takes less memory
def validation_step_end(self, batch_parts):
gpu_0_prediction = batch_parts.pred[0]['pred']
gpu_1_prediction = batch_parts.pred[1]['pred']
# do something with both outputs
return (batch_parts[0]['loss'] + batch_parts[1]['loss']) / 2
def validation_epoch_end(self, validation_step_outputs):
for out in validation_step_outputs:
# do something with preds
Tips
Focus on per-device calculations and as small number of between-GPU transfers as possible
Inside validation_step (or training_step if that's what you want, this is general) calculate f1, precision, recall and whatever else on a per-batch basis
Returns those values (say, as a dict). Now you will return 3 numbers from each device instead of (batch, outputs) (which could be significantly larger)
Inside validation_step_end get those 3 values (actually (2, 3) if you have 2 GPUs) and sum/take mean of them and return 3 values
Now validation_epoch_end will get (steps, 3) values that you can use to accumulate
It would be even better if instead of operating on list of values during validation_epoch_end you could accumulate them in another 3 values (say you have a lot of validation steps, the list could grow too large), but this should be enough.
AFAIK PyTorch-Lightning doesn't do this (e.g. instead of adding to list, apply some accumulator directly), but I might be mistaken, so any correction would be great.

How to create a ring network in NEST?

I am trying to develop a simple code in NEST: a network constituted by 10 identical neurons connected between them in order to form a loop. I would like to use arrays to develop this code, but I obtained error messages related to mismatch of argument type. Below I copy my code:
# SIMPLE NET -- FIRST TRIAL
# First example of a net made by 10 exicitatory neurons
# modeled by IAF models, which are connected in a loop
# and each neuron receives a synaptic current of double
# exponential type with rise and decay times respectively
# of 0.5 s. and 3 s.
###############################################################################
# Import the necessary modules
import pylab
import nest
import nest.raster_plot
import nest.voltage_trace
import numpy as np
###############################################################################
# Create the NODES
nest.SetDefaults("iaf_psc_alpha",
{"C_m": 10.0,
"tau_m": 15.58,
"t_ref": 2.0,
"E_L": -65.0,
"V_th": -40.0,
"V_reset": -65.0})
neurons = nest.Create("iaf_psc_alpha",10)
###############################################################################
# SYNAPTIC CURRENTS NODES
exc = 0.5
ini = 3.0
Istim = 0.0
Istim1 = 20.0
nest.SetStatus(neurons[:1], {"tau_syn_ex": exc, "tau_syn_in": ini, "I_e": Istim1})
nest.SetStatus(neurons[1:], {"tau_syn_ex": exc, "tau_syn_in": ini, "I_e": Istim})
###############################################################################
# OUTPUT DEVICE
voltmeter1 = nest.Create("voltmeter")
nest.Connect(voltmeter1, neurons[1])
voltmeter2 = nest.Create("voltmeter")
nest.Connect(voltmeter2, neurons[2])
voltmeter3 = nest.Create("voltmeter")
nest.Connect(voltmeter3, neurons[3])
spikes = nest.Create('spike_detector')
###############################################################################
# EXCITATORY CONNECTION BETWEEN NODES in a LOOP
weight = 200.0
delay = 1.0
#'excitatory',
for i in range(1,9,1):
nest.Connect(neurons[i], neurons[i+1], syn_spec={'weight': weight, 'delay': delay})
nest.Connect(neurons[10], neurons[1], syn_spec={'weight': weight, 'delay': delay})
###############################################################################
# SPIKE DETECTOR
nest.Connect(neurons, spikes)
###############################################################################
#SIMULATIONS AND OUTPUTS
nest.Simulate(400.0)
nest.voltage_trace.from_device(voltmeter1)
nest.voltage_trace.from_device(voltmeter2)
nest.voltage_trace.from_device(voltmeter3)
nest.voltage_trace.show()
nest.raster_plot.from_device(spikes, hist=True)
nest.raster_plot.show()
In Python arrays are indexed from zero, so the ten neurons created are neurons[0] to neurons[9]. To close the ring you need to connect the last neuron neurons[9] to the first neurons[0]. The neurons[10] does not exist.
Note also, that you close the ring 9 times, since the second Connect() call is also indented and thereby is inside the for loop.
If you use another Python feature, that negative indices are counted from the back, then your loop becomes very simple. Try something like this:
for i in range(10):
print(f"connect {(i-1)%10} -> {i}")
nest.Connect([neurons[i-1]], [neurons[i]], syn_spec={'weight': weight, 'delay': delay})
(You can of course omit the print() call. I just added it to show what happens.)
As you see this closes your ring automatically and you do not need the extra Connect() call. In Python lists and numpy arrays the convention is that neurons[-1] is the last element of the array, neurons[-2] is the one before the last, etc. Nice effect is that you see the 10 everywhere and it's easy to replace with a variable for the number of neurons. Also there is need anymore for making the range() so complex.
Additional note: For long rings it may be much faster to do array slicing:
nest.Connect(neurons[:-1], neurons[1:], syn_spec={'weight': weight, 'delay': delay})
nest.Connect([neurons[-1]], [neurons[0]], syn_spec={'weight': weight, 'delay': delay})
In this variant you need only two connect calls (without the Python for loop!), so most of the connections can be generated by the NEST kernel without returning control to Python after each synapse. Generally the fewer calls the better the kernel can make use of parallelization.

python - multiprocessing stops running after some batch operation

I am trying to do image processing in all cores available in my machine(which has 4 cores and 8 processors). I chose to do Multiprocessing because it's a kind of CPU bound workload. Now, explaining the data I have a CSV file that has file paths recorded(local path), Image Category(explain what image is). The CSV has exactly 9258 categories. My Idea is to do batch processing. Assign 10 categories to each processor and loop through the images one by one, wait till all the processors complete its job, and assign the next batch.
The categories are stored in this format as_batches = [[C1, C2, ..., C10], [C11, C12, C13, ..., C20], [Cn-10, Cn-9,..., Cn]]
Here is the function that starts the process.
def get_n_process(as_batches, processes, df, q):
p = []
for i in range(processes):
work = Process(target=submit_job, args=(df, as_batches[i], q, i))
p.append(work)
work.start()
as_batches = as_batches[processes:]
return p, as_batches
Here is the main loop,
while(len(as_batches) > 0):
t = []
#dynamically check the lists
if len(as_batches) > 8:
n_process = 8
else:
n_process = len(as_batches)
print("For this it Requries {} Process".format(n_process))
process_obj_inlist, as_batches = get_n_process(as_batches, n_process, df, q)
for ind_process in process_obj_inlist:
ind_process.join()
with open("logs.txt", "a") as f:
f.write("\n")
f.write("Log Recording at: {timestamp}, Remaining N = {remaining} yet to be processed".format(
timestamp=datetime.datetime.now(),
remaining = len(as_batches)
))
f.close()
For log purposes, I am writing into a text file to see how many categories are there to process yet. And here is the main function
def do_something(fromprocess):
time.sleep(1)
print("Operation Ended for Process:{}, process Id:{}".format(
current_process().name, os.getpid()
))
return "msg"
def submit_job(df, list_items, q, fromprocess):
a = []
for items in list_items:
oneitemdf = df[df['MinorCategory']==items]['FilePath'].values.tolist()
oneitemdf = [x for x in oneitemdf if x.endswith('.png')]
result = do_something(fromprocess)
a.append(result)
q.put(a)
For now, I am just printing in the console, but in real code, I will be using KAZE algorithm to extract features from the images, store it in a list and append it to the Queue(Shared Memory) from all the processors. Now the script is running for few minutes but after some time the script is halted. It didn't run further. I tried to exit it but I couldn't. I think some deadlock might happen but I am not sure. I read online sources but couldn't figure out the solution and reason why it's happening?
For the full code, here is the gist link Full Source Code Link. What I am doing wrong here? I am new to Multiprocessing and MultiThreading. I would like to understand the concept in-depth. Links/Resources related to this topic are much appreciated.
UPDATE - The same code working perfectly on Mac OS.

Python: Spacy and memory consumption

1 - THE PROBLEM
I'm using "spacy" on python for text documents lemmatization.
There are 500,000 documents having size up to 20 Mb of clean text.
The problem is the following: spacy memory consuming is growing in time till the whole memory is used.
2 - BACKGROUND
My hardware configuration:
CPU: Intel I7-8700K 3.7 GHz (12 cores)
Memory: 16 Gb
SSD: 1 Tb
GPU is onboard but is not used for this task
I'm using "multiprocessing" to split the task among several processes (workers).
Each worker receives a list of documents to process.
The main process performs monitoring of child processes.
I initiate "spacy" in each child process once and use this one spacy instance to handle the whole list of documents in the worker.
Memory tracing says the following:
[ Memory trace - Top 10 ]
/opt/develop/virtualenv/lib/python3.6/site-packages/thinc/neural/mem.py:68: size=45.1 MiB, count=99, average=467 KiB
/opt/develop/virtualenv/lib/python3.6/posixpath.py:149: size=40.3 MiB, count=694225, average=61 B
:487: size=9550 KiB, count=77746, average=126 B
/opt/develop/virtualenv/lib/python3.6/site-packages/dawg_python/wrapper.py:33: size=7901 KiB, count=6, average=1317 KiB
/opt/develop/virtualenv/lib/python3.6/site-packages/spacy/lang/en/lemmatizer/_nouns.py:7114: size=5273 KiB, count=57494, average=94 B
prepare_docs04.py:372: size=4189 KiB, count=1, average=4189 KiB
/opt/develop/virtualenv/lib/python3.6/site-packages/dawg_python/wrapper.py:93: size=3949 KiB, count=5, average=790 KiB
/usr/lib/python3.6/json/decoder.py:355: size=1837 KiB, count=20456, average=92 B
/opt/develop/virtualenv/lib/python3.6/site-packages/spacy/lang/en/lemmatizer/_adjectives.py:2828: size=1704 KiB, count=20976, average=83 B
prepare_docs04.py:373: size=1633 KiB, count=1, average=1633 KiB
3 - EXPECTATIONS
I have seen a good recommendation to build a separated server-client solution [here]Is possible to keep spacy in memory to reduce the load time?
Is it possible to keep memory consumption under control using "multiprocessing" approach?
4 - THE CODE
Here is a simplified version of my code:
import os, subprocess, spacy, sys, tracemalloc
from multiprocessing import Pipe, Process, Lock
from time import sleep
# START: memory trace
tracemalloc.start()
# Load spacy
spacyMorph = spacy.load("en_core_web_sm")
#
# Get word's lemma
#
def getLemma(word):
global spacyMorph
lemmaOutput = spacyMorph(str(word))
return lemmaOutput
#
# Worker's logic
#
def workerNormalize(lock, conn, params):
documentCount = 1
for filenameRaw in params[1]:
documentTotal = len(params[1])
documentID = int(os.path.basename(filenameRaw).split('.')[0])
# Send to the main process the worker's current progress
if not lock is None:
lock.acquire()
try:
statusMessage = "WORKING:{:d},{:d},".format(documentID, documentCount)
conn.send(statusMessage)
documentCount += 1
finally:
lock.release()
else:
print(statusMessage)
# ----------------
# Some code is excluded for clarity sake
# I've got a "wordList" from file "filenameRaw"
# ----------------
wordCount = 1
wordTotalCount = len(wordList)
for word in wordList:
lemma = getLemma(word)
wordCount += 1
# ----------------
# Then I collect all lemmas and save it to another text file
# ----------------
# Here I'm trying to reduce memory usage
del wordList
del word
gc.collect()
if __name__ == '__main__':
lock = Lock()
processList = []
# ----------------
# Some code is excluded for clarity sake
# Here I'm getting full list of files "fileTotalList" which I need to lemmatize
# ----------------
while cursorEnd < (docTotalCount + stepSize):
fileList = fileTotalList[cursorStart:cursorEnd]
# ----------------
# Create workers and populate it with list of files to process
# ----------------
processData = {}
processData['total'] = len(fileList) # worker total progress
processData['count'] = 0 # worker documents done count
processData['currentDocID'] = 0 # current document ID the worker is working on
processData['comment'] = '' # additional comment (optional)
processData['con_parent'], processData['con_child'] = Pipe(duplex=False)
processName = 'worker ' + str(count) + " at " + str(cursorStart)
processData['handler'] = Process(target=workerNormalize, name=processName, args=(lock, processData['con_child'], [processName, fileList]))
processList.append(processData)
processData['handler'].start()
cursorStart = cursorEnd
cursorEnd += stepSize
count += 1
# ----------------
# Run the monitor to look after the workers
# ----------------
while True:
runningCount = 0
#Worker communication format:
#STATUS:COMMENTS
#STATUS:
#- WORKING - worker is working
#- CLOSED - worker has finished his job and closed pipe-connection
#COMMENTS:
#- for WORKING status:
#DOCID,COUNT,COMMENTS
#DOCID - current document ID the worker is working on
#COUNT - count of done documents
#COMMENTS - additional comments (optional)
# ----------------
# Run through the list of workers ...
# ----------------
for i, process in enumerate(processList):
if process['handler'].is_alive():
runningCount += 1
# ----------------
# .. and check if there is somethng in the PIPE
# ----------------
if process['con_parent'].poll():
try:
message = process['con_parent'].recv()
status = message.split(':')[0]
comment = message.split(':')[1]
# ----------------
# Some code is excluded for clarity sake
# Update worker's information and progress in "processList"
# ----------------
except EOFError:
print("EOF----")
# ----------------
# Some code is excluded for clarity sake
# Here I draw some progress lines per workers
# ----------------
else:
# worker has finished his job. Close the connection.
process['con_parent'].close()
# Whait for some time and monitor again
sleep(PARAM['MONITOR_REFRESH_FREQUENCY'])
print("================")
print("**** DONE ! ****")
print("================")
# ----------------
# Here I'm measuring memory usage to find the most "gluttonous" part of the code
# ----------------
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
print("[ Memory trace - Top 10 ]")
for stat in top_stats[:10]:
print(stat)
'''
For people who land on this in the future, I found a hack that seems to work well:
import spacy
import en_core_web_lg
import multiprocessing
docs = ['Your documents']
def process_docs(docs, n_processes=None):
# Load the model inside the subprocess,
# as that seems to be the main culprit of the memory issues
nlp = en_core_web_lg.load()
if not n_processes:
n_processes = multiprocessing.cpu_count()
processed_docs = [doc for doc in nlp.pipe(docs, disable=['ner', 'parser'], n_process=n_processes)]
# Then do what you wish beyond this point. I end up writing results out to s3.
pass
for x in range(10):
# This will spin up a subprocess,
# and everytime it finishes it will release all resources back to the machine.
with multiprocessing.Manager() as manager:
p = multiprocessing.Process(target=process_docs, args=(docs))
p.start()
p.join()
The idea here is to put everything Spacy-related into a subprocess so all the memory gets released once the subprocess finishes. I know it's working because I can actually watch the memory get released back to the instance every time the subprocess finishes (also the instance no longer crashes xD).
Full Disclosure: I have no idea why Spacy seems to go up in memory overtime, I've read all over trying to find a simple answer, and all the github issues I've seen claim they've fixed the issue yet I still see this happening when I use Spacy on AWS Sagemaker instances.
Hope this helps someone! I know I spent hours pulling my hair out over this.
Credit to another SO answer that explains a bit more about subprocesses in Python.
Memory leaks with spacy
Memory problems when processing large amounts of data seem to be a known issue, see some relevant github issues:
https://github.com/explosion/spaCy/issues/3623
https://github.com/explosion/spaCy/issues/3556
Unfortunately, it doesn't look like there's a good solution yet.
Lemmatization
Looking at your particular lemmatization task, I think your example code is a bit too over-simplified, because you're running the full spacy pipeline on single words and then not doing anything with the results (not even inspecting the lemma?), so it's hard to tell what you actually want to do.
I'll assume you just want to lemmatize, so in general, you want to disable the parts of the pipeline that you're not using as much as possible (especially parsing if you're only lemmatizing, see https://spacy.io/usage/processing-pipelines#disabling) and use nlp.pipe to process documents in batches. Spacy can't handle really long documents if you're using the parser or entity recognition, so you'll need to break up your texts somehow (or for just lemmatization/tagging you can just increase nlp.max_length as much as you need).
Breaking documents into individual words as in your example kind of the defeats the purpose of most of spacy's analysis (you often can't meaningfully tag or parse single words), plus it's going to be very slow to call spacy this way.
Lookup lemmatization
If you just need lemmas for common words out of context (where the tagger isn't going to provide any useful information), you can see if the lookup lemmatizer is good enough for your task and skip the rest of the processing:
from spacy.lemmatizer import Lemmatizer
from spacy.lang.en import LOOKUP
lemmatizer = Lemmatizer(lookup=LOOKUP)
print(lemmatizer(u"ducks", ''), lemmatizer(u"ducking", ''))
Output:
['duck'] ['duck']
It is just a static lookup table, so it won't do well on unknown words or capitalization for words like "wugs" or "DUCKS", so you'll have to see if it works well enough for your texts, but it would be much much faster without memory leaks. (You could also just use the table yourself without spacy, it's here: https://github.com/michmech/lemmatization-lists.)
Better lemmatization
Otherwise, use something more like this to process texts in batches:
nlp = spacy.load('en', disable=['parser', 'ner'])
# if needed: nlp.max_length = MAX_DOC_LEN_IN_CHAR
for doc in nlp.pipe(texts):
for token in doc:
print(token.lemma_)
If you process one long text (or use nlp.pipe() for lots of shorter texts) instead of processing individual words, you should be able to tag/lemmatize (many) thousands of words per second in one thread.

Resources