Reduce execution time when parsing files

Reduce execution time when parsing files - multithreading

I need to write a script in python2.7 which parse 4 files.
I need to be fast as possible.
For the moment i create a loop, and i parse the 4 files one after another.
I need to understand one thing. If a created 4 parsing script programs (one for each file) and launch the 4 script in 4 different terminal, is this going to reduce the execution time (or not) ?
Thx,

if you ave a potato pc , Yes it will Reduce the Execution Time
i suggest yu to use multithreading on every script for up the speed
import threading
import time
def main():
starttime = time.time()
endtime = time.time()
for x in range(1,10000): # For Print 10000 Times The Character X For Try If It Reduce Or Nop
print x
print "Time Speled : " + round((endtime-starttime), 2) # For Show The Time
threads = []
t = threading.Thread(target=main)
threads.append(t)
t.start()
and you can Try the difference with threading and without it

Related

multiprocessing starts off fast and drastically slows down

I'm trying to train a forecasting model on several backtest dates and model parameters. I wrote a custom function that basically takes an average of ARIMA, ETS, and a few other univariate and multivariate forecasting models from a dataset that's about 10 years of quarterly data (40 data points). I want to run this model in parallel on thousands of different combinations.
The custom model I wrote looks like this
def train_test_func(model_params)
data = read_data_from_pickle()
data_train, data_test = train_test_split(data, backtestdate)
model1 = ARIMA.fit(data_train)
data_pred1 = model1.predict(len(data_test))
...
results = error_eval(data_pred1, ..., data_pred_i, data_test)
save_to_aws_s3(results)
logger.info("log steps here")
My multiprocessing script looks like this:
# Custom function I work that trains and tests
from my_custom_model import train_test_func
commands = []
if __name__ == '__main__':
for backtest_date in target_backtest_dates:
for param_a in target_drugs:
for param_b in param_b_options:
for param_c in param_c_options:
args = {
"backtest_date": backtest_date,
"param_a": param_a,
"param_b": param_b,
"param_c": param_c
}
commands.append(args)
count = multiprocessing.cpu_count()
with multiprocessing.get_context("spawn").Pool(processes=count) as pool:
pool.map(train_test_func, batched_args)
I can get relatively fast results for the first 200 or so iterations, roughly 50 iterations per min. Then, it drastically slows down to ~1 iteration per minute. For reference, running this on a single core gets me about 5 iterations per minute. Each process is independent and uses a relatively small dataset (40 data points). None of the processes need to depend on each other, either--they are completely standalone.
Can anyone help me understand where I'm going wrong with multiprocessing? Is there enough information here to identify the problem? At the moment, the multiprocessing versions are slower than single core versions.
Attaching performance output

I found the answer. Basically my model uses numpy, which, by default, is configured to use multicore. The clue was in my CPU usage from the top command.
This stackoverflow post led me to the correct answer. I added this code block to the top of my scripts that use numpy:
import os
ncore = "1"
os.environ["OMP_NUM_THREADS"] = ncore
os.environ["OPENBLAS_NUM_THREADS"] = ncore
os.environ["MKL_NUM_THREADS"] = ncore
os.environ["VECLIB_MAXIMUM_THREADS"] = ncore
os.environ["NUMEXPR_NUM_THREADS"] = ncore
import numpy
...
The key being that you have to add these configurations before you import numpy.
Performance increased from 50 cycles / min to 150 cycles / min and didn't experience any throttling after a few minutes. CPU usage was also improved, with no processes exceeding 100%.

Why is my python time comparison wrong every minute?

Hello everyone Im making a time comparison project.
Basically what Im trying to do is have my code take a reading of time and then take a separate reading of time, if the system where to go to sleep for any length of time, the code will see that a time difference has occurred.
The problem that Im running into is if statement will trigger every minute with a value of 40 seconds plus the 1 second delay I added and I don't know why.
Any thoughts?
import datetime
import time
while True:
current_time_A = datetime.datetime.now()
print("current time A ",
int(current_time_A.strftime("%H%M%S")))
time.sleep(1)
current_time_B = datetime.datetime.now()
print("current time B ",
int(current_time_B.strftime("%H%M%S")))
time_elapsed = ((int(current_time_B.strftime("%H%M%S"))) - (int(current_time_A.strftime("%H%M%S"))))
print("time_elapsed = ",time_elapsed)
if time_elapsed >= 5:
print("time changed more then 5 seconds")
Here's the output you will get
current time A 1959
current time B 2000
time_elapsed = 41
time changed more then 5 seconds
current time A 2000
current time B 2001
1
current time A 2001

So the problem you're facing is due to how you're setting everything up. Basically, you're getting the 2 times' hour, minute, and second as a string, converting them to an int, and then subtracting their differences. Seconds only go up to 59 before going back to 00. So, when current_time_A's second is 59 and current_time_B's second is 00, the difference comes out to be 41 (because the minute changes as well).
Instead of converting back and forth between types (datetime.datetime -> str -> int), I suggest using time.time() to get the timestamp and do your calculations on that:
import time
while True:
current_time_A = time.time()
print(f"Current time A {current_time_A}")
time.sleep(1)
current_time_B = time.time()
print(f"Current time B: {current_time_B}")
# calculate how many seconds have elapsed since the start of the program
elapsed_time = current_time_B - current_time_A
print(f"Elapsed time: {elapsed_time}")
# round to the nearest second
print("Elapsed time (rounded):", round(elapsed_time))

python - multiprocessing stops running after some batch operation

I am trying to do image processing in all cores available in my machine(which has 4 cores and 8 processors). I chose to do Multiprocessing because it's a kind of CPU bound workload. Now, explaining the data I have a CSV file that has file paths recorded(local path), Image Category(explain what image is). The CSV has exactly 9258 categories. My Idea is to do batch processing. Assign 10 categories to each processor and loop through the images one by one, wait till all the processors complete its job, and assign the next batch.
The categories are stored in this format as_batches = [[C1, C2, ..., C10], [C11, C12, C13, ..., C20], [Cn-10, Cn-9,..., Cn]]
Here is the function that starts the process.
def get_n_process(as_batches, processes, df, q):
p = []
for i in range(processes):
work = Process(target=submit_job, args=(df, as_batches[i], q, i))
p.append(work)
work.start()
as_batches = as_batches[processes:]
return p, as_batches
Here is the main loop,
while(len(as_batches) > 0):
t = []
#dynamically check the lists
if len(as_batches) > 8:
n_process = 8
else:
n_process = len(as_batches)
print("For this it Requries {} Process".format(n_process))
process_obj_inlist, as_batches = get_n_process(as_batches, n_process, df, q)
for ind_process in process_obj_inlist:
ind_process.join()
with open("logs.txt", "a") as f:
f.write("\n")
f.write("Log Recording at: {timestamp}, Remaining N = {remaining} yet to be processed".format(
timestamp=datetime.datetime.now(),
remaining = len(as_batches)
))
f.close()
For log purposes, I am writing into a text file to see how many categories are there to process yet. And here is the main function
def do_something(fromprocess):
time.sleep(1)
print("Operation Ended for Process:{}, process Id:{}".format(
current_process().name, os.getpid()
))
return "msg"
def submit_job(df, list_items, q, fromprocess):
a = []
for items in list_items:
oneitemdf = df[df['MinorCategory']==items]['FilePath'].values.tolist()
oneitemdf = [x for x in oneitemdf if x.endswith('.png')]
result = do_something(fromprocess)
a.append(result)
q.put(a)
For now, I am just printing in the console, but in real code, I will be using KAZE algorithm to extract features from the images, store it in a list and append it to the Queue(Shared Memory) from all the processors. Now the script is running for few minutes but after some time the script is halted. It didn't run further. I tried to exit it but I couldn't. I think some deadlock might happen but I am not sure. I read online sources but couldn't figure out the solution and reason why it's happening?
For the full code, here is the gist link Full Source Code Link. What I am doing wrong here? I am new to Multiprocessing and MultiThreading. I would like to understand the concept in-depth. Links/Resources related to this topic are much appreciated.
UPDATE - The same code working perfectly on Mac OS.

How to process the data returned from a function (Python 3.7)

Background:
My question should be relatively easy, however I am not able to figure it out.
I have written a function regarding queueing theory and it will be used for ambulance service planning. For example, how many calls for service can I expect in a given time frame.
The function takes two parameters; a starting value of the number of ambulances in my system starting at 0 and ending at 100 ambulances. This will show the probability of zero calls for service, one call for service, three calls for service….up to 100 calls for service. Second parameter is an arrival rate number which is the past historical arrival rate in my system.
The function runs and prints out the result to my screen. I have checked the math and it appears to be correct.
This is Python 3.7 with the Anaconda distribution.
My question is this:
I would like to process this data even further but I don’t know how to capture it and do more math. For example, I would like to take this list and accumulate the probability values. With an arrival rate of five, there is a cumulative probability of 61.56% of at least five calls for service, etc.
A second example of how I would like to process this data is to format it as percentages and write out a text file
A third example would be to process the cumulative probabilities and exclude any values higher than the 99% cumulative value (because these vanish into extremely small numbers).
A fourth example would be to create a bar chart showing the probability of n calls for service.
These are some of the things I want to do with the queueing theory calculations. And there are a lot more. I am planning on writing a larger application. But I am stuck at this point. The function writes an output into my Python 3.7 console. How do I “capture” that output as an object or something and perform other processing on the data?
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import math
import csv
def probability_x(start_value = 0, arrival_rate = 0):
probability_arrivals = []
while start_value <= 100:
probability_arrivals = [start_value, math.pow(arrival_rate, start_value) * math.pow(math.e, -arrival_rate) / math.factorial(start_value)]
print(probability_arrivals)
start_value = start_value + 1
return probability_arrivals
#probability_x(arrival_rate = 5, x = 5)
#The code written above prints to the console, but my goal is to take the returned values and make other calculations.
#How do I 'capture' this data for further processing is where I need help (for example, bar plots, cumulative frequency, etc )
#failure. TypeError: writerows() argument must be iterable.
with open('ExpectedProbability.csv', 'w') as writeFile:
writer = csv.writer(writeFile)
for value in probability_x(arrival_rate = 5):
writer.writerows(value)
writeFile.close()
#Failure. Why does it return 2. Yes there are two columns but I was expecting 101 as the length because that is the end of my loop.
print(len(probability_x(arrival_rate = 5)))

The problem is, when you write
probability_arrivals = [start_value, math.pow(arrival_rate, start_value) * math.pow(math.e, -arrival_rate) / math.factorial(start_value)]
You're overwriting the previous contents of probability_arrivals. Everything that it held previously is lost.
Instead of using = to reassign probability_arrivals, you want to append another entry to the list:
probability_arrivals.append([start_value, math.pow(arrival_rate, start_value) * math.pow(math.e, -arrival_rate) / math.factorial(start_value)])
I'll also note, your while loop can be improved. You're basically just looping over start_value until it reaches a certain value. A for loop would be more appropriate here:
for s in range(start_value, 101): # The end value is exclusive, so it's 101 not 100
probability_arrivals = [s, math.pow(arrival_rate, s) * math.pow(math.e, -arrival_rate) / math.factorial(s)]
print(probability_arrivals)
Now you don't need to manually worry about incrementing the counter.

Python: Spacy and memory consumption

1 - THE PROBLEM
I'm using "spacy" on python for text documents lemmatization.
There are 500,000 documents having size up to 20 Mb of clean text.
The problem is the following: spacy memory consuming is growing in time till the whole memory is used.
2 - BACKGROUND
My hardware configuration:
CPU: Intel I7-8700K 3.7 GHz (12 cores)
Memory: 16 Gb
SSD: 1 Tb
GPU is onboard but is not used for this task
I'm using "multiprocessing" to split the task among several processes (workers).
Each worker receives a list of documents to process.
The main process performs monitoring of child processes.
I initiate "spacy" in each child process once and use this one spacy instance to handle the whole list of documents in the worker.
Memory tracing says the following:
[ Memory trace - Top 10 ]
/opt/develop/virtualenv/lib/python3.6/site-packages/thinc/neural/mem.py:68: size=45.1 MiB, count=99, average=467 KiB
/opt/develop/virtualenv/lib/python3.6/posixpath.py:149: size=40.3 MiB, count=694225, average=61 B
:487: size=9550 KiB, count=77746, average=126 B
/opt/develop/virtualenv/lib/python3.6/site-packages/dawg_python/wrapper.py:33: size=7901 KiB, count=6, average=1317 KiB
/opt/develop/virtualenv/lib/python3.6/site-packages/spacy/lang/en/lemmatizer/_nouns.py:7114: size=5273 KiB, count=57494, average=94 B
prepare_docs04.py:372: size=4189 KiB, count=1, average=4189 KiB
/opt/develop/virtualenv/lib/python3.6/site-packages/dawg_python/wrapper.py:93: size=3949 KiB, count=5, average=790 KiB
/usr/lib/python3.6/json/decoder.py:355: size=1837 KiB, count=20456, average=92 B
/opt/develop/virtualenv/lib/python3.6/site-packages/spacy/lang/en/lemmatizer/_adjectives.py:2828: size=1704 KiB, count=20976, average=83 B
prepare_docs04.py:373: size=1633 KiB, count=1, average=1633 KiB
3 - EXPECTATIONS
I have seen a good recommendation to build a separated server-client solution [here]Is possible to keep spacy in memory to reduce the load time?
Is it possible to keep memory consumption under control using "multiprocessing" approach?
4 - THE CODE
Here is a simplified version of my code:
import os, subprocess, spacy, sys, tracemalloc
from multiprocessing import Pipe, Process, Lock
from time import sleep
# START: memory trace
tracemalloc.start()
# Load spacy
spacyMorph = spacy.load("en_core_web_sm")
#
# Get word's lemma
#
def getLemma(word):
global spacyMorph
lemmaOutput = spacyMorph(str(word))
return lemmaOutput
#
# Worker's logic
#
def workerNormalize(lock, conn, params):
documentCount = 1
for filenameRaw in params[1]:
documentTotal = len(params[1])
documentID = int(os.path.basename(filenameRaw).split('.')[0])
# Send to the main process the worker's current progress
if not lock is None:
lock.acquire()
try:
statusMessage = "WORKING:{:d},{:d},".format(documentID, documentCount)
conn.send(statusMessage)
documentCount += 1
finally:
lock.release()
else:
print(statusMessage)
# ----------------
# Some code is excluded for clarity sake
# I've got a "wordList" from file "filenameRaw"
# ----------------
wordCount = 1
wordTotalCount = len(wordList)
for word in wordList:
lemma = getLemma(word)
wordCount += 1
# ----------------
# Then I collect all lemmas and save it to another text file
# ----------------
# Here I'm trying to reduce memory usage
del wordList
del word
gc.collect()
if __name__ == '__main__':
lock = Lock()
processList = []
# ----------------
# Some code is excluded for clarity sake
# Here I'm getting full list of files "fileTotalList" which I need to lemmatize
# ----------------
while cursorEnd < (docTotalCount + stepSize):
fileList = fileTotalList[cursorStart:cursorEnd]
# ----------------
# Create workers and populate it with list of files to process
# ----------------
processData = {}
processData['total'] = len(fileList) # worker total progress
processData['count'] = 0 # worker documents done count
processData['currentDocID'] = 0 # current document ID the worker is working on
processData['comment'] = '' # additional comment (optional)
processData['con_parent'], processData['con_child'] = Pipe(duplex=False)
processName = 'worker ' + str(count) + " at " + str(cursorStart)
processData['handler'] = Process(target=workerNormalize, name=processName, args=(lock, processData['con_child'], [processName, fileList]))
processList.append(processData)
processData['handler'].start()
cursorStart = cursorEnd
cursorEnd += stepSize
count += 1
# ----------------
# Run the monitor to look after the workers
# ----------------
while True:
runningCount = 0
#Worker communication format:
#STATUS:COMMENTS
#STATUS:
#- WORKING - worker is working
#- CLOSED - worker has finished his job and closed pipe-connection
#COMMENTS:
#- for WORKING status:
#DOCID,COUNT,COMMENTS
#DOCID - current document ID the worker is working on
#COUNT - count of done documents
#COMMENTS - additional comments (optional)
# ----------------
# Run through the list of workers ...
# ----------------
for i, process in enumerate(processList):
if process['handler'].is_alive():
runningCount += 1
# ----------------
# .. and check if there is somethng in the PIPE
# ----------------
if process['con_parent'].poll():
try:
message = process['con_parent'].recv()
status = message.split(':')[0]
comment = message.split(':')[1]
# ----------------
# Some code is excluded for clarity sake
# Update worker's information and progress in "processList"
# ----------------
except EOFError:
print("EOF----")
# ----------------
# Some code is excluded for clarity sake
# Here I draw some progress lines per workers
# ----------------
else:
# worker has finished his job. Close the connection.
process['con_parent'].close()
# Whait for some time and monitor again
sleep(PARAM['MONITOR_REFRESH_FREQUENCY'])
print("================")
print("**** DONE ! ****")
print("================")
# ----------------
# Here I'm measuring memory usage to find the most "gluttonous" part of the code
# ----------------
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
print("[ Memory trace - Top 10 ]")
for stat in top_stats[:10]:
print(stat)
'''

For people who land on this in the future, I found a hack that seems to work well:
import spacy
import en_core_web_lg
import multiprocessing
docs = ['Your documents']
def process_docs(docs, n_processes=None):
# Load the model inside the subprocess,
# as that seems to be the main culprit of the memory issues
nlp = en_core_web_lg.load()
if not n_processes:
n_processes = multiprocessing.cpu_count()
processed_docs = [doc for doc in nlp.pipe(docs, disable=['ner', 'parser'], n_process=n_processes)]
# Then do what you wish beyond this point. I end up writing results out to s3.
pass
for x in range(10):
# This will spin up a subprocess,
# and everytime it finishes it will release all resources back to the machine.
with multiprocessing.Manager() as manager:
p = multiprocessing.Process(target=process_docs, args=(docs))
p.start()
p.join()
The idea here is to put everything Spacy-related into a subprocess so all the memory gets released once the subprocess finishes. I know it's working because I can actually watch the memory get released back to the instance every time the subprocess finishes (also the instance no longer crashes xD).
Full Disclosure: I have no idea why Spacy seems to go up in memory overtime, I've read all over trying to find a simple answer, and all the github issues I've seen claim they've fixed the issue yet I still see this happening when I use Spacy on AWS Sagemaker instances.
Hope this helps someone! I know I spent hours pulling my hair out over this.
Credit to another SO answer that explains a bit more about subprocesses in Python.

Memory leaks with spacy
Memory problems when processing large amounts of data seem to be a known issue, see some relevant github issues:
https://github.com/explosion/spaCy/issues/3623
https://github.com/explosion/spaCy/issues/3556
Unfortunately, it doesn't look like there's a good solution yet.
Lemmatization
Looking at your particular lemmatization task, I think your example code is a bit too over-simplified, because you're running the full spacy pipeline on single words and then not doing anything with the results (not even inspecting the lemma?), so it's hard to tell what you actually want to do.
I'll assume you just want to lemmatize, so in general, you want to disable the parts of the pipeline that you're not using as much as possible (especially parsing if you're only lemmatizing, see https://spacy.io/usage/processing-pipelines#disabling) and use nlp.pipe to process documents in batches. Spacy can't handle really long documents if you're using the parser or entity recognition, so you'll need to break up your texts somehow (or for just lemmatization/tagging you can just increase nlp.max_length as much as you need).
Breaking documents into individual words as in your example kind of the defeats the purpose of most of spacy's analysis (you often can't meaningfully tag or parse single words), plus it's going to be very slow to call spacy this way.
Lookup lemmatization
If you just need lemmas for common words out of context (where the tagger isn't going to provide any useful information), you can see if the lookup lemmatizer is good enough for your task and skip the rest of the processing:
from spacy.lemmatizer import Lemmatizer
from spacy.lang.en import LOOKUP
lemmatizer = Lemmatizer(lookup=LOOKUP)
print(lemmatizer(u"ducks", ''), lemmatizer(u"ducking", ''))
Output:
['duck'] ['duck']
It is just a static lookup table, so it won't do well on unknown words or capitalization for words like "wugs" or "DUCKS", so you'll have to see if it works well enough for your texts, but it would be much much faster without memory leaks. (You could also just use the table yourself without spacy, it's here: https://github.com/michmech/lemmatization-lists.)
Better lemmatization
Otherwise, use something more like this to process texts in batches:
nlp = spacy.load('en', disable=['parser', 'ner'])
# if needed: nlp.max_length = MAX_DOC_LEN_IN_CHAR
for doc in nlp.pipe(texts):
for token in doc:
print(token.lemma_)
If you process one long text (or use nlp.pipe() for lots of shorter texts) instead of processing individual words, you should be able to tag/lemmatize (many) thousands of words per second in one thread.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Reduce execution time when parsing files - multithreading

Related

multiprocessing starts off fast and drastically slows down

Why is my python time comparison wrong every minute?

python - multiprocessing stops running after some batch operation

How to process the data returned from a function (Python 3.7)

Python: Spacy and memory consumption

Categories

Resources