How to confirm multiprocessing library is being used? - python-3.x

I am trying to use multiprocessing for the below code. The code seems to run a bit faster than the for loop inside the function.
How can I confirm I using the library and not the just the for loop?
from multiprocessing import Pool
from multiprocessing import cpu_count
import requests
import pandas as pd
data= pd.read_csv('~/Downloads/50kNAE000.txt.1' ,sep="\t", header=None)
data = data[0].str.strip("0 ")
lst = []
def request(x):
for i,v in x.items():
print(i)
file = requests.get(v)
lst.append(file.text)
#time.sleep(1)
if __name__ == "__main__":
pool = Pool(cpu_count())
results = pool.map(request(data))
pool.close() # 'TERM'
pool.join() # 'KILL'

Multiprocessing has overhead. It has to start the process and transfer function data via interprocess mechanism. Just running a single function in another process vs. running that same function normally is always going to be slower. The advantage is actually doing parallelism with significant work in the functions that makes the overhead minimal.
You can call multiprocessing.current_process().name to see the process name change.

Related

How to control memory consumption while multithreading

I am trying to scrape some websites using the python's threading and thread-safe queue module. I'm observing an increase in memory usage as I test on more URLs. Below is my code for your reference:
from collections import defaultdict
from queue import Queue
from threading import Thread
import itertools
from time import time
import newspaper
import requests
import pickle
data = defaultdict(list)
def get_links():
return (url for url in pickle.load(open('urls.pkl','rb')))
# for url in urls[:500]:
# yield url
def download_url(url):
try:
resp = requests.get(url)
article = newspaper.Article(resp.url)
article.download(input_html=resp.content)
article.parse()
data['url'].append(url)
data['result'].append(article.text)
except:
pass
class DownloadWorker(Thread):
def __init__(self, queue):
Thread.__init__(self)
self.queue = queue
def run(self):
while True:
# Get the work from the queue and expand the tuple
link = self.queue.get()
try:
download_url(link)
print(link,"done")
finally:
self.queue.task_done()
print(self.queue.qsize())
def main():
ts = time()
links = get_links()
# Create a queue to communicate with the worker threads
queue = Queue()
# Create worker threads
for x in range(4):
worker = DownloadWorker(queue)
# Setting daemon to True will let the main thread exit even though the workers are blocking
worker.daemon = True
worker.start()
# Put the tasks into the queue as a tuple
for link in itertools.islice(links,1000):
queue.put(link)
# Causes the main thread to wait for the queue to finish processing all the tasks
queue.join()
pickle.dump(data, open('scrapped_results.pkl','wb'))
print('Took %s mins' %((time() - ts)/60))
if __name__ == '__main__':
main()
If tested on 100 URLs the memory consumption stays constant at 0.1% but it increases as the more number of URLs are tested (0.2%,0.4%,0.5%). Max URLs I have tested are 1000. The mix of questions I have is below:
Why memory consumption increase?
Is memory increasing because the queue not getting emptied before it gets filled? My understanding of queue is that it empties itself as the data in the queue gets processed.
Is there a way to keep the memory usage constant by the threads?
Is it because of the data in the defaultdict is getting bigger?
Can timeout help here? Where can I declare a timeout?
Is it the newspaper and requests?

How to transfer data between two separate scripts in Multiprocessing?

I am using multiprocessing to run two python scripts in parallel. p1.y continually updates a certain variable and the latest value of the variable will be displayed by p2.py after every 2seconds. The code for multiprocessing of the two scripts are given below:
import os
from multiprocessing import Process
def script1():
os.system("p1.py")
def script2():
os.system("p2.py")
if __name__ == '__main__':
p = Process(target=script1)
q = Process(target=script2)
p.start()
q.start()
p.join()
q.join()
I am unable to transfer the value of the variable being updated by p1.py to p2.py. How should I approach the problem in a very simple way?

Multi-Processing to share memory between processes

I am trying to update a variable of a class by calling a function of the class from a different function which is being run on multi-process.
To achieve the desired result, process (p1) needs to update the variable "transaction" and which should get then modified by process (p2)
I tried the below code and I know i should use Multiprocess.value or manager to achieve the desired result and I am not sure of how to do it as my variable to be updated is in another class
Below is the code:
from multiprocessing import Process
from helper import Helper
camsource = ['a','b']
Pros = []
def sub(i):
HC.trail_func(i)
def main():
for i in camsource:
print ("Camera Thread {} Started!".format(i))
p = Process(target=sub, args=(i))
Pros.append(p)
p.start()
# block until all the threads finish (i.e. block until all function_x calls finish)
for t in Pros:
t.join()
if __name__ == "__main__":
HC = Helper()
main()
Here is the helper code:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
class Helper():
def __init__(self):
self.transactions = []
def trail_func(self,preview):
if preview == 'a':
self.transactions.append({"Apple":1})
else:
if self.transactions[0]['Apple'] == 1:
self.transactions[0]['Apple'] = self.transactions[0]['Apple'] + 1
print (self.transactions)
Desired Output:
p1:
transactions = {"Apple":1}
p2:
transactions = {"Apple":2}
I've recently released this module that can help you with your code, as all data frames (data models that can hold any type of data), have locks on them, in order to solve concurrency issues. Anyway, take a look at the README file and the examples.
I've made an example here too, if you'd like to check.

writing tfrecord with multithreading is not fast as expected

Tried to write tfrecord w/ and w/o multithreading, and found the speed difference is not much (w/ 4 threads: 434 seconds; w/o multithread 590 seconds). Not sure if I used it correctly. Is there any better way to write tfrecord faster?
import tensorflow as tf
import numpy as np
import threading
import time
def generate_data(shape=[15,28,60,1]):
return np.random.uniform(size=shape)
def _bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
def _int64_feature(value):
return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
def write_instances_to_tfrecord(tfrecord_file, filenames):
tfrecord_writer = tf.python_io.TFRecordWriter(tfrecord_file)
for i, filename in enumerate(filenames):
curr_MFCC = generate_data()
curr_MFCC_raw = curr_MFCC.tostring()
curr_filename_raw = str(filename)+'-'+str(i)
example = tf.train.Example(features=tf.train.Features(
feature={
'MFCC': _bytes_feature(curr_MFCC_raw),
'filename': _bytes_feature(curr_filename_raw)
})
)
tfrecord_writer.write(example.SerializeToString())
tfrecord_writer.close()
def test():
threading_start = time.time()
coord = tf.train.Coordinator()
threads = []
for thread_index in xrange(4):
args = (str(thread_index), range(200000))
t = threading.Thread(target=write_instances_to_tfrecord, args=args)
t.start()
threads.append(t)
coord.join(threads)
print 'w/ threading takes', time.time()-threading_start
start = time.time()
write_instances_to_tfrecord('5', range(800000))
print 'w/o threading takes', time.time()-start
if __name__ == '__main__':
test()
When using python threading, due to the GIL restriction in the cPython implementation, the CPU utilization will be capped at 1 core. No matter how many threads you add, you will not see a speed up.
A simple solution in your case would be to use the multiprocessing module.
The code is almost exactly the same as what you have, just switch threads to processes:
from multiprocessing import Process
coord = tf.train.Coordinator()
processes = []
for thread_index in xrange(4):
args = (str(thread_index), range(200000))
p = Process(target=write_instances_to_tfrecord, args=args)
p.start()
processes.append(p)
coord.join(processes)
I tested this on my own tfrecord writer code, and got a linear scaling speedup. Total number of processes is limited by memory.
It's better to use Tensorflow computation graph to take advantage of multithreading since each session and graph can be run in different threads. With computation graph, it's about 40 times faster.

Python multiprocessing script partial output

I am following the principles laid down in this post to safely output the results which will eventually be written to a file. Unfortunately, the code only print 1 and 2, and not 3 to 6.
import os
import argparse
import pandas as pd
import multiprocessing
from multiprocessing import Process, Queue
from time import sleep
def feed(queue, parlist):
for par in parlist:
queue.put(par)
print("Queue size", queue.qsize())
def calc(queueIn, queueOut):
while True:
try:
par=queueIn.get(block=False)
res=doCalculation(par)
queueOut.put((res))
queueIn.task_done()
except:
break
def doCalculation(par):
return par
def write(queue):
while True:
try:
par=queue.get(block=False)
print("response:",par)
except:
break
if __name__ == "__main__":
nthreads = 2
workerQueue = Queue()
writerQueue = Queue()
considerperiod=[1,2,3,4,5,6]
feedProc = Process(target=feed, args=(workerQueue, considerperiod))
calcProc = [Process(target=calc, args=(workerQueue, writerQueue)) for i in range(nthreads)]
writProc = Process(target=write, args=(writerQueue,))
feedProc.start()
feedProc.join()
for p in calcProc:
p.start()
for p in calcProc:
p.join()
writProc.start()
writProc.join()
On running the code it prints,
$ python3 tst.py
Queue size 6
response: 1
response: 2
Also, is it possible to ensure that the write function always outputs 1,2,3,4,5,6 i.e. in the same order in which the data is fed into the feed queue?
The error is somehow with the task_done() call. If you remove that one, then it works, don't ask me why (IMO that's a bug). But the way it works then is that the queueIn.get(block=False) call throws an exception because the queue is empty. This might be just enough for your use case, a better way though would be to use sentinels (as suggested in the multiprocessing docs, see last example). Here's a little rewrite so your program uses sentinels:
import os
import argparse
import multiprocessing
from multiprocessing import Process, Queue
from time import sleep
def feed(queue, parlist, nthreads):
for par in parlist:
queue.put(par)
for i in range(nthreads):
queue.put(None)
print("Queue size", queue.qsize())
def calc(queueIn, queueOut):
while True:
par=queueIn.get()
if par is None:
break
res=doCalculation(par)
queueOut.put((res))
def doCalculation(par):
return par
def write(queue):
while not queue.empty():
par=queue.get()
print("response:",par)
if __name__ == "__main__":
nthreads = 2
workerQueue = Queue()
writerQueue = Queue()
considerperiod=[1,2,3,4,5,6]
feedProc = Process(target=feed, args=(workerQueue, considerperiod, nthreads))
calcProc = [Process(target=calc, args=(workerQueue, writerQueue)) for i in range(nthreads)]
writProc = Process(target=write, args=(writerQueue,))
feedProc.start()
feedProc.join()
for p in calcProc:
p.start()
for p in calcProc:
p.join()
writProc.start()
writProc.join()
A few things to note:
the sentinel is putting a None into the queue. Note that you need one sentinel for every worker process.
for the write function you don't need to do the sentinel handling as there's only one process and you don't need to handle concurrency (if you would do the empty() and then get() thingie in your calc function you would run into a problem if e.g. there's only one item left in the queue and both workers check empty() at the same time and then both want to do get() and then one of them is locked forever)
you don't need to put feed and write into processes, just put them into your main function as you don't want to run it in parallel anyway.
how can I have the same order in output as in input? [...] I guess multiprocessing.map can do this
Yes map keeps the order. Rewriting your program into something simpler (as you don't need the workerQueue and writerQueue and adding random sleeps to prove that the output is still in order:
from multiprocessing import Pool
import time
import random
def calc(val):
time.sleep(random.random())
return val
if __name__ == "__main__":
considerperiod=[1,2,3,4,5,6]
with Pool(processes=2) as pool:
print(pool.map(calc, considerperiod))

Resources