parallel process data from file - python-3.x

I`m facing a problem with parallel computing data from a big csv file. The issue is that reading from file can not be paralleled but chunks of data from file can be passed for parallel computing. I tried using Multiprocessing.Pool with no result (Pool.imap does not accept yield generator).
I have a generator for reading chunks of data from file. It takes ca. 3 sec. to fetch one chunk of data from file. This chunk of data is processed witch takes ca. 2 sec. I get 50 chunks of data from file. Waiting on next chunk of file I could compute previous chunk "parallel".
Let`s have some code in concept (but does not work in practice).:
def file_data_generator(path):
# file reading chunk by chunk
yield datachunk
def compute(datachunk):
# some heavy computation 2.sec
return partial_result
from multiprocessing import Pool
p = Pool()
result = p.imap(compute, file_data_generator(path) ) # yield is the issue?
What am I doing wrong? Any other tools should I use?
It`s Python3.5
Simple code concept/skeleton appreciated :)

You were very close. The generator bit with yield is correct: imap does take a generator as an argument and runs next() on it, so yield is correct in this context.
What you were missing was that imap is not blocking, that means the result = p.imap call is returning even though the processes are not finished yet. You either need to do
p.close()
p.join()
And then do something with results as a whole, or you simply iterate over the result. Here is a working example:
from multiprocessing import Pool, Queue
def compute(line):
# some heavy computation 2.sec
return len(line)
def file_data_generator(path):
# file reading chunk by chunk
with open('book.txt') as f:
for line in f:
yield line.strip()
if __name__ == '__main__':
p = Pool()
# start processes, they are still blocked because queue is empty
# results is a generator and is empty at the start
results = p.imap(compute, file_data_generator('book.txt'))
# now we tell pool that we finished filling the queue
p.close()
for res in results:
print(res)

Related

python multiprocessing for loop is not running for all loop arguments

Here is the code I am trying to do. The output text file should contain 500 data. But it is always less than 500 (450 or 476 or 429 when I run). Any idea why this is happening and what should I do to get 500 data in output. It will be very helpful if I get output in order.
def foo(j):
output=[j]
f=open('output.txt','a')
f.write('\n')
np.savetxt(f,output)
f.close()
if __name__=='__main__':
pool = Pool(processes=4)
pool.map(foo,range(500))
Try creating chunks before hand.
For example
def f_amp(inputs):
chunks = [inputs for inputs in range(500)]
pool = Pool(processes=4)
result = pool.map(f, chunks)
Also you can refer here for solutions.

Need to do CPU bound processing using 2+ processes in Python by reading from a gzipped file

I have a gzipped file spanning (compressed 10GB, uncompressed 100GB) and which has some reports separated by demarcations and I have to parse it.
The parsing and processing the data is taking long time and hence is a CPU bound problem (not an IO bound problem). So I am planning to split the work into multiple processes using multiprocessing module. The problem is I am unable to send/share data to child processes efficiently. I am using subprocess.Popen to stream in the uncompressed data in parent process.
process = subprocess.Popen('gunzip --keep --stdout big-file.gz',
shell=True,
stdout=subprocess.PIPE)
I am thinking of using a Lock() to read/parse one report in child-process-1 and then release the lock, and switch to child-process-2 to read/parse next report and then switch back to child-process-1 to read/parse next report). When I share the process.stdout as args with the child processes, I get a pickling error.
I have tried to create multiprocessing.Queue() and multiprocessing.Pipe() to send data to child processes, but this is way too slow (in fact it is way slower than doing it in single thread ie serially).
Any thoughts/examples about sending data to child processes efficiently will help.
Could you try something simple instead? Have each worker process run its own instance of gunzip, with no interprocess communication at all. Worker 1 can process the first report and just skip over the second. The opposite for worker 2. Each worker skips every other report. Then an obvious generalization to N workers.
Or not ...
I think you'll need to be more specific about what you tried, and perhaps give more info about your problem (like: how many records are there? how big are they?).
Here's a program ("genints.py") that prints a bunch of random ints, one per line, broken into groups via "xxxxx\n" separator lines:
from random import randrange, seed
seed(42)
for i in range(1000):
for j in range(randrange(1, 1000)):
print(randrange(100))
print("xxxxx")
Because it forces the seed, it generates the same stuff every time. Now a program to process those groups, both in parallel and serially, via the most obvious way I first thought of. crunch() takes time quadratic in the number of ints in a group, so it's quite CPU-bound. The output from one run, using (as shown) 3 worker processes for the parallel part:
parallel result: 10,901,000,334 0:00:35.559782
serial result: 10,901,000,334 0:01:38.719993
So the parallelized run took about one-third the time. In what relevant way(s) does that differ from your problem? Certainly, a full run of "genints.py" produces less than 2 million bytes of output, so that's a major difference - but it's impossible to guess from here whether it's a relevant difference. Perahps, e.g., your problem is only very mildly CPU-bound? It's obvious from output here that the overheads of passing chunks of stdout to worker processes are all but insignificant in this program.
In short, you probably need to give people - as I just did for you - a complete program they can run that reproduces your problem.
import multiprocessing as mp
NWORKERS = 3
DELIM = "xxxxx\n"
def runjob():
import subprocess
# 'py' is just a shell script on my box that
# invokes the desired version of Python -
# which happened to be 3.8.5 for this run.
p = subprocess.Popen("py genints.py",
shell=True,
text=True,
stdout=subprocess.PIPE)
return p.stdout
# Return list of lines up to (but not including) next DELIM,
# or EOF. If the file is already exhausted, return None.
def getrecord(f):
result = []
foundone = False
for line in f:
foundone = True
if line == DELIM:
break
result.append(line)
return result if foundone else None
def crunch(rec):
total = 0
for a in rec:
for b in rec:
total += abs(int(a) - int(b))
return total
if __name__ == "__main__":
import datetime
now = datetime.datetime.now
s = now()
total = 0
f = runjob()
with mp.Pool(NWORKERS) as pool:
for i in pool.imap_unordered(crunch,
iter((lambda: getrecord(f)), None)):
total += i
f.close()
print(f"parallel result: {total:,}", now() - s)
s = now()
# try the same thing serially
total = 0
f = runjob()
while True:
rec = getrecord(f)
if rec is None:
break
total += crunch(rec)
f.close()
print(f"serial result: {total:,}", now() - s)

How to change the number of multiprocessing pool workers on the go

I want to change the number of workers in the pool that are currently used.
My current idea is
while True:
current_connection_number = get_connection_number()
forced_break = False
with mp.Pool(current_connection_number) as p:
for data in p.imap_unordered(fun, some_infinite_generator):
yield data
if current_connection_number != get_connection_number():
forced_break = True
break
if not forced_break:
break
The problem is that it just terminates the workers and so the last items that were gotten from some_infinite_generator and weren't processed yet are lost. Is there some standard way of doing this?
Edit: I've tried printing inside some_infinite_generator and it turns out p.imap_unordered requests 1565 items with just 2 pool workers even before anything is processed, how do I limit the number of items requested from generator? If I use the code above and change number of connections after just 2 items, I will loose 1563 items
The problem is that the Pool will consume the generator internally in a separate thread. You have no way to control that logic.
What you can do, is feeding to the Pool.imap_unordered method a portion of the generator and get that consumed before scaling according to the available connections.
CHUNKSIZE = 100
while True:
current_connection_number = get_connection_number()
with mp.Pool(current_connection_number) as p:
while current_connection_number == get_connection_number():
for data in p.imap_unordered(fun, grouper(CHUNKSIZE, some_infinite_generator)):
yield data
def grouper(n, iterable):
it = iter(iterable)
while True:
chunk = tuple(itertools.islice(it, n))
if not chunk:
return
yield chunk
It's a bit less optimal as the scaling happens every chunk instead of every iteration but with a bit of fine tuning of the CHUNKSIZE value you can easily get it right.
The grouper recipe.

Python Multiprocessing Queue Slow

I have a problem with python multiprocessing Queues.
I'm doing some hard computation on some data. I have created few processes to lower calculation time, also data have been split evenly before sending it to processes. It decrease the time of calculations nicely but when I want to return data from the process by multiprocessing.Queue it takes ages and whole thing is slower than calculating in main thread.
processes = []
proc = 8
for i in range(proc):
processes.append(multiprocessing.Process(target=self.calculateTriangles, args=(inData[i],outData,timer)))
for p in processes:
p.start()
results = []
for i in range(proc):
results.append(outData.get())
print("killing threads")
print(datetime.datetime.now() - timer)
for p in processes:
p.join()
print("Finish Threads")
print(datetime.datetime.now() - timer)
all of threads print their finish time when they are done. Here is example output of this code
0:00:00.017873 CalcDone
0:00:01.692940 CalcDone
0:00:01.777674 CalcDone
0:00:01.780019 CalcDone
0:00:01.796739 CalcDone
0:00:01.831723 CalcDone
0:00:01.842356 CalcDone
0:00:01.868633 CalcDone
0:00:05.497160 killing threads
60968 calculated triangles
As you can see everything is quiet simple until this code.
for i in range(proc):
results.append(outData.get())
print("killing threads")
print(datetime.datetime.now() - timer)
here are some observations I have made on mine computer and slower one.
https://docs.google.com/spreadsheets/d/1_8LovX0eSgvNW63-xh8L9-uylAVlzY4VSPUQ1yP2F9A/edit?usp=sharing . On slower one there isn't any improvement as you can see.
Why does it take so much time to get items from queue when process is finished?? Is there way to speed this up?
So I have solved it myself. Calculations are fast but copying objects from one process to another takes ages. I just made a method that cleared all not-necessary fields in the objects, also using pipes is faster than multiprocessing queues. It took down the time on my slower computer from 29 seconds to 15 seconds.
This time is mainly spent on putting another object to the Queue and spiking up the Semaphore count. If you are able to bulk insert the Queue with all the data at once, then you cut down to 1/10 of the previous time.
I've assigned dynamically a new method to Queue based on the old one. Go to the multiprocessing module for your Python version:
/usr/lib/pythonx.x/multiprocessing.queues.py
Copy the "put" method of the class to your project e.g. for Python 3.7:
def put(self, obj, block=True, timeout=None):
assert not self._closed, "Queue {0!r} has been closed".format(self)
if not self._sem.acquire(block, timeout):
raise Full
with self._notempty:
if self._thread is None:
self._start_thread()
self._buffer.append(obj)
self._notempty.notify()
modify it:
def put_bla(self, obj, block=True, timeout=None):
assert not self._closed, "Queue {0!r} has been closed".format(self)
for el in obj:
if not self._sem.acquire(block, timeout): #spike the semaphore count
raise Full
with self._notempty:
if self._thread is None:
self._start_thread()
self._buffer += el # adding a collections.deque object
self._notempty.notify()
The last step is to add the new method to the class. The multiprocessing.Queue is a DefaultContext method which returns a Queue object. It is easier to inject the method directly to the class of the created object. So:
from collections import deque
queue = Queue()
queue.__class__.put_bulk = put_bla # injecting new method
items = (500, 400, 450, 350) * count # (500, 400, 450, 350, 500, 400...)
queue.put_bulk(deque(items))
Unfortunately the multiprocessing.Pool was always faster by 10%, so just stick with that if you don't require everlasting workers to process your tasks. It is based on multiprocessing.SimpleQueue which is based on multiprocessing.Pipe and I have no idea why it is faster because my SimpleQueue solution wasn't and it is not bulk-injectable:) Break that and You'll have the fastest worker ever:)

Multithreading in Python/BeautifulSoup scraping doesn't speed up at all

I have a csv file ("SomeSiteValidURLs.csv") which listed all the links I need to scrape. The code is working and will go through the urls in the csv, scrape the information and record/save in another csv file ("Output.csv"). However, since I am planning to do it for a large portion of the site (for >10,000,000 pages), speed is important. For each link, it takes about 1s to crawl and save the info into the csv, which is too slow for the magnitude of the project. So I have incorporated the multithreading module and to my surprise it doesn't speed up at all, it still takes 1s person link. Did I do something wrong? Is there other way to speed up the processing speed?
Without multithreading:
import urllib2
import csv
from bs4 import BeautifulSoup
import threading
def crawlToCSV(FileName):
with open(FileName, "rb") as f:
for URLrecords in f:
OpenSomeSiteURL = urllib2.urlopen(URLrecords)
Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
OpenSomeSiteURL.close()
tbodyTags = Soup_SomeSite.find("tbody")
trTags = tbodyTags.find_all("tr", class_="result-item ")
placeHolder = []
for trTag in trTags:
tdTags = trTag.find("td", class_="result-value")
tdTags_string = tdTags.string
placeHolder.append(tdTags_string)
with open("Output.csv", "ab") as f:
writeFile = csv.writer(f)
writeFile.writerow(placeHolder)
crawltoCSV("SomeSiteValidURLs.csv")
With multithreading:
import urllib2
import csv
from bs4 import BeautifulSoup
import threading
def crawlToCSV(FileName):
with open(FileName, "rb") as f:
for URLrecords in f:
OpenSomeSiteURL = urllib2.urlopen(URLrecords)
Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
OpenSomeSiteURL.close()
tbodyTags = Soup_SomeSite.find("tbody")
trTags = tbodyTags.find_all("tr", class_="result-item ")
placeHolder = []
for trTag in trTags:
tdTags = trTag.find("td", class_="result-value")
tdTags_string = tdTags.string
placeHolder.append(tdTags_string)
with open("Output.csv", "ab") as f:
writeFile = csv.writer(f)
writeFile.writerow(placeHolder)
fileName = "SomeSiteValidURLs.csv"
if __name__ == "__main__":
t = threading.Thread(target=crawlToCSV, args=(fileName, ))
t.start()
t.join()
You're not parallelizing this properly. What you actually want to do is have the work being done inside your for loop happen concurrently across many workers. Right now you're moving all the work into one background thread, which does the whole thing synchronously. That's not going to improve performance at all (it will just slightly hurt it, actually).
Here's an example that uses a ThreadPool to parallelize the network operation and parsing. It's not safe to try to write to the csv file across many threads at once, so instead we return the data that would have been written back to the parent, and have the parent write all the results to the file at the end.
import urllib2
import csv
from bs4 import BeautifulSoup
from multiprocessing.dummy import Pool # This is a thread-based Pool
from multiprocessing import cpu_count
def crawlToCSV(URLrecord):
OpenSomeSiteURL = urllib2.urlopen(URLrecord)
Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
OpenSomeSiteURL.close()
tbodyTags = Soup_SomeSite.find("tbody")
trTags = tbodyTags.find_all("tr", class_="result-item ")
placeHolder = []
for trTag in trTags:
tdTags = trTag.find("td", class_="result-value")
tdTags_string = tdTags.string
placeHolder.append(tdTags_string)
return placeHolder
if __name__ == "__main__":
fileName = "SomeSiteValidURLs.csv"
pool = Pool(cpu_count() * 2) # Creates a Pool with cpu_count * 2 threads.
with open(fileName, "rb") as f:
results = pool.map(crawlToCSV, f) # results is a list of all the placeHolder lists returned from each call to crawlToCSV
with open("Output.csv", "ab") as f:
writeFile = csv.writer(f)
for result in results:
writeFile.writerow(result)
Note that in Python, threads only actually speed up I/O operations - because of the GIL, CPU-bound operations (like the parsing/searching BeautifulSoup is doing) can't actually be done in parallel via threads, because only one thread can do CPU-based operations at a time. So you still may not see the speed up you were hoping for with this approach. When you need to speed up CPU-bound operations in Python, you need to use multiple processes instead of threads. Luckily, you can easily see how this script performs with multiple processes instead of multiple threads; just change from multiprocessing.dummy import Pool to from multiprocessing import Pool. No other changes are required.
Edit:
If you need to scale this up to a file with 10,000,000 lines, you're going to need to adjust this code a bit - Pool.map converts the iterable you pass into it to a list prior to sending it off to your workers, which obviously isn't going to work very well with a 10,000,000 entry list; having that whole thing in memory is probably going to bog down your system. Same issue with storing all the results in a list. Instead, you should use Pool.imap:
imap(func, iterable[, chunksize])
A lazier version of map().
The chunksize argument is the same as the one used by the map()
method. For very long iterables using a large value for chunksize can
make the job complete much faster than using the default value of 1.
if __name__ == "__main__":
fileName = "SomeSiteValidURLs.csv"
FILE_LINES = 10000000
NUM_WORKERS = cpu_count() * 2
chunksize = FILE_LINES // NUM_WORKERS * 4 # Try to get a good chunksize. You're probably going to have to tweak this, though. Try smaller and lower values and see how performance changes.
pool = Pool(NUM_WORKERS)
with open(fileName, "rb") as f:
result_iter = pool.imap(crawlToCSV, f)
with open("Output.csv", "ab") as f:
writeFile = csv.writer(f)
for result in result_iter: # lazily iterate over results.
writeFile.writerow(result)
With imap, we never put the all of f into memory at once, nor do we store all the results in memory at once. The most we ever have in memory is chunksize lines of f, which should be more manageable.

Resources