I wanted to write small program that would symulate for me lottery winning chances. After that i wanted to make it a bit faster by implemening multiprocessing like this
But two weird behaviors started
import random as r
from multiprocessing.pool import ThreadPool
# winnerSequence = []
# mCombinations = []
howManyLists = 5
howManyTry = 1000000
combinations = 720/10068347520
possbilesNumConstantsConstant = []
for x in range(1, 50):
possbilesNumConstantsConstant.append(x)
def getTicket():
possbilesNumConstants = list(possbilesNumConstantsConstant)
toReturn = []
possiblesNum = list(possbilesNumConstants)
for x in range(6):
choice = r.choice(possiblesNum)
toReturn.append(choice)
possiblesNum.remove(choice)
toReturn.sort()
return toReturn
def sliceRange(rangeNum,num):
"""returns list of smaller ranges"""
toReturn = []
rest = rangeNum%num
print(rest)
toSlice = rangeNum - rest
print(toSlice)
n = toSlice/num
print(n)
for x in range(num):
toReturn.append((int(n*x),int(n*(x+1)-1)))
print(toReturn,"<---range")
return toReturn
def Job(tupleRange):
"""Job returns list of tickets """
toReturn = list()
print(tupleRange,"Start")
for x in range(int(tupleRange[0]),int(tupleRange[1])):
toReturn.append(getTicket())
print(tupleRange,"End")
return toReturn
result = list()
First one when i add Job(tupleRange) to pool it looks like job is done in main thread before another job is added to pool
def start():
"""this fun() starts program"""
#create pool of threads
pool = ThreadPool(processes = howManyLists)
#create list of tuples with smaller piece of range
lista = sliceRange(howManyTry,howManyLists)
#create list for storing job objects
jobList = list()
for tupleRange in lista:
#add job to pool
jobToList = pool.apply_async(Job(tupleRange))
#add retured object to list for future callback
jobList.append(jobToList)
print('Adding to pool',tupleRange)
#for all jobs in list get returned tickes
for job in jobList:
#print(job.get())
result.extend(job.get())
if __name__ == '__main__':
start()
Consol output
[(0, 199999), (200000, 399999), (400000, 599999), (600000, 799999), (800000, 999999)] <---range
(0, 199999) Start
(0, 199999) End
Adding to pool (0, 199999)
(200000, 399999) Start
(200000, 399999) End
Adding to pool (200000, 399999)
(400000, 599999) Start
(400000, 599999) End
and second one when i want to get data from thread i got this exception on this line
for job in jobList:
#print(job.get())
result.extend(job.get()) #<---- this line
File "C:/Users/CrazyUrusai/PycharmProjects/TestLotka/main/kopia.py", line 79, in start
result.extend(job.get())
File "C:\Users\CrazyUrusai\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py", line 644, in get
raise self._value
File "C:\Users\CrazyUrusai\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py", line 119, in worker
result = (True, func(*args, **kwds))
TypeError: 'list' object is not callable
Can sombody explain this to me?(i am new to multiprocessing)
The problem is here:
jobToList = pool.apply_async(Job(tupleRange))
Job(tupleRange) executes first, then apply_async gets some returned value, list type (as Job returns list). There are two problems here: this code is synchronous and async_apply gets list instead of job it expects. So it try to execute given list as a job but fails.
That's a signature of pool.apply_async:
def apply_async(self, func, args=(), kwds={}, callback=None,
error_callback=None):
...
So, you should send func and arguments args to this function separately, and shouldn't execute the function before you will send it to the pool.
I fix this line and your code have worked for me:
jobToList = pool.apply_async(Job, (tupleRange, ))
Or, with explicitly named args,
jobToList = pool.apply_async(func=Job, args=(tupleRange, ))
Don't forget to wrap function arguments in tuple or so.
Related
I have 2 functions in a Python 3.7 script that search 2 separate network nodes and returns very large data sets of strings in a list. The smaller data set length is ~300K entries, while the larger one is ~1.5M. This script takes almost an hour to execute because of how it has to compile the data sets as well as having the second data set be significantly larger. I don't have a way to shorten the run time by changing how the compilation happens, there's no easier way for me to get the data from the network nodes. But I can cut almost 10 minutes if I can run them simultaneously, so I'm trying to shorten the run time by using multiprocessing so I can run both of them at once.
I do not need them to necessarily start within the same second or finish at the same second, just want them to run at the same time.
Here's a breakdown of first attempt at coding for multiprocessing:
def p_func(arg1, arg2, pval):
## Do Stuff
return pval
def s_func(arg1, sval):
## Do Stuff
return sval
# Creating variables to get return values that multiprocessing can handle
pval = multiprocessing.Value(list)
sval = multiprocessing.Value(list)
# setting up multiprocessing Processes for each function and passing arguments
p1 = multiprocessing.Process(target=p_func, args=(arg1, arg2, pval))
s2 = multiprocessing.Process(target=s_func, args=(arg3, sval))
p1.start()
s1.start()
p1.join()
s1.join()
print("Number of values in pval: ", len(pval))
print("Number of values in sval: ", len(sval))
I believe I have solved my list concerns, so....
Based on comments I've updated my code as follows:
#! python3
import multiprocessing as mp
def p_func(arg1, arg2, pval):
# takes arg1 and arg2 and queries network node to return list of ~300K
# values and assigns that list to pval for return to main()
return pval
def s_func(arg1, sval):
# takes arg1 and queries network node to return list of ~1.5M
# values and assigns that list to sval for return to main()
return sval
# Creating variables to get return values that multiprocessing can handle in
# main()
with mp.Manager() as mgr
pval = mgr.list()
sval = mgr.list()
# setting up multiprocessing Processes for each function and passing
# arguments
p1 = mp.Process(target=p_func, args=(arg1, arg2, pval))
s1 = mp.Process(target=s_func, args=(arg3, sval))
p1.start()
s1.start()
p1.join()
s1.join()
# out of with block
print("Number of values in pval: ", len(pval))
print("Number of values in sval: ", len(sval))
Now getting a TypeError: can't pickle _thread.lock objects on the p1.start() invocation. I'm guessing that one of the variables I have passed in the p1 declaration is causing a problem with multiprocessing, but I'm not sure how to read the error or resolve the problem.
Use a Manager.list() instead:
import multiprocessing as mp
def p_func(pval):
pval.extend(list(range(300000)))
def s_func(sval):
sval.extend(list(range(1500000)))
if __name__ == '__main__':
# Creating variables to get return values that mp can handle
with mp.Manager() as mgr:
pval = mgr.list()
sval = mgr.list()
# setting up mp Processes for each function and passing arguments
p1 = mp.Process(target=p_func, args=(pval,))
s2 = mp.Process(target=s_func, args=(sval,))
p1.start()
s2.start()
p1.join()
s2.join()
print("Number of values in pval: ", len(pval))
print("Number of values in sval: ", len(sval))
Output:
Number of values in pval: 300000
Number of values in sval: 1500000
Manager objects are slower than shared memory but more flexible. Shared memory is faster, so if you know an upper limit for your arrays, you could use a fixed-sized shared memory Array and a shared value indicating the used size instead, such as:
#!python3
import multiprocessing as mp
def p_func(parr,psize):
for i in range(10):
parr[i] = i
psize.value = 10
def s_func(sarr,ssize):
for i in range(5):
sarr[i] = i
ssize.value = 5
if __name__ == '__main__':
# Creating variables to get return values that mp can handle
parr = mp.Array('i',2<<20) # 2M
sarr = mp.Array('i',2<<20)
psize = mp.Value('i',0)
ssize = mp.Value('i',0)
# setting up mp Processes for each function and passing arguments
p1 = mp.Process(target=p_func, args=(parr,psize))
s2 = mp.Process(target=s_func, args=(sarr,ssize))
p1.start()
s2.start()
p1.join()
s2.join()
print("parr: ", parr[:psize.value])
print("sarr: ", sarr[:ssize.value])
Output:
parr: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
sarr: [0, 1, 2, 3, 4]
I run Windows 10, Python 3.7, and have a 6-core CPU. A single Python thread on my machine submits 1,000 inserts per second to grakn. I'd like to parallelize my code to insert and match even faster. How are people doing this?
My only experience with parellelization is on another project, where I submit a custom function to a dask distributed client to generate thousands of tasks. Right now, this same approach fails whenever the custom function receives or generates a grakn transaction object/handle. I get errors like:
Traceback (most recent call last):
File "C:\Users\dvyd\.conda\envs\activefiction\lib\site-packages\distributed\protocol\pickle.py", line 41, in dumps
return cloudpickle.dumps(x, protocol=pickle.HIGHEST_PROTOCOL)
...
File "stringsource", line 2, in grpc._cython.cygrpc.Channel.__reduce_cython__
TypeError: no default __reduce__ due to non-trivial __cinit__
I've never used Python's multiprocessing module directly. What are other people doing to parallelize their queries to grakn?
The easiest approach that I've found to execute a batch of queries is to pass a Grakn session to each thread in a ThreadPool. Within each thread you can manage transactions and of course do some more complex logic:
from grakn.client import GraknClient
from multiprocessing.dummy import Pool as ThreadPool
from functools import partial
def write_query_batch(session, batch):
tx = session.transaction().write()
for query in batch:
tx.query(query)
tx.commit()
def multi_thread_write_query_batches(session, query_batches, num_threads=8):
pool = ThreadPool(num_threads)
pool.map(partial(write_query_batch, session), query_batches)
pool.close()
pool.join()
def generate_query_batches(my_data_entries_list, batch_size):
batch = []
for index, data_entry in enumerate(my_data_entries_list):
batch.append(data_entry)
if index % batch_size == 0 and index != 0:
yield batch
batch = []
if batch:
yield batch
# (Part 2) Somewhere in your application open a client and a session
client = GraknClient(uri="localhost:48555")
session = client.session(keyspace="grakn")
query_batches_iterator = generate_query_batches(my_data_entries_list, batch_size)
multi_thread_write_query_batches(session, query_batches_iterator, num_threads=8)
session.close()
client.close()
The above is a generic method. As a concrete example, you can use the above (omitting part 2) to parallelise batches of insert statements from two files. Appending this to the above should work:
files = [
{
"file_path": f"/path/to/your/file.gql",
},
{
"file_path": f"/path/to/your/file2.gql",
}
]
KEYSPACE = "grakn"
URI = "localhost:48555"
BATCH_SIZE = 10
NUM_BATCHES = 1000
# Entry point where migration starts
def migrate_graql_files():
start_time = time.time()
for file in files:
print('==================================================')
print(f'Loading from {file["file_path"]}')
print('==================================================')
open_file = open(file["file_path"], "r") # Here we are assuming you have 1 Graql query per line!
batches = generate_query_batches(open_file.readlines(), BATCH_SIZE)
with GraknClient(uri=URI) as client: # Using `with` auto-closes the client
with client.session(KEYSPACE) as session: # Using `with` auto-closes the session
multi_thread_write_query_batches(session, batches, num_threads=16) # Pick `num_threads` according to your machine
elapsed = time.time() - start_time
print(f'Time elapsed {elapsed:.1f} seconds')
elapsed = time.time() - start_time
print(f'Time elapsed {elapsed:.1f} seconds')
if __name__ == "__main__":
migrate_graql_files()
You should also be able to see how you can load from a csv or any other file type in this way, but taking the values you find in that file and substitution them into Graql query string templates. Take a look at the migration example in the docs for more on that.
An alternative approach using multi-processing instead of multi-threading follows below.
We empirically found that multi-threading doesn't yield particularly large performance gains, compared to multi-processing. This is probably due to Python's GIL.
This piece of code assumes a file enumerating TypeQL queries that are independent of each other, so they can be parallelised freely.
from typedb.client import TypeDB, TypeDBClient, SessionType, TransactionType
import multiprocessing as mp
import queue
def batch_writer(database, kill_event, batch_queue):
client = TypeDB.core_client("localhost:1729")
session = client.session(database, SessionType.DATA)
while not kill_event.is_set():
try:
batch = batch_queue.get(block=True, timeout=1)
with session.transaction(TransactionType.WRITE) as tx:
for query in batch:
tx.query().insert(query)
tx.commit()
except queue.Empty:
continue
print("Received kill event, exiting worker.")
def start_writers(database, kill_event, batch_queue, parallelism=4):
processes = []
for _ in range(parallelism):
proc = mp.Process(target=batch_writer, args=(database, kill_event, batch_queue))
processes.append(proc)
proc.start()
return processes
def batch(iterable, n=1000):
l = len(iterable)
for ndx in range(0, l, n):
yield iterable[ndx:min(ndx + n, l)]
if __name__ == '__main__':
batch_size = 100
parallelism = 1
database = "<database name>"
# filePath = "<PATH TO QUERIES FILE - ONE QUERY PER NEW LINE>"
with open(file_path, "r") as file:
statements = file.read().splitlines()[:]
batch_statements = batch(statements, n=batch_size)
total_batches = int(len(statements) / batch_size)
if total_batches % batch_size > 0:
total_batches += 1
batch_queue = mp.Queue(parallelism * 4)
kill_event = mp.Event()
writers = start_writers(database, kill_event, batch_queue, parallelism=parallelism)
for i, batch in enumerate(batch_statements):
batch_queue.put(batch, block=True)
if i*batch_size % 10000 == 0:
print("Loaded: {0}/{1}".format(i*batch_size, total_batches*batch_size))
kill_event.set()
batch_queue.close()
batch_queue.join_thread()
for proc in writers:
proc.join()
print("Done loading")
I've recently made a python program that would benefit a lot from a consumer/producer parallel computing strategy.
I've tried to develop a module (Class) to ease the implementation of such processing strategy, but I quickly ran into a problem.
My ProducerConsumer class:
class ProducerConsumer(object):
def __init__(self, workers_qt, producer, consumer, min_producer_qt=1):
self.producer_functor = producer # Pointer to the producer function
self.consumer_functor = consumer # Pointer to the consumer function
self.buffer = deque([]) # Thread-safe double-ended queue item for intermediate result buffer
self.workers_qt = workers_qt
self.min_producer_qt = min_producer_qt # Minimum quantity of active producers (if enough remaining input data)
self.producers = [] # List of producers async results
self.consumers = [] # List of consumers async results
def produce(self, params, callback=None):
result = self.producer_functor(*params) # Execute the producer function
if callback is not None:
callback() # Call the callback (if there is one)
return result
def consume(self, params, callback=None):
result = self.consumer_functor(params) # Execute the producer function
if callback is not None:
callback() # Call the callback (if there is one)
return result
# Map a list of producer's input data to a list of consumer's output data
def map_result(self, producers_param):
result = [] # Result container
producers_param = deque(producers_param) # Convert input to double-ended queue (for popleft() member)
with Pool(self.workers_qt) as p: # Create a worker pool
while self.buffer or producers_param or self.consumers or self.producers: # Work remaining
# Create consumers
if self.buffer and (len(self.producers) >= self.min_producer_qt or not producers_param):
consumer_param = self.buffer.popleft() # Pop one set from the consumer param queue
if not isinstance(consumer_param, tuple):
consumer_param = (consumer_param,) # Force tuple type
self.consumers.append(p.apply_async(func=self.consume, args=consumer_param)) # Start new consumer
# Create producers
elif producers_param:
producer_param = producers_param.popleft() # Pop one set from the consumer param queue
if not isinstance(producer_param, tuple):
producer_param = (producer_param,) # Force tuple type
self.producers.append(p.apply_async(func=self.produce, args=producer_param)) # Start new producer
# Filter finished async_tasks
finished_producers = [r for r in self.producers if r.ready()] if self.producers else []
finished_consumers = [r for r in self.consumers if r.ready()] if self.consumers else []
# Remove finished async_tasks from the running tasks list
self.producers = [r for r in self.producers if r not in finished_producers]
self.consumers = [r for r in self.consumers if r not in finished_consumers]
# Extract result from finished async_tasks
for r in finished_producers:
assert r.ready()
self.buffer.append(r.get()) # Get the producer result and put it in the buffer
for r in finished_consumers:
assert r.ready()
result.append(r.get()) # Get the consumer tesult and put in in the function local result var
return result
In the member map_result(), when I try to "get()" the result of the apply_async(...) function, i get the following error (note that I'm running python3):
Traceback (most recent call last):
File "ProducerConsumer.py", line 91, in <module>
test()
File "ProducerConsumer.py", line 85, in test
result = pc.map_result(input)
File "ProducerConsumer.py", line 64, in map_result
self.buffer.append(r.get()) # Get the producer result and put it in the buffer
File "/usr/lib/python3.5/multiprocessing/pool.py", line 608, in get
raise self._value
File "/usr/lib/python3.5/multiprocessing/pool.py", line 385, in _handle_tasks
put(task)
File "/usr/lib/python3.5/multiprocessing/connection.py", line 206, in send
self._send_bytes(ForkingPickler.dumps(obj))
File "/usr/lib/python3.5/multiprocessing/reduction.py", line 50, in dumps
cls(buf, protocol).dump(obj)
TypeError: can't pickle _thread.lock objects
And here is some code to reproduce my error (dependent on the class obviously) :
def test_producer(val):
return val*12
def test_consumer(val):
return val/4
def test():
pc = ProducerConsumer(4, test_producer, test_consumer)
input = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] # Input for the test of the ProducerConsumer class
expected = [0, 3, 6, 9, 15, 18, 21, 23, 27] # Expected output for the test of the ProducerConsumer class
result = pc.map_result(input)
print('got : {}'.format(result))
print('expected : {}'.format(expected))
if __name__ == '__main__':
test()
Note that in the map_result() member of my class I only "get()" results that are "ready()".
From what I know about pickling (which I admit is not that much), I'd say that the fact that I Pool.apply_async(...) on a member function could play a role but I'd really like to keep the class structure if I can.
Thank you for the help!
So, the problem have been corrected when I also corrected some conception errors:
My 3 buffer variables (buffer, producers, consumers) had nothing to do as member of the class since they were semantically bound to the "map_result()" member itself.
So the patch was deleting these members and creating them as local variables of the member "map_result()".
Problem is, even if the conception was faulty, I still have a hard time understanding why the worker couldn't pickle the lock (of the param I now suppose) so...
If anyone have a clear explanation on what was going on (or a link to some) that would be really appreciated.
I saw somewhere a hint on how to process a large dataset (say lines of text) faster with the multiprocessing module, something like:
... (form batch_set = nump batches [= lists of lines to process], batch_set
is a list of lists of strings (batches))
nump = len(batch_set)
output = mp.Queue()
processes = [mp.Process(target=proc_lines, args=(i, output, batch_set[i])) for i in range(nump)]
for p in processes:
p.start()
for p in processes:
p.join()
results = sorted([output.get() for p in processes])
... (do something with the processed outputs, ex print them in order,
given that each proc_lines function returns a couple (i, out_batch))
However, when i run the code with a small number of lines/batch it works fine
[ex: './code.py -x 4:10' for nump=4 and numb=10 (lines/batch)] while after a
certain number of lines/batch is hangs [ex: './code.py -x 4:4000'] and when i
interrupt it i see a traceback hint about a _wait_for_tstate_lock and the system
threading library. It seems that the code does not reach the last shown code
line above...
I provide the code below, in case somebody needs it to answer why this is
happening and how to fix it.
#!/usr/bin/env python3
import sys
import multiprocessing as mp
def fabl(numb, nump):
'''
Form And Batch Lines: form nump[roc] groups of numb[atch] indexed lines
'<idx> my line here' with indexes from 1 to (nump x numb).
'''
ret = []
idx = 1
for _ in range(nump):
cb = []
for _ in range(numb):
cb.append('%07d my line here' % idx)
idx += 1
ret.append(cb)
return ret
def proc_lines(i, output, rows_in):
ret = []
for row in rows_in:
row = row[0:8] + 'some other stuff\n' # replacement for the post-idx part
ret.append(row)
output.put((i,ret))
return
def mp_proc(batch_set):
'given the batch, disperse it to the number of processes and ret the results'
nump = len(batch_set)
output = mp.Queue()
processes = [mp.Process(target=proc_lines, args=(i, output, batch_set[i])) for i in range(nump)]
for p in processes:
p.start()
for p in processes:
p.join()
print('waiting for procs to complete...')
results = sorted([output.get() for p in processes])
return results
def write_set(proc_batch_set, fout):
'write p[rocessed]batch_set'
for _, out_batch in proc_batch_set:
for row in out_batch:
fout.write(row)
return
def main():
args = sys.argv
if len(args) < 2:
print('''
run with args: -x [ NumProc:BatchSize ]
( ex: '-x' | '-x 4:10' (default values) | '-x 4:4000' (hangs...) )
''')
sys.exit(0)
numb = 10 # suppose we need this number of lines/batch : BatchSize
nump = 4 # number of processes to use. : NumProcs
if len(args) > 2 and ':' in args[2]: # use another np:bs
nump, numb = map(int, args[2].split(':'))
batch_set = fabl(numb, nump) # proc-batch made in here: nump (groups) x numb (lines)
proc_batch_set = mp_proc(batch_set)
with open('out-min', 'wt') as fout:
write_set(proc_batch_set, fout)
return
if __name__ == '__main__':
main()
The Queue have a certain capacity and can get full if you do not empty it while the Process are running. This does not block the execution of your processes but you won't be able to join the Process if the put did not complete.
So I would just modify the mp_proc function such that:
def mp_proc(batch_set):
'given the batch, disperse it to the number of processes and ret the results'
n_process = len(batch_set)
output = mp.Queue()
processes = [mp.Process(target=proc_lines, args=(i, output, batch_set[i]))
for i in range(process)]
for p in processes:
p.start()
# Empty the queue while the processes are running so there is no
# issue with uncomplete `put` operations.
results = sorted([output.get() for p in processes])
# Join the process to make sure everything finished correctly
for p in processes:
p.join()
return results
I am trying to learn how to use multiprocessing and have managed to get the code below to work. The goal is to work through every combination of the variables within the CostlyFunction by setting n equal to some number (right now it is 100 so the first 100 combinations are tested). I was hoping I could manipulate w as each process returned its list (CostlyFunction returns a list of 7 values) and only keep the results in a given range. Right now, w holds all 100 lists and then lets me manipulate those lists but, when I use n=10MM, w becomes huge and costly to hold. Is there a way to evaluate CostlyFunction's output as the workers return values and then 'throw out' values I don't need?
if __name__ == "__main__":
import csv
csvFile = open('C:\\Users\\bryan.j.weiner\\Desktop\\test.csv', 'w', newline='')
#width = -36000000/1000
#fronteir = [None]*1000
currtime = time()
n=100
po = Pool()
res = po.map_async(CostlyFunction,((i,) for i in range(n)))
w = res.get()
spamwriter = csv.writer(csvFile, delimiter=',')
spamwriter.writerows(w)
print(('2: parallel: time elapsed:', time() - currtime))
csvFile.close()
Unfortunately, Pool doesn't have a 'filter' method; otherwise, you might've been able to prune your results before they're returned. Pool.imap is probably the best solution you'll find for dealing with your memory issue: it returns an iterator over the results from CostlyFunction.
For sorting through the results, I made a simple list-based class called TopList that stores a fixed number of items. All of its items are the highest-ranked according to a key function.
from collections import Userlist
def keyfunc(a):
return a[5] # This would be the sixth item in a result from CostlyFunction
class TopList(UserList):
def __init__(self, key, *args, cap=10): # cap is the largest number of results
super().__init__(*args) # you want to store
self.cap = cap
self.key = key
def add(self, item):
self.data.append(item)
self.data.sort(key=self.key, reverse=True)
self.data.pop()
Here's how your code might look:
if __name__ == "__main__":
import csv
csvFile = open('C:\\Users\\bryan.j.weiner\\Desktop\\test.csv', 'w', newline='')
n = 100
currtime = time()
po = Pool()
best = TopList(keyfunc)
result_iter = po.imap(CostlyFunction, ((i,) for i in range(n)))
for result in result_iter:
best.add(result)
spamwriter = csv.writer(csvFile, delimiter=',')
spamwriter.writerows(w)
print(('2: parallel: time elapsed:', time() - currtime))
csvFile.close()