Pool.apply_async().get() causes _thread.lock pickle error - python-3.x

I've recently made a python program that would benefit a lot from a consumer/producer parallel computing strategy.
I've tried to develop a module (Class) to ease the implementation of such processing strategy, but I quickly ran into a problem.
My ProducerConsumer class:
class ProducerConsumer(object):
def __init__(self, workers_qt, producer, consumer, min_producer_qt=1):
self.producer_functor = producer # Pointer to the producer function
self.consumer_functor = consumer # Pointer to the consumer function
self.buffer = deque([]) # Thread-safe double-ended queue item for intermediate result buffer
self.workers_qt = workers_qt
self.min_producer_qt = min_producer_qt # Minimum quantity of active producers (if enough remaining input data)
self.producers = [] # List of producers async results
self.consumers = [] # List of consumers async results
def produce(self, params, callback=None):
result = self.producer_functor(*params) # Execute the producer function
if callback is not None:
callback() # Call the callback (if there is one)
return result
def consume(self, params, callback=None):
result = self.consumer_functor(params) # Execute the producer function
if callback is not None:
callback() # Call the callback (if there is one)
return result
# Map a list of producer's input data to a list of consumer's output data
def map_result(self, producers_param):
result = [] # Result container
producers_param = deque(producers_param) # Convert input to double-ended queue (for popleft() member)
with Pool(self.workers_qt) as p: # Create a worker pool
while self.buffer or producers_param or self.consumers or self.producers: # Work remaining
# Create consumers
if self.buffer and (len(self.producers) >= self.min_producer_qt or not producers_param):
consumer_param = self.buffer.popleft() # Pop one set from the consumer param queue
if not isinstance(consumer_param, tuple):
consumer_param = (consumer_param,) # Force tuple type
self.consumers.append(p.apply_async(func=self.consume, args=consumer_param)) # Start new consumer
# Create producers
elif producers_param:
producer_param = producers_param.popleft() # Pop one set from the consumer param queue
if not isinstance(producer_param, tuple):
producer_param = (producer_param,) # Force tuple type
self.producers.append(p.apply_async(func=self.produce, args=producer_param)) # Start new producer
# Filter finished async_tasks
finished_producers = [r for r in self.producers if r.ready()] if self.producers else []
finished_consumers = [r for r in self.consumers if r.ready()] if self.consumers else []
# Remove finished async_tasks from the running tasks list
self.producers = [r for r in self.producers if r not in finished_producers]
self.consumers = [r for r in self.consumers if r not in finished_consumers]
# Extract result from finished async_tasks
for r in finished_producers:
assert r.ready()
self.buffer.append(r.get()) # Get the producer result and put it in the buffer
for r in finished_consumers:
assert r.ready()
result.append(r.get()) # Get the consumer tesult and put in in the function local result var
return result
In the member map_result(), when I try to "get()" the result of the apply_async(...) function, i get the following error (note that I'm running python3):
Traceback (most recent call last):
File "ProducerConsumer.py", line 91, in <module>
test()
File "ProducerConsumer.py", line 85, in test
result = pc.map_result(input)
File "ProducerConsumer.py", line 64, in map_result
self.buffer.append(r.get()) # Get the producer result and put it in the buffer
File "/usr/lib/python3.5/multiprocessing/pool.py", line 608, in get
raise self._value
File "/usr/lib/python3.5/multiprocessing/pool.py", line 385, in _handle_tasks
put(task)
File "/usr/lib/python3.5/multiprocessing/connection.py", line 206, in send
self._send_bytes(ForkingPickler.dumps(obj))
File "/usr/lib/python3.5/multiprocessing/reduction.py", line 50, in dumps
cls(buf, protocol).dump(obj)
TypeError: can't pickle _thread.lock objects
And here is some code to reproduce my error (dependent on the class obviously) :
def test_producer(val):
return val*12
def test_consumer(val):
return val/4
def test():
pc = ProducerConsumer(4, test_producer, test_consumer)
input = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] # Input for the test of the ProducerConsumer class
expected = [0, 3, 6, 9, 15, 18, 21, 23, 27] # Expected output for the test of the ProducerConsumer class
result = pc.map_result(input)
print('got : {}'.format(result))
print('expected : {}'.format(expected))
if __name__ == '__main__':
test()
Note that in the map_result() member of my class I only "get()" results that are "ready()".
From what I know about pickling (which I admit is not that much), I'd say that the fact that I Pool.apply_async(...) on a member function could play a role but I'd really like to keep the class structure if I can.
Thank you for the help!

So, the problem have been corrected when I also corrected some conception errors:
My 3 buffer variables (buffer, producers, consumers) had nothing to do as member of the class since they were semantically bound to the "map_result()" member itself.
So the patch was deleting these members and creating them as local variables of the member "map_result()".
Problem is, even if the conception was faulty, I still have a hard time understanding why the worker couldn't pickle the lock (of the param I now suppose) so...
If anyone have a clear explanation on what was going on (or a link to some) that would be really appreciated.

Related

How to access public attributes in RPyC on nested data?

I am trying access public attributes on RPyC call by following this document but don't see it to be working as mentioned in document.
Documentation says if you don't specify protocol_config={'allow_public_attrs': True,} , public attributes , even of builtin data types won't be accessible. However even if we specify this, public attributes of nested data structure is not accessible ?
RPyC Server code.
import pickle
import rpyc
class MyService(rpyc.Service):
def on_connect(self, conn):
# code that runs when a connection is created
# (to init the service, if needed)
pass
def on_disconnect(self, conn):
# code that runs after the connection has already closed
# (to finalize the service, if needed)
pass
def exposed_get_answer(self): # this is an exposed method
return 42
exposed_the_real_answer_though = 43 # an exposed attribute
def get_question(self): # while this method is not exposed
return "what is the airspeed velocity of an unladen swallow?"
def exposed_hello(self, collection):
print ("Collection is ", collection)
print ("Collection type is ", type(collection).__name__)
for item in collection:
print ("Item type is ", type(item).__name__)
print(item)
def exposed_hello2(self, collection):
for item in collection:
for key, val in item.items():
print (key, val)
def exposed_hello_json(self, collection):
for item in collection:
item = json.loads(item)
for key, val in item.items():
print (key, val)
if __name__ == "__main__":
from rpyc.utils.server import ThreadedServer
t = ThreadedServer(
MyService(),
port=3655,
protocol_config={'allow_public_attrs': True,}
)
t.start()
Client Side Calls
>>> import rpyc
>>> rpyc.__version__
(4, 0, 2)
>>> c = rpyc.connect('a.b.c.d', 3655) ; client=c.root
# Case 1
If data is in nested structure (using builtin data types) , it doesn't work.
>>> data
[{'a': [1, 2], 'b': 'asa'}]
>>> client.hello2(data)
...
AttributeError: cannot access 'items'
========= Remote Traceback (2) =========
Traceback (most recent call last):
File "/root/lydian.egg/rpyc/core/protocol.py", line 329, in _dispatch_request
res = self._HANDLERS[handler](self, *args)
File "/root/lydian.egg/rpyc/core/protocol.py", line 590, in _handle_call
return obj(*args, **dict(kwargs))
File "sample.py", line 33, in exposed_hello2
for key, val in item.items():
File "/root/lydian.egg/rpyc/core/netref.py", line 159, in __getattr__
return syncreq(self, consts.HANDLE_GETATTR, name)
File "/root/lydian.egg/rpyc/core/netref.py", line 75, in syncreq
return conn.sync_request(handler, proxy, *args)
File "/root/lydian.egg/rpyc/core/protocol.py", line 471, in sync_request
return self.async_request(handler, *args, timeout=timeout).value
File "/root/lydian.egg/rpyc/core/async_.py", line 97, in value
raise self._obj
_get_exception_class.<locals>.Derived: cannot access 'items'
Case 2 : Workaround, Pass nested data as string using json (poor man's pickle) and decode at server end.
>>> jdata = [json.dumps({'a': [1,2], 'b':"asa"})].
>>> client.hello_json(jdata) # Prints following at remote endpoint.
a [1, 2]
b asa
Case 3 :
Interestingly, at first level builtin items are accessible as in case of
hello method. But calling that on nested data is giving error.
>>> client.hello([1,2,3,4]) # Prints following at remote endpoint.
Collection is [1, 2, 3, 4]
Collection type is list
Item type is int
1
Item type is int
2
Item type is int
3
Item type is int
4
I have workaround / solution to the problem (case 2 above) but looking for explanation on why is this not allowed or if it is a bug. Thanks for inputs.
The issue is not related to nested data.
Your problem is that you are not allowing public attributes in the client side.
The solution is simple:
c = rpyc.connect('a.b.c.d', 3655, config={'allow_public_attrs': True})
Keep in mind that rpyc is a symmetric protocol (see https://rpyc.readthedocs.io/en/latest/docs/services.html#decoupled-services).
In your case, the server tries to access the client's object, so allow_public_attrs must be set in the client side.
Actually for your specific example, there is no need to set allow_public_attrs in the server side.
Regarding case 3:
In the line for item in collection:, the server tries to access two fields: collection.__iter__ and collection.__next__.
Both of these fields are considered by default as "safe attributes", and this is why you didn't get error there.
To inspect the default configuration dictionary in rpyc:
>>> import rpyc
>>> rpyc.core.protocol.DEFAULT_CONFIG

How best to parallelize grakn queries with Python?

I run Windows 10, Python 3.7, and have a 6-core CPU. A single Python thread on my machine submits 1,000 inserts per second to grakn. I'd like to parallelize my code to insert and match even faster. How are people doing this?
My only experience with parellelization is on another project, where I submit a custom function to a dask distributed client to generate thousands of tasks. Right now, this same approach fails whenever the custom function receives or generates a grakn transaction object/handle. I get errors like:
Traceback (most recent call last):
File "C:\Users\dvyd\.conda\envs\activefiction\lib\site-packages\distributed\protocol\pickle.py", line 41, in dumps
return cloudpickle.dumps(x, protocol=pickle.HIGHEST_PROTOCOL)
...
File "stringsource", line 2, in grpc._cython.cygrpc.Channel.__reduce_cython__
TypeError: no default __reduce__ due to non-trivial __cinit__
I've never used Python's multiprocessing module directly. What are other people doing to parallelize their queries to grakn?
The easiest approach that I've found to execute a batch of queries is to pass a Grakn session to each thread in a ThreadPool. Within each thread you can manage transactions and of course do some more complex logic:
from grakn.client import GraknClient
from multiprocessing.dummy import Pool as ThreadPool
from functools import partial
def write_query_batch(session, batch):
tx = session.transaction().write()
for query in batch:
tx.query(query)
tx.commit()
def multi_thread_write_query_batches(session, query_batches, num_threads=8):
pool = ThreadPool(num_threads)
pool.map(partial(write_query_batch, session), query_batches)
pool.close()
pool.join()
def generate_query_batches(my_data_entries_list, batch_size):
batch = []
for index, data_entry in enumerate(my_data_entries_list):
batch.append(data_entry)
if index % batch_size == 0 and index != 0:
yield batch
batch = []
if batch:
yield batch
# (Part 2) Somewhere in your application open a client and a session
client = GraknClient(uri="localhost:48555")
session = client.session(keyspace="grakn")
query_batches_iterator = generate_query_batches(my_data_entries_list, batch_size)
multi_thread_write_query_batches(session, query_batches_iterator, num_threads=8)
session.close()
client.close()
The above is a generic method. As a concrete example, you can use the above (omitting part 2) to parallelise batches of insert statements from two files. Appending this to the above should work:
files = [
{
"file_path": f"/path/to/your/file.gql",
},
{
"file_path": f"/path/to/your/file2.gql",
}
]
KEYSPACE = "grakn"
URI = "localhost:48555"
BATCH_SIZE = 10
NUM_BATCHES = 1000
# ​Entry point where migration starts
def migrate_graql_files():
start_time = time.time()
for file in files:
print('==================================================')
print(f'Loading from {file["file_path"]}')
print('==================================================')
open_file = open(file["file_path"], "r") # Here we are assuming you have 1 Graql query per line!
batches = generate_query_batches(open_file.readlines(), BATCH_SIZE)
with GraknClient(uri=URI) as client: # Using `with` auto-closes the client
with client.session(KEYSPACE) as session: # Using `with` auto-closes the session
multi_thread_write_query_batches(session, batches, num_threads=16) # Pick `num_threads` according to your machine
elapsed = time.time() - start_time
print(f'Time elapsed {elapsed:.1f} seconds')
elapsed = time.time() - start_time
print(f'Time elapsed {elapsed:.1f} seconds')
if __name__ == "__main__":
migrate_graql_files()
You should also be able to see how you can load from a csv or any other file type in this way, but taking the values you find in that file and substitution them into Graql query string templates. Take a look at the migration example in the docs for more on that.
An alternative approach using multi-processing instead of multi-threading follows below.
We empirically found that multi-threading doesn't yield particularly large performance gains, compared to multi-processing. This is probably due to Python's GIL.
This piece of code assumes a file enumerating TypeQL queries that are independent of each other, so they can be parallelised freely.
from typedb.client import TypeDB, TypeDBClient, SessionType, TransactionType
import multiprocessing as mp
import queue
def batch_writer(database, kill_event, batch_queue):
client = TypeDB.core_client("localhost:1729")
session = client.session(database, SessionType.DATA)
while not kill_event.is_set():
try:
batch = batch_queue.get(block=True, timeout=1)
with session.transaction(TransactionType.WRITE) as tx:
for query in batch:
tx.query().insert(query)
tx.commit()
except queue.Empty:
continue
print("Received kill event, exiting worker.")
def start_writers(database, kill_event, batch_queue, parallelism=4):
processes = []
for _ in range(parallelism):
proc = mp.Process(target=batch_writer, args=(database, kill_event, batch_queue))
processes.append(proc)
proc.start()
return processes
def batch(iterable, n=1000):
l = len(iterable)
for ndx in range(0, l, n):
yield iterable[ndx:min(ndx + n, l)]
if __name__ == '__main__':
batch_size = 100
parallelism = 1
database = "<database name>"
# filePath = "<PATH TO QUERIES FILE - ONE QUERY PER NEW LINE>"
with open(file_path, "r") as file:
statements = file.read().splitlines()[:]
batch_statements = batch(statements, n=batch_size)
total_batches = int(len(statements) / batch_size)
if total_batches % batch_size > 0:
total_batches += 1
batch_queue = mp.Queue(parallelism * 4)
kill_event = mp.Event()
writers = start_writers(database, kill_event, batch_queue, parallelism=parallelism)
for i, batch in enumerate(batch_statements):
batch_queue.put(batch, block=True)
if i*batch_size % 10000 == 0:
print("Loaded: {0}/{1}".format(i*batch_size, total_batches*batch_size))
kill_event.set()
batch_queue.close()
batch_queue.join_thread()
for proc in writers:
proc.join()
print("Done loading")

I am getting the error : ValueError: not enough values to unpack (expected 2, got 1). What could be the problem?

Im running a mapreduce job, but it keeps failing saying missing input. Unfortunately its not showing where its missing either
from mrjob.job import MRJob
from mrjob.step import MRStep
import re
class flight_combination(MRJob):
def steps(self):
return [MRStep(mapper=self.mapper_1,reducer=self.reducer_1)]
def mapper_1(self,_,value):
group1 = {}
group2 = {}
parts = value.split(",")
destination = parts[0]
origin = parts[1]
count = parts[2]
group1[destination] = {'Origin': origin, 'count': count}
group2[origin] = {'Destination':destination,'count':count}
yield group1
yield group2
def reducer_1(self,key,value):
g1,g2 = data
for key1 in g1:
for key2 in g2:
if g1[key1]['Origin'] == g2[key2]['Destination']:
total = int(g1[key1]['count'])*int(g2[key2]['count'])
yield (key1,key2,total)
if __name__ == '__main__':
flight_combination.run()
Following is the error:
`File "wd.py", line 35, in <module>
flight_combination.run()
…...
File "/usr/lib/python3.6/site-packages/mrjob/job.py", line 536, in run_mapper
for out_key, out_value in mapper(key, value) or ():
ValueError: not enough values to unpack (expected 2, got 1)`
The run method of the object type flight_combination is expecting 2 arguments, but provided with 1 argument. ( Python by default takes self as the first argument for methods invoked on object)
To fix this -
as the method run is defined in the parent class, go through its definition and pass the other argument.
Override the run method by redefining it flight_combination class and provide your logic.

Exception Handling Error: TypeError: can only concatenate list (not "odict_keys") to list

I was running pytest and using someone else's exception handling library. It supposed to run older version of python, not sure which one. However when I try to run it with python3, it spouted error that I didn't understand, also for some reason I have trouble finding the meaning of the keyword error (odict_keys) in the web.
The following was the result from the pytest. The exception handling inside test_analysis procedure was calling run_with_timeout(timeoutwrapper_analysis,max_seconds_per_call,(),{}) before the error occurred here. Inside run_with_timeout, the error happened when it raised e as an exception:
#pytest.mark.parametrize("inputs,outputs,description", portfolio_test_cases)
def test_analysis(inputs, outputs, description, grader):
"""Test get_portfolio_value() and get_portfolio_stats() return correct values.
Requires test inputs, expected outputs, description, and a grader fixture.
"""
points_earned = 0.0 # initialize points for this test case
try:
# Try to import student code (only once)
if not main_code in globals():
import importlib
# * Import module
mod = importlib.import_module(main_code)
globals()[main_code] = mod
# Unpack test case
start_date_str = inputs['start_date'].split('-')
start_date = datetime.datetime(int(start_date_str[0]),int(start_date_str[1]),int(start_date_str[2]))
end_date_str = inputs['end_date'].split('-')
end_date = datetime.datetime(int(end_date_str[0]),int(end_date_str[1]),int(end_date_str[2]))
symbols = inputs['symbol_allocs'].keys() # e.g.: ['GOOG', 'AAPL', 'GLD', 'XOM']
allocs = inputs['symbol_allocs'].values() # e.g.: [0.2, 0.3, 0.4, 0.1]
start_val = inputs['start_val']
risk_free_rate = inputs.get('risk_free_rate',0.0)
# the wonky unpacking here is so that we only pull out the values we say we'll test.
def timeoutwrapper_analysis():
student_rv = analysis.assess_portfolio(\
sd=start_date, ed=end_date,\
syms=symbols,\
allocs=allocs,\
sv=start_val, rfr=risk_free_rate, sf=252.0, \
gen_plot=False)
return student_rv
# Error happen in the following line:
result = run_with_timeout(timeoutwrapper_analysis,max_seconds_per_call,(),{})
grade_analysis.py:176:
func = .timeoutwrapper_analysis at 0x7f8c458347b8>, timeout_seconds = 5, pos_args = (), keyword_args = {}
def run_with_timeout(func,timeout_seconds,pos_args,keyword_args):
rv_dict = timeout_manager.dict()
p = multiprocessing.Process(target=proc_wrapper,args=(func,rv_dict,pos_args,keyword_args))
p.start()
p.join(timeout_seconds)
if p.is_alive():
p.terminate()
raise TimeoutException("Exceeded time limit!")
if not('output' in rv_dict):
if 'exception' in rv_dict:
e = rv_dict['exception']
e.grading_traceback=None
if 'traceback' in rv_dict:
e.grading_traceback = rv_dict['traceback']
# Error occurred after the following line:
raise e
E TypeError: can only concatenate list (not "odict_keys") to list
grading.py:134: TypeError
Looks the script didn't like
raise e
statement. What is that to do with odict_keys?
regards
two of the inputs of analysis.assess_portfolio, symbols and allocs are in the forms of odic_keys. Apparently, this worked in python 2.7, however when running with python 3, they need to be in the form of list, thus changing the input statement from
symbols = inputs['symbol_allocs'].keys()
allocs = inputs['symbol_allocs'].values()
to
symbols = list(inputs['symbol_allocs'].keys())
allocs = list(inputs['symbol_allocs'].values())
fixed it

Error while getting list from another process

I wanted to write small program that would symulate for me lottery winning chances. After that i wanted to make it a bit faster by implemening multiprocessing like this
But two weird behaviors started
import random as r
from multiprocessing.pool import ThreadPool
# winnerSequence = []
# mCombinations = []
howManyLists = 5
howManyTry = 1000000
combinations = 720/10068347520
possbilesNumConstantsConstant = []
for x in range(1, 50):
possbilesNumConstantsConstant.append(x)
def getTicket():
possbilesNumConstants = list(possbilesNumConstantsConstant)
toReturn = []
possiblesNum = list(possbilesNumConstants)
for x in range(6):
choice = r.choice(possiblesNum)
toReturn.append(choice)
possiblesNum.remove(choice)
toReturn.sort()
return toReturn
def sliceRange(rangeNum,num):
"""returns list of smaller ranges"""
toReturn = []
rest = rangeNum%num
print(rest)
toSlice = rangeNum - rest
print(toSlice)
n = toSlice/num
print(n)
for x in range(num):
toReturn.append((int(n*x),int(n*(x+1)-1)))
print(toReturn,"<---range")
return toReturn
def Job(tupleRange):
"""Job returns list of tickets """
toReturn = list()
print(tupleRange,"Start")
for x in range(int(tupleRange[0]),int(tupleRange[1])):
toReturn.append(getTicket())
print(tupleRange,"End")
return toReturn
result = list()
First one when i add Job(tupleRange) to pool it looks like job is done in main thread before another job is added to pool
def start():
"""this fun() starts program"""
#create pool of threads
pool = ThreadPool(processes = howManyLists)
#create list of tuples with smaller piece of range
lista = sliceRange(howManyTry,howManyLists)
#create list for storing job objects
jobList = list()
for tupleRange in lista:
#add job to pool
jobToList = pool.apply_async(Job(tupleRange))
#add retured object to list for future callback
jobList.append(jobToList)
print('Adding to pool',tupleRange)
#for all jobs in list get returned tickes
for job in jobList:
#print(job.get())
result.extend(job.get())
if __name__ == '__main__':
start()
Consol output
[(0, 199999), (200000, 399999), (400000, 599999), (600000, 799999), (800000, 999999)] <---range
(0, 199999) Start
(0, 199999) End
Adding to pool (0, 199999)
(200000, 399999) Start
(200000, 399999) End
Adding to pool (200000, 399999)
(400000, 599999) Start
(400000, 599999) End
and second one when i want to get data from thread i got this exception on this line
for job in jobList:
#print(job.get())
result.extend(job.get()) #<---- this line
File "C:/Users/CrazyUrusai/PycharmProjects/TestLotka/main/kopia.py", line 79, in start
result.extend(job.get())
File "C:\Users\CrazyUrusai\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py", line 644, in get
raise self._value
File "C:\Users\CrazyUrusai\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py", line 119, in worker
result = (True, func(*args, **kwds))
TypeError: 'list' object is not callable
Can sombody explain this to me?(i am new to multiprocessing)
The problem is here:
jobToList = pool.apply_async(Job(tupleRange))
Job(tupleRange) executes first, then apply_async gets some returned value, list type (as Job returns list). There are two problems here: this code is synchronous and async_apply gets list instead of job it expects. So it try to execute given list as a job but fails.
That's a signature of pool.apply_async:
def apply_async(self, func, args=(), kwds={}, callback=None,
error_callback=None):
...
So, you should send func and arguments args to this function separately, and shouldn't execute the function before you will send it to the pool.
I fix this line and your code have worked for me:
jobToList = pool.apply_async(Job, (tupleRange, ))
Or, with explicitly named args,
jobToList = pool.apply_async(func=Job, args=(tupleRange, ))
Don't forget to wrap function arguments in tuple or so.

Resources