Return results from asyncio.get_event_loop - python-3.x

I'm new to using the asyncio module. I have the following code that is querying a service to return IDs. How do I set a variable to return the results from the 'findIntersectingFeatures' function?
Also, how can I have the print statements execute after run_in_executor has finished. They are currently printing immediately after the first iteration.
import json, requests, time
import asyncio
startTime = time.clock()
out_json = "UML10kmbuffer.json"
intersections = []
def findIntersectingFeatures(coordinate):
coordinates = '{"rings":' + str(coordinate) + '}'
forestCoverURL = 'http://server1.ags.com/server/rest/services/Forest_Cover/MapServer/0/query'
params = {'f': 'json', 'where': "1=1", 'outFields': '*', 'geometry': coordinates, 'geometryType': 'esriGeometryPolygon', 'returnIdsOnly': 'true'}
r = requests.post(forestCoverURL, data = params, verify=False)
response = json.loads(r.content)
if response['objectIds'] != None:
intersections.append(response['objectIds'])
return intersections
with open(out_json, "r") as f_in:
for line in f_in:
json_res = json.loads(line)
coordinates = []
# Get features
feat_json = json_res["features"]
for item in feat_json:
coordinates.append(item["geometry"]["rings"])
loop = asyncio.get_event_loop()
for coordinate in coordinates:
loop.run_in_executor(None, findIntersectingFeatures, coordinate)
print("Intersecting Features: " + str(intersections))
endTime = time.clock()
elapsedTime =(endTime - startTime) / 60
print("Elapsed Time: " + str(elapsedTime))

To use asyncio, you shouldn't just get the event loop, you must also run it. You can use run_until_complete to run a coroutine to completion. Since you need to run many coroutines in parallel, you can use asyncio.gather to combine them into a single parallel task:
coroutines = []
for coordinate in coordinates:
coroutines.append(loop.run_in_executor(
None, findIntersectingFeatures, coordinate))
intersections = loop.run_until_complete(asyncio.gather(*coroutines))
Also, how can I have the print statements execute after run_in_executor has finished.
You can await the call to run_in_executor and place your print after it:
def find_features(coordinate):
inter = await loop.run_in_executor(None, findInterestingFeatures, coordinate)
print('found', inter)
return inter
# in the for loop, replace coroutines.append(loop.run_in_executor(...))
# with coroutines.append(find_features(coordinate)).

Related

How best to parallelize grakn queries with Python?

I run Windows 10, Python 3.7, and have a 6-core CPU. A single Python thread on my machine submits 1,000 inserts per second to grakn. I'd like to parallelize my code to insert and match even faster. How are people doing this?
My only experience with parellelization is on another project, where I submit a custom function to a dask distributed client to generate thousands of tasks. Right now, this same approach fails whenever the custom function receives or generates a grakn transaction object/handle. I get errors like:
Traceback (most recent call last):
File "C:\Users\dvyd\.conda\envs\activefiction\lib\site-packages\distributed\protocol\pickle.py", line 41, in dumps
return cloudpickle.dumps(x, protocol=pickle.HIGHEST_PROTOCOL)
...
File "stringsource", line 2, in grpc._cython.cygrpc.Channel.__reduce_cython__
TypeError: no default __reduce__ due to non-trivial __cinit__
I've never used Python's multiprocessing module directly. What are other people doing to parallelize their queries to grakn?
The easiest approach that I've found to execute a batch of queries is to pass a Grakn session to each thread in a ThreadPool. Within each thread you can manage transactions and of course do some more complex logic:
from grakn.client import GraknClient
from multiprocessing.dummy import Pool as ThreadPool
from functools import partial
def write_query_batch(session, batch):
tx = session.transaction().write()
for query in batch:
tx.query(query)
tx.commit()
def multi_thread_write_query_batches(session, query_batches, num_threads=8):
pool = ThreadPool(num_threads)
pool.map(partial(write_query_batch, session), query_batches)
pool.close()
pool.join()
def generate_query_batches(my_data_entries_list, batch_size):
batch = []
for index, data_entry in enumerate(my_data_entries_list):
batch.append(data_entry)
if index % batch_size == 0 and index != 0:
yield batch
batch = []
if batch:
yield batch
# (Part 2) Somewhere in your application open a client and a session
client = GraknClient(uri="localhost:48555")
session = client.session(keyspace="grakn")
query_batches_iterator = generate_query_batches(my_data_entries_list, batch_size)
multi_thread_write_query_batches(session, query_batches_iterator, num_threads=8)
session.close()
client.close()
The above is a generic method. As a concrete example, you can use the above (omitting part 2) to parallelise batches of insert statements from two files. Appending this to the above should work:
files = [
{
"file_path": f"/path/to/your/file.gql",
},
{
"file_path": f"/path/to/your/file2.gql",
}
]
KEYSPACE = "grakn"
URI = "localhost:48555"
BATCH_SIZE = 10
NUM_BATCHES = 1000
# ​Entry point where migration starts
def migrate_graql_files():
start_time = time.time()
for file in files:
print('==================================================')
print(f'Loading from {file["file_path"]}')
print('==================================================')
open_file = open(file["file_path"], "r") # Here we are assuming you have 1 Graql query per line!
batches = generate_query_batches(open_file.readlines(), BATCH_SIZE)
with GraknClient(uri=URI) as client: # Using `with` auto-closes the client
with client.session(KEYSPACE) as session: # Using `with` auto-closes the session
multi_thread_write_query_batches(session, batches, num_threads=16) # Pick `num_threads` according to your machine
elapsed = time.time() - start_time
print(f'Time elapsed {elapsed:.1f} seconds')
elapsed = time.time() - start_time
print(f'Time elapsed {elapsed:.1f} seconds')
if __name__ == "__main__":
migrate_graql_files()
You should also be able to see how you can load from a csv or any other file type in this way, but taking the values you find in that file and substitution them into Graql query string templates. Take a look at the migration example in the docs for more on that.
An alternative approach using multi-processing instead of multi-threading follows below.
We empirically found that multi-threading doesn't yield particularly large performance gains, compared to multi-processing. This is probably due to Python's GIL.
This piece of code assumes a file enumerating TypeQL queries that are independent of each other, so they can be parallelised freely.
from typedb.client import TypeDB, TypeDBClient, SessionType, TransactionType
import multiprocessing as mp
import queue
def batch_writer(database, kill_event, batch_queue):
client = TypeDB.core_client("localhost:1729")
session = client.session(database, SessionType.DATA)
while not kill_event.is_set():
try:
batch = batch_queue.get(block=True, timeout=1)
with session.transaction(TransactionType.WRITE) as tx:
for query in batch:
tx.query().insert(query)
tx.commit()
except queue.Empty:
continue
print("Received kill event, exiting worker.")
def start_writers(database, kill_event, batch_queue, parallelism=4):
processes = []
for _ in range(parallelism):
proc = mp.Process(target=batch_writer, args=(database, kill_event, batch_queue))
processes.append(proc)
proc.start()
return processes
def batch(iterable, n=1000):
l = len(iterable)
for ndx in range(0, l, n):
yield iterable[ndx:min(ndx + n, l)]
if __name__ == '__main__':
batch_size = 100
parallelism = 1
database = "<database name>"
# filePath = "<PATH TO QUERIES FILE - ONE QUERY PER NEW LINE>"
with open(file_path, "r") as file:
statements = file.read().splitlines()[:]
batch_statements = batch(statements, n=batch_size)
total_batches = int(len(statements) / batch_size)
if total_batches % batch_size > 0:
total_batches += 1
batch_queue = mp.Queue(parallelism * 4)
kill_event = mp.Event()
writers = start_writers(database, kill_event, batch_queue, parallelism=parallelism)
for i, batch in enumerate(batch_statements):
batch_queue.put(batch, block=True)
if i*batch_size % 10000 == 0:
print("Loaded: {0}/{1}".format(i*batch_size, total_batches*batch_size))
kill_event.set()
batch_queue.close()
batch_queue.join_thread()
for proc in writers:
proc.join()
print("Done loading")

Python 3 Multiprocessing and openCV problem with dictionary sharing between processor

I would like to use multiprocessing to compute the SIFT extraction and SIFT matching for object detection.
For now, I have a problem with the return value of the function that does not insert data in the dictionary.
I'm using Manager class and image that are open inside the function. But does not work.
Finally, my idea is:
Computer the keypoint for every reference image, use this keypoint as a parameter of a second function that compares and match with the keypoint and descriptors of the test image.
My code is:
# %% Import Section
import cv2
import numpy as np
from matplotlib import pyplot as plt
import os
from datetime import datetime
from multiprocessing import Process, cpu_count, Manager, Lock
import argparse
# %% path section
tests_path = 'TestImages/'
references_path = 'ReferenceImages2/'
result_path = 'ResultParametrizer/'
#%% Number of processor
cpus = cpu_count()
# %% parameter section
eps = 1e-7
useTwo = False # using the m and n keypoint better with False
# good point parameters
distanca_coefficient = 0.75
# gms parameter
gms_thresholdFactor = 3
gms_withRotation = True
gms_withScale = True
# flann parameter
flann_trees = 5
flann_checks = 50
#%% Locker
lock = Lock()
# %% function definition
def keypointToDictionaries(keypoint):
x, y = keypoint.pt
pt = float(x), float(y)
angle = float(keypoint.angle) if keypoint.angle is not None else None
size = float(keypoint.size) if keypoint.size is not None else None
response = float(keypoint.response) if keypoint.response is not None else None
class_id = int(keypoint.class_id) if keypoint.class_id is not None else None
octave = int(keypoint.octave) if keypoint.octave is not None else None
return {
'point': pt,
'angle': angle,
'size': size,
'response': response,
'class_id': class_id,
'octave': octave
}
def dictionariesToKeypoint(dictionary):
kp = cv2.KeyPoint()
kp.pt = dictionary['pt']
kp.angle = dictionary['angle']
kp.size = dictionary['size']
kp.response = dictionary['response']
kp.octave = dictionary['octave']
kp.class_id = dictionary['class_id']
return kp
def rootSIFT(dictionary, image_name, image_path,eps=eps):
# SIFT init
image = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
sift = cv2.xfeatures2d.SIFT_create()
keypoints, descriptors = sift.detectAndCompute(image, None)
descriptors /= (descriptors.sum(axis=1, keepdims=True) + eps)
descriptors = np.sqrt(descriptors)
print('Finito di calcolare, PID: ', os.getpid())
lock.acquire()
dictionary[image_name]['keypoints'] = keypoints
dictionary[image_name]['descriptors'] = descriptors
lock.release()
def featureMatching(reference_image, reference_descriptors, reference_keypoints, test_image, test_descriptors,
test_keypoints, flann_trees=flann_trees, flann_checks=flann_checks):
# FLANN parameter
FLANN_INDEX_KDTREE = 1
index_params = dict(algorithm=FLANN_INDEX_KDTREE, trees=flann_trees)
search_params = dict(checks=flann_checks) # or pass empty dictionary
flann = cv2.FlannBasedMatcher(index_params, search_params)
flann_matches = flann.knnMatch(reference_descriptors, test_descriptors, k=2)
matches_copy = []
for i, (m, n) in enumerate(flann_matches):
if m.distance < distanca_coefficient * n.distance:
matches_copy.append(m)
gsm_matches = cv2.xfeatures2d.matchGMS(reference_image.shape, test_image.shape, keypoints1=reference_keypoints,
keypoints2=test_keypoints, matches1to2=matches_copy,
withRotation=gms_withRotation, withScale=gms_withScale,
thresholdFactor=gms_thresholdFactor)
#%% Starting reference list file creation
reference_init = datetime.now()
print('Start reference file list creation')
reference_image_process_list = []
manager = Manager()
reference_image_dictionary = manager.dict()
reference_image_list = manager.list()
for root, directories, files in os.walk(references_path):
for file in files:
if file.endswith('.DS_Store'):
continue
reference_image_path = os.path.join(root, file)
reference_name = file.split('.')[0]
image = cv2.imread(reference_image_path, cv2.IMREAD_GRAYSCALE)
reference_image_dictionary[reference_name] = {
'image': image,
'keypoints': None,
'descriptors': None
}
proc = Process(target=rootSIFT, args=(reference_image_list, reference_name, reference_image_path))
reference_image_process_list.append(proc)
proc.start()
for proc in reference_image_process_list:
proc.join()
reference_end = datetime.now()
reference_time = reference_end - reference_init
print('End reference file list creation, time required: ', reference_time)
I faced pretty much the same error. It seems that the code hangs at detectAndCompute in my case, not when creating the dictionary. For some reason, sift feature extraction is not multi-processing safe (to my understanding, it is the case in Macs but I am not totally sure.)
I found this in a github thread. Many people say it works but I couldn't get it worked. (Edit: I tried this later which works fine)
Instead I used multithreading which is pretty much the same code and works perfectly. Of course you need to take multithreading vs multiprocessing into account

Error while getting list from another process

I wanted to write small program that would symulate for me lottery winning chances. After that i wanted to make it a bit faster by implemening multiprocessing like this
But two weird behaviors started
import random as r
from multiprocessing.pool import ThreadPool
# winnerSequence = []
# mCombinations = []
howManyLists = 5
howManyTry = 1000000
combinations = 720/10068347520
possbilesNumConstantsConstant = []
for x in range(1, 50):
possbilesNumConstantsConstant.append(x)
def getTicket():
possbilesNumConstants = list(possbilesNumConstantsConstant)
toReturn = []
possiblesNum = list(possbilesNumConstants)
for x in range(6):
choice = r.choice(possiblesNum)
toReturn.append(choice)
possiblesNum.remove(choice)
toReturn.sort()
return toReturn
def sliceRange(rangeNum,num):
"""returns list of smaller ranges"""
toReturn = []
rest = rangeNum%num
print(rest)
toSlice = rangeNum - rest
print(toSlice)
n = toSlice/num
print(n)
for x in range(num):
toReturn.append((int(n*x),int(n*(x+1)-1)))
print(toReturn,"<---range")
return toReturn
def Job(tupleRange):
"""Job returns list of tickets """
toReturn = list()
print(tupleRange,"Start")
for x in range(int(tupleRange[0]),int(tupleRange[1])):
toReturn.append(getTicket())
print(tupleRange,"End")
return toReturn
result = list()
First one when i add Job(tupleRange) to pool it looks like job is done in main thread before another job is added to pool
def start():
"""this fun() starts program"""
#create pool of threads
pool = ThreadPool(processes = howManyLists)
#create list of tuples with smaller piece of range
lista = sliceRange(howManyTry,howManyLists)
#create list for storing job objects
jobList = list()
for tupleRange in lista:
#add job to pool
jobToList = pool.apply_async(Job(tupleRange))
#add retured object to list for future callback
jobList.append(jobToList)
print('Adding to pool',tupleRange)
#for all jobs in list get returned tickes
for job in jobList:
#print(job.get())
result.extend(job.get())
if __name__ == '__main__':
start()
Consol output
[(0, 199999), (200000, 399999), (400000, 599999), (600000, 799999), (800000, 999999)] <---range
(0, 199999) Start
(0, 199999) End
Adding to pool (0, 199999)
(200000, 399999) Start
(200000, 399999) End
Adding to pool (200000, 399999)
(400000, 599999) Start
(400000, 599999) End
and second one when i want to get data from thread i got this exception on this line
for job in jobList:
#print(job.get())
result.extend(job.get()) #<---- this line
File "C:/Users/CrazyUrusai/PycharmProjects/TestLotka/main/kopia.py", line 79, in start
result.extend(job.get())
File "C:\Users\CrazyUrusai\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py", line 644, in get
raise self._value
File "C:\Users\CrazyUrusai\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py", line 119, in worker
result = (True, func(*args, **kwds))
TypeError: 'list' object is not callable
Can sombody explain this to me?(i am new to multiprocessing)
The problem is here:
jobToList = pool.apply_async(Job(tupleRange))
Job(tupleRange) executes first, then apply_async gets some returned value, list type (as Job returns list). There are two problems here: this code is synchronous and async_apply gets list instead of job it expects. So it try to execute given list as a job but fails.
That's a signature of pool.apply_async:
def apply_async(self, func, args=(), kwds={}, callback=None,
error_callback=None):
...
So, you should send func and arguments args to this function separately, and shouldn't execute the function before you will send it to the pool.
I fix this line and your code have worked for me:
jobToList = pool.apply_async(Job, (tupleRange, ))
Or, with explicitly named args,
jobToList = pool.apply_async(func=Job, args=(tupleRange, ))
Don't forget to wrap function arguments in tuple or so.

python3 multiprocessing.Process approach fails

I saw somewhere a hint on how to process a large dataset (say lines of text) faster with the multiprocessing module, something like:
... (form batch_set = nump batches [= lists of lines to process], batch_set
is a list of lists of strings (batches))
nump = len(batch_set)
output = mp.Queue()
processes = [mp.Process(target=proc_lines, args=(i, output, batch_set[i])) for i in range(nump)]
for p in processes:
p.start()
for p in processes:
p.join()
results = sorted([output.get() for p in processes])
... (do something with the processed outputs, ex print them in order,
given that each proc_lines function returns a couple (i, out_batch))
However, when i run the code with a small number of lines/batch it works fine
[ex: './code.py -x 4:10' for nump=4 and numb=10 (lines/batch)] while after a
certain number of lines/batch is hangs [ex: './code.py -x 4:4000'] and when i
interrupt it i see a traceback hint about a _wait_for_tstate_lock and the system
threading library. It seems that the code does not reach the last shown code
line above...
I provide the code below, in case somebody needs it to answer why this is
happening and how to fix it.
#!/usr/bin/env python3
import sys
import multiprocessing as mp
def fabl(numb, nump):
'''
Form And Batch Lines: form nump[roc] groups of numb[atch] indexed lines
'<idx> my line here' with indexes from 1 to (nump x numb).
'''
ret = []
idx = 1
for _ in range(nump):
cb = []
for _ in range(numb):
cb.append('%07d my line here' % idx)
idx += 1
ret.append(cb)
return ret
def proc_lines(i, output, rows_in):
ret = []
for row in rows_in:
row = row[0:8] + 'some other stuff\n' # replacement for the post-idx part
ret.append(row)
output.put((i,ret))
return
def mp_proc(batch_set):
'given the batch, disperse it to the number of processes and ret the results'
nump = len(batch_set)
output = mp.Queue()
processes = [mp.Process(target=proc_lines, args=(i, output, batch_set[i])) for i in range(nump)]
for p in processes:
p.start()
for p in processes:
p.join()
print('waiting for procs to complete...')
results = sorted([output.get() for p in processes])
return results
def write_set(proc_batch_set, fout):
'write p[rocessed]batch_set'
for _, out_batch in proc_batch_set:
for row in out_batch:
fout.write(row)
return
def main():
args = sys.argv
if len(args) < 2:
print('''
run with args: -x [ NumProc:BatchSize ]
( ex: '-x' | '-x 4:10' (default values) | '-x 4:4000' (hangs...) )
''')
sys.exit(0)
numb = 10 # suppose we need this number of lines/batch : BatchSize
nump = 4 # number of processes to use. : NumProcs
if len(args) > 2 and ':' in args[2]: # use another np:bs
nump, numb = map(int, args[2].split(':'))
batch_set = fabl(numb, nump) # proc-batch made in here: nump (groups) x numb (lines)
proc_batch_set = mp_proc(batch_set)
with open('out-min', 'wt') as fout:
write_set(proc_batch_set, fout)
return
if __name__ == '__main__':
main()
The Queue have a certain capacity and can get full if you do not empty it while the Process are running. This does not block the execution of your processes but you won't be able to join the Process if the put did not complete.
So I would just modify the mp_proc function such that:
def mp_proc(batch_set):
'given the batch, disperse it to the number of processes and ret the results'
n_process = len(batch_set)
output = mp.Queue()
processes = [mp.Process(target=proc_lines, args=(i, output, batch_set[i]))
for i in range(process)]
for p in processes:
p.start()
# Empty the queue while the processes are running so there is no
# issue with uncomplete `put` operations.
results = sorted([output.get() for p in processes])
# Join the process to make sure everything finished correctly
for p in processes:
p.join()
return results

Python Multiprocessing throwing out results based on previous values

I am trying to learn how to use multiprocessing and have managed to get the code below to work. The goal is to work through every combination of the variables within the CostlyFunction by setting n equal to some number (right now it is 100 so the first 100 combinations are tested). I was hoping I could manipulate w as each process returned its list (CostlyFunction returns a list of 7 values) and only keep the results in a given range. Right now, w holds all 100 lists and then lets me manipulate those lists but, when I use n=10MM, w becomes huge and costly to hold. Is there a way to evaluate CostlyFunction's output as the workers return values and then 'throw out' values I don't need?
if __name__ == "__main__":
import csv
csvFile = open('C:\\Users\\bryan.j.weiner\\Desktop\\test.csv', 'w', newline='')
#width = -36000000/1000
#fronteir = [None]*1000
currtime = time()
n=100
po = Pool()
res = po.map_async(CostlyFunction,((i,) for i in range(n)))
w = res.get()
spamwriter = csv.writer(csvFile, delimiter=',')
spamwriter.writerows(w)
print(('2: parallel: time elapsed:', time() - currtime))
csvFile.close()
Unfortunately, Pool doesn't have a 'filter' method; otherwise, you might've been able to prune your results before they're returned. Pool.imap is probably the best solution you'll find for dealing with your memory issue: it returns an iterator over the results from CostlyFunction.
For sorting through the results, I made a simple list-based class called TopList that stores a fixed number of items. All of its items are the highest-ranked according to a key function.
from collections import Userlist
def keyfunc(a):
return a[5] # This would be the sixth item in a result from CostlyFunction
class TopList(UserList):
def __init__(self, key, *args, cap=10): # cap is the largest number of results
super().__init__(*args) # you want to store
self.cap = cap
self.key = key
def add(self, item):
self.data.append(item)
self.data.sort(key=self.key, reverse=True)
self.data.pop()
Here's how your code might look:
if __name__ == "__main__":
import csv
csvFile = open('C:\\Users\\bryan.j.weiner\\Desktop\\test.csv', 'w', newline='')
n = 100
currtime = time()
po = Pool()
best = TopList(keyfunc)
result_iter = po.imap(CostlyFunction, ((i,) for i in range(n)))
for result in result_iter:
best.add(result)
spamwriter = csv.writer(csvFile, delimiter=',')
spamwriter.writerows(w)
print(('2: parallel: time elapsed:', time() - currtime))
csvFile.close()

Resources