How to kafka-python consumers parallely - python-3.x

I have a set of kafka-python consumers that consume from different kafka topics continuously and parallely.
My question is how to kick off the consumers in parallel using single python script?
And what is the best way to manage(start/stop/monitor) these consumers.
if I write ex:
run.py
import consumer1, consumer2, consumer3
consumer1.start()
consumer2.start()
consumer3.start()
It just hangs on consumer1.start() as the script does not return any value and keeps running.

You can have different threads for each consumer to consume messages in parallel. For example you can have:
consumer1_thread = threading.Thread(target=consumer1.start, args=())
consumer2_thread = threading.Thread(target=consumer2.start, args=())
consumer3_thread = threading.Thread(target=consumer3.start, args=())
consumer1_thread.start()
consumer2_thread.start()
consumer3_thread.start()
You, can see the logs of each thread in parallel and write some logic to stop individual thread if required.

Related

handle concurrent access in multiple job queues with multiple workers

I've to design a job scheduler for multi-tenant app. Each tenant will have it's own job queue for processing background task. There are N workers each of which listen to all the queues and take up the job when idle.
eg.
queue 1 : task - A, B, c
queue 2 : task - D
queue 3 : task - E, F
and I have 3 workers w1, w2, w3, all of which listen to all the queues. This whole design is going to be implemented in aws.
It is important that one job is processed only once. Since all the workers are reading queue's, how can I prevent simultaneous access of 1 job to many workers ?
Also if the workers read all queue sequentially then it will keep dequeuing only from first queue till empty, how to handle this situation ?
I initially thought of using sns ntoification when new task is added to job queue, but since all workers will receive it, the core problem won't be solved.
For the first concern, SQS handles distributing tasks to individual workers automatically, go read about Visibility Timeouts.
If you want to maintain separate queues, you need to put the logic in the workers to do the queue switching, basically putting in an infinite loop that is looping over the 3 queues, checking for new work, and only processing a single chunk / message before switching to the next queue:
while (true)
for (queue : queues) {
message = getMessage(queue)
if (message != null)
processmessage(message)
}
}
Make sure you aren't using long polling, as it will just sit on the first queue.

Asynchronous Communication between few 'loops'

I have 3 classes that represent nearly isolated processes that can be run concurrently (meant to be persistent, like 3 main() loops).
class DataProcess:
...
def runOnce(self):
...
class ComputeProcess:
...
def runOnce(self):
...
class OtherProcess:
...
def runOnce(self):
...
Here's the pattern I'm trying to achieve:
start various streams
start each process
allow each process to publish to any stream
allow each process to listen to any stream (at various points in it's loop) and behave accordingly (allow for interruption of it's current task or not, etc.)
For example one 'process' Listens for external data. Another process does computation on some of that data. The computation process might be busy for a while, so by the time it comes back to start and checks the stream, there may be many values that piled up. I don't want to just use a queue because, actually I don't want to be forced to process each one in order, I'd rather be able to implement logic like, "if there is one or multiple things waiting, just run your process one more time, otherwise go do this interruptible task while you wait for something to show up."
That's like a lot, right? So I was thinking of using an actor model until I discovered RxPy. I saw that a stream is like a subject
from reactivex.subject import BehaviorSubject
newData = BehaviorSubject()
newModel = BehaviorSubject()
then I thought I'd start 3 threads for each of my high level processes:
thread = threading.Thread(target=data)
threads = {'data': thread}
thread = threading.Thread(target=compute)
threads = {'compute': thread}
thread = threading.Thread(target=other)
threads = {'other': thread}
for thread in threads.values():
thread.start()
and I thought the functions of those threads should listen to the streams:
def data():
while True:
DataProcess().runOnce() # publishes to stream inside process
def compute():
def run():
ComuteProcess().runOnce()
newData.events.subscribe(run())
newModel.events.subscribe(run())
def other():
''' not done '''
ComuteProcess().runOnce()
Ok, so that's what I have so far. Is this pattern going to give me what I'm looking for?
Should I use threading in conjunction with rxpy or just use rxpy scheduler stuff to achieve concurrency? If so how?
I hope this question isn't too vague, I suppose I'm looking for the simplest framework where I can have a small number of computational-memory units (like objects because they have internal state) that communicate with each other and work in parallel (or concurrently). At the highest level I want to be able to treat these computational-memory units (which I've called processes above) as like individuals who mostly work on their own stuff but occasionally broadcast or send a message to a specific other individual, requesting information or providing information.
Am I perhaps actually looking for an actor model framework? or is this RxPy setup versatile enough to achieve that without extreme complexity?
Thanks so much!

Is at a good idea to use ThreadPoolExecutor with one worker?

I have a simple rest service which allows you to create task. When a client requests a task - it returns a unique task number and starts executing in a separate thread. The easiest way to implement it
class Executor:
def __init__(self, max_workers=1):
self.executor = ThreadPoolExecutor(max_workers)
def execute(self, body, task_number):
# some logic
pass
def some_rest_method(request):
body = json.loads(request.body)
task_id = generate_task_id()
Executor(max_workers=1).execute(body)
return Response({'taskId': task_id})
Is it a good idea to create each time ThreadPoolExecutor with one (!) workers if i know than one request - is one new task (new thread). Perhaps it is worth putting them in the queue somehow? Maybe the best option is to create a regular stream every time?
Is it a good idea to create each time ThreadPoolExecutor...
No. That completely defeats the purpose of a thread pool. The reason for using a thread pool is so that you don't create and destroy a new thread for every request. Creating and destroying threads is expensive. The idea of a thread pool is that it keeps the "worker thread(s)" alive and re-uses it/them for each next request.
...with just one thread
There's a good use-case for a single-threaded executor, though it probably does not apply to your problem. The use-case is, you need a sequence of tasks to be performed "in the background," but you also need them to be performed sequentially. A single-thread executor will perform the tasks, one after another, in the same order that they were submitted.
Perhaps it is worth putting them in the queue somehow?
You already are putting them in a queue. Every thread pool has a queue of pending tasks. When you submit a task (i.e., executor.execute(...)) that puts the task into the queue.
what's the best way...in my case?
The bones of a simplistic server look something like this (pseudo-code):
POOL = ThreadPoolExecutor(...with however many threads seem appropriate...)
def service():
socket = create_a_socket_that_listens_on_whatever_port()
while True:
client_connection = socket.accept()
POOL.submit(request_handler, connection=connection)
def request_handler(connection):
request = receive_request_from(connection)
reply = generate_reply_based_on(request)
send_reply_to(reply, connection)
connection.close()
def main():
initialize_stuff()
service()
Of course, there are many details that I have left out. I can't design it for you. Especially not in Python. I've written servers like this in other languages, but I'm pretty new to Python.

Do python-rq workers support multiprocessing module?

I currently have multiple python-rq workers executing jobs from a queue in parallel. Each job also utilizes the python multiprocessing module.
Job execution code is simply this:
from redis import Redis
from rq import Queue
q = Queue('calculate', connection=Redis())
job = q.enqueue(calculateJob, someArgs)
And calculateJob is defined as such:
import multiprocessing as mp
from functools import partial
def calculateJob (someArgs):
pool = mp.Pool()
result = partial(someFunc, someArgs=someArgs)
def someFunc(someArgs):
//do something
return output
So presumably when a job is being processed, all cores are automatically being utilized by that job. How does another worker processing another job in parallel execute its job if the first job is utilizing all cores already?
it depends how your system handles processes. Just like how opening a video + 5 more processes doesn't completely freeze your 6 core computer. Each worker is a new process. (a fork of a process really). Instead of doing a multiprocessing inside of a job, you can put each job on a queue and let rq handle multiprocessing by spawning multiple workers.

Using Multiple Asyncio Queues Effectively

I am currently building a project that requires multiple requests made to various endpoints. I am wrapping these requests in Aiohttp to allow for async.
The problem:
I have three Queues: queue1, queue2, and queue3. Additionally, I have three worker functions (worker1, worker2, worker3) that are associated with their respective Queue. The first queue is populated immediately with a list IDs that is known prior to running. When the request is finished and the data is committed to a database, it passes the ID to queue2. A worker2 will take this ID and request more data. From this data it will begin to generate a list of IDs (different from the IDs in queue1/queue2. worker2 will put the IDs in queue3. Finally worker3 will grab this ID from queue3 and request more data before committing to a database.
The issue arises with the fact queue.join() is a blocking call. Each worker is tied to a separate Queue so the join for queue1 will block until its finished. This is fine, but it also defeats the purpose of using async. Without using join() the program is unable to detect when the Queues are totally empty. The other issue is that there may be silent errors when one of the Queues is empty but there is still data that hasn't been added yet.
The basic code outline is as follows:
queue1 = asyncio.Queue()
queue2 = asyncio.Queue()
queue3 = asyncio.Queue()
async with aiohttp.ClientSession() as session:
for i in range(3):
tasks.append(asyncio.create_task(worker1(queue1)))
for i in range(3):
tasks.append(asyncio.create_task(worker2(queue2)))
for i in range(10):
tasks.append(asyncio.create_task(worker3(queue3)))
for i in IDs:
queue1.put_nowait(i)
await asyncio.gather(*tasks)
The worker functions sit in an infinite loop waiting for items to enter the queue.
When the data has all been processed there will be no exit and the program will hang.
Is there a way to effectively manage the workers and end properly?
As nicely explained in this answer, Queue.join serves to inform the producer when all the work injected into the queue got completed. Since your first queue doesn't know when a particular item is done (it's multiplied and distributed to other queues), join is not the right tool for you.
Judging from your code, it seems that your workers need to run for only as long as it takes to process the queue's initial items. If that is the case, then you can use a shutdown sentinel to signal the workers to exit. For example:
async with aiohttp.ClientSession() as session:
# ... create tasks as above ...
for i in IDs:
queue1.put_nowait(i)
queue1.put_nowait(None) # no more work
await asyncio.gather(*tasks)
This is like your original code, but with an explicit shutdown request. Workers must detect the sentinel and react accordingly: propagate it to the next queue/worker and exit. For example, in worker1:
while True:
item = queue1.get()
if item is None:
# done with processing, propagate sentinel to worker2 and exit
await queue2.put(None)
break
# ... process item as usual ...
Doing the same in other two workers (except for worker3 which won't propagate because there's no next queue) will result in all three tasks completing once the work is done. Since queues are FIFO, the workers can safely exit after encountering the sentinel, knowing that no items have been dropped. The explicit shutdown also distinguishes a shut-down queue from one that happens to be empty, thus preventing workers from exiting prematurely due to a temporarily empty queue.
Up to Python 3.7, this technique was actually demonstrated in the documentation of Queue, but that example somewhat confusingly shows both the use of Queue.join and the use of a shutdown sentinel. The two are separate and can be used independently of one another. (And it might also make sense to use them together, e.g. to use Queue.join to wait for a "milestone", and then put other stuff in the queue, while reserving the sentinel for stopping the workers.)

Resources