How to name thread for logging with concurrent.futures? - python-3.x

I am creating a webscraper that would scrape from multiple domains in different threads. As there are many different domains, I would like to be able to search logged info per each thread.
UPDATE: solution implemented in code. Follow # SOLUTION lines
The script has been set up as follows:
import logging
from queue import Queue, Empty
from threading import current_thread # SOLUTION
from concurrent.futures import ThreadPoolExecutor
logging.basicConfig(
format='%(threadName)s %(levelname)s: %(message)s',
level=logging.INFO
)
class Scraper:
def __init__(self, max_workers):
self.pool = ThreadPoolExecutor(max_workers = max_workers, thread_name_prefix='T')
self.to_crawl = Queue()
for task in self.setup_tasks(tasks=max_workers):
logging.info('Putting task to queue:\n{}'.format(task))
self.to_crawl.put(task)
logging.info('Queue size after init: {}'.format(self.to_crawl.qsize()))
def setup_tasks(self, cur, tasks=1):
# Prepare tasks for the queue
def run_task(self, task):
# Function for executing the task
current_thread().name = task['id'] # SOLUTION
logging.info('Executing task:\n{}'.format(task))
id = task['id'] # I want the task id to be reflected in the logging function for when run_task runds
def run_scraper(self):
while True:
logging.info('Launching new thread, queue size is {}'.format(self.to_crawl.qsize()))
try:
task = self.to_crawl.get()
self.pool.submit(self.run_task, task)
except Empty:
break
if __name__ == '__main__':
s = Scraper(max_workers=3)
s.run_scraper()
I would like to add the task['id'] to the logging formatting configuration instead of the given %(threadName)s without doing it manually each time the script logs something in run_task
Is there a way to assign task['id'] to the thread %(threadName)s when the thread takes the task in run_scraper?

Related

Run asyncio loop in a separate thread

I have a component of an application that needs to run an IOLoop in a separate thread. I try to achieve that by creating an new IOLOOP in a background Thread and starting the loop. My original use case it keep scheduling a bunch of async tasks periodically.
To achieve this, I:
Create an event loop in a background thread.
Start the thread and call asyncio.run_coroutine_threadsafe(self._start, self._loop)
import asyncio
from contextlib import suppress
from threading import Thread
class AsyncScheduler(object):
"""
Async Schedule Class.
This class:
- Will run on a separate event loop on a separate thread.
- Will periodically(every minute) schedule tasks for Requester.
"""
def __init__(self, batch_manager, requester):
self._requester = requester
self._is_started = False
self._tasks = []
self._loop = None
self.start()
def start(self):
"""
Start a new event loop in a thread.
call eventloop.run(self._start)
:return:
"""
print("STARTING")
self._loop = asyncio.new_event_loop()
# start new loop in thread.
Thread(target=self._loop.run_forever).start()
asyncio.run_coroutine_threadsafe(self._start, self._loop)
def stop(self):
if self._loop:
# cancel tasks
self._loop.call_soon_threadsafe(self._stop)
# stop the loop.
self._loop.stop()
async def _start(self):
"""
Create three tasks for 3 API versions.
Schedule Each tasks on the event loop using
asyncio.gather.
:return:
"""
versions = [1, 2, 3]
print("ASYNC START")
if not self._is_started:
self._is_started = True
for version in versions:
self._tasks.append(
self.create_task(60, version)
)
await asyncio.gather(*self._tasks)
async def _stop(self):
for task in self._tasks:
task.cancel()
with suppress(asyncio.CancelledError):
await task
async def execute(self, api_version):
"""
This method gets the batch to be executed and
tells the requester to run it.
:param api_version:
:return:
"""
await self._requester.run()
async def create_task(self, sleep_time, api_version):
"""
Calls the tasks in infinite loop.
:param sleep_time:
:param api_version:
:return:
"""
while True:
print("EVER CALLED")
await self.execute(api_version)
await asyncio.sleep(sleep_time)
Steps done in the code:
Call start from init
In start, create an eventloop within a new thread and start the loop with an awaitable.
I thought this is the way to use an event loop inside a separate thread. Alas, but my awaitbale sel._start is never called and I get an error [A coroutine object is required]
Any ideas, what am I messing up here?
Thanks & Regards & Happy Thanksgiving to folks who celebrate.

Python: Callback on the worker-queue not working

Apologies for the long post. I am trying to subscribe to rabbitmq queue and then trying to create a worker-queue to execute tasks. This is required since the incoming on the rabbitmq would be high and the processing task on the item from the queue would take 10-15 minutes to execute each time. Hence necessitating the need for a worker-queue. Now I am trying to initiate only 4 items in the worker-queue, and register a callback method for processing the items in the queue. The expectation is that my code handles the part when all the 4 instances in the worker-queue are busy, the new incoming would be blocked until a free slot is available.
The rabbitmq piece is working well. The problem is I cannot figure out why the items from my worker-queue are not executing the task, i.e the callback is not working. In fact, the item from the worker queue gets executed only once when the program execution starts. For the rest of the time, tasks keep getting added to the worker-queue without being consumed. Would appreciate it if somebody could help out with the understanding on this one.
I am attaching the code for rabbitmqConsumer, driver, and slaveConsumer. Some information has been redacted in the code for privacy issues.
# This is the driver
#!/usr/bin/env python
import time
from rabbitmqConsumer import BasicMessageReceiver
basic_receiver_object = BasicMessageReceiver()
basic_receiver_object.declare_queue()
while True:
basic_receiver_object.consume_message()
time.sleep(2)
#This is the rabbitmqConsumer
#!/usr/bin/env python
import pika
import ssl
import json
from slaveConsumer import slave
class BasicMessageReceiver:
def __init__(self):
# SSL Context for TLS configuration of Amazon MQ for RabbitMQ
ssl_context = ssl.SSLContext(ssl.PROTOCOL_TLSv1_2)
url = <url for the queue>
parameters = pika.URLParameters(url)
parameters.ssl_options = pika.SSLOptions(context=ssl_context)
self.connection = pika.BlockingConnection(parameters)
self.channel = self.connection.channel()
# worker-queue object
self.slave_object = slave()
self.slave_object.start_task()
def declare_queue(self, queue_name=“abc”):
print(f"Trying to declare queue inside consumer({queue_name})...")
self.channel.queue_declare(queue=queue_name, durable=True)
def close(self):
print("Closing Receiver")
self.channel.close()
self.connection.close()
def _consume_message_setup(self, queue_name):
def message_consume(ch, method, properties, body):
print(f"I am inside the message_consume")
message = json.loads(body)
self.slave_object.execute_task(message)
ch.basic_ack(delivery_tag=method.delivery_tag)
self.channel.basic_qos(prefetch_count=1)
self.channel.basic_consume(on_message_callback=message_consume,
queue=queue_name)
def consume_message(self, queue_name=“abc”):
print("I am starting the rabbitmq start_consuming")
self._consume_message_setup(queue_name)
self.channel.start_consuming()
#This is the slaveConsumer
#!/usr/bin/env python
import pika
import ssl
import json
import requests
import threading
import queue
import os
class slave:
def __init__(self):
self.job_queue = queue.Queue(maxsize=3)
self.job_item = ""
def start_task(self):
def _worker():
while True:
json_body = self.job_queue.get()
self._parse_object_from_queue(json_body)
self.job_queue.task_done()
threading.Thread(target=_worker, daemon=True).start()
def execute_task(self, obj):
print("Inside execute_task")
self.job_item = obj
self.job_queue.put(self.job_item)
# print(self.job_queue.queue)
def _parse_object_from_queue(self, json_body):
if bool(json_body[‘entity’]):
if json_body['entity'] == 'Hello':
print("Inside Slave: Hello")
elif json_body['entity'] == 'World':
print("Inside Slave: World")
self.job_queue.join()

How to control memory consumption while multithreading

I am trying to scrape some websites using the python's threading and thread-safe queue module. I'm observing an increase in memory usage as I test on more URLs. Below is my code for your reference:
from collections import defaultdict
from queue import Queue
from threading import Thread
import itertools
from time import time
import newspaper
import requests
import pickle
data = defaultdict(list)
def get_links():
return (url for url in pickle.load(open('urls.pkl','rb')))
# for url in urls[:500]:
# yield url
def download_url(url):
try:
resp = requests.get(url)
article = newspaper.Article(resp.url)
article.download(input_html=resp.content)
article.parse()
data['url'].append(url)
data['result'].append(article.text)
except:
pass
class DownloadWorker(Thread):
def __init__(self, queue):
Thread.__init__(self)
self.queue = queue
def run(self):
while True:
# Get the work from the queue and expand the tuple
link = self.queue.get()
try:
download_url(link)
print(link,"done")
finally:
self.queue.task_done()
print(self.queue.qsize())
def main():
ts = time()
links = get_links()
# Create a queue to communicate with the worker threads
queue = Queue()
# Create worker threads
for x in range(4):
worker = DownloadWorker(queue)
# Setting daemon to True will let the main thread exit even though the workers are blocking
worker.daemon = True
worker.start()
# Put the tasks into the queue as a tuple
for link in itertools.islice(links,1000):
queue.put(link)
# Causes the main thread to wait for the queue to finish processing all the tasks
queue.join()
pickle.dump(data, open('scrapped_results.pkl','wb'))
print('Took %s mins' %((time() - ts)/60))
if __name__ == '__main__':
main()
If tested on 100 URLs the memory consumption stays constant at 0.1% but it increases as the more number of URLs are tested (0.2%,0.4%,0.5%). Max URLs I have tested are 1000. The mix of questions I have is below:
Why memory consumption increase?
Is memory increasing because the queue not getting emptied before it gets filled? My understanding of queue is that it empties itself as the data in the queue gets processed.
Is there a way to keep the memory usage constant by the threads?
Is it because of the data in the defaultdict is getting bigger?
Can timeout help here? Where can I declare a timeout?
Is it the newspaper and requests?

Python multiprocessing script partial output

I am following the principles laid down in this post to safely output the results which will eventually be written to a file. Unfortunately, the code only print 1 and 2, and not 3 to 6.
import os
import argparse
import pandas as pd
import multiprocessing
from multiprocessing import Process, Queue
from time import sleep
def feed(queue, parlist):
for par in parlist:
queue.put(par)
print("Queue size", queue.qsize())
def calc(queueIn, queueOut):
while True:
try:
par=queueIn.get(block=False)
res=doCalculation(par)
queueOut.put((res))
queueIn.task_done()
except:
break
def doCalculation(par):
return par
def write(queue):
while True:
try:
par=queue.get(block=False)
print("response:",par)
except:
break
if __name__ == "__main__":
nthreads = 2
workerQueue = Queue()
writerQueue = Queue()
considerperiod=[1,2,3,4,5,6]
feedProc = Process(target=feed, args=(workerQueue, considerperiod))
calcProc = [Process(target=calc, args=(workerQueue, writerQueue)) for i in range(nthreads)]
writProc = Process(target=write, args=(writerQueue,))
feedProc.start()
feedProc.join()
for p in calcProc:
p.start()
for p in calcProc:
p.join()
writProc.start()
writProc.join()
On running the code it prints,
$ python3 tst.py
Queue size 6
response: 1
response: 2
Also, is it possible to ensure that the write function always outputs 1,2,3,4,5,6 i.e. in the same order in which the data is fed into the feed queue?
The error is somehow with the task_done() call. If you remove that one, then it works, don't ask me why (IMO that's a bug). But the way it works then is that the queueIn.get(block=False) call throws an exception because the queue is empty. This might be just enough for your use case, a better way though would be to use sentinels (as suggested in the multiprocessing docs, see last example). Here's a little rewrite so your program uses sentinels:
import os
import argparse
import multiprocessing
from multiprocessing import Process, Queue
from time import sleep
def feed(queue, parlist, nthreads):
for par in parlist:
queue.put(par)
for i in range(nthreads):
queue.put(None)
print("Queue size", queue.qsize())
def calc(queueIn, queueOut):
while True:
par=queueIn.get()
if par is None:
break
res=doCalculation(par)
queueOut.put((res))
def doCalculation(par):
return par
def write(queue):
while not queue.empty():
par=queue.get()
print("response:",par)
if __name__ == "__main__":
nthreads = 2
workerQueue = Queue()
writerQueue = Queue()
considerperiod=[1,2,3,4,5,6]
feedProc = Process(target=feed, args=(workerQueue, considerperiod, nthreads))
calcProc = [Process(target=calc, args=(workerQueue, writerQueue)) for i in range(nthreads)]
writProc = Process(target=write, args=(writerQueue,))
feedProc.start()
feedProc.join()
for p in calcProc:
p.start()
for p in calcProc:
p.join()
writProc.start()
writProc.join()
A few things to note:
the sentinel is putting a None into the queue. Note that you need one sentinel for every worker process.
for the write function you don't need to do the sentinel handling as there's only one process and you don't need to handle concurrency (if you would do the empty() and then get() thingie in your calc function you would run into a problem if e.g. there's only one item left in the queue and both workers check empty() at the same time and then both want to do get() and then one of them is locked forever)
you don't need to put feed and write into processes, just put them into your main function as you don't want to run it in parallel anyway.
how can I have the same order in output as in input? [...] I guess multiprocessing.map can do this
Yes map keeps the order. Rewriting your program into something simpler (as you don't need the workerQueue and writerQueue and adding random sleeps to prove that the output is still in order:
from multiprocessing import Pool
import time
import random
def calc(val):
time.sleep(random.random())
return val
if __name__ == "__main__":
considerperiod=[1,2,3,4,5,6]
with Pool(processes=2) as pool:
print(pool.map(calc, considerperiod))

Threaded result not giving same result as un-threaded result (python)

I have created a program to generate data points of functions that I later plot. The program takes a class which defines the function, creates a data outputting object which when called generates the data to a text file. To make the whole process faster I put the jobs in threads, however when I do, the data generated is not always correct. I have attached a picture to show what I mean:
Here are some of the relevant bits of code:
from queue import Queue
import threading
import time
queueLock = threading.Lock()
workQueue = Queue(10)
def process_data(threadName, q, queue_window, done):
while not done.get():
queueLock.acquire() # check whether or not the queue is locked
if not workQueue.empty():
data = q.get()
# data is the Plot object to be run
queueLock.release()
data.parent_window = queue_window
data.process()
else:
queueLock.release()
time.sleep(1)
class WorkThread(threading.Thread):
def __init__(self, threadID, q, done):
threading.Thread.__init__(self)
self.ID = threadID
self.q = q
self.done = done
def get_qw(self, queue_window):
# gets the queue_window object
self.queue_window = queue_window
def run(self):
# this is called when thread.start() is called
print("Thread {0} started.".format(self.ID))
process_data(self.ID, self.q, self.queue_window, self.done)
print("Thread {0} finished.".format(self.ID))
class Application(Frame):
def __init__(self, etc):
self.threads = []
# does some things
def makeThreads(self):
for i in range(1, int(self.threadNum.get()) +1):
thread = WorkThread(i, workQueue, self.calcsDone)
self.threads.append(thread)
# more code which just processes the function etc, sorts out the gui stuff.
And in a separate class (as I'm using tkinter, so the actual code to get the threads to run is called in a different window) (self.parent is the Application class):
def run_jobs(self):
if self.running == False:
# threads are only initiated when jobs are to be run
self.running = True
self.parent.calcsDone.set(False)
self.parent.threads = [] # just to make sure that it is initially empty, we want new threads each time
self.parent.makeThreads()
self.threads = self.parent.threads
for thread in self.threads:
thread.get_qw(self)
thread.start()
# put the jobs in the workQueue
queueLock.acquire()
for job in self.job_queue:
workQueue.put(job)
queueLock.release()
else:
messagebox.showerror("Error", "Jobs already running")
This is all the code which relates to the threads.
I don't know why when I run the program with multiple threads some data points are incorrect, whilst running it with just 1 single thread the data is all perfect. I tried looking up "threadsafe" processes, but couldn't find anything.
Thanks in advance!

Resources