Thread objects not freed from memory - multithreading

I wrote a continuous script that collects some data from the internet every few seconds, keeps it in memory for a while, periodically stores it all to db and then deletes it. To keep everything running smoothly I use threads to collect the data from several sources at the same time. To minimize db operations and to avoid conflict with other db processes, I only write every now and then.
The memory from the deleted variables is never returned and eventually becomes so large the script crashes (shown by tracemalloc and pympler). I guess I'm handling the data coming out of the threads wrong but I don't know how I could do it differently. Minimal example below.
Addition: I don't think I can use a queue because in reality multiple functions are threaded from this point, modifying different local variables.
import threading
import time
import tracemalloc
import pympler.muppy, pympler.summary
import gc
tracemalloc.start()
def a():
# collect data
collection.update({int(time.time()): list(range(1,1000))})
return
collection = {}
threads = []
start = time.time()
cycle = 0
while time.time() < start + 60:
cycle += 1
t = threading.Thread(target = a)
threads.append(t)
t.start()
time.sleep(1)
for t in threads:
if t.is_alive() == False:
t.join()
# periodically delete data
delete = []
for key, val in collection.items():
if key < time.time() - 10:
delete.append(key)
for delet in delete:
print('DELETING:', delet)
del collection[delet]
gc.collect()
print('CYCLE:', cycle, 'THREADS:', threading.active_count(), 'COLLECTION:', len(collection))
print(tracemalloc.get_traced_memory())
all_objects = pympler.muppy.get_objects()
sum1 = pympler.summary.summarize(all_objects)
pympler.summary.print_(sum1)

Related

Best way to keep creating threads on variable list argument

I have an event that I am listening to every minute that returns a list ; it could be empty, 1 element, or more. And with those elements in that list, I'd like to run a function that would monitor an event on that element every minute for 10 minute.
For that I wrote that script
from concurrent.futures import ThreadPoolExecutor
from time import sleep
import asyncio
import Client
client = Client()
def handle_event(event):
for i in range(10):
client.get_info(event)
sleep(60)
async def main():
while True:
entires = client.get_new_entry()
if len(entires) > 0:
with ThreadPoolExecutor(max_workers=len(entires)) as executor:
executor.map(handle_event, entires)
await asyncio.sleep(60)
if __name__ == "__main__":
loop = asyncio.new_event_loop()
loop.run_until_complete(main())
However, instead of keep monitoring the entries, it blocks while the previous entries are still being monitors.
Any idea how I could do that please?
First let me explain why your program doesn't work the way you want it to: It's because you use the ThreadPoolExecutor as a context manager, which will not close until all the threads started by the call to map are finished. So main() waits there, and the next iteration of the loop can't happen until all the work is finished.
There are ways around this. Since you are using asyncio already, one approach is to move the creation of the Executor to a separate task. Each iteration of the main loop starts one copy of this task, which runs as long as it takes to finish. It's a async def function so many copies of this task can run concurrently.
I changed a few things in your code. Instead of Client I just used some simple print statements. I pass a list of integers, of random length, to handle_event. I increment a counter each time through the while True: loop, and add 10 times the counter to every integer in the list. This makes it easy to see how old calls continue for a time, mixing with new calls. I also shortened your time delays. All of these changes were for convenience and are not important.
The important change is to move ThreadPoolExecutor creation into a task. To make it cooperate with other tasks, it must contain an await expression, and for that reason I use executor.submit rather than executor.map. submit returns a concurrent.futures.Future, which provides a convenient way to await the completion of all the calls. executor.map, on the other hand, returns an iterator; I couldn't think of any good way to convert it to an awaitable object.
To convert a concurrent.futures.Future to an asyncio.Future, an awaitable, there is a function asyncio.wrap_future. When all the futures are complete, I exit from the ThreadPoolExecutor context manager. That will be very fast since all of the Executor's work is finished, so it does not block other tasks.
import random
from concurrent.futures import ThreadPoolExecutor
from time import sleep
import asyncio
def handle_event(event):
for i in range(10):
print("Still here", event)
sleep(2)
async def process_entires(counter, entires):
print("Counter", counter, "Entires", entires)
x = [counter * 10 + a for a in entires]
with ThreadPoolExecutor(max_workers=len(entires)) as executor:
futs = []
for z in x:
futs.append(executor.submit(handle_event, z))
await asyncio.gather(*(asyncio.wrap_future(f) for f in futs))
async def main():
counter = 0
while True:
entires = [0, 1, 2, 3, 4][:random.randrange(5)]
if len(entires) > 0:
counter += 1
asyncio.create_task(process_entires(counter, entires))
await asyncio.sleep(3)
if __name__ == "__main__":
asyncio.run(main())

Multiproccesing and lists in python

I have a list of jobs but due to certain condition not all of the jobs should run in parallel at the same time because sometimes it is important that a finishes before I start b or vice versa (actually its not important which one runs first just not that they run both at the same time) so i thought i keep a list of the currently running threads and when ever a new on starts it checks in this list of currently running threads if the thread can proceed or not. I wrote some sample code for that:
from time import sleep
from multiprocessing import Pool
def square_and_test(x):
print(running_list)
if not x in running_list:
running_list = running_list.append(x)
sleep(1)
result_list = result_list.append(x**2)
running_list = running_list.remove(x)
else:
print(f'{x} is currently worked on')
task_list = [1,2,3,4,1,1,4,4,2,2]
running_list = []
result_list = []
pool = Pool(2)
pool.map(square_and_test, task_list)
print(result_list)
this code fails with UnboundLocalError: local variable 'running_list' referenced before assignment so i guess my threads don't have access to global variables. Is there a way around this? If not is there another way to solve this problem?

Need to do CPU bound processing using 2+ processes in Python by reading from a gzipped file

I have a gzipped file spanning (compressed 10GB, uncompressed 100GB) and which has some reports separated by demarcations and I have to parse it.
The parsing and processing the data is taking long time and hence is a CPU bound problem (not an IO bound problem). So I am planning to split the work into multiple processes using multiprocessing module. The problem is I am unable to send/share data to child processes efficiently. I am using subprocess.Popen to stream in the uncompressed data in parent process.
process = subprocess.Popen('gunzip --keep --stdout big-file.gz',
shell=True,
stdout=subprocess.PIPE)
I am thinking of using a Lock() to read/parse one report in child-process-1 and then release the lock, and switch to child-process-2 to read/parse next report and then switch back to child-process-1 to read/parse next report). When I share the process.stdout as args with the child processes, I get a pickling error.
I have tried to create multiprocessing.Queue() and multiprocessing.Pipe() to send data to child processes, but this is way too slow (in fact it is way slower than doing it in single thread ie serially).
Any thoughts/examples about sending data to child processes efficiently will help.
Could you try something simple instead? Have each worker process run its own instance of gunzip, with no interprocess communication at all. Worker 1 can process the first report and just skip over the second. The opposite for worker 2. Each worker skips every other report. Then an obvious generalization to N workers.
Or not ...
I think you'll need to be more specific about what you tried, and perhaps give more info about your problem (like: how many records are there? how big are they?).
Here's a program ("genints.py") that prints a bunch of random ints, one per line, broken into groups via "xxxxx\n" separator lines:
from random import randrange, seed
seed(42)
for i in range(1000):
for j in range(randrange(1, 1000)):
print(randrange(100))
print("xxxxx")
Because it forces the seed, it generates the same stuff every time. Now a program to process those groups, both in parallel and serially, via the most obvious way I first thought of. crunch() takes time quadratic in the number of ints in a group, so it's quite CPU-bound. The output from one run, using (as shown) 3 worker processes for the parallel part:
parallel result: 10,901,000,334 0:00:35.559782
serial result: 10,901,000,334 0:01:38.719993
So the parallelized run took about one-third the time. In what relevant way(s) does that differ from your problem? Certainly, a full run of "genints.py" produces less than 2 million bytes of output, so that's a major difference - but it's impossible to guess from here whether it's a relevant difference. Perahps, e.g., your problem is only very mildly CPU-bound? It's obvious from output here that the overheads of passing chunks of stdout to worker processes are all but insignificant in this program.
In short, you probably need to give people - as I just did for you - a complete program they can run that reproduces your problem.
import multiprocessing as mp
NWORKERS = 3
DELIM = "xxxxx\n"
def runjob():
import subprocess
# 'py' is just a shell script on my box that
# invokes the desired version of Python -
# which happened to be 3.8.5 for this run.
p = subprocess.Popen("py genints.py",
shell=True,
text=True,
stdout=subprocess.PIPE)
return p.stdout
# Return list of lines up to (but not including) next DELIM,
# or EOF. If the file is already exhausted, return None.
def getrecord(f):
result = []
foundone = False
for line in f:
foundone = True
if line == DELIM:
break
result.append(line)
return result if foundone else None
def crunch(rec):
total = 0
for a in rec:
for b in rec:
total += abs(int(a) - int(b))
return total
if __name__ == "__main__":
import datetime
now = datetime.datetime.now
s = now()
total = 0
f = runjob()
with mp.Pool(NWORKERS) as pool:
for i in pool.imap_unordered(crunch,
iter((lambda: getrecord(f)), None)):
total += i
f.close()
print(f"parallel result: {total:,}", now() - s)
s = now()
# try the same thing serially
total = 0
f = runjob()
while True:
rec = getrecord(f)
if rec is None:
break
total += crunch(rec)
f.close()
print(f"serial result: {total:,}", now() - s)

Python27 Is it able to make timer without thread.Timer?

So, basically I want to make timer but I don't want to use thread.Timer for
efficiency
Python produces thread by itself, it is not efficient and better not to use it.
I search the essay related to this. And checked It is slow to use thread.
e.g) single process was divided into N, and made it work into Thread, It was slower.
However I need to use Thread for this.
class Works(object):
def __init__(self):
self.symbol_dict = config.ws_api.get("ASSET_ABBR_LIST")
self.dict = {}
self.ohlcv1m = []
def on_open(self, ws):
ws.send(json.dumps(config.ws_api.get("SUBSCRIPTION_DICT")))
everytime I get the message form web socket server, I store in self.dict
def on_message(self,ws,message):
message = json.loads(message)
if len(message) > 2 :
ticker = message[2]
pair = self.symbol_dict[(ticker[0])]
baseVolume = ticker[5]
timestmap = time.time()
try:
type(self.dict[pair])
except KeyError as e:
self.dict[pair] = []
self.dict[pair].append({
'pair':pair,
'baseVolume' : baseVolume,
})
def run(self):
websocket.enableTrace(True)
ws = websocket.WebSocketApp(
url = config.ws_api.get("WEBSOCK_HOST"),
on_message = self.on_message,
on_open = self.on_open
)
ws.run_forever(sslopt = {"cert_reqs":ssl.CERT_NONE})
'once in every 60s it occurs. calculate self.dict and save in to self.ohlcv1m
and will sent it to db. eventually self.dict and self.ohlcv1m initialized again to store 1min data from server'
def every60s(self):
threading.Timer(60, self.every60s).start()
for symbol in self.dict:
tickerLists = self.dict[symbol]
self.ohlcv1m.append({
"V": sum([
float(ticker['baseVolume']) for ticker in tickerLists]
})
#self.ohlcv1m will go to database every 1m
self.ohlcv1 = [] #init again
self.dict = {} #init again
if __name__ == "__main__":
work=Works()
t1 = threading.Thread(target=work.run)
t1.daemon = True
t1.start()
work.every60s()
(sorry for the indention)
I am connecting to socket by running run_forever() and getting realtimedata
Every 60s I need to check and calculate the data
Is there any way to make 60s without thread in python27?
I will be so appreciate you answer If you give me any advice.
Thank you
The answer comes down to if you need the code to run exactly every 60 seconds, or if you can just wait 60 seconds between runs (i.e. if the logic takes 5 seconds, it'll run every 65 seconds).
If you're happy with just a 60 second gap between runs, you could do
import time
while True:
every60s()
time.sleep(60)
If you're really set on not using threads but having it start every 60 seconds regardless of the last poll time, you could time the last execution and subtract that from 60 seconds to get the sleep time.
However, really, with the code you've got there you're not going to run into any of the issues with Python threads you might have read about. Those issues come in when you've got multiple threads all running at the same time and all CPU bound, which doesn't seem to be the case here unless there's some very slow, CPU intensive work that's not in your provided code.

Python Fire dynamic urls using multithreading

I'm new to to Python-Threading, and I've gone through multiple posts but I really did not understand how to use it. However I tried to complete my task, and I want to check if I'm doing it with right approach.
Task is :
Read big CSV containing around 20K records, fetch id from each record and fire an HTTP API call for each record of the CSV.
t1 = time.time()
file_data_obj = csv.DictReader(open(file_path, 'rU'))
threads = []
for record in file_data_obj:
apiurl = https://www.api-server.com?id=record.get("acc_id", "")
thread = threading.Thread(target=requests.get, args=(apiurl,))
thread.start()
threads.append(thread)
t2 = time.time()
for thread in threads:
thread.join()
print("Total time required to process a file - {} Secs".format(t2-t1))
As there are 20K records, would it start 20K threads? OR OS/Python will handle it? If yes, can we restrict it?
How can I collect the response returned by requests.get?
Would t2 - t1 really give mw the time required to process whole file?
As there are 20K records, would it start 20K threads? OR OS/Python will handle it? If yes, can we restrict it?
Yes - it will start a thread for each iteration. The maximum amount of threads is dependent on your OS.
How can I grab the response returned by requests.get?
If you want to use the threading module only, you'll have to make use of a Queue. Threads return None by design, hence you'll have to implement a line of communication between the Thread and you main loop yourself.
from queue import Queue
from threading import Thread
import time
# A thread that produces data
q = Queue()
def return_get(q, apiurl):
q.put(requests.get(apiurl)
for record in file_data_obj:
apiurl = https://www.api-server.com?id=record.get("acc_id", "")
t = threading.Thread(target=return_get, args=(q, apiurl))
t.start()
threads.append(t)
for thread in threads:
thread.join()
while not q.empty:
r = q.get() # Fetches the first item on the queue
print(r.text)
An alternative is to use a worker pool.
from concurrent.futures import ThreadPoolExecutor
from queue import Queue
import urllib.request
threads = []
pool = ThreadPoolExecutor(10)
# Submit work to the pool
for record in file_data_obj:
apiurl = https://www.api-server.com?id=record.get("acc_id", "")
t = pool.submit(fetch_url, 'http://www.python.org')
threads.append(t)
for t in threads:
print(t.result())
You can use ThreadPoolExecutor
Retrieve a single page and report the URL and contents
def load_url(url, timeout):
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
Create pool executor with N workers
with concurrent.futures.ThreadPoolExecutor(max_workers=N_workers) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))

Resources