How to create a data loader which populates a dictionary in background? - python-3.x

I am trying to loop over a set of graphs as shown in the snippet from the main script below:
import pickle
for timeOfDay in range(1440): # total number of minutes in a day=1440
with open("G_" + timeOfDay +'.pickle', 'rb') as handle:
G = pickle.load(handle)
## do something with G
Please note that the loop above cannot be parallelized because the graph at G_i.pickle must be processed before G_{i+1}.pickle. The loading time of one graph is around 1 minute while the processing time for one graph is 30 seconds. So, my whole algorithm is bound by the loading time of graphs. The RAM is sufficient for holding at least 10 graphs and the number of threads available is more than 30, so I am wondering if I can leverage parallel processing to keep updating a buffer dictionary which holds my Graphs from time timeOfDay to timeOfDay+10.
I could load all the Graphs G_1.pickle to G_1440.pickle in the RAM but it exhausts my RAM. So, I tried to create a dictionary G_RAM which has 1440 keys with G_RAM[i] = the graph loaded from G_{i}.pickle. But it runs into a whole lot of issues, sometimes getting stuck sometimes not deleting the older graphs at the right time. I would like to know if there is a standard way to do this without having to re-invent the wheel. Any help is appreciated! Thanks in advance!
To better explain my effort, I have a background thread which does the following:
reads the global value of timeOfDay variable
uses N processes to populate the G_RAM for keys timeOfDay to timeOfDay + N and deletes all keys G_RAM[i] for i < timeOfDay
My attempt is reproduced below:
import multiprocessing
import pickle
import gc
HORIZON_PARALLEL_LOADING_OF_GRAPHS = 10
THREADS_PARALLEL_LOADING_OF_GRAPHS = 5
G_RAM = {}
##### GLOBAL VARIABLES
# global variable to keep track of which timeOfDays are already being worked at by multiple process
# Simply checking for the keys in the G_RAM is not enough because it is possible that some process is
# working to populate it already but it is not complete, hence we keep track of which timeOfDays are already
# being worked at
already_being_worked_at = []
timeOfDay = 0
# Inspired by from https://amalgjose.com/2018/07/18/run-a-background-function-in-python/
def background(f):
'''
a threading decorator
use #background above the function you want to run in the background
'''
def backgrnd_func(*a, **kw):
threading.Thread(target=f, args=a, kwargs=kw).start()
return backgrnd_func
def mp_worker(key):
# load graph at time "key"
with open('G' + str(key) + '.pickle', 'rb') as handle:
read_data = pickle.load(handle)
return (key, read_data)
def mp_handler(data):
p = multiprocessing.Pool(THREADS_PARALLEL_LOADING_OF_GRAPHS)
res = p.map(mp_worker, data)
# populating the global dict here
global G_RAM
for k_v in res:
G_RAM[k_v[0]] = k_v[1]
p.close()
p.join()
#background
def updateDict():
# This will print the count for every second
# G_RAM is a global variable consisting of the graphs at different times of day
global timeOfDay
time.sleep(0.01)
gd_keys = list(G_RAM.keys())
# remove the used graphs (before time timeOfDay)
for key in gd_keys:
if key < timeOfDay or (key > min(timeOfDay + HORIZON_PARALLEL_LOADING_OF_GRAPHS, 1440)):
del G_RAM[key]
gc.collect()
# check which keys (upto t+HORIZON_PARALLEL_LOADING_OF_GRAPHS) are missing from the G_RAM
data = []
for key in range(t, min(t + HORIZON_PARALLEL_LOADING_OF_GRAPHS, 288)):
if key not in G_RAM:
data.append(key)
# removing already processed t to avoid creating duplicate processes
global already_being_worked_at
# we want to call the worker thread again only for the timeOfDays which are not being worked at by worker processes
temp = []
for key in data:
if key not in already_being_worked_at:
temp.append(key)
data = temp
# if the missing timeOfDays are already being worked at by the processes, we have no action item
# otherwise, we call the worker threads with the new timeOfDays which we need but are not being
# worked at
if len(data) > 0:
mp_handler(data)
# we update our list of already_being_worked_at
already_being_worked_at = already_being_worked_at + data
### main script follows
updateDict()
for dayNumber in range(100):
already_being_worked_at = [] # reset the list of populated keys at the start of the new day
for timeOfDay in range(1440): # total number of minutes in a day=1440
with open("G_" + timeOfDay +'.pickle', 'rb') as handle:
G = pickle.load(handle)
## do something with G

Related

Plot a continuous graph of Number of Snort alerts against time

I have snort logging DDOS alerts to file; I use Syslog-ng to parse the logs and output in json format into redis (wanted to set it up as a buffer, I use 'setex' command with expiry of 70 secs).
The whole thing seems not to be working well; any ideas to make it easier is welcome.
I wrote a simple python script to listen to redis KA events and count the number of snort alerts per second. I tried creating two other threads; one to retrieve the json-formatted alerts from snort and the second to count the alerts. The third is supposed to plot a graph using matplotlib.pyplot
#import time
from redis import StrictRedis as sr
import os
import json
import matplotlib.pyplot as plt
import threading as th
import time
redis = sr(host='localhost', port = 6379, decode_responses = True)
#file = open('/home/lucidvis/vis_app_py/log.json','w+')
# This function is still being worked on
def do_plot():
print('do_plot loop running')
while accumulated_data:
x_values = [int(x['time_count']) for x in accumulated_data]
y_values = [y['date'] for y in accumulated_data]
plt.title('Attacks Alerts per time period')
plt.xlabel('Time', fontsize=14)
plt.ylabel('Snort Alerts/sec')
plt.tick_params(axis='both', labelsize=14)
plt.plot(y_values,x_values, linewidth=5)
plt.show()
time.sleep(0.01)
def accumulator():
# first of, check the current json data and see if its 'sec' value is same
#that is the last in the accumulated data list
#if it is the same, increase time_count by one else pop that value
pointer_data = {}
print('accumulator loop running')
while True:
# pointer data is the current sec of json data used for comparison
#new_data is the latest json formatted alert received
# received_from_redis is a list declared in the main function
if received_from_redis:
new_data = received_from_redis.pop(0)
if not pointer_data:
pointer_data = new_data.copy()
print(">>", type(pointer_data), " >> ", pointer_data)
if pointer_data and pointer_data['sec']==new_data["sec"]
pointer_data['time_count'] +=1
elif pointer_data:
accumulated_data.append(pointer_data)
pointer_data = new_data.copy()
pointer_data.setdefault('time_count',1)
else:
time.sleep(0.01)
# main function creates the redis object and receives messages based on events
#this function calls two other functions and creates threads so they appear to run concurrently
def main():
p = redis.pubsub()
#
p.psubscribe('__keyspace#0__*')
print('Starting message loop')
while True:
try:
time.sleep(2)
message = p.get_message()
# Obtain the key from the redis emmitted event if the event is a set event
if message and message['data']=='set':
# the format emmited by redis is in a dict form
# the key is the value to the key 'channel'
# The key is in '__keyspace#0__*' form
# obtain the last field of the list returned by split function
key = message['channel'].split('__:')[-1]
data_redis = json.loads(redis.get(str(key)))
received_from_redis.append(data_redis)
except Exception e:
print(e)
continue
if __name__ == "__main__":
accumulated_data = []
received_from_redis = []
# main function creates the redis object and receives messages based on events
#this function calls two other functions and creates threads so they appear to run concurrently
thread_accumulator = th.Thread(target = accumulator, name ='accumulator')
do_plot_thread = th.Thread(target = do_plot, name ='do_plot')
while True:
thread_accumulator.start()
do_plot_thread.start()
main()
thread_accumulator.join()
do_plot_thread.join()
I currently do get errors per se ; I just cant tell if the threads are created or are working well. I need ideas to make things work better.
sample of the alert formated in json and obtained from redis below
{"victim_port":"","victim":"192.168.204.130","protocol":"ICMP","msg":"Ping_Flood_Attack_Detected","key":"1000","date":"06/01-09:26:13","attacker_port":"","attacker":"192.168.30.129","sec":"13"}
I'm not sure I understand exactly your scenario, but if you want to count events that are essentially log messages, you can probably do that within syslog-ng. Either as a Python destination (since you are already working in python), or maybe even without additional programming using the grouping-by parser.

How can i use multithreading (or multiproccessing?) for faster data upload?

I have a list of issues (jira issues):
listOfKeys = [id1,id2,id3,id4,id5...id30000]
I want to get worklogs of this issues, for this I used jira-python library and this code:
listOfWorklogs=pd.DataFrame() (I used pandas (pd) lib)
lst={} #dictionary for help, where the worklogs will be stored
for i in range(len(listOfKeys)):
worklogs=jira.worklogs(listOfKeys[i]) #getting list of worklogs
if(len(worklogs)) == 0:
i+=1
else:
for j in range(len(worklogs)):
lst = {
'self': worklogs[j].self,
'author': worklogs[j].author,
'started': worklogs[j].started,
'created': worklogs[j].created,
'updated': worklogs[j].updated,
'timespent': worklogs[j].timeSpentSeconds
}
listOfWorklogs = listOfWorklogs.append(lst, ignore_index=True)
########### Below there is the recording to the .xlsx file ################
so I simply go into the worklog of each issue in a simple loop, which is equivalent to referring to the link:
https://jira.mycompany.com/rest/api/2/issue/issueid/worklogs and retrieving information from this link
The problem is that there are more than 30,000 such issues.
and the loop is sooo slow (approximately 3 sec for 1 issue)
Can I somehow start multiple loops / processes / threads in parallel to speed up the process of getting worklogs (maybe without jira-python library)?
I recycled a piece of code I made into your code, I hope it helps:
from multiprocessing import Manager, Process, cpu_count
def insert_into_list(worklog, queue):
lst = {
'self': worklog.self,
'author': worklog.author,
'started': worklog.started,
'created': worklog.created,
'updated': worklog.updated,
'timespent': worklog.timeSpentSeconds
}
queue.put(lst)
return
# Number of cpus in the pc
num_cpus = cpu_count()
index = 0
# Manager and queue to hold the results
manager = Manager()
# The queue has controlled insertion, so processes don't step on each other
queue = manager.Queue()
listOfWorklogs=pd.DataFrame()
lst={}
for i in range(len(listOfKeys)):
worklogs=jira.worklogs(listOfKeys[i]) #getting list of worklogs
if(len(worklogs)) == 0:
i+=1
else:
# This loop replaces your "for j in range(len(worklogs))" loop
while index < len(worklogs):
processes = []
elements = min(num_cpus, len(worklogs) - index)
# Create a process for each cpu
for i in range(elements):
process = Process(target=insert_into_list, args=(worklogs[i+index], queue))
processes.append(process)
# Run the processes
for i in range(elements):
processes[i].start()
# Wait for them to finish
for i in range(elements):
processes[i].join(timeout=10)
index += num_cpus
# Dump the queue into the dataframe
while queue.qsize() != 0:
listOfWorklogs.append(q.get(), ignore_index=True)
This should work and reduce the time by a factor of little less than the number of CPUs in your machine. You can try and change that number manually for better performance. In any case I find it very strange that it takes about 3 seconds per operation.
PS: I couldn't try the code because I have no examples, it probably has some bugs
I have some troubles((
1) indents in the code where the first "for" loop appears and the first "if" instruction begins (this instruction and everything below should be included in the loop, right?)
for i in range(len(listOfKeys)-99):
worklogs=jira.worklogs(listOfKeys[i]) #getting list of worklogs
if(len(worklogs)) == 0:
....
2) cmd, conda prompt and Spyder did not allow your code to work for a reason:
Python Multiprocessing error: AttributeError: module '__ main__' has no attribute 'spec'
After researching in the google, I had to set a bit higher in the code: spec = None (but I'm not sure if this is correct) and this error disappeared.
By the way, the code in Jupyter Notebook worked without this error, but listOfWorklogs is empty and this is not right.
3) when I corrected indents and set __spec __ = None, a new error occurred in this place:
processes[i].start ()
error like this:
"PicklingError: Can't pickle : attribute lookup PropertyHolder on jira.resources failed"
if I remove the parentheses from the start and join methods, the code will work, but I will not have any entries in the listOfWorklogs(((
I ask again for your help!)
How about thinking about it not from a technical standpoint but a logical one? You know your code works, but at a rate of 3sec per 1 issue which means it would take 25 hours to complete. If you have the ability to split up the # of Jira issues that are passed into the script (maybe use date or issue key, etc) you could create multiple different .py files with basically the same code, you would just be passing each one a different list of Jira tickets. So you could just run say 4 of them at the same time and you would reduce your time to 6.25 hours each.

Where to set the locks in apply pandas with multithreading?

I am trying to asynchronously read and write from a pandas df with an apply function. For this purpose I am using the multithreading.dummy package. Since I am doing read and write simultaneously (multithreaded) on my df, I am using multiprocessing.Lock() so that no more than one thread can edit the df at the a given time. However I am a bit confused to where I should be adding a lock.acquire() and lock.release()with an apply function in pandas. I have tried doing as per below, however, it seems that doing as so the entire process becomes synchronous, so it defeats the whole purpose of multithreading.
self._lock.acquire()
to_df[col_name] = to_df.apply(lambda row: getattr(Object(row['col_1'],
row['col_2'],
row['col_3']),
someattribute), axis=1)
self._lock.release()
Note: In my case I have to be doing getattr. someattribute is simply a #property in Object. Object takes 3 arguments, which some from rows 1,2,3 from my df.
There 2 possible solutions. 1 - locks. 2 - queues. Code below is just a skeleton, it may contain typos/errors and cannot be used as is.
First. Locks where they actually needed:
def method_to_process_url(df):
lock.acquire()
url = df.loc[some_idx, some_col]
lock.release()
info = process_url(url)
lock.acquire()
# add info to df
lock.release()
Second. Queues instead of locks:
def method_to_process_url(df, url_queue, info_queue):
for url in url_queue.get():
info = process_url(url)
info_queue.put(info)
url_queue = queue.Queue()
# add all urls to process to the url_queue
info_queue = queue.Queue()
# working_thread_1
threading.Thread(
target=method_to_process_url,
kwargs={'url_queue': url_queue, 'info_queue': info_queue},
daemon=True).start()
# more working threads
counter = 0
while counter < amount_of_urls:
info = info_queue.get():
# add info to df
counter += 1
In the second case you may even start separate thread for every url without url_queue (reasonable if amount of urls is on the order of thousands or less). counter is some simple way to stop the program when all urls are processed.
I would use the second approach if you ask me. It is more flexible in my opinion.

Thread objects not freed from memory

I wrote a continuous script that collects some data from the internet every few seconds, keeps it in memory for a while, periodically stores it all to db and then deletes it. To keep everything running smoothly I use threads to collect the data from several sources at the same time. To minimize db operations and to avoid conflict with other db processes, I only write every now and then.
The memory from the deleted variables is never returned and eventually becomes so large the script crashes (shown by tracemalloc and pympler). I guess I'm handling the data coming out of the threads wrong but I don't know how I could do it differently. Minimal example below.
Addition: I don't think I can use a queue because in reality multiple functions are threaded from this point, modifying different local variables.
import threading
import time
import tracemalloc
import pympler.muppy, pympler.summary
import gc
tracemalloc.start()
def a():
# collect data
collection.update({int(time.time()): list(range(1,1000))})
return
collection = {}
threads = []
start = time.time()
cycle = 0
while time.time() < start + 60:
cycle += 1
t = threading.Thread(target = a)
threads.append(t)
t.start()
time.sleep(1)
for t in threads:
if t.is_alive() == False:
t.join()
# periodically delete data
delete = []
for key, val in collection.items():
if key < time.time() - 10:
delete.append(key)
for delet in delete:
print('DELETING:', delet)
del collection[delet]
gc.collect()
print('CYCLE:', cycle, 'THREADS:', threading.active_count(), 'COLLECTION:', len(collection))
print(tracemalloc.get_traced_memory())
all_objects = pympler.muppy.get_objects()
sum1 = pympler.summary.summarize(all_objects)
pympler.summary.print_(sum1)

Is this a Python 3 regression in IPython Notebook?

I am attempting to create some simple asynchronously-executing animations based on ipythonblocks and I am trying to update the cell output area using clear_output() followed by a grid.show().
For text output the basis of the technique is discussed in Per-cell output for threaded IPython Notebooks so my simplistic assumption was to use the same method to isolate HTML output. Since I want to repeatedly replace a grid with its updated HTML version I try to use clear_output() to ensure that only one copy of the grid is displayed.
I verified that this proposed technique works for textual output with the following cells. First the context manager.
import sys
from contextlib import contextmanager
import threading
stdout_lock = threading.Lock()
n = 0
#contextmanager
def set_stdout_parent(parent):
"""a context manager for setting a particular parent for sys.stdout
(i.e. redirecting output to a specific cell). The parent determines
the destination cell of output
"""
global n
save_parent = sys.stdout.parent_header
# we need a lock, so that other threads don't snatch control
# while we have set a temporary parent
with stdout_lock:
sys.stdout.parent_header = parent
try:
yield
finally:
# the flush is important, because that's when the parent_header actually has its effect
n += 1; print("Flushing", n)
sys.stdout.flush()
sys.stdout.parent_header = save_parent
Then the test code
import threading
import time
class timedThread(threading.Thread):
def run(self):
# record the parent (uncluding the stdout cell) when the thread starts
thread_parent = sys.stdout.parent_header
for i in range(3):
time.sleep(2)
# then ensure that the parent is the same as when the thread started
# every time we print
with set_stdout_parent(thread_parent):
print(i)
timedThread().start()
This provided the output
0
Flushing 1
1
Flushing 2
2
Flushing 3
So I modified the code to clear the cell between cycles.
import IPython.core.display
class clearingTimedThread(threading.Thread):
def run(self):
# record the parent (uncluding the stdout cell) when the thread starts
thread_parent = sys.stdout.parent_header
for i in range(3):
time.sleep(2)
# then ensure that the parent is the same as when the thread started
# every time we print
with set_stdout_parent(thread_parent):
IPython.core.display.clear_output()
print(i)
clearingTimedThread().start()
As expected the output area of the cell was repeatedly cleared, and ended up reading
2
Flushing 6
I therefore thought I was on safe ground in using the same technique to clear a cell's output area when using ipythonblocks. Alas no. This code
from ipythonblocks import BlockGrid
w = 10
h = 10
class clearingBlockThread(threading.Thread):
def run(self):
grid = BlockGrid(w, h)
# record the parent (uncluding the stdout cell) when the thread starts
thread_parent = sys.stdout.parent_header
for i in range(10):
# then ensure that the parent is the same as when the thread started
# every time we print
with set_stdout_parent(thread_parent):
block = grid[i, i]
block.green = 255
IPython.core.display.clear_output(other=True)
grid.show()
time.sleep(0.2)
clearingBlockThread().start()
does indeed produce the desired end state (a black matrix with a green diagonal) but the intermediate steps don't appear in the cell's output area. To complicate things slightly (?) this example is running on Python 3. In checking before posting here I discover that the expected behavior (a simple animation) does in fact occur under Python 2.7. Hence I though to ask whether this is an issue I need to report.

Resources