Wandb first run start time is delayed - pytorch

I wanted to compare the execution speeds of three data types. The runs were organized in sequence of Original, DictList, DataFrame. So Original was the first run. The x-axis is set as Relative Time (Process)
Problem is that each runs starting time is all different! How can I make sure that they all start at time 0?
I can't post the entire code. I'll post every point where wandb API is called.
import wandb
if __name__ == "__main__":
wandb.login()
datatypes = {"Original": Original, "DictList": DictList, "DataFrame": DataFrame}
for type_name, datatype in datatypes.items():
wandb_run = wandb.init(project="Compare", name=type_name, reinit=True)
with wandb_run:
# Initialize RL training session
storage = datatype()
# Run RL training session
# Log (loss, elapsed_steps, etc.)

Related

Slow multiprocessing when using torch

I have a function func that takes on average 9 sec to run. But when I try to use multiprocessing to parallelize it (even using torch.multiprocessing) each inference takes on average 20 sec why is that ?
func is an inference function which takes in a patient_name and runs a torch model in inference on that patient's data.
device = torch.device(torch.device('cpu'))
def func(patient_name):
data = np.load(my_dict[system_name]['data_path'])
model_state = torch.load(my_dict[system_name]['model_state_path'],map_location='cpu')
model = my_net(my_dict[system_name]['HPs'])
model = model.to(device)
model.load_state_dict(model_state)
model.eval()
result = model(torch.FloatTensor(data).to(device))
return result
from torch.multiprocessing import pool
core_cnt = 10
pool = Pool(core_cnt)
out = pool.starmap(func, pool_args)
Please check if the inference on your model architecture with data that you provide already uses substantial computational power. That would make OS process scheduler to switch between each process which takes additional time.
Also, you load model every time inside your function. It is always faster to copy data object between processes than read it from a disk (that's the strategy with multiprocessing, or share model altogether with torch.multiprocessing)

Dask: Submit continuously, work on all submitted data

Having 500, continously growing DataFrames, I would like to submit operations on the (for each DataFrame indipendent) data to dask. My main question is: Can dask hold the continously submitted data, so I can submit a function on all the submitted data - not just the newly submitted?
But lets explain it on an example:
Creating a dask_server.py:
from dask.distributed import Client, LocalCluster
HOST = '127.0.0.1'
SCHEDULER_PORT = 8711
DASHBOARD_PORT = ':8710'
def run_cluster():
cluster = LocalCluster(dashboard_address=DASHBOARD_PORT, scheduler_port=SCHEDULER_PORT, n_workers=8)
print("DASK Cluster Dashboard = http://%s%s/status" % (HOST, DASHBOARD_PORT))
client = Client(cluster)
print(client)
print("Press Enter to quit ...")
input()
if __name__ == '__main__':
run_cluster()
Now I can connect from my my_stream.py and start to submit and gather data:
DASK_CLIENT_IP = '127.0.0.1'
dask_con_string = 'tcp://%s:%s' % (DASK_CLIENT_IP, DASK_CLIENT_PORT)
dask_client = Client(self.dask_con_string)
def my_dask_function(lines):
return lines['a'].mean() + lines['b'].mean
def async_stream_redis_to_d(max_chunk_size = 1000):
while 1:
# This is a redis queue, but can be any queueing/file-stream/syslog or whatever
lines = self.queue_IN.get(block=True, max_chunk_size=max_chunk_size)
futures = []
df = pd.DataFrame(data=lines, columns=['a','b','c'])
futures.append(dask_client.submit(my_dask_function, df))
result = self.dask_client.gather(futures)
print(result)
time sleep(0.1)
if __name__ == '__main__':
max_chunk_size = 1000
thread_stream_data_from_redis = threading.Thread(target=streamer.async_stream_redis_to_d, args=[max_chunk_size])
#thread_stream_data_from_redis.setDaemon(True)
thread_stream_data_from_redis.start()
# Lets go
This works as expected and it is really quick!!!
But next, I would like to actually append the lines first before the computation takes place - And wonder if this is possible? So in our example here, I would like to calculate the mean over all lines which have been submitted, not only the last submitted ones.
Questions / Approaches:
Is this cummulative calculation possible?
Bad Alternative 1: I
cache all lines locally and submit all the data to the cluster
every time a new row arrives. This is like an exponential overhead. Tried it, it works, but it is slow!
Golden Option: Python
Program 1 pushes the data. Than it would be possible to connect with
another client (from another python program) to that cummulated data
and move the analysis logic away from the inserting logic. I think Published DataSets are the way to go, but are there applicable for this high-speed appends?
Maybe related: Distributed Variables, Actors Worker
Assigning a list of futures to a published dataset seems ideal to me. This is relatively cheap (everything is metadata) and you'll be up-to-date as of a few milliseconds
client.datasets["x"] = list_of_futures
def worker_function(...):
futures = get_client().datasets["x"]
data = get_client.gather(futures)
... work with data
As you mention there are other systems like PubSub or Actors. From what you say though I suspect that Futures + Published datasets are simpler and a more pragmatic option.

How to make python script update in realtime (run continuously)

Hello I am a beginner in python and I just learned the fundamentals and basics of programming in python. I wanted to make this program that will notify me when my class ends, however, once I run my program it only executes it once
When I loop it, it does not continuously access my date and time (instead it takes the time from when the code is executed). Any suggestion on how to resolve this problem?
import win10toast
import datetime
currentDT = datetime.datetime.now()
toaster = win10toast.ToastNotifier()
while (1):
def Period_A():
if currentDT.hour == 7 and currentDT.minute == 30:
toaster.show_toast('Shedule Reminder', 'It is Period A time!', duration=10)
I wanted the code to run in the background and update the value of the date and time continuously so that the notification will appear on the desired time not the time when the code is executed ;).
The line currentDT = datetime.datetime.now() is only called once throughout the whole program, and therefore stays constant to approximately the time you ran the script.
Since you're want to continuously check the time to compare it to a set time, you'll have to move that line to inside the loop.
Secondly, you define a function Period_A in the loop, which by itself does nothing because you don't call the function. If you don't need the abstraction a function gives you, there is no point in making one just to call it once.
import datetime
import win10toast
toaster = win10toast.ToastNotifier()
while 1:
currentDT = datetime.datetime.now()
if currentDT.hour == 7 and currentDT.minute == 30:
toaster.show_toast('Shedule Reminder', 'It is Period A time!', duration=10)

Multiprocessing a function that tests a given dataset against a list of distributions. Returning function values from each iteration through list

I am working on processing a dataset that includes dense GPS data. My goal is to use parallel processing to test my dataset against all possible distributions and return the best one with the parameters generated for said distribution.
Currently, I have code that does this in serial thanks to this answer https://stackoverflow.com/a/37616966. Of course, it is going to take entirely too long to process my full dataset. I have been playing around with multiprocessing, but can't seem to get it to work right. I want it to test multiple distributions in parallel, keeping track of sum of square error. Then I want to select the distribution with the lowest SSE and return its name along with the parameters generated for it.
def fit_dist(distribution, data=data, bins=200, ax=None):
#Block of code that tests the distribution and generates params
return(distribution.name, best_params, sse)
if __name__ == '__main__':
p = Pool()
result = p.map(fit_dist, DISTRIBUTIONS)
p.close()
p.join()
I need some help with how to actually make use of the return values on each of the iterations in the multiprocessing to compare those values. I'm really new to python especially multiprocessing so please be patient with me and explain as much as possible.
The problem I'm having is it's giving me an "UnboundLocalError" on the variables that I'm trying to return from my fit_dist function. The DISTRIBUTIONS list is 89 objects. Could this be related to the parallel processing, or is it something to do with the definition of fit_dist?
With the help of Tomerikoo's comment and some further struggling, I got the code working the way I wanted it to. The UnboundLocalError was due to me not putting the return statement in the correct block of code within my fit_dist function. To answer the question I did the following.
from multiprocessing import Pool
def fit_dist:
#put this return under the right section of this method
return[distribution.name, params, sse]
if __name__ == '__main__':
p = Pool()
result = p.map(fit_dist, DISTRIBUTIONS)
p.close()
p.join()
'''filter out the None object results. Due to the nature of the distribution fitting,
some distributions are so far off that they result in None objects'''
res = list(filter(None, result))
#iterates over nested list storing the lowest sum of squared errors in best_sse
for dist in res:
if best_sse > dist[2] > 0:
best_sse = dis[2]
else:
continue
'''iterates over list pulling out sublist of distribution with best sse.
The sublists are made up of a string, tuple with parameters,
and float value for sse so that's why sse is always index 2.'''
for dist in res:
if dist[2]==best_sse:
best_dist_list = dist
else:
continue
The rest of the code simply consists of me using that list to construct charts and plots with that best distribution overtop of a histogram of my raw data.

Plot a continuous graph of Number of Snort alerts against time

I have snort logging DDOS alerts to file; I use Syslog-ng to parse the logs and output in json format into redis (wanted to set it up as a buffer, I use 'setex' command with expiry of 70 secs).
The whole thing seems not to be working well; any ideas to make it easier is welcome.
I wrote a simple python script to listen to redis KA events and count the number of snort alerts per second. I tried creating two other threads; one to retrieve the json-formatted alerts from snort and the second to count the alerts. The third is supposed to plot a graph using matplotlib.pyplot
#import time
from redis import StrictRedis as sr
import os
import json
import matplotlib.pyplot as plt
import threading as th
import time
redis = sr(host='localhost', port = 6379, decode_responses = True)
#file = open('/home/lucidvis/vis_app_py/log.json','w+')
# This function is still being worked on
def do_plot():
print('do_plot loop running')
while accumulated_data:
x_values = [int(x['time_count']) for x in accumulated_data]
y_values = [y['date'] for y in accumulated_data]
plt.title('Attacks Alerts per time period')
plt.xlabel('Time', fontsize=14)
plt.ylabel('Snort Alerts/sec')
plt.tick_params(axis='both', labelsize=14)
plt.plot(y_values,x_values, linewidth=5)
plt.show()
time.sleep(0.01)
def accumulator():
# first of, check the current json data and see if its 'sec' value is same
#that is the last in the accumulated data list
#if it is the same, increase time_count by one else pop that value
pointer_data = {}
print('accumulator loop running')
while True:
# pointer data is the current sec of json data used for comparison
#new_data is the latest json formatted alert received
# received_from_redis is a list declared in the main function
if received_from_redis:
new_data = received_from_redis.pop(0)
if not pointer_data:
pointer_data = new_data.copy()
print(">>", type(pointer_data), " >> ", pointer_data)
if pointer_data and pointer_data['sec']==new_data["sec"]
pointer_data['time_count'] +=1
elif pointer_data:
accumulated_data.append(pointer_data)
pointer_data = new_data.copy()
pointer_data.setdefault('time_count',1)
else:
time.sleep(0.01)
# main function creates the redis object and receives messages based on events
#this function calls two other functions and creates threads so they appear to run concurrently
def main():
p = redis.pubsub()
#
p.psubscribe('__keyspace#0__*')
print('Starting message loop')
while True:
try:
time.sleep(2)
message = p.get_message()
# Obtain the key from the redis emmitted event if the event is a set event
if message and message['data']=='set':
# the format emmited by redis is in a dict form
# the key is the value to the key 'channel'
# The key is in '__keyspace#0__*' form
# obtain the last field of the list returned by split function
key = message['channel'].split('__:')[-1]
data_redis = json.loads(redis.get(str(key)))
received_from_redis.append(data_redis)
except Exception e:
print(e)
continue
if __name__ == "__main__":
accumulated_data = []
received_from_redis = []
# main function creates the redis object and receives messages based on events
#this function calls two other functions and creates threads so they appear to run concurrently
thread_accumulator = th.Thread(target = accumulator, name ='accumulator')
do_plot_thread = th.Thread(target = do_plot, name ='do_plot')
while True:
thread_accumulator.start()
do_plot_thread.start()
main()
thread_accumulator.join()
do_plot_thread.join()
I currently do get errors per se ; I just cant tell if the threads are created or are working well. I need ideas to make things work better.
sample of the alert formated in json and obtained from redis below
{"victim_port":"","victim":"192.168.204.130","protocol":"ICMP","msg":"Ping_Flood_Attack_Detected","key":"1000","date":"06/01-09:26:13","attacker_port":"","attacker":"192.168.30.129","sec":"13"}
I'm not sure I understand exactly your scenario, but if you want to count events that are essentially log messages, you can probably do that within syslog-ng. Either as a Python destination (since you are already working in python), or maybe even without additional programming using the grouping-by parser.

Resources