How to set the r_bar part of tqdm - python-3.x

I use tqdm to print a progress bar for a long running optimization process with hyperopt.
The process calls a function say 500 times and each call will take around 10 to 20 minutes, so I started to make the progress display a bit more fine granular and added some tqdm.update-statements in the loop, advancing the progress bar fraction-wise to avoid having two nested progress bars while still beeing able to immediately see how many function calls have been performed so far.
Now the ugly result looks like this:
15%|███▌ | 73.69999999999993/500 [7:40:31<102:54:08, 868.98s/it, evaluating fold 2 of 2 folds...]Iteration 1, loss = 2.50358388
You can see above, it is the 73th call of the function and this 73th function call is about 70% finished. In fact I just estimated the number of substeps m in the function (which might vary from call to call) and used the fraction 1/m to update the progress bar. Then after the function call I just synchronize the progress bar back to a full integer to avoid adding rounding errors.
Of course accuracy is not an issue at all here. But I would like to display 73.70 rather than 73.69999999999993.
I already tried to round my update value to two decimal places, which doesn't fix the problem, because of precision issues in float, if a number is not exactly representable by a float, then it gets ugly-long again.
According to the documentation of tqdm this part is hidden in the in the part r_bar of the whole format string, but I couldn't find a way to set it. Can you help me with this?
According to the docs r_bar defaults to:
r_bar='| {n_fmt}/{total_fmt} [{elapsed}<{remaining}, '
Here is my code:
with tqdm(iterable=None, initial=num_trials, maxinterval=maxinterval, total=max_evals, ascii=False, disable=show_progressbar is False) as progress_bar:
def fn_to_minimize(*args, **kwargs):
return fn(*args, **kwargs, _progress_bar=progress_bar)
for num_trials in range(num_trials, max_evals):
progress_bar.n=float(num_trials)
progress_bar.refresh()
best = fmin(**kwargs, fn=fn_to_minimize, trials=trials, max_evals=num_trials+1)
# do some other stuff here
In the called function (one of the entries in kwargs btw) I update the progress bar just like this:
_progress_bar.update(round(update_value, 2))

For rounding issues in tqdm, you can directly edit the formatting in the r_bar as one of the parameters in the bar_format. For example:
from tqdm import trange
for i in trange(int(7e7), bar_format = "{desc}: {percentage:.3f}%|{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}"):
pass
shows:
For 2 decimal places, you can simply edit the {n_fmt} to be {n:.2f} . You can also edit other parameters such as {desc} or add in additional decimal places to the percentage.
from tqdm import trange
for i in trange(int(7e7), bar_format = "{desc}: {percentage:.10f}%|{bar}| {n:.2f}/{total_fmt} [{elapsed}<{remaining}"):
pass
shows:
Upon looking through the source code of tqdm, n_fmt is actually pointing to str(n), hence passing in the formatted version of n can bypass its intrinsic formatting.
if unit_scale:
n_fmt = format_sizeof(n, divisor=unit_divisor)
total_fmt = format_sizeof(total, divisor=unit_divisor) \
if total is not None else '?'
else:
n_fmt = str(n)
total_fmt = str(total) if total is not None else '?'
try:
postfix = ', ' + postfix if postfix else ''
except TypeError:
pass

Related

How to process the data returned from a function (Python 3.7)

Background:
My question should be relatively easy, however I am not able to figure it out.
I have written a function regarding queueing theory and it will be used for ambulance service planning. For example, how many calls for service can I expect in a given time frame.
The function takes two parameters; a starting value of the number of ambulances in my system starting at 0 and ending at 100 ambulances. This will show the probability of zero calls for service, one call for service, three calls for service….up to 100 calls for service. Second parameter is an arrival rate number which is the past historical arrival rate in my system.
The function runs and prints out the result to my screen. I have checked the math and it appears to be correct.
This is Python 3.7 with the Anaconda distribution.
My question is this:
I would like to process this data even further but I don’t know how to capture it and do more math. For example, I would like to take this list and accumulate the probability values. With an arrival rate of five, there is a cumulative probability of 61.56% of at least five calls for service, etc.
A second example of how I would like to process this data is to format it as percentages and write out a text file
A third example would be to process the cumulative probabilities and exclude any values higher than the 99% cumulative value (because these vanish into extremely small numbers).
A fourth example would be to create a bar chart showing the probability of n calls for service.
These are some of the things I want to do with the queueing theory calculations. And there are a lot more. I am planning on writing a larger application. But I am stuck at this point. The function writes an output into my Python 3.7 console. How do I “capture” that output as an object or something and perform other processing on the data?
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import math
import csv
def probability_x(start_value = 0, arrival_rate = 0):
probability_arrivals = []
while start_value <= 100:
probability_arrivals = [start_value, math.pow(arrival_rate, start_value) * math.pow(math.e, -arrival_rate) / math.factorial(start_value)]
print(probability_arrivals)
start_value = start_value + 1
return probability_arrivals
#probability_x(arrival_rate = 5, x = 5)
#The code written above prints to the console, but my goal is to take the returned values and make other calculations.
#How do I 'capture' this data for further processing is where I need help (for example, bar plots, cumulative frequency, etc )
#failure. TypeError: writerows() argument must be iterable.
with open('ExpectedProbability.csv', 'w') as writeFile:
writer = csv.writer(writeFile)
for value in probability_x(arrival_rate = 5):
writer.writerows(value)
writeFile.close()
#Failure. Why does it return 2. Yes there are two columns but I was expecting 101 as the length because that is the end of my loop.
print(len(probability_x(arrival_rate = 5)))
The problem is, when you write
probability_arrivals = [start_value, math.pow(arrival_rate, start_value) * math.pow(math.e, -arrival_rate) / math.factorial(start_value)]
You're overwriting the previous contents of probability_arrivals. Everything that it held previously is lost.
Instead of using = to reassign probability_arrivals, you want to append another entry to the list:
probability_arrivals.append([start_value, math.pow(arrival_rate, start_value) * math.pow(math.e, -arrival_rate) / math.factorial(start_value)])
I'll also note, your while loop can be improved. You're basically just looping over start_value until it reaches a certain value. A for loop would be more appropriate here:
for s in range(start_value, 101): # The end value is exclusive, so it's 101 not 100
probability_arrivals = [s, math.pow(arrival_rate, s) * math.pow(math.e, -arrival_rate) / math.factorial(s)]
print(probability_arrivals)
Now you don't need to manually worry about incrementing the counter.

Problem with high values for wxPython.ProgressDialog

up front: I am using wxPython-4.0.4, Python 3.7.2, wx.ProgressDialog, and Windows 10
I encountered a problem while trying to create a progress bar. I was reading a file with over 1 mil. lines and added a progress bar that updates every 10k or so lines. What I found is that the bar advance was... let's just say unexpected. I looked at the values I got with GetValue() and noticed that they were way smaller then the values I set with Update().
I deleted all unnecessary code, experimented a bit, and noticed that the effect depends on the maximum value of the progress bar. Everything seems to work fine for values smaller than 65,536, which happens to be 2^16. I haven't found a similar case during my search for a solution. You can find my example code below.
Is this supposed to happen? Is there a way to avoid this behaviour (besides reducing the maximum value)? Or am I missing something? I know that other types of progress bar do not act in this manner but I would like to use this type due to its simplicity.
Thank you already in advance!
import wx
def pb_test(maxVal):
print("Progress bar test with maxVal=%d" % maxVal)
pb = wx.ProgressDialog(title="", message="")
pb.SetRange(maxVal) # set maximum of progress bar
val=20000 # arbitrary value to set progress bar to
pb.Update(val) # set value
print("Value/Range: %d/%d\tExpected Value: %d" %(pb.GetValue(), pb.GetRange(), val))
pb.Update(maxVal) # set maximum value
print("Value/Range: %d/%d\tExpected Value: %d" %(pb.GetValue(), pb.GetRange(), maxVal))
app = wx.App()
pb_test(65535) # <-- this one works as expected
print("------------------")
pb_test(65536) # <-- this one shows only half the expected values
EDIT: This is my output:
Progress bar test with maxVal=65535
Value/Range: 20000/65535 Expected Value: 20000
Value/Range: 65535/65535 Expected Value: 65535
------------------
Progress bar test with maxVal=65536
Value/Range: 10000/65536 Expected Value: 20000
Value/Range: 32768/65536 Expected Value: 65536

PyEphem: How to test if an object is above the horizon?

I am writing a Python script that gives basic data for all the planets, the Sun and the Moon. My first function divides the planets between those that are above the horizon, and those that are not risen yet:
planets = {
'mercury': ephem.Mercury(),
'venus': ephem.Venus(),
'mars': ephem.Mars(),
'jupiter': ephem.Jupiter(),
'saturn': ephem.Saturn(),
'uranus': ephem.Uranus(),
'neptune': ephem.Neptune()
}
def findVisiblePlanets(obs):
visiblePlanets = dict()
notVisiblePlanets = dict()
for obj in planets:
planets[obj].compute(obs)
if planets[obj].alt > 0:
visiblePlanets[obj] = planets[obj]
else:
notVisiblePlanets[obj] = planets[obj]
return (visiblePlanets, notVisiblePlanets)
This works alright, the tuple I receive from findVisiblePlanets corresponds corresponds to the actual sky for the given 'obs'.
But in another function, I need to test the altitude of each planet. If it's above 0, the script displays 'setting at xxx', and if it's under 0, the script displays 'rising at xxx'. Here is the code:
if bodies[obj].alt > 0:
print(' Sets at', setTime.strftime('%H:%M:%S'), deltaSet)
else:
print(' Rises at', riseTime.strftime('%H:%M:%S'), deltaRise)
So I'm using the exact same condition, except that this time it doesn't work. I am sure I have the correct object behind bodies[obj], as the script displays name, magnitude, distance, etc. But for some reason, the altitude (.alt) is always below 0, so the script only displays the rising time.
I tried print(bodies[obj].alt), and I receive a negative figure in the form of '-0:00:07.8' (example). I tried using int(bodies[obj].alt) for the comparison but this ends up being a 0. How can I test if the altitude is negative? Am I missing something obvious here?
Thanks for your help.
I thinkk I had a similar problem once. How I understand it pyephem forwards the time of your observer, when you call nextrising() or nextsetting() on a object. It somehow looks, at which timepoint the object is above/below the horizont for the first time. if you then call the bodie.alt it will always be this little bit below/above horizon.
You have to store your observer time somehow and set it again after calculating setting/rising times.

ValueError, though check has already be performed for this

Getting a little stuck with NaN data. This program trawls through a folder in an external hard drive loads in a txt file as a dataframe, and should reads the very last value of the last column. As some of the last rows do not complete for what ever reason, i have chosen to take the row before (or that's what i hope to have done. Here is the code and I have commented the lines that I think are giving the trouble:
#!/usr/bin/env python3
import glob
import math
import pandas as pd
import numpy as np
def get_avitime(vbo):
try:
df = pd.read_csv(vbo,
delim_whitespace=True,
header=90)
row = next(df.iterrows())
t = df.tail(2).avitime.values[0]
return t
except:
pass
def human_time(seconds):
secs = seconds/1000
mins, secs = divmod(secs, 60)
hours, mins = divmod(mins, 60)
return '%02d:%02d:%02d' % (hours, mins, secs)
def main():
path = 'Z:\\VBox_Backup\\**\\*.vbo'
events = {}
customers = {}
for vbo_path in glob.glob(path, recursive=True):
path_list = vbo_path.split('\\')
event = path_list[2].upper()
customer = path_list[3].title()
avitime = get_avitime(vbo_path)
if not avitime: # this is to check there is a number
continue
else:
if event not in events:
events[event] = {customer:avitime}
print(event)
elif customer not in events[event]:
events[event][last_customer] = human_time(events[event][last_customer])
print(events[event][last_customer])
events[event][customer] = avitime
else:
total_time = events[event][customer]
total_time += avitime
events[event][customer] = total_time
last_customer = customer
events[event][customer] = human_time(events[event][customer])
df_events = pd.DataFrame(events)
df.to_csv('event_track_times.csv')
main()
I put in a line to check for a value, but I am guessing that NaN is not a null value, hence it hasn't quite worked.
C:\Users\rob.kinsey\AppData\Local\Continuum\Anaconda3) c:\Users\rob.kinsey\Pro
ramming>python test_single.py
BARCELONA
03:52:42
02:38:31
03:21:02
00:16:35
00:59:00
00:17:45
01:31:42
03:03:03
03:16:43
01:08:03
01:59:54
00:09:03
COTA
04:38:42
02:42:34
sys:1: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
04:01:13
01:19:47
03:09:31
02:37:32
03:37:34
02:14:42
04:53:01
LAGUNA_SECA
01:09:10
01:34:31
01:49:27
03:05:34
02:39:03
01:48:14
SILVERSTONE
04:39:31
01:52:21
02:53:42
02:10:44
02:11:17
02:37:11
01:19:12
04:32:21
05:06:43
SPA
Traceback (most recent call last):
File "test_single.py", line 56, in <module>
main()
File "test_single.py", line 41, in main
events[event][last_customer] = human_time(events[event][last_customer])
File "test_single.py", line 23, in human_time
The output is starting out correctly, except for the sys:1 error, but at least it carries on, and the final error that stalls the program completely. How can I get past this NaN issue, all variables I am working with should be of float data type or should have been ignored. All data types should only be strings or floats until the time conversion which are integers.
Ok, even though no one answered, I am compelled to answer my own question as I am not convinced I am the only person that has had this problem.
There are 3 main reasons for receiving NaN in a data frame, most of these revolve around infinity, such as using 'inf' as a value or dividing by zero, which will also provide NaN as a result, the wiki page was the most helpful for me in solving this issue:
https://en.wikipedia.org/wiki/NaN
One other important point about NaN it that is works a little like a virus, in that anything that touches it in any calculation will result in NaN, so the problem can get exponentially worse. Actually what you are dealing with is missing data, and until you realize that's what it is, NaN is the least useful and frustrating thing as it comes under a datatype not an error yet any mathematical operations will end in NaN. BEWARE!!
The reason on this occasion is because a specific line was used to get the headers when reading in the csv file and although that worked for the majority of these files, some of them had the headers I was after on a different line, as a result, the headers being imported into the data frame either were part of the data itself or a null value. As a result, trying to access a column in the data frame by header name resulted in NaN, and as discussed earlier, this proliferated though the program causing a few problems which had used workarounds to combat, one of which was actually acceptable which is to add this line:
df = df.fillna(0)
after the first definition of the df variable, in this case:
df= pd.read_csv(vbo,
delim_whitespace=True,
header=90)
The bottom line is that if you are receiving this value, the best thing really is to work out why you are getting NaN in the first place, then it is easier to make an informed decision as to whether or not replacing NaN with '0' is a viable choice.
I sincerely hope this helps anyone who finds it.
Regards
iFunction

Matplotlib - Stacked Bar Chart with ~1000 Bars

Background:
I'm working on a program to show a 2d cross section of 3d data. The data is stored in a simple text csv file in the format x, y, z1, z2, z3, etc. I take a start and end point and flick through the dataset (~110,000 lines) to create a line of points between these two locations, and dump them into an array. This works fine, and fairly quickly (takes about 0.3 seconds). To then display this line, I've been creating a matplotlib stacked bar chart. However, the total run time of the program is about 5.5 seconds. I've narrowed the bulk of it (3 seconds worth) down to the code below.
'values' is an array with the x, y and z values plus a leading identifier, which isn't used in this part of the code. The first plt.bar is plotting the bar sections, and the second is used to create an arbitrary floor of -2000. In order to generate a continuous looking section, I'm using an interval between each bar of zero.
import matplotlib.pyplot as plt
for values in crossSection:
prevNum = None
layerColour = None
if values != None:
for i in range(3, len(values)):
if values[i] != 'n':
num = float(values[i].strip())
if prevNum != None:
plt.bar(spacing, prevNum-num, width=interval, \
bottom=num, color=layerColour, \
edgecolor=None, linewidth=0)
prevNum = num
layerColour = layerParams[i].strip()
if prevNum != None:
plt.bar(spacing, prevNum+2000, width=interval, bottom=-2000, \
color=layerColour, linewidth=0)
spacing += interval
I'm sure there's a more efficient way to do this, but I'm new to Matplotlib and still unfamilar with its capabilities. The other main use of time in the code is:
plt.savefig('output.png')
which takes about a second, but I figure this is to be expected to save the file and I can't do anything about it.
Question:
Is there a faster way of generating the same output (a stacked bar chart or something that looks like one) by using plt.bar() better, or a different Matplotlib function?
EDIT:
I forgot to mention in the original post that I'm using Python 3.2.3 and Matplotlib 1.2.0
Leaving this here in case someone runs into the same problem...
While not exactly the same as using bar(), with a sufficiently large dataset (large enough that using bar() takes a few seconds) the results are indistinguishable from stackplot(). If I sort the data into layers using the method given by tcaswell and feed it into stackplot() the chart is created in 0.2 seconds, rather than 3 seconds.
EDIT
Code provided by tcaswell to turn the data into layers:
accum_values = []
for values in crosssection:
accum_values.append([float(v.strip()) for v iv values[3:]])
accum_values = np.vstack(accum_values).T
layer_params = [l.strip() for l in layerParams]
bottom = numpy.zeros(accum_values[0].shape)
It looks like you are drawing each bar, you can pass sequences to bar (see this example)
I think something like:
accum_values = []
for values in crosssection:
accum_values.append([float(v.strip()) for v iv values[3:]])
accum_values = np.vstack(accum_values).T
layer_params = [l.strip() for l in layerParams]
bottom = numpy.zeros(accum_values[0].shape)
ax = plt.gca()
spacing = interval*numpy.arange(len(accum_values[0]))
for data,color is zip(accum_values,layer_params):
ax.bar(spacing,data,bottom=bottom,color=color,linewidth=0,width=interval)
bottom += data
will be faster (because each call to bar creates one BarContainer and I suspect the source of your issues is you were creating one for each bar, instead of one for each layer).
I don't really understand what you are doing with the bars that have tops below their bottoms, so I didn't try to implement that, so you will have to adapt this a bit.

Resources