python3 why is print("\r"+"text") different on linux and windows terminal - linux

while learning python3 i wrote a little programm that displays a ascii art bargraph on the console. im feeding this method with some randon numbers so i can see how the bargraph works in action. therefor i want it to print the bargraph over and over to the same line , NOT adding a LF.
what works fine on a linux console does not on a windows console.
WHY?! and how can i fix this for any platform?
for i in range(500):
print( "\r" + getProgressBar( progressPercentage=limitedRandGen(), width=consoleWidth ), end="" )
time.sleep(50 / 1000) # delays for x seconds
This Code on linux outputs every bargraph on the same single line (overwriting it over and over again, which is good). on windows somehow there is an LF after each bargraph , resulting in something like this:
47% [==================================================================================
45% [==============================================================================
46% [=================================================================================
42% [=========================================================================
38% [==================================================================
40% [======================================================================
40% [=====================================================================
43% [===========================================================================
46% [===============================================================================
50% [=======================================================================================
48% [====================================================================================
49% [=====================================================================================
46% [================================================================================
47% [=================================================================================
46% [================================================================================
43% [==========================================================================
43% [==========================================================================
41% [=======================================================================
45% [==============================================================================
41% [=======================================================================
44% [============================================================================
42% [=========================================================================
41% [========================================================================
44% [============================================================================
46% [================================================================================
47% [==================================================================================
NOTE: The getProgressBar() method just returns a string with no CR+LF at all.

In the meanwhile, I found out that for some unknown reason windows adds an LF (as known as line wraps) if some character(s) are written to the last place within a console line.
So my code was unable to stay on the same line until I reduce the width of the print to at least maxConsoleWidth-1, which fixed the problem.

The following works fine in Mac OS.
import sys
import time
import random
def getRandomLine():
r = random.randrange(1,20)
return str(r) + '% ' + '=' * random.randrange(1,20)
for i in range(20):
print(getRandomLine().ljust(30) + '\r', end='')
sys.stdout.flush()
time.sleep(0.5)

Related

How to Fully Utilize CPU cores for skopt.forest_minimize

So I have the following code for running skopt.forest_minimize(), but the biggest challenge I am facing right now is that it is taking upwards of days to finish running even just 2 iterations.
SPACE = [skopt.space.Integer(4, max_neighbour, name='n_neighbors', prior='log-uniform'),
skopt.space.Integer(6, 10, name='nr_cubes', prior='uniform'),
skopt.space.Categorical(overlap_cat, name='overlap_perc')]
#skopt.utils.use_named_args(SPACE)
def objective(**params):
score, scomp = tune_clustering(X_cont=X_cont, df=df, pl_brewer=pl_brewer, **params)
if score == 0:
print('saving new scomp')
with open(scomp_file, 'w') as filehandle:
json.dump(scomp, filehandle, default = json_default)
return score
results = skopt.forest_minimize(objective, SPACE, n_calls=1, n_initial_points=1, callback=[scoring])
Is it possible to optimize the following code so that it can compute faster? I noticed that it was barely making use of my CPU, highest CPU utilized is about 30% (it's i7 9th gen with
8 cores).
Also a question while I'm at it, is it possible to utilize a GPU for these computational tasks? I have a 3050 that I can use.

question about combining def function() and PWM duty_ns() in micropython

as a Micro python beginner, I combined few codes found on different forums in order to achieve a higher resolution for ESC signal control. The code will generate from a MIN 1000000 nanoseconds to a MAX 2000000 nanoseconds' pulse but I could only do 100 in increments. My code is kind of messy. Sorry if that hurts your eyes. My question is, does it represent an actual 100ns of resolution? and what's the trick to make it in increments of 1.(Not sure whether is even necessary, but I still hope someone can share some wisdom.)
from machine import Pin, PWM, ADC
from time import sleep
MIN=10000
MAX=20000
class setPin(PWM):
def __init__(self, pin: Pin):
super().__init__(pin)
def duty(self,d):
super().duty_ns(d*100)
print(d*100)
pot = ADC(0)
esc = setPin(Pin(7))
esc.freq(500)
esc.duty(MIN) # arming ESC at 1000 us.
sleep(1)
def map(x, in_min, in_max, out_min, out_max):
return int((x - in_min)*(out_max - out_min)/(in_max - in_min) + out_min)
while True:
pot_val = pot.read_u16()
pulse_ns = map(pot_val, 256, 65535, 10000, 20000)
if pot_val<300: # makes ESC more stable at startup.
esc.duty(MIN)
sleep(0.1)
if pot_val>65300: # gives less tolerance when reaching MAX.
esc.duty(MAX)
sleep(0.1)
else:
esc.duty(pulse_ns) # generates 1000000ns to 2000000ns of pulse.
sleep(0.1)
try to change
esc.freq(500) => esc.freq(250)
x=3600
print (map(3600,256,65535,10000,20000)*100)

tkinter animation speed on windows and linux

I was trying to create simple 2d game using tkinter, but faced with interesting problem: animation speed is quite different on various computers.
To test this, I've create script, that measures time of animation
import tkinter as tk
import datetime
root = tk.Tk()
can = tk.Canvas(height=500, width=1000)
can.pack()
rect = can.create_rectangle(0, 240, 20, 260, fil='#5F6A6A')
def act():
global rect, can
pos = can.coords(rect)
if pos[2] < 1000:
can.move(rect, 5, 0)
can.update()
can.after(1)
act()
def key_down(key):
t = datetime.datetime.now()
act()
print(datetime.datetime.now() - t)
can.bind("<Button-1>", key_down)
root.mainloop()
and get these results:
i3-7100u ubuntu 20.04 laptop python3.8.5 - 0.5 seconds
i3-7100u windows 10 laptop python3.9.4 - 3 seconds
i3-6006u ubuntu 20.10 laptop python3.9.x - 0.5 seconds
i3-6006u windows 10 laptop python3.8.x - 3 seconds
i5-7200u windows 10 laptop python3.6.x - 3 seconds
i5-8400 windows 10 desktop python3.9.x - 3 seconds
fx-9830p windows 10 laptop python3.8.x - 0.5 seconds
tkinter vesrion is the same - 8.6
How can be it fixed or at least explained?
tkinter.Canvas.after should be used like so:
def act():
global rect, can
pos = can.coords(rect)
if pos[2] < 1000:
can.move(rect, 5, 0)
can.update()
can.after(1, act)
The after method is not like time.sleep. Rather than recursively calling the function, the above code schedules it to be called later, so this will break your timing code.
If you want to time it again, try this:
def act():
global rect, can, t
pos = can.coords(rect)
if pos[2] < 1000:
can.move(rect, 5, 0)
can.update()
can.after(1, act)
else:
print(datetime.datetime.now() - t)
def key_down(key):
global t
t = datetime.datetime.now()
act()
This may still take different amounts of time on different machines. This difference can be caused by a variety of things like CPU speed, the implementation of tkinter for your OS etc. The difference can be reduced by increasing the delay between iterations: tkinter.Canvas.after takes a time in milliseconds, so a delay of 16 can still give over 60 frames per seconds.
If keeping the animation speed constant is important, I would recommend you use delta time in your motion calculations rather than assuming a constant frame rate.
AS you can see from your data it doesent matter which python version. It appears that ubuntu systems can help process python easier. However, im pretty sure its just the processer or how much ram the computer has.

Torch.cuda.empty_cache() very very slow performance

I have a very slow performance problem when I execute an inference batch loop on a single GPU.
This slow behavior appears after the first batch has been processed -
that is when the GPU is already almost full and its memory needs to be recycled to accept the next batch.
At a pristine GPU state - the performance is super fast (as expected).
I hope both the following code snippet and the output illustrate the problem in a nutshell.
(I've removed the print and time measurements from the snippet for brevity)
predictions = None
for i, batch in enumerate(self.test_dataloader):
# if this line is active - the bottleneck after the first batch moves here, rather than below
# i.e. when i > 0
# torch.cuda.empty_cache()
# HUGE PERFORMANCE HIT HAPPENS HERE - after the first batch
# i.e. when i > 0
# obviously tensor.to(device) uses torch.cuda.empty_cache() internally when needed
# and it is inexplicably SLOW
batch = tuple(t.to(device) for t in batch) # to GPU (or CPU) when gpu
b_input_ids, b_input_mask, b_labels = batch
with torch.no_grad():
outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
logits = outputs[0]
logits = logits.detach()
# that doesn't help alleviate the issue
del outputs
predictions = logits if predictions is None else torch.cat((predictions, logits), 0)
# nor do all of the below - freeing references doesn't help speeding up
del logits
del b_input_ids
del b_input_mask
del b_labels
for o in batch:
del o
del batch
output
start empty cache... 0.00082
end empty cache... 1.9e-05
start to device... 3e-06
end to device... 0.001179 - HERE - time is super fast (as expected)
start outputs... 8e-06
end outputs... 0.334536
logits... 6e-06
start detach... 1.7e-05
end detach... 0.004036
start empty cache... 0.335932
end empty cache... 4e-06
start to device... 3e-06
end to device... 16.553849 - HERE - time is ridiculously high - it's 16 seconds to move tensor to GPU
start outputs... 2.3e-05
end outputs... 0.020878
logits... 7e-06
start detach... 1.4e-05
end detach... 0.00036
start empty cache... 0.00082
end empty cache... 6e-06
start to device... 4e-06
end to device... 17.385204 - HERE - time is ridiculously high
start outputs... 2.9e-05
end outputs... 0.021351
logits... 4e-06
start detach... 1.3e-05
end detach... 1.1e-05
...
Have I missed something obvious or is this the expected GPU behavior?
I am posting this question before engaging in complex coding, juggling between a couple of GPUs and CPU available on my server.
Thanks in advance,
Albert
EDIT
RESOLVED The issue was: in DataLoader constructor - I changed the pin_memory to False (True was causing the issue). That cut the .to(device) time by 350%-400%
self.test_dataloader = DataLoader(
test_dataset,
sampler=SequentialSampler(test_dataset),
# batch_size=len(test_dataset) # AKA - single batch - nope! no mem for that
batch_size=BATCH_SIZE_AKA_MAX_ROWS_PER_GUESS_TO_FIT_GPU_MEM,
# tests
num_workers=8,
# maybe this is the culprit as suggested by user12750353 in stackoverflow
# pin_memory=True
pin_memory=False
)
You should not be required to clear cache if you are properly clearing the references to the previously allocated variables. Cache is like free, is memory that your script can use for new variables.
Also notice that
a = torch.zeros(10**9, dtype=torch.float)
a = torch.zeros(10**9, dtype=torch.float)
Requires 8GB of memory, even though a uses 4GB (1B elements with 4 bytes each). This occurs because the torch.zeros will allocate memory before the previous content of a is released. This may be happening in your model in a larger scale, depending on how it is implemented.
Edit 1
One suspicious thing is that you are loading your batch to the GPU one example at a time.
Just to illustrate what I mean
import torch
device = 'cuda'
batch = torch.zeros((4500, 10));
Creating the batch as a tuple
batch_gpu = tuple(t.to(device) for t in batch)
torch.cuda.synchronize()
254 ms ± 36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Creating the batch as a list
batch_gpu = list(t.to(device) for t in batch)
torch.cuda.synchronize()
235 ms ± 3.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
batch_gpu = batch.to(device)
torch.cuda.synchronize()
115 µs ± 2.9 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In this example it was 2000x faster to copy one example at a time.
Notice that GPU works asynchronously with the CPU. So you may keep calling functions that will return before the operation is finished. In order to make meaningful measurements you may call synchronize to make clear the time boundaries.
The code to be instrumented is this
for i, batch in enumerate(self.test_dataloader):
# torch.cuda.empty_cache()
# torch.synchronize() # if empty_cache is used
# start timer for copy
batch = tuple(t.to(device) for t in batch) # to GPU (or CPU) when gpu
torch.cuda.synchronize()
# stop timer for copy
b_input_ids, b_input_mask, b_labels = batch
# start timer for inference
with torch.no_grad():
outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
torch.cuda.synchronize()
# stop timer for inference
logits = outputs[0]
logits = logits.detach()
# if you copy outputs to CPU it will be synchronized

threads and function 'print'

I'm trying to parallelize a script that prints out how many documents, pictures and videos there are in a directory as well as some other informations. I've put the serial script at the end of this message. Here's one example that shows how it outputs the informations about the directory given :
7 documents use 110.4 kb ( 1.55 % of total size)
2 pictures use 6.8 Mb ( 98.07 % of total size)
0 videos use 0.0 bytes ( 0.00 % of total size)
9 others use 26.8 kb ( 0.38 % of total size)
Now, I would like to use threads to minimize the execution time. I've tried this :
import threading
import tools
import time
import os
import os.path
directory_path="Users/usersos/Desktop/j"
cv=threading.Lock()
type_=["documents","pictures","videos"]
e={}
e["documents"]=[".pdf",".html",".rtf",".txt"]
e["pictures"]=[".png",".jpg",".jpeg"]
e["videos"]=[".mpg",".avi",".mp4",".mov"]
class type_thread(threading.Thread):
def __init__(self,n,e_):
super().__init__()
self.extensions=e_
self.name=n
def __run__(self):
files=tools.g(directory_path,self.extensions)
n=len(files)
s=tools.size1(files)
p=s*100/tools.size2(directory_path)
cv.acquire()
print("{} {} use {} ({:10.2f} % of total size)".format(n,self.name,tools.compact(s),p))
cv.release()
types=[type_thread(t,e[t]) for t in type_]
for t in types:
t.start()
for t in types:
t.join()
When I run that, nothing is printed out ! And when I key in 't'+'return key' in the interpreter, I get <type_thread(videos, stopped 4367323136)> What's more, sometimes the interpreter returns the right statistics with these same keys.
Why is that ?
Initial script (serial) :
import tools
import time
import os
import os.path
type_=["documents","pictures","videos"]
all_=type_+["others"]
e={}
e["documents"]=[".pdf",".html",".rtf",".txt"]
e["pictures"]=[".png",".jpg",".jpeg"]
e["videos"]=[".mpg",".avi",".mp4",".mov"]
def statistic(directory_path):
#----------------------------- Computing ---------------------------------
d={t:tools.g(directory_path,e[t]) for t in type_}
d["others"]=[os.path.join(root,f) for root, _, files_names in os.walk(directory_path) for f in files_names if os.path.splitext(f)[1].lower() not in e["documents"]+e["pictures"]+e["videos"]]
n={t:len(d[t]) for t in type_}
n["others"]=len(d["others"])
s={t:tools.size1(d[t]) for t in type_}
s["others"]=tools.size1(d["others"])
s_dir=tools.size2(directory_path)
p={t:s[t]*100/s_dir for t in type_}
p["others"]=s["others"]*100/s_dir
#----------------------------- Printing ---------------------------------
for t in all_:
print("{} {} use {} ({:10.2f} % of total size)".format(n[t],t,tools.compact(s[t]),p[t]))
return s_dir
Method start() seems not to work. When I replace
for t in types:
t.start()
for t in types:
t.join()
with
for t in types:
t.__run__()
It works fine (at least for now, I don't know if it will still when I'll add other commands).

Resources