Python Counter too slow - python-3.x

I just wanted to code a little timer on my work pc. Funny thing is, the counter is too slow, meaning it runs longer than it should. I am really confused. The delay grows the smaller the intervals of updating become. Is my pc too slow? The CPU is around 30% while running this... idk.
python3.6.3
import time
def timer(sec):
start = sec
print(sec)
while sec > 0:
sec = sec-0.1 #the smaller this value, the slower
time.sleep(0.1)
print(round(sec,2))
print("Done! {} Seconds passed.".format(start))
start = time.time() #For Testing
timer(10)
print(time.time()-start)

Sleeping you process require a system call (a call to the kernel, which triggers an hardware interruption to give hand to that kernel), and a hardware clock interruption to wake up the process once it's done. Sleeping may not be a lot of CPU computations, but waiting for the hardware interruption and the kernel to task the processes can take multiple CPU cycles.
Rather than waiting for a constant unit of time, I suggest you to wait for the time required to hit the next milestone (by getting the current time, rounding it to the next step and getting the difference)

Try this way, you can use normal operators on time.time()
import time
start = time.time()
seconds = 5
while True:
if start - time.time() > seconds:
print(seconds + " elapsed.")

Related

Using time.time() in Python to measure time elapsed for code segments consumes significant time itself

OS: Ubuntu 20.04.3 LTS
In my code written in Python, I am using time.time() to calculate the time taken by various parts of the code. So, I have multiple blocks as shown below in my overall code:
start_time = time.time()
# some code
end_time = time.time()
Now, since there are multiple blocks as above in the overall code (because I need to measure time consumed by various code segments, and not just the overall time cosumption), so what I noticed is that the time.time() statements itself consume significant time that the overall runtime of the code shoots from say 10 secs to 15 secs. I was expecting that statement time.time() would consume insignificant amount of time such that there would be negligible affect on the overall runtime. Could you please help me to tackle this issue ?
I wanted to see what amount of time a code block consumes as compared to the overall runtime. But if the overall runtime gets significantly affected due to time.time() statements, then there is this problem.
Also, I noticed that the sum of times consumed by each code block is way less than total time consumed. So, I am clueless what eats up some time duration, where does it go?
Any help would be very much appreciated. Thanks!
start_time = time.time()
#Code
total_time = str((time.time() - start))
Or if you want to get multiple times:
start_time = time.time()
#some code
checkpoint1 = str((time.time() - start))
#more code
checkpoint2 = str((time.time() - start))
#...
It seems this is expected behaviour.

Handling RabbitMQ heartbeats when cpu is loaded 100% for a long time

I'm using pika 1.1 and graph-tool 3.4 in my python application. It consumes tasks from RabbitMQ, which then used to build graphs with graph-tool and then runs some calculations.
Some of the calculations, such as betweenness, take a lot of cpu power which make cpu usage hit 100% for a long time. Sometimes rabbitmq connection drops down, which causes task to start from the beginning.
Even though calculations are run in a separate process, my guess is during the time cpu is loaded 100%, it can't find any opportunity to send a heartbeat to rabbitmq, which causes connection to terminate. This doesn't happen all the time, which indicates by chance it could send heartbeats time to time. This is only my guess, I am not sure what else can cause this.
I tried lowering the priority of the calculation process using nice(19), which didn't work. I'm assuming it's not affecting the processes spawned by graph-tool, which parallelizes work on its own.
Since it's just one line of code, graph.calculate_betweenness(... I don't have a place to manually send heartbeats or slow the execution down to create chance for heartbeats.
Can my guess about heartbeats not getting sent because cpu is super busy be correct?
If yes, how can I handle this scenario?
Answering to your questions:
Yes, that's basically it.
The solution we do is creating a separate process for the CPU intensive tasks.
import time
from multiprocessing import Process
import pika
connection = pika.BlockingConnection(
pika.ConnectionParameters(host='localhost'))
channel = connection.channel()
channel.exchange_declare(exchange='logs', exchange_type='fanout')
result = channel.queue_declare(queue='', exclusive=True)
queue_name = result.method.queue
channel.queue_bind(exchange='logs', queue=queue_name)
def cpu_intensive_task(ch, method, properties, body):
def work(body):
time.sleep(60) # If I remember well default HB is 30 seconds
print(" [x] %r" % body)
p = Process(target=work, args=(body,))
p.start()
# Important to notice if you do p.join() You will have the same problem.
channel.basic_consume(
queue=queue_name, on_message_callback=cpu_intensive_task, auto_ack=True)
channel.start_consuming()
I wonder if this is the best solution to this problem or if rabbitMQ is the best tool for CPU intensive tasks. (For really long CPU intensive tasks (more than 30 min) if you send manual ACK you will need to handle with this also: https://www.rabbitmq.com/consumers.html#acknowledgement-timeout)

When running Numba code on a CUDA GPU, I notice one of my CPU cores stays at 100%. Is this limiting performance?

I have a test code that is computationally intense and I run that on the GPU using Numba. I noticed that while that is running, one of my CPU cores goes to 100% and stays there the whole time. The GPU seems to be at 100% as well. You can see both in the screenshot below.
My benchmark code is as follows:
from numba import *
import numpy as np
from numba import cuda
import time
def benchmark():
input_list = np.random.randint(10, size=3200000).astype(np.intp)
output_list = np.zeros(input_list.shape).astype(np.intp)
d_input_array = cuda.to_device(input_list)
d_output_array = cuda.to_device(output_list)
run_test[32, 512](d_input_array, d_output_array)
out = d_output_array.copy_to_host()
print('Result: ' + str(out))
#cuda.jit("void(intp[::1], intp[::1])", fastmath=True)
def run_test(d_input_array, d_output_array):
array_slice_len = len(d_input_array) / (cuda.blockDim.x * cuda.gridDim.x)
thread_coverage = cuda.threadIdx.x * array_slice_len
slice_start = thread_coverage + (cuda.blockDim.x * cuda.blockIdx.x * array_slice_len)
for step in range(slice_start, slice_start + array_slice_len, 1):
if step > len(d_input_array) - 1:
return
count = 0
for item2 in d_input_array:
if d_input_array[step] == item2:
count = count + 1
d_output_array[step] = count
if __name__ == '__main__':
import timeit
# make_multithread(benchmark, 64)
print(timeit.timeit("benchmark()", setup="from __main__ import benchmark", number=1))
You can run the code above to repro if you got python 3.7, Numba and codatoolkit installed. I'm on Linux Mint 20.
I got 32 cores - doesn't seem right to have one 100% while everyone else seats idle.
I'm wondering why that is, if there is a way to have other cores help with whatever is being done to increase performance?
How can I investigate what is taking 100% of a single core and know what is going on?
CUDA kernel launches (and some other operations) are asynchronous from the point of view of the host thread. And as you say, you're running the computationally intense portion of the work on the GPU.
So the host thread has nothing to do, other than launch some work and wait for it to be finished. The waiting process here is a spin-wait which means the CPU thread is in a tight loop, waiting for a status condition to change.
The CPU thread will hit that spin-wait here:
out = d_output_array.copy_to_host()
which is the line of code after your kernel launch, and it expects to copy (valid) results back from the GPU to the CPU. In order for this to work, the CPU thread must wait there until the results are ready. Numba implements this with a blocking sync operation, between GPU and CPU activity. Therefore, for most of the duration of your program, the CPU thread is actually waiting at that line of code.
This waiting takes up 100% of that thread's activity, and thus one core is seen as fully utilized.
There wouldn't be any sense or reason to try to "distribute" this "work" to multiple threads/cores, so this is not a "performance" issue in the way you are suggesting.
Any CPU profiler that shows hotspots or uses PC sampling should be able to give you a picture of this. That line of code should show up near the top of the list of lines of code most heavily visited by your CPU/core/thread.

Python multiprocessing taking the brakes off OSX

I have a program that randomly selects 13 cards from a full pack and analyses the hands for shape, point count and some other features important to the game of bridge. The program will select and analyse 10**7 hands in about 5 minutes. Checking the Activity Monitor shows that during execution the CPU (which s a 6 Core processor) is devoting about 9% of its time to the program and ~90% of its time it is idle. So it looks like a prime candidate for multiprocessing and I created a multiprocessing version using a Queue to pass information from each process back to the main program. Having navigated the problems of IDLE not working will multiprocessing (I now run it using PyCharm) and that doing a join on a process before it has finished freezes the program, I got it to work.
However, it doesn’t matter how many processes I use 5,10, 25 or 50 the result is always the same. The CPU devotes about 18% of its time to the program and has ~75% of its time idle and the execution time is slightly more than double at a bit over 10 minutes.
Can anyone explain how I can get the processes to take up more of the CPU time and how I can get the execution time to reflect this? Below are the relevant sections fo the program:
import random
import collections
import datetime
import time
from math import log10
from multiprocessing import Process, Queue
NUM_OF_HANDS = 10**6
NUM_OF_PROCESSES = 25
def analyse_hands(numofhands, q):
#code remove as not relevant to the problem
q.put((distribution, points, notrumps))
if __name__ == '__main__':
processlist = []
q = Queue()
handsperprocess = NUM_OF_HANDS // NUM_OF_PROCESSES
print(handsperprocess)
# Set up the processes and get them to do their stuff
start_time = time.time()
for _ in range(NUM_OF_PROCESSES):
p = Process(target=analyse_hands, args=((handsperprocess, q)))
processlist.append(p)
p.start()
# Allow q to get a few items
time.sleep(.05)
while not q.empty():
while not q.empty():
#code remove as not relevant to the problem
# Allow q to be refreshed so allowing all processes to finish before
# doing a join. It seems that doing a join before a process is
# finished will cause the program to lock
time.sleep(.05)
counter['empty'] += 1
for p in processlist:
p.join()
while not q.empty():
# This is never executed as all the processes have finished and q
# emptied before the join command above.
#code remove as not relevant to the problem
finish_time = time.time()
I have no answer to the reason why IDLE will not run a multiprocessor start instruction correctly but I believe the answer to the doubling of the execution times lies in the type of problem I am dealing with. Perhaps others can comment but it seems to me that the overhead involved with adding and removing items to and from the Queue is quite high so that performance improvements will be best achieved when the amount of data being passed via the Queue is small compared with the amount of processing required to obtain that data.
In my program I am creating and passing 10**7 items of data and I suppose it is the overhead of passing this number of items via the Queue that kills any performance improvement from getting the data via separate Processes. By using a map it seems all 10^7 items of data will need to be stored in the map before any further processing can be done. This might improve performance depending on the overhead of using the map and dealing with that amount of data but for the time being I will stick with my original vanilla, single processed code.

Expected duration of sleep(1)

What is the expected duration of a call to sleep with one as the argument? Is it some random time that doesn't exceed 1 second? Is it some random time that is at least one second?
Scenario:
Developer A writes code that performs some steps in sequence with an output device. The code is shipped and A leaves.
Developer B is advised from the field that steps j and k need a one-second interval between them. So he inserts a call to sleep(1) between those steps. The code is shipped and Developer B leaves.
Developer C wonders if the sleep(1) should be expected to sleep long enough, or whether a higher-resolution method should be used to make sure that at least 1000 milliseconds of delay occurs.
sleep() only guarantees that the process will sleep for at least the amount of time specified, so as you put it "some random time that is at least one second."
Similar behavior is mentioned in the man page for nanosleep:
nanosleep() suspends the execution of the calling thread until either at least the time specified in *req has elapsed...
You might also find the answers in this question useful.
my man-page says this:
unsigned int sleep(unsigned int seconds);
DESCRIPTION
sleep() makes the calling thread sleep until seconds seconds have
elapsed or a signal arrives which is not ignored.
...
RETURN VALUE
Zero if the requested time has elapsed, or the number of seconds left
to sleep, if the call was interrupted by a signal handler.
so sleep makes the thread sleep, as long as you tell it, but a signals awakes it. I see no further guarantees.
if you need a better, more precise waiting time, then sleep is not good enough. There is nanosleep and (sound funny, but is true) select is the only posix portable way to sleep sub-second (or with higher precision), that I am aware of.

Resources