CPU Percentage Scan - python-3.x

I need to scan through all running processes and filter the ones with high usage.
I currently have the code bellow but is taking too long to scan through them.
Is there a way that this can be done faster?
for proc in psutil.process_iter():
try:
cpu_usage = proc.cpu_percent(1)
processName = proc.name()
if cpu_usage > 2:
processID = proc.pid
print(processName , cpu_usage , processID)
except (psutil.NoSuchProcess, psutil.AccessDenied, psutil.ZombieProcess):
pass ```

The reason it is taking so long is because proc.cpu_percent(1) is a blocking function that take 1 second every time it is called. According to the docs, the function can be called once with interval None to get a starting point and then again to get an ending point. This means you could call this function on every Process object in the iterator, the wait 1 second and call the function on all of the objects again. To my understanding this should achieve the same affect with only about 1 second of delay.
Edit for clarity:
Here is an example of what I explained above. I just made it quickly so I recommend you clean it up to your use case but it does work.
import psutil, time
def getTime(proc):
try:
cpu_usage = proc.cpu_percent(interval=None)
processName = proc.name()
if cpu_usage > 2:
processID = proc.pid
print(processName, cpu_usage, processID)
except (psutil.NoSuchProcess, psutil.AccessDenied, psutil.ZombieProcess):
pass
for proc in psutil.process_iter():
getTime(proc)
time.sleep(1)
for proc in psutil.process_iter():
getTime(proc)
It runs the getTime function for every process initially, and as the docs state the first time the proc.cpu_percent function is called on a process with interval=None it will return 0.0 as a meaningless number because the function must get a starting point for its measurement. After this it waits 1 second and then runs getTime on every function again. This time however, it returns meaningful results because it has had 1 second per process to measure their usage. Essentially the function needs to be run twice for each but doing it like this can skip a massive amount of time compared to how you were doing it originally.

Related

Need to do CPU bound processing using 2+ processes in Python by reading from a gzipped file

I have a gzipped file spanning (compressed 10GB, uncompressed 100GB) and which has some reports separated by demarcations and I have to parse it.
The parsing and processing the data is taking long time and hence is a CPU bound problem (not an IO bound problem). So I am planning to split the work into multiple processes using multiprocessing module. The problem is I am unable to send/share data to child processes efficiently. I am using subprocess.Popen to stream in the uncompressed data in parent process.
process = subprocess.Popen('gunzip --keep --stdout big-file.gz',
shell=True,
stdout=subprocess.PIPE)
I am thinking of using a Lock() to read/parse one report in child-process-1 and then release the lock, and switch to child-process-2 to read/parse next report and then switch back to child-process-1 to read/parse next report). When I share the process.stdout as args with the child processes, I get a pickling error.
I have tried to create multiprocessing.Queue() and multiprocessing.Pipe() to send data to child processes, but this is way too slow (in fact it is way slower than doing it in single thread ie serially).
Any thoughts/examples about sending data to child processes efficiently will help.
Could you try something simple instead? Have each worker process run its own instance of gunzip, with no interprocess communication at all. Worker 1 can process the first report and just skip over the second. The opposite for worker 2. Each worker skips every other report. Then an obvious generalization to N workers.
Or not ...
I think you'll need to be more specific about what you tried, and perhaps give more info about your problem (like: how many records are there? how big are they?).
Here's a program ("genints.py") that prints a bunch of random ints, one per line, broken into groups via "xxxxx\n" separator lines:
from random import randrange, seed
seed(42)
for i in range(1000):
for j in range(randrange(1, 1000)):
print(randrange(100))
print("xxxxx")
Because it forces the seed, it generates the same stuff every time. Now a program to process those groups, both in parallel and serially, via the most obvious way I first thought of. crunch() takes time quadratic in the number of ints in a group, so it's quite CPU-bound. The output from one run, using (as shown) 3 worker processes for the parallel part:
parallel result: 10,901,000,334 0:00:35.559782
serial result: 10,901,000,334 0:01:38.719993
So the parallelized run took about one-third the time. In what relevant way(s) does that differ from your problem? Certainly, a full run of "genints.py" produces less than 2 million bytes of output, so that's a major difference - but it's impossible to guess from here whether it's a relevant difference. Perahps, e.g., your problem is only very mildly CPU-bound? It's obvious from output here that the overheads of passing chunks of stdout to worker processes are all but insignificant in this program.
In short, you probably need to give people - as I just did for you - a complete program they can run that reproduces your problem.
import multiprocessing as mp
NWORKERS = 3
DELIM = "xxxxx\n"
def runjob():
import subprocess
# 'py' is just a shell script on my box that
# invokes the desired version of Python -
# which happened to be 3.8.5 for this run.
p = subprocess.Popen("py genints.py",
shell=True,
text=True,
stdout=subprocess.PIPE)
return p.stdout
# Return list of lines up to (but not including) next DELIM,
# or EOF. If the file is already exhausted, return None.
def getrecord(f):
result = []
foundone = False
for line in f:
foundone = True
if line == DELIM:
break
result.append(line)
return result if foundone else None
def crunch(rec):
total = 0
for a in rec:
for b in rec:
total += abs(int(a) - int(b))
return total
if __name__ == "__main__":
import datetime
now = datetime.datetime.now
s = now()
total = 0
f = runjob()
with mp.Pool(NWORKERS) as pool:
for i in pool.imap_unordered(crunch,
iter((lambda: getrecord(f)), None)):
total += i
f.close()
print(f"parallel result: {total:,}", now() - s)
s = now()
# try the same thing serially
total = 0
f = runjob()
while True:
rec = getrecord(f)
if rec is None:
break
total += crunch(rec)
f.close()
print(f"serial result: {total:,}", now() - s)

A first-year project: A small error in my while loop is driving me CRAZY. Any help greatly appreciated

The goal of this code is to develop a scheduler that works the shortest job first. I am given a process object that is initiated w/ arrival time, a completion time, and a process ID number. The timer is incremented with each loop.
The program should run on idle if the timer has not reached the arrival time.
If a process has arrived (arrival >= timer), the program will loop as many times as is required by the completion time.
If multiple processes have arrived, I should be working through the shortest job first.
My issue right now is that all of my peeks and removes are contained within my while not process_queue.is_empty():
yet I still end up with an error because after the queue becomes empty, I try to peek into an empty queue.
Confused as to why, and how, I am able to peek into an empty queue given the condition of my while loop.
def schedule_SJF(processes):
"""
-------------------------------------------------------
Description:
Creates a schedule and an elapsed time to complete all processes using SJF implementation.
(if >1 process has arrived, the one with the shortest completion time takes priority.)
Use: schedule_SJF(processes)
-------------------------------------------------------
Parameters:
processes - a list of Process objects pulled from a file (list) (from: processes = read_processes(filename))
Returns:
None
-------------------------------------------------------
"""
timer = 0
buffer = []
length = len(processes)
process_queue = pQueue(length, 'L')
# initialize counter (timer), initialize temporary list to store all arrived processes (buffer),
# initialize variable for length & initialize pQueue with 'L' mode and .
print('Scheduling processes1.txt')
# starting statement
for i in processes:
process_queue.insert(i)
# input all data into a queue for easy use
while not process_queue.is_empty():
# loop until queue is empty
if timer == 0:
print('[Timer: 0]: Starting SJF Scheduler')
timer += 1
# for program start up
elif not process_queue.is_empty() and process_queue.peek().arrival <= timer:
# we now know >= 1 process has arrived
while not process_queue.peek().arrival > timer:
buffer.append(process_queue.peek())
process_queue.remove()
# create a list containing all arrived processes, break when arrival time becomes greater than the timer
while not len(buffer) == 0 :
shortest = buffer[0]
# loop until buffer is empty & initialize value for shortest
for i in range(len(buffer)):
if buffer[i].time < shortest.time:
shortest = buffer[i]
buffer.remove(shortest)
# compare all values within the buffer, isolate the value with the shortest time, remove that value.
print('Fetching Process: {}'.format(shortest))
for _ in range(shortest.time):
print('[Timer:{}]: {}'.format(timer, shortest.PID))
timer += 1
# loop for as many times as is necessary to 'complete' the task
else:
print('[Timer:{}]: {}'.format(timer, 'idle'))
timer += 1
# if not timer >= arrival, program should continue looping (on 'idle')
return
As to why you're getting an error peeking into an empty queue, let's look at your code:
elif not process_queue.is_empty() and process_queue.peek().arrival <= timer:
# we now know >= 1 process has arrived
while not process_queue.peek().arrival > timer:
buffer.append(process_queue.peek())
process_queue.remove()
Your problem is in that last while condition. It doesn't check to see if the queue is empty. The queue wasn't empty when you passed the elif, but the code inside the while removes an item from the queue. Imagine what happens if the queue contains a single item whose arrival time is less than timer. Execution enters the loop body, which removes that one item from the queue. Then it goes back to the conditional and tries to peek. But there's nothing in the queue.
The solution is to check for an empty queue before peeking.
Your approach to this problem is a little bit confusing. What you're doing is, essentially:
Add processes to a queue
while the queue isn't empty
Remove all the processes from the queue that are earlier than the current time, and put them into a buffer
while the buffer isn't empty
find the process with the earliest arrival time, and process it
That looks like a lot of unnecessary complication.
Seems like it would be easier to do this:
Sort processes by arrival time
Add sorted processes to queue
while queue is not empty
Extract the next job, and process it
Or, even better:
Add processes to a priority queue (see [heapq][1])
while priority queue is not empty
Extract the next job, and process it

How to set timeout for a block of code which is not a function python3

After spending a lot of hours looking for a solution in stackoverflow, I did not find a good solution to set a timeout for a block of code. There are approximations to set a timeout for a function. Nevertheless, I would like to know how to set a timeout without having a function. Let's take the following code as an example:
print("Doing different things")
for i in range(0,10)
# Doing some heavy stuff
print("Done. Continue with the following code")
So, How would you break the for loop if it has not finished after x seconds? Just continue with the code (maybe saving some bool variables to know that timeout was reached), despite the fact that the for loop did not finish properly.
i think implement this efficiently without using functions not possible , look this code ..
import datetime as dt
print("Doing different things")
# store
time_out_after = dt.timedelta(seconds=60)
start_time = dt.datetime.now()
for i in range(10):
if dt.datetime.now() > time_started + time_out:
break
else:
# Doing some heavy stuff
print("Done. Continue with the following code")
the problem : the timeout will checked in the beginning of every loop cycle, so it may be take more than the specified timeout period to break of the loop, or in worst case it maybe not interrupt the loop ever becouse it can't interrupt the code that never finish un iteration.
update :
as op replayed, that he want more efficient way, this is a proper way to do it, but using functions.
import asyncio
async def test_func():
print('doing thing here , it will take long time')
await asyncio.sleep(3600) # this will emulate heaven task with actual Sleep for one hour
return 'yay!' # this will not executed as the timeout will occur early
async def main():
# Wait for at most 1 second
try:
result = await asyncio.wait_for(test_func(), timeout=1.0) # call your function with specific timeout
# do something with the result
except asyncio.TimeoutError:
# when time out happen program will break from the test function and execute code here
print('timeout!')
print('lets continue to do other things')
asyncio.run(main())
Expected output:
doing thing here , it will take long time
timeout!
lets continue to do other things
note:
now timeout will happen after exactly the time you specify. in this example code, after one second.
you would replace this line:
await asyncio.sleep(3600)
with your actual task code.
try it and let me know what do you think. thank you.
read asyncio docs:
link
update 24/2/2019
as op noted that asyncio.run introduced in python 3.7 and asked for altrnative on python 3.6
asyncio.run alternative for python older than 3.7:
replace
asyncio.run(main())
with this code for older version (i think 3.4 to 3.6)
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
loop.close()
You may try the following way:
import time
start = time.time()
for val in range(10):
# some heavy stuff
time.sleep(.5)
if time.time() - start > 3: # 3 is timeout in seconds
print('loop stopped at', val)
break # stop the loop, or sys.exit() to stop the script
else:
print('successfully completed')
I guess it is kinda viable approach. Actual timeout is greater than 3 seconds and depends on the single step execution time.

How to reuse a multiprocessing pool?

At the bottom is the code I have now. It seems to work fine. However, I don't completely understand it. I thought without .join(), I'd risking the code going onto the next for-loop before the pool finishes executing. Wouldn't we need those 3 commented-out lines?
On the other hand, if I were to go with the .close() and .join() way, is there any way to 'reopen' that closed pool instead of Pool(6) every time?
import multiprocessing as mp
import random as rdm
from statistics import stdev, mean
import time
def mesh_subset(population, n_chosen=5):
chosen = rdm.choices(population, k=n_chosen)
return mean(chosen)
if __name__ == '__main__':
population = [x for x in range(20)]
N_iteration = 10
start_time = time.time()
pool = mp.Pool(6)
for i in range(N_iteration):
print([round(x,2) for x in population])
print(stdev(population))
# pool = mp.Pool(6)
population = pool.map(mesh_subset, [population]*len(population))
# pool.close()
# pool.join()
print('run time:', time.time() - start_time)
A pool of workers is a relatively costly thing to set up, so it should be done (if possible) only once, usually at the beginning of the script.
The pool.map command blocks until all the tasks are completed. After all, it returns a list of the results. It couldn't do that unless mesh_subset has been called on all the inputs and has returned a result for each. In contrast, methods like pool.apply_async do not block. apply_async returns an ApplyResult object with a get method which blocks until it obtains a result from a worker process.
pool.close sets the worker handler's state to CLOSE. This causes the handler to signal the workers to terminate.
The pool.join blocks until all the worker processes have been terminated.
So you don't need to call -- in fact you shouldn't call -- pool.close and pool.join until you are finished with the pool. Once the workers have been sent the signal to terminate (by pool.close), there is no way to "reopen" them. You would need to start a new pool instead.
In your situation, since you do want the loop to wait until all the tasks are completed, there would be no advantage to using pool.apply_async instead of pool.map. But if you were to use pool.apply_async, you could obtain the same result as before by calling get instead of resorting to closing and restarting the pool:
# you could do this, but using pool.map is simpler
for i in range(N_iteration):
apply_results = [pool.apply_async(mesh_subset, [population]) for i in range(len(population))]
# the call to result.get() blocks until its worker process (running
# mesh_subset) returns a value
population = [result.get() for result in apply_results]
When the loops complete, len(population) is unchanged.
If you did NOT want each loop to block until all the tasks are completed, you could use apply_async's callback feature:
N_pop = len(population)
result = []
for i in range(N_iteration):
for i in range(N_pop):
pool.apply_async(mesh_subset, [population]),
callback=result.append)
pool.close()
pool.join()
print(result)
Now, when any mesh_subset returns a return_value,
result.append(return_value) is called. The calls to apply_async do not
block, so N_iteration * N_pop tasks are pushed into the pools task
queue all at once. But since the pool has 6 workers, at most 6 calls to
mesh_subset are running at any given time. As the workers complete the tasks,
whichever worker finishes first calls result.append(return_value). So the
values in result are unordered. This is different than pool.map which
returns a list whose return values are in the same order as its corresponding
list of arguments.
Barring an exception, result will eventually contain N_iteration * N_pop return values once
all the tasks complete. Above, pool.close() and pool.join() were used to
wait for all the tasks to complete.

Is it possible to resume a generator function after python program exit and the program restarts?

I am wondering if there exist a possibility where a generator function/iterator function in python can pause after keyboard interrupt and whenever the program restart the generator function resume from where it is left off? Please be clear and simple when explaining this solution.
After a bit reading on generators and the 'yield'.i've realized that generators only output a value, discard it and output another value and so forth...
The was trying to find a way to resume output for the following function after python quits
counter=0
def product(*args, repeat=1):
global counter
pools = [tuple(pool) for pool in args] * repeat
#yield pools
result = [[]]
for pool in pools:
result = [x+[y] for x in result for y in pool]
for prod in result:
counter=counter+1
if counter>11:
yield tuple(prod)
def product_function():
for i in product('abc',repeat=3):
print(i)
print(counter)
product_function()
I finally decided to put in the a little variable called counter and once the counter is greater that the 11th word then all over values (words) are yielded and printed. I suppose i could write some codes to store the counter variable in a separate file whenever the program quits and whenever the program restarts it pulls the last counter variable from the file so that output resumes. hope this works..

Resources