In Python 3, is it possible to use a subclass of Thread in the context of a concurrent.futures.ThreadPoolExecutor, so that they can be individually initialized before processing (presumably many) work items?
I'd like to use the convenient concurrent.futures API for a piece of code that syncs up files and S3 objects (each work item is one file to sync if the corresponding S3 object is inexistent or out-of-sync). I would like each worker thread to do some initialization first, such as setting up a boto3.session.Session. Then that thread pool of workers would be ready to process potentially thousands of work items (files to sync).
BTW, if a thread dies for some reason, is it reasonable to expect a new thread to be automatically created and added back to the pool?
(Disclaimer: I am much more familiar with Java's multithreading framework than Python's one).
So, it seems that a simple solution to my problem is to use threading.local to store a per-thread "session" (in the mockup below, just a random int). Perhaps not the cleanest I guess but for now it will do. Here is a mockup (Python 3.5.1):
import time
import threading
import concurrent.futures
import random
import logging
logging.basicConfig(level=logging.DEBUG, format='(%(threadName)-0s) %(relativeCreated)d - %(message)s')
x = [0.1, 0.1, 0.2, 0.4, 1.0, 0.1, 0.0]
mydata = threading.local()
def do_work(secs):
if 'session' in mydata.__dict__:
logging.debug('re-using session "{}"'.format(mydata.session))
else:
mydata.session = random.randint(0,1000)
logging.debug('created new session: "{}"'.format(mydata.session))
time.sleep(secs)
logging.debug('slept for {} seconds'.format(secs))
return secs
with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
y = executor.map(do_work, x)
print(list(y))
Produces the following output, showing that "sessions" are indeed local to each thread and reused:
(Thread-1) 29 - created new session: "855"
(Thread-2) 29 - created new session: "58"
(Thread-3) 30 - created new session: "210"
(Thread-1) 129 - slept for 0.1 seconds
(Thread-1) 130 - re-using session "855"
(Thread-2) 130 - slept for 0.1 seconds
(Thread-2) 130 - re-using session "58"
(Thread-3) 230 - slept for 0.2 seconds
(Thread-3) 230 - re-using session "210"
(Thread-3) 331 - slept for 0.1 seconds
(Thread-3) 331 - re-using session "210"
(Thread-3) 331 - slept for 0.0 seconds
(Thread-1) 530 - slept for 0.4 seconds
(Thread-2) 1131 - slept for 1.0 seconds
[0.1, 0.1, 0.2, 0.4, 1.0, 0.1, 0.0]
Minor note about logging: in order to use this in an IPython notebook, the logging setup needs to be slightly modified (since IPython has already setup a root logger). A more robust logging setup would be:
IN_IPYNB = 'get_ipython' in vars()
if IN_IPYNB:
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
for h in logger.handlers:
h.setFormatter(logging.Formatter(
'(%(threadName)-0s) %(relativeCreated)d - %(message)s'))
else:
logging.basicConfig(level=logging.DEBUG, format='(%(threadName)-0s) %(relativeCreated)d - %(message)s')
Related
I have measured the performance of a parallel read_pickle() execution on a Linux machine with 12 cores and Python 3.6 interpreter (code launched in JupyterLab). I simply open many pickled dataframes:
import pandas as pd
def my_read(filename):
df = pd.read_pickle(path + filename)
print(filename, df.shape)
return df.iloc[:1, :]
files = ... # array of file names of about 130 pickled 1 000 000 x 43 dataframes
Since this is an IO-bound operation rather than a CPU-bound one, I would expect the threaded solution to win over the process-based one.
However, this cell:
%%time
from multiprocessing import Pool
with Pool(10) as pool:
pool.map(my_read, files)
gave
CPU times: user 416 ms, sys: 267 ms, total: 683 ms
Wall time: 3min 37s
while this one:
from multiprocessing.pool import ThreadPool
with ThreadPool(10) as tpool:
tpool.map(my_read, files)
run in
user 7min 28s, sys: 1min 58s, total: 9min 27s
Wall time: 10min 25s
Why?
Usually I use shell command time. My purpose is to test if data is small, medium, large or very large set, how much time and memory usage will be.
Any tools for Linux or just Python to do this?
Have a look at timeit, the python profiler and pycallgraph. Also make sure to have a look at the comment below by nikicc mentioning "SnakeViz". It gives you yet another visualisation of profiling data which can be helpful.
timeit
def test():
"""Stupid test function"""
lst = []
for i in range(100):
lst.append(i)
if __name__ == '__main__':
import timeit
print(timeit.timeit("test()", setup="from __main__ import test"))
# For Python>=3.5 one can also write:
print(timeit.timeit("test()", globals=locals()))
Essentially, you can pass it python code as a string parameter, and it will run in the specified amount of times and prints the execution time. The important bits from the docs:
timeit.timeit(stmt='pass', setup='pass', timer=<default timer>, number=1000000, globals=None)
Create a Timer instance with the given statement, setup
code and timer function and run its timeit method with
number executions. The optional globals argument specifies a namespace in which to execute the code.
... and:
Timer.timeit(number=1000000)
Time number executions of the main statement. This executes the setup
statement once, and then returns the time it takes to execute the main
statement a number of times, measured in seconds as a float.
The argument is the number of times through the loop, defaulting to one
million. The main statement, the setup statement and the timer function
to be used are passed to the constructor.
Note:
By default, timeit temporarily turns off garbage collection during the timing. The advantage of this approach is that
it makes independent timings more comparable. This disadvantage is
that GC may be an important component of the performance of the
function being measured. If so, GC can be re-enabled as the first
statement in the setup string. For example:
timeit.Timer('for i in xrange(10): oct(i)', 'gc.enable()').timeit()
Profiling
Profiling will give you a much more detailed idea about what's going on. Here's the "instant example" from the official docs:
import cProfile
import re
cProfile.run('re.compile("foo|bar")')
Which will give you:
197 function calls (192 primitive calls) in 0.002 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.001 0.001 <string>:1(<module>)
1 0.000 0.000 0.001 0.001 re.py:212(compile)
1 0.000 0.000 0.001 0.001 re.py:268(_compile)
1 0.000 0.000 0.000 0.000 sre_compile.py:172(_compile_charset)
1 0.000 0.000 0.000 0.000 sre_compile.py:201(_optimize_charset)
4 0.000 0.000 0.000 0.000 sre_compile.py:25(_identityfunction)
3/1 0.000 0.000 0.000 0.000 sre_compile.py:33(_compile)
Both of these modules should give you an idea about where to look for bottlenecks.
Also, to get to grips with the output of profile, have a look at this post
pycallgraph
NOTE pycallgraph has been officially abandoned since Feb. 2018. As of Dec. 2020 it was still working on Python 3.6 though. As long as there are no core changes in how python exposes the profiling API it should remain a helpful tool though.
This module uses graphviz to create callgraphs like the following:
You can easily see which paths used up the most time by colour. You can either create them using the pycallgraph API, or using a packaged script:
pycallgraph graphviz -- ./mypythonscript.py
The overhead is quite considerable though. So for already long-running processes, creating the graph can take some time.
I use a simple decorator to time the func
import time
def st_time(func):
"""
st decorator to calculate the total time of a func
"""
def st_func(*args, **keyArgs):
t1 = time.time()
r = func(*args, **keyArgs)
t2 = time.time()
print("Function=%s, Time=%s" % (func.__name__, t2 - t1))
return r
return st_func
The timeit module was slow and weird, so I wrote this:
def timereps(reps, func):
from time import time
start = time()
for i in range(0, reps):
func()
end = time()
return (end - start) / reps
Example:
import os
listdir_time = timereps(10000, lambda: os.listdir('/'))
print "python can do %d os.listdir('/') per second" % (1 / listdir_time)
For me, it says:
python can do 40925 os.listdir('/') per second
This is a primitive sort of benchmarking, but it's good enough.
I usually do a quick time ./script.py to see how long it takes. That does not show you the memory though, at least not as a default. You can use /usr/bin/time -v ./script.py to get a lot of information, including memory usage.
Memory Profiler for all your memory needs.
https://pypi.python.org/pypi/memory_profiler
Run a pip install:
pip install memory_profiler
Import the library:
import memory_profiler
Add a decorator to the item you wish to profile:
#profile
def my_func():
a = [1] * (10 ** 6)
b = [2] * (2 * 10 ** 7)
del b
return a
if __name__ == '__main__':
my_func()
Execute the code:
python -m memory_profiler example.py
Recieve the output:
Line # Mem usage Increment Line Contents
==============================================
3 #profile
4 5.97 MB 0.00 MB def my_func():
5 13.61 MB 7.64 MB a = [1] * (10 ** 6)
6 166.20 MB 152.59 MB b = [2] * (2 * 10 ** 7)
7 13.61 MB -152.59 MB del b
8 13.61 MB 0.00 MB return a
Examples are from the docs, linked above.
snakeviz interactive viewer for cProfile
https://github.com/jiffyclub/snakeviz/
cProfile was mentioned at https://stackoverflow.com/a/1593034/895245 and snakeviz was mentioned in a comment, but I wanted to highlight it further.
It is very hard to debug program performance just by looking at cprofile / pstats output, because they can only total times per function out of the box.
However, what we really need in general is to see a nested view containing the stack traces of each call to actually find the main bottlenecks easily.
And this is exactly what snakeviz provides via its default "icicle" view.
First you have to dump the cProfile data to a binary file, and then you can snakeviz on that
pip install -u snakeviz
python -m cProfile -o results.prof myscript.py
snakeviz results.prof
This prints an URL to stdout which you can open on your browser, which contains the desired output that looks like this:
and you can then:
hover each box to see the full path to the file that contains the function
click on a box to make that box show up on the top as a way to zoom in
More profile oriented question: How can you profile a Python script?
Have a look at nose and at one of its plugins, this one in particular.
Once installed, nose is a script in your path, and that you can call in a directory which contains some python scripts:
$: nosetests
This will look in all the python files in the current directory and will execute any function that it recognizes as a test: for example, it recognizes any function with the word test_ in its name as a test.
So you can just create a python script called test_yourfunction.py and write something like this in it:
$: cat > test_yourfunction.py
def test_smallinput():
yourfunction(smallinput)
def test_mediuminput():
yourfunction(mediuminput)
def test_largeinput():
yourfunction(largeinput)
Then you have to run
$: nosetest --with-profile --profile-stats-file yourstatsprofile.prof testyourfunction.py
and to read the profile file, use this python line:
python -c "import hotshot.stats ; stats = hotshot.stats.load('yourstatsprofile.prof') ; stats.sort_stats('time', 'calls') ; stats.print_stats(200)"
Be carefull timeit is very slow, it take 12 second on my medium processor to just initialize (or maybe run the function). you can test this accepted answer
def test():
lst = []
for i in range(100):
lst.append(i)
if __name__ == '__main__':
import timeit
print(timeit.timeit("test()", setup="from __main__ import test")) # 12 second
for simple thing I will use time instead, on my PC it return the result 0.0
import time
def test():
lst = []
for i in range(100):
lst.append(i)
t1 = time.time()
test()
result = time.time() - t1
print(result) # 0.000000xxxx
If you don't want to write boilerplate code for timeit and get easy to analyze results, take a look at benchmarkit. Also it saves history of previous runs, so it is easy to compare the same function over the course of development.
# pip install benchmarkit
from benchmarkit import benchmark, benchmark_run
N = 10000
seq_list = list(range(N))
seq_set = set(range(N))
SAVE_PATH = '/tmp/benchmark_time.jsonl'
#benchmark(num_iters=100, save_params=True)
def search_in_list(num_items=N):
return num_items - 1 in seq_list
#benchmark(num_iters=100, save_params=True)
def search_in_set(num_items=N):
return num_items - 1 in seq_set
benchmark_results = benchmark_run(
[search_in_list, search_in_set],
SAVE_PATH,
comment='initial benchmark search',
)
Prints to terminal and returns list of dictionaries with data for the last run. Command line entrypoints also available.
If you change N=1000000 and rerun
The easy way to quickly test any function is to use this syntax :
%timeit my_code
For instance :
%timeit a = 1
13.4 ns ± 0.781 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)
line_profiler (execution time line by line)
instalation
pip install line_profiler
Usage
Add a #profile decorator before function. For example:
#profile
def function(base, index, shift):
addend = index << shift
result = base + addend
return result
Use command kernprof -l <file_name> to create an instance of line_profiler. For example:
kernprof -l test.py
kernprof will print Wrote profile results to <file_name>.lprof on success. For example:
Wrote profile results to test.py.lprof
Use command python -m line_profiler <file_name>.lprof to print benchmark results. For example:
python -m line_profiler test.py.lprof
You will see detailed info about each line of code:
Timer unit: 1e-06 s
Total time: 0.0021632 s
File: test.py
Function: function at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 #profile
2 def function(base, index, shift):
3 1000 796.4 0.8 36.8 addend = index << shift
4 1000 745.9 0.7 34.5 result = base + addend
5 1000 620.9 0.6 28.7 return result
memory_profiler (memory usage line by line)
instalation
pip install memory_profiler
Usage
Add a #profile decorator before function. For example:
#profile
def function():
result = []
for i in range(10000):
result.append(i)
return result
Use command python -m memory_profiler <file_name> to print benchmark results. For example:
python -m memory_profiler test.py
You will see detailed info about each line of code:
Filename: test.py
Line # Mem usage Increment Occurences Line Contents
============================================================
1 40.246 MiB 40.246 MiB 1 #profile
2 def function():
3 40.246 MiB 0.000 MiB 1 result = []
4 40.758 MiB 0.008 MiB 10001 for i in range(10000):
5 40.758 MiB 0.504 MiB 10000 result.append(i)
6 40.758 MiB 0.000 MiB 1 return result
Good Practice
Call a function many times to minimize environment impact.
Based on Danyun Liu's answer with some convenience features, perhaps it is useful to someone.
def stopwatch(repeat=1, autorun=True):
"""
stopwatch decorator to calculate the total time of a function
"""
import timeit
import functools
def outer_func(func):
#functools.wraps(func)
def time_func(*args, **kwargs):
t1 = timeit.default_timer()
for _ in range(repeat):
r = func(*args, **kwargs)
t2 = timeit.default_timer()
print(f"Function={func.__name__}, Time={t2 - t1}")
return r
if autorun:
try:
time_func()
except TypeError:
raise Exception(f"{time_func.__name__}: autorun only works with no parameters, you may want to use #stopwatch(autorun=False)") from None
return time_func
if callable(repeat):
func = repeat
repeat = 1
return outer_func(func)
return outer_func
Some tests:
def is_in_set(x):
return x in {"linux", "darwin"}
def is_in_list(x):
return x in ["linux", "darwin"]
#stopwatch
def run_once():
import time
time.sleep(0.5)
#stopwatch(autorun=False)
def run_manually():
import time
time.sleep(0.5)
run_manually()
#stopwatch(repeat=10000000)
def repeat_set():
is_in_set("windows")
is_in_set("darwin")
#stopwatch(repeat=10000000)
def repeat_list():
is_in_list("windows")
is_in_list("darwin")
#stopwatch
def should_fail(x):
pass
Result:
Function=run_once, Time=0.5005391679987952
Function=run_manually, Time=0.500624185999186
Function=repeat_set, Time=1.7064883739985817
Function=repeat_list, Time=1.8905151920007484
Traceback (most recent call last):
(some more traceback here...)
Exception: should_fail: autorun only works with no parameters, you may want to use #stopwatch(autorun=False)
#I EDITED MY ORIGINAL POST in order to put a simpler example.
I use differential evolution (DE) of Scipy to optimize certain parameters.
I would like to use all the PC processors in this task and I try to use the option "workers=-1"
The codition asked is that the function called by DE must be pickleable.
If I run the example in https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.differential_evolution.html#scipy.optimize.differential_evolution, the optimisation works.
from scipy.optimize import rosen, differential_evolution
import pickle
import dill
bounds = [(0,2), (0, 2)]
result = differential_evolution(rosen, bounds, updating='deferred',workers=-1)
result.x, result.fun
(array([1., 1.]), 0.0)
But if I define a custom function 'Ros_custom', the optimisation crashes (doesn't give a result)
def Ros_custom(X):
x = X[0]
y = X[1]
a = 1. - x
b = y - x*x
return a*a + b*b*100
result = differential_evolution(Ros_custom, bounds, updating='deferred',workers=-1)
If I try to pickle.dumps and pickle.loads 'Ros_custom' I get the same behaviour (optimisation crash, no answer).
If I use dill
Ros_pick_1=dill.dumps(Ros_custom)
Ros_pick_2=dill.loads(Ros_pick_1)
result = differential_evolution(Ros_pick_2, bounds, updating='deferred',workers=-1)
result.x, result.fun
I get the following message error
PicklingError: Can't pickle <function Ros_custom at 0x0000020247F04C10>: it's not the same object as __main__.Ros_custom
My question are:
Why do I get the error ? and if there would be a way to get 'Ros_custom' pickleable in order to use all the PC processors in DE.
Thank you in advance for any advice.
Two things:
I'm not able to reproduce the error you are seeing unless I first pickle/unpickle the custom function.
There's no need to pickle/unpickle the custom function before passing it to the solver.
This seems to work for me. Python 3.6.12 and scipy 1.5.2:
>>> from scipy.optimize import rosen, differential_evolution
>>> bounds = [(0,2), (0, 2)]
>>>
>>> def Ros_custom(X):
... x = X[0]
... y = X[1]
... a = 1. - x
... b = y - x*x
... return a*a + b*b*100
...
>>> result = differential_evolution(Ros_custom, bounds, updating='deferred',workers=-1)
>>> result.x, result.fun
(array([1., 1.]), 0.0)
>>>
>>> result
fun: 0.0
message: 'Optimization terminated successfully.'
nfev: 4953
nit: 164
success: True
x: array([1., 1.])
>>>
I can even nest a function inside of the custom objective:
>>> def foo(a,b):
... return a*a + b*b*100
...
>>> def custom(X):
... x,y = X[0],X[1]
... return foo(1.-x, y-x*x)
...
>>> result = differential_evolution(custom, bounds, updating='deferred',workers=-1)
>>> result
fun: 0.0
message: 'Optimization terminated successfully.'
nfev: 4593
nit: 152
success: True
x: array([1., 1.])
So, for me, at least the code works as expected.
You should have no need to serialize/deserialize the function ahead of it's use in scipy. Yes, the function need to be picklable, but scipy will do that for you. Basically, what's happening under the covers is that your function will get serialized, passed to multiprocessing as a string, then distributed to the processors, then unpickled and used on the target processors.
Like this, for 4 sets on inputs, run one per processor:
>>> import multiprocessing as mp
>>> res = mp.Pool().map(custom, [(0,1), (1,2), (4,9), (3,4)])
>>> list(res)
[101.0, 100.0, 4909.0, 2504.0]
>>>
Older versions of multiprocessing had difficulty serializing functions defined in the interpreter, and often needed to have the code executed in a __main__ block. If you are on windows, this is still often the case... and you might also need to call mp.freeze_support(), depending on how the code in scipy is implemented.
I tend to like dill (I'm the author) because it can serialize a broader range of objects that pickle. However, as scipy uses multiprocessing, which uses pickle... I often choose to use mystic (I'm the author), which uses multiprocess (I'm the author), which uses dill. Very roughly, equivalent codes, but they all work with dill instead of pickle.
>>> from mystic.solvers import diffev2
>>> from pathos.pools import ProcessPool
>>> diffev2(custom, bounds, npop=40, ftol=1e-10, map=ProcessPool().map)
Optimization terminated successfully.
Current function value: 0.000000
Iterations: 42
Function evaluations: 1720
array([1.00000394, 1.00000836])
With mystic, you get some additional nice features, like a monitor:
>>> from mystic.monitors import VerboseMonitor
>>> mon = VerboseMonitor(5,5)
>>> diffev2(custom, bounds, npop=40, ftol=1e-10, itermon=mon, map=ProcessPool().map)
Generation 0 has ChiSquare: 0.065448
Generation 0 has fit parameters:
[0.769543181527466, 0.5810893880113548]
Generation 5 has ChiSquare: 0.065448
Generation 5 has fit parameters:
[0.588156685059123, -0.08325052939774935]
Generation 10 has ChiSquare: 0.060129
Generation 10 has fit parameters:
[0.8387858177101133, 0.6850849855634057]
Generation 15 has ChiSquare: 0.001492
Generation 15 has fit parameters:
[1.0904350077743412, 1.2027007403275813]
Generation 20 has ChiSquare: 0.001469
Generation 20 has fit parameters:
[0.9716429877952866, 0.9466681129902448]
Generation 25 has ChiSquare: 0.000114
Generation 25 has fit parameters:
[0.9784047411865372, 0.9554056558210251]
Generation 30 has ChiSquare: 0.000000
Generation 30 has fit parameters:
[0.996105436348129, 0.9934091068974504]
Generation 35 has ChiSquare: 0.000000
Generation 35 has fit parameters:
[0.996589586891175, 0.9938925277204567]
Generation 40 has ChiSquare: 0.000000
Generation 40 has fit parameters:
[1.0003791956048833, 1.0007133195321427]
Generation 45 has ChiSquare: 0.000000
Generation 45 has fit parameters:
[1.0000170425596364, 1.0000396089375592]
Generation 50 has ChiSquare: 0.000000
Generation 50 has fit parameters:
[0.9999013984263114, 0.9998041148375927]
STOP("VTRChangeOverGeneration with {'ftol': 1e-10, 'gtol': 1e-06, 'generations': 30, 'target': 0.0}")
Optimization terminated successfully.
Current function value: 0.000000
Iterations: 54
Function evaluations: 2200
array([0.99999186, 0.99998338])
>>>
All of the above are running in parallel.
So, in summary, the code should work as is (and without pre-pickling) -- maybe unless you are on windows, where you might need to use freeze_support and run the code in the __main__ block.
Writing the function separately from the code worked for me.
create rosen_custom.py with code inside:
import numpy as np
def rosen(x):
x = np.array(x)
r = np.sum(100.0 * (x[1:] - x[:-1]**2.0)**2.0 + (1 - x[:-1])**2.0,
axis=0)
return r
Then use it in DE:
from scipy.optimize import differential_evolution
from rosen_custom import rosen
import numpy as np
bounds = [(0,2), (0, 2), (0, 2), (0, 2), (0, 2)]
result = differential_evolution(rosen_custom, bounds,
updating='deferred',workers=-1)
print(result.x, result.fun)
I need to add a constant time epoch of 5.30 hours to the time field in my model so that it have to add 5.30 hours to current time in seconds every time dynamically
start_time_in_seconds = models.IntegerField(blank=True,null=True)
I'm not entirely sure if I understood your question correctly, but assuming that you want to add 5.5 hours every time an instance of your model is created and saved, a possible solution would be overriding the save method:
from django.db import models
class MyModel(models.Model):
start_time_in_seconds = models.IntegerField(blank=True, null=True)
def save(self, *args, **kwargs):
if self.pk:
# save() is being called on an already existing model.
# we only want to trigger this on initial creation,
# therefore skip
super(MyModel, self).save(*args, **kwargs)
return
# this is run on initial creation. apply your custom
# time-modifying logic here.
self.start_time_in_seconds += 5.5 * 60 * 60
# and save
super(MyModel, self).save(*args, **kwargs)
On a side note, you may want to consider using a DateTimeField if you're handling times, depending on your use case.
Why does .dt.days take 100 times longer than .dt.total_seconds()?
df = pd.DataFrame({'a': pd.date_range('2011-01-01 00:00:00', periods=1000000, freq='1H')})
df.a = df.a - pd.to_datetime('2011-01-01 00:00:00')
df.a.dt.days # 12 sec
df.a.dt.total_seconds() # 0.14 sec
.dt.total_seconds is basically just a multiplication, and can be performed at numpythonic speed:
def total_seconds(self):
"""
Total duration of each element expressed in seconds.
.. versionadded:: 0.17.0
"""
return self._maybe_mask_results(1e-9 * self.asi8)
Whereas if we abort the days operation, we see it's spending its time in a slow listcomp with a getattr and a construction of Timedelta objects (source):
360 else:
361 result = np.array([getattr(Timedelta(val), m)
--> 362 for val in values], dtype='int64')
363 return result
364
To me this screams "look, let's get it correct, and we'll cross the optimization bridge when we come to it."