I have measured the performance of a parallel read_pickle() execution on a Linux machine with 12 cores and Python 3.6 interpreter (code launched in JupyterLab). I simply open many pickled dataframes:
import pandas as pd
def my_read(filename):
df = pd.read_pickle(path + filename)
print(filename, df.shape)
return df.iloc[:1, :]
files = ... # array of file names of about 130 pickled 1 000 000 x 43 dataframes
Since this is an IO-bound operation rather than a CPU-bound one, I would expect the threaded solution to win over the process-based one.
However, this cell:
%%time
from multiprocessing import Pool
with Pool(10) as pool:
pool.map(my_read, files)
gave
CPU times: user 416 ms, sys: 267 ms, total: 683 ms
Wall time: 3min 37s
while this one:
from multiprocessing.pool import ThreadPool
with ThreadPool(10) as tpool:
tpool.map(my_read, files)
run in
user 7min 28s, sys: 1min 58s, total: 9min 27s
Wall time: 10min 25s
Why?
Related
Do I understand the following correctly?
When num_workers >=1, the main process pre-loads prefetch_factor * num_workers batches. When the training loop consumes one batch, the corresponding worker loads the next batch in its queue.
If this is the case, let's go through an example.
NOTE: I have chosen the numeric values for illustration purposes and
have ignored various overheads in this example. The accompanying code example uses these numbers.
Say I have num_workers=4, prefetch_factor=4, batch_size=128. Further assume, it takes 0.003125 s to fetch an item from a source database and the train step takes 0.05 s.
Now, each batch would take 0.003125 * 128 = 0.4 s to load.
With a prefetch_factor=4 and num_workers=4, first, 4*4=16 batches will be loaded.
Once the 16 batches are loaded, the first train step consumes 1 batch and takes 0.05 s. Say worker[0] provided this batch and will start the process to generate a new batch to replenish the queue. Recall fetching a new batch takes 0.4 s.
Similarly, the second step consumes one more batch and the corresponding worker (worker[1] in this example) starts the data fetching process.
The first 8 train steps would take 0.05*8=0.4s. By this time, 8 batches have been
consumed and worker[0] has produced 1 batch. In the next step, 1 batch is consumed and worker[1] produces a new batch. worker[1] had started the data fetching process in the second train step which would now be completed.
Following this we can see, each subsequent train step will consume 1 batch and one of the workers will produce 1 batch, keeping the dataloader queue to have always 8 batches. This means that the train step is never waiting for the data loading process as there are always 8 batches in the buffer.
I would expect this behavior regardless of the data size of the batch given num_workers, prefetch_factor are large enough. However, in the following code example that is not case.
In the code below, I define a custom iterable that returns a numpy array. As the size of the numpy array increases, increasing num_worker or 'prefetch_factor' does not improve the time taken for running through a batch.
I'm guessing this is because each worker serializes the batch to send to the main process where it is de-serialized. As the data size increase, this process would take more time. However, I would think if the queue size is large enough (num_workers, prefetch_factor), at some point, there should be a break even point where each training step consumption of a batch would be accompanied by replenishment via one of the workers as I illustrated in the above example.
In the code below, when MyIterable returns a small object (np array of size (10, 150)), increasing num_workers helps as expected. But when the returned object is larger (np array of size (1000, 150)), num_workers or prefetch_factor does not do much.
# small np object
avg time per batch for num workers=0: 0.47068126868714444
avg time per batch for num workers=2: 0.20982365206225495
avg time per batch for num workers=4: 0.10560789656221914
avg time per batch for num workers=6: 0.07202646931250456
avg time per batch for num workers=8: 0.05311137337469063
# large np object
avg time per batch for num workers=0: 0.6090951558124971
avg time per batch for num workers=2: 0.4594530961876444
avg time per batch for num workers=4: 0.45023533212543043
avg time per batch for num workers=6: 0.3830978863124983
avg time per batch for num workers=8: 0.3811495694375253
Am I missing something here? Why doesn't the data loader queue have enough buffer such that data loading is not the bottleneck?
Even if the serialization and de-serialization process would take longer for the latter case, I'd expect to have a large enough buffer where the consumption and replenishment rate of the batches are almost equal. Otherwise, what is the point of having prefetch_factor.
If the code is behaving as expected, are there any other ways to pre-load the next n batches in a buffer such that it is large enough and never depleted?
Thanks
import time
import torch
import numpy as np
from time import sleep
from torch.utils.data import DataLoader, IterableDataset
def collate_fn(records):
# some custom collation function
return records
class MyIterable(object):
def __init__(self, n):
self.n = n
self.i = 0
def __iter__(self):
return self
def __next__(self):
if self.i < self.n:
sleep(0.003125) # simulates data fetch time
# return np.random.random((10, 150)) # small data item
return np.random.random((1000, 150)) # large data item
else:
raise StopIteration
class MyIterableDataset(IterableDataset):
def __init__(self, n):
super(MyIterableDataset).__init__()
self.n = n
def __iter__(self):
return MyIterable(self.n)
def get_performance_metrics(num_workers):
ds = MyIterableDataset(n=10000)
if num_workers == 0:
dl = torch.utils.data.DataLoader(ds, num_workers=0, batch_size=128, collate_fn=collate_fn)
else:
dl = torch.utils.data.DataLoader(ds, num_workers=num_workers, prefetch_factor=4, persistent_workers=True,
batch_size=128, collate_fn=collate_fn,
multiprocessing_context='spawn')
warmup = 5
times = []
t0 = time.perf_counter()
for i, batch in enumerate(dl):
sleep(0.05) # simulates train step
e = time.perf_counter()
if i >= warmup:
times.append(e - t0)
t0 = time.perf_counter()
if i >= 20:
break
print(f'avg time per batch for num workers={num_workers}: {sum(times) / len(times)}')
if __name__ == '__main__':
num_worker_options = [0, 2, 4, 6, 8]
for n in num_worker_options:
get_performance_metrics(n)
Usually I use shell command time. My purpose is to test if data is small, medium, large or very large set, how much time and memory usage will be.
Any tools for Linux or just Python to do this?
Have a look at timeit, the python profiler and pycallgraph. Also make sure to have a look at the comment below by nikicc mentioning "SnakeViz". It gives you yet another visualisation of profiling data which can be helpful.
timeit
def test():
"""Stupid test function"""
lst = []
for i in range(100):
lst.append(i)
if __name__ == '__main__':
import timeit
print(timeit.timeit("test()", setup="from __main__ import test"))
# For Python>=3.5 one can also write:
print(timeit.timeit("test()", globals=locals()))
Essentially, you can pass it python code as a string parameter, and it will run in the specified amount of times and prints the execution time. The important bits from the docs:
timeit.timeit(stmt='pass', setup='pass', timer=<default timer>, number=1000000, globals=None)
Create a Timer instance with the given statement, setup
code and timer function and run its timeit method with
number executions. The optional globals argument specifies a namespace in which to execute the code.
... and:
Timer.timeit(number=1000000)
Time number executions of the main statement. This executes the setup
statement once, and then returns the time it takes to execute the main
statement a number of times, measured in seconds as a float.
The argument is the number of times through the loop, defaulting to one
million. The main statement, the setup statement and the timer function
to be used are passed to the constructor.
Note:
By default, timeit temporarily turns off garbage collection during the timing. The advantage of this approach is that
it makes independent timings more comparable. This disadvantage is
that GC may be an important component of the performance of the
function being measured. If so, GC can be re-enabled as the first
statement in the setup string. For example:
timeit.Timer('for i in xrange(10): oct(i)', 'gc.enable()').timeit()
Profiling
Profiling will give you a much more detailed idea about what's going on. Here's the "instant example" from the official docs:
import cProfile
import re
cProfile.run('re.compile("foo|bar")')
Which will give you:
197 function calls (192 primitive calls) in 0.002 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.001 0.001 <string>:1(<module>)
1 0.000 0.000 0.001 0.001 re.py:212(compile)
1 0.000 0.000 0.001 0.001 re.py:268(_compile)
1 0.000 0.000 0.000 0.000 sre_compile.py:172(_compile_charset)
1 0.000 0.000 0.000 0.000 sre_compile.py:201(_optimize_charset)
4 0.000 0.000 0.000 0.000 sre_compile.py:25(_identityfunction)
3/1 0.000 0.000 0.000 0.000 sre_compile.py:33(_compile)
Both of these modules should give you an idea about where to look for bottlenecks.
Also, to get to grips with the output of profile, have a look at this post
pycallgraph
NOTE pycallgraph has been officially abandoned since Feb. 2018. As of Dec. 2020 it was still working on Python 3.6 though. As long as there are no core changes in how python exposes the profiling API it should remain a helpful tool though.
This module uses graphviz to create callgraphs like the following:
You can easily see which paths used up the most time by colour. You can either create them using the pycallgraph API, or using a packaged script:
pycallgraph graphviz -- ./mypythonscript.py
The overhead is quite considerable though. So for already long-running processes, creating the graph can take some time.
I use a simple decorator to time the func
import time
def st_time(func):
"""
st decorator to calculate the total time of a func
"""
def st_func(*args, **keyArgs):
t1 = time.time()
r = func(*args, **keyArgs)
t2 = time.time()
print("Function=%s, Time=%s" % (func.__name__, t2 - t1))
return r
return st_func
The timeit module was slow and weird, so I wrote this:
def timereps(reps, func):
from time import time
start = time()
for i in range(0, reps):
func()
end = time()
return (end - start) / reps
Example:
import os
listdir_time = timereps(10000, lambda: os.listdir('/'))
print "python can do %d os.listdir('/') per second" % (1 / listdir_time)
For me, it says:
python can do 40925 os.listdir('/') per second
This is a primitive sort of benchmarking, but it's good enough.
I usually do a quick time ./script.py to see how long it takes. That does not show you the memory though, at least not as a default. You can use /usr/bin/time -v ./script.py to get a lot of information, including memory usage.
Memory Profiler for all your memory needs.
https://pypi.python.org/pypi/memory_profiler
Run a pip install:
pip install memory_profiler
Import the library:
import memory_profiler
Add a decorator to the item you wish to profile:
#profile
def my_func():
a = [1] * (10 ** 6)
b = [2] * (2 * 10 ** 7)
del b
return a
if __name__ == '__main__':
my_func()
Execute the code:
python -m memory_profiler example.py
Recieve the output:
Line # Mem usage Increment Line Contents
==============================================
3 #profile
4 5.97 MB 0.00 MB def my_func():
5 13.61 MB 7.64 MB a = [1] * (10 ** 6)
6 166.20 MB 152.59 MB b = [2] * (2 * 10 ** 7)
7 13.61 MB -152.59 MB del b
8 13.61 MB 0.00 MB return a
Examples are from the docs, linked above.
snakeviz interactive viewer for cProfile
https://github.com/jiffyclub/snakeviz/
cProfile was mentioned at https://stackoverflow.com/a/1593034/895245 and snakeviz was mentioned in a comment, but I wanted to highlight it further.
It is very hard to debug program performance just by looking at cprofile / pstats output, because they can only total times per function out of the box.
However, what we really need in general is to see a nested view containing the stack traces of each call to actually find the main bottlenecks easily.
And this is exactly what snakeviz provides via its default "icicle" view.
First you have to dump the cProfile data to a binary file, and then you can snakeviz on that
pip install -u snakeviz
python -m cProfile -o results.prof myscript.py
snakeviz results.prof
This prints an URL to stdout which you can open on your browser, which contains the desired output that looks like this:
and you can then:
hover each box to see the full path to the file that contains the function
click on a box to make that box show up on the top as a way to zoom in
More profile oriented question: How can you profile a Python script?
Have a look at nose and at one of its plugins, this one in particular.
Once installed, nose is a script in your path, and that you can call in a directory which contains some python scripts:
$: nosetests
This will look in all the python files in the current directory and will execute any function that it recognizes as a test: for example, it recognizes any function with the word test_ in its name as a test.
So you can just create a python script called test_yourfunction.py and write something like this in it:
$: cat > test_yourfunction.py
def test_smallinput():
yourfunction(smallinput)
def test_mediuminput():
yourfunction(mediuminput)
def test_largeinput():
yourfunction(largeinput)
Then you have to run
$: nosetest --with-profile --profile-stats-file yourstatsprofile.prof testyourfunction.py
and to read the profile file, use this python line:
python -c "import hotshot.stats ; stats = hotshot.stats.load('yourstatsprofile.prof') ; stats.sort_stats('time', 'calls') ; stats.print_stats(200)"
Be carefull timeit is very slow, it take 12 second on my medium processor to just initialize (or maybe run the function). you can test this accepted answer
def test():
lst = []
for i in range(100):
lst.append(i)
if __name__ == '__main__':
import timeit
print(timeit.timeit("test()", setup="from __main__ import test")) # 12 second
for simple thing I will use time instead, on my PC it return the result 0.0
import time
def test():
lst = []
for i in range(100):
lst.append(i)
t1 = time.time()
test()
result = time.time() - t1
print(result) # 0.000000xxxx
If you don't want to write boilerplate code for timeit and get easy to analyze results, take a look at benchmarkit. Also it saves history of previous runs, so it is easy to compare the same function over the course of development.
# pip install benchmarkit
from benchmarkit import benchmark, benchmark_run
N = 10000
seq_list = list(range(N))
seq_set = set(range(N))
SAVE_PATH = '/tmp/benchmark_time.jsonl'
#benchmark(num_iters=100, save_params=True)
def search_in_list(num_items=N):
return num_items - 1 in seq_list
#benchmark(num_iters=100, save_params=True)
def search_in_set(num_items=N):
return num_items - 1 in seq_set
benchmark_results = benchmark_run(
[search_in_list, search_in_set],
SAVE_PATH,
comment='initial benchmark search',
)
Prints to terminal and returns list of dictionaries with data for the last run. Command line entrypoints also available.
If you change N=1000000 and rerun
The easy way to quickly test any function is to use this syntax :
%timeit my_code
For instance :
%timeit a = 1
13.4 ns ± 0.781 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)
line_profiler (execution time line by line)
instalation
pip install line_profiler
Usage
Add a #profile decorator before function. For example:
#profile
def function(base, index, shift):
addend = index << shift
result = base + addend
return result
Use command kernprof -l <file_name> to create an instance of line_profiler. For example:
kernprof -l test.py
kernprof will print Wrote profile results to <file_name>.lprof on success. For example:
Wrote profile results to test.py.lprof
Use command python -m line_profiler <file_name>.lprof to print benchmark results. For example:
python -m line_profiler test.py.lprof
You will see detailed info about each line of code:
Timer unit: 1e-06 s
Total time: 0.0021632 s
File: test.py
Function: function at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 #profile
2 def function(base, index, shift):
3 1000 796.4 0.8 36.8 addend = index << shift
4 1000 745.9 0.7 34.5 result = base + addend
5 1000 620.9 0.6 28.7 return result
memory_profiler (memory usage line by line)
instalation
pip install memory_profiler
Usage
Add a #profile decorator before function. For example:
#profile
def function():
result = []
for i in range(10000):
result.append(i)
return result
Use command python -m memory_profiler <file_name> to print benchmark results. For example:
python -m memory_profiler test.py
You will see detailed info about each line of code:
Filename: test.py
Line # Mem usage Increment Occurences Line Contents
============================================================
1 40.246 MiB 40.246 MiB 1 #profile
2 def function():
3 40.246 MiB 0.000 MiB 1 result = []
4 40.758 MiB 0.008 MiB 10001 for i in range(10000):
5 40.758 MiB 0.504 MiB 10000 result.append(i)
6 40.758 MiB 0.000 MiB 1 return result
Good Practice
Call a function many times to minimize environment impact.
Based on Danyun Liu's answer with some convenience features, perhaps it is useful to someone.
def stopwatch(repeat=1, autorun=True):
"""
stopwatch decorator to calculate the total time of a function
"""
import timeit
import functools
def outer_func(func):
#functools.wraps(func)
def time_func(*args, **kwargs):
t1 = timeit.default_timer()
for _ in range(repeat):
r = func(*args, **kwargs)
t2 = timeit.default_timer()
print(f"Function={func.__name__}, Time={t2 - t1}")
return r
if autorun:
try:
time_func()
except TypeError:
raise Exception(f"{time_func.__name__}: autorun only works with no parameters, you may want to use #stopwatch(autorun=False)") from None
return time_func
if callable(repeat):
func = repeat
repeat = 1
return outer_func(func)
return outer_func
Some tests:
def is_in_set(x):
return x in {"linux", "darwin"}
def is_in_list(x):
return x in ["linux", "darwin"]
#stopwatch
def run_once():
import time
time.sleep(0.5)
#stopwatch(autorun=False)
def run_manually():
import time
time.sleep(0.5)
run_manually()
#stopwatch(repeat=10000000)
def repeat_set():
is_in_set("windows")
is_in_set("darwin")
#stopwatch(repeat=10000000)
def repeat_list():
is_in_list("windows")
is_in_list("darwin")
#stopwatch
def should_fail(x):
pass
Result:
Function=run_once, Time=0.5005391679987952
Function=run_manually, Time=0.500624185999186
Function=repeat_set, Time=1.7064883739985817
Function=repeat_list, Time=1.8905151920007484
Traceback (most recent call last):
(some more traceback here...)
Exception: should_fail: autorun only works with no parameters, you may want to use #stopwatch(autorun=False)
I'm trying to set up a Tensorflow input pipeline for feeding images into an AlexNet for feature extraction (not for training, this is a one of thing). Since AlexNet is rather small it is crucial to provide input data at a high rate for achieving acceptable performance (~1000 images / second).
My images are 400x300 JPEGs with 24KB per image on average.
Unfortunately it seems, that the Tensorflow input pipeline can't keep up with a GTX1080 running the AlexNet.
My input pipeline is simple: load a file, decode the image, resize it and batch them.
I created a small benchmark to show the issue:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
import time
import glob
import os
IMAGE_DIR = 'images'
EPOCHS = 1
def main():
print('batch_size\tnum_threads\tms/image')
for batch_size in [16, 32, 64, 128]:
for num_threads in [1, 2, 4, 8]:
run(batch_size, num_threads)
def run(batch_size, num_threads):
filenames = glob.glob(os.path.join(IMAGE_DIR, '*.jpg'))
(filename,) = tf.train.slice_input_producer(
[filenames],
capacity=2 * batch_size * num_threads,
num_epochs=EPOCHS)
raw = tf.read_file(filename)
decoded = tf.image.decode_jpeg(raw, channels=3)
resized = tf.image.resize_images(decoded, [227, 227])
batch = tf.train.batch(
[resized],
batch_size,
num_threads,
2 * batch_size * num_threads,
enqueue_many=True)
init_op = tf.group(
tf.global_variables_initializer(),
tf.local_variables_initializer())
with tf.Session() as sess:
sess.run(init_op)
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
t = time.time()
try:
while not coord.should_stop():
sess.run(batch)
except tf.errors.OutOfRangeError:
pass
finally:
coord.request_stop()
tpe = (time.time() - t) / (len(filenames) * EPOCHS) * 1000
print('{: <11}\t{: <10}\t{: <8}'
.format(batch_size, num_threads, tpe))
coord.join(threads)
if __name__ == "__main__":
main()
Running this on a MacBook Pro (early 2015, 2,9 GHz Intel Core i5) yields the following results:
batch_size num_threads ms/image
16 1 4.81571793556
16 2 3.00584602356
16 4 2.94281005859
16 8 2.94555711746
32 1 3.51123785973
32 2 1.82255005836
32 4 1.85884213448
32 8 1.88741898537
64 1 2.9537730217
64 2 1.58108997345
64 4 1.57125210762
64 8 1.57615303993
128 1 2.71797513962
128 2 1.67120599747
128 4 1.6521999836
128 8 1.6885869503
It shows overall bad performance far from 1/ms per image. Also it does not scale beyond two threads, which in this case is to be expected since it is a dual core processor only.
Running the same benchmark on a 2.5Ghz AMD Opteron 6180 SE with 24 cores yields the following:
batch_size num_threads ms/image
16 1 13.983194828
16 2 6.80965399742
16 4 6.67097783089
16 8 6.63090395927
32 1 12.0395629406
32 2 5.72535085678
32 4 4.94155502319
32 8 4.99696803093
64 1 10.9073989391
64 2 4.96317911148
64 4 3.76832485199
64 8 3.82816386223
128 1 10.2617599964
128 2 5.20488095284
128 4 3.16122984886
128 8 3.51550602913
Here, too, single threaded / overall performance is very bad and it does not scale beyond 2/4 threads.
The systems are neither IO nor CPU bound in any of the cases. For both systems loading and resizing the images with OpenCV gives far better numbers (~0.86ms/image in the MacBook, which in this case is CPU bound and up to ~0.22ms/image on the server, which in this case is IO bound).
What's going on with Tensorflow here? How can I speed this up?
I already tried to assemble a batch of images manually and use enqueue_many for batching, this made things even worse. I tried to add a small sleep before running the loop, just to make sure the queues are filled - but no luck.
Any help is greatly appreciated.
Why does .dt.days take 100 times longer than .dt.total_seconds()?
df = pd.DataFrame({'a': pd.date_range('2011-01-01 00:00:00', periods=1000000, freq='1H')})
df.a = df.a - pd.to_datetime('2011-01-01 00:00:00')
df.a.dt.days # 12 sec
df.a.dt.total_seconds() # 0.14 sec
.dt.total_seconds is basically just a multiplication, and can be performed at numpythonic speed:
def total_seconds(self):
"""
Total duration of each element expressed in seconds.
.. versionadded:: 0.17.0
"""
return self._maybe_mask_results(1e-9 * self.asi8)
Whereas if we abort the days operation, we see it's spending its time in a slow listcomp with a getattr and a construction of Timedelta objects (source):
360 else:
361 result = np.array([getattr(Timedelta(val), m)
--> 362 for val in values], dtype='int64')
363 return result
364
To me this screams "look, let's get it correct, and we'll cross the optimization bridge when we come to it."
In Python 3, is it possible to use a subclass of Thread in the context of a concurrent.futures.ThreadPoolExecutor, so that they can be individually initialized before processing (presumably many) work items?
I'd like to use the convenient concurrent.futures API for a piece of code that syncs up files and S3 objects (each work item is one file to sync if the corresponding S3 object is inexistent or out-of-sync). I would like each worker thread to do some initialization first, such as setting up a boto3.session.Session. Then that thread pool of workers would be ready to process potentially thousands of work items (files to sync).
BTW, if a thread dies for some reason, is it reasonable to expect a new thread to be automatically created and added back to the pool?
(Disclaimer: I am much more familiar with Java's multithreading framework than Python's one).
So, it seems that a simple solution to my problem is to use threading.local to store a per-thread "session" (in the mockup below, just a random int). Perhaps not the cleanest I guess but for now it will do. Here is a mockup (Python 3.5.1):
import time
import threading
import concurrent.futures
import random
import logging
logging.basicConfig(level=logging.DEBUG, format='(%(threadName)-0s) %(relativeCreated)d - %(message)s')
x = [0.1, 0.1, 0.2, 0.4, 1.0, 0.1, 0.0]
mydata = threading.local()
def do_work(secs):
if 'session' in mydata.__dict__:
logging.debug('re-using session "{}"'.format(mydata.session))
else:
mydata.session = random.randint(0,1000)
logging.debug('created new session: "{}"'.format(mydata.session))
time.sleep(secs)
logging.debug('slept for {} seconds'.format(secs))
return secs
with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
y = executor.map(do_work, x)
print(list(y))
Produces the following output, showing that "sessions" are indeed local to each thread and reused:
(Thread-1) 29 - created new session: "855"
(Thread-2) 29 - created new session: "58"
(Thread-3) 30 - created new session: "210"
(Thread-1) 129 - slept for 0.1 seconds
(Thread-1) 130 - re-using session "855"
(Thread-2) 130 - slept for 0.1 seconds
(Thread-2) 130 - re-using session "58"
(Thread-3) 230 - slept for 0.2 seconds
(Thread-3) 230 - re-using session "210"
(Thread-3) 331 - slept for 0.1 seconds
(Thread-3) 331 - re-using session "210"
(Thread-3) 331 - slept for 0.0 seconds
(Thread-1) 530 - slept for 0.4 seconds
(Thread-2) 1131 - slept for 1.0 seconds
[0.1, 0.1, 0.2, 0.4, 1.0, 0.1, 0.0]
Minor note about logging: in order to use this in an IPython notebook, the logging setup needs to be slightly modified (since IPython has already setup a root logger). A more robust logging setup would be:
IN_IPYNB = 'get_ipython' in vars()
if IN_IPYNB:
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
for h in logger.handlers:
h.setFormatter(logging.Formatter(
'(%(threadName)-0s) %(relativeCreated)d - %(message)s'))
else:
logging.basicConfig(level=logging.DEBUG, format='(%(threadName)-0s) %(relativeCreated)d - %(message)s')