Is Python's time.monotonic() consistent across multiple threads? - python-3.x

Python's documentation for time.monotonic() states:
Return the value (in fractional seconds) of a monotonic clock, i.e. a clock that cannot go backwards. The clock is not affected by system clock updates. The reference point of the returned value is undefined, so that only the difference between the results of consecutive calls is valid.
New in version 3.3.
Changed in version 3.5: The function is now always available and always system-wide.
Does the last sentence imply that for Python >= 3.5 the values returned by time.monotonic() can be safely used and compared across multiple threads?

I came across the same problem and it seems NO , getting around 10 microseconds difference across threads.
I'm on Linux, Python 3.7.5, Kernel 5.3.0-40-generic , CPU AMD Ryzen 5 3600

Related

What timeframe does the "perf sched record" use?

I've been trying to analyze the output of perf sched record but I don't understand with what frame of reference do I try to understand the "20624.983302 secs". It isn't Unix time for sure, so what is it? How would I go about converting this into Unix time?
*A0 20624.983302 secs A0 => migration/0:12
*. 20624.983311 secs . => swapper:0
*B0 20624.983318 secs B0 => IPC I/O Child:33924
*. 20624.983355 secs
*C0 20624.983485 secs C0 => WRScene~lder#15:39974
*. 20624.983581 secs
*D0 20624.983972 secs D0 => IPC I/O Parent:33780
These timestamps are captured using the kernel scheduler clock, which counts in nanoseconds since boot. The exact details depend on the compile-time parameters chosen to build a particular Linux distribution and the target architecture.
In general, the timestamp of a sample is captured around the same time when it's recorded. Timestamps on the same core are guaranteed to be monotonically increasing as long the core remains in an active state. The samples you've shown were all captured on the same core and the core remained active from the first sample to the last sample. So the timestamps are guaranteed to be monotonic in this case irrespective of the platform and distribution. When profiling on multiple cores, there is no guarantee that the clocks on all cores are in sync.
All perf tools use the same clock to capture timestamps, but they may differ in the way timestamps are printed and it may happen that two tools print timestamps from the same sample file differently. This depends on the kernel version.
It's possible to specify a clock source when calling perf_event_open() by setting use_clockid to 1 and setting clockid to one of the clock sources defined in linux/time.h, such as CLOCK_MONOTONIC. perf record provides the -k or --clockid option to specify the clock source for capturing timestamps.
Modern distributions on x86 typically use TSC as the source for the scheduler clock (check /sys/devices/system/clocksource/clocksource0/current_clocksource). So if you're on an x86 processor, most probably the TSC of the profiled core was used to capture the current value of TSC cycles, which internally gets converted into nanoseconds. When a timestamp is printed, it may get converted to a different unit. In this case, timestamps are printed in the format "seconds.microseconds". A summary of the behavior of TSC on Intel processors can be found at: Can constant non-invariant tsc change frequency across cpu states?.

scikit learn unwanted parallel processing

I have a problem with nested multiprocessing witch starts when I use scikit-learn (v. 0.22) Quadratic Discriminant Analysis. Necessary is system configuration: 24 thread Xeon machine running fedora 30.
I run consecutively qda on the randomly selected subset of predictors:
def process(X,y,n_features,i=1):
comb = np.random.choice(range(X.shape[1]),n_features,replace=False)
qda = QDA(tol=1e-8)
qda.fit(X[:,comb],y)
y_pred = qda.predict(X[:,comb])
return (accuracy_score(y,y_pred),comb,i)
where n_features is number of features randomly selected from the full set of possible predictors, X,y explanatory and depended variables.
When n_features is 18 or less process works in single-thread mode, which means that I can use any tool to parallel processing (I use ray). When n_features is 19, and above for unknown reason it (not me) starts all available threads, and entire calculation get more time even in comparison to a single thread.
tmp = [process(X,y,n_features,i=1) for _ in range(1000)]
Based on my previous experiences with other Linux libraries (R gstat precisely) the same situation (uncontrolled multithreading mode) was caused by Linux implementation of blas, but here it could not be the case. In general, the question is: what starts this multithreading and how to control it even if LDA/QDA hasn't n_jobs parameter to avoid nested multiprocessing.
QDA in scikit-learn does not expose n_jobs meaning that you cannot set anything. However, it could be due to numpy which does not restrict the number of threads.
The solution to limit the number of threads are:
set the environment variable OMP_NUM_THREADS, MKL_NUM_THREADS, or OPENBLAS_NUM_THREADS to be sure that you will limit the number of threads;
you can use threadpoolctl which provides a context manager to set the number of threads.

Can we use `shuffle()` instruction for reg-to-reg data-exchange between items (threads) in WaveFront?

As we known, WaveFront (AMD OpenCL) is very similar to WARP (CUDA): http://research.cs.wisc.edu/multifacet/papers/isca14-channels.pdf
GPGPU languages, like OpenCLâ„¢ and CUDA, are called SIMT because they
map the programmer’s view of a thread to a SIMD lane. Threads
executing on the same SIMD unit in lockstep are called a wavefront
(warp in CUDA).
Also known, that AMD suggested us the (Reduce) addition of numbers using a local memory. And for accelerating of addition (Reduce) suggests using vector types: http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/01/AMD_OpenCL_Tutorial_SAAHPC2010.pdf
But are there any optimized register-to-register data-exchage instructions between items (threads) in WaveFront:
such as int __shfl_down(int var, unsigned int delta, int width=warpSize); in WARP (CUDA): https://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/
or such as __m128i _mm_shuffle_epi8(__m128i a, __m128i b); SIMD-lanes on x86_64: https://software.intel.com/en-us/node/524215
This shuffle-instruction can, for example, execute Reduce (add up the numbers) of 8 elements from 8 threads/lanes, for 3 cycles without any synchronizations and without using any cache/local/shared-memory (which has ~3 cycles latency for each access).
I.e. threads sends its value directly to register of other threads: https://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/
Or in OpenCL we can use only instruction gentypen shuffle( gentypem x, ugentypen mask ) which can be used only for vector-types such as float16/uint16 into each item (thread), but not between items (threads) in WaveFront: https://www.khronos.org/registry/OpenCL/sdk/1.1/docs/man/xhtml/shuffle.html
Can we use something looks like shuffle() for reg-to-reg data-exchange between items (threads) in WaveFront which more faster than data-echange via Local memory?
Are there in AMD OpenCL instructions for register-to-register data-exchange intra-WaveFront such as instructions __any(), __all(), __ballot(), __shfl() for intra-WARP(CUDA): http://on-demand.gputechconf.com/gtc/2015/presentation/S5151-Elmar-Westphal.pdf
Warp vote functions:
__any(predicate) returns non-zero if any of the predicates for the
threads in the warp returns non-zero
__all(predicate) returns non-zero if all of the predicates for the
threads in the warp returns non-zero
__ballot(predicate) returns a bit-mask with the respective bits
of threads set where predicate returns non-zero
__shfl(value, thread) returns value from the requested thread
(but only if this thread also performed a __shfl()-operation)
CONCLUSION:
As known, in OpenCL-2.0 there is Sub-groups with SIMD execution model akin to WaveFronts: Does the official OpenCL 2.2 standard support the WaveFront?
For Sub-Group there are - page-160: http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_OpenCL_Programming_User_Guide2.pdf
int sub_group_all(int predicate) the same as CUDA-__all(predicate)
int sub_group_any(int predicate); the same as CUDA-__any(predicate)
But in OpenCL there is no similar functions:
CUDA-__ballot(predicate)
CUDA-__shfl(value, thread)
There is only Intel-specified built-in shuffle functions in Version 4, August 28, 2016 Final Draft OpenCL Extension #35: intel_sub_group_shuffle, intel_sub_group_shuffle_down, intel_sub_group_shuffle_down, intel_sub_group_shuffle_up: https://www.khronos.org/registry/OpenCL/extensions/intel/cl_intel_subgroups.txt
Also in OpenCL there are functions, which usually implemented by shuffle-functions, but there are not all of functions which can be implemented by using shuffle-functions:
<gentype> sub_group_broadcast( <gentype> x, uint sub_group_local_id );
<gentype> sub_group_reduce_<op>( <gentype> x );
<gentype> sub_group_scan_exclusive_<op>( <gentype> x );
<gentype> sub_group_scan_inclusive_<op>( <gentype> x );
Summary:
shuffle-functions remain more flexible functions , and ensure the fastest possible communication between threads with direct register-to-register data-exchanging.
But functions sub_group_broadcast/_reduce/_scan doesn't guarantee direct register-to-register data-exchanging, and these sub-group-functions less flexible.
There is
gentype work_group_reduce<op> ( gentype x)
for version >=2.0
but its definition doesn't say anything about using local memory or registers. This just reduces each collaborator's x value to a single sum of all. This function must be hit by all workgroup-items so its not on a wavefront level approach. Also the order of floating-point operations is not guaranteed.
Maybe some vendors do it register way while some use local memory. Nvidia does with register I assume. But an old mainstream Amd gpu has local memory bandwidth of 3.7 TB/s which is still good amount. (edit: its not 22 TB/s) For 2k cores, this means nearly 1.5 byte per cycle per core or much faster per cache line.
For %100 register(if not spills to global memory) version, you can reduce number of threads and do vectorized reduction in threads themselves without communicating with others if number of elements are just 8 or 16. Such as
v.s0123 += v.s4567
v.s01 += v.s23
v.s0 += v.s1
which should be similar to a __m128i _mm_shuffle_epi8 and its sum version when compiled on a CPU and non-scalar implementations will use same SIMD on a GPU to do these 3 operations.
Also using these vector types tend to use efficient memory transactions even for global and local, not just registers.
A SIMD works on only a single wavefront at a time, but a wavefront may be processed by multiple SIMDs, so, this vector operation does not imply a whole wavefront is being used. Or even whole wavefront may be computing 1st elements of all vectors in a cycle. But for a CPU, most logical option is SIMD computing work items one by one(avx,sse) instead of computing them in parallel by their same indexed elements.
If main work group doesn't fit ones requirements, there are child kernels to spawn and use dynamic width kernels for this kind of operations. Child kernel works on another group called sub-group concurrently. This is done within device-side queue and needs OpenCl version to be at least 2.0.
Look for "device-side enqueue" in http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_OpenCL_Programming_User_Guide2.pdf
AMD APP SDK supports Sub-Group

Different number of cycles when running a benchmark more than once on C++ emulator

When running a benchmark e.g. dhrystone with the command:
make output/dhrystone.riscv.out
as described at: http://riscv.org/download.html#tab_rocket,
on the C++ emulator. I get the following output:
When running it for the first time:
Microseconds for one run through Dhrystone: 1064
Dhrystones per Second: 939
cycle = 533718
instret = 148672
and the second time:
Microseconds for one run through Dhrystone: 1064
Dhrystones per Second: 939
cycle = 533715
instret = 148672
Why do the cycles differ? Shouldn't they be exactly the same. I have tried this with other benchmarks too and had even higher deviations. If this is normal where do the deviations come from?
There are small amounts of nondeterminism from randomly initialized registers (e.g., the clock that is recovered by the HTIF is initialized to a random phase). It doesn't seem like these minor deviations would impact any performance benchmarking.
If you need identical results each time (e.g., for verification?), you could modify the emulator code to initialize registers to some known value each time.

Measuring time: differences among gettimeofday, TSC and clock ticks

I am doing some performance profiling for part of my program. And I try to measure the execution with the following four methods. Interestingly they show different results and I don't fully understand their differences. My CPU is Intel(R) Core(TM) i7-4770. System is Ubuntu 14.04. Thanks in advance for any explanation.
Method 1:
Use the gettimeofday() function, result is in seconds
Method 2:
Use the rdtsc instruction similar to https://stackoverflow.com/a/14019158/3721062
Method 3 and 4 exploits Intel's Performance Counter Monitor (PCM) API
Method 3:
Use PCM's
uint64 getCycles(const CounterStateType & before, const CounterStateType &after)
Its description (I don't quite understand):
Computes the number core clock cycles when signal on a specific core is running (not halted)
Returns number of used cycles (halted cyles are not counted). The counter does not advance in the following conditions:
an ACPI C-state is other than C0 for normal operation
HLT
STPCLK+ pin is asserted
being throttled by TM1
during the frequency switching phase of a performance state transition
The performance counter for this event counts across performance state transitions using different core clock frequencies
Method 4:
Use PCM's
uint64 getInvariantTSC (const CounterStateType & before, const CounterStateType & after)
Its description:
Computes number of invariant time stamp counter ticks.
This counter counts irrespectively of C-, P- or T-states
Two samples runs generate result as follows:
(Method 1 is in seconds. Methods 2~4 are divided by a (same) number to show a per-item cost).
0.016489 0.533603 0.588103 4.15136
0.020374 0.659265 0.730308 5.15672
Some observations:
The ratio of Method 1 over Method 2 is very consistent, while the others are not. i.e., 0.016489/0.533603 = 0.020374/0.659265. Assuming gettimeofday() is sufficiently accurate, the rdtsc method exhibits the "invariant" property. (Yep I read from Internet that current generation of Intel CPU has this feature for rdtsc.)
Methods 3 reports higher than Method 2. I guess its somehow different from the TSC. But what is it?
Methods 4 is the most confusing one. It reports an order of magnitude larger number than Methods 2 and 3. Shouldn't it be also kind of cycle counts? Let alone it carries the "Invariant" name.
gettimeofday() is not designed for measuring time intervals. Don't use it for that purpose.
If you need wall time intervals, use the POSIX monotonic clock. If you need CPU time spent by a particular process or thread, use the POSIX process time or thread time clocks. See man clock_gettime.
PCM API is great for fine tuned performance measurement when you know exactly what you are doing. Which is generally obtaining a variety of separate memory, core, cache, low-power, ... performance figures. Don't start messing with it if you are not sure what exact services you need from it that you can't get from clock_gettime.

Resources