why double the cpu limit leads to only 20% time cost improvement? - python-3.x

I use python3 to do some encrypted calculation with MICROSOFT SEAL and is looking for some performance improvement.
I do it by:
create a shared memory to hold the plaintext data
(Use numpy array in shared memory for multiprocessing)
start multiple processes with multiprocessing.Process (there is a param controlling the number of processes, thus limiting the cpu usage)
processes read from shared memory and do some encrypted calculation
wait for calculation ends and join processes
I run this program on a 32U64G x86 linux server, cpu model is: Intel(R) Xeon(R) Gold 6161 CPU # 2.20GHz.
I notice that if I double the number of processes there is only about 20% time cost improvement.
I've tried three kinds of process nums:
| process nums | 7 | 13 | 27 |
| time ratio | 0.8 | 1 | 1.2 |
Why is this improvement disproportionate to the resources i use (cpu & memory)?
Conceptual knowledge or specific linux cmdlines are both welcome.
Thanks.
FYI:
My code of sub processes is like:
def sub_process_main(encrypted_bytes, plaintext_array, result_queue):
// init
// status_sign
while shared_int > 0:
// seal load and some other calculation
encrypted_matrix_list = seal.Ciphertext.load(encrypted_bytes)
shared_plaintext_matrix = seal.Encoder.encode(plaintext_array)
// ... do something
for some loop:
time1 = time.time()
res = []
for i in range(len(encrypted_matrix_list)):
enc = seal.evaluator.multiply_plain(encrypted_matrix_list[i], shared_plaintext_matrix[i])
res.append(enc)
time2 = time.time()
print(f'time usage: {time2 - time1}')
// ... do something
result_queue.put(final_result)
I actually print the time for every part of my code and here is the time cost for this part of code.
| process nums | 13 | 27 |
| occurrence | 1791 | 864 |
| total time | 1698.2140 | 1162.8330 |
| average | 0.9482 | 1.3459 |
I've monitored some metrics but I don't know if there are any abnormal ones.
13 cores:
top
pidstat
vmstat
27 cores:
top (Why is this using all cores rather than exactly 27 cores? Does it have anything to do with Hyper-Threading?)
pidstat
vmstat

Related

How to ensure that my model is using all available GPU in jupyter notebook

I am using tensorflow 2.3 dedicated with 2-GPU's. I am using styleformer model to get informal to formal sentences. I want to use all 2-GPU's for this task.
Here is the information about GPU:
!nvidia-smi
| NVIDIA-SMI XXX.XX.XX Driver Version: ******** CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:89:00.0 Off | 0 |
| N/A 34C P0 42W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:8A:00.0 Off | 0 |
| N/A 35C P0 43W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
from tensorflow.python.client import device_lib
def get_available_gpus():
local_device_protos = device_lib.list_local_devices()
return [x.name for x in local_device_protos if x.device_type == 'GPU']
['/device:GPU:0', '/device:GPU:1']
Code that I am using on GPU
from styleformer import Styleformer
import torch
import warnings
sf = Styleformer(style = 0)
source_sentences = [
"I am quitting my job",
"Jimmy is on crack and can't trust him",
"What do guys do to show that they like a gal?"
]
for source_sentence in source_sentences:
target_sentence = sf.transfer(source_sentence, inference_on=1, quality_filter=0.95, max_candidates=5)
In the above code inference_on=1 means we are using GPU. But how can I ensure it's using both GPU. I went to the transfer function inside styleformer package and found this line..
def transfer(self, input_sentence, inference_on=0, quality_filter=0.95, max_candidates=5):
if self.model_loaded:
if inference_on == 0:
device = "cpu"
elif inference_on == 1:
device = "cuda:0"
else:
device = "cpu"
print("Onnx + Quantisation is not supported in the pre-release...stay tuned.")
How can I change the above code to use both GPU's?
your model is not using the GPU, there could be multiple reasons,
you did not install the Cuda toolkit & drivers for the GPU to access in the developer mode
Nvidia Driver issue ( uninstall & install)
Tensorflow version issue
check all these and try again. The notebook will take GPU automatically if it is available for use if you have everything installed.
When it is running on GPU, you will see 0MiB / 32510MiB will change to more then 0MiB.
Vote it if you found it correct !!! will help others to follow the same !!

Use linux perf utility to report counters every second like vmstat

There is perf command-linux utility in Linux to access hardware performance-monitoring counters, it works using perf_events kernel subsystems.
perf itself has basically two modes: perf record/perf top to record sampling profile (the sample is for example every 100000th cpu clock cycle or executed command), and perf stat mode to report total count of cycles/executed commands for the application (or for the whole system).
Is there mode of perf to print system-wide or per-CPU summary on total count every second (every 3, 5, 10 seconds), like it is printed in vmstat and systat-family tools (iostat, mpstat, sar -n DEV... like listed in http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html)? For example, with cycles and instructions counters I will get mean IPC for every second of system (or of every CPU).
Is there any non-perf tool (in https://perf.wiki.kernel.org/index.php/Tutorial or http://www.brendangregg.com/perf.html) which can get such statistics with perf_events kernel subsystem? What about system-wide per-process IPC calculation with resolution of seconds?
There is perf stat option "interval-print" of -I N where N is millisecond interval to do interval counter printing every N milliseconds (N>=10): http://man7.org/linux/man-pages/man1/perf-stat.1.html
-I msecs, --interval-print msecs
Print count deltas every N milliseconds (minimum: 10ms) The
overhead percentage could be high in some cases, for instance
with small, sub 100ms intervals. Use with caution. example: perf
stat -I 1000 -e cycles -a sleep 5
For best results it is usually a good idea to use it with interval
mode like -I 1000, as the bottleneck of workloads can change often.
There is also importing results in machine-readable form, and with -I first field is datetime:
With -x, perf stat is able to output a not-quite-CSV format output ... optional usec time stamp in fractions of second (with -I xxx)
vmstat, systat-family tools iostat, mpstat, etc periodic printing is -I 1000 of perf stat (every second), for example system-wide (add -A to separate cpu counters):
perf stat -a -I 1000
The option is implemented in builtin-stat.c http://lxr.free-electrons.com/source/tools/perf/builtin-stat.c?v=4.8 __run_perf_stat function
531 static int __run_perf_stat(int argc, const char **argv)
532 {
533 int interval = stat_config.interval;
For perf stat -I 1000 with some program argument (forks=1), for example perf stat -I 1000 sleep 10 there is interval loop (ts is the millisecond interval converted to struct timespec):
639 enable_counters();
641 if (interval) {
642 while (!waitpid(child_pid, &status, WNOHANG)) {
643 nanosleep(&ts, NULL);
644 process_interval();
645 }
646 }
666 disable_counters();
For variant of system-wide hardware performance monitor counting and forks=0 there is other interval loop
658 enable_counters();
659 while (!done) {
660 nanosleep(&ts, NULL);
661 if (interval)
662 process_interval();
663 }
666 disable_counters();
process_interval() http://lxr.free-electrons.com/source/tools/perf/builtin-stat.c?v=4.8#L347 from the same file uses read_counters(); which loops over event list and invokes read_counter() which loops over all known threads and all cpus and starts actual reading function:
306 for (thread = 0; thread < nthreads; thread++) {
307 for (cpu = 0; cpu < ncpus; cpu++) {
...
310 count = perf_counts(counter->counts, cpu, thread);
311 if (perf_evsel__read(counter, cpu, thread, count))
312 return -1;
perf_evsel__read is the real counter read while program is still running:
1207 int perf_evsel__read(struct perf_evsel *evsel, int cpu, int thread,
1208 struct perf_counts_values *count)
1209 {
1210 memset(count, 0, sizeof(*count));
1211
1212 if (FD(evsel, cpu, thread) < 0)
1213 return -EINVAL;
1214
1215 if (readn(FD(evsel, cpu, thread), count, sizeof(*count)) < 0)
1216 return -errno;
1217
1218 return 0;
1219 }

Child event order in Node.js

I have an api, its working process is like this:
doing some logic, using 1 second's CPU time
wait for network IO, and this IO need 1 second too.
So, normally this api will need about 2 seconds to respond
Then I did a test.
I start 10 requests at the same time.
EVERY ONE OF THEM need more than 10 seconds to respond
This test means
Node will finish all the cpu costly part of all the 10 requests first.
WHY?
why doesn't it respond to one request immediately after one IO is done.
Thanks for the comments. I think I need to do some explanation about my concern.
What i concern is if the request count is not 10, if there are 100 request at the same time.
All of them will timeout!!
If the Node respond to the child IO event immediately, I think at least 20% of them will not time out.
I think node need some Event Priority mechanism
router.use('/test/:id', function (req, res) {
var id = req.param('id');
console.log('start cpu code for ' + id);
for (var x = 0; x < 10000; x++) {
for (var x2 = 0; x2 < 30000; x2++) {
x2 -= 1;
x2 += 1;
}
}
console.log('cpu code over for ' + id);
request('http://terranotifier.duapp.com/wait3sec/' + id, function (a,b,data) {
// how can I make this code run immediately after the server response to me.
console.log('IO over for ' + data);
res.send('over');
});
});
Node.js is single threaded. Therefore as long as you have a long running routine it cannot process other pieces of code. The offending piece of code in this instance is your double for loop which takes up a lot of CPU time.
To understand what you're seeing first let me explain how the event loop works.
Node.js event loop evolved out of javascript's event loop which evolved out of web browsers event loop. The web browser event loop was originally implemented not for javascript but to allow progressive rendering of images. The event loop looks a bit like this:
,-> is there anything from the network?
| | |
| no yes
| | |
| | '-----------> read network data
| V |
| does the DOM need updating? <-------------'
| | |
| no yes
| | |
| | v
| | update the DOM
| | |
'------'--------------'
When javascript was added the script processing was simply inserted into the event loop:
,-> is there anything from the network?
| | |
| no yes
| | |
| | '-----------> read network data
| V |
| any javascript to run? <------------------'
| | |
| no yes
| | '-----------> run javascript
| V |
| does the DOM need updating? <-------------'
| | |
| no yes
| | |
| | v
| | update the DOM
| | |
'------'--------------'
When the javascript engine is made to run outside of the browser, as in Node.js, the DOM related parts are simply removed and the I/O becomes generalized:
,-> any javascript to run?
| | |
| no yes
| | |
| | '--------> RUN JAVASCRIPT
| V |
| is there any I/O <------------'
| | |
| no yes
| | |
| | v
| | read I/O
| | |
'------'--------------'
Note that all your javascript code is executed in the RUN JAVASCRIPT part.
So, what happens with your code when you make 10 connections?
connection1: node accepts your request, processes the double for loops
connection2: node is still processing the for loops, the request gets queued
connection3: node is still processing the for loops, the request gets queued
(at some point the for loop for connection 1 finishes)
node notices that connection2 is queued so connection2 gets accepted,
process the double for loops
...
connection10: node is still processing the for loops, the request gets queued
(at this point node is still busy processing some other for loop,
probably for connection 7 or something)
request1: node is still processing the for loops, the request gets queued
request2: node is still processing the for loops, the request gets queued
(at some point all connections for loops finishes)
node notices that response from request1 is queued so request1 gets processed,
console.log gets printed and res.send('over') gets executed.
...
request10: node is busy processing some other request, request10 gets queued
(at some point request10 gets executed)
This is why you see node taking 10 seconds answering 10 requests. It's not that the requests themselves are slow but their responses are queued behind all the for loops and the for loops get executed first (because we're still in the current loop of the event loop).
To counter this, you should make the for loops asynchronous to give node a chance to process the event loop. You can either write them in C and use C to run independent threads for each of them. Or you can use one of the thread modules from npm to run javascript in separate threads. Or you can use worker-threads which is a web-worker like API implemented for Node.js. Or you can fork a cluster of processes to execute them. Or you can simply loop them with setTimeout if parallelism is not critical:
router.use('/test/:id', function (req, res) {
var id = req.param('id');
console.log('start cpu code for ' + id);
function async_loop (count, callback, done_callback) {
if (count) {
callback();
setTimeout(function(){async_loop(count-1, callback)},1);
}
else if (done_callback) {
done_callback();
}
}
var outer_loop_done=0;
var x2=0;
async_loop(10000,function(){
x1++;
async_loop(30000,function(){
x2++;
},function() {
if (outer_loop_done) {
console.log('cpu code over for ' + id);
request('http://terranotifier.duapp.com/wait3sec/' + id,
function (a,b,data){
console.log('IO over for ' + data);
res.send('over');
}
);
}
});
},function(){
outer_loop_done = 1;
});
});
The above code will process a response from request() as soon as possible rather than wait for all the async_loops to execute to completion without using threads (so no parallelism) but simply using event queue priority.

What dtrace script output means?

I am tracing DTrace probes in my restify.js application (restify it is http server in node.js that provides dtrace support). I am using sample dtrace script from restify documentation:
#!/usr/sbin/dtrace -s
#pragma D option quiet
restify*:::route-start
{
track[arg2] = timestamp;
}
restify*:::handler-start
/track[arg3]/
{
h[arg3, copyinstr(arg2)] = timestamp;
}
restify*:::handler-done
/track[arg3] && h[arg3, copyinstr(arg2)]/
{
#[copyinstr(arg2)] = quantize((timestamp - h[arg3, copyinstr(arg2)]) / 1000000);
h[arg3, copyinstr(arg2)] = 0;
}
restify*:::route-done
/track[arg2]/
{
#[copyinstr(arg1)] = quantize((timestamp - track[arg2]) / 1000000);
track[arg2] = 0;
}
And the output is:
use_restifyRequestLogger
value ------------- Distribution ------------- count
-1 | 0
0 |######################################## 2
1 | 0
use_validate
value ------------- Distribution ------------- count
-1 | 0
0 |######################################## 2
1 | 0
pre
value ------------- Distribution ------------- count
0 | 0
1 |#################### 1
2 |#################### 1
4 | 0
handler
value ------------- Distribution ------------- count
128 | 0
256 |######################################## 2
512 | 0
route_user_read
value ------------- Distribution ------------- count
128 | 0
256 |######################################## 2
512 | 0
I was wondering what is value value field - what does it mean?
Why there is 124/256/512 for example? I guess it means the time/duration but it is in strange format - is it possible to show miliseconds for example?
The output is a histogram. You are getting a histogram because you are using the quantize function in your D script. The DTrace documentation says the following on quantize:
A power-of-two frequency distribution of the values of the specified expressions. Increments the value in the highest power-of-two bucket that is less than the specified expression.
The 'value' columns is the result of (timestamp - track[arg2]) / 1000000 where timestamp is the current time in nanoseconds. So the value shown is duration in milliseconds.
Putting this all together, the route_user_read result graph is telling you that you had 2 requests that took between 128 and 256 milliseconds.
This output is useful when you have a lot of requests and want to get a general sense of how your server is performing (you can quickly identify a bi-modal distribution for example). If you just want to see how long each request is taking, try using the printf function instead of quantize.

Poor memcpy performance in user space for mmap'ed physical memory in Linux

Of 192GB RAM installed on my computer, I have 188GB RAM above 4GB (at hardware address 0x100000000) reserved by the Linux kernel at boot time (mem=4G memmap=188G$4G). A data acquisition kernel modules accumulates data into this large area used as a ring buffer using DMA. A user space application mmap's this ring buffer into user space, then copies blocks from the ring buffer at the current location for processing once they are ready.
Copying these 16MB blocks from the mmap'ed area using memcpy does not perform as I expected. It appears that the performance depends on the size of the memory reserved at boot time (and later mmap'ed into user space). http://www.wurmsdobler.org/files/resmem.zip contains the source code for a kernel module which does implements the mmap file operation:
module_param(resmem_hwaddr, ulong, S_IRUSR);
module_param(resmem_length, ulong, S_IRUSR);
//...
static int resmem_mmap(struct file *filp, struct vm_area_struct *vma) {
remap_pfn_range(vma, vma->vm_start,
resmem_hwaddr >> PAGE_SHIFT,
resmem_length, vma->vm_page_prot);
return 0;
}
and a test application, which does in essence (with the checks removed):
#define BLOCKSIZE ((size_t)16*1024*1024)
int resMemFd = ::open(RESMEM_DEV, O_RDWR | O_SYNC);
unsigned long resMemLength = 0;
::ioctl(resMemFd, RESMEM_IOC_LENGTH, &resMemLength);
void* resMemBase = ::mmap(0, resMemLength, PROT_READ | PROT_WRITE, MAP_SHARED, resMemFd, 4096);
char* source = ((char*)resMemBase) + RESMEM_HEADER_SIZE;
char* destination = new char[BLOCKSIZE];
struct timeval start, end;
gettimeofday(&start, NULL);
memcpy(destination, source, BLOCKSIZE);
gettimeofday(&end, NULL);
float time = (end.tv_sec - start.tv_sec)*1000.0f + (end.tv_usec - start.tv_usec)/1000.0f;
std::cout << "memcpy from mmap'ed to malloc'ed: " << time << "ms (" << BLOCKSIZE/1000.0f/time << "MB/s)" << std::endl;
I have carried out memcpy tests of a 16MB data block for the different sizes of reserved RAM (resmem_length) on Ubuntu 10.04.4, Linux 2.6.32, on a SuperMicro 1026GT-TF-FM109:
| | 1GB | 4GB | 16GB | 64GB | 128GB | 188GB
|run 1 | 9.274ms (1809.06MB/s) | 11.503ms (1458.51MB/s) | 11.333ms (1480.39MB/s) | 9.326ms (1798.97MB/s) | 213.892ms ( 78.43MB/s) | 206.476ms ( 81.25MB/s)
|run 2 | 4.255ms (3942.94MB/s) | 4.249ms (3948.51MB/s) | 4.257ms (3941.09MB/s) | 4.298ms (3903.49MB/s) | 208.269ms ( 80.55MB/s) | 200.627ms ( 83.62MB/s)
My observations are:
From the first to the second run, memcpy from mmap'ed to malloc'ed seems to benefit that the contents might already be cached somewhere.
There is a significant performance degradation from >64GB, which can be noticed both when using a memcpy.
I would like to understand why that so is. Perhaps somebody in the Linux kernel developers group thought: 64GB should be enough for anybody (does this ring a bell?)
Kind regards,
peter
Based on feedback from SuperMicro, the performance degradation is due to NUMA, non-uniform memory access. The SuperMicro 1026GT-TF-FM109 uses the X8DTG-DF motherboard with one Intel 5520 Tylersburg chipset at its heart, connected to two Intel Xeon E5620 CPUs, each of which has 96GB RAM attached.
If I lock my application to CPU0, I can observe different memcpy speeds depending on what memory area was reserved and consequently mmap'ed. If the reserved memory area is off-CPU, then mmap struggles for some time to do its work, and any subsequent memcpy to and from the "remote" area consumes more time (data block size = 16MB):
resmem=64G$4G (inside CPU0 realm): 3949MB/s
resmem=64G$96G (outside CPU0 realm): 82MB/s
resmem=64G$128G (outside CPU0 realm): 3948MB/s
resmem=92G$4G (inside CPU0 realm): 3966MB/s
resmem=92G$100G (outside CPU0 realm): 57MB/s
It nearly makes sense. Only the third case, 64G$128, which means the uppermost 64GB also yield good results. This contradicts somehow the theory.
Regards,
peter
Your CPU probably doesn't have enough cache to deal with it efficiently. Either use lower memory, or get a CPU with a bigger cache.

Resources