Running multiple processes doesn't scale - linux

There are two C++ processes, one thread in each process. The thread handles network traffic (Diameter) from 32 incoming TCP connections, parses it and forwards split messages via 32 outgoing TCP connections. Let's call this C++ process a DiameterFE.
If only one DiameterFE process is running, it can handle 70 000 messages/sec.
If two DiameterFE processes are running, they can handle 35 000 messages/sec each, so the same 70 000 messages/sec in total.
Why don't they scale? What is a bottleneck?
Details:
There are 32 Clients (seagull) and 32 servers (seagull) for each Diameter Front End process, running on separate hosts.
A dedicated host is given for these two processes - 2 E5-2670 # 2.60GHz CPUs x 8 cores/socket x 2 HW threads/core = 32 threads in total.
10 GBit/sec network.
Average Diameter message size is 700 bytes.
It looks like only the Cpu0 handles network traffic - 58.7%si. Do I have to explicitly configure different network queues to different CPUs?
The first process (PID=7615) takes 89.0 % CPU, it is running on Cpu0.
The second process (PID=59349) takes 70.8 % CPU, it is running on Cpu8.
On the other hand, Cpu0 is loaded at: 95.2% = 9.7%us + 26.8%sy + 58.7%si,
whereas Cpu8 is loaded only at 70.3% = 14.8%us + 55.5%sy
It looks like the Cpu0 is doing the work also for the second process. There is very high softirq and only on the Cpu0 = 58.7%. Why?
Here is the top output with key "1" pressed:
top - 15:31:55 up 3 days, 9:28, 5 users, load average: 0.08, 0.20, 0.47
Tasks: 973 total, 3 running, 970 sleeping, 0 stopped, 0 zombie
Cpu0 : 9.7%us, 26.8%sy, 0.0%ni, 4.8%id, 0.0%wa, 0.0%hi, 58.7%si, 0.0%st
...
Cpu8 : 14.8%us, 55.5%sy, 0.0%ni, 29.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
...
Cpu31 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 396762772k total, 5471576k used, 391291196k free, 354920k buffers
Swap: 1048568k total, 0k used, 1048568k free, 2164532k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7615 test1 20 0 18720 2120 1388 R 89.0 0.0 52:35.76 diameterfe
59349 test1 20 0 18712 2112 1388 R 70.8 0.0 121:02.37 diameterfe
610 root 20 0 36080 1364 1112 S 2.6 0.0 126:45.58 plymouthd
3064 root 20 0 10960 788 432 S 0.3 0.0 2:13.35 irqbalance
16891 root 20 0 15700 2076 1004 R 0.3 0.0 0:01.09 top
1 root 20 0 19364 1540 1232 S 0.0 0.0 0:05.20 init
...

The fix of this issue was to upgrade the kernel to 2.6.32-431.20.3.el6.x86_64 .
After that network interrupts and message queues are distributed among different CPUs.

Related

What are llvm_pipe threads?

I'm writing a Rust app that uses a lot of threads. I noticed the CPU usage was high so I did top and then hit H to see the threads:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
247759 root 20 0 3491496 104400 64676 R 32.2 1.0 0:02.98 my_app
247785 root 20 0 3491496 104400 64676 S 22.9 1.0 0:01.89 llvmpipe-0
247786 root 20 0 3491496 104400 64676 S 21.9 1.0 0:01.71 llvmpipe-1
247792 root 20 0 3491496 104400 64676 S 20.9 1.0 0:01.83 llvmpipe-7
247789 root 20 0 3491496 104400 64676 S 20.3 1.0 0:01.60 llvmpipe-4
247790 root 20 0 3491496 104400 64676 S 20.3 1.0 0:01.64 llvmpipe-5
247787 root 20 0 3491496 104400 64676 S 19.9 1.0 0:01.70 llvmpipe-2
247788 root 20 0 3491496 104400 64676 S 19.9 1.0 0:01.61 llvmpipe-3
What are these llvmpipe-n threads? Why my_app launches them? Are them even from my_app for sure?
As HHK links to, the llvmpipe threads are from your OpenGL driver, which is Mesa.
You said you are running this in a VM. VMs usually don't virtualize GPU hardware, so the Mesa OpenGL driver is doing sofware rendering. To achieve better performance, Mesa spawns threads to do parallel computations on the CPU.

Can't explain this Node clustering behavior

I'm learning about threads and how they interact with Node's native cluster module. I saw some behavior I can't explain that I'd like some help understanding.
My code:
process.env.UV_THREADPOOL_SIZE = 1;
const cluster = require('cluster');
if (cluster.isMaster) {
cluster.fork();
} else {
const crypto = require('crypto');
const express = require('express');
const app = express();
app.get('/', (req, res) => {
crypto.pbkdf2('a', 'b', 100000, 512, 'sha512', () => {
res.send('Hi there');
});
});
app.listen(3000);
}
I benchmarked this code with one request using apache benchmark.
ab -c 1 -n 1 localhost:3000/ yielded these connection times
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 0
Processing: 605 605 0.0 605 605
Waiting: 605 605 0.0 605 605
Total: 605 605 0.0 605 605
So far so good. I then ran ab -c 2 -n 2 localhost:3000/ (doubling the number of calls from the benchmark). I expected the total time to double since I limited the libuv thread pool to one thread per child process and I only started one child process. But nothing really changed. Here's those results.
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 0
Processing: 608 610 3.2 612 612
Waiting: 607 610 3.2 612 612
Total: 608 610 3.3 612 612
For extra info, when I further increase the number of calls with ab -c 3 -n 3 localhost:3000/, I start to see a slow down.
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 0
Processing: 599 814 352.5 922 1221
Waiting: 599 814 352.5 922 1221
Total: 599 815 352.5 922 1221
I'm running all this on a quadcore mac using Node v14.13.1.
tldr: how did my benchmark not use up all my threads? I forked one child process with one thread in its libuv pool - so the one call in my benchmark should have been all it could handle without taking longer. And yet the second test (the one that doubled the amount of calls) took the same amount of time as the benchmark.

Tasklet counts in /proc/softirqs increase very rapidly on USB operation in Linux

I have a legacy device with following configuration:
Chipset Architecture : Intel NM10 express
CPU : Atom D2250 Dual Core
Volatile Memory : 1GB
CPU core : 4
USB Host controller driver : ehci-pci
When I perform any USB operation I observe increasing tasklet count linearly and if USB operation continues for a long time(approx half an hour) then tasklet count crosses a million and this seems to be very strange to me.
I read about interrupt handling mechanism used by ehci-pci which is not latest(i.e. PCI pin-based) but still the tasklet counts are so high in numbers.
I use /proc/softirqs to read tasklet count.
Any lead on this?
root#panther1:~# cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3
23: 0 0 0 1411443 IO-APIC 23-fasteoi ehci_hcd:usb1, uhci_hcd:usb2
root#panther1:~# cat /proc/softirqs
CPU0 CPU1 CPU2 CPU3
HI: 7 3 13 1125529
TIMER: 2352846 2325384 2533628 2675821
NET_TX: 0 0 2 1703
NET_RX: 1161 1193 2730 51184
BLOCK: 0 0 0 0
IRQ_POLL: 0 0 0 0
TASKLET: 256 164 90 1162220
SCHED: 1078965 1015261 1155661 1207484
HRTIMER: 0 0 0 0
RCU: 1370647 1367098 1485356 1503762

Is it possible to get current CPU utilisation from a specific core via /sys in Linux?

I would like to write a shellscript that reads the current CPU utilisation on a per-core basis. Is it possible to read this from the /sys directory in Linux (CentOS 8)? I have found /sys/bus/cpu/drivers/processor/cpu0 which does give me a fair bit of information (like current frequency), but I've yet to figure out how to read CPU utilisation.
In other words: Is there a file that gives me current utilisation of a specific CPU core in Linux, specifically CentOS 8?
I believe that you should be able to extract information from /proc/stat - the lines that start with cpu$N, where $N is 0, 1, 2, ...... For example:
Strongly suggesting reading articles referenced on other answer.
cpu0 101840 1 92875 80508446 4038 0 4562 0 0 0
cpu1 81264 0 68829 80842548 4424 0 2902 0 0 0
Repeated call will show larger values:
cpu 183357 1 162020 161382289 8463 0 7470 0 0 0
cpu0 102003 1 93061 80523961 4038 0 4565 0 0 0
cpu1 81354 0 68958 80858328 4424 0 2905 0 0 0
Notice CPU0 5th column (idle count) moving from 80508446 to 80523961
Format of each line in
cpuN user-time nice-time system-time idle-time io-wait ireq softirq
steal guest guest_nice
So a basic solution:
while true ;
for each cpu
read current counters, at least user-time system-time and idle
usage = current(user-time + system-time) - prev(user-time+system-time)
idle = current(idle) - prev(idle)
utilization = usage/(usage+idle)
// print or whatever.
set prev=current
done

write error: No space left on device in embedded linux

all
I have a embedded board, run linux OS. and I use yaffs2 as rootfs.
I run a program on it, but after some times, it got a error "error No space left on device.". but I checked the flash, there still have a lot free space.
I just write some config file. the config file is rarely update. the program will write some log to flash. log size is limited to 2M.
I don't know why, and how to solve.
Help me please!(my first language is not English,sorry. hope you understand what I say)
some debug info:
# ./write_test
version 1.0
close file :: No space left on device
return errno 28
# cat /proc/yaffs
YAFFS built:Nov 23 2015 16:57:34
Device 0 "rootfs"
start_block........... 0
end_block............. 511
total_bytes_per_chunk. 2048
use_nand_ecc.......... 1
no_tags_ecc........... 1
is_yaffs2............. 1
inband_tags........... 0
empty_lost_n_found.... 0
disable_lazy_load..... 0
refresh_period........ 500
n_caches.............. 10
n_reserved_blocks..... 5
always_check_erased... 0
data_bytes_per_chunk.. 2048
chunk_grp_bits........ 0
chunk_grp_size........ 1
n_erased_blocks....... 366
blocks_in_checkpt..... 0
n_tnodes.............. 749
n_obj................. 477
n_free_chunks......... 23579
n_page_writes......... 6092
n_page_reads.......... 11524
n_erasures............ 96
n_gc_copies........... 5490
all_gcs............... 1136
passive_gc_count...... 1136
oldest_dirty_gc_count. 95
n_gc_blocks........... 96
bg_gcs................ 96
n_retired_writes...... 0
n_retired_blocks...... 0
n_ecc_fixed........... 0
n_ecc_unfixed......... 0
n_tags_ecc_fixed...... 0
n_tags_ecc_unfixed.... 0
cache_hits............ 0
n_deleted_files....... 0
n_unlinked_files...... 289
refresh_count......... 1
n_bg_deletions........ 0
Device 2 "data"
start_block........... 0
end_block............. 927
total_bytes_per_chunk. 2048
use_nand_ecc.......... 1
no_tags_ecc........... 1
is_yaffs2............. 1
inband_tags........... 0
empty_lost_n_found.... 0
disable_lazy_load..... 0
refresh_period........ 500
n_caches.............. 10
n_reserved_blocks..... 5
always_check_erased... 0
data_bytes_per_chunk.. 2048
chunk_grp_bits........ 0
chunk_grp_size........ 1
n_erased_blocks....... 10
blocks_in_checkpt..... 0
n_tnodes.............. 4211
n_obj................. 24
n_free_chunks......... 658
n_page_writes......... 430
n_page_reads.......... 467
n_erasures............ 7
n_gc_copies........... 421
all_gcs............... 20
passive_gc_count...... 13
oldest_dirty_gc_count. 3
n_gc_blocks........... 6
bg_gcs................ 4
n_retired_writes...... 0
n_retired_blocks...... 0
n_ecc_fixed........... 0
n_ecc_unfixed......... 0
n_tags_ecc_fixed...... 0
n_tags_ecc_unfixed.... 0
cache_hits............ 0
n_deleted_files....... 0
n_unlinked_files...... 2
refresh_count......... 1
n_bg_deletions........ 0
#
log and config file stored in "data".
thanks!!
In General this could be your disk space (here Flash), first of all check your flash space with with df -h (or other commands you have.. df is present in BusyBox). But if your flash space (specially on your program partition) is ok, this could be your "inode" (directory) space problem, you could see your inode usage with df -i command. (a good link for this: https://wiki.gentoo.org/wiki/Knowledge_Base:No_space_left_on_device_while_there_is_plenty_of_space_available)
If non of these is the problem cause, I think you have to have a deeper look at your code, specially if you deal with disk I/O!
Also good to mention that be aware of memory & heap space & free all allocated spaces in you functions.

Resources