How to reliably detect dropped netlink requests or responses - linux

I'm interested in using netlink for a straightforward application (reading cgroup stats at high frequency).
The man page cautions that the protocol is not reliable, hinting that the application needs to be prepared to handle dropped packets:
However, reliable transmissions from kernel to user are impossible in
any case. The kernel can't send a netlink message if the socket buffer
is full: the message will be dropped and the kernel and the user-space
process will no longer have the same view of kernel state. It
is up to the application to detect when this happens (via the ENOBUFS
error returned by recvmsg(2)) and resynchronize.
Since my requirements are simple, I'm fine with just destroying the socket and creating a new one whenever anything unexpected happens. But I can't find any documentation on what the expectations are on my program—the man page for recvmsg(2) doesn't even mention ENOBUFS for example.
What all do I need to worry about in order to make sure I can tell that a request from my application or a response from the kernel has been dropped, so that I can reset everything and start over? It's clear to me that I could do so whenever I receive an error from any of the syscalls involved, but for example what happens if my request is dropped on the way to the kernel? Will I just never receive a response? Do I need to build a timeout mechanism where I wait only so long for a response?

I found the following in Communicating between the kernel and user-space in Linux using Netlink sockets by Ayuso, Gasca, and Lefevre:
If Netlink fails to deliver a message that goes from kernel to user-space, the recvmsg() function returns the No buffer space available (ENOBUFS) error. Thus, the user-space process knows that it is losing messages [...]
On the other hand, buffer overruns cannot occur in communications from user to kernel-space since sendmsg() synchronously passes the Netlink message to the kernel subsystem. If blocking sockets are used, Netlink is completely reliable in communications from user to kernel-space since memory allocations would wait, so no memory exhaustion is possible.
Regarding acks, it looks like worrying about them is optional:
NLM_F_ACK: the user-space application requested a confirmation message from
kernel-space to make sure that a given request was successfully performed. If this
flag is not set, the kernel-space reports the error synchronously via sendmsg() as errno value.
So it sounds like for my simplistic use case I can just use sendmsg and recvmsg naively, reacting to any error (except for EINTR) by starting the whole thing over, perhaps with backoff. My guess it that since I only get one response per request and the responses are tiny, I should never even see ENOBUFS as long as I have only one request in flight at at a time.

As a side note, we can see the dropped netlink packets in the Drops column of /proc/net/netlink. For example:
# cat /proc/net/netlink
sk Eth Pid Groups Rmem Wmem Dump Locks Drops Inode
76f0e0ed 0 1966 00080551 0 0 0 2 0 36968
36a83ab1 0 1431 00000001 0 0 0 2 0 30297
d7d5db8e 0 563 00000440 0 0 0 2 0 19572
a10eb5c0 0 795 00000515 704 0 0 2 0 23584
c52bbce9 0 474 00000001 0 0 0 2 0 17511
3c5a89a5 0 989856248 00000001 0 0 0 2 0 31686
051108c1 0 0 00000000 0 0 0 2 0 25
ed401538 0 562 00000440 0 0 0 2 0 19576
38699987 0 469 00000557 0 0 0 2 0 19806
d7bbb203 0 728 00000113 0 0 0 2 0 22988
4d31126f 2 795 40000000 0 0 0 2 0 31092
febb9674 2 2100 00000001 0 0 0 2 0 37904
8c18eb5b 2 728 40000000 0 0 0 2 0 22989
922a7fcf 4 0 00000000 0 0 0 2 0 8681
16cfa740 7 0 00000000 0 0 0 2 0 7680
4e55a095 9 395 00000000 0 0 0 2 0 15142
0b2c5994 9 1 00000000 0 0 0 2 0 10840
94fe571b 9 0 00000000 0 0 0 2 0 7673
a7a1d82c 9 396 00000000 0 0 0 2 0 14484
b6a3f183 10 0 00000000 0 0 0 2 0 7517
1c2dc7e3 11 0 00000000 0 0 0 2 0 640
6dafb596 12 469 00000007 0 0 0 2 0 19810
2fe2c14c 12 3676872482 00000007 0 0 0 2 0 19811
3f245567 12 0 00000000 0 0 0 2 0 8682
0da2ddc4 12 3578344684 00000000 0 0 0 2 0 43683
f9720247 15 489 00000001 0 0 0 2 11 17781
e84b6d30 15 519 00000002 0 0 0 2 0 19071
c7d75154 15 1550 ffffffff 0 0 0 2 0 31970
02c1c3db 15 4070855316 00000001 0 0 0 2 0 10852
e0d7b09a 15 1 00000002 0 0 0 2 0 10821
78649432 15 0 00000000 0 0 0 2 0 30
8182eaf3 15 504 00000002 0 0 0 2 0 22047
40263df1 15 1858 ffffffff 0 0 0 2 0 34001
49283e31 16 0 00000000 0 0 0 2 0 696

Related

Native Transport Requests in Cassandra

I got some points about Native Transport Requests in Cassandra using this link : What are native transport requests in Cassandra?
As per my understanding, any query I execute in Cassandra is an Native Transport Requests.
I frequently get Request Timed Out error in Cassandra and I observed the following logs in Cassandra debug log and as well as using nodetool tpstats
/var/log/cassandra# nodetool tpstats
Pool Name Active Pending Completed Blocked All time blocked
MutationStage 0 0 186933949 0 0
ViewMutationStage 0 0 0 0 0
ReadStage 0 0 781880580 0 0
RequestResponseStage 0 0 5783147 0 0
ReadRepairStage 0 0 0 0 0
CounterMutationStage 0 0 14430168 0 0
MiscStage 0 0 0 0 0
CompactionExecutor 0 0 366708 0 0
MemtableReclaimMemory 0 0 788 0 0
PendingRangeCalculator 0 0 1 0 0
GossipStage 0 0 0 0 0
SecondaryIndexManagement 0 0 0 0 0
HintsDispatcher 0 0 0 0 0
MigrationStage 0 0 0 0 0
MemtablePostFlush 0 0 799 0 0
ValidationExecutor 0 0 0 0 0
Sampler 0 0 0 0 0
MemtableFlushWriter 0 0 788 0 0
InternalResponseStage 0 0 0 0 0
AntiEntropyStage 0 0 0 0 0
CacheCleanupExecutor 0 0 0 0 0
Native-Transport-Requests 0 0 477629331 0 1063468
Message type Dropped
READ 0
RANGE_SLICE 0
_TRACE 0
HINT 0
MUTATION 0
COUNTER_MUTATION 0
BATCH_STORE 0
BATCH_REMOVE 0
REQUEST_RESPONSE 0
PAGED_RANGE 0
READ_REPAIR 0
1) What is the All time blocked state?
2) What is this value : 1063468 denotes? How harmful it is?
3) How to tune this?
Each request is taken processed by the NTR stage before being handed off to read/mutation stage but it still blocks while waiting for completion. To prevent being overloaded the stage starts to block tasks being added to its queue to apply back pressure to client. Every time a request is blocked the all time blocked counter is incremented. So 1063468 requests have at one time been blocked for some period of time due to having to many requests backed up.
In situations where the app has spikes of queries this blocking is unnecessary and can cause issues so you can increase this queue limit with something like -Dcassandra.max_queued_native_transport_requests=4096 (default 128). You can also throttle requests on client side but id try increasing queue size first.
There also may be some request thats exceptionally slow that is clogging up your system. If you have monitoring setup, look at high percentile read/write coordinator latencies. You can also use nodetool proxyhistograms. There may be something in your data model or queries that is causing issues.

Is it possible to change which core timer interrupts happen on?

On my Debian 8 system, when I run the command watch -n0.1 --no-title cat /proc/interrupts, I get the output below.
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 [0/1808]
0: 46 0 0 10215 0 0 0 0 IO-APIC-edge timer
1: 1 0 0 2 0 0 0 0 IO-APIC-edge i8042
8: 0 0 0 1 0 0 0 0 IO-APIC-edge rtc0
9: 0 0 0 0 0 0 0 0 IO-APIC-fasteoi acpi
12: 0 0 0 4 0 0 0 0 IO-APIC-edge i8042
18: 0 0 0 0 8 0 0 0 IO-APIC-fasteoi i801_smbus
19: 7337 0 0 0 0 0 0 0 IO-APIC-fasteoi ata_piix, ata_piix
21: 0 66 0 0 0 0 0 0 IO-APIC-fasteoi ehci_hcd:usb1
23: 0 0 35 0 0 0 0 0 IO-APIC-fasteoi ehci_hcd:usb2
40: 208677 0 0 0 0 0 0 0 HPET_MSI-edge hpet2
41: 0 4501 0 0 0 0 0 0 HPET_MSI-edge hpet3
42: 0 0 2883 0 0 0 0 0 HPET_MSI-edge hpet4
43: 0 0 0 1224 0 0 0 0 HPET_MSI-edge hpet5
44: 0 0 0 0 1029 0 0 0 HPET_MSI-edge hpet6
45: 0 0 0 0 0 0 0 0 PCI-MSI-edge aerdrv, PCIe PME
46: 0 0 0 0 0 0 0 0 PCI-MSI-edge PCIe PME
47: 0 0 0 0 0 0 0 0 PCI-MSI-edge PCIe PME
48: 0 0 0 0 0 0 0 0 PCI-MSI-edge PCIe PME
49: 0 0 0 0 0 8570 0 0 PCI-MSI-edge eth0-rx-0
50: 0 0 0 0 0 0 1684 0 PCI-MSI-edge eth0-tx-0
51: 0 0 0 0 0 0 0 2 PCI-MSI-edge eth0
NMI: 8 2 2 2 1 2 1 49 Non-maskable interrupts
LOC: 36 31 29 26 21 7611 886 1390 Local timer interrupts
SPU: 0 0 0 0 0 0 0 0 Spurious interrupts
PMI: 8 2 2 2 1 2 1 49 Performance monitoring interrupts
IWI: 0 0 0 1 1 0 1 0 IRQ work interrupts
RTR: 7 0 0 0 0 0 0 0 APIC ICR read retries
RES: 473 1027 1530 739 1532 3567 1529 1811 Rescheduling interrupts
CAL: 846 1012 1122 1047 984 1008 1064 1145 Function call interrupts
TLB: 2 7 5 3 12 15 10 6 TLB shootdowns
TRM: 0 0 0 0 0 0 0 0 Thermal event interrupts
THR: 0 0 0 0 0 0 0 0 Threshold APIC interrupts
MCE: 0 0 0 0 0 0 0 0 Machine check exceptions
MCP: 4 4 4 4 4 4 4 4 Machine check polls
THR: 0 0 0 0 0 0 0 0 Hypervisor callback interrupts
ERR: 0
MIS: 0
Observe that the timer interrupt is firing mostly on CPU3.
Is it possible to move the timer interrupt to CPU0?
The name of the concept is IRQ SMP affinity.
It's possible to set the smp_affinity of an IRQ by setting the affinity mask in /proc/irq/<IRQ_NUMBER>/smp_affinity or the affinity list in /proc/irq/<IRQ_NUMBER>/smp_affinity_list.
The affinity mask is a bit field where each bit represents a core, the IRQ is allowed to be served on the cores corresponding to bits set.
The command
echo 1 > /proc/irq/0/smp_affinity
executed as root should pin the IRQ0 to CPU0.
The conditional is mandatory as setting the affinity for an IRQ is subject to a set of prerequisites, the list includes: an interrupt controller that supports a redirection table (like the IO-APIC), the affinity mask must contains at least one active CPUs, the IRQ affinity must not be managed by the kernel and the feature must be enabled.
In my virtualised Debian 8 system I was unable to set the affinity of the IRQ0, failing with an EIO error.
I was also unable to track down the exact reason.
If you are willing to dive into the Linux source code, you can start from write_irq_affinity in proc.c
Use isolcpus. It may not reduce your timer interrupts to 0, but on our servers they are greatly reduced.
If you use isolcpus, then the kernel will not affine interrupts to your CPUs that it might otherwise do. For example, we have systems with 12 core dual CPUs. We noticed NVME interrupts on our CPU1 (the second CPU), even with the CPUs isolated via tuned and its cpu-partitioning scheme. nvme drives on our Dell systems are connected to the PCIe lanes on CPU1, hence the interrupts on those cores.
As per my ticket with Red Hat (and Margaret Bloom, who wrote an excellent answer here), if you don't want the interrupts to be affined to your CPUs, you need to use isolcpus on the kernel boot line. And lo and behold, I tried it and our interrupts went to 0 for the NVME drives on all isolated CPU cores.
I have not attempted to isolate ALL cores on CPU1; I don't know if they'll simply be affined to CPU0 or what.
And, in a short summary: any interrupt in /proc/interrupts with "MSI" in the name, is managed by the kernel.

Linux Top command giving output of both the cpu cores using command line [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 6 years ago.
Improve this question
I have a system with 2 cores running linux. I will want to log the cpu usages of the individual cores at regular intervals of say 15mins.
I can use top and regex to get the info. But it gives me the overall info of the cpu. When I manually press "1", then both the cores usages are shown separately.
My question is how can I display both the cores cpu usage without manually pressing "1" after invoking top command.
Current research by me:
-I can use the -b option to run in batch mode and output to a file. But the next question is how I can input data to the top command in the batch mode. Is there a script that top command reads to run in a batch mode?
The Linux top command obtains its information from /proc/stat which is (somewhat) dependent upon the kernel version. Perhaps you could write a program which reads from that. Here is a sample from a 2.6.32 system with 20 cores:
cpu 46832272 794980 8521784 1312627944 853989 247 34947 0 0
cpu0 6404288 173468 806918 60455445 377313 1 1799 0 0
cpu1 2980140 137898 937163 64278592 68373 0 118 0 0
cpu2 5099227 86676 841568 62395343 27685 0 64 0 0
cpu3 11255325 20062 767603 56427120 9388 0 85 0 0
cpu4 2618170 1002 501629 65394095 4369 0 62 0 0
cpu5 635453 867 154898 67725523 2981 212 58 0 0
cpu6 343657 32 66510 68113208 2769 0 64 0 0
cpu7 327935 688 38431 68158263 1703 0 55 0 0
cpu8 118687 78 27436 68382190 1992 0 33 0 0
cpu9 329990 49 42224 68138515 1643 0 49 0 0
cpu10 3462177 160918 814788 63701724 202763 3 5444 0 0
cpu11 3006524 112533 484490 64877526 37455 0 6840 0 0
cpu12 2696919 61285 695966 65004324 17277 0 133 0 0
cpu13 3453005 34509 957663 64035215 10938 0 101 0 0
cpu14 2068954 2039 679830 65764151 6418 0 50 0 0
cpu15 628390 159 367213 67531841 2593 0 41 0 0
cpu16 331139 77 76690 68120995 2971 0 51 0 0
cpu17 616895 2482 182239 67595814 70070 29 19797 0 0
cpu18 343472 51 38712 68148369 2481 0 46 0 0
cpu19 111916 96 39803 68379681 2797 0 47 0 0
intr 1991637171 173 2 0 0 2 0 0 0 1 0 0 0 4 0 0 0 0 1 56 1416833 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1644 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2285 0 0 0 0 0 0 0 3211641 4799987 3235 31624105 11000098 0 ...
ctxt 3201588026
btime 1460672984
processes 2430161
procs_running 2
procs_blocked 0
softirq 1391193131 0 626556634 166050 33864038 3892307 0 11210298 67287467 2880340 645335997
According to the man page (man 5 proc then search for /proc/stat), the lines for cpu entries are:
The amount of time, measured in units of USER_HZ (1/100ths of a second on most architectures, use sysconf(_SC_CLK_TCK) to obtain the right value), that the system spent in user mode, user mode with low priority (nice), system mode, and the idle task, respectively. The last value should be USER_HZ times the second entry in the uptime pseudo-file.
iowait - time waiting for I/O to complete; irq - time servicing interrupts; softirq - time servicing softirqs.
steal - stolen time, which is the time spent in other operating systems when running in a virtualized environment
guest, which is the time spent running a virtual CPU for guest operating systems under the control of the Linux kernel.
guest_nice time spent running a niced guest (virtual CPU for get operating systems under the control of the Linux kernel).
I looked at a 4.4.6 kernel system too. The cpu entries have the tenth item.

Cassandra2.1 write slow in a 1TB data table

I am doing some test in a cassandra cluster,and now i have a table with 1TB data per node.When i used ycsb to do more insert operation,i found the throughput was really low(about 10000 ops/sec) comparing to a same,new table in the same cluster(about 80000 ops/sec).While inserting,the cpu usage was about 40%,and almost no disk usege.
I used nodetool tpstats to get task details,it showed :
Pool Name Active Pending Completed Blocked All time blocked
CounterMutationStage 0 0 0 0 0
ReadStage 0 0 102 0 0
RequestResponseStage 0 0 41571733 0 0
MutationStage 384 21949 82375487 0 0
ReadRepairStage 0 0 0 0 0
GossipStage 0 0 247100 0 0
CacheCleanupExecutor 0 0 0 0 0
AntiEntropyStage 0 0 0 0 0
MigrationStage 0 0 6 0 0
Sampler 0 0 0 0 0
ValidationExecutor 0 0 0 0 0
CommitLogArchiver 0 0 0 0 0
MiscStage 0 0 0 0 0
MemtableFlushWriter 16 16 4745 0 0
MemtableReclaimMemory 0 0 4745 0 0
PendingRangeCalculator 0 0 4 0 0
MemtablePostFlush 1 163 9394 0 0
CompactionExecutor 8 29 13713 0 0
InternalResponseStage 0 0 0 0 0
HintedHandoff 2 2 5 0 0
I found there was a large amount of pending MutationStage and MemtablePostFlush
I have read some related articles about cassandra write limitation,but no useful information.I want to know why there is a huge difference about cassandra throughput between two same tables except the data size?
In addition,i use ssd on my server.However,this phenomenon also occur in another cluster using hdd
When cassandra was running,i found the both %user and %nice on cpu utilization are about 10% while only compactiontask running with compaction throughput about 80MB/S.but i have been set nice value to 0 for my cassandra process.
Wild guess: your system is busy compacting the sstable.
Check it out with nodetool compactionstats
BTW, YCSB does not use prepare statement, which make it bad estimator for actual application load.

smp affinity setting in linux

I want to load balance the interrupt (irq 75) on my virtual machine system. It has 64 bit redhat 5.8, kernel 2.6.18. There are 8 CPUs in the virtual machine.
When I run:
cat /proc/interrupts
75: 9189 0 0 0 0 0 0 0 IO-APIC-level eth0
I saw that the IRQ 75 is used only CPU0. Then I changed the smp_affinity for irq 75.
echo ff > /proc/irq/75/smp_affinity
cat /proc/irq/75/smp_affinity
00000000,00000000,00000000,00000000,00000000,00000000,00000000,000000ff
But I saw againg the interrupts for irq 75 were using CPU0 only.
75: 157228 0 0 0 0 0 0 0 IO-APIC-level eth0
There is no irq balancing between CPUs. I want to distrubute all interrupts (irq 75) to all CPUs, Am I doing something wrong?
The value is in hex representation of bitmask, usually 64bit
first stop irqbalance
now, try (bit pattern: 10 = 0x2 in hex representation)
echo 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000002 > /proc/irq/75/smp_affinity
this should work if you have 2 core processor.
If you are using vmware, change the ethernet driver to VMXNET3, you will have interrupts like following:
cat /proc/interrupts | grep eth3
57: 0 0 0 0 5 101198492 0 0 PCI-MSI-edge eth3-rxtx-0
58: 0 0 0 0 0 2 82962355 0 PCI-MSI-edge eth3-rxtx-1
59: 0 0 0 0 0 0 1 112986304 PCI-MSI-edge eth3-rxtx-2
60: 120252394 0 0 0 0 0 0 1 PCI-MSI-edge eth3-rxtx-3
61: 1 118585532 0 0 0 0 0 0 PCI-MSI-edge eth3-rxtx-4
62: 0 1 151440277 0 0 0 0 0 PCI-MSI-edge eth3-rxtx-5
63: 0 0 1 94639274 0 0 0 0 PCI-MSI-edge eth3-rxtx-6
64: 0 0 0 1 63577471 0 0 0 PCI-MSI-edge eth3-rxtx-7
65: 0 0 0 0 0 0 0 0 PCI-MSI-edge eth3-event-8
You will have different "rxtx" queues, each assigned to a CPU.
In my case the load became balanced among all CPUs.

Resources