kprobes, backtrace, same symbol adjacent to itself - linux

I'm trying to understand the output from the backtrace I captured using kprobes. If you'd like to see the full stack trace, that's fine, but it's not necessary for the question. Below is an excerpt:
Jul 24 16:31:34 scilinx kernel: [<ffffffff813a2b2e>] ? ata_scsi_port_error_handler+0x4be/0x710
Jul 24 16:31:34 scilinx kernel: [<ffffffff813a2ea8>] ? ata_scsi_cmd_error_handler+0x128/0x180
Jul 24 16:31:34 scilinx kernel: [<ffffffff813a2f98>] ? ata_scsi_error+0x98/0xd0
Jul 24 16:31:34 scilinx kernel: [<ffffffff81386cfa>] ? scsi_error_handler+0x12a/0x810
Jul 24 16:31:34 scilinx kernel: [<ffffffff81386bd0>] ? scsi_error_handler+0x0/0x810
Jul 24 16:31:34 scilinx kernel: [<ffffffff8109aef6>] ? kthread+0x96/0xa0
You'll notice that scsi_error_handler is adjacent to itself in the call stack but I cannot find why this is. Here is the scsi_error_handler function for this kernel. As you can see, it does not call itself. So why does the stacktrace show it adjacent to itself like this?
Thanks.

Related

Kafka broker crash every day - OOM killer

I have a cluster of 3 kafka brokers Version 0.10.2.1. Each broker has it's own host 2 cpu / 16G RAM, In addition we are using docker to wrap the broker process.
The problems is as follows:
Almost every day at the same time we see all of our kafka clients failed for 10 minutes.
At the beginning I thought it is related to Kafka No broker in ISR for partition
But after a while I discover that the broker just crash due to OOM-killer.
I also played with the Xmx and Xms before I discover that it is the OOM-killer. I had:
-Xmx2048M -Xms2048M
-Xmx4096M -Xms2048M
Same behavior for both
In addition currently we don't have ulimit
>> ulimit
unlimited
less kern.log
LOGS:
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761019] run-parts invoked oom-killer: gfp_mask=0x26000c0, order=2, oom_score_adj=0
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761022] run-parts cpuset=/ mems_allowed=0
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761026] CPU: 1 PID: 12266 Comm: run-parts Not tainted 4.4.0-59-generic #80-Ubuntu
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761027] Hardware name: Xen HVM domU, BIOS 4.2.amazon 02/16/2017
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761029] 0000000000000286 000000004811d7da ffff880036967af0 ffffffff813f7583
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761031] ffff880036967cc8 ffff880439f2f000 ffff880036967b60 ffffffff8120ad5e
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761033] ffffffff81cd2dc7 0000000000000000 ffffffff81e67760 0000000000000206
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761036] Call Trace:
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761041] [<ffffffff813f7583>] dump_stack+0x63/0x90
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761044] [<ffffffff8120ad5e>] dump_header+0x5a/0x1c5
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761048] [<ffffffff81192722>] oom_kill_process+0x202/0x3c0
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761049] [<ffffffff81192b49>] out_of_memory+0x219/0x460
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761052] [<ffffffff81198abd>] __alloc_pages_slowpath.constprop.88+0x8fd/0xa70
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761054] [<ffffffff81198eb6>] __alloc_pages_nodemask+0x286/0x2a0
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761056] [<ffffffff81198f6b>] alloc_kmem_pages_node+0x4b/0xc0
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761060] [<ffffffff8107ea5e>] copy_process+0x1be/0x1b70
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761063] [<ffffffff81391bcc>] ? apparmor_file_alloc_security+0x5c/0x220
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761066] [<ffffffff811ed05a>] ? kmem_cache_alloc+0x1ca/0x1f0
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761070] [<ffffffff81347bd3>] ? security_file_alloc+0x33/0x50
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761073] [<ffffffff810caf11>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761074] [<ffffffff810805a0>] _do_fork+0x80/0x360
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761076] [<ffffffff81080929>] SyS_clone+0x19/0x20
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761080] [<ffffffff818384f2>] entry_SYSCALL_64_fastpath+0x16/0x71
And ....
Jan 24 06:25:25 kafka10-172-40-103-177 kernel: [16591270.954463] Out of memory: Kill process 16123 (java) score 134 or sacrifice child
Jan 24 06:25:25 kafka10-172-40-103-177 kernel: [16591270.958609] Killed process 16123 (java) total-vm:11977548kB, anon-rss:2035780kB, file-rss:67848kB
Any suggestion of how to approach this ??
We found the problem.
First I will say that adding more RAM to the machine also solved the problem but it is "expensive solution".
The problem was as follows:
Since I was working with EC2 ubuntu distribution I got daily crontabs in all of my cluster exactly at the same time. One of the scripts was mlocate this script apparently took too many resources.
I assume that since all cluster of kafka has some issues with IO and Memory, brokers was trying to use more memory and then the OOM killer killed them.
When 2 of my 3 brokers were down some services were down.
So the solution was:
Change the crontab to work in different hours of the day in each
broker.
Disable mlocate
I also faced the same issue below mentioned blog helped me out :
https://docs.confluent.io/current/kafka/deployment.html
How to decide Kafka Cluster size
https://community.hortonworks.com/articles/80813/kafka-best-practices-1.html
And please make sure that the swap is enabled on all the brokers.

centos6.5's yum error : Input/output error

when i run yum command:
> yum
There was a problem importing one of the Python modules
required to run yum. The error leading to this problem was:
/usr/lib64/python2.6/lib-dynload/arraymodule.so: cannot read file data: Input/output error
Please install a package which provides this module, or
verify that the module is installed correctly.
It's possible that the above module doesn't match the
current version of Python, which is:
2.6.6 (r266:84292, Jul 23 2015, 15:22:56)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-11)]
Current version of Python is 2.6.6,not other。
system logs:
Oct 16 09:56:50 localhost kernel: mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000) cb_idx mptscsih_io_done
Oct 16 09:56:50 localhost kernel: LSI Debug log info 31080000 for channel 0 id 0
Oct 16 09:56:50 localhost kernel: mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000) cb_idx mptscsih_io_done
Oct 16 09:56:50 localhost kernel: LSI Debug log info 31080000 for channel 0 id 0
Oct 16 09:56:50 localhost kernel: sd 6:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 16 09:56:50 localhost kernel: sd 6:0:0:0: [sda] Sense Key : Medium Error [current]
Oct 16 09:56:50 localhost kernel: Info fld=0x4d59fc8
Oct 16 09:56:50 localhost kernel: sd 6:0:0:0: [sda] Add. Sense: Unrecovered read error
Oct 16 09:56:50 localhost kernel: sd 6:0:0:0: [sda] CDB: Read(10): 28 00 04 d5 9f c8 00 00 08 00
Oct 16 09:56:50 localhost kernel: end_request: critical medium error, dev sda, sector 81108936
Who know how to fix? Thank you!
Input/output error indicates that you system cannot read the file. Your log indicates that the hard drive is failing. Reinstall yum through RPM if you must, but ultimately backup your critical data and salvage the storage array.

How does a rescan LUN on Linux works

We can issue a host bus scan on Linux host to get the /dev/sd* devices on host.
The scan is issued using this command :
echo "- - -" > /sys/class/scsi_host/host0/scan
Can someone please tell me the flow of events in the Linux userspace/kernel code which leads to formation of /dev/sd* devices post scsi scan ?
Is this a PCI bus scan OR SCSI commands sent to controller of storage OR something new ?
Looked upon the code how iscsid scan works. __scsi_scan_target is the key function which does the SCSI scan.
This function first it tries to probe and add LUN 0 using scsi_probe_and_add_lun function.
Dec 24 03:24:28 localhost kernel: [<ffffffff813e8c94>] scsi_probe_and_add_lun+0x3f4/0xc80
Dec 24 03:24:28 localhost kernel: [<ffffffff815e5a51>] ? printk+0x77/0x8e
Dec 24 03:24:28 localhost kernel: [<ffffffff813e9a0e>] __scsi_scan_target+0x13e/0x270
Dec 24 03:24:28 localhost kernel: [<ffffffff813e9c30>] scsi_scan_target+0xf0/0x110
Dec 24 03:24:28 localhost kernel: [<ffffffffa050eedf>] iscsi_user_scan_session.part.13+0x10f/0x150 [scsi_transport_iscsi]
Dec 24 03:24:28 localhost kernel: [<ffffffffa050ef20>] ? iscsi_user_scan_session.part.13+0x150/0x150 [scsi_transport_iscsi]
Dec 24 03:24:28 localhost kernel: [<ffffffffa050ef41>] iscsi_user_scan_session+0x21/0x30 [scsi_transport_iscsi]
Dec 24 03:24:28 localhost kernel: [<ffffffff813b1b63>] device_for_each_child+0x53/0x90
Dec 24 03:24:28 localhost kernel: [<ffffffffa050cb6c>] iscsi_user_scan+0x3c/0x60 [scsi_transport_iscsi]
Dec 24 03:24:28 localhost kernel: [<ffffffff813eb835>] store_scan+0xa5/0x100
Dec 24 03:24:28 localhost kernel: [<ffffffff813b1138>] dev_attr_store+0x18/0x30
Dec 24 03:24:28 localhost kernel: [<ffffffff81225686>] sysfs_write_file+0xc6/0x140
Dec 24 03:24:28 localhost kernel: [<ffffffff811afa7d>] vfs_write+0xbd/0x1e0
Dec 24 03:24:28 localhost kernel: [<ffffffff811b04c8>] SyS_write+0x58/0xb0
Dec 24 03:24:28 localhost kernel: [<ffffffff815fc9d9>] system_call_fastpath+0x16/0x1b
scsi_probe_and_add_lun function will :
1. allocate scsi_device data structure
2. call scsi_probe_lun and send SCSI enquiries to LUN 0.
3. call scsi_add_lun to fill the scsi_device data structure from the enquiries done.
this data structure is added using scsi_sysfs_add_sdev function.
Then it tries to send SCSI report LUN command to get the number of LUNS in function scsi_report_lun_scan. This function finds the number of LUNs and again tries to add all the LUNs using scsi_probe_and_add_lun function.
scsi_sysfs_add_sdev function adds the devices by calling scsi_target_add and device_add for scsi_device.
This device_add calls the drivers' probe function (sd_probe):
Dec 24 03:24:28 localhost kernel: [<ffffffffa01d77d0>] sd_probe+0x320/0x380 [sd_mod]
Dec 24 03:24:28 localhost kernel: [<ffffffff813b6917>] driver_probe_device+0x87/0x390
Dec 24 03:24:28 localhost kernel: [<ffffffff813b6c20>] ? driver_probe_device+0x390/0x390
Dec 24 03:24:28 localhost kernel: [<ffffffff813b6c5b>] __device_attach+0x3b/0x40
Dec 24 03:24:28 localhost kernel: [<ffffffff813b477b>] bus_for_each_drv+0x6b/0xb0
Dec 24 03:24:28 localhost kernel: [<ffffffff813b6818>] device_attach+0x88/0xa0
Dec 24 03:24:28 localhost kernel: [<ffffffff813b5b18>] bus_probe_device+0x98/0xc0
Dec 24 03:24:28 localhost kernel: [<ffffffff813b3584>] device_add+0x4c4/0x7a0
Dec 24 03:24:28 localhost kernel: [<ffffffff813c2ebc>] ? __pm_runtime_resume+0x5c/0x80
Dec 24 03:24:28 localhost kernel: [<ffffffff813eba9c>] scsi_sysfs_add_sdev+0xac/0x320
Dec 24 03:24:28 localhost kernel: [<ffffffff813e9317>] scsi_probe_and_add_lun+0xa77/0xc80
Dec 24 03:24:28 localhost kernel: [<ffffffff815eceff>] scsi_report_lun_scan+0x39a/0x5f1
Dec 24 03:24:28 localhost kernel: [<ffffffff813e9a24>] __scsi_scan_target+0x154/0x270
Dec 24 03:24:28 localhost kernel: [<ffffffff813e9c30>] scsi_scan_target+0xf0/0x110
Dec 24 03:24:28 localhost kernel: [<ffffffffa050eedf>] iscsi_user_scan_session.part.13+0x10f/0x150 [scsi_transport_iscsi]
Dec 24 03:24:28 localhost kernel: [<ffffffffa050ef20>] ? iscsi_user_scan_session.part.13+0x150/0x150 [scsi_transport_iscsi]
Dec 24 03:24:28 localhost kernel: [<ffffffffa050ef41>] iscsi_user_scan_session+0x21/0x30 [scsi_transport_iscsi]
Dec 24 03:24:28 localhost kernel: [<ffffffff813b1b63>] device_for_each_child+0x53/0x90
Dec 24 03:24:28 localhost kernel: [<ffffffffa050cb6c>] iscsi_user_scan+0x3c/0x60 [scsi_transport_iscsi]
Dec 24 03:24:28 localhost kernel: [<ffffffff813eb835>] store_scan+0xa5/0x100
Dec 24 03:24:28 localhost kernel: [<ffffffff813b1138>] dev_attr_store+0x18/0x30
Dec 24 03:24:28 localhost kernel: [<ffffffff81225686>] sysfs_write_file+0xc6/0x140
Dec 24 03:24:28 localhost kernel: [<ffffffff811afa7d>] vfs_write+0xbd/0x1e0
Dec 24 03:24:28 localhost kernel: [<ffffffff811b04c8>] SyS_write+0x58/0xb0
Dec 24 03:24:28 localhost kernel: [<ffffffff815fc9d9>] system_call_fastpath+0x16/0x1b
linux-3.14.69\drivers\scsi\sd.c
This sd_probe function allocates the disk and launches thread sd_probe_async.
async_schedule_domain(sd_probe_async, sdkp, &scsi_sd_probe_domain);
sd_probe_async calls add_disk and prints the message :
"Attached SCSI %sdisk"

INFO: task nginx:22992 blocked for more than 120 seconds

I'm running an Ubuntu VM on Azure.
2 days ago my server was down. I found this inside my syslog:
Dec 11 06:45:28 myservice kernel: [4525694.437314] INFO: task nginx:22992 blocked for more than 120 seconds.
Dec 11 06:45:28 myservice kernel: [4525694.442895] Not tainted 3.16.0-29-generic #39-Ubuntu
Dec 11 06:45:28 myservice kernel: [4525694.447905] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 11 06:45:28 myservice kernel: [4525694.453525] nginx D ffff8801bb633840 0 22992 22990 0x00000000
Dec 11 06:45:28 myservice kernel: [4525694.453531] ffff8801a0a7bd60 0000000000000082 ffff8801a0ebf010 0000000000013840
Dec 11 06:45:28 myservice kernel: [4525694.453534] ffff8801a0a7bfd8 0000000000013840 ffff8801a0ebf010 ffff8801b88d8d10
Dec 11 06:45:28 myservice kernel: [4525694.453536] ffff8801b88d8d14 ffff8801a0ebf010 00000000ffffffff ffff8801b88d8d18
Dec 11 06:45:28 myservice kernel: [4525694.453539] Call Trace:
Dec 11 06:45:28 myservice kernel: [4525694.453547] [<ffffffff817858a9>] schedule_preempt_disabled+0x29/0x70
Dec 11 06:45:28 myservice kernel: [4525694.453551] [<ffffffff81787e45>] __mutex_lock_slowpath+0xd5/0x1f0
Dec 11 06:45:28 myservice kernel: [4525694.453562] [<ffffffff81787f7f>] mutex_lock+0x1f/0x30
Dec 11 06:45:28 myservice kernel: [4525694.453580] [<ffffffffc048fe90>] cifs_strict_writev+0xf0/0x250 [cifs]
Dec 11 06:45:28 myservice kernel: [4525694.453585] [<ffffffff811e0991>] new_sync_write+0x81/0xb0
Dec 11 06:45:28 myservice kernel: [4525694.453588] [<ffffffff811e1177>] vfs_write+0xb7/0x1f0
Dec 11 06:45:28 myservice kernel: [4525694.453592] [<ffffffff811ffdcb>] ? set_close_on_exec+0x4b/0x60
Dec 11 06:45:28 myservice kernel: [4525694.453595] [<ffffffff811e1d26>] SyS_write+0x46/0xb0
Dec 11 06:45:28 myservice kernel: [4525694.453598] [<ffffffff8178a1ad>] system_call_fastpath+0x1a/0x1f
"Google" told me, it has probably sth. to do with high Disk I/O rates. But my Azure monitoring showed me very low disk read/write values in the problematic timerange. Also low CPU and low memory usage.
Another guess to this issue was a faulty hardware.
How can I check if this really was the reason - and if it was: how can I solve this problem when my VM is in the cloud? Migrate to a new VM ?!
I also have a very old nginx version which I want to update - but I don't think this is the reason for this issue, is it?

Node.js segfault - What does the stacktrace mean?

I've got a node (version: v0.10.28) server running which stops from time to time without an exception.
The kern.log (linux debian) shows this errors:
May 28 05:01:20 pro1739 kernel: [29083831.961652] node[32519]: segfault at 0 ip (null) sp 00007fff520fa478 error 14 in node[400000+80d000]
May 28 05:41:29 pro1739 kernel: [29086239.406993] node[1893] general protection ip:7c9334 sp:7fffc7644440 error:0 in node[400000+80d000]
May 28 05:50:45 pro1739 kernel: [29086794.741280] node[4227]: segfault at 7000000000b ip 00000000007c9334 sp 00007fff49b6d9b0 error 4 in node[400000+80d000]
May 28 06:26:27 pro1739 kernel: [29088936.031535] node[4732]: segfault at 0 ip (null) sp 00007fffe25a9978 error 14 in node[400000+80d000]
May 28 07:02:48 pro1739 kernel: [29091115.229410] node[6904]: segfault at 700000007 ip 00000000007bdd0d sp 00007fff7d722160 error 4 in node[400000+80d000]
May 28 08:00:58 pro1739 kernel: [29094603.815258] node[8970]: segfault at 40000000b ip 00000000007c9334 sp 00007fff457b8950 error 4 in node[400000+80d000]
May 28 08:37:27 pro1739 kernel: [29096791.454732] node[12482] general protection ip:7c8edf sp:7fff612a5ab0 error:0 in node[400000+80d000]
May 28 10:37:41 pro1739 kernel: [29104001.982760] node[14603]: segfault at 0 ip (null) sp 00007fffcb7ace18 error 14 in node[400000+80d000]
I managed to take a stacktrace:
PID 22101 received SIGSEGV for address: 0x7
/home/nodejs/node_modules/segvhandler/build/Release/segvhandler.node(+0x2fcd)[0x7fe355d98fcd]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf030)[0x7fe356538030]
/home/nodejs/.nvm/v0.10.28/bin/node(_ZN2v88internal18IncrementalMarking4StepElNS1_16CompletionActionE+0x2e4)[0x681fa4]
/home/nodejs/.nvm/v0.10.28/bin/node(_ZN2v88internal8NewSpace15SlowAllocateRawEi+0x71)[0x774e31]
/home/nodejs/.nvm/v0.10.28/bin/node(_ZN2v88internal4Heap8AllocateEPNS0_3MapENS0_15AllocationSpaceE+0x1e8)[0x621c08]
/home/nodejs/.nvm/v0.10.28/bin/node(_ZN2v88internal4Heap23AllocateJSObjectFromMapEPNS0_3MapENS0_13PretenureFlagE+0x56)[0x624c96]
/home/nodejs/.nvm/v0.10.28/bin/node(_ZN2v88internal4Heap27AllocateJSArrayWithElementsEPNS0_14FixedArrayBaseENS0_12ElementsKindENS0_13PretenureFlagE+0x20)[0x624da0]
/home/nodejs/.nvm/v0.10.28/bin/node(_ZN2v88internal7Factory22NewJSArrayWithElementsENS0_6HandleINS0_14FixedArrayBaseEEENS0_12ElementsKindENS0_13PretenureFlagE+0x37)[0x6051e7]
/home/nodejs/.nvm/v0.10.28/bin/node(_ZN2v88internal19Runtime_StringMatchENS0_9ArgumentsEPNS0_7IsolateE+0x3dd)[0x75279d]
[0x3e9691a06362]
Could somebody help me with that stacktrace ?
For me it looks like node's trying to allocate heap memory for an array?
Thank you!

Resources