Child process is not generating core ONLY for SIGBUS error and became Zombie process - linux

My child process is trying to access an PCI address space. It works fine most of the times.
But, sometimes the child process is going to zombie state. dmesg logs shows the following bus error.
[ 501.134156] Caused by (from MCSR=10008): Bus - Read Data Bus Error
[ 501.134169] Oops: Machine check, sig: 7 [#1]
There is no core file generated in this case.
[Linux:/]$ ps -axl | grep tes1
4 0 6805 32495 20 0 0 0 exit Zl ? 0:05 [test1] <defunct>
[Linux:/]$
Core is generated for SIGSEGV error by the child process. So I assume it has nothing to do with permission/ulimit settings.
Can someone help me to understand why core is not getting generated in this case?
Child Process:
--------------
[Linux:/]$ cat /proc/6805/status
Name: test1
State: Z (zombie)
Tgid: 6805
Pid: 6805
PPid: 32495
TracerPid: 0
Uid: 0 0 0 0
Gid: 0 0 0 0
FDSize: 0
Groups:
Threads: 2
SigQ: 18/13007
SigPnd: 0000000002000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000001006
SigCgt: 0000000182000200
CapInh: 0000000000000000
CapPrm: 0000001fffffffff
CapEff: 0000001fffffffff
CapBnd: 0000001fffffffff
Seccomp: 0
Cpus_allowed: 3
Cpus_allowed_list: 0-1
voluntary_ctxt_switches: 8998
nonvoluntary_ctxt_switches: 857
Stack:
-------
[Linux:/]$ cat /proc/6805/stack
[<00000000>] (nil)
[<c0008640>] __switch_to+0xc0/0x160
[<c004b4f4>] do_exit+0x5d4/0xa70
[<c000c694>] die+0x224/0x310
[<c000ce44>] machine_check_exception+0x124/0x1e0
[<c00123bc>] ret_from_mcheck_exc+0x0/0x14c
[Linux:/]$
Parent Process:
---------------
[Linux:/]$ cat /proc/32495/status
Name: test
State: S (sleeping)
Tgid: 32495
Pid: 32495
PPid: 21911
TracerPid: 0
Uid: 0 0 0 0
Gid: 0 0 0 0
FDSize: 256
Groups:
VmPeak: 4820 kB
VmSize: 4820 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 2548 kB
VmRSS: 2548 kB
VmData: 1284 kB
VmStk: 132 kB
VmExe: 900 kB
VmLib: 1976 kB
VmPTE: 24 kB
VmSwap: 0 kB
Threads: 1
SigQ: 19/13007
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000010000
SigIgn: 0000000000001006
SigCgt: 0000000043816ef9
CapInh: 0000000000000000
CapPrm: 0000001fffffffff
CapEff: 0000001fffffffff
CapBnd: 0000001fffffffff
Seccomp: 0
Cpus_allowed: 3
Cpus_allowed_list: 0-1
voluntary_ctxt_switches: 274
nonvoluntary_ctxt_switches: 145
[Linux:/]$

I understand that the PCI hardware which is mmaped, is not responding. So, it is appropriate that only the kernel to deal with the error.
The error won't be propagated to user level, because this is not software fault. So, We do not get a core dump (either kernel or user space), since it is not a software failure.
The Machine check exception handler in the kernel tells what the hardware failure was, and what address/data is relevant (depending on the cause) - Need to be investigated from the hardware perspective further.

Related

cgroup and process memory usage mismactch

I have a memory cgroup with 1 process in it.
And look for rss memory usage in that cgroup (in memory.stat) and it is much bigger than rss memory of the process (from /proc/[pid]/status).
The only process pid in cgroup:
$ cat /sys/fs/cgroup/memory/karim/master/cgroup.procs
3744924
The memory limit in cgroup:
$ cat /sys/fs/cgroup/memory/karim/master/memory.limit_in_bytes
7340032000
rss of the cgroup is 990 MB:
$ cat /sys/fs/cgroup/memory/karim/master/memory.stat
cache 5990449152
rss 990224384
rss_huge 0
shmem 0
mapped_file 13516800
dirty 1081344
writeback 270336
pgpgin 4195191
pgpgout 2490628
pgfault 5264589
pgmajfault 0
inactive_anon 0
active_anon 990240768
inactive_file 5862830080
active_file 127021056
unevictable 0
hierarchical_memory_limit 7340032000
total_cache 5990449152
total_rss 990224384
total_rss_huge 0
total_shmem 0
total_mapped_file 13516800
total_dirty 1081344
total_writeback 270336
total_pgpgin 4195191
total_pgpgout 2490628
total_pgfault 5264589
total_pgmajfault 0
total_inactive_anon 0
total_active_anon 990240768
total_inactive_file 5862830080
total_active_file 127021056
total_unevictable 0
rss of the process is 165 MB:
$ cat /proc/3744924/status
Name: [main] /h
Umask: 0002
State: S (sleeping)
Tgid: 3744924
Ngid: 0
Pid: 3744924
PPid: 3744912
TracerPid: 0
Uid: 1000 1000 1000 1000
Gid: 1001 1001 1001 1001
FDSize: 256
Groups: 1000 1001
NStgid: 3744924
NSpid: 3744924
NSpgid: 3744912
NSsid: 45028
VmPeak: 2149068 kB
VmSize: 2088876 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 245352 kB
VmRSS: 198964 kB
RssAnon: 165248 kB
RssFile: 33660 kB
RssShmem: 56 kB
VmData: 575400 kB
VmStk: 132 kB
VmExe: 3048 kB
VmLib: 19180 kB
VmPTE: 1152 kB
VmSwap: 0 kB
HugetlbPages: 0 kB
CoreDumping: 0
THP_enabled: 1
Threads: 17
SigQ: 0/241014
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000001001000
SigCgt: 0000000180000002
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000
NoNewPrivs: 0
Seccomp: 0
Speculation_Store_Bypass: thread vulnerable
Cpus_allowed: fff
Cpus_allowed_list: 0-11
Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
Mems_allowed_list: 0
voluntary_ctxt_switches: 94902
nonvoluntary_ctxt_switches: 1903
Why is there such a big difference?

Kernel Panic after SMP Implementation - Attempted to kill init

I am working on implementing SMP support in Linux kernel for Marvell PXA2128 ARM SoC. I am using Linus Torvald kernel as base kernel. Kernel version is 3.5. I have added SMP support in Linux kernel, I am able to boot with the second core but sometimes my kernel crashes with "Attempted to kill Init" message, that is, init process of initramfs dies somehow, don't know why. I thought L1 cache of second core is corrupted, so I invalidated the L1 cache of second core before it enters the Linux kernel execution. Crash log of kernel is like
[ 16.413024] Freeing init memory: 268K
[ 16.658111] tmpfs: No value for mount option 'strictatime'
[ 16.809997] scsi 0:0:0:0: Direct-Access SanDisk Cruzer Blade 1.27 PQ: 0 ANSI: 6
[ 16.827545] tmpfs: No value for mount option 'strictatime'
[ 16.972473] sd 0:0:0:0: Attached scsi generic sg0 type 0
[ 16.972930] sd 0:0:0:0: [sda] 15330304 512-byte logical blocks: (7.84 GB/7.30 GiB)
[ 16.974487] sd 0:0:0:0: [sda] Write Protect is off
[ 16.976104] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[ 17.369537] sda: sda1 sda2
[ 17.377258] sd 0:0:0:0: [sda] Attached SCSI removable disk
[ 17.602966] tmpfs: No value for mount option 'strictatime'
[ 18.074981] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100
[ 18.074981]
[ 18.222442] [<c001691c>] (unwind_backtrace+0x0/0x128) from [<c046beb0>] (dump_stack+0x20/0x24)
[ 18.301177] [<c046beb0>] (dump_stack+0x20/0x24) from [<c046bfcc>] (panic+0x94/0x1d4)
[ 18.378265] [<c046bfcc>] (panic+0x94/0x1d4) from [<c002ad08>] (do_exit+0x390/0x7ac)
[ 18.454956] [<c002ad08>] (do_exit+0x390/0x7ac) from [<c002b37c>] (do_group_exit+0x0/0xc4)
[ 18.533111] CPU0: stopping
[ 18.605621] [<c001691c>] (unwind_backtrace+0x0/0x128) from [<c046beb0>] (dump_stack+0x20/0x24)
[ 18.687225] [<c046beb0>] (dump_stack+0x20/0x24) from [<c00145ac>] (handle_IPI+0x104/0x174)
[ 18.770141] [<c00145ac>] (handle_IPI+0x104/0x174) from [<c0008590>] (gic_handle_irq+0x60/0x68)
[ 18.854888] [<c0008590>] (gic_handle_irq+0x60/0x68) from [<c000e6c0>] (__irq_svc+0x40/0x70)
[ 18.940093] Exception stack(0xc0685f38 to 0xc0685f80)
[ 19.022155] 5f20: c06c4c28 a0000093
[ 19.108398] 5f40: 00000001 60400100 c0684000 c06c48c8 c047560c c0691438 c177a080 562f5842
[ 19.188018] SMP: failed to stop secondary CPUs
[ 19.275970] 5f60: 00000000 c0685f9c c0685f40 c0685f80 c0020668 c0010068 60000013 ffffffff
[ 19.363708] [<c000e6c0>] (__irq_svc+0x40/0x70) from [<c0010068>] (cpu_idle+0x94/0xdc)
[ 19.450805] [<c0010068>] (cpu_idle+0x94/0xdc) from [<c0462d24>] (rest_init+0x7c/0x94)
[ 19.537292] [<c0462d24>] (rest_init+0x7c/0x94) from [<c063f89c>] (start_kernel+0x328/0x380)
Now when my kernel boots successfully, the output of cat /proc/interrupts is like
bash-4.2# cat /proc/interrupts
CPU0 CPU1
39: 101 0 GIC pxa_i2c-i2c
45: 0 26692 GIC timer0
46: 26578 0 GIC timer1
52: 27 0 GIC olpc-ec-1.75
58: 0 0 GIC mmp-vmeta
60: 1142 0 GIC UART3
71: 0 0 GIC mmc2
72: 0 0 GIC olpc-kbd
73: 25978 0 GIC pxa168fb-dss
76: 2282 0 GIC ehci_hcd:usb1
84: 246 0 GIC mmc0
85: 9820 0 GIC mmc1
132: 0 0 ICU rtc Alrm
133: 0 0 ICU rtc 1Hz
137: 0 0 ICU galcore interrupt service
139: 88 0 ICU galcore interrupt service for 2D
141: 20 0 ICU pxa_i2c-i2c
143: 0 0 ICU pxa_i2c-i2c
145: 328 0 ICU pxa_i2c-i2c
186: 52 0 ICU mmc3
252: 0 0 GPIO hsdet-gpio
253: 0 0 GPIO hdmi-hpd
270: 0 0 GPIO d4280000.sdhci cd
278: 0 0 GPIO sdhci_wakeup_irq, sdhci_wakeup_irq, sdhci_wakeup_irq, sdhci_wakeup_irq
335: 0 0 GPIO micdet-gpio
365: 0 0 GPIO DCON
368: 0 0 GPIO olpc-switch-1.75-lid
369: 0 0 GPIO olpc-switch-1.75-ebook
393: 0 0 GPIO olpc-ec-1.75-wake
IPI0: 0 0 Timer broadcast interrupts
IPI1: 3508 4069 Rescheduling interrupts
IPI2: 0 0 Function call interrupts
IPI3: 7 429 Single function call interrupts
IPI4: 0 0 CPU stop interrupts
Err: 0
bash-4.2#
Please give me a clue what important I am missing in SMP implementation.

Valgrind support for TMS320DM365

We have developed an application using ipnc_rdk version 5.0 for TMS320DM365 .
We have memory leak in the application.
We cross compiled valgrind for arm using arm- arago.
But when we run valgrind on the device, it shows illegal instruction error.
We saw couple of posts telling that valgrind doesn't support armv5.
We got valgrind patches for armv5 at below link, but they fail to apply in the valgrind source.
https://bugs.kde.org/show_bug.cgi?id=248998
We tried applying patches for valgrind 3.9.0 3.8.1 & 3.2.1 versions
Is there a valgrind version for armv5 ?
If not how can we debug memory leak error in our application.
Also i ran top utility to check memory usage and there was increase in memory. Also kerenel calls oom killer. Please find the attached log which gives the top utility output and also the out of memory log
Top utility output just before oom killer was called:
shrd, 8K buff, 8352K cached
CPU: 86% usr 13% sys 0% nic 0% idle 0% io 0% irq 0% sirq
Load average: 2.86 1.31 0.51 4/54 690
PID PPID USER STAT VSZ %MEM %CPU COMMAND
663 655 root R 59184 132% 97% /opt/ipnc/stillImage.out
668 655 root R 3080 7% 3% top
1 0 root S 1624 4% 0% init [5]
269 2 root SW 0 0% 0% [kswapd0]
627 1 root S 6152 14% 0% ./boa -c /etc
655 1 root S 3080 7% 0% -sh
529 1 root S 2976 7% 0% /usr/sbin/inetd
651 1 root S 2964 7% 0% /sbin/syslogd -n -C64 -m 20
653 1 root S 2900 6% 0% /sbin/klogd -n
646 1 root S 2900 6% 0% /usr/sbin/telnetd
634 1 root S 2704 6% 0% avahi-daemon: running [10.local]
637 1 root S 1940 4% 0% /usr/sbin/avahi-dnsconfd -D
630 1 root S 1764 4% 0% avahi-autoipd: [eth0] bound 169.254.11
631 630 root S 1764 4% 0% avahi-autoipd: [eth0] callout dispatch
622 2 root SW 0 0% 0% [flush-1:0]
620 2 root SW 0 0% 0% [flush-ubifs_0_0]
5 2 root SW 0 0% 0% [kworker/u:0]
4 2 root SW 0 0% 0% [kworker/0:0]
3 2 root SW 0 0% 0% [ksoftirqd/0]
2 0 root SW 0 0% 0% [kthreadd]
stillImage.out invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0
[ 96.003955] Backtrace:
[ 96.006595] Function entered at [<c0030504>] from [<c03330c8>]
[ 96.012605] r7:00000042 r6:00000000 r5:000200da r4:c1f06000
[ 96.018703] Function entered at [<c03330b0>] from [<c007cad0>]
[ 96.024712] Function entered at [<c007ca58>] from [<c007cf50>]
[ 96.030710] Function entered at [<c007cf00>] from [<c007d478>]
[ 96.036727] Function entered at [<c007d1b4>] from [<c00809bc>]
[ 96.042602] Function entered at [<c008056c>] from [<c007a5d0>]
[ 96.048656] Function entered at [<c007a574>] from [<c0132024>]
[ 96.054670] Function entered at [<c0131fb8>] from [<c00794bc>]
[ 96.060669] Function entered at [<c00793cc>] from [<c007b6c8>]
[ 96.066683] Function entered at [<c007b220>] from [<c007b78c>]
[ 96.072694] Function entered at [<c007b718>] from [<c01317a8>]
[ 96.078718] Function entered at [<c0131620>] from [<c00a2b34>]
[ 96.084813] Function entered at [<c00a2a90>] from [<c00a35e8>]
[ 96.090777] r8:400b4000 r7:c1f07f70 r6:400b4000 r5:00001000 r4:c1d85680
[ 96.098026] Function entered at [<c00a3534>] from [<c00a3738>]
[ 96.103955] r8:400b4000 r7:00001000 r6:c1d85680 r5:00000000 r4:0004e000
[ 96.111222] Function entered at [<c00a36f4>] from [<c002d020>]
[ 96.117225] r8:c002d1a4 r7:00000004 r6:00351d08 r5:400b4000 r4:00001000
[ 96.124460] Mem-info:
[ 96.126777] DMA per-cpu:
[ 96.129412] CPU 0: hi: 6, btch: 1 usd: 4
[ 96.134472] active_anon:2952 inactive_anon:179 isolated_anon:0
[ 96.134504] active_file:95 inactive_file:192 isolated_file:0
[ 96.134531] unevictable:0 dirty:191 writeback:0 unstable:0
[ 96.134557] free:220 slab_reclaimable:86 slab_unreclaimable:437
[ 96.134584] mapped:154 shmem:1911 pagetables:71 bounce:0
[ 96.163348] DMA free:880kB min:880kB low:1100kB high:1320kB active_anon:11808kBinactive_anon:716kB active_file:380kB inactive_file:768kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:48768kB mlocked:0kB dirty:764kB writeback:0kB mapped:616kB shmem:7644kB slab_reclaimable:344kB slab_unreclaimable:1748kB kernel_stack:432kB pagetables:284kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1760 all_unreclaimable? yes
[ 96.202321] lowmem_reserve[]: 0 0 0
[ 96.206059] DMA: 6*4kB 7*8kB 8*16kB 1*32kB 4*64kB 3*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 880kB
[ 96.216739] 2198 total pagecache pages
[ 96.224107] 12288 pages of RAM
[ 96.227313] 317 free pages
[ 96.230128] 1096 reserved pages
[ 96.233410] 389 slab pages
[ 96.236289] 187 pages shared
[ 96.239319] 0 pages swap cached
[ 96.242494] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
[ 96.250121] [ 529] 0 529 744 20 0 0 0 inetd
[ 96.257877] [ 627] 0 627 1538 22 0 0 0 boa
[ 96.265408] [ 630] 0 630 441 20 0 0 0 avahi-autoipd
[ 96.273789] [ 631] 0 631 441 15 0 0 0 avahi-autoipd
[ 96.282186] [ 634] 0 634 676 54 0 0 0 avahi-daemon
[ 96.290549] [ 637] 0 637 485 22 0 0 0 avahi-dnsconfd
[ 96.299030] [ 646] 0 646 725 15 0 0 0 telnetd
[ 96.306996] [ 651] 0 651 741 34 0 0 0 syslogd
[ 96.314876] [ 653] 0 653 725 22 0 0 0 klogd
[ 96.322563] [ 655] 0 655 770 29 0 0 0 sh
[ 96.330018] [ 663] 0 663 14796 1080 0 0 0 stillImage.out
[ 96.338552] [ 668] 0 668 770 54 0 0 0 top
[ 96.346062] Out of memory: Kill process 663 (stillImage.out) score 66 or sacrifice child
[ 96.354411] Killed process 663 (stillImage.out) total-vm:59184kB, anon-rss:3824kB, file-rss:496kB
Thanks and Regards,
Arpitha

SIGBUS while doing memcpy from mmap ed buffer which is in RAM as identified by mincore

I am mmapping a block as:
mapAddr = mmap((void*) 0, curMapSize, PROT_NONE, MAP_LOCKED|MAP_SHARED, fd, curMapOffset);
if this does not fail (mapAddr != MAP_FAILED) I query mincore as:
err = mincore((char*) mapAddr, pageSize, &mincoreRet);
to find out whether it is in RAM. In case it is in RAM (err == 0 && mincoreRet & 0x01) I mmap it again for reading as:
copyAddr = mmap((void*) 0, curMapSize, PROT_READ, MAP_LOCKED|MAP_SHARED, fd, curMapOffset);
and then I try to copy it out to my buffer as:
memcpy(data, copyAddr, pageSize);
everything works fine except in the last memcpy once in a while I get SIGBUS. When I check /proc/ /smaps at the time of the failure I notice that it has Rss as well as Locked fields as 0 as listed below:
7f4a4c118000-7f4a4c119000 r--s 00326000 00:17 6 <file name>
Size: 4 kB
Rss: 0 kB
Pss: 0 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 0 kB
Referenced: 0 kB
Anonymous: 0 kB
AnonHugePages: 0 kB
Swap: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
Any thoughts? This is happening on ubuntu 12.0.4 with kernel version 3.5.0-36.

VmSize = physical memory + swap?

I have a little question regarding VmSize, in the documentation it's supposed to be the application's usage of memory.
However on my system:
VmSize = physical memory + swap
VmHWM seems more like what the application actually would be using.
[root#sun ~]# free -m
total used free shared buffers cached
Mem: 12012 9223 2788 0 613 1175
-/+ buffers/cache: 7434 4577
Swap: 3967 0 3967
[root#sun ~]# cat /proc/8268/status
Name: mysqld
State: S (sleeping)
Tgid: 8268
Pid: 8268
PPid: 1
TracerPid: 0
Uid: 89 89 89 89
Gid: 89 89 89 89
FDSize: 512
Groups: 89
VmPeak: 15878128 kB
VmSize: 15878128 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 7036312 kB
VmRSS: 7036312 kB
VmData: 15839272 kB
VmStk: 136 kB
VmExe: 10744 kB
VmLib: 6356 kB
VmPTE: 16208 kB
VmSwap: 0 kB
Threads: 265
SigQ: 0/96048
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000087007
SigIgn: 0000000000001000
SigCgt: 00000001800066e9
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000001fffffffff
Seccomp: 0
Cpus_allowed: fff
Cpus_allowed_list: 0-11
Mems_allowed: 00000000,00000001
Mems_allowed_list: 0
voluntary_ctxt_switches: 2567
nonvoluntary_ctxt_switches: 77
Any idea of why?
I try to get the usage of memory for this application in particular but this result doesn't really make sense.
Thanks.
VMsize is the "address space" that the process has in use: the number of available adresses. These addresses do not have to have any physical memory attached to them. (Attached physical memory is the RSS figure)
You can verify this by allocating a chunk of memory with p = malloc(4 * 1024 * 1024);, and not doing anything to *p: the VmSize will increase by 1K pages, but the RSS will be (about) the same. (your program will have more adressable memory, but it does not address it, so the memory does not need to be attached )
VmSize is the sum of all mapped memory (/proc/pid/maps)

Resources