Bluetooth on raspberry 4 without Linux - linux

I'm working on non-Linux OS and now trying to enable bluetooth on Raspberry Pi 4.
I have some necessary drivers such as: gpio, uart (pl011 and mini-uart), mailbox and expgpio through that mailbox.
To enable bluetooth I make some steps:
I configure GPIOs as described in Linux's dts to make UART0 connected
to BT/WiFi chip;
I set BT_ON expgpio to 1 through mailbox (it is made by default, just ensure);
I wrote some command to UART0 and nothing happened =( UART driver return success and reading command answer is always timeouted.
I think I could forget some step for initialization procedure, but as I can see in Linux log there is only firmware downloading and many commands, such as read device name, can be executed prior to it.
May be I forget to enable some clock source or a regulator, but I don't have any idea where start my research.
There is a part of Raspbian kernel log with additional debug info:
Jan 28 05:17:13 raspberrypi kernel: [ 15.321055] Bluetooth: Core ver 2.22
Jan 28 05:17:13 raspberrypi kernel: [ 15.321093] device class 'bluetooth': registering
Jan 28 05:17:13 raspberrypi kernel: [ 15.321149] NET: Registered PF_BLUETOOTH protocol family
Jan 28 05:17:13 raspberrypi kernel: [ 15.321158] Bluetooth: HCI device and connection manager initialized
Jan 28 05:17:13 raspberrypi kernel: [ 15.321176] Bluetooth: HCI socket layer initialized
Jan 28 05:17:13 raspberrypi kernel: [ 15.321189] Bluetooth: L2CAP socket layer initialized
Jan 28 05:17:13 raspberrypi kernel: [ 15.321208] Bluetooth: SCO socket layer initialized
Jan 28 05:17:13 raspberrypi kernel: [ 15.335356] Bluetooth: HCI UART driver ver 2.3
Jan 28 05:17:13 raspberrypi kernel: [ 15.335377] Bluetooth: HCI UART protocol H4 registered at id 0
Jan 28 05:17:13 raspberrypi kernel: [ 15.335387] bus: 'serial': add driver hci_uart_h5
Jan 28 05:17:13 raspberrypi kernel: [ 15.335456] Bluetooth: HCI UART protocol Three-wire (H5) registered at id 2
Jan 28 05:17:13 raspberrypi kernel: [ 15.335480] bus: 'platform': add driver hci_bcm
Jan 28 05:17:13 raspberrypi kernel: [ 15.335641] bus: 'serial': add driver hci_uart_bcm
Jan 28 05:17:13 raspberrypi kernel: [ 15.335679] Bluetooth: HCI UART protocol Broadcom registered at id 7
Jan 28 05:17:13 raspberrypi kernel: [ 15.337922] Bluetooth: TTY name ttyAMA0
Jan 28 05:17:13 raspberrypi kernel: [ 15.338543] Bluetooth: hci_uart_register_dev
Jan 28 05:17:13 raspberrypi kernel: [ 15.338599] device: 'hci0': device_add
Jan 28 05:17:13 raspberrypi kernel: [ 15.345358] device: 'rfkill1': device_add
Jan 28 05:17:13 raspberrypi kernel: [ 15.345497] Bluetooth: HCI UART protocol set. Proto H4; id 0
Jan 28 05:17:13 raspberrypi kernel: [ 15.345530] Bluetooth: hci_uart_open hci0 5d898f04
Jan 28 05:17:13 raspberrypi kernel: [ 15.345543] Bluetooth: hci_uart_setup: START
Jan 28 05:17:13 raspberrypi kernel: [ 15.345550] Bluetooth: hci_uart_setup: init speed = 0
Jan 28 05:17:13 raspberrypi kernel: [ 15.345557] Bluetooth: hci_uart_setup: oper speed = 0
Jan 28 05:17:13 raspberrypi kernel: [ 15.352975] Bluetooth: hci0: type 1 len 3
Jan 28 05:17:13 raspberrypi kernel: [ 15.353010] Bluetooth skb: 00000000: 01 03 10 00
Jan 28 05:17:13 raspberrypi kernel: [ 15.353026] Bluetooth: hci_uart_write_work written 4
Jan 28 05:17:13 raspberrypi kernel: [ 15.353760] Bluetooth: hci0: type 1 len 3
Jan 28 05:17:13 raspberrypi kernel: [ 15.353826] Bluetooth skb: 00000000: 01 01 10 00
....
a lot of lines
....
Jan 28 05:17:13 raspberrypi btuart[479]: bcm43xx_init
Jan 28 05:17:13 raspberrypi btuart[479]: Flash firmware /lib/firmware/brcm/BCM4345C0.hcd
Jan 28 05:17:13 raspberrypi btuart[479]: Set Controller UART speed to 3000000 bit/s
Jan 28 05:17:13 raspberrypi btuart[479]: Device setup complete
Jan 28 05:17:13 raspberrypi systemd[1]: Starting Load/Save RF Kill Switch Status...
Jan 28 05:17:13 raspberrypi systemd[1]: Started Configure Bluetooth Modems connected by UART.
Jan 28 05:17:13 raspberrypi systemd[1]: Reached target Multi-User System.
Jan 28 05:17:13 raspberrypi systemd[1]: Reached target Graphical Interface.
Jan 28 05:17:13 raspberrypi systemd[1]: Starting Update UTMP about System Runlevel Changes...
Jan 28 05:17:13 raspberrypi systemd[625]: Reached target Bluetooth.
Jan 28 05:17:13 raspberrypi systemd[1]: Started Load/Save RF Kill Switch Status.
Jan 28 05:17:13 raspberrypi systemd[1]: Created slice system-bthelper.slice.
Jan 28 05:17:13 raspberrypi systemd[1]: Starting Raspberry Pi bluetooth helper...
Jan 28 05:17:13 raspberrypi systemd[1]: systemd-update-utmp-runlevel.service: Succeeded.
Jan 28 05:17:13 raspberrypi systemd[1]: Finished Update UTMP about System Runlevel Changes.
Jan 28 05:17:13 raspberrypi bthelper[774]: Raspberry Pi BDADDR already set
Jan 28 05:17:13 raspberrypi systemd[1]: Finished Raspberry Pi bluetooth helper.
Jan 28 05:17:13 raspberrypi kernel: [ 15.490868] Bluetooth: hci0: type 1 len 8
Jan 28 05:17:13 raspberrypi kernel: [ 15.490909] Bluetooth skb: 00000000: 01 1c fc 05 01 02 00 01 01
Jan 28 05:17:13 raspberrypi kernel: [ 15.490930] Bluetooth: hci_uart_write_work written 9
Thank you in advance

For H4 protocol UART with Hardware Flow Control must be used. Adding HFC support to PL011 UART driver resolves the problem.

Related

ElasticSearch docker container remains in Exited status

Recently installed Docker, ElasticSearch 7.17.6. docker-compose up -d worked fine
but when trying to bring up the ElasticSearch container, its status remains Exited(1) & can't start the container.
Command to start: sudo docker container start <container-ID>
See below Exception for command: sudo docker logs <Container-ID>
Exception in thread "main" java.nio.file.NoSuchFileException: /usr/share/elasticsearch/config/jvm.options
at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
at java.base/sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:218)
at java.base/java.nio.file.Files.newByteChannel(Files.java:380)
at java.base/java.nio.file.Files.newByteChannel(Files.java:432)
at java.base/java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:422)
at java.base/java.nio.file.Files.newInputStream(Files.java:160)
at org.elasticsearch.tools.launchers.JvmOptionsParser.readJvmOptionsFiles(JvmOptionsParser.java:168)
at org.elasticsearch.tools.launchers.JvmOptionsParser.jvmOptions(JvmOptionsParser.java:124)
at org.elasticsearch.tools.launchers.JvmOptionsParser.main(JvmOptionsParser.java:86)
var/log/messages file shows below error:
msg="error detaching from network es_elastic: could not find network attachment for container <Container-ID> to network es_elastic"
Nov 1 13:11:07 ES11 dockerd: time="2022-11-01T13:11:07.567246932-04:00" level=info msg="initialized VXLAN UDP port to 4789 "
Nov 1 13:11:07 ES11 kernel: br0: port 2(<Device_name2>) entered disabled state
Nov 1 13:11:07 ES11 kernel: br0: port 1(<Device_name1>) entered disabled state
Nov 1 13:11:07 ES11 kernel: ov-1-f: renamed from br0
Nov 1 13:11:07 ES11 kernel: device <Device_name2> left promiscuous mode
Nov 1 13:11:07 ES11 kernel: ov-1-f: port 2(<Device_name2>) entered disabled state
Nov 1 13:11:07 ES11 kernel: device <Device_name1> left promiscuous mode
Nov 1 13:11:07 ES11 kernel: ov-1-f: port 1(<Device_name1>) entered disabled state
Nov 1 13:11:07 ES11 kernel: vx-1-f: renamed from <Device_name1>
Nov 1 13:11:07 ES11 kernel: : renamed from <Device_name2>
Nov 1 13:11:07 ES11 avahi-daemon[891]: Withdrawing workstation service for vx-1-f.
Nov 1 13:11:07 ES11 NetworkManager[999]: <info> [ID.7289] manager: (): new Veth device (/org/freedesktop/NetworkManager/Devices/144)
Nov 1 13:11:07 ES11 kernel: : renamed from eth0
Nov 1 13:11:07 ES11 NetworkManager[999]: <info> [ID.7693] manager: (): new Veth device (/org/freedesktop/NetworkManager/Devices/145)
Nov 1 13:11:07 ES11 dockerd: time="2022-11-01T13:11:07.ID-04:00" level=warning msg="Error (Unable to complete atomic operation, key modified) deleting object
File jvm.options were missing from the config directory at /u01/es11/config but unsure why the error shows different location and how come it worked in other nodes es12 & es13. But containers started fine after placing these files.

Pop OS / Dell XPS 9310 -- battery drained overnight on suspend

My laptop is suspending on lid close successfully, but if I don't have it plugged in overnight, the battery is drained by the morning.
I'm including logs from a short suspend I ran just now. I can suspend it overnight and look at the logs afterward, but is there anything immediately suspicious here? I validated that all suspend-related targets are loaded via sudo systemctl status sleep.target suspend.target hibernate.target hybrid-sleep.target
Apr 11 22:09:29 pop-os systemd[1]: Reached target Sleep.
Apr 11 22:09:29 pop-os systemd[1]: Starting Suspend...
Apr 11 22:09:29 pop-os kernel: [ 44.986190] PM: suspend entry (s2idle)
Apr 11 22:09:29 pop-os systemd-sleep[3730]: Suspending system...
Apr 11 22:09:29 pop-os kernel: [ 44.991600] Filesystems sync: 0.005 seconds
Apr 11 22:09:57 pop-os kernel: [ 44.994638] Freezing user space processes ... (elapsed 0.002 seconds) done.
Apr 11 22:09:57 pop-os kernel: [ 44.996920] OOM killer disabled.
Apr 11 22:09:57 pop-os kernel: [ 44.996921] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
Apr 11 22:09:57 pop-os kernel: [ 44.998055] printk: Suspending console(s) (use no_console_suspend to debug)
Apr 11 22:09:57 pop-os kernel: [ 45.315954] psmouse serio1: Failed to disable mouse on isa0060/serio1
Apr 11 22:09:57 pop-os kernel: [ 46.377203] ACPI: EC: interrupt blocked
Apr 11 22:09:57 pop-os kernel: [ 72.605807] ACPI: EC: interrupt unblocked
Apr 11 22:09:57 pop-os kernel: [ 73.107660] pcieport 10000:e0:06.0: can't derive routing for PCI INT A
Apr 11 22:09:57 pop-os kernel: [ 73.107666] nvme 10000:e1:00.0: PCI INT A: no GSI
Apr 11 22:09:57 pop-os kernel: [ 73.114494] nvme nvme0: 8/0/0 default/read/poll queues
Apr 11 22:09:57 pop-os kernel: [ 73.363725] OOM killer enabled.
Apr 11 22:09:57 pop-os kernel: [ 73.363728] Restarting tasks ...
Apr 11 22:09:57 pop-os kernel: [ 73.364154] mei_hdcp 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627f04: bound 0000:00:02.0 (ops i915_hdcp_component_ops [i915])
Apr 11 22:09:57 pop-os kernel: [ 73.367166] done.
Apr 11 22:09:57 pop-os touchegg[1000]: libinput error: event0 - Lid Switch: client bug: event processing lagging behind by 1279ms, your system is too slow
Apr 11 22:09:57 pop-os /usr/libexec/gdm-x-session[1823]: (II) modeset(0): EDID vendor "SHP", prod id 5370
Apr 11 22:09:57 pop-os /usr/libexec/gdm-x-session[1823]: (II) modeset(0): Printing DDC gathered Modelines:
Apr 11 22:09:57 pop-os /usr/libexec/gdm-x-session[1823]: (II) modeset(0): Modeline "3840x2400"x0.0 592.50 3840 3888 3920 4000 2400 2403 2409 2469 -hsync -vsync (148.1 kHz eP)
Apr 11 22:09:57 pop-os /usr/libexec/gdm-x-session[1823]: (II) modeset(0): Modeline "3840x2400"x0.0 474.00 3840 3888 3920 4000 2400 2403 2409 2469 -hsync -vsync (118.5 kHz e)
Apr 11 22:09:57 pop-os systemd-sleep[3730]: System resumed.
Apr 11 22:09:57 pop-os bluetoothd[961]: Controller resume with wake event 0x0
Apr 11 22:09:57 pop-os kernel: [ 73.413202] PM: suspend exit
Apr 11 22:09:57 pop-os systemd[1]: systemd-suspend.service: Succeeded.
Apr 11 22:09:57 pop-os systemd[1]: Finished Suspend.
Apr 11 22:09:57 pop-os systemd[1]: Stopped target Sleep.
Apr 11 22:09:57 pop-os systemd[1]: Reached target Suspend.
Apr 11 22:09:57 pop-os systemd[1]: Stopped target Suspend.
Apr 11 22:09:57 pop-os NetworkManager[968]: <info> [1649729397.3461] manager: sleep: wake requested (sleeping: yes enabled: yes)
Apr 11 22:09:57 pop-os NetworkManager[968]: <info> [1649729397.3461] device (wlp113s0): state change: unmanaged -> unavailable (reason 'managed', sys-iface-state: 'external')
Apr 11 22:09:57 pop-os ModemManager[1079]: <info> [sleep-monitor] system is resuming
Apr 11 22:09:57 pop-os NetworkManager[968]: <info> [1649729397.4258] manager: NetworkManager state is now DISCONNECTED
The hardware on this system only supports s2idle sleep, and not deep sleep for less energy consumption (details on different sleep states here https://www.kernel.org/doc/Documentation/power/states.txt).
pop-os:$~ sudo cat /sys/power/mem_sleep
[s2idle]
I found this thread: https://www.dell.com/community/XPS/XPS-13-9310-Ubuntu-deep-sleep-missing/td-p/7734008 It suggests changing the disk management from RAID (Dell's default) to AHCI via the Dell BIOS.
So far this has worked for a solution! I've lost only 10% battery overnight, and can go 3 days idling in suspend without a charge.
(Before this, I did try enabling hibernate through these instructions from System76 https://support.system76.com/articles/enable-hibernation/. This does not work great, because the Killer wifi driver does not load on wake from hibernate.)
Suspend ( considering hybrid suspend ), the machine's state is stored in swap space and suspend via RAM (aka sleep) is invoked. This caused for minimal utilisation of power.
Reason to do so : wake up from hibernate is slower than wakeup from sleep. So to ensure system state is not lost, machine's state is stored in swap space and sleep is invoked that uses minimal power and does not shut off the machine. Machine's state is stored in RAM. If battery does not die, wake up happens from RAM which is faster.
Read More : https://wiki.archlinux.org/title/Power_management/Suspend_and_hibernate
In case you want your battery to not die or drain, switch your lid close action from sleep/suspend to hibernate. Hibernate has zero power consumption. Follow the steps mentioned below.
$ grep HandleLidSwitch /etc/systemd/logind.conf
HandleLidSwitch=suspend
If the line is commented, please uncomment by removing "#" and change option to hibernate.
HandleLidSwitch=hibernate
If you are new to Linux, please use gedit command to edit the file.
sudo gedit /etc/systemd/logind.conf

booting Arch stucks with Reading supported feature failed (-16)

My new Arch installation always stucks after rebooting with the Message [x.yz] Bluetooth: hci0: Reading supported features failed (-16). It is just doing nothing until I start a session on tty2 by pressing CTRL+ALT+F2 and going back to CTRL+ALT+F1 without logging in in tty2 (btw. I have installed gnome with gdm). Bluetooth is working well once I'm logged in...
Here are all journalctl - b-Outputs containing 'Bluetooth':
Mar 06 14:56:23 archBook kernel: Bluetooth: Core ver 2.22
Mar 06 14:56:23 archBook kernel: Bluetooth: HCI device and connection manager initialized
Mar 06 14:56:23 archBook kernel: Bluetooth: HCI socket layer initialized
Mar 06 14:56:23 archBook kernel: Bluetooth: L2CAP socket layer initialized
Mar 06 14:56:23 archBook kernel: Bluetooth: SCO socket layer initialized
Mar 06 14:56:23 archBook kernel: Bluetooth: hci0: Bootloader revision 0.0 build 26 week 38 2015
Mar 06 14:56:23 archBook kernel: Bluetooth: hci0: Device revision is 16
Mar 06 14:56:23 archBook kernel: Bluetooth: hci0: Secure boot is enabled
Mar 06 14:56:23 archBook kernel: Bluetooth: hci0: OTP lock is enabled
Mar 06 14:56:23 archBook kernel: Bluetooth: hci0: API lock is enabled
Mar 06 14:56:23 archBook kernel: Bluetooth: hci0: Debug lock is disabled
Mar 06 14:56:23 archBook kernel: Bluetooth: hci0: Minimum firmware build 1 week 10 2014
Mar 06 14:56:23 archBook kernel: Bluetooth: hci0: Found device firmware: intel/ibt-12-16.sfi
Mar 06 14:56:23 archBook systemd[1]: Starting Bluetooth service...
Mar 06 14:56:23 archBook bluetoothd[392]: Bluetooth daemon 5.56
Mar 06 14:56:23 archBook systemd[1]: Started Bluetooth service.
Mar 06 14:56:23 archBook systemd[1]: Reached target Bluetooth.
Mar 06 14:56:23 archBook kernel: Bluetooth: BNEP (Ethernet Emulation) ver 1.3
Mar 06 14:56:23 archBook kernel: Bluetooth: BNEP filters: protocol multicast
Mar 06 14:56:23 archBook kernel: Bluetooth: BNEP socket layer initialized
Mar 06 14:56:23 archBook bluetoothd[392]: Bluetooth management interface 1.19 initialized
Mar 06 14:56:25 archBook kernel: Bluetooth: hci0: Waiting for firmware download to complete
Mar 06 14:56:25 archBook kernel: Bluetooth: hci0: Firmware loaded in 1675919 usecs
Mar 06 14:56:25 archBook kernel: Bluetooth: hci0: Waiting for device to boot
Mar 06 14:56:25 archBook kernel: Bluetooth: hci0: Device booted in 11722 usecs
Mar 06 14:56:25 archBook kernel: Bluetooth: hci0: Found Intel DDC parameters: intel/ibt-12-16.ddc
Mar 06 14:56:25 archBook kernel: Bluetooth: hci0: Applying Intel DDC parameters completed
Mar 06 14:56:25 archBook kernel: Bluetooth: hci0: Reading supported features failed (-16)
Mar 06 14:56:25 archBook kernel: Bluetooth: hci0: Telemetry exception format not supported
Mar 06 14:56:25 archBook kernel: Bluetooth: hci0: Firmware revision 0.1 build 50 week 12 2019
Mar 06 15:01:32 archBook kernel: Bluetooth: RFCOMM TTY layer initialized
Mar 06 15:01:32 archBook kernel: Bluetooth: RFCOMM socket layer initialized
Mar 06 15:01:32 archBook kernel: Bluetooth: RFCOMM ver 1.11
Mar 06 15:01:36 archBook systemd[749]: Starting Bluetooth OBEX service...
Mar 06 15:01:36 archBook systemd[749]: Started Bluetooth OBEX service.
Hope somebody can help me to fix this since it sucks to always enter tty2 before I can log in via GUI...
Thanks for any advice and tell me if you need more information

Kafka broker crash every day - OOM killer

I have a cluster of 3 kafka brokers Version 0.10.2.1. Each broker has it's own host 2 cpu / 16G RAM, In addition we are using docker to wrap the broker process.
The problems is as follows:
Almost every day at the same time we see all of our kafka clients failed for 10 minutes.
At the beginning I thought it is related to Kafka No broker in ISR for partition
But after a while I discover that the broker just crash due to OOM-killer.
I also played with the Xmx and Xms before I discover that it is the OOM-killer. I had:
-Xmx2048M -Xms2048M
-Xmx4096M -Xms2048M
Same behavior for both
In addition currently we don't have ulimit
>> ulimit
unlimited
less kern.log
LOGS:
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761019] run-parts invoked oom-killer: gfp_mask=0x26000c0, order=2, oom_score_adj=0
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761022] run-parts cpuset=/ mems_allowed=0
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761026] CPU: 1 PID: 12266 Comm: run-parts Not tainted 4.4.0-59-generic #80-Ubuntu
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761027] Hardware name: Xen HVM domU, BIOS 4.2.amazon 02/16/2017
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761029] 0000000000000286 000000004811d7da ffff880036967af0 ffffffff813f7583
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761031] ffff880036967cc8 ffff880439f2f000 ffff880036967b60 ffffffff8120ad5e
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761033] ffffffff81cd2dc7 0000000000000000 ffffffff81e67760 0000000000000206
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761036] Call Trace:
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761041] [<ffffffff813f7583>] dump_stack+0x63/0x90
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761044] [<ffffffff8120ad5e>] dump_header+0x5a/0x1c5
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761048] [<ffffffff81192722>] oom_kill_process+0x202/0x3c0
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761049] [<ffffffff81192b49>] out_of_memory+0x219/0x460
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761052] [<ffffffff81198abd>] __alloc_pages_slowpath.constprop.88+0x8fd/0xa70
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761054] [<ffffffff81198eb6>] __alloc_pages_nodemask+0x286/0x2a0
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761056] [<ffffffff81198f6b>] alloc_kmem_pages_node+0x4b/0xc0
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761060] [<ffffffff8107ea5e>] copy_process+0x1be/0x1b70
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761063] [<ffffffff81391bcc>] ? apparmor_file_alloc_security+0x5c/0x220
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761066] [<ffffffff811ed05a>] ? kmem_cache_alloc+0x1ca/0x1f0
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761070] [<ffffffff81347bd3>] ? security_file_alloc+0x33/0x50
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761073] [<ffffffff810caf11>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761074] [<ffffffff810805a0>] _do_fork+0x80/0x360
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761076] [<ffffffff81080929>] SyS_clone+0x19/0x20
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761080] [<ffffffff818384f2>] entry_SYSCALL_64_fastpath+0x16/0x71
And ....
Jan 24 06:25:25 kafka10-172-40-103-177 kernel: [16591270.954463] Out of memory: Kill process 16123 (java) score 134 or sacrifice child
Jan 24 06:25:25 kafka10-172-40-103-177 kernel: [16591270.958609] Killed process 16123 (java) total-vm:11977548kB, anon-rss:2035780kB, file-rss:67848kB
Any suggestion of how to approach this ??
We found the problem.
First I will say that adding more RAM to the machine also solved the problem but it is "expensive solution".
The problem was as follows:
Since I was working with EC2 ubuntu distribution I got daily crontabs in all of my cluster exactly at the same time. One of the scripts was mlocate this script apparently took too many resources.
I assume that since all cluster of kafka has some issues with IO and Memory, brokers was trying to use more memory and then the OOM killer killed them.
When 2 of my 3 brokers were down some services were down.
So the solution was:
Change the crontab to work in different hours of the day in each
broker.
Disable mlocate
I also faced the same issue below mentioned blog helped me out :
https://docs.confluent.io/current/kafka/deployment.html
How to decide Kafka Cluster size
https://community.hortonworks.com/articles/80813/kafka-best-practices-1.html
And please make sure that the swap is enabled on all the brokers.

centos6.5's yum error : Input/output error

when i run yum command:
> yum
There was a problem importing one of the Python modules
required to run yum. The error leading to this problem was:
/usr/lib64/python2.6/lib-dynload/arraymodule.so: cannot read file data: Input/output error
Please install a package which provides this module, or
verify that the module is installed correctly.
It's possible that the above module doesn't match the
current version of Python, which is:
2.6.6 (r266:84292, Jul 23 2015, 15:22:56)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-11)]
Current version of Python is 2.6.6,not other。
system logs:
Oct 16 09:56:50 localhost kernel: mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000) cb_idx mptscsih_io_done
Oct 16 09:56:50 localhost kernel: LSI Debug log info 31080000 for channel 0 id 0
Oct 16 09:56:50 localhost kernel: mptbase: ioc0: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000) cb_idx mptscsih_io_done
Oct 16 09:56:50 localhost kernel: LSI Debug log info 31080000 for channel 0 id 0
Oct 16 09:56:50 localhost kernel: sd 6:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 16 09:56:50 localhost kernel: sd 6:0:0:0: [sda] Sense Key : Medium Error [current]
Oct 16 09:56:50 localhost kernel: Info fld=0x4d59fc8
Oct 16 09:56:50 localhost kernel: sd 6:0:0:0: [sda] Add. Sense: Unrecovered read error
Oct 16 09:56:50 localhost kernel: sd 6:0:0:0: [sda] CDB: Read(10): 28 00 04 d5 9f c8 00 00 08 00
Oct 16 09:56:50 localhost kernel: end_request: critical medium error, dev sda, sector 81108936
Who know how to fix? Thank you!
Input/output error indicates that you system cannot read the file. Your log indicates that the hard drive is failing. Reinstall yum through RPM if you must, but ultimately backup your critical data and salvage the storage array.

Resources