BadIDChoice RENDER in python 3.3 and tk/tcl displayed on X - linux

I have a fairly complicated GUI written through python's tkinter running on linux, and one of the components (which has a Text widget which updates frequently) causes the GUI to crash infrequently (once a day).
The guis are being displayed to X running on both Mac OSX through X11 and Gnome 2.28.2 with the same behavior. My python version is 3.3 and tk/tcl version is 8.5. The error I get is:
X Error of failed request: BadIDChoice (invalid resource ID chosen for this connection)
Major opcode of failed request: 148 (RENDER)
Minor opcode of failed request: 4 (RenderCreatePicture)
Resource id in failed request: 0x116517f
Serial number of failed request: 15106831
Current serial number in output stream: 15106872
a strace looks like:
11:03:29.632041 recvfrom(13, 0x3bae1d4, 4096, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
11:03:29.632059 recvfrom(13, 0x3bae1d4, 4096, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
11:03:29.632147 poll([{fd=13, events=POLLIN|POLLOUT}], 1, -1) = 1 ([{fd=13, revents=POLLOUT}])
11:03:29.632164 writev(13, [{"\224\4\5\0D\304\361\0\17\274\361\0i\4\0\0\0\0\0\0\224\27\n\0\3\f\340\0\301\v\340\0"..., 5032}, {NULL, 0}, {"", 0}], 3) = 5032
11:03:29.632193 poll([{fd=13, events=POLLIN}], 1, -1) = 1 ([{fd=13, revents=POLLIN}])
11:03:29.637040 recvfrom(13, "\0\16\302\276x\304\361\0\4\0\224\0\1\0\0\0`\16\330\3\1\0\0\0\243\304\342\210\377\177\0\0"..., 4096, 0, NULL, NULL) = 136
11:03:29.637135 open("/usr/share/X11/XErrorDB", O_RDONLY) = 35
11:03:29.637217 fstat(35, {st_mode=S_IFREG|0644, st_size=41532, ...}) = 0
11:03:29.637360 read(35, "!\n! Copyright 1993, 1995, 1998 "..., 41532) = 41532
11:03:29.637387 close(35) = 0
11:03:29.637820 write(2, "X Error of failed request: BadI"..., 91) = 91
...
My GUI is single-threaded (and uses the after() call to monitor sockets for I/O).
Does anyone know what might be wrong? Is there any better debugging that I could be doing to figure out what the X Error part means?

Infrequent crashes (once a day) with the following logs...
X Error of failed request: BadIDChoice (invalid resource ID chosen for this connection)
Major opcode of failed request: 148 (RENDER)
Minor opcode of failed request: 4 (RenderCreatePicture)
...appear to be a telltale signature of a known issue within xcb as mentioned in the following thread:
Bug 458092 - Crashes with BadIdChoice X errors
The patch for it is available here.
Based on the git history, this xcb bug should be fixed in libX11-1.1.99.2 and above (~8years ago).
For further reference here is the email-thread with the complete discussion.

Related

Strange behavier of "getgroups"

I'm working on a little learning project,need to call "getgroups(int gidsetsize, gid_t grouplist[])"
I'v got "0" result of an id should have a list.
while checking all possibilities,I found out the user's group must not be "0",or the function won't return none 0 result.
but i'm only meet this problem on my own computer which running archlinux.
I checked virtual machine which use manjaro or ubuntu,none of them has the problem ,
a co-league has an vps which use arch too do not has the problem.
arch bbs replied "the gid of user should not be 0",but it couldn't explain why my machine is the only one has the problem.
I'v compared id output before the post.
Only the physical machine give me NULL list,and the strace output is different
my machine result
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
getgroups(0, NULL) = 0
getgroups(0, []) = 0
newfstatat(AT_FDCWD, "/etc/nsswitch.conf", {st_mode=S_IFREG|0644, st_size=359, ...}, 0) = 0
other machine's the first getgroups will return none 0 result,and then,the 2nd will use the result as the 1st para to get a list.
I couldn't find the difference by myself.
The ubuntu vm result
getgroups(0, NULL) = 7
getgroups(7, [0, 4, 24, 27, 30, 46, 110]) = 7
"id" command only return the none zero result to get "self" result but not other's because of different execution branch.
Tried edit the user's gid to none 0,after reboot,the result going to normal
Change gid back to 0 again,after reboot,the result is NULL again.
Might be something about user namespace?
Any suggestions?
edit,more info:
Seems the problem is related to "WHO IS THE FATHER".
When the process is child(or grand child etc.) of
/usr/lib/systemd/systemd --user
getgroups will give bad result.
When not(running on i3wm ,the process has no father)the result is good
following the trace of systemd,arch bbs give me this.
User service not running with supplementary groups
------------EDIT---------------
It's not the same problem,possible another bug.

Perf record hanging on armv7

I have a device with embedded Linux. The base image is built using ptxdist 2019.01 with the OSELAS toolchain build 2018.02 with gcc 7.3.1. Ptxdist has native option to enable perf support, so I enabled it and installed it on the device. It is using Linux 4.19.72.
However, when I run perf record -g (with process to trace) without explicitly specifying events, it seems to hang using a lot of CPU and not responding to SIGINT. I am not sure what the default event is; it does not seem to say anywhere. How can I find what it is hanging on and/or which events to specify that will actually work?
Update #1: Trying to strace perf recorg -g app… shows (
openat(AT_FDCWD, "/proc/sys/kernel/kptr_restrict", O_RDONLY|O_LARGEFILE) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
read(3, "0\n", 1024) = 2
geteuid32() = 0
getuid32() = 0
close(3) = 0
statfs64("/sys", 88, 0xbef2f388) = 0
stat64("/sys/bus/event_source/devices/cs_etm/format", 0xbef2f440) = -1 ENOENT (No such file or directory)
stat64("/sys/bus/event_source/devices/cs_etm/type", 0xbef2f440) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/sys/devices/system/cpu", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_CLOEXEC|O_DIRECTORY) = 3
fstat64(3, {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
getdents64(3, /* 12 entries */, 32768) = 360
getdents64(3, /* 0 entries */, 32768) = 0
close(3) = 0
stat64("/sys/bus/event_source/devices/arm_spe_0/format", 0xbef2f440) = -1 ENOENT (No such file or directory)
stat64("/sys/bus/event_source/devices/arm_spe_0/type", 0xbef2f440) = -1 ENOENT (No such file or directory)
geteuid32() = 0
perf_event_open(
Unfortunately arguments of the perf_event_open don't ever get written out. Listing with ps shows it in R state.

What does -1000 mean in spark exit status

I'm doing something with Spark-SQL and got error below:
YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to
remove executor 1 for reason Container marked as failed:
container_1568946404896_0002_02_000002 on host: worker1. Exit status:
-1000. Diagnostics: [2019-09-20 10:43:11.474]Task java.util.concurrent.ExecutorCompletionService$QueueingFuture#76430b7c
rejected from
org.apache.hadoop.util.concurrent.HadoopThreadPoolExecutor#16970b[Terminated,
pool size = 0, active threads = 0, queued tasks = 0, completed tasks =
1]
I'm trying to figure it out by checking the meaning of Exit status: 1000, however, no valuable info returned by googling.
According to this thread, the -1000 is not even mentioned.
Any comment is welcomed, thanks.

ERROR: GETH_NEWDATE: Strange length for ODATE: 20

While setting up the WPS for WRF, i getting the following error while running the file metgrid.exe.
Processing domain 1 of 1
ERROR: GETH_NEWDATE: Strange length for ODATE: 20
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 0
Had this same problem, and realized it was caused by either and extra or missing quote on start_date or end_date in namelist.wps.

Could it be that users are added/removed in the background from /etc/passwd?

from pwd import getpwuid
getpwuid(48).pw_name
This Python program prints apache 99% of the time. 48 is the id that appears in /etc/passwd for the apache user. Without any apparent reason, Python sometimes prints the error:
KeyError: 'getpwuid(): uid not found: 48'
I need to understand why this happen sometimes. Can the apache user be removed from the file for some reason?
Here is the CPython 2.7 source code for the pwd module, particularly the getpwuid() call: https://github.com/python/cpython/blob/2.7/Modules/pwdmodule.c#L114 It looks like a wrapper around the system getpwuid call with not very much code - Python doesn't read from /etc/passwd directly.
Here's a current Ubuntu manpage (you didn't mention any particular OS) for (3) getpwuid: http://manpages.ubuntu.com/manpages/wily/man3/getpwuid.3posix.html which includes:
ERRORS
The getpwuid() and getpwuid_r() functions may fail if:
EIO An I/O error has occurred.
EINTR A signal was caught during getpwuid().
EMFILE {OPEN_MAX} file descriptors are currently open in the calling
process.
ENFILE The maximum allowable number of files is currently open in the
system.
Since you haven't mentioned any user management processes which might be regenerating your user accounts, I'm going to answer that no, apache doesn't get removed from /etc/passwd, but your webserver does hit some heavy IO or too many open files condition where reading /etc/passwd becomes impossible.
That’s a very interesting phenomenon (and a great question) but I doubt that Apache is being removed from your /etc/passwd file.
On a GNU/Linux system, there are a number of different authentication mechanisms that can be used. In modern systems, the Name Service Switch (NSS) is used to resolve user names and IDs. This is configured in the passwd line of /etc/nsswitch.conf, e.g., the following configuration means that the /etc/passwd will be searched first and if the user or ID is not found, then a configured NIS server is used to determine the user name/ID.
passwd: files nis
However, in some systems, the NSS library functions might not actually be used to resolve a name request. Some systems may have a service such as nscd running. This is a daemon that caches name service requests, e.g., if the Apache user had previously been looked up, its name would be stored in the nscd cache and it would return the correct name or ID without /etc/passwd being searched.
Debugging
I would try debugging this issue by running the Python program through strace. At the very end of the output file, you should see the system calls that are used to retrieve the name of the user.
strace -o getpwuid_test.trace getpwuid_test.py
You would need to run this command enough times to catch the call to getpwuid failing to see why it failed. I, for one, would be interested to see the results.
Examples
Here’s an example of the output where no caching daemon is running and NSS is used to read the /etc/passwd file:
open("/etc/nsswitch.conf", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=1717, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7fa9000
read(3, "#\n# /etc/nsswitch.conf\n#\n# An ex"..., 4096) = 1717
read(3, "", 4096) = 0
close(3) = 0
...
open("/etc/passwd", O_RDONLY) = 3
fcntl64(3, F_GETFD) = 0
fcntl64(3, F_SETFD, FD_CLOEXEC) = 0
fstat64(3, {st_mode=S_IFREG|0644, st_size=3012, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7fa9000
read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 3012
close(3) = 0
...
write(1, "apache\n", 7)
Here’s an example where the nscd service is running and the NSS library is bypassed:
socket(PF_FILE, SOCK_STREAM, 0) = 3
connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"...}, 110) = 0
send(3, "\2\0\0\0\v\0\0\0\7\0\0\0passwd\0", 19, MSG_NOSIGNAL) = 19
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP}], 1, 5000) = 1 ([{fd=3, revents=POLLIN|POLLHUP}])
recvmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"passwd\0", 7}, {"\270O\3\0\0\0\0\0", 8}], msg_controllen=16, {cmsg_len=16, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, {4}}, msg_flags=0}, 0) = 15
...
write(1, "apache\n", 7)

Resources