What causes BitBake worker processes to exit unexpectedly? - linux

I have a BitBake build process that runs on a Docker container (CentOS 7). The BitBake fails during recipe gcc-cross-i586-5.2.0-r0: task do_compile on each run that I try it in.
An example of bitbake's output:
NOTE: recipe gcc-cross-i586-5.2.0-r0: task do_compile: Started
ERROR: Worker process (367) exited unexpectedly (-9), shutting down...
ERROR: Worker process (367) exited unexpectedly (-9), shutting down...
ERROR: Worker process (367) exited unexpectedly (-9), shutting down...
ERROR: Worker process (367) exited unexpectedly (-9), shutting down...
NOTE: Tasks Summary: Attempted 1538 tasks of which 17 didn't need to be rerun and all succeeded.
Is this a problem with recipe gcc-cross-i586-5.2.0-r0: task do_compile? Perhaps an out-of-memory error? I don't know what the -9 refers to or how to find out more information about it.

Try:
$ bitbake -c cleansstate gcc-cross ; bitbake -k gcc-cross
How much you have memory of ram?
Report log error here.

This worked for me,
Edit conf/local.conf and decrease the number of working threads by adding the following to you conf/local.conf file (under the build directory):
BB_NUMBER_THREADS = "6"

Just a long shot, -9 in kernel land means EBADF (bad file number.) Is it possible you have done some operations as root and some files are not accessible during the build? Is the issue reproducible? ie. can you rm -rf tmp and does it happen again? Make sure you don't have any permissions issues in your project directory and associated file system(s).

Related

xwininfo -tree always shows 0 children on lambda container

As the title states, I'm finding that whenever I issue xwininfo -tree on a container running Ubuntu in lambda, there are never child screens, though when I run the same container locally it works fine (listing all the windows used by said application). The issue must lie somewhere from application start and it registering with the X server. Does anyone have any insights here as to what could be going on, or how I can get this to work?
To attach some code to this problem, the gist is essentially:
# Start Xvfb
Xvfb $DISPLAY -screen 0 1920x1080x24 -nolisten tcp -nolisten unix &
# Open Visual Studio Code (just as an example application)
code .
# List all windows
xwininfo -root -tree
As mentioned, locally xwininfo produces something like this:
Root window id: 0x50d (the root window) (has no name)
Parent window id: 0x0 (none)
6 children:
0x600006 "Code": ("code" "Code") 800x600+560+240 +560+240
0x600002 "Get Started - src - Visual Studio Code": ("code" "Code") 1024x768+448+156 +448+156
0x600003 (has no name): () 1x1+0+0 +0+0
0x800003 "code": ("code" "Code") 200x200+0+0 +0+0
1 child:
0x800004 (has no name): () 1x1+-1+-1 +-1+-1
0x800001 "code": ("code" "Code") 10x10+10+10 +10+10
0x600000 "Chromium clipboard": () 10x10+-100+-100 +-100+-100
but in lambda, the output only shows the root window, with no children:
Root window id: 0x50d (the root window) (has no name)
Parent window id: 0x0 (none)
0 children.
What could be going on here? Why are my application windows not appearing in the output of xwininfo -tree?
This is a sort of half answer in that I'm unable to definitely solve the root cause of the original question, but it's an answer that explains tells why Visual Studio Code fails to launch (and does not appear in xwininfo). If you pass the --verbose flag to Visual Studio code when it launches, in a Lambda environment, you'll get output like this:
prctl(PR_SET_NO_NEW_PRIVS) failed
prctl(PR_SET_NO_NEW_PRIVS) failed
Unable to create argv.json configuration file in /home/sbx_user1051/.vscode/argv.json, falling back to defaults (Error: ENOENT: no such file or directory, mkdir \'/home/sbx_user1051/.vscode\')
[94:1213/213845.593392:ERROR:bus.cc(398)] Failed to connect to the bus: Could not parse server address: Unknown address type (examples of valid types are "tcp" and on UNIX "unix")
[94:1213/213845.593463:ERROR:bus.cc(398)] Failed to connect to the bus: Could not parse server address: Unknown address type (examples of valid types are "tcp" and on UNIX "unix")
[94:1213/213845.626853:ERROR:bus.cc(398)] Failed to connect to the bus: Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory
[94:1213/213845.626936:ERROR:bus.cc(398)] Failed to connect to the bus: Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory
[94:1213/213845.671433:WARNING:bluez_dbus_manager.cc(248)] Floss manager not present, cannot set Floss enable/disable.
[94:1213/213845.671514:ERROR:gpu_process_host.cc(974)] GPU process launch failed: error_code=1002
[94:1213/213845.671522:WARNING:gpu_process_host.cc(1282)] The GPU process has crashed 1 time(s)
[94:1213/213845.672685:ERROR:gpu_process_host.cc(974)] GPU process launch failed: error_code=1002
[94:1213/213845.672708:WARNING:gpu_process_host.cc(1282)] The GPU process has crashed 2 time(s)
[94:1213/213845.688115:ERROR:gpu_process_host.cc(974)] GPU process launch failed: error_code=1002
[94:1213/213845.688143:WARNING:gpu_process_host.cc(1282)] The GPU process has crashed 3 time(s)
[94:1213/213845.689768:ERROR:gpu_process_host.cc(974)] GPU process launch failed: error_code=1002
[94:1213/213845.689790:WARNING:gpu_process_host.cc(1282)] The GPU process has crashed 4 time(s)
[94:1213/213845.690646:ERROR:gpu_process_host.cc(974)] GPU process launch failed: error_code=1002
[94:1213/213845.690662:WARNING:gpu_process_host.cc(1282)] The GPU process has crashed 5 time(s)
[94:1213/213846.470027:ERROR:gpu_process_host.cc(974)] GPU process launch failed: error_code=1002
[94:1213/213846.470054:WARNING:gpu_process_host.cc(1282)] The GPU process has crashed 6 time(s)
[94:1213/213846.471084:ERROR:gpu_process_host.cc(974)] GPU process launch failed: error_code=1002
[94:1213/213846.471101:WARNING:gpu_process_host.cc(1282)] The GPU process has crashed 7 time(s)
[94:1213/213846.471742:ERROR:gpu_process_host.cc(974)] GPU process launch failed: error_code=1002
[94:1213/213846.471761:WARNING:gpu_process_host.cc(1282)] The GPU process has crashed 8 time(s)
[94:1213/213846.472344:ERROR:gpu_process_host.cc(974)] GPU process launch failed: error_code=1002
[94:1213/213846.472357:WARNING:gpu_process_host.cc(1282)] The GPU process has crashed 9 time(s)
[94:1213/213846.472371:FATAL:gpu_data_manager_impl_private.cc(450)] GPU process isn\'t usable. Goodbye.
The problem is that even though Visual Studio Code is built on Electron / Chromium, it doesn't seem to support a --headless option yet (as apposed to something like Chrome, which is possible to launch in a Lambda function with some careful tinkering). So, due to the restrictions of the Lambda environment, for now, it seems that it's not possible to Launch Visual Studio code in a lambda environment. (I know there is the web version, but I don't want this). Would have been nice as a single instance per call would have been perfect for what I was trying to accomplish. I'm going to move on to some other backend infrastructure to do what I want...

How to get a basic direvent watcher working?

I have read through the direvent documentation and am trying to get a simple watch working. Since I am having so much trouble with it, I am wondering if the issue has to do with the fact that the system I am using is nixos.
Here is the simple watcher file, watcher, I've created:
watcher {
path ./dir;
command "echo $file";
}
I run it in the foreground, so I can see the output, with direvent --foreground watcher. Once it's running, I create a file in dir, thus creating an event for it to respond to. However, it fails with the following output:
$ direvent --foreground watcher
direvent: [INFO] direvent 5.2 started
direvent: [ERROR] process 8552 failed with status 127
direvent: [ERROR] process 8555 failed with status 127
direvent: [ERROR] process 8557 failed with status 127
Since 127 usually means 'command not found', I tried specifying the path to echo, i.e. running this watcher instead:
watcher {
path ./dir;
command "/run/current-system/sw/bin/echo $file";
}
Then the output still gives an error, albeit a different one:
$ direvent --foreground watcher
direvent: [INFO] direvent 5.2 started
direvent: [ERROR] process 8645 failed with status 1
direvent: [ERROR] process 8651 failed with status 1
direvent: [ERROR] process 8652 failed with status 1
So the failure is now with status 1. I am not sure what to try next. I'm wondering if this issue is due to the fact that I am running nixos. Anyone know what I might try next to get direvent working?
direvent has two other flag that may be useful for you.
--debug(-d) to give extra information.
There's also --lint(t) that check the configuration file for errors, but I suspect this isn't your issue if direvent is running.
Source: https://www.gnu.org.ua/software/direvent/manual/direvent.html

mpirun launching a runtime script for thread binding

Running a program in a MPI process, which exec file is myapp_mpi. The command line below launches the app correctly
mpirun --bind-to none -np 128 myapp_mpi apprun -resethway -noconfout -nsteps 8000 -s benchData.tpr -cpo state.cpt -e ener.edr -dlb no -pin off -v
I now wish to constrain the thread binding with a script pin.sh. The command line below would then produce an error.
mpirun --bind-to none -np 128 pin.sh myapp_mpi apprun -resethway -noconfout -nsteps 8000 -s benchData.tpr -cpo state.cpt -e ener.edr -dlb no -pin off -v
I get the following error
--------------------------------------------------------------------------
Open MPI tried to fork a new process via the "execve" system call but
failed. Open MPI checks many things before attempting to launch a
child process, but nothing is perfect. This error may be indicative
of another problem on the target host, or even something as silly as
having specified a directory for your application. Your job will now
abort.
Local host: machine001
Working dir: /home/user/myapp/bin
Application name: /home/user/myapp/bin/pin.sh
Error: Exec format error
--------------------------------------------------------------------------
mpirun: Forwarding signal 18 to job
mpirun: Forwarding signal 18 to job
--------------------------------------------------------------------------
mpirun was unable to start the specified application as it encountered an
error:
Error code: 1
Error name: (null)
Node: machine001
when attempting to start process rank 0.
--------------------------------------------------------------------------
125 total processes failed to start
File locations are good, a priori. Any clue ?

Yocto: bitbake exit code confusion

I get an error while building an image using Yocto (dizzy):
ERROR: Creation of tar /mnt/workspace/build/tmp/deploy/tar/xev-dbg-1.2.1-r0.tar.gz failed.
and bitbake command fails with the following report:
No currently running tasks (6291 of 6292)
NOTE: Tasks Summary: Attempted 6292 tasks of which 18 didn't need to be rerun and all succeeded.
Summary: There were 13 WARNING messages shown.
Summary: There were 3 ERROR messages shown, returning a non-zero exit code.
If I check the file xev-dbg-1.2.1-r0.tar.gz, I get:
$ file /mnt/workspace/build/tmp/deploy/tar/xev-dbg-1.2.1-r0.tar.gz
/mnt/workspace/build/tmp/deploy/tar/xev-dbg-1.2.1-r0.tar.gz: gzip compressed data, from Unix, last modified: Mon Mar 27 20:19:55 201
and it is the same case for the remaining two errors.
I am confused:
if there was an error, why bitbake is reporting that all tasks succeeded?
If the file were successfully created, why bitbake exits with non zero value?
Bitbake did not return a 0 exit-code. This mean that there are errors in the bitbake process.
There are 3 errors when it is trying to create the tar files as shown.
The compressed file is there but it is not complete. E.g. Just like how you could download a file and interrupt it and the download file is still there. So we usually use md5sum or some kind of hash number to check on the completeness of the file.
A better understanding might be: Bitbake attempted to run 6292 task. 18 of them do not need to rerun. Bitbake attempted to rerun the rest 6274(6292-18) and succeeded in rerunning them. This does not mean that all of them are successfully compiled. In the process of rerunning them, there are 13 warnings and and 3 errors appeared. Because of the 3 errors, bitbake returns with a non-zero exit code.
No currently running tasks (6291 of 6292)
NOTE: Tasks Summary: Attempted 6292 tasks of which 18 didn't need to be rerun and all succeeded.
Summary: There were 13 WARNING messages shown.
Summary: There were 3 ERROR messages shown, returning a non-zero exit code.

How to know if app is in RUNNING state to kill spark-submit process?

I am creating a shell script which will be executed from Jenkins because we have many streaming jobs and it seems easier to manager from Jenkins. So I have created the below script.
#!/bin/bash
spark-submit "spark parameters here" > /dev/null 2>&1 &
processId=$!
echo $processId
sleep 5m
kill $processId
If I don't have a sleep, the spark-submit process is killed immediately and no spark application is submitted. And if there is a sleep the spark-submit process gets enough time to submit the spark application.
My question is, is there a better way to know if the spark application is in RUNNING state so that the spark-submit process can be killed ?
Spark 1.6.0 with YARN
You should spark-submit your Spark application and use yarn application -status <ApplicationId> as described in application section:
Prints the status of the application.
You could get <ApplicationId> from the logs of spark-submit (in client deploy mode) or use yarn application -list -appType SPARK -appStates RUNNING.
I don't know what Spark version you are using or if you are running in standalone mode, but anyway, you can use the REST API for submitting/killing your apps. The last time I checked it was pretty much undocumented, but it worked properly.
When you submit an application, you will get a submissionId which you can use later for either getting the current state or killing it. The possible states are documented here:
// SUBMITTED: Submitted but not yet scheduled on a worker
// RUNNING: Has been allocated to a worker to run
// FINISHED: Previously ran and exited cleanly
// RELAUNCHING: Exited non-zero or due to worker failure, but has not yet started running again
// UNKNOWN: The state of the driver is temporarily not known due to master failure recovery
// KILLED: A user manually killed this driver
// FAILED: The driver exited non-zero and was not supervised
// ERROR: Unable to run or restart due to an unrecoverable error (e.g. missing jar file)
This is specially useful for long-running apps (e.g. streaming), since you don't have to babysit the shell script.

Resources