I am trying to deploy a prediction web service to Azure using ML Workbench process using cluster mode in this tutorial (https://learn.microsoft.com/en-us/azure/machine-learning/preview/tutorial-classifying-iris-part-3#prepare-to-operationalize-locally)
The model gets sent to the manifest, the scoring script and schema
Creating
service..........................................................Error
occurred: {'Error': {'Code': 'KubernetesDeploymentFailed', 'Details':
[{'Message': 'Back-off 40s restarting failed container=...pod=...',
'Code': 'CrashLoopBackOff'}], 'StatusCode': 400, 'Message':
'Kubernetes Deployment failed'}, 'OperationType': 'Service',
'State':'Failed', 'Id': '...', 'ResourceLocation':
'/api/subscriptions/...', 'CreatedTime':
'2017-10-26T20:30:49.77362Z','EndTime': '2017-10-26T20:36:40.186369Z'}
Here is the result of checking the ml service realtime logs
C:\Users\userguy\Documents\azure_ml_workbench\projecto>az ml service logs realtime -i projecto
2017-10-26 20:47:16,118 CRIT Supervisor running as root (no user in config file)
2017-10-26 20:47:16,120 INFO supervisord started with pid 1
2017-10-26 20:47:17,123 INFO spawned: 'rsyslog' with pid 9
2017-10-26 20:47:17,124 INFO spawned: 'program_exit' with pid 10
2017-10-26 20:47:17,124 INFO spawned: 'nginx' with pid 11
2017-10-26 20:47:17,125 INFO spawned: 'gunicorn' with pid 12
2017-10-26 20:47:18,160 INFO success: rsyslog entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2017-10-26 20:47:18,160 INFO success: program_exit entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2017-10-26 20:47:22,164 INFO success: nginx entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
2017-10-26T20:47:22.519159Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting gunicorn 19.6.0
2017-10-26T20:47:22.520097Z, INFO, 00000000-0000-0000-0000-000000000000, , Listening at: http://127.0.0.1:9090 (12)
2017-10-26T20:47:22.520375Z, INFO, 00000000-0000-0000-0000-000000000000, , Using worker: sync
2017-10-26T20:47:22.521757Z, INFO, 00000000-0000-0000-0000-000000000000, , worker timeout is set to 300
2017-10-26T20:47:22.522646Z, INFO, 00000000-0000-0000-0000-000000000000, , Booting worker with pid: 22
2017-10-26 20:47:27,669 WARN received SIGTERM indicating exit request
2017-10-26 20:47:27,669 INFO waiting for nginx, gunicorn, rsyslog, program_exit to die
2017-10-26T20:47:27.669556Z, INFO, 00000000-0000-0000-0000-000000000000, , Handling signal: term
2017-10-26 20:47:30,673 INFO waiting for nginx, gunicorn, rsyslog, program_exit to die
2017-10-26 20:47:33,675 INFO waiting for nginx, gunicorn, rsyslog, program_exit to die
Initializing logger
2017-10-26T20:47:36.564469Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up app insights client
2017-10-26T20:47:36.564991Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up request id generator
2017-10-26T20:47:36.565316Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up app insight hooks
2017-10-26T20:47:36.565642Z, INFO, 00000000-0000-0000-0000-000000000000, , Invoking user's init function
2017-10-26 20:47:36.715933: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instruc
tions, but these are available on your machine and could speed up CPU computations.
2017-10-26 20:47:36,716 INFO waiting for nginx, gunicorn, rsyslog, program_exit to die
2017-10-26 20:47:36.716376: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instruc
tions, but these are available on your machine and could speed up CPU computations.
2017-10-26 20:47:36.716542: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructio
ns, but these are available on your machine and could speed up CPU computations.
2017-10-26 20:47:36.716703: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructi
ons, but these are available on your machine and could speed up CPU computations.
2017-10-26 20:47:36.716860: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructio
ns, but these are available on your machine and could speed up CPU computations.
this is the init
2017-10-26T20:47:37.551940Z, INFO, 00000000-0000-0000-0000-000000000000, , Users's init has completed successfully
Using TensorFlow backend.
2017-10-26T20:47:37.553751Z, INFO, 00000000-0000-0000-0000-000000000000, , Worker exiting (pid: 22)
2017-10-26T20:47:37.885303Z, INFO, 00000000-0000-0000-0000-000000000000, , Shutting down: Master
2017-10-26 20:47:37,885 WARN killing 'gunicorn' (12) with SIGKILL
2017-10-26 20:47:37,886 INFO stopped: gunicorn (terminated by SIGKILL)
2017-10-26 20:47:37,889 INFO stopped: nginx (exit status 0)
2017-10-26 20:47:37,890 INFO stopped: program_exit (terminated by SIGTERM)
2017-10-26 20:47:37,891 INFO stopped: rsyslog (exit status 0)
Received 41 lines of log
My best guess is theres something silent happening to cause "WARN received SIGTERM indicating exit request". The rest of the scoring.py script seems to kick off - see tensorflow get initiated and the "this is the init" print statement.
http://127.0.0.1:63437 is accessible from my local machine, but the ui endpoint is blank.
Any ideas on how to get this up and running in an Azure cluster? I'm not very familiar with how Kubernetes works, so any basic debugging guidance would be appreciated.
We discovered a bug in our system that could have caused this. The fix was deployed last night. Can you please try again and let us know if you still encounter this issue?
Related
I'm working with flutter in ubuntu 18.04. I can't run my project although flutter doctor is no problem. I stuck in the problem of Gradle daemon. I have tried the method in gradle user guide and other questions in SO to add org.gradle.daemon=false in gradle.properties, but the same problem. What's wrong?
Launching lib/main.dart on M2006C3LC in debug mode...
The message received from the daemon indicates that the daemon has disappeared.
Build request sent: Build{id=0846277b-6ded-4171-9a9a-ce34e51b07be, currentDir=/home/byhuang/flutterOpenProject/flutter_init_demo/android}
Attempting to read last messages from the daemon log...
Daemon pid: 3422
log file: /home/byhuang/.gradle/daemon/5.6.2/daemon-3422.out.log
----- Last 20 lines from daemon log file - daemon-3422.out.log -----
11:38:04.753 [DEBUG] [org.gradle.launcher.daemon.server.DefaultIncomingConnectionHandler] Starting executing command: Build{id=0846277b-6ded-4171-9a9a-ce34e51b07be, currentDir=/home/byhuang/flutterOpenProject/flutter_init_demo/android} with connection: socket connection from /0:0:0:0:0:0:0:1:41393 to /0:0:0:0:0:0:0:1:46484.
11:38:04.755 [ERROR] [org.gradle.launcher.daemon.server.DaemonStateCoordinator] Command execution: started DaemonCommandExecution[command = Build{id=0846277b-6ded-4171-9a9a-ce34e51b07be, currentDir=/home/byhuang/flutterOpenProject/flutter_init_demo/android}, connection = DefaultDaemonConnection: socket connection from /0:0:0:0:0:0:0:1:41393 to /0:0:0:0:0:0:0:1:46484] after 0.0 minutes of idle
11:38:04.756 [INFO] [org.gradle.launcher.daemon.server.DaemonRegistryUpdater] Marking the daemon as busy, address: [f54344e9-fcb9-455f-a518-b1d7af504663 port:41393, addresses:[/0:0:0:0:0:0:0:1%lo, /127.0.0.1]]
11:38:04.763 [DEBUG] [org.gradle.launcher.daemon.registry.PersistentDaemonRegistry] Marking busy by address: [f54344e9-fcb9-455f-a518-b1d7af504663 port:41393, addresses:[/0:0:0:0:0:0:0:1%lo, /127.0.0.1]]
11:38:04.790 [DEBUG] [org.gradle.cache.internal.DefaultFileLockManager] Waiting to acquire exclusive lock on daemon addresses registry.
11:38:04.791 [DEBUG] [org.gradle.cache.internal.DefaultFileLockManager] Lock acquired on daemon addresses registry.
11:38:04.804 [DEBUG] [org.gradle.cache.internal.DefaultFileLockManager] Releasing lock on daemon addresses registry.
11:38:04.805 [DEBUG] [org.gradle.launcher.daemon.server.DaemonStateCoordinator] resetting idle timer
11:38:04.805 [DEBUG] [org.gradle.launcher.daemon.server.DaemonStateCoordinator] daemon is running. Sleeping until state changes.
11:38:04.810 [INFO] [org.gradle.launcher.daemon.server.exec.StartBuildOrRespondWithBusy] Daemon is about to start building Build{id=0846277b-6ded-4171-9a9a-ce34e51b07be, currentDir=/home/byhuang/flutterOpenProject/flutter_init_demo/android}. Dispatching build started information...
11:38:04.811 [DEBUG] [org.gradle.launcher.daemon.server.SynchronizedDispatchConnection] thread 17: dispatching class org.gradle.launcher.daemon.protocol.BuildStarted
11:38:04.812 [DEBUG] [org.gradle.launcher.daemon.server.exec.EstablishBuildEnvironment] Configuring env variables: {PATH=/home/byhuang/src/FlutterSDK/flutter/bin/cache/dart-sdk/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/byhuang/.pub-cache/bin:~/src/FlutterSDK/flutter/bin, XAUTHORITY=/run/user/1000/gdm/Xauthority, INVOCATION_ID=bb1359976cdb403b93f2e6bd5fa0d6e7, XMODIFIERS=#im=ibus, GDMSESSION=ubuntu, XDG_DATA_DIRS=/usr/share/ubuntu:/usr/local/share:/usr/share:/var/lib/snapd/desktop, TEXTDOMAINDIR=/usr/share/locale/, GTK_IM_MODULE=ibus, DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1000/bus,guid=e1b93d29865be954f2c06fce5fe015cb, PUB_HOSTED_URL=https://pub.flutter-io.cn, XDG_CURRENT_DESKTOP=ubuntu:GNOME, JOURNAL_STREAM=9:49578, SSH_AGENT_PID=1797, COLORTERM=truecolor, QT4_IM_MODULE=xim, SESSION_MANAGER=local/byhuang-virtual-machine:#/tmp/.ICE-unix/1702,unix/byhuang-virtual-machine:/tmp/.ICE-unix/1702, USERNAME=byhuang, LOGNAME=byhuang, PWD=/home/byhuang/flutterOpenProject/flutter_init_demo/android, MANAGERPID=1667, IM_CONFIG_PHASE=2, LANGUAGE=zh_CN:zh, LESSOPEN=| /usr/bin/lesspipe %s, SHELL=/bin/bash, OLDPWD=/home/byhuang/flutterOpenProject/flutter_init_demo/android, GNOME_DESKTOP_SESSION_ID=this-is-deprecated, GTK_MODULES=gail:atk-bridge, GNOME_TERMINAL_SCREEN=/org/gnome/Terminal/screen/1c43d352_e494_406b_b4b1_8e1a6bf95400, CLUTTER_IM_MODULE=xim, TEXTDOMAIN=im-config, DBUS_STARTER_ADDRESS=unix:path=/run/user/1000/bus,guid=e1b93d29865be954f2c06fce5fe015cb, FLUTTER_STORAGE_BASE_URL=https://storage.flutter-io.cn, XDG_SESSION_DESKTOP=ubuntu, LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:, SHLVL=2, LESSCLOSE=/usr/bin/lesspipe %s %s, QT_IM_MODULE=xim, JAVA_HOME=/home/byhuang/software/android-studio/jre, TERM=xterm-256color, FLUTTER_ROOT=/home/byhuang/src/FlutterSDK/flutter, XDG_CONFIG_DIRS=/etc/xdg/xdg-ubuntu:/etc/xdg, GNOME_TERMINAL_SERVICE=:1.59, LANG=zh_CN.UTF-8, XDG_SESSION_TYPE=x11, XDG_SESSION_ID=4, DISPLAY=:1, FLUTTER_SUPPRESS_ANALYTICS=true, GPG_AGENT_INFO=/run/user/1000/gnupg/S.gpg-agent:0:1, DESKTOP_SESSION=ubuntu, USER=byhuang, XDG_MENU_PREFIX=gnome-, VTE_VERSION=5202, WINDOWPATH=1, QT_ACCESSIBILITY=1, XDG_SEAT=seat0, SSH_AUTH_SOCK=/run/user/1000/keyring/ssh, FLUTTER_ALREADY_LOCKED=true, GNOME_SHELL_SESSION_MODE=ubuntu, XDG_RUNTIME_DIR=/run/user/1000, XDG_VTNR=1, DBUS_STARTER_BUS_TYPE=session, HOME=/home/byhuang}
11:38:04.826 [DEBUG] [org.gradle.launcher.daemon.server.exec.LogToClient] About to start relaying all logs to the client via the connection.
11:38:04.826 [INFO] [org.gradle.launcher.daemon.server.exec.LogToClient] The client will now receive all logging from the daemon (pid: 3422). The daemon log file: /home/byhuang/.gradle/daemon/5.6.2/daemon-3422.out.log
11:38:04.832 [DEBUG] [org.gradle.launcher.daemon.server.exec.RequestStopIfSingleUsedDaemon] Requesting daemon stop after processing Build{id=0846277b-6ded-4171-9a9a-ce34e51b07be, currentDir=/home/byhuang/flutterOpenProject/flutter_init_demo/android}
11:38:04.832 [LIFECYCLE] [org.gradle.launcher.daemon.server.DaemonStateCoordinator] Daemon will be stopped at the end of the build stopping after processing
11:38:04.832 [DEBUG] [org.gradle.launcher.daemon.server.DaemonStateCoordinator] Stop as soon as idle requested. The daemon is busy: true
11:38:04.835 [DEBUG] [org.gradle.launcher.daemon.server.DaemonStateCoordinator] daemon stop has been requested. Sleeping until state changes.
11:38:04.835 [DEBUG] [org.gradle.launcher.daemon.server.exec.ExecuteBuild] The daemon has started executing the build.
11:38:04.839 [DEBUG] [org.gradle.launcher.daemon.server.exec.ExecuteBuild] Executing build with daemon context: DefaultDaemonContext[uid=cbbde454-54e4-47ec-b258-28179c1fddfe,javaHome=/home/byhuang/software/android-studio/jre,daemonRegistryDir=/home/byhuang/.gradle/daemon,pid=3422,idleTimeout=120000,priority=NORMAL,daemonOpts=-Xmx1536M,-Dfile.encoding=UTF-8,-Duser.country=CN,-Duser.language=zh,-Duser.variant]
----- End of the daemon log -----
FAILURE: Build failed with an exception.
* What went wrong:
Gradle build daemon disappeared unexpectedly (it may have been killed or may have crashed)
* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.
* Get more help at https://help.gradle.org
Running Gradle task 'assembleDebug'...
Running Gradle task 'assembleDebug'... Done 29.2s
Exception: Gradle task assembleDebug failed with exit code 1
I faced the same error today in Ubuntu. To temporarily bypass the error, adding --no-daemon to whichever gradle command I was running worked. Deleting the daemon subfolder in /home/{username}/.gradle folder permanently fixed the error.
go to project file -> android-> delete .gradle file.
As the other answers suggest delete the .gradle folder. But that didn't work for me, I had to do that and then shut down and restart my ubuntu.
I want to emulate SLURM on Ubuntu 16.04. I don't need serious resource management, I just want to test some simple examples. I cannot install SLURM in the usual way, and I am wondering if there are other options. Other things I have tried:
A Docker image. Unfortunately, docker pull agaveapi/slurm; docker run agaveapi/slurm gives me errors:
/usr/lib/python2.6/site-packages/supervisor/options.py:295: UserWarning: Supervisord is running as root and it is searching for its configuration file in default locations (including its current working directory); you probably want to specify a "-c" argument specifying an absolute path to a configuration file for improved security.
'Supervisord is running as root and it is searching '
2017-10-29 15:27:45,436 CRIT Supervisor running as root (no user in config file)
2017-10-29 15:27:45,437 INFO supervisord started with pid 1
2017-10-29 15:27:46,439 INFO spawned: 'slurmd' with pid 9
2017-10-29 15:27:46,441 INFO spawned: 'sshd' with pid 10
2017-10-29 15:27:46,443 INFO spawned: 'munge' with pid 11
2017-10-29 15:27:46,443 INFO spawned: 'slurmctld' with pid 12
2017-10-29 15:27:46,452 INFO exited: munge (exit status 0; not expected)
2017-10-29 15:27:46,452 CRIT reaped unknown pid 13)
2017-10-29 15:27:46,530 INFO gave up: munge entered FATAL state, too many start retries too quickly
2017-10-29 15:27:46,531 INFO exited: slurmd (exit status 1; not expected)
2017-10-29 15:27:46,535 INFO gave up: slurmd entered FATAL state, too many start retries too quickly
2017-10-29 15:27:46,536 INFO exited: slurmctld (exit status 0; not expected)
2017-10-29 15:27:47,537 INFO success: sshd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2017-10-29 15:27:47,537 INFO gave up: slurmctld entered FATAL state, too many start retries too quickly
This guide to start a SLURM VM via Vagrant. I tried, but copying over my munge key timed out.
sudo scp /etc/munge/munge.key vagrant#server:/home/vagrant/
ssh: connect to host server port 22: Connection timed out
lost connection
So ... we have an existing cluster here but it runs an older Ubuntu version which does not mesh well with my workstation running 17.04.
So on my workstation, I just made sure I slurmctld (backend) and slurmd installed, and then set up a trivial slurm.conf with
ControlMachine=mybox
# ...
NodeName=DEFAULT CPUs=4 RealMemory=4000 TmpDisk=50000 State=UNKNOWN
NodeName=mybox CPUs=4 RealMemory=16000
after which I restarted slurmcltd and then slurmd. Now all is fine:
root#mybox:/etc/slurm-llnl$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
demo up infinite 1 idle mybox
root#mybox:/etc/slurm-llnl$
This is a degenerate setup, our real one has a mix of dev and prod machine and appropriate partitions. But this should answer your "can backend really be client" question. Also, my machine is not really called mybox but is not really pertinent for the question in either case.
Using Ubuntu 17.04, all stock, with munge to communicate (which is the default anyway).
Edit: To wit:
me#mybox:~$ COLUMNS=90 dpkg -l '*slurm*' | grep ^ii
ii slurm-client 16.05.9-1ubun amd64 SLURM client side commands
ii slurm-wlm-basic- 16.05.9-1ubun amd64 SLURM basic plugins
ii slurmctld 16.05.9-1ubun amd64 SLURM central management daemon
ii slurmd 16.05.9-1ubun amd64 SLURM compute node daemon
me#mybox:~$
I would still prefer to run SLURM natively, but I caved and spun up a Debian 9.2 VM. See here for my efforts to troubleshoot a native installation. The directions here worked smoothly, but I needed to make the following changes to slurm.conf. Below, Debian64 is the hostname, and wlandau is my user name.
ControlMachine=Debian64
SlurmUser=wlandau
NodeName=Debian64
Here is the complete slurm.conf. An analogous slurm.conf did not work on my native Ubuntu 16.04.
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=Debian64
#ControlAddr=
#BackupController=
#BackupAddr=
#
AuthType=auth/munge
#CheckpointType=checkpoint/none
CryptoType=crypto/munge
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobCheckpointDir=/var/lib/slurm-llnl/checkpoint
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/usr/bin/mail
#MaxJobCount=5000
#MaxStepCount=40000
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/pgid
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=wlandau
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/none
#TaskPluginParam=
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
FastSchedule=1
#MaxMemPerCPU=0
#SchedulerRootFilter=1
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/linear
#SelectTypeParameters=
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster
#DebugFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
#JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=Debian64 CPUs=1 RealMemory=744 CoresPerSocket=1 ThreadsPerCore=1 State=UNKNOWN
PartitionName=debug Nodes=Debian64 Default=YES MaxTime=INFINITE State=UP
Note: This error was thrown before the components were executed by spark.
Logs
Worker Node1:
17/05/18 23:12:52 INFO Worker: Successfully registered with master spark://spark-master-1.com:7077
17/05/18 23:58:41 ERROR Worker: RECEIVED SIGNAL 15: SIGTERM
Master Node:
17/05/18 23:12:52 INFO Master: Registering worker spark-worker-1com:56056 with 2 cores, 14.5 GB RAM
17/05/18 23:14:20 INFO Master: Registering worker spark-worker-2.com:53986 with 2 cores, 14.5 GB RAM
17/05/18 23:59:42 WARN Master: Removing spark-worker-1com-56056 because we got no heartbeat in 60 seconds
17/05/18 23:59:42 INFO Master: Removing spark-worker-2.com:56056
17/05/19 00:00:03 ERROR Master: RECEIVED SIGNAL 15: SIGTERM
Worker Node2:
17/05/18 23:14:20 INFO Worker: Successfully registered with master spark://spark-master-node-2.com:7077
17/05/18 23:59:40 ERROR Worker: RECEIVED SIGNAL 15: SIGTERM
TL;DR I think someone has explicitly called kill command or sbin/stop-worker.sh.
"RECEIVED SIGNAL 15: SIGTERM" is reported by a shutdown hook to log TERM, HUP, INT signals on UNIX-like systems:
/** Register a signal handler to log signals on UNIX-like systems. */
def registerLogger(log: Logger): Unit = synchronized {
if (!loggerRegistered) {
Seq("TERM", "HUP", "INT").foreach { sig =>
SignalUtils.register(sig) {
log.error("RECEIVED SIGNAL " + sig)
false
}
}
loggerRegistered = true
}
}
In your case it means that the process received SIGTERM to stop itself:
The SIGTERM signal is a generic signal used to cause program termination. Unlike SIGKILL, this signal can be blocked, handled, and ignored. It is the normal way to politely ask a program to terminate.
That's what is sent when you execute KILL or use ./sbin/stop-master.sh or ./sbin/stop-worker.sh shell scripts that in turn call sbin/spark-daemon.sh with stop command that kills a JVM process for a master or a worker:
kill "$TARGET_ID" && rm -f "$pid"
During an investigation into a different problem (Inconsistent systemd startup of freeswitch) I discovered that both the latest freeswitch 1.6 and 1.7 paused for several minutes at a time (between 4 and 14) during boot up on centos 7.1. Whilst it was intermittent, it was as often as one time in 3 or 4.
Running this from the command line :
/usr/bin/freeswitch -nonat -db /dev/shm -log /usr/local/freeswitch/log -conf /usr/local/freeswitch/conf -run /usr/local/freeswitch/run
caused the following output (note the time difference between the Add task 2 and the line after it) :
2015-10-23 15:40:14.160101 [INFO] switch_event.c:685 Activate Eventing Engine.
2015-10-23 15:40:14.170805 [WARNING] switch_event.c:656 Create additional event dispatch thread 0
2015-10-23 15:40:14.272850 [INFO] switch_core_sqldb.c:3381 Opening DB
2015-10-23 15:40:14.282317 [INFO] switch_core_sqldb.c:1693 CORE Starting SQL thread.
2015-10-23 15:40:14.285266 [NOTICE] switch_scheduler.c:183 Starting task thread
2015-10-23 15:40:14.293743 [DEBUG] switch_scheduler.c:249 Added task 1 heartbeat (core) to run at 1445611214
2015-10-23 15:40:14.293837 [DEBUG] switch_scheduler.c:249 Added task 2 check_ip (core) to run at 1445611214
2015-10-23 15:49:47.883158 [NOTICE] switch_core.c:1386 Created ip list rfc6598.auto default (deny)
When I ran it from 1.6 on centos6.7 using the same command line as above I got this - note the delay is a more reasonable 14 seconds :
2015-10-23 10:31:00.274533 [INFO] switch_event.c:685 Activate Eventing Engine.
2015-10-23 10:31:00.285807 [WARNING] switch_event.c:656 Create additional event dispatch thread 0
2015-10-23 10:31:00.434780 [INFO] switch_core_sqldb.c:3381 Opening DB
2015-10-23 10:31:00.465158 [INFO] switch_core_sqldb.c:1693 CORE Starting SQL thread.
2015-10-23 10:31:00.481306 [DEBUG] switch_scheduler.c:249 Added task 1 heartbeat (core) to run at 1445610660
2015-10-23 10:31:00.481446 [DEBUG] switch_scheduler.c:249 Added task 2 check_ip (core) to run at 1445610660
2015-10-23 10:31:00.481723 [NOTICE] switch_scheduler.c:183 Starting task thread
2015-10-23 10:31:14.286702 [NOTICE] switch_core.c:1386 Created ip list rfc6598.auto default (deny)
It's the same on FS 1.7 as well.
This suggests heavily that centos 7.1 & FS have an issue together. Can anyone help me diagnose further or shine some more light on this, please?
This all came to light as I tried to understand why FS would not accept the cli connection for several minutes after I thought it had booted up (using -nc from systemd service).
Thanks to the FS userlist and ultimately Anthony Minessale, the issue was to do with RNG entropy.
This is a good explanation -
https://www.digitalocean.com/community/tutorials/how-to-setup-additional-entropy-for-cloud-servers-using-haveged
Here are some extracts :
There are two general random devices on Linux: /dev/random and
/dev/urandom. The best randomness comes from /dev/random, since it's a
blocking device, and will wait until sufficient entropy is available
to continue providing output.
The key here is that it's a blocking device, so any program waiting for a random number from /dev/random will pause until sufficient entropy is available for a "safe" random number.
This is a headless server, so the usual sources of entropy such as mouse/keyboard activity (and many others) do not apply. Thus the delays,
The fix is this :
Based on the HAVEGE principle, and previously based on its associated
library, haveged allows generating randomness based on variations in
code execution time on a processor......(google the rest!)
Install like this :
yum install haveged
and start it up like this :
haveged -w 1024
making sure it restarts on reboot :
chkconfig haveged on
Hope this helps someone.
I am trying to running Spark on a Mesos cluster.
When I run ./bin/spark-shell --master mesos://host:5050 from the machine where I run the Mesos master everything works. However if I run the same command from a different machine, process ends up hanging after trying to connect:
I0825 07:30:10.184141 27380 sched.cpp:126] Version: 0.19.0
I0825 07:30:10.187476 27385 sched.cpp:222] New master detected at master#192.168.0.241:5050
I0825 07:30:10.187619 27385 sched.cpp:230] No credentials provided. Attempting to register without authentication
On the mesos master I see the following output:
[...]
I0825 15:30:23.928402 23214 master.cpp:684] Giving framework 20140825-143817-4043352256-5050-23194-0002 0ns to failover
I0825 15:30:23.929033 23210 master.cpp:2849] Framework failover timeout, removing framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:23.929095 23210 master.cpp:3344] Removing framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:23.929687 23210 hierarchical_allocator_process.hpp:636] Recovered mem(*):512 (total allocatable: cpus(*):4; mem(*):6831; disk(*):455983; ports(*):[31000-32000]) on slave 20140822-144404-4043352256-5050-15999-31 from framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:23.935073 23210 hierarchical_allocator_process.hpp:636] Recovered mem(*):512 (total allocatable: cpus(*):4; mem(*):15001; disk(*):917264; ports(*):[31000-32000]) on slave 20140822-144404-4043352256-5050-15999-29 from framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:23.938248 23210 hierarchical_allocator_process.hpp:636] Recovered mem(*):512 (total allocatable: mem(*):6823; disk(*):455991; ports(*):[31000-32000]; cpus(*):4) on slave 20140822-144404-4043352256-5050-15999-32 from framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:23.938356 23210 hierarchical_allocator_process.hpp:636] Recovered mem(*):512 (total allocatable: mem(*):4939; disk(*):457873; ports(*):[31000-32000]; cpus(*):4) on slave 20140822-144404-4043352256-5050-15999-28 from framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:23.938397 23210 hierarchical_allocator_process.hpp:362] Removed framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:27.952940 23215 http.cpp:452] HTTP request for '/master/state.json'
W0825 15:30:29.595441 23208 master.cpp:2718] Ignoring unknown exited executor 20140822-144404-4043352256-5050-15999-32 on slave 20140822-144404-4043352256-5050-15999-32 at slave(1)#192.168.0.233:5051 (cluster2)
W0825 15:30:29.596709 23213 master.cpp:2718] Ignoring unknown exited executor 20140822-144404-4043352256-5050-15999-29 on slave 20140822-144404-4043352256-5050-15999-29 at slave(1)#192.168.0.241:5051 (cluster4)
W0825 15:30:29.615630 23213 master.cpp:2718] Ignoring unknown exited executor 20140822-144404-4043352256-5050-15999-31 on slave 20140822-144404-4043352256-5050-15999-31 at slave(1)#192.168.0.213:5051 (cluster3)
W0825 15:30:29.935130 23214 master.cpp:2718] Ignoring unknown exited executor 20140822-144404-4043352256-5050-15999-28 on slave 20140822-144404-4043352256-5050-15999-28 at slave(1)#192.168.0.212:5051 (cluster1)
Where as the slaves output
[...]
I0825 15:30:08.450343 980 slave.cpp:1337] Asked to shut down framework 20140825-143817-4043352256-5050-23194-0002 by master#192.168.0.241:5050
I0825 15:30:08.455153 980 slave.cpp:1362] Shutting down framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:08.455401 980 slave.cpp:2698] Shutting down executor '20140822-144404-4043352256-5050-15999-31' of framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:13.456045 982 slave.cpp:2768] Killing executor '20140822-144404-4043352256-5050-15999-31' of framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:13.456217 982 mesos_containerizer.cpp:992] Destroying container '37cc2b09-0e6d-4738-a837-7956367bba2b'
I0825 15:30:14.134845 977 mesos_containerizer.cpp:1108] Executor for container '37cc2b09-0e6d-4738-a837-7956367bba2b' has exited
I0825 15:30:14.135220 978 slave.cpp:2413] Executor '20140822-144404-4043352256-5050-15999-31' of framework 20140825-143817-4043352256-5050-23194-0002 has terminated with signal Killed
I0825 15:30:14.135356 978 slave.cpp:2552] Cleaning up executor '20140822-144404-4043352256-5050-15999-31' of framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:14.135499 978 slave.cpp:2627] Cleaning up framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:14.135627 976 status_update_manager.cpp:282] Closing status update streams for framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:14.135571 975 gc.cpp:56] Scheduling '/tmp/mesos/slaves/20140822-144404-4043352256-5050-15999-31/frameworks/20140825-143817-4043352256-5050-23194-0002/executors/20140822-144404-4043352256-5050-15999-31/runs/37cc2b09-0e6d-4738-a837-7956367bba2b' for gc 6.99999843242074days in the future
I0825 15:30:14.135910 975 gc.cpp:56] Scheduling '/tmp/mesos/slaves/20140822-144404-4043352256-5050-15999-31/frameworks/20140825-143817-4043352256-5050-23194-0002/executors/20140822-144404-4043352256-5050-15999-31' for gc 6.99999843187556days in the future
I0825 15:30:14.135980 975 gc.cpp:56] Scheduling '/tmp/mesos/slaves/20140822-144404-4043352256-5050-15999-31/frameworks/20140825-143817-4043352256-5050-23194-0002' for gc 6.99999843111111days in the future
I0825 15:31:04.450660 978 slave.cpp:2873] Current usage 60.67%. Max allowed age: 2.053113079446458days
Have anyone seen anything similar?
The problem turned out to be caused not by network connectivity issues but by Mesos slave recovery policy as outlined here: http://mesos.apache.org/documentation/latest/slave-recovery/
I would initially connect slaves to the master and disconnect them because of an unrelated problem, however when I later tried to connect the slaves again, they were being dropped by the master. To quote the documentation linked above:
A restarted slave should re-register with master within a timeout (currently, 75s). If the slave takes longer than this timeout to re-register, the master shuts down the slave, which in turn shuts down any live executors/tasks. Therefore, it is highly recommended to automate the process of restarting a slave (e.g, using monit).
I solved the problem by connecting the slaves with the --strict option set to false.