Freeswitch pauses on check_ip at boot on centos 7.1 - linux

During an investigation into a different problem (Inconsistent systemd startup of freeswitch) I discovered that both the latest freeswitch 1.6 and 1.7 paused for several minutes at a time (between 4 and 14) during boot up on centos 7.1. Whilst it was intermittent, it was as often as one time in 3 or 4.
Running this from the command line :
/usr/bin/freeswitch -nonat -db /dev/shm -log /usr/local/freeswitch/log -conf /usr/local/freeswitch/conf -run /usr/local/freeswitch/run
caused the following output (note the time difference between the Add task 2 and the line after it) :
2015-10-23 15:40:14.160101 [INFO] switch_event.c:685 Activate Eventing Engine.
2015-10-23 15:40:14.170805 [WARNING] switch_event.c:656 Create additional event dispatch thread 0
2015-10-23 15:40:14.272850 [INFO] switch_core_sqldb.c:3381 Opening DB
2015-10-23 15:40:14.282317 [INFO] switch_core_sqldb.c:1693 CORE Starting SQL thread.
2015-10-23 15:40:14.285266 [NOTICE] switch_scheduler.c:183 Starting task thread
2015-10-23 15:40:14.293743 [DEBUG] switch_scheduler.c:249 Added task 1 heartbeat (core) to run at 1445611214
2015-10-23 15:40:14.293837 [DEBUG] switch_scheduler.c:249 Added task 2 check_ip (core) to run at 1445611214
2015-10-23 15:49:47.883158 [NOTICE] switch_core.c:1386 Created ip list rfc6598.auto default (deny)
When I ran it from 1.6 on centos6.7 using the same command line as above I got this - note the delay is a more reasonable 14 seconds :
2015-10-23 10:31:00.274533 [INFO] switch_event.c:685 Activate Eventing Engine.
2015-10-23 10:31:00.285807 [WARNING] switch_event.c:656 Create additional event dispatch thread 0
2015-10-23 10:31:00.434780 [INFO] switch_core_sqldb.c:3381 Opening DB
2015-10-23 10:31:00.465158 [INFO] switch_core_sqldb.c:1693 CORE Starting SQL thread.
2015-10-23 10:31:00.481306 [DEBUG] switch_scheduler.c:249 Added task 1 heartbeat (core) to run at 1445610660
2015-10-23 10:31:00.481446 [DEBUG] switch_scheduler.c:249 Added task 2 check_ip (core) to run at 1445610660
2015-10-23 10:31:00.481723 [NOTICE] switch_scheduler.c:183 Starting task thread
2015-10-23 10:31:14.286702 [NOTICE] switch_core.c:1386 Created ip list rfc6598.auto default (deny)
It's the same on FS 1.7 as well.
This suggests heavily that centos 7.1 & FS have an issue together. Can anyone help me diagnose further or shine some more light on this, please?
This all came to light as I tried to understand why FS would not accept the cli connection for several minutes after I thought it had booted up (using -nc from systemd service).

Thanks to the FS userlist and ultimately Anthony Minessale, the issue was to do with RNG entropy.
This is a good explanation -
https://www.digitalocean.com/community/tutorials/how-to-setup-additional-entropy-for-cloud-servers-using-haveged
Here are some extracts :
There are two general random devices on Linux: /dev/random and
/dev/urandom. The best randomness comes from /dev/random, since it's a
blocking device, and will wait until sufficient entropy is available
to continue providing output.
The key here is that it's a blocking device, so any program waiting for a random number from /dev/random will pause until sufficient entropy is available for a "safe" random number.
This is a headless server, so the usual sources of entropy such as mouse/keyboard activity (and many others) do not apply. Thus the delays,
The fix is this :
Based on the HAVEGE principle, and previously based on its associated
library, haveged allows generating randomness based on variations in
code execution time on a processor......(google the rest!)
Install like this :
yum install haveged
and start it up like this :
haveged -w 1024
making sure it restarts on reboot :
chkconfig haveged on
Hope this helps someone.

Related

arangodb starter mode does not start

I have d/l'd arangodb3-linux-3.9.2 from GIT on Centos 7. I created a database dir and ran the README instructions for a standalone start. The first time it runs, I get 100 failures, the key INFO log lines seem to be
... [INFO] server started component=arangodb pid=49827 type=single
... [INFO] Wait on 49827 returned component=arangodb exit-status=1 trap-cause=-1
It creates the log file, setup.json and a single8529 dir in the database dir I sped'd. Is it just taking too long to start? The whole 100 fails take about 1 or 2 seconds.
If I try to run it again with the same README instructions, the next time I get this error:
... [FATAL] Failed to run service error="open /.../single8529/data/ENGINE: no such file"
I have also tried with --starter.host 127.0.0.1 -- to simplify
Also I and can confirm that port 8529 is open
I couldn't get arangodb 'starter' according to their README to work, but this does start the server:
arangod --database.directory MYDIR --rocksdb.max-background-jobs 4

flutter: Gradle build daemon disappeared unexpectedly (it may have been killed or may have crashed)

I'm working with flutter in ubuntu 18.04. I can't run my project although flutter doctor is no problem. I stuck in the problem of Gradle daemon. I have tried the method in gradle user guide and other questions in SO to add org.gradle.daemon=false in gradle.properties, but the same problem. What's wrong?
Launching lib/main.dart on M2006C3LC in debug mode...
The message received from the daemon indicates that the daemon has disappeared.
Build request sent: Build{id=0846277b-6ded-4171-9a9a-ce34e51b07be, currentDir=/home/byhuang/flutterOpenProject/flutter_init_demo/android}
Attempting to read last messages from the daemon log...
Daemon pid: 3422
log file: /home/byhuang/.gradle/daemon/5.6.2/daemon-3422.out.log
----- Last 20 lines from daemon log file - daemon-3422.out.log -----
11:38:04.753 [DEBUG] [org.gradle.launcher.daemon.server.DefaultIncomingConnectionHandler] Starting executing command: Build{id=0846277b-6ded-4171-9a9a-ce34e51b07be, currentDir=/home/byhuang/flutterOpenProject/flutter_init_demo/android} with connection: socket connection from /0:0:0:0:0:0:0:1:41393 to /0:0:0:0:0:0:0:1:46484.
11:38:04.755 [ERROR] [org.gradle.launcher.daemon.server.DaemonStateCoordinator] Command execution: started DaemonCommandExecution[command = Build{id=0846277b-6ded-4171-9a9a-ce34e51b07be, currentDir=/home/byhuang/flutterOpenProject/flutter_init_demo/android}, connection = DefaultDaemonConnection: socket connection from /0:0:0:0:0:0:0:1:41393 to /0:0:0:0:0:0:0:1:46484] after 0.0 minutes of idle
11:38:04.756 [INFO] [org.gradle.launcher.daemon.server.DaemonRegistryUpdater] Marking the daemon as busy, address: [f54344e9-fcb9-455f-a518-b1d7af504663 port:41393, addresses:[/0:0:0:0:0:0:0:1%lo, /127.0.0.1]]
11:38:04.763 [DEBUG] [org.gradle.launcher.daemon.registry.PersistentDaemonRegistry] Marking busy by address: [f54344e9-fcb9-455f-a518-b1d7af504663 port:41393, addresses:[/0:0:0:0:0:0:0:1%lo, /127.0.0.1]]
11:38:04.790 [DEBUG] [org.gradle.cache.internal.DefaultFileLockManager] Waiting to acquire exclusive lock on daemon addresses registry.
11:38:04.791 [DEBUG] [org.gradle.cache.internal.DefaultFileLockManager] Lock acquired on daemon addresses registry.
11:38:04.804 [DEBUG] [org.gradle.cache.internal.DefaultFileLockManager] Releasing lock on daemon addresses registry.
11:38:04.805 [DEBUG] [org.gradle.launcher.daemon.server.DaemonStateCoordinator] resetting idle timer
11:38:04.805 [DEBUG] [org.gradle.launcher.daemon.server.DaemonStateCoordinator] daemon is running. Sleeping until state changes.
11:38:04.810 [INFO] [org.gradle.launcher.daemon.server.exec.StartBuildOrRespondWithBusy] Daemon is about to start building Build{id=0846277b-6ded-4171-9a9a-ce34e51b07be, currentDir=/home/byhuang/flutterOpenProject/flutter_init_demo/android}. Dispatching build started information...
11:38:04.811 [DEBUG] [org.gradle.launcher.daemon.server.SynchronizedDispatchConnection] thread 17: dispatching class org.gradle.launcher.daemon.protocol.BuildStarted
11:38:04.812 [DEBUG] [org.gradle.launcher.daemon.server.exec.EstablishBuildEnvironment] Configuring env variables: {PATH=/home/byhuang/src/FlutterSDK/flutter/bin/cache/dart-sdk/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/byhuang/.pub-cache/bin:~/src/FlutterSDK/flutter/bin, XAUTHORITY=/run/user/1000/gdm/Xauthority, INVOCATION_ID=bb1359976cdb403b93f2e6bd5fa0d6e7, XMODIFIERS=#im=ibus, GDMSESSION=ubuntu, XDG_DATA_DIRS=/usr/share/ubuntu:/usr/local/share:/usr/share:/var/lib/snapd/desktop, TEXTDOMAINDIR=/usr/share/locale/, GTK_IM_MODULE=ibus, DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1000/bus,guid=e1b93d29865be954f2c06fce5fe015cb, PUB_HOSTED_URL=https://pub.flutter-io.cn, XDG_CURRENT_DESKTOP=ubuntu:GNOME, JOURNAL_STREAM=9:49578, SSH_AGENT_PID=1797, COLORTERM=truecolor, QT4_IM_MODULE=xim, SESSION_MANAGER=local/byhuang-virtual-machine:#/tmp/.ICE-unix/1702,unix/byhuang-virtual-machine:/tmp/.ICE-unix/1702, USERNAME=byhuang, LOGNAME=byhuang, PWD=/home/byhuang/flutterOpenProject/flutter_init_demo/android, MANAGERPID=1667, IM_CONFIG_PHASE=2, LANGUAGE=zh_CN:zh, LESSOPEN=| /usr/bin/lesspipe %s, SHELL=/bin/bash, OLDPWD=/home/byhuang/flutterOpenProject/flutter_init_demo/android, GNOME_DESKTOP_SESSION_ID=this-is-deprecated, GTK_MODULES=gail:atk-bridge, GNOME_TERMINAL_SCREEN=/org/gnome/Terminal/screen/1c43d352_e494_406b_b4b1_8e1a6bf95400, CLUTTER_IM_MODULE=xim, TEXTDOMAIN=im-config, DBUS_STARTER_ADDRESS=unix:path=/run/user/1000/bus,guid=e1b93d29865be954f2c06fce5fe015cb, FLUTTER_STORAGE_BASE_URL=https://storage.flutter-io.cn, XDG_SESSION_DESKTOP=ubuntu, LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:, SHLVL=2, LESSCLOSE=/usr/bin/lesspipe %s %s, QT_IM_MODULE=xim, JAVA_HOME=/home/byhuang/software/android-studio/jre, TERM=xterm-256color, FLUTTER_ROOT=/home/byhuang/src/FlutterSDK/flutter, XDG_CONFIG_DIRS=/etc/xdg/xdg-ubuntu:/etc/xdg, GNOME_TERMINAL_SERVICE=:1.59, LANG=zh_CN.UTF-8, XDG_SESSION_TYPE=x11, XDG_SESSION_ID=4, DISPLAY=:1, FLUTTER_SUPPRESS_ANALYTICS=true, GPG_AGENT_INFO=/run/user/1000/gnupg/S.gpg-agent:0:1, DESKTOP_SESSION=ubuntu, USER=byhuang, XDG_MENU_PREFIX=gnome-, VTE_VERSION=5202, WINDOWPATH=1, QT_ACCESSIBILITY=1, XDG_SEAT=seat0, SSH_AUTH_SOCK=/run/user/1000/keyring/ssh, FLUTTER_ALREADY_LOCKED=true, GNOME_SHELL_SESSION_MODE=ubuntu, XDG_RUNTIME_DIR=/run/user/1000, XDG_VTNR=1, DBUS_STARTER_BUS_TYPE=session, HOME=/home/byhuang}
11:38:04.826 [DEBUG] [org.gradle.launcher.daemon.server.exec.LogToClient] About to start relaying all logs to the client via the connection.
11:38:04.826 [INFO] [org.gradle.launcher.daemon.server.exec.LogToClient] The client will now receive all logging from the daemon (pid: 3422). The daemon log file: /home/byhuang/.gradle/daemon/5.6.2/daemon-3422.out.log
11:38:04.832 [DEBUG] [org.gradle.launcher.daemon.server.exec.RequestStopIfSingleUsedDaemon] Requesting daemon stop after processing Build{id=0846277b-6ded-4171-9a9a-ce34e51b07be, currentDir=/home/byhuang/flutterOpenProject/flutter_init_demo/android}
11:38:04.832 [LIFECYCLE] [org.gradle.launcher.daemon.server.DaemonStateCoordinator] Daemon will be stopped at the end of the build stopping after processing
11:38:04.832 [DEBUG] [org.gradle.launcher.daemon.server.DaemonStateCoordinator] Stop as soon as idle requested. The daemon is busy: true
11:38:04.835 [DEBUG] [org.gradle.launcher.daemon.server.DaemonStateCoordinator] daemon stop has been requested. Sleeping until state changes.
11:38:04.835 [DEBUG] [org.gradle.launcher.daemon.server.exec.ExecuteBuild] The daemon has started executing the build.
11:38:04.839 [DEBUG] [org.gradle.launcher.daemon.server.exec.ExecuteBuild] Executing build with daemon context: DefaultDaemonContext[uid=cbbde454-54e4-47ec-b258-28179c1fddfe,javaHome=/home/byhuang/software/android-studio/jre,daemonRegistryDir=/home/byhuang/.gradle/daemon,pid=3422,idleTimeout=120000,priority=NORMAL,daemonOpts=-Xmx1536M,-Dfile.encoding=UTF-8,-Duser.country=CN,-Duser.language=zh,-Duser.variant]
----- End of the daemon log -----
FAILURE: Build failed with an exception.
* What went wrong:
Gradle build daemon disappeared unexpectedly (it may have been killed or may have crashed)
* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.
* Get more help at https://help.gradle.org
Running Gradle task 'assembleDebug'...
Running Gradle task 'assembleDebug'... Done 29.2s
Exception: Gradle task assembleDebug failed with exit code 1
I faced the same error today in Ubuntu. To temporarily bypass the error, adding --no-daemon to whichever gradle command I was running worked. Deleting the daemon subfolder in /home/{username}/.gradle folder permanently fixed the error.
go to project file -> android-> delete .gradle file.
As the other answers suggest delete the .gradle folder. But that didn't work for me, I had to do that and then shut down and restart my ubuntu.

could not run codedeployagent on windows 10

I am following this note: https://docs.aws.amazon.com/codedeploy/latest/userguide/register-on-premises-instance-iam-user-arn.html however could not run AWS CodeDeployAgent on Windows 10 machine.
Log file:
2019-11-23T20:28:51 INFO [codedeploy-agent(2988)]: Version file found in C:/ProgramData/Amazon/CodeDeploy/.version with agent version OFFICIAL_1.0.1.1597_msi.
2019-11-23T20:28:51 ERROR [codedeploy-agent(2988)]: InstanceAgent::Plugins::CodeDeployPlugin::CommandPoller: Missing credentials - please check if this instance was started with an IAM instance profile
2019-11-23T20:28:51 DEBUG [codedeploy-agent(2988)]: InstanceAgent::Plugins::CodeDeployPlugin::CommandPoller: Sleeping 89 seconds.
2019-11-23T20:30:03 INFO [codedeploy-agent(2988)]: CodeDeploy Instance Agent Service: stopping the agent
2019-11-23T20:30:20 INFO [codedeploy-agent(2988)]: InstanceAgent::Plugins::CodeDeployPlugin::CommandPoller: Gracefully shutting down agent child threads now, will wait up to 7200 seconds
2019-11-23T20:30:20 INFO [codedeploy-agent(2988)]: InstanceAgent::Plugins::CodeDeployPlugin::CommandPoller: All agent child threads have been shut down
2019-11-23T20:30:20 INFO [codedeploy-agent(2988)]: CodeDeploy Instance Agent Service: command execution threads shutdown, agent exiting now
Any idea please?
The CodeDeploy agent is not tested on Windows 10 for EC2 or OnPrem installation. Please refer to this table for supported OS list:
https://docs.aws.amazon.com/codedeploy/latest/userguide/codedeploy-agent.html#codedeploy-agent-supported-operating-systems-ec2
https://docs.aws.amazon.com/codedeploy/latest/userguide/codedeploy-agent.html#codedeploy-agent-supported-operating-systems-on-premises

How to enable a Spark-Mesos job to be launched from inside a Docker container?

Summary:
Is it possible to submit a Spark job on Mesos from inside a Docker container with 1 Mesos master (no Zookeeper) and 1 Mesos agent also each running in separate Docker containers (on the same host for now)? The Mesos containerizer described at http://mesos.apache.org/documentation/latest/container-image/ seems to apply to the case where the Mesos application is simply encapsulated in a Docker container and run. My Docker application is more interactive with multiple PySpark Mesos jobs being instantiated at run-time based on user input. The driver program in the Docker container is not itself run as a Mesos app. Only the user-initiated job requests are handled as PySpark Mesos apps.
Specifics:
I have 3 Docker containers based on centos:7 linux, and running on the same host machine for now:
Container "Master" running a Mesos Master.
Container "Agent" running a Mesos Agent.
Container "Test" with Spark and Mesos installed where I run a bash shell and launch the following PySpark test program from the command line.
from pyspark import SparkContext, SparkConf
from operator import add
# Configure Spark
sp_conf = SparkConf()
sp_conf.setAppName("spark_test")
sp_conf.set("spark.scheduler.mode", "FAIR")
sp_conf.set("spark.dynamicAllocation.enabled", "false")
sp_conf.set("spark.driver.memory", "500m")
sp_conf.set("spark.executor.memory", "500m")
sp_conf.set("spark.executor.cores", 1)
sp_conf.set("spark.cores.max", 1)
sp_conf.set("spark.mesos.executor.home", "/usr/local/spark-2.1.0")
sp_conf.set("spark.executor.uri", "file://usr/local/spark-2.1.0-bin-without-hadoop.tgz")
sc = SparkContext(conf=sp_conf)
# Simple computation
x = [(1.5,100.),(1.5,200.),(1.5,300.),(2.5,150.)]
rdd = sc.parallelize(x,1)
tot = rdd.foldByKey(0,add).collect()
cnt = rdd.countByKey()
time = [t[0] for t in tot]
avg = [t[1]/cnt[t[0]] for t in tot]
print 'tot=', tot
print 'cnt=', cnt
print 't=', time
print 'avg=', avg
The relevant software versions I am using are as follows:
Hadoop: 2.7.3
Spark: 2.1.0
Mesos: 1.2.0
Docker: 17.03.1-ce, build c6d412e
The following works fine:
I can run the simple PySpark test program above from inside the Test container with Spark's MASTER=local[N] for N=1 or N=4.
I can see in the Mesos logs and in the Mesos user interface (UI) that the Mesos agent and master come up fine. The Mesos UI shows that the agent is connected with plenty of resources (cpu, memory, disk).
I can run the Mesos Python tests successfully from inside the Test container with /usr/local/mesos-1.2.0/build/src/examples/python/test-framework 127.0.0.1:5050. This seems to confirm that the Mesos containers can be accessed from within my Test container, but these tests are not using Spark.
This is the Failure:
With Spark's MASTER=mesos://127.0.0.1:5050, when I launch my PySpark test program from inside the Test container there is activity in the logs of both the Mesos Master and Agent, and in the couple seconds before failure, the Mesos UI shows resources assigned for the job that are well within what is available. However, the PySpark test program then fails with: WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources.
The steps I followed are as follows.
Start Mesos Master:
docker run -it --net=host -p 5050:5050 the_master
Relevant excerpts from the master's log shows:
I0418 01:05:08.540192 27 master.cpp:383] Master 15b354eb-6a20-4bc9-a13b-6533b1e91bd2 (localhost) started on 127.0.0.1:5050
I0418 01:05:08.540210 27 master.cpp:385] Flags at startup: --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate_agents="false" --authenticate_frameworks="false" --authenticate_http_frameworks="false" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_unreachable_tasks_per_framework="1000" --quiet="false" --recovery_agent_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="20secs" --registry_strict="false" --root_submissions="true" --user_sorter="drf" --version="false" --webui_dir="/usr/local/mesos-1.2.0/build/../src/webui" --work_dir="/var/lib/mesos" --zk_session_timeout="10secs"
Start Mesos Agent:
docker run -it --net=host -e MESOS_AGENT_PORT=5051 the_agent
The agent's log shows:
I0418 01:42:00.234244 40 slave.cpp:212] Flags at startup: --appc_simple_discovery_uri_prefix="http://" --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticatee="crammd5" --authentication_backoff_factor="1secs" --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_kill_orphans="true" --docker_mesos_image="spark-mesos-agent-test" --docker_registry="https://registry-1.docker.io" --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker" --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" --enforce_container_disk_quota="false" --executor_registration_timeout="1mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_command_executor="false" --http_heartbeat_interval="30secs" --initialize_driver_logging="true" --isolation="posix/cpu,posix/mem" --launcher="posix" --launcher_dir="/usr/local/mesos-1.2.0/build/src" --logbufsecs="0" --logging_level="INFO" --max_completed_executors_per_framework="150" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" --runtime_dir="/var/run/mesos" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="false" --systemd_enable_support="false" --systemd_runtime_directory="/run/systemd/system" --version="false" --work_dir="/var/lib/mesos"
I get the following warning for both the Mesos Master and Agent, but ignore it because I am running everything on the same host for now:
Master/Agent bound to loopback interface! Cannot communicate with remote schedulers or agents. You might want to set '--ip' flag to a routable IP address.
In fact, my tests with assigning a routable IP address instead of 127.0.0.1 failed to change any of the behavior I describe here.
Start Test Container (with bash shell for testing):
docker run -it --net=host the_test /bin/bash
Some relevant environment variables set inside all three container (Master, Agent, and Test):
HADOOP_HOME=/usr/local/hadoop-2.7.3
HADOOP_CONF_DIR=/usr/local/hadoop-2.7.3/etc/hadoop
SPARK_HOME=/usr/local/spark-2.1.0
SPARK_EXECUTOR_URI=file:////usr/local/spark-2.1.0-bin-without-hadoop.tgz
MASTER=mesos://127.0.0.1:5050
PYSPARK_PYTHON=/usr/local/anaconda2/bin/python
PYSPARK_DRIVER_PYTHON=/usr/local/anaconda2/bin/python
PYSPARK_SUBMIT_ARGS=--driver-memory=4g pyspark-shell
MESOS_PORT=5050
MESOS_IP=127.0.0.1
MESOS_WORKDIR=/var/lib/mesos
MESOS_HOME=/usr/local/mesos-1.2.0
MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.so
MESOS_MASTER=mesos://127.0.0.1:5050
PYTHONPATH=:/usr/local/spark-2.1.0/python:/usr/local/spark-2.1.0/python/lib/py4j-0.10.1-src.zip
Run Mesos (non-Spark) tests from inside the Test container:
/usr/local/mesos-1.2.0/build/src/examples/python/test-framework 127.0.0.1:5050
This produces the following log output (as expected I think):
I0417 21:28:36.912542 20 sched.cpp:232] Version: 1.2.0
I0417 21:28:36.920013 62 sched.cpp:336] New master detected at master#127.0.0.1:5050
I0417 21:28:36.920472 62 sched.cpp:352] No credentials provided. Attempting to register without authentication
I0417 21:28:36.924165 62 sched.cpp:759] Framework registered with be89e739-be8d-430e-b1e9-3fe55fa18459-0000
Registered with framework ID be89e739-be8d-430e-b1e9-3fe55fa18459-0000
Received offer be89e739-be8d-430e-b1e9-3fe55fa18459-O0 with cpus: 16.0 and mem: 119640.0
Launching task 0 using offer be89e739-be8d-430e-b1e9-3fe55fa18459-O0
Launching task 1 using offer be89e739-be8d-430e-b1e9-3fe55fa18459-O0
Launching task 2 using offer be89e739-be8d-430e-b1e9-3fe55fa18459-O0
Launching task 3 using offer be89e739-be8d-430e-b1e9-3fe55fa18459-O0
Launching task 4 using offer be89e739-be8d-430e-b1e9-3fe55fa18459-O0
Task 0 is in state TASK_RUNNING
Task 1 is in state TASK_RUNNING
Task 2 is in state TASK_RUNNING
Task 3 is in state TASK_RUNNING
Task 4 is in state TASK_RUNNING
Task 0 is in state TASK_FINISHED
Task 1 is in state TASK_FINISHED
Task 2 is in state TASK_FINISHED
Task 3 is in state TASK_FINISHED
Task 4 is in state TASK_FINISHED
All tasks done, waiting for final framework message
Received message: 'data with a \x00 byte'
Received message: 'data with a \x00 byte'
Received message: 'data with a \x00 byte'
Received message: 'data with a \x00 byte'
Received message: 'data with a \x00 byte'
All tasks done, and all messages received, exiting
Run PySpark test program from inside the Test container:
python spark_test.py
This produces the following log output:
17/04/17 21:29:18 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
I0417 21:29:19.187747 205 sched.cpp:232] Version: 1.2.0
I0417 21:29:19.196535 188 sched.cpp:336] New master detected at master#127.0.0.1:5050
I0417 21:29:19.197453 188 sched.cpp:352] No credentials provided. Attempting to register without authentication
I0417 21:29:19.201884 195 sched.cpp:759] Framework registered with be89e739-be8d-430e-b1e9-3fe55fa18459-0001
17/04/17 21:29:34 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
I searched for this error on the internet but every page I found indicates that it is a common error caused by insufficient resources being allocated to the Mesos agent. As I mentioned, the Mesos UI indicates that there are sufficient resources. Please respond if you have any idea why my Spark job is not accepting resources from Mesos or if you have any suggestions of things I could try.
Thank you for your help.
This error is now resolved. In case anybody encounters a similar problem, I wanted to post that in my case it was caused by the HADOOP CLASSPATH not being set in the Mesos Master and Agent containers. Once set, everything works as expected.

(bdutil) Unable to get hadoop/spark cluster working with a fresh install

I'm setting up a tiny cluster in GCE to play around with it but although instances are created some failures prevent to get it working. I'm following the steps in https://cloud.google.com/hadoop/downloads
So far I'm using (as of now) lastest versions of gcloud (143.0.0) and bdutil (1.3.5), freshly installed.
./bdutil deploy -e extensions/spark/spark_env.sh
using debian-8 as image (as bdutil still uses debian-7-backports).
At some point I got
Fri Feb 10 16:19:34 CET 2017: Command failed: wait ${SUBPROC} on line 326.
Fri Feb 10 16:19:34 CET 2017: Exit code of failed command: 1
full debug output is in https://gist.github.com/jlorper/4299a816fc0b140575ed70fe0da1f272
(project id and bucket names changed)
Instances are created, but spark not even installed. Digging a bit I've managed to run spark installation and start hadoop commands in the master after after ssh. But it fails badly when starting the spark-shell:
17/02/10 15:53:20 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.4.5-hadoop1
17/02/10 15:53:20 INFO gcsio.FileSystemBackedDirectoryListCache: Creating '/hadoop_gcs_connector_metadata_cache' with createDirectories()...
java.lang.RuntimeException: java.lang.RuntimeException: java.nio.file.AccessDeniedException: /hadoop_gcs_connector_metadata_cache
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
and not able to import sparkSQL. For what I've read everything should be started automatically.
Up to this point I'm a bit lost and don't know what else to do.
Am I missing any step? Is any of the commands faulty? Thanks in advance.
Update: solved
As pointed out in accepted solution I cloned the repo and cluster was created without issues. When trying to start the spark-shell though it gave
java.lang.RuntimeException: java.io.IOException: GoogleHadoopFileSystem has been closed or not initialized.`
That sounded to me like connectors were not initialized properly, so after running
./bdutil --env_var_files extensions/spark/spark_env.sh,bigquery_env.sh run_command_group install_connectors
it worked as expected.
The last version of bdutil on https://cloud.google.com/hadoop/downloads is a bit stale and I'd instead recommend using the version of bdutil at head on github: https://github.com/GoogleCloudPlatform/bdutil.

Resources