I am trying to running Spark on a Mesos cluster.
When I run ./bin/spark-shell --master mesos://host:5050 from the machine where I run the Mesos master everything works. However if I run the same command from a different machine, process ends up hanging after trying to connect:
I0825 07:30:10.184141 27380 sched.cpp:126] Version: 0.19.0
I0825 07:30:10.187476 27385 sched.cpp:222] New master detected at master#192.168.0.241:5050
I0825 07:30:10.187619 27385 sched.cpp:230] No credentials provided. Attempting to register without authentication
On the mesos master I see the following output:
[...]
I0825 15:30:23.928402 23214 master.cpp:684] Giving framework 20140825-143817-4043352256-5050-23194-0002 0ns to failover
I0825 15:30:23.929033 23210 master.cpp:2849] Framework failover timeout, removing framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:23.929095 23210 master.cpp:3344] Removing framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:23.929687 23210 hierarchical_allocator_process.hpp:636] Recovered mem(*):512 (total allocatable: cpus(*):4; mem(*):6831; disk(*):455983; ports(*):[31000-32000]) on slave 20140822-144404-4043352256-5050-15999-31 from framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:23.935073 23210 hierarchical_allocator_process.hpp:636] Recovered mem(*):512 (total allocatable: cpus(*):4; mem(*):15001; disk(*):917264; ports(*):[31000-32000]) on slave 20140822-144404-4043352256-5050-15999-29 from framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:23.938248 23210 hierarchical_allocator_process.hpp:636] Recovered mem(*):512 (total allocatable: mem(*):6823; disk(*):455991; ports(*):[31000-32000]; cpus(*):4) on slave 20140822-144404-4043352256-5050-15999-32 from framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:23.938356 23210 hierarchical_allocator_process.hpp:636] Recovered mem(*):512 (total allocatable: mem(*):4939; disk(*):457873; ports(*):[31000-32000]; cpus(*):4) on slave 20140822-144404-4043352256-5050-15999-28 from framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:23.938397 23210 hierarchical_allocator_process.hpp:362] Removed framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:27.952940 23215 http.cpp:452] HTTP request for '/master/state.json'
W0825 15:30:29.595441 23208 master.cpp:2718] Ignoring unknown exited executor 20140822-144404-4043352256-5050-15999-32 on slave 20140822-144404-4043352256-5050-15999-32 at slave(1)#192.168.0.233:5051 (cluster2)
W0825 15:30:29.596709 23213 master.cpp:2718] Ignoring unknown exited executor 20140822-144404-4043352256-5050-15999-29 on slave 20140822-144404-4043352256-5050-15999-29 at slave(1)#192.168.0.241:5051 (cluster4)
W0825 15:30:29.615630 23213 master.cpp:2718] Ignoring unknown exited executor 20140822-144404-4043352256-5050-15999-31 on slave 20140822-144404-4043352256-5050-15999-31 at slave(1)#192.168.0.213:5051 (cluster3)
W0825 15:30:29.935130 23214 master.cpp:2718] Ignoring unknown exited executor 20140822-144404-4043352256-5050-15999-28 on slave 20140822-144404-4043352256-5050-15999-28 at slave(1)#192.168.0.212:5051 (cluster1)
Where as the slaves output
[...]
I0825 15:30:08.450343 980 slave.cpp:1337] Asked to shut down framework 20140825-143817-4043352256-5050-23194-0002 by master#192.168.0.241:5050
I0825 15:30:08.455153 980 slave.cpp:1362] Shutting down framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:08.455401 980 slave.cpp:2698] Shutting down executor '20140822-144404-4043352256-5050-15999-31' of framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:13.456045 982 slave.cpp:2768] Killing executor '20140822-144404-4043352256-5050-15999-31' of framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:13.456217 982 mesos_containerizer.cpp:992] Destroying container '37cc2b09-0e6d-4738-a837-7956367bba2b'
I0825 15:30:14.134845 977 mesos_containerizer.cpp:1108] Executor for container '37cc2b09-0e6d-4738-a837-7956367bba2b' has exited
I0825 15:30:14.135220 978 slave.cpp:2413] Executor '20140822-144404-4043352256-5050-15999-31' of framework 20140825-143817-4043352256-5050-23194-0002 has terminated with signal Killed
I0825 15:30:14.135356 978 slave.cpp:2552] Cleaning up executor '20140822-144404-4043352256-5050-15999-31' of framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:14.135499 978 slave.cpp:2627] Cleaning up framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:14.135627 976 status_update_manager.cpp:282] Closing status update streams for framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:14.135571 975 gc.cpp:56] Scheduling '/tmp/mesos/slaves/20140822-144404-4043352256-5050-15999-31/frameworks/20140825-143817-4043352256-5050-23194-0002/executors/20140822-144404-4043352256-5050-15999-31/runs/37cc2b09-0e6d-4738-a837-7956367bba2b' for gc 6.99999843242074days in the future
I0825 15:30:14.135910 975 gc.cpp:56] Scheduling '/tmp/mesos/slaves/20140822-144404-4043352256-5050-15999-31/frameworks/20140825-143817-4043352256-5050-23194-0002/executors/20140822-144404-4043352256-5050-15999-31' for gc 6.99999843187556days in the future
I0825 15:30:14.135980 975 gc.cpp:56] Scheduling '/tmp/mesos/slaves/20140822-144404-4043352256-5050-15999-31/frameworks/20140825-143817-4043352256-5050-23194-0002' for gc 6.99999843111111days in the future
I0825 15:31:04.450660 978 slave.cpp:2873] Current usage 60.67%. Max allowed age: 2.053113079446458days
Have anyone seen anything similar?
The problem turned out to be caused not by network connectivity issues but by Mesos slave recovery policy as outlined here: http://mesos.apache.org/documentation/latest/slave-recovery/
I would initially connect slaves to the master and disconnect them because of an unrelated problem, however when I later tried to connect the slaves again, they were being dropped by the master. To quote the documentation linked above:
A restarted slave should re-register with master within a timeout (currently, 75s). If the slave takes longer than this timeout to re-register, the master shuts down the slave, which in turn shuts down any live executors/tasks. Therefore, it is highly recommended to automate the process of restarting a slave (e.g, using monit).
I solved the problem by connecting the slaves with the --strict option set to false.
Related
I came in today only to find that all 3 of our cassandra nodes in one of our labs was down. I keep seeing this INFO message in the logs. Does this mean that cassandra is running out of memory?
INFO [main] 2020-10-12 15:11:56,014 CassandraDaemon.java:493 - JVM Arguments: [-Xloggc:/opt/cassandra/logs/gc.log, -ea, -XX:+UseThreadPriorities, -XX:ThreadPriorityPolicy=42, -XX:+HeapDumpOnOutOfMemoryError, -Xss256k, -XX:StringTableSize=1000003, -XX:+AlwaysPreTouch, -XX:-UseBiasedLocking, -XX:+UseTLAB, -XX:+ResizeTLAB, -XX:+UseNUMA, -XX:+PerfDisableSharedMem, -Djava.net.preferIPv4Stack=true, -XX:+UseParNewGC, -XX:+UseConcMarkSweepGC, -XX:+CMSParallelRemarkEnabled, -XX:SurvivorRatio=8, -XX:MaxTenuringThreshold=1, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -XX:CMSWaitDuration=10000, -XX:+CMSParallelInitialMarkEnabled, -XX:+CMSEdenChunksRecordAlways, -XX:+CMSClassUnloadingEnabled, -XX:+PrintGCDetails, -XX:+PrintGCDateStamps, -XX:+PrintHeapAtGC, -XX:+PrintTenuringDistribution, -XX:+PrintGCApplicationStoppedTime, -XX:+PrintPromotionFailure, -XX:+UseGCLogFileRotation, -XX:NumberOfGCLogFiles=10, -XX:GCLogFileSize=10M, -Xms899M, -Xmx899M, -Xmn200M, -XX:+UseCondCardMark, -XX:CompileCommandFile=/opt/cassandra/conf/hotspot_compiler, -javaagent:/opt/cassandra/lib/jamm-0.3.0.jar, -Dcassandra.jmx.remote.port=7199, -Dcom.sun.management.jmxremote.rmi.port=7199, -Dcom.sun.management.jmxremote.authenticate=false, -Dcom.sun.management.jmxremote.password.file=/etc/cassandra/jmxremote.password, -Djava.library.path=/opt/cassandra/lib/sigar-bin, -XX:OnOutOfMemoryError=kill -9 %p, -Dlogback.configurationFile=logback.xml, -Dcassandra.logdir=/opt/cassandra/logs, -Dcassandra.storagedir=/opt/cassandra/data, -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid]
They're running in Amazon EC2. Would the next logical course of action be to increase the node size?
I'm running a PySpark job in Google Cloud Dataproc, in a cluster with half the nodes being preemptible, and seeing several errors in the job output (the driver output) such as:
...spark.scheduler.TaskSetManager: Lost task 9696.0 in stage 0.0 ... Python worker exited unexpectedly (crashed)
...
Caused by java.io.EOFException
...
...YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 177 for reason Container marked as failed: ... Exit status: -100. Diagnostics: Container released on a *lost* node
...spark.storage.BlockManagerMasterEndpoint: Error try to remove broadcast 3 from block manager BlockManagerId(...)
Perhaps by coincidence, the errors mostly seem to be coming from preemptible nodes.
My suspicion is that these opaque errors are coming from the node or executors running out of memory, but there don't seem to be any granular memory related metrics exposed by Dataproc.
How can I determine why a node was considered lost? Is there a way I can inspect memory usage per node or executor to validate whether these errors are being caused by high memory usage? If YARN is the one which is killing containers / determining nodes are lost, then hopefully there's a way to introspect why?
Because you are using Preemptible VMs which are short-lived and guaranteed to last for up to 24 hours. This means that when GCE shutdowns Preemptible VMs you see errors like this:
YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 177 for reason Container marked as failed: ... Exit status: -100. Diagnostics: Container released on a lost node
Open a secure shell from your machine to the cluster. You'll require gcloud sdk installed for that.
gcloud compute ssh ${HOSTNAME}-m --project=${PROJECT}
Then run the following commands in the cluster.
List all nodes in the cluster
yarn node -list
Then using ${NodeID} to get report on the node state.
yarn node -status ${NodeID}
You could also set up local port forwarding via SSH to Yarn WebUI server instead of running commands directly in the cluster.
gcloud compute ssh ${HOSTNAME}-m \
--project=${PROJECT} -- \
-L 8088:${HOSTNAME}-m:8088 -N
Then go to http://localhost:8088/cluster/apps in your browser.
I am trying to deploy a prediction web service to Azure using ML Workbench process using cluster mode in this tutorial (https://learn.microsoft.com/en-us/azure/machine-learning/preview/tutorial-classifying-iris-part-3#prepare-to-operationalize-locally)
The model gets sent to the manifest, the scoring script and schema
Creating
service..........................................................Error
occurred: {'Error': {'Code': 'KubernetesDeploymentFailed', 'Details':
[{'Message': 'Back-off 40s restarting failed container=...pod=...',
'Code': 'CrashLoopBackOff'}], 'StatusCode': 400, 'Message':
'Kubernetes Deployment failed'}, 'OperationType': 'Service',
'State':'Failed', 'Id': '...', 'ResourceLocation':
'/api/subscriptions/...', 'CreatedTime':
'2017-10-26T20:30:49.77362Z','EndTime': '2017-10-26T20:36:40.186369Z'}
Here is the result of checking the ml service realtime logs
C:\Users\userguy\Documents\azure_ml_workbench\projecto>az ml service logs realtime -i projecto
2017-10-26 20:47:16,118 CRIT Supervisor running as root (no user in config file)
2017-10-26 20:47:16,120 INFO supervisord started with pid 1
2017-10-26 20:47:17,123 INFO spawned: 'rsyslog' with pid 9
2017-10-26 20:47:17,124 INFO spawned: 'program_exit' with pid 10
2017-10-26 20:47:17,124 INFO spawned: 'nginx' with pid 11
2017-10-26 20:47:17,125 INFO spawned: 'gunicorn' with pid 12
2017-10-26 20:47:18,160 INFO success: rsyslog entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2017-10-26 20:47:18,160 INFO success: program_exit entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2017-10-26 20:47:22,164 INFO success: nginx entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
2017-10-26T20:47:22.519159Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting gunicorn 19.6.0
2017-10-26T20:47:22.520097Z, INFO, 00000000-0000-0000-0000-000000000000, , Listening at: http://127.0.0.1:9090 (12)
2017-10-26T20:47:22.520375Z, INFO, 00000000-0000-0000-0000-000000000000, , Using worker: sync
2017-10-26T20:47:22.521757Z, INFO, 00000000-0000-0000-0000-000000000000, , worker timeout is set to 300
2017-10-26T20:47:22.522646Z, INFO, 00000000-0000-0000-0000-000000000000, , Booting worker with pid: 22
2017-10-26 20:47:27,669 WARN received SIGTERM indicating exit request
2017-10-26 20:47:27,669 INFO waiting for nginx, gunicorn, rsyslog, program_exit to die
2017-10-26T20:47:27.669556Z, INFO, 00000000-0000-0000-0000-000000000000, , Handling signal: term
2017-10-26 20:47:30,673 INFO waiting for nginx, gunicorn, rsyslog, program_exit to die
2017-10-26 20:47:33,675 INFO waiting for nginx, gunicorn, rsyslog, program_exit to die
Initializing logger
2017-10-26T20:47:36.564469Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up app insights client
2017-10-26T20:47:36.564991Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up request id generator
2017-10-26T20:47:36.565316Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up app insight hooks
2017-10-26T20:47:36.565642Z, INFO, 00000000-0000-0000-0000-000000000000, , Invoking user's init function
2017-10-26 20:47:36.715933: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instruc
tions, but these are available on your machine and could speed up CPU computations.
2017-10-26 20:47:36,716 INFO waiting for nginx, gunicorn, rsyslog, program_exit to die
2017-10-26 20:47:36.716376: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instruc
tions, but these are available on your machine and could speed up CPU computations.
2017-10-26 20:47:36.716542: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructio
ns, but these are available on your machine and could speed up CPU computations.
2017-10-26 20:47:36.716703: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructi
ons, but these are available on your machine and could speed up CPU computations.
2017-10-26 20:47:36.716860: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructio
ns, but these are available on your machine and could speed up CPU computations.
this is the init
2017-10-26T20:47:37.551940Z, INFO, 00000000-0000-0000-0000-000000000000, , Users's init has completed successfully
Using TensorFlow backend.
2017-10-26T20:47:37.553751Z, INFO, 00000000-0000-0000-0000-000000000000, , Worker exiting (pid: 22)
2017-10-26T20:47:37.885303Z, INFO, 00000000-0000-0000-0000-000000000000, , Shutting down: Master
2017-10-26 20:47:37,885 WARN killing 'gunicorn' (12) with SIGKILL
2017-10-26 20:47:37,886 INFO stopped: gunicorn (terminated by SIGKILL)
2017-10-26 20:47:37,889 INFO stopped: nginx (exit status 0)
2017-10-26 20:47:37,890 INFO stopped: program_exit (terminated by SIGTERM)
2017-10-26 20:47:37,891 INFO stopped: rsyslog (exit status 0)
Received 41 lines of log
My best guess is theres something silent happening to cause "WARN received SIGTERM indicating exit request". The rest of the scoring.py script seems to kick off - see tensorflow get initiated and the "this is the init" print statement.
http://127.0.0.1:63437 is accessible from my local machine, but the ui endpoint is blank.
Any ideas on how to get this up and running in an Azure cluster? I'm not very familiar with how Kubernetes works, so any basic debugging guidance would be appreciated.
We discovered a bug in our system that could have caused this. The fix was deployed last night. Can you please try again and let us know if you still encounter this issue?
Note: This error was thrown before the components were executed by spark.
Logs
Worker Node1:
17/05/18 23:12:52 INFO Worker: Successfully registered with master spark://spark-master-1.com:7077
17/05/18 23:58:41 ERROR Worker: RECEIVED SIGNAL 15: SIGTERM
Master Node:
17/05/18 23:12:52 INFO Master: Registering worker spark-worker-1com:56056 with 2 cores, 14.5 GB RAM
17/05/18 23:14:20 INFO Master: Registering worker spark-worker-2.com:53986 with 2 cores, 14.5 GB RAM
17/05/18 23:59:42 WARN Master: Removing spark-worker-1com-56056 because we got no heartbeat in 60 seconds
17/05/18 23:59:42 INFO Master: Removing spark-worker-2.com:56056
17/05/19 00:00:03 ERROR Master: RECEIVED SIGNAL 15: SIGTERM
Worker Node2:
17/05/18 23:14:20 INFO Worker: Successfully registered with master spark://spark-master-node-2.com:7077
17/05/18 23:59:40 ERROR Worker: RECEIVED SIGNAL 15: SIGTERM
TL;DR I think someone has explicitly called kill command or sbin/stop-worker.sh.
"RECEIVED SIGNAL 15: SIGTERM" is reported by a shutdown hook to log TERM, HUP, INT signals on UNIX-like systems:
/** Register a signal handler to log signals on UNIX-like systems. */
def registerLogger(log: Logger): Unit = synchronized {
if (!loggerRegistered) {
Seq("TERM", "HUP", "INT").foreach { sig =>
SignalUtils.register(sig) {
log.error("RECEIVED SIGNAL " + sig)
false
}
}
loggerRegistered = true
}
}
In your case it means that the process received SIGTERM to stop itself:
The SIGTERM signal is a generic signal used to cause program termination. Unlike SIGKILL, this signal can be blocked, handled, and ignored. It is the normal way to politely ask a program to terminate.
That's what is sent when you execute KILL or use ./sbin/stop-master.sh or ./sbin/stop-worker.sh shell scripts that in turn call sbin/spark-daemon.sh with stop command that kills a JVM process for a master or a worker:
kill "$TARGET_ID" && rm -f "$pid"
Summary:
Is it possible to submit a Spark job on Mesos from inside a Docker container with 1 Mesos master (no Zookeeper) and 1 Mesos agent also each running in separate Docker containers (on the same host for now)? The Mesos containerizer described at http://mesos.apache.org/documentation/latest/container-image/ seems to apply to the case where the Mesos application is simply encapsulated in a Docker container and run. My Docker application is more interactive with multiple PySpark Mesos jobs being instantiated at run-time based on user input. The driver program in the Docker container is not itself run as a Mesos app. Only the user-initiated job requests are handled as PySpark Mesos apps.
Specifics:
I have 3 Docker containers based on centos:7 linux, and running on the same host machine for now:
Container "Master" running a Mesos Master.
Container "Agent" running a Mesos Agent.
Container "Test" with Spark and Mesos installed where I run a bash shell and launch the following PySpark test program from the command line.
from pyspark import SparkContext, SparkConf
from operator import add
# Configure Spark
sp_conf = SparkConf()
sp_conf.setAppName("spark_test")
sp_conf.set("spark.scheduler.mode", "FAIR")
sp_conf.set("spark.dynamicAllocation.enabled", "false")
sp_conf.set("spark.driver.memory", "500m")
sp_conf.set("spark.executor.memory", "500m")
sp_conf.set("spark.executor.cores", 1)
sp_conf.set("spark.cores.max", 1)
sp_conf.set("spark.mesos.executor.home", "/usr/local/spark-2.1.0")
sp_conf.set("spark.executor.uri", "file://usr/local/spark-2.1.0-bin-without-hadoop.tgz")
sc = SparkContext(conf=sp_conf)
# Simple computation
x = [(1.5,100.),(1.5,200.),(1.5,300.),(2.5,150.)]
rdd = sc.parallelize(x,1)
tot = rdd.foldByKey(0,add).collect()
cnt = rdd.countByKey()
time = [t[0] for t in tot]
avg = [t[1]/cnt[t[0]] for t in tot]
print 'tot=', tot
print 'cnt=', cnt
print 't=', time
print 'avg=', avg
The relevant software versions I am using are as follows:
Hadoop: 2.7.3
Spark: 2.1.0
Mesos: 1.2.0
Docker: 17.03.1-ce, build c6d412e
The following works fine:
I can run the simple PySpark test program above from inside the Test container with Spark's MASTER=local[N] for N=1 or N=4.
I can see in the Mesos logs and in the Mesos user interface (UI) that the Mesos agent and master come up fine. The Mesos UI shows that the agent is connected with plenty of resources (cpu, memory, disk).
I can run the Mesos Python tests successfully from inside the Test container with /usr/local/mesos-1.2.0/build/src/examples/python/test-framework 127.0.0.1:5050. This seems to confirm that the Mesos containers can be accessed from within my Test container, but these tests are not using Spark.
This is the Failure:
With Spark's MASTER=mesos://127.0.0.1:5050, when I launch my PySpark test program from inside the Test container there is activity in the logs of both the Mesos Master and Agent, and in the couple seconds before failure, the Mesos UI shows resources assigned for the job that are well within what is available. However, the PySpark test program then fails with: WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources.
The steps I followed are as follows.
Start Mesos Master:
docker run -it --net=host -p 5050:5050 the_master
Relevant excerpts from the master's log shows:
I0418 01:05:08.540192 27 master.cpp:383] Master 15b354eb-6a20-4bc9-a13b-6533b1e91bd2 (localhost) started on 127.0.0.1:5050
I0418 01:05:08.540210 27 master.cpp:385] Flags at startup: --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate_agents="false" --authenticate_frameworks="false" --authenticate_http_frameworks="false" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_unreachable_tasks_per_framework="1000" --quiet="false" --recovery_agent_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="20secs" --registry_strict="false" --root_submissions="true" --user_sorter="drf" --version="false" --webui_dir="/usr/local/mesos-1.2.0/build/../src/webui" --work_dir="/var/lib/mesos" --zk_session_timeout="10secs"
Start Mesos Agent:
docker run -it --net=host -e MESOS_AGENT_PORT=5051 the_agent
The agent's log shows:
I0418 01:42:00.234244 40 slave.cpp:212] Flags at startup: --appc_simple_discovery_uri_prefix="http://" --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticatee="crammd5" --authentication_backoff_factor="1secs" --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_kill_orphans="true" --docker_mesos_image="spark-mesos-agent-test" --docker_registry="https://registry-1.docker.io" --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker" --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" --enforce_container_disk_quota="false" --executor_registration_timeout="1mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_command_executor="false" --http_heartbeat_interval="30secs" --initialize_driver_logging="true" --isolation="posix/cpu,posix/mem" --launcher="posix" --launcher_dir="/usr/local/mesos-1.2.0/build/src" --logbufsecs="0" --logging_level="INFO" --max_completed_executors_per_framework="150" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" --runtime_dir="/var/run/mesos" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="false" --systemd_enable_support="false" --systemd_runtime_directory="/run/systemd/system" --version="false" --work_dir="/var/lib/mesos"
I get the following warning for both the Mesos Master and Agent, but ignore it because I am running everything on the same host for now:
Master/Agent bound to loopback interface! Cannot communicate with remote schedulers or agents. You might want to set '--ip' flag to a routable IP address.
In fact, my tests with assigning a routable IP address instead of 127.0.0.1 failed to change any of the behavior I describe here.
Start Test Container (with bash shell for testing):
docker run -it --net=host the_test /bin/bash
Some relevant environment variables set inside all three container (Master, Agent, and Test):
HADOOP_HOME=/usr/local/hadoop-2.7.3
HADOOP_CONF_DIR=/usr/local/hadoop-2.7.3/etc/hadoop
SPARK_HOME=/usr/local/spark-2.1.0
SPARK_EXECUTOR_URI=file:////usr/local/spark-2.1.0-bin-without-hadoop.tgz
MASTER=mesos://127.0.0.1:5050
PYSPARK_PYTHON=/usr/local/anaconda2/bin/python
PYSPARK_DRIVER_PYTHON=/usr/local/anaconda2/bin/python
PYSPARK_SUBMIT_ARGS=--driver-memory=4g pyspark-shell
MESOS_PORT=5050
MESOS_IP=127.0.0.1
MESOS_WORKDIR=/var/lib/mesos
MESOS_HOME=/usr/local/mesos-1.2.0
MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.so
MESOS_MASTER=mesos://127.0.0.1:5050
PYTHONPATH=:/usr/local/spark-2.1.0/python:/usr/local/spark-2.1.0/python/lib/py4j-0.10.1-src.zip
Run Mesos (non-Spark) tests from inside the Test container:
/usr/local/mesos-1.2.0/build/src/examples/python/test-framework 127.0.0.1:5050
This produces the following log output (as expected I think):
I0417 21:28:36.912542 20 sched.cpp:232] Version: 1.2.0
I0417 21:28:36.920013 62 sched.cpp:336] New master detected at master#127.0.0.1:5050
I0417 21:28:36.920472 62 sched.cpp:352] No credentials provided. Attempting to register without authentication
I0417 21:28:36.924165 62 sched.cpp:759] Framework registered with be89e739-be8d-430e-b1e9-3fe55fa18459-0000
Registered with framework ID be89e739-be8d-430e-b1e9-3fe55fa18459-0000
Received offer be89e739-be8d-430e-b1e9-3fe55fa18459-O0 with cpus: 16.0 and mem: 119640.0
Launching task 0 using offer be89e739-be8d-430e-b1e9-3fe55fa18459-O0
Launching task 1 using offer be89e739-be8d-430e-b1e9-3fe55fa18459-O0
Launching task 2 using offer be89e739-be8d-430e-b1e9-3fe55fa18459-O0
Launching task 3 using offer be89e739-be8d-430e-b1e9-3fe55fa18459-O0
Launching task 4 using offer be89e739-be8d-430e-b1e9-3fe55fa18459-O0
Task 0 is in state TASK_RUNNING
Task 1 is in state TASK_RUNNING
Task 2 is in state TASK_RUNNING
Task 3 is in state TASK_RUNNING
Task 4 is in state TASK_RUNNING
Task 0 is in state TASK_FINISHED
Task 1 is in state TASK_FINISHED
Task 2 is in state TASK_FINISHED
Task 3 is in state TASK_FINISHED
Task 4 is in state TASK_FINISHED
All tasks done, waiting for final framework message
Received message: 'data with a \x00 byte'
Received message: 'data with a \x00 byte'
Received message: 'data with a \x00 byte'
Received message: 'data with a \x00 byte'
Received message: 'data with a \x00 byte'
All tasks done, and all messages received, exiting
Run PySpark test program from inside the Test container:
python spark_test.py
This produces the following log output:
17/04/17 21:29:18 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
I0417 21:29:19.187747 205 sched.cpp:232] Version: 1.2.0
I0417 21:29:19.196535 188 sched.cpp:336] New master detected at master#127.0.0.1:5050
I0417 21:29:19.197453 188 sched.cpp:352] No credentials provided. Attempting to register without authentication
I0417 21:29:19.201884 195 sched.cpp:759] Framework registered with be89e739-be8d-430e-b1e9-3fe55fa18459-0001
17/04/17 21:29:34 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
I searched for this error on the internet but every page I found indicates that it is a common error caused by insufficient resources being allocated to the Mesos agent. As I mentioned, the Mesos UI indicates that there are sufficient resources. Please respond if you have any idea why my Spark job is not accepting resources from Mesos or if you have any suggestions of things I could try.
Thank you for your help.
This error is now resolved. In case anybody encounters a similar problem, I wanted to post that in my case it was caused by the HADOOP CLASSPATH not being set in the Mesos Master and Agent containers. Once set, everything works as expected.