Azure Service Fabric: cannot run local Service Fabric Cluster - azure

I have a problem with Azure Service Fabric.
I have installed it (on Windows 7) as it was said in https://azure.microsoft.com/en-gb/documentation/articles/service-fabric-get-started/.
Then I have tried to run a Service Fabric application from Visual Studio 2015. I got an error “Connect-ServiceFabricCluster : No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue”.
Here is the fill log of that run:
1>------ Build started: Project: Application2, Configuration: Debug x64 ------
2>------ Deploy started: Project: Application2, Configuration: Debug x64 ------
-------- Package started: Project: Application2, Configuration: Debug x64 ------
Application2 -> c:\temp\Application2\Application2\pkg\Debug
-------- Package: Project: Application2 succeeded, Time elapsed: 00:00:01.7361084 --------
2>Started executing script 'Set-LocalClusterReady'.
2>Import-Module 'C:\Program Files\Microsoft SDKs\Service Fabric\Tools\Scripts\DefaultLocalClusterSetup.psm1'; Set-LocalClusterReady
2>--------------------------------------------
2>Local Service Fabric Cluster is not setup...
2>Please wait while we setup the Local Service Fabric Cluster. This may take few minutes...
2>
2>Using Cluster Data Root: C:\SfDevCluster\Data
2>Using Cluster Log Root: C:\SfDevCluster\Log
2>
2>Create node configuration succeeded
2>Starting service FabricHostSvc. This may take a few minutes...
2>
2>Waiting for Service Fabric Cluster to be ready. This may take a few minutes...
2>Local Cluster ready status: 4% completed.
2>Local Cluster ready status: 8% completed.
2>Local Cluster ready status: 12% completed.
2>Local Cluster ready status: 17% completed.
2>Local Cluster ready status: 21% completed.
2>Local Cluster ready status: 25% completed.
2>Local Cluster ready status: 29% completed.
2>Local Cluster ready status: 33% completed.
2>Local Cluster ready status: 38% completed.
2>Local Cluster ready status: 42% completed.
2>Local Cluster ready status: 46% completed.
2>Local Cluster ready status: 50% completed.
2>Local Cluster ready status: 54% completed.
2>Local Cluster ready status: 58% completed.
2>Local Cluster ready status: 62% completed.
2>Local Cluster ready status: 67% completed.
2>Local Cluster ready status: 71% completed.
2>Local Cluster ready status: 75% completed.
2>Local Cluster ready status: 79% completed.
2>Local Cluster ready status: 83% completed.
2>Local Cluster ready status: 88% completed.
2>Local Cluster ready status: 92% completed.
2>Local Cluster ready status: 96% completed.
2>Local Cluster ready status: 100% completed.
2>WARNING: Service Fabric Cluster is taking longer than expected to connect.
2>
2>Waiting for Naming Service to be ready. This may take a few minutes...
2>Connect-ServiceFabricCluster : **No cluster endpoint is reachable, please check
2>if there is connectivity/firewall/DNS issue.**
2>At C:\Program Files\Microsoft SDKs\Service
2>Fabric\Tools\Scripts\ClusterSetupUtilities.psm1:521 char:12
2>+ [void](Connect-ServiceFabricCluster #connParams)
2>+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2> + CategoryInfo : InvalidOperation: (:) [Connect-ServiceFabricClus
2> ter], FabricException
2> + FullyQualifiedErrorId : TestClusterConnectionErrorId,Microsoft.ServiceFa
2> bric.Powershell.ConnectCluster
2>
2>Naming Service ready status: 8% completed.
2>Naming Service ready status: 17% completed.
2>Naming Service ready status: 25% completed.
2>Naming Service ready status: 33% completed.
2>Naming Service ready status: 42% completed.
2>Naming Service ready status: 50% completed.
2>Naming Service ready status: 58% completed.
2>Naming Service ready status: 67% completed.
2>Naming Service ready status: 75% completed.
2>Naming Service ready status: 83% completed.
2>Naming Service ready status: 92% completed.
2>Naming Service ready status: 100% completed.
2>WARNING: Naming Service is taking longer than expected to be ready...
2>Local Service Fabric Cluster created successfully.
2>--------------------------------------------------
2>Launching Service Fabric Local Cluster Manager...
2>You can use Service Fabric Local Cluster Manager (system tray application) to manage your local dev cluster.
2>Finished executing script 'Set-LocalClusterReady'.
2>Time elapsed: 00:07:01.8147993
2>The PowerShell script failed to execute.
========== Build: 1 succeeded, 0 failed, 1 up-to-date, 0 skipped ==========
========== Deploy: 0 succeeded, 1 failed, 0 skipped ==========

This worked for me and I stumbled upon the solution on an MSDN forum.
Most likely your C++ runtime is corrupt and that needs to be reinstalled.
You will have to manually execute vcredist_x64.exe which can be found at
C:\Program Files\Microsoft Service Fabric\bin\Fabric\Fabric.Code\vcredist_x64.exe
Once that is done, you can choose to reboot your machine or not, I chose to reboot it and then ran the following commands
C:\Program Files\Microsoft SDKs\Service Fabric\ClusterSetup\CleanCluster.ps1
C:\Program Files\Microsoft SDKs\Service Fabric\ClusterSetup\DevClusterSetup.ps1
I hope this helps!

I had to disconnect my VPN from my office as well as the directions that Varun had shown above.
Thanks #Varun for sharing this!!
Once I connected my VPN I could not run the system again.
Hope this helps someone.

After trying many solutions, I found that this GitHub Issue helped to add an entry to the setup configuration then I had to recreate the cluster.
Configuration File
C:\Program Files\Microsoft SDKs\Service Fabric\ClusterSetup\NonSecure\OneNode\ClusterManifestTemplate.json:
Entry to add
{
"name": "FabricContainerAppsEnabled",
"value": "false"
}

By default Service Fabric development instance is running to listen only on the loopback address(simply run netstat -an command in your CMD to see which ports are open):
As you can see on the screen above my SF instance are listen on the port 19000 for all addresses(available external connections)
By default port 19000 listen only on the loopback address(127.0.0.1)
[::1]:19000
All you need is to change your IPAddressOrFQDN from localhost to your IPv4 external address. The hange should be visible in manifest file:
All SF cluster configuration files in windows 7 are located in C:\SfDevCluster\Data. There you can find many XML documents like:
clusterManifest
FabricHostSettings
To change IPAddressOrFQDN I have replaced all occurrences of the word 'localhost' over 'my.ip.address.here'. The files to be modified are also deeper:
C:\SfDevCluster\Data\_Node_0\Fabric
C:\SfDevCluster\Data\_Node_0\Fabric\Fabric.Data
C:\SfDevCluster\Data\_Node_0\Fabric\Fabric.Config.0.0
and so on
before changes stop your cluster:
And after all run it again:
There is another - better method. You can change generating scripts in directory C:\Program Files\Microsoft SDKs\Service Fabric\ClusterSetup.
After cluster reset this settings will be lost. Remember that you do it at your own risk, the development version is unsecure and should not be used in a production environment.
If after hanges you cannot connect to SF - make sure the firewall doesn't block the port(you can temporary disable your firewall)

If the C++ runtime does not solve your issue you can narrow the solution down to just firewall issue. The network is not allowing the User to run the FabricHostSvc Service which you can check in services.msc ,which is due to security issues. If you just are able to disable your firewall/change to network without restrictions the issue should resolve on its own.
If you are able to get the FabricHostSvc started your issue will be solved.
Hope this helps.....

Hey First answer ( #Varun Rathore gave ) worked for me. But as I was trying to deploy container on service fabric locally, switching from 1 node cluster to 5 node cluster again and again, this error comes often. So what I did is open powershell as Administrator and follow the next 2 steps.
First go to following Path
cd "C:\Program Files\Microsoft SDKs\Service Fabric\ClusterSetup"
Second run these 2 files
.\CleanCluster.ps1
.\DevClusterSetup.ps1
This will setup 1 node cluster on your local machine.
You can do this from service fabric cluster manager but it did not work for me. And every time repair vcredist_x64.exe file and reboot your computer (windows 10 Pro) is frustating. This works for me.

Either you can reset fabric cluster (if you don't have stateful service, and data persist not needed), or else you can re-launch fabric cluster, sometime switching nodes also helpful.

Related

Server process being killed on a Linux Digitalocean VM

I am trying to run a Next.js server on a DigitalOcean virtual machine. The server works, but when I run npm run start, the logs say Killed after ~1 minute.
Here is an example log of what happens:
joey#mydroplet:~/Server$ sudo node server
info - SWC minify release candidate enabled. https://nextjs.link/swcmin
event - compiled client and server successfully in 3.3s (196 modules)
wait - compiling...
event - compiled client and server successfully in 410 ms (196 modules)
> Ready on https://localhost:443
> Ready on http://localhost:8080
wait - compiling / (client and server)...
event - compiled client and server successfully in 1173 ms (261 modules)
Killed
joey#mydroplet:~/Server$
After some research, I came across a couple of threads which detail a server lacking enough memory/resources to continue the operation. I upgraded the memory from 512 mb to 1 gb, but this still happens.
Do I need to further upgrade the memory?
This is the plan that I am on:
It was the memory. Upgrading the memory of the server from 1 gb to 2 gb solved this problem.
This is the plan that worked for me:

Listener oracle in pacemaker cluster

I need some help. I have created a cluster with 2 nodes. I created all resources, but listener has errors and the pacemaker cluster status shows Oracle Listener has stopped.
In the web interface I have the following error messages:
Failed to start listener_or on Mon Oct 25 16:30:52 2021 on node lha1: Listener pdb1 appears to have started, but is not running properly:
and
Unable to get metadata for resource agent 'ocf:heartbeat:oralsnr' (timeout)
In pacemaker cluster status I have the following error message:
listener_or_start_0 on lha1 'error' (1): call=35, status='complete', exitreason='Listener pdb1 appears to have started, but is not running properly: ', last-rc-change='2021-10-25 16:30:52 +03:00', queued=0ms, exec=393ms
However, I can connect to the base.
Do you have any ideas how I could solve this issue?
The issue was that the agent couldn't see the Oracle listener.
The solution was to:
Create tnsnames.ora
Write configurtion
Reload services
After this the pcs didn't have any errors.

How can I inspect per executor/node memory usage metrics of a pyspark job on Dataproc?

I'm running a PySpark job in Google Cloud Dataproc, in a cluster with half the nodes being preemptible, and seeing several errors in the job output (the driver output) such as:
...spark.scheduler.TaskSetManager: Lost task 9696.0 in stage 0.0 ... Python worker exited unexpectedly (crashed)
...
Caused by java.io.EOFException
...
...YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 177 for reason Container marked as failed: ... Exit status: -100. Diagnostics: Container released on a *lost* node
...spark.storage.BlockManagerMasterEndpoint: Error try to remove broadcast 3 from block manager BlockManagerId(...)
Perhaps by coincidence, the errors mostly seem to be coming from preemptible nodes.
My suspicion is that these opaque errors are coming from the node or executors running out of memory, but there don't seem to be any granular memory related metrics exposed by Dataproc.
How can I determine why a node was considered lost? Is there a way I can inspect memory usage per node or executor to validate whether these errors are being caused by high memory usage? If YARN is the one which is killing containers / determining nodes are lost, then hopefully there's a way to introspect why?
Because you are using Preemptible VMs which are short-lived and guaranteed to last for up to 24 hours. This means that when GCE shutdowns Preemptible VMs you see errors like this:
YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 177 for reason Container marked as failed: ... Exit status: -100. Diagnostics: Container released on a lost node
Open a secure shell from your machine to the cluster. You'll require gcloud sdk installed for that.
gcloud compute ssh ${HOSTNAME}-m --project=${PROJECT}
Then run the following commands in the cluster.
List all nodes in the cluster
yarn node -list
Then using ${NodeID} to get report on the node state.
yarn node -status ${NodeID}
You could also set up local port forwarding via SSH to Yarn WebUI server instead of running commands directly in the cluster.
gcloud compute ssh ${HOSTNAME}-m \
--project=${PROJECT} -- \
-L 8088:${HOSTNAME}-m:8088 -N
Then go to http://localhost:8088/cluster/apps in your browser.

No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS

I'm new to Azure and Service Fabric.
Today I installed the VS Service Fabric tools and the SDK (latest 3.3.617)
My home desktop has Win10 pro x64 with 8GB RAM.
I tried to run a "hello world" app from the "new project" using VS. Compilation works and then it starts to spin up the local cluster.
I get this error:
1>------ Build started: Project: Stateless1, Configuration: Debug x64 ------
1> Stateless1 -> C:\Users\USER\source\repos\Application2\Stateless1\bin\x64\Debug\Stateless1.exe
2>------ Build started: Project: Application2, Configuration: Debug x64 ------
3>------ Deploy started: Project: Application2, Configuration: Debug x64 ------
3>Started executing script 'GetApplicationExistence'.
3>Finished executing script 'GetApplicationExistence'.
3>Time elapsed: 00:00:00.4751003
3>Started executing script 'Set-LocalClusterReady'.
3>powershell -NonInteractive -NoProfile -WindowStyle Hidden -ExecutionPolicy Bypass -Command "Import-Module 'C:\Program Files\Microsoft SDKs\Service Fabric\Tools\Scripts\DefaultLocalClusterSetup.psm1'; Set-LocalClusterReady -createOneNodeCluster $true"
3>--------------------------------------------
3>Local Service Fabric Cluster is not setup...
3>Please wait while we setup the Local Service Fabric Cluster. This may take few minutes...
3>Performing Stop-Service on service: FabricHostSvc . This may take a few minutes...
3>
3>Using Cluster Data Root: C:\SfDevCluster\Data
3>Using Cluster Log Root: C:\SfDevCluster\Log
3>
3>The generated json path is C:\Users\USER\AppData\Local\Temp\tmpE26D.tmp.json
3>Processing and validating cluster config.
3>Check if machine 'ComputerFullName' is IOT Core: False. Could not open registry key.
3>Performing Stop-Service on service: FabricHostSvc . This may take a few minutes...
3>Create node configuration succeeded
3>Performing Start-Service on service: FabricHostSvc . This may take a few minutes...
3>
3>Waiting for Service Fabric Cluster to be ready. This may take a few minutes...
3>Local Cluster ready status: 4% completed.
3>Local Cluster ready status: 8% completed.
3>Local Cluster ready status: 100% completed.
3>WARNING: Service Fabric Cluster is taking longer than expected to connect.
3>
3>Waiting for fabric:/System/NamingService to be ready. This may take a few minutes...
3>Connect-ServiceFabricCluster : No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS
3>issue.
3>At C:\Program Files\Microsoft SDKs\Service Fabric\Tools\Scripts\ClusterSetupUtilities.psm1:979 char:12
3>+ [void](Connect-ServiceFabricCluster #connParams)
3>+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3> + CategoryInfo : InvalidOperation: (:) [Connect-ServiceFabricCluster], FabricException
3> + FullyQualifiedErrorId : TestClusterConnectionErrorId,Microsoft.ServiceFabric.Powershell.ConnectCluster
3>
3>fabric:/System/NamingService ready status: 8% completed.
3>fabric:/System/NamingService ready status: 17% completed.
3>fabric:/System/NamingService ready status: 92% completed.
3>fabric:/System/NamingService ready status: 100% completed.
3>WARNING: fabric:/System/NamingService is taking longer than expected to be ready...
3>Local Service Fabric Cluster created successfully.
3>--------------------------------------------------
3>Launching Service Fabric Local Cluster Manager...
3>You can use Service Fabric Local Cluster Manager (system tray application) to manage your local dev cluster.
3>Finished executing script 'Set-LocalClusterReady'.
3>Time elapsed: 00:12:29.9792319
3>The PowerShell script failed to execute.
========== Build: 2 succeeded, 0 failed, 0 up-to-date, 0 skipped ==========
========== Deploy: 0 succeeded, 1 failed, 0 skipped ==========
I tried closing down my Windows firewall and debugging again, now getting:
Unable to determine whether the application is installed on the cluster or not
Unable to determine whether the application is installed on the cluster or not
I have encountered this error before and have solved it in 2 ways:
I simply redeployed to the local cluster.
I stopped the local cluster from the tray menu before redeploying.

How to enable a Spark-Mesos job to be launched from inside a Docker container?

Summary:
Is it possible to submit a Spark job on Mesos from inside a Docker container with 1 Mesos master (no Zookeeper) and 1 Mesos agent also each running in separate Docker containers (on the same host for now)? The Mesos containerizer described at http://mesos.apache.org/documentation/latest/container-image/ seems to apply to the case where the Mesos application is simply encapsulated in a Docker container and run. My Docker application is more interactive with multiple PySpark Mesos jobs being instantiated at run-time based on user input. The driver program in the Docker container is not itself run as a Mesos app. Only the user-initiated job requests are handled as PySpark Mesos apps.
Specifics:
I have 3 Docker containers based on centos:7 linux, and running on the same host machine for now:
Container "Master" running a Mesos Master.
Container "Agent" running a Mesos Agent.
Container "Test" with Spark and Mesos installed where I run a bash shell and launch the following PySpark test program from the command line.
from pyspark import SparkContext, SparkConf
from operator import add
# Configure Spark
sp_conf = SparkConf()
sp_conf.setAppName("spark_test")
sp_conf.set("spark.scheduler.mode", "FAIR")
sp_conf.set("spark.dynamicAllocation.enabled", "false")
sp_conf.set("spark.driver.memory", "500m")
sp_conf.set("spark.executor.memory", "500m")
sp_conf.set("spark.executor.cores", 1)
sp_conf.set("spark.cores.max", 1)
sp_conf.set("spark.mesos.executor.home", "/usr/local/spark-2.1.0")
sp_conf.set("spark.executor.uri", "file://usr/local/spark-2.1.0-bin-without-hadoop.tgz")
sc = SparkContext(conf=sp_conf)
# Simple computation
x = [(1.5,100.),(1.5,200.),(1.5,300.),(2.5,150.)]
rdd = sc.parallelize(x,1)
tot = rdd.foldByKey(0,add).collect()
cnt = rdd.countByKey()
time = [t[0] for t in tot]
avg = [t[1]/cnt[t[0]] for t in tot]
print 'tot=', tot
print 'cnt=', cnt
print 't=', time
print 'avg=', avg
The relevant software versions I am using are as follows:
Hadoop: 2.7.3
Spark: 2.1.0
Mesos: 1.2.0
Docker: 17.03.1-ce, build c6d412e
The following works fine:
I can run the simple PySpark test program above from inside the Test container with Spark's MASTER=local[N] for N=1 or N=4.
I can see in the Mesos logs and in the Mesos user interface (UI) that the Mesos agent and master come up fine. The Mesos UI shows that the agent is connected with plenty of resources (cpu, memory, disk).
I can run the Mesos Python tests successfully from inside the Test container with /usr/local/mesos-1.2.0/build/src/examples/python/test-framework 127.0.0.1:5050. This seems to confirm that the Mesos containers can be accessed from within my Test container, but these tests are not using Spark.
This is the Failure:
With Spark's MASTER=mesos://127.0.0.1:5050, when I launch my PySpark test program from inside the Test container there is activity in the logs of both the Mesos Master and Agent, and in the couple seconds before failure, the Mesos UI shows resources assigned for the job that are well within what is available. However, the PySpark test program then fails with: WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources.
The steps I followed are as follows.
Start Mesos Master:
docker run -it --net=host -p 5050:5050 the_master
Relevant excerpts from the master's log shows:
I0418 01:05:08.540192 27 master.cpp:383] Master 15b354eb-6a20-4bc9-a13b-6533b1e91bd2 (localhost) started on 127.0.0.1:5050
I0418 01:05:08.540210 27 master.cpp:385] Flags at startup: --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate_agents="false" --authenticate_frameworks="false" --authenticate_http_frameworks="false" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_unreachable_tasks_per_framework="1000" --quiet="false" --recovery_agent_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="20secs" --registry_strict="false" --root_submissions="true" --user_sorter="drf" --version="false" --webui_dir="/usr/local/mesos-1.2.0/build/../src/webui" --work_dir="/var/lib/mesos" --zk_session_timeout="10secs"
Start Mesos Agent:
docker run -it --net=host -e MESOS_AGENT_PORT=5051 the_agent
The agent's log shows:
I0418 01:42:00.234244 40 slave.cpp:212] Flags at startup: --appc_simple_discovery_uri_prefix="http://" --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticatee="crammd5" --authentication_backoff_factor="1secs" --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_kill_orphans="true" --docker_mesos_image="spark-mesos-agent-test" --docker_registry="https://registry-1.docker.io" --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker" --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" --enforce_container_disk_quota="false" --executor_registration_timeout="1mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_command_executor="false" --http_heartbeat_interval="30secs" --initialize_driver_logging="true" --isolation="posix/cpu,posix/mem" --launcher="posix" --launcher_dir="/usr/local/mesos-1.2.0/build/src" --logbufsecs="0" --logging_level="INFO" --max_completed_executors_per_framework="150" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" --runtime_dir="/var/run/mesos" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="false" --systemd_enable_support="false" --systemd_runtime_directory="/run/systemd/system" --version="false" --work_dir="/var/lib/mesos"
I get the following warning for both the Mesos Master and Agent, but ignore it because I am running everything on the same host for now:
Master/Agent bound to loopback interface! Cannot communicate with remote schedulers or agents. You might want to set '--ip' flag to a routable IP address.
In fact, my tests with assigning a routable IP address instead of 127.0.0.1 failed to change any of the behavior I describe here.
Start Test Container (with bash shell for testing):
docker run -it --net=host the_test /bin/bash
Some relevant environment variables set inside all three container (Master, Agent, and Test):
HADOOP_HOME=/usr/local/hadoop-2.7.3
HADOOP_CONF_DIR=/usr/local/hadoop-2.7.3/etc/hadoop
SPARK_HOME=/usr/local/spark-2.1.0
SPARK_EXECUTOR_URI=file:////usr/local/spark-2.1.0-bin-without-hadoop.tgz
MASTER=mesos://127.0.0.1:5050
PYSPARK_PYTHON=/usr/local/anaconda2/bin/python
PYSPARK_DRIVER_PYTHON=/usr/local/anaconda2/bin/python
PYSPARK_SUBMIT_ARGS=--driver-memory=4g pyspark-shell
MESOS_PORT=5050
MESOS_IP=127.0.0.1
MESOS_WORKDIR=/var/lib/mesos
MESOS_HOME=/usr/local/mesos-1.2.0
MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.so
MESOS_MASTER=mesos://127.0.0.1:5050
PYTHONPATH=:/usr/local/spark-2.1.0/python:/usr/local/spark-2.1.0/python/lib/py4j-0.10.1-src.zip
Run Mesos (non-Spark) tests from inside the Test container:
/usr/local/mesos-1.2.0/build/src/examples/python/test-framework 127.0.0.1:5050
This produces the following log output (as expected I think):
I0417 21:28:36.912542 20 sched.cpp:232] Version: 1.2.0
I0417 21:28:36.920013 62 sched.cpp:336] New master detected at master#127.0.0.1:5050
I0417 21:28:36.920472 62 sched.cpp:352] No credentials provided. Attempting to register without authentication
I0417 21:28:36.924165 62 sched.cpp:759] Framework registered with be89e739-be8d-430e-b1e9-3fe55fa18459-0000
Registered with framework ID be89e739-be8d-430e-b1e9-3fe55fa18459-0000
Received offer be89e739-be8d-430e-b1e9-3fe55fa18459-O0 with cpus: 16.0 and mem: 119640.0
Launching task 0 using offer be89e739-be8d-430e-b1e9-3fe55fa18459-O0
Launching task 1 using offer be89e739-be8d-430e-b1e9-3fe55fa18459-O0
Launching task 2 using offer be89e739-be8d-430e-b1e9-3fe55fa18459-O0
Launching task 3 using offer be89e739-be8d-430e-b1e9-3fe55fa18459-O0
Launching task 4 using offer be89e739-be8d-430e-b1e9-3fe55fa18459-O0
Task 0 is in state TASK_RUNNING
Task 1 is in state TASK_RUNNING
Task 2 is in state TASK_RUNNING
Task 3 is in state TASK_RUNNING
Task 4 is in state TASK_RUNNING
Task 0 is in state TASK_FINISHED
Task 1 is in state TASK_FINISHED
Task 2 is in state TASK_FINISHED
Task 3 is in state TASK_FINISHED
Task 4 is in state TASK_FINISHED
All tasks done, waiting for final framework message
Received message: 'data with a \x00 byte'
Received message: 'data with a \x00 byte'
Received message: 'data with a \x00 byte'
Received message: 'data with a \x00 byte'
Received message: 'data with a \x00 byte'
All tasks done, and all messages received, exiting
Run PySpark test program from inside the Test container:
python spark_test.py
This produces the following log output:
17/04/17 21:29:18 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
I0417 21:29:19.187747 205 sched.cpp:232] Version: 1.2.0
I0417 21:29:19.196535 188 sched.cpp:336] New master detected at master#127.0.0.1:5050
I0417 21:29:19.197453 188 sched.cpp:352] No credentials provided. Attempting to register without authentication
I0417 21:29:19.201884 195 sched.cpp:759] Framework registered with be89e739-be8d-430e-b1e9-3fe55fa18459-0001
17/04/17 21:29:34 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
I searched for this error on the internet but every page I found indicates that it is a common error caused by insufficient resources being allocated to the Mesos agent. As I mentioned, the Mesos UI indicates that there are sufficient resources. Please respond if you have any idea why my Spark job is not accepting resources from Mesos or if you have any suggestions of things I could try.
Thank you for your help.
This error is now resolved. In case anybody encounters a similar problem, I wanted to post that in my case it was caused by the HADOOP CLASSPATH not being set in the Mesos Master and Agent containers. Once set, everything works as expected.

Resources