I'm using Spark Standalone Mode tutorial page to install Spark in Standalone mode.
1- I have started a master by:
./sbin/start-master.sh
2- I have started a worker by:
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://ubuntu:7077
Note: spark://ubuntu:7077 is my master name, which I can see it in Master-WebUI.
Problem: By second command, a worker started successfully. But it couldn't associate with master. It tries repeatedly and then give this message:
15/02/08 11:30:04 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkMaster#ubuntu:7077]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: ubuntu/127.0.1.1:7077
15/02/08 11:30:04 INFO RemoteActorRefProvider$RemoteDeadLetterActorRef: Message [org.apache.spark.deploy.DeployMessages$RegisterWorker] from Actor[akka://sparkWorker/user/Worker#-1296628173] to Actor[akka://sparkWorker/deadLetters] was not delivered. [20] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
15/02/08 11:31:15 ERROR Worker: All masters are unresponsive! Giving up.
What is the problem?
Thanks
I usually start from spark-env.sh template. And I set, properties that I need. For simple cluster you need:
SPARK_MASTER_IP
Then, create a file called "slaves" in the same directory as spark-env.sh and slaves ip's (one per line). Assure you reach all slaves through ssh.
Finally, copy this configuration in every machine of your cluster. Then start the entire cluster executing start-all.sh script and try spark-shell to check your configuration.
> sbin/start-all.sh
> bin/spark-shell
You can set export SPARK_LOCAL_IP="You-IP" #to set the IP address Spark binds to on this node in $SPARK_HOME/conf/spark-env.sh
In my case, using spark 2.4.7 in standalone mode, I've created a passwordless ssh key using ssh-keygen, but still got asked for worker password when starting the cluster.
What I did was follow the instructions here
https://www.cyberciti.biz/faq/how-to-set-up-ssh-keys-on-linux-unix/
This line solved the problem:
ssh-copy-id -i $HOME/.ssh/id_rsa.pub user#server-ip
Related
I'm running a PySpark job in Google Cloud Dataproc, in a cluster with half the nodes being preemptible, and seeing several errors in the job output (the driver output) such as:
...spark.scheduler.TaskSetManager: Lost task 9696.0 in stage 0.0 ... Python worker exited unexpectedly (crashed)
...
Caused by java.io.EOFException
...
...YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 177 for reason Container marked as failed: ... Exit status: -100. Diagnostics: Container released on a *lost* node
...spark.storage.BlockManagerMasterEndpoint: Error try to remove broadcast 3 from block manager BlockManagerId(...)
Perhaps by coincidence, the errors mostly seem to be coming from preemptible nodes.
My suspicion is that these opaque errors are coming from the node or executors running out of memory, but there don't seem to be any granular memory related metrics exposed by Dataproc.
How can I determine why a node was considered lost? Is there a way I can inspect memory usage per node or executor to validate whether these errors are being caused by high memory usage? If YARN is the one which is killing containers / determining nodes are lost, then hopefully there's a way to introspect why?
Because you are using Preemptible VMs which are short-lived and guaranteed to last for up to 24 hours. This means that when GCE shutdowns Preemptible VMs you see errors like this:
YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 177 for reason Container marked as failed: ... Exit status: -100. Diagnostics: Container released on a lost node
Open a secure shell from your machine to the cluster. You'll require gcloud sdk installed for that.
gcloud compute ssh ${HOSTNAME}-m --project=${PROJECT}
Then run the following commands in the cluster.
List all nodes in the cluster
yarn node -list
Then using ${NodeID} to get report on the node state.
yarn node -status ${NodeID}
You could also set up local port forwarding via SSH to Yarn WebUI server instead of running commands directly in the cluster.
gcloud compute ssh ${HOSTNAME}-m \
--project=${PROJECT} -- \
-L 8088:${HOSTNAME}-m:8088 -N
Then go to http://localhost:8088/cluster/apps in your browser.
I'm trying to send spark job to yarn (without HDFS) in HA mode.
For submitting I'm using org.apache.spark.deploy.SparkSubmit.
When I send request from machine with active Resource Manager, it works well. But if I' trying to send from machine with standby Resource Manager, job fails with error:
DEBUG org.apache.hadoop.ipc.Client - Connecting to spark2-node-dev/10.10.10.167:8032
DEBUG org.apache.hadoop.ipc.Client - Connecting to /0.0.0.0:8032
org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep
However, when I send request via command line (spark-submit), it works well through both active and standby machine.
What can cause the problem?
P.S. Use the same parameters for both type of sending job: org.apache.spark.deploy.SparkSubmit and spark-submit command line request. And properties yarn.resourcemanager.hostname.rm_id defined for all rm hosts
The problem was with absence of yarn-site.xml within class path for spark-submitter jar. Actually spark submitter jar does not take to account YARN_CONF_DIR or HADOOP_CONF_DIR env var, so cannot see yarn-site.
One solution that I found was to put yarn-site into classpath of jar.
I am trying hadoop/spark cluster in Google Compute Engine through "Launch click-to-deploy software" feature .
I have created 1 master and 2 slave node and i can launch spark-shell on the cluster but when i want to launch spark-shell since my computer, i failed.
I launch :
./bin/spark-shell --master spark://IP or Hostname:7077
And i have this stackTrace :
15/04/09 10:58:06 INFO AppClient$ClientActor: Connecting to master
akka.tcp://sparkMaster#IP or Hostname:7077/user/Master...
15/04/09 10:58:06 WARN AppClient$ClientActor: Could not connect to
akka.tcp://sparkMaster#IP or Hostname:7077: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster#IP or Hostname:7077
15/04/09 10:58:06 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkMaster#IP or Hostname:7077]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: IP or Hostname: unknown error
please let me know how to overcome this problem .
See comment from Daniel Darabos. By default, all incoming connections are blocked except for SSH, RDP and ICMP. To be able to connect from the Internet to the hadoop master instance, you must open port 7077 for 'hadoop-master' tag in your project first:
gcloud compute --project PROJECT firewall-rules create allow-spark \
--allow TCP:7077 \
--target-tags hadoop-master
See Firewalls, Adding a firewall and gcloud compute firewall-rules create at GCE public documentation for further details and all the possibilities.
On the spark master machine, I have the following config in my conf/slaves:
spark-slave1.com
spark-slave2.com
localhost
In conf/spark-env.sh, I have
export SPARK_WORKER_INSTANCES=1
That I intended to have 1 worker from each of the host machine, in total 3 workers, when spark master is started.
Then I start the cluster by: ./sbin/start-all.sh,
yielding:
starting org.apache.spark.deploy.master.Master, logging to ...
spark-slave1.com: starting org.apache.spark.deploy.worker.Worker, logging to ...
localhost: starting org.apache.spark.deploy.worker.Worker, logging to ...
spark-slave2.com: starting org.apache.spark.deploy.worker.Worker, logging to ...
Visiting the spark monitorying web interface at localhost:8080 shows 5 workers registered.
1 from localhost
2 from spark-slave1.com
2 from spark-slave2.com
All of them are having status ALIVE
What I have done wrong?
Let me know if any additional information is needed. I changed the hostname for illustration purpose. It is actually a local ip.
Edit 1 - Added screen capture for reference
I have been experienced the some issues, that is because at your configuration file spark-env.sh, you has set multiple worker instances
modify to export SPARK_WORKER_INSTANCES=1 your problem will solve .
I am trying to configure two windows servers in my network as Cassandra cluster.
I did some reading in various sites and changed the below in Cassandra.yalm
after changing the default value of 127.0.0.1 to actual IP the Cassandra service is not starting.
I also added the map to actual IP to localhost in (windows) hosts file.
After doing the above change, the service is coming up when I start the service. it is stopping immediately.
The reason I am changing this IP is to make this a cluster with two node setup,
Please let me know if I miss some thing.
Version: Datastax community version of Cassandra
Server : windows.
Thx
Muthu
Message from Cassandra.txt in logs dir:
ERROR [main] 2014-09-18 11:43:12,155 DatabaseDescriptor.java (line 116) Fatal configuration error
org.apache.cassandra.exceptions.ConfigurationException: Invalid yaml Caused by: Can't construct a java object for tag:yaml.org,2002:org.apache.cassandra.config.Config; exception=Cannot create property=seed_provider for JavaBean=org.apache.cassandra.config.Config#34e5190a; No suitable constructor with 2 arguments found for class org.apache.cassandra.config.SeedProviderDef in 'reader', line 8, column 1: cluster_name: 'Test Cluster'
If you want to create Cassandra cluster you must have at least two nodes and configure /etc/cassandra/cassandra.yaml
cassandra.yaml
cluster_name: 'Some Cluster Name'
listen_address: [Current IP]
rpc_address: [Current IP]
seed_provuder:
- seeds: "[Current IP], [Remote IP]"
Note: seeds must have at least two IPs which must be reachable for each other
Clean and start Cassandra instance
sudo rm -rf /var/lib/cassandra/* /var/log/cassandra/*
Note: Cassandra instance must be killed before cleaning those folders.