I saw a similar question, but didn't solve mine, hence posted...please help if you can..
I have 1 worker node and separate master in apache spark configuration. By start-all.sh, I see only worker ID in master, not on slave on web UI.
my /etc/hosts on master node:
10.0.0.6 master
10.0.0.20 slave01
on master:
ll /etc/alternatives/java
lrwxrwxrwx. 1 root root 73 Jan 13 00:14 /etc/alternatives/java -> /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.275.b01-0.el7_9.x86_64/jre/bin/java
On my master /opt/spark/conf/spark-env.sh
export SPARK_MASTER_HOST='10.0.0.6'
export JAVA_HOME='/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.275.b01-0.el7_9.x86_64'
When I run start-all.sh, heres the output:
[sudip#master sbin]$ ./start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-sudip-org.apache.spark.deploy.master.Master-1-master.out
slave01: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-sudip-org.apache.spark.deploy.worker.Worker-1-localhost.localdomain.out
master: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-sudip-org.apache.spark.deploy.worker.Worker-1-master.out
On master:
[sudip#master sbin]$ cat /opt/spark/logs/spark-sudip-org.apache.spark.deploy.worker.Worker-1-localhost.localdomain.out
/opt/spark/bin/spark-class: line 71: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64//bin/java: No such file or directory
I don't have any idea where it gets the "java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64" as in the log from. Any idea where I am going wrong.
Update:
I deleted the log file and created again. Now, the log file for the slave is empty even when I run (start-all.sh or start-slaves.sh. I dont see the slave entry also in http://:8080
Thanks and Regards,
Sudip
Related
cassandra service (3.11.5) stops automatically after it starts/restart on AWS linux.
I have fresh installation of cassandra on new instance of AWS linux (t3.xlarge) and
sudo service cassandra start
or
sudo service cassandra restart
after 1 or 2 seconds, the service stop automatically. I looked into logs and I found these.
I am not sure, I havent change configs related to snitch and its always SimpleSnitch. I dont have any multiple cassandras. Just only on single EC2.
Logs
INFO [main] 2020-02-12 17:40:50,833 ColumnFamilyStore.java:426 - Initializing system.schema_aggregates
INFO [main] 2020-02-12 17:40:50,836 ViewManager.java:137 - Not submitting build tasks for views in keyspace system as storage service is not initialized
INFO [main] 2020-02-12 17:40:51,094 ApproximateTime.java:44 - Scheduling approximate time-check task with a precision of 10 milliseconds
ERROR [main] 2020-02-12 17:40:51,137 CassandraDaemon.java:759 - Cannot start node if snitch's data center (datacenter1) differs from previous data center (dc1). Please fix the snitch configuration, decommission and rebootstrap this node or use the flag -Dcassandra.ignore_dc=true.
Installation steps
sudo curl -OL https://www.apache.org/dist/cassandra/redhat/311x/cassandra-3.11.5-1.noarch.rpm
sudo rpm -i cassandra-3.11.5-1.noarch.rpm
sudo pip install cassandra-driver
export CQLSH_NO_BUNDLED=true
sudo chkconfig --levels 3 cassandra on
The issue is in your log file:
ERROR [main] 2020-02-12 17:40:51,137 CassandraDaemon.java:759 - Cannot start node if snitch's data center (datacenter1) differs from previous data center (dc1). Please fix the snitch configuration, decommission and rebootstrap this node or use the flag -Dcassandra.ignore_dc=true.
It seems that you started the cluster, stopped it and renamed the datacenter from dc1 to datacenter1.
In order to fix:
If no data is stored, delete the data directories
If data is stored, rename the datacenter back to dc1 in the config
I had the same problem , where cassandra service immediately stops after it was started.
in the cassandra configuration file located at /etc/cassandra/cassandra.yaml change the cluster_name to the previous one, like this:
...
# The name of the cluster. This is mainly used to prevent machines in
# one logical cluster from joining another.
cluster_name: 'dc1'
# This defines the number of tokens randomly assigned to this node on the ring
# The more tokens, relative to other nodes, the larger the proportion of data
...
I am running pyspark code on HDIcluster and getting this error:
The code failed because of a fatal error: Session 681 unexpectedly
reached final status 'dead'. See logs:
I don't have experience in YARN or Hadoop. I tried few links provided in stack overflow. But none of them helped. One strange thing is I was able to run the same code yesterday with out that error.
I just ran this import
from pyspark.sql import SparkSession
This is the error I am getting:
19/06/21 20:35:35 INFO Client:
client token: N/A
diagnostics: [Fri Jun 21 20:35:35 +0000 2019] Application is Activated, waiting for resources to be assigned for AM. Details : AM Partition = <DEFAULT_PARTITION> ; Partition Resource = <memory:819200, vCores:240> ; Queue's Absolute capacity = 50.0 % ; Queue's Absolute used capacity = 99.1875 % ; Queue's Absolute max capacity = 100.0 % ;
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1561149335158
final status: UNDEFINED
tracking URL: https://mmsorderpredhdi.azurehdinsight.net/yarnui/hn/proxy/application_1560840076505_0062/
user: livy
19/06/21 20:35:35 INFO ShutdownHookManager: Shutdown hook called
19/06/21 20:35:35 INFO ShutdownHookManager: Deleting directory /tmp/spark-bb63c5f0-7579-4456-b32a-0e643ca97ecc
YARN Diagnostics:
Application killed by user..
Question : Is there something to deal with Queue's absolute used capacity?
Could you please check the logs to find the exact issue?
Where do I find the log file?
On Azure HDInsight cluster, You may found the livy log by connecting to one of the Head nodes with SSH and downloading a file at this path.
hdfs dfs -ls /app-logs/livy/logs-ifile
For more details, refer "Access Apache Hadoop YARN application logs on Linux-based HDInsight"
And also, you may refer “How to start sparksession in pyspark”.
Hope this helps.
I am getting some triggers that show process unavailable, but when I check on the host it runs fine. Here is how the expression for the Trigger is set:
{$hostname:proc.num[,,,/etc/alternatives/java].last()}=0
It seems to be working fine for some hosts, but some of them triggers process unavailable and sends the alert.
Affected host:
# ps ax | grep java
1717 ? Ssl 119:15 /etc/alternatives/java -Dcom.sun.akuma.Daemon=daemonized -Djava.awt.headless=true -Djsse.enableSNIExtension=false -DJENKINS_HOME=/var/lib/jenkins -jar /usr/lib/jenkins/jenkins.war --logfile=/var/log/jenkins/jenkins.log --webroot=/var/cache/jenkins/war --daemon --httpPort=-1 --httpsPort=8443 --ajp13Port=8009 --debug=5 --handlerCountMax=100 --handlerCountMaxIdle=20 --httpsCertificate=/var/lib/jenkins/.ssl/hostssl.crt --httpsPrivateKey=/var/lib/jenkins/.ssl/hostssl.key
Zabbix log:
2000:20160901:081336.721 Starting Zabbix Agent [$hostname]. Zabbix 2.2.8 (revision 51174).
2000:20160901:081336.721 using configuration file: /etc/zabbix/zabbix_agentd.conf
2002:20160901:081336.724 agent #0 started [collector]
2004:20160901:081336.724 agent #2 started [listener #2]
2005:20160901:081336.725 agent #3 started [listener #3]
2006:20160901:081336.725 agent #4 started [active checks #1]
2003:20160901:081336.725 agent #1 started [listener #1]
cat: /proc//status: No such file or directory
cat: /proc//status: No such file or directory
cat: /proc//status: No such file or directory
cat: /proc//status: No such file or directory
Host sending zabbix data properly:
# ps ax | grep java
2472 ? Ssl 1279:26 /etc/alternatives/java -Dcom.sun.akuma.Daemon=daemonized -Djava.awt.headless=true -Djsse.enableSNIExtension=false -Dorg.apache.commons.jelly.tags.fmt.timeZone=Europe/Dublin -DJENKINS_HOME=/var/lib/jenkins -jar /usr/lib/jenkins/jenkins.war --logfile=/var/log/jenkins/jenkins.log --webroot=/var/cache/jenkins/war --daemon --httpPort=-1 --httpsPort=8443 --ajp13Port=8009 --debug=5 --handlerCountMax=100 --handlerCountMaxIdle=20 --httpsCertificate=/var/lib/jenkins/.security/hostssl.crt --httpsPrivateKey=/var/lib/jenkins/.security/hostssl.key --httpsPort=8443
Zabbix log does not contain line cat: /proc//status: No such file or directory
In my understanding problem is that PID of the process is not discovered so it triggers an alert action.
Is there any way to troubleshoot this further so see why the zabbix agent cannot detect the PID of the running process on affected machines?
The problem is resolved now.
I used zabbix_get to get results from the zabbix agent. There I found that it cannot get any processes from the jenkins or any other non-zabbix user.
Googling brought me to this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1032691
Applying custom SELinux policy resolved the issue.
On the spark master machine, I have the following config in my conf/slaves:
spark-slave1.com
spark-slave2.com
localhost
In conf/spark-env.sh, I have
export SPARK_WORKER_INSTANCES=1
That I intended to have 1 worker from each of the host machine, in total 3 workers, when spark master is started.
Then I start the cluster by: ./sbin/start-all.sh,
yielding:
starting org.apache.spark.deploy.master.Master, logging to ...
spark-slave1.com: starting org.apache.spark.deploy.worker.Worker, logging to ...
localhost: starting org.apache.spark.deploy.worker.Worker, logging to ...
spark-slave2.com: starting org.apache.spark.deploy.worker.Worker, logging to ...
Visiting the spark monitorying web interface at localhost:8080 shows 5 workers registered.
1 from localhost
2 from spark-slave1.com
2 from spark-slave2.com
All of them are having status ALIVE
What I have done wrong?
Let me know if any additional information is needed. I changed the hostname for illustration purpose. It is actually a local ip.
Edit 1 - Added screen capture for reference
I have been experienced the some issues, that is because at your configuration file spark-env.sh, you has set multiple worker instances
modify to export SPARK_WORKER_INSTANCES=1 your problem will solve .
I'm using Spark Standalone Mode tutorial page to install Spark in Standalone mode.
1- I have started a master by:
./sbin/start-master.sh
2- I have started a worker by:
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://ubuntu:7077
Note: spark://ubuntu:7077 is my master name, which I can see it in Master-WebUI.
Problem: By second command, a worker started successfully. But it couldn't associate with master. It tries repeatedly and then give this message:
15/02/08 11:30:04 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkMaster#ubuntu:7077]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: ubuntu/127.0.1.1:7077
15/02/08 11:30:04 INFO RemoteActorRefProvider$RemoteDeadLetterActorRef: Message [org.apache.spark.deploy.DeployMessages$RegisterWorker] from Actor[akka://sparkWorker/user/Worker#-1296628173] to Actor[akka://sparkWorker/deadLetters] was not delivered. [20] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
15/02/08 11:31:15 ERROR Worker: All masters are unresponsive! Giving up.
What is the problem?
Thanks
I usually start from spark-env.sh template. And I set, properties that I need. For simple cluster you need:
SPARK_MASTER_IP
Then, create a file called "slaves" in the same directory as spark-env.sh and slaves ip's (one per line). Assure you reach all slaves through ssh.
Finally, copy this configuration in every machine of your cluster. Then start the entire cluster executing start-all.sh script and try spark-shell to check your configuration.
> sbin/start-all.sh
> bin/spark-shell
You can set export SPARK_LOCAL_IP="You-IP" #to set the IP address Spark binds to on this node in $SPARK_HOME/conf/spark-env.sh
In my case, using spark 2.4.7 in standalone mode, I've created a passwordless ssh key using ssh-keygen, but still got asked for worker password when starting the cluster.
What I did was follow the instructions here
https://www.cyberciti.biz/faq/how-to-set-up-ssh-keys-on-linux-unix/
This line solved the problem:
ssh-copy-id -i $HOME/.ssh/id_rsa.pub user#server-ip