My opscenter always gets stuck with Loading OpsCenter... BTW this is my first installation and so far I did not get OpsCenter to run.
All three of these run normally.
nodetool status
service dse
service datastax-agent
I could reproduce it on both GChrome & MFirefox. Both remotely & running on localhost.
opscenterd.log :
2017-04-12 15:20:15,877 [myclustername] WARN: These nodes reported this message, Nodes: ['10.35.21.207'] Message: HTTP request http://10.35.21.207:61621/connection-status? failed:
An error occurred while connecting: 107: Transport endpoint is not connected. (MainThread)
when using life cycle manager , it sees my cluster name I picked but could not connect. Here's what the log looks like when I attempt to start managing the un-managed cluster.
[opscenterd] ERROR: Problem while calling ImportClusterIntoLifecycleManagerController (AgentCommunicationFailure): Cluster Import Failure: Unable to determine the DSE version for the specified cluster. Please verify that the Agents for this cluster are properly communicating with Opscenter.
File "/usr/share/opscenter/lib/py/twisted/internet/defer.py", line 1122, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/share/opscenter/lib/py/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/share/opscenter/jython/Lib/site-packages/opscenterd/WebServer.py", line 2598, in ImportClusterIntoLifecycleManagerController
Related
I'm trying to send spark job to yarn (without HDFS) in HA mode.
For submitting I'm using org.apache.spark.deploy.SparkSubmit.
When I send request from machine with active Resource Manager, it works well. But if I' trying to send from machine with standby Resource Manager, job fails with error:
DEBUG org.apache.hadoop.ipc.Client - Connecting to spark2-node-dev/10.10.10.167:8032
DEBUG org.apache.hadoop.ipc.Client - Connecting to /0.0.0.0:8032
org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep
However, when I send request via command line (spark-submit), it works well through both active and standby machine.
What can cause the problem?
P.S. Use the same parameters for both type of sending job: org.apache.spark.deploy.SparkSubmit and spark-submit command line request. And properties yarn.resourcemanager.hostname.rm_id defined for all rm hosts
The problem was with absence of yarn-site.xml within class path for spark-submitter jar. Actually spark submitter jar does not take to account YARN_CONF_DIR or HADOOP_CONF_DIR env var, so cannot see yarn-site.
One solution that I found was to put yarn-site into classpath of jar.
I am facing issue when I try to connect opscenter-agent to opscenter . In UI when we put any node ip it's shows Error creating cluster: Timeout while adding cluster. Please check the log for details on the problem.
In opscenter log I getting below error
2016-10-18 09:35:12+0000 [MW_Dev_Cluster] WARN: Unable to collect datacenter, rack information: Failed query to http://10.164.120.116:61621/cluster/topology?node_ip=10.184.27.237 : [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionDone'>>]
2016-10-18 09:35:12+0000 [MW_Dev_Cluster] WARN: HTTP request http://10.185.183.5:61621/cluster/topology?node_ip=10.112.73.35 failed: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionDone'>>]
In opscenter agent conf. file address.yaml only below entry is there
stomp_interface: "10.166.71.92"
After starting the repair service, it shows a percentage illustrating the current repair process going. When the whole cluster is repaired, it goes OFF again.
I thought it was repairing the whole cluster smoothly, forever, starting again and again, but it appears to "finish"... which is not my expectation
Did I miss something?
OpsCenter 5.2.0
DSE 4.6.7
Edit:
Logs:
2015-09-02 08:33:34+0000 [XX] INFO: Detected a topology change. The Repair Service will stop now and check the cluster topology every 5 minutes. If the cluster is stable, the Repair Service will start again.
2015-09-02 08:33:34+0000 [XX] INFO: Stopping Repair Service
2015-09-02 08:48:34+0000 [] INFO: Unhandled error in Deferred:
2015-09-02 08:48:34+0000 [] Unhandled Error
Traceback (most recent call last):
File "/usr/share/opscenter/lib/py-debian/2.7/amd64/twisted/internet/defer.py", line 361, in callback
self._startRunCallbacks(result)
File "/usr/share/opscenter/lib/py-debian/2.7/amd64/twisted/internet/defer.py", line 455, in _startRunCallbacks
self._runCallbacks()
File "/usr/share/opscenter/lib/py-debian/2.7/amd64/twisted/internet/defer.py", line 542, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/share/opscenter/lib/py-debian/2.7/amd64/twisted/internet/defer.py", line 1076, in gotResult
_inlineCallbacks(r, g, deferred)
--- <exception caught here> ---
File "/usr/share/opscenter/lib/py-debian/2.7/amd64/twisted/internet/defer.py", line 1020, in _inlineCallbacks
result = g.send(result)
File "/usr/lib/python2.7/dist-packages/opscenterd/cluster/Repair.py", line 909, in startRepairService
opscenterd.cluster.Repair.RepairServiceAlreadyRunning: The Repair Service is already running.
It seems that OpsCenter is failing at starting again the repair service after a topology change (Adding a node)
You are not experiencing the expected behavior of the repair service as detailed in the documentation:
http://docs.datastax.com/en/opscenter/5.2//opsc/online_help/services/repairService.html
I did a test of the repair service with opscenter 5.2.0 and DSE 4.7.3, and it did behave appropriately. After completing the repair service, it started a new one promptly. This was seen in opscenter on the services screen (not visible in activities).
As stated in the comments, you should review the logs and see what "bread crumbs" you can find.
I install Cassandra on my local mac, together with opscenter from DataStax. When I create the new cluster, it gives me the error,
"Error
Error provisioning cluster: Unable to login to some nodes. Please check your user/creds"
though I am very sure "Node Credentials (sudo)" user name and password are entered correctly. Any idea?
The server side logs show,
"2015-06-02 17:00:48+0800 [] INFO: Sleeping before retrying ssh login.
2015-06-02 17:00:53+0800 [] There was a problem verifying an ssh login on 127.0.0.1
Traceback (most recent call last):
Failure: opscenterd.SecureShell.SshFailed: ssh to u'127.0.0.1' failed"
However, when I open a terminal and try ssh username#127.0.0.1. I can access without a problem
I am running a spark(1.2.1) standalone cluster on my virtual machine(Ubuntu 12.04). I can run the example such as als.py and pi.py successfully. But I can't run the workcount.py example because a connection error will occur.
bin/spark-submit --master spark://192.168.1.211:7077 /examples/src/main/python/wordcount.py ~/Documents/Spark_Examples/wordcount.py
The error message is as below:
15/03/13 22:26:02 INFO BlockManagerMasterActor: Registering block manager a12:45594 with 267.3 MB RAM, BlockManagerId(0, a12, 45594)
15/03/13 22:26:03 INFO Client: Retrying connect to server: a11/192.168.1.211:9000. Already tried 4 time(s).
......
Traceback (most recent call last):
File "/home/spark/spark/examples/src/main/python/wordcount.py", line 32, in <module>
.reduceByKey(add)
File "/home/spark/spark/lib/spark-assembly-1.2.1 hadoop1.0.4.jar/pyspark/rdd.py", line 1349, in reduceByKey
File "/home/spark/spark/lib/spark-assembly-1.2.1-hadoop1.0.4.jar/pyspark/rdd.py", line 1559, in combineByKey
File "/home/spark/spark/lib/spark-assembly-1.2.1-hadoop1.0.4.jar/pyspark/rdd.py", line 1942, in _defaultReducePartitions
File "/home/spark/spark/lib/spark-assembly-1.2.1-hadoop1.0.4.jar/pyspark/rdd.py", line 297, in getNumPartitions
......
py4j.protocol.Py4JJavaError: An error occurred while calling o23.partitions.
java.lang.RuntimeException: java.net.ConnectException: Call to a11/192.168.1.211:9000 failed on connection exception: java.net.ConnectException: Connection refused
......
I didn't use Yarn or ZooKeeper. And all the virtual machines can connect to each other via ssh without password. I also set the SPARK_LOCAL_IP for master and workers.
I think that wordcount.py example is accessing hdfs to reading lines in a file (and then count the words)
Something like:
sc.textFile("hdfs://<master-hostname>:9000/path/to/whatever")
Port 9000 is usually used for hdfs.
Please be sure that this file is accessible or do not use hdfs for that example :).
I hope it helps.