After starting the repair service, it shows a percentage illustrating the current repair process going. When the whole cluster is repaired, it goes OFF again.
I thought it was repairing the whole cluster smoothly, forever, starting again and again, but it appears to "finish"... which is not my expectation
Did I miss something?
OpsCenter 5.2.0
DSE 4.6.7
Edit:
Logs:
2015-09-02 08:33:34+0000 [XX] INFO: Detected a topology change. The Repair Service will stop now and check the cluster topology every 5 minutes. If the cluster is stable, the Repair Service will start again.
2015-09-02 08:33:34+0000 [XX] INFO: Stopping Repair Service
2015-09-02 08:48:34+0000 [] INFO: Unhandled error in Deferred:
2015-09-02 08:48:34+0000 [] Unhandled Error
Traceback (most recent call last):
File "/usr/share/opscenter/lib/py-debian/2.7/amd64/twisted/internet/defer.py", line 361, in callback
self._startRunCallbacks(result)
File "/usr/share/opscenter/lib/py-debian/2.7/amd64/twisted/internet/defer.py", line 455, in _startRunCallbacks
self._runCallbacks()
File "/usr/share/opscenter/lib/py-debian/2.7/amd64/twisted/internet/defer.py", line 542, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/share/opscenter/lib/py-debian/2.7/amd64/twisted/internet/defer.py", line 1076, in gotResult
_inlineCallbacks(r, g, deferred)
--- <exception caught here> ---
File "/usr/share/opscenter/lib/py-debian/2.7/amd64/twisted/internet/defer.py", line 1020, in _inlineCallbacks
result = g.send(result)
File "/usr/lib/python2.7/dist-packages/opscenterd/cluster/Repair.py", line 909, in startRepairService
opscenterd.cluster.Repair.RepairServiceAlreadyRunning: The Repair Service is already running.
It seems that OpsCenter is failing at starting again the repair service after a topology change (Adding a node)
You are not experiencing the expected behavior of the repair service as detailed in the documentation:
http://docs.datastax.com/en/opscenter/5.2//opsc/online_help/services/repairService.html
I did a test of the repair service with opscenter 5.2.0 and DSE 4.7.3, and it did behave appropriately. After completing the repair service, it started a new one promptly. This was seen in opscenter on the services screen (not visible in activities).
As stated in the comments, you should review the logs and see what "bread crumbs" you can find.
Related
[Question posted by a user on YugabyteDB Community Slack]
Is it possible to run YugabyteDB in the Virtuozzo Container?
I am getting error at startup but don’t know how to resolve problem.
Is this error related to prerequisite checks as such failures are visible in the log?
$ ./bin/yugabyted start
Starting yugabyted...
/ Running system checks...yugabyted crashed. For troubleshooting, contact us on https://www.yugabyte.com/slack or check our FAQ at https://docs.yugabyte.com/latest/faq/
Traceback (most recent call last):
File "./bin/yugabyted", line 1621, in run
args.func()
File "./bin/yugabyted", line 254, in start
self.start_processes()
File "./bin/yugabyted", line 788, in start_processes
prereqs_check_result = self.prereqs_check(ulimits=ulimits_failed)
File "./bin/yugabyted", line 511, in prereqs_check
check = self.linux_prereqs_check()
File "./bin/yugabyted", line 473, in linux_prereqs_check
transparent_hugepages = re.search("\[(.*)\]", transparent_hugepages_check[0]).group(1)
AttributeError: 'NoneType' object has no attribute 'group'
Outside of the container, YugabyteDB starts without a problem.
Looks like yugabyted cli is checking for hugepages and Virtuozzo container doesn’t have them. Try running the db manually and it should work.
My opscenter always gets stuck with Loading OpsCenter... BTW this is my first installation and so far I did not get OpsCenter to run.
All three of these run normally.
nodetool status
service dse
service datastax-agent
I could reproduce it on both GChrome & MFirefox. Both remotely & running on localhost.
opscenterd.log :
2017-04-12 15:20:15,877 [myclustername] WARN: These nodes reported this message, Nodes: ['10.35.21.207'] Message: HTTP request http://10.35.21.207:61621/connection-status? failed:
An error occurred while connecting: 107: Transport endpoint is not connected. (MainThread)
when using life cycle manager , it sees my cluster name I picked but could not connect. Here's what the log looks like when I attempt to start managing the un-managed cluster.
[opscenterd] ERROR: Problem while calling ImportClusterIntoLifecycleManagerController (AgentCommunicationFailure): Cluster Import Failure: Unable to determine the DSE version for the specified cluster. Please verify that the Agents for this cluster are properly communicating with Opscenter.
File "/usr/share/opscenter/lib/py/twisted/internet/defer.py", line 1122, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/share/opscenter/lib/py/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/share/opscenter/jython/Lib/site-packages/opscenterd/WebServer.py", line 2598, in ImportClusterIntoLifecycleManagerController
This code runs perfect when I set master to localhost. The problem occurs when I submit on a cluster with two worker nodes.
All the machines have same version of python and packages. I have also set the path to point to the desired python version i.e. 3.5.1. when I submit my spark job on the master ssh session. I get the following error -
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 5, .c..internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/hadoop/yarn/nm-local-dir/usercache//appcache/application_1469113139977_0011/container_1469113139977_0011_01_000004/pyspark.zip/pyspark/worker.py", line 98, in main
command = pickleSer._read_with_length(infile)
File "/hadoop/yarn/nm-local-dir/usercache//appcache/application_1469113139977_0011/container_1469113139977_0011_01_000004/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
return self.loads(obj)
File "/hadoop/yarn/nm-local-dir/usercache//appcache/application_1469113139977_0011/container_1469113139977_0011_01_000004/pyspark.zip/pyspark/serializers.py", line 419, in loads
return pickle.loads(obj, encoding=encoding)
File "/hadoop/yarn/nm-local-dir/usercache//appcache/application_1469113139977_0011/container_1469113139977_0011_01_000004/pyspark.zip/pyspark/mllib/init.py", line 25, in
import numpy
ImportError: No module named 'numpy'
I saw other posts where people did not have access to their worker nodes. I do. I get the same message for the other worker node. not sure if I am missing some environment setting. Any help will be much appreciated.
Not sure if this qualifies as a solution. I submitted the same job using dataproc on google platform and it worked without any problem. I believe the best way to run jobs on google cluster is via the utilities offered on google platform. The dataproc utility seems to iron out any issues related to the environment.
I get this error every time...
I use sparkling water...
My conf-file:
***"spark.driver.memory 65g
spark.python.worker.memory 65g
spark.master local[*]"***
The amount of data is about 5 Gb.
There is no another information about this error...
Does anybody know why it happens? Thank you!
***"ERROR:py4j.java_gateway:Error while sending or receiving.
Traceback (most recent call last):
File "/data/analytics/Spark1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 746, in send_command
raise Py4JError("Answer from Java side is empty")
Py4JError: Answer from Java side is empty
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server
Traceback (most recent call last):
File "/data/analytics/Spark1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 690, in start
self.socket.connect((self.address, self.port))
File "/usr/local/anaconda/lib/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server
Traceback (most recent call last):
File "/data/analytics/Spark1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 690, in start
self.socket.connect((self.address, self.port))
File "/usr/local/anaconda/lib/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server
Traceback (most recent call last):
File "/data/analytics/Spark1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 690, in start
self.socket.connect((self.address, self.port))
File "/usr/local/anaconda/lib/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused"***
Have you tried setting spark.executor.memory and spark.driver.memory in your Spark configuration file?
See https://stackoverflow.com/a/22742982/5453184 for more info.
Usually, you'll see this error when the Java process get silently killed by the OOM Killer.
The OOM Killer (Out of Memory Killer) is a Linux process that kicks in when the system becomes critically low on memory. It selects a process based on its "badness" score and kills it to reclaim memory.
Read more on OOM Killer here.
Increasing spark.executor.memory and/or spark.driver.memory values will only make things worse in this case, i.e. you may want to do the opposite!
Other options would be to:
increase the number of partitions if you're working with very big data sources;
increase the number of worker nodes;
add more physical memory to worker/driver nodes;
Or, if you're running your driver/workers using docker:
increase docker memory limit;
set --oom-kill-disable on your containers, but make sure you understand possible consequences!
Read more on --oom-kill-disable and other docker memory settings here.
Another point to note if you are on wsl2 using pyspark. Ensure that your wsl2 config file has an increased memory.
# Settings apply across all Linux distros running on WSL 2
[wsl2]
# Limits VM memory to use no more than 4 GB, this can be set as whole numbers using GB or MB
memory=12GB # This was originally set to 3gb which caused me to fail since spark.executor.memory and spark.driver.memory was only able to MAX of 3gb regardless of how high i set it.
# Sets the VM to use eight virtual processors
processors=8
for reference. your .wslconfig config file should be located in C:\Users\USERNAME
We recently upgraded to cassandra 2.0.1 with cqlsh 4.0.1. I am seeing timeout errors/ broken pipe while using the cqlsh client. Please see error trace below. I have verified that the cluster is Up using nodetool and I am able to read/write using mapreduce. Please advice.
Thanks,
Prateek
Traceback (most recent call last):
File "./bin/cqlsh", line 897, in perform_statement_untraced
self.cursor.execute(statement, decoder=decoder)
File "./bin/../lib/cql-internal-only-1.4.0.zip/cql-1.4.0/cql/cursor.py", line 80, in execute
response = self.get_response(prepared_q, cl)
File "./bin/../lib/cql-internal-only-1.4.0.zip/cql-1.4.0/cql/thrifteries.py", line 77, in get_response
return self.handle_cql_execution_errors(doquery, compressed_q, compress, cl)
File "./bin/../lib/cql-internal-only-1.4.0.zip/cql-1.4.0/cql/thrifteries.py", line 96, in handle_cql_execution_errors
return executor(*args, **kwargs)
File "./bin/../lib/cql-internal-only-1.4.0.zip/cql-1.4.0/cql/cassandra/Cassandra.py", line 1782, in execute_cql3_query
self.send_execute_cql3_query(query, compression, consistency)
File "./bin/../lib/cql-internal-only-1.4.0.zip/cql-1.4.0/cql/cassandra/Cassandra.py", line 1793, in send_execute_cql3_query
self._oprot.trans.flush()
File "./bin/../lib/thrift-python-internal-only-0.9.1.zip/thrift/transport/TTransport.py", line 292, in flush
self.__trans.write(buf)
File "./bin/../lib/thrift-python-internal-only-0.9.1.zip/thrift/transport/TSocket.py", line 128, in write
plus = self.handle.send(buff)
error: [Errno 32] Broken pipe
If you have an open cqlsh session, it will always give you Errno 32 if the Cassandra instance that it connected to was stopped or even just restarted. You will have to restart cqlsh in order to re-establish a connection to the server.
If you see this problem without having stopped or restarted a Cassandra server, then please supply and additional details about conditions that lead up to this error.