Error while initializing Ray on an EC2 master node - python-3.x

I am using Ray to run a parallel loop on an Ubuntu 14.04 cluster on AWS EC2. The following Python 3 script works well on my local machine with just 4 workers (imports and local initializations left out):-
ray.init() #initialize Ray
#ray.remote
def test_loop(n):
c=tests[n,0]
tout=100
rc=-1
with tmp.TemporaryDirectory() as path: #Create a temporary directory
for files in filelist: #then copy in all of the
sh.copy(filelist,path) #files
txtfile=path+'/inputf.txt' #create the external
fileId=open(txtfile,'w') #data input text file,
s='Number = '+str(c)+"\n" #write test number,
fileId.write(s)
fileId.close() #close external parameter file,
os.chdir(path) #and change working directory
try: #Try running simulation:
rc=sp.call('./simulation.run',timeout=tout,stdout=sp.DEVNULL,\
stderr=sp.DEVNULL,shell=True) #(must use .call for timeout)
outdat=sio.loadmat('outputf.dat') #get the output data struct
rt_Data=outdat.get('rt_Data') #extract simulation output
err=float(rt_Data[-1]) #use final value of error
except: #If system fails to execute,
err=deferr #use failure default
#end try
if (err<=0) or (err>deferr) or (rc!=0):
err=deferr #Catch other types of failure
return err
if __name__=='__main__':
result=ray.get([test_loop.remote(n) for n in range(0,ntest)])
print(result)
The unusual bit here is that the simulation.run has to read in a different test number from an external text file when it runs. The file name is the same for all iterations of the loop, but the test number is different.
I launched an EC2 cluster using Ray, with the number of CPUs available equal to n (I am trusting that Ray will not default to multi-threading). Then I had to copy the filelist (which includes the Python script) from my local machine to the master node using rsync, because I couldn't do this from the config (see recent question: "Workers not being launched on EC2 by Ray"). Then ssh into that node, and run the script. The result is a file-finding error:-
~$ python3 test_small.py
2019-04-29 23:39:27,065 WARNING worker.py:1337 -- WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.
2019-04-29 23:39:27,065 INFO node.py:469 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-04-29_23-39-27_3897/logs.
2019-04-29 23:39:27,172 INFO services.py:407 -- Waiting for redis server at 127.0.0.1:42930 to respond...
2019-04-29 23:39:27,281 INFO services.py:407 -- Waiting for redis server at 127.0.0.1:47779 to respond...
2019-04-29 23:39:27,282 INFO services.py:804 -- Starting Redis shard with 0.21 GB max memory.
2019-04-29 23:39:27,296 INFO node.py:483 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-04-29_23-39-27_3897/logs.
2019-04-29 23:39:27,296 INFO services.py:1427 -- Starting the Plasma object store with 0.31 GB memory using /dev/shm.
(pid=3917) sh: 0: getcwd() failed: No such file or directory
2019-04-29 23:39:44,960 ERROR worker.py:1672 -- Traceback (most recent call last):
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/worker.py", line 909, in _process_task
self._store_outputs_in_object_store(return_object_ids, outputs)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/worker.py", line 820, in _store_outputs_in_object_store
self.put_object(object_ids[i], outputs[i])
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/worker.py", line 375, in put_object
self.store_and_register(object_id, value)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/worker.py", line 309, in store_and_register
self.task_driver_id))
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/worker.py", line 238, in get_serialization_context
_initialize_serialization(driver_id)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/worker.py", line 1148, in _initialize_serialization
serialization_context = pyarrow.default_serialization_context()
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/pyarrow_files/pyarrow/serialization.py", line 326, in default_serialization_context
register_default_serialization_handlers(context)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/pyarrow_files/pyarrow/serialization.py", line 321, in register_default_serialization_handlers
_register_custom_pandas_handlers(serialization_context)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/pyarrow_files/pyarrow/serialization.py", line 129, in _register_custom_pandas_handlers
import pandas as pd
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/__init__.py", line 42, in <module>
from pandas.core.api import *
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/core/api.py", line 10, in <module>
from pandas.core.groupby import Grouper
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/core/groupby.py", line 49, in <module>
from pandas.core.frame import DataFrame
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 74, in <module>
from pandas.core.series import Series
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/core/series.py", line 3042, in <module>
import pandas.plotting._core as _gfx # noqa
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/plotting/__init__.py", line 8, in <module>
from pandas.plotting import _converter
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/plotting/_converter.py", line 7, in <module>
import matplotlib.units as units
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/matplotlib/__init__.py", line 1060, in <module>
rcParams = rc_params()
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/matplotlib/__init__.py", line 892, in rc_params
fname = matplotlib_fname()
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/matplotlib/__init__.py", line 736, in matplotlib_fname
for fname in gen_candidates():
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/matplotlib/__init__.py", line 725, in gen_candidates
yield os.path.join(six.moves.getcwd(), 'matplotlibrc')
FileNotFoundError: [Errno 2] No such file or directory
During handling of the above exception, another exception occurred:
The problem then seems to repeat for all the other workers and finally gives up:-
AttributeError: module 'pandas' has no attribute 'core'
This error is unexpected and should not have happened. Somehow a worker
crashed in an unanticipated way causing the main_loop to throw an exception,
which is being caught in "python/ray/workers/default_worker.py".
2019-04-29 23:44:08,489 ERROR worker.py:1672 -- A worker died or was killed while executing task 000000002d95245f833cdbf259672412d8455d89.
Traceback (most recent call last):
File "test_small.py", line 82, in <module>
result=ray.get([test_loop.remote(n) for n in range(0,ntest)])
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/worker.py", line 2184, in get
raise value
ray.exceptions.RayWorkerError: The worker died unexpectedly while executing this task.
I suspect that I am not initializing Ray correctly. I tried with ray.init(redis_address="172.31.50.149:6379") - which was the redis address given when the cluster was formed, but the error was more or less the same. I also tried starting Ray on the master (in case it needed starting):-
~$ ray start --redis-address 172.31.50.149:6379 #Start Ray
2019-04-29 23:46:20,774 INFO services.py:407 -- Waiting for redis server at 172.31.50.149:6379 to respond...
2019-04-29 23:48:29,076 INFO services.py:412 -- Failed to connect to the redis server, retrying.
....etc.

The installation of pandas and matplotlib on the master node seems to have solved the problem. Ray now initializes successfully.

Related

starting celery worker from python script results in error - "click.exceptions.UsageError: No such command" using celery==5.1.2

Directory structure:
Here is my cw_manage_integration/psa_integration/api_service/sync_config/init.py:
from celery import Celery
from kombu import Queue
from psa_integration.celery_config import QUEUE, USER, MAX_PRIORITIES_SUPPORT_AT_TIME
BROKER = "amqp://{0}:{1}#{2}/xyz".format("abc", "pqrst", "x.x.x.x)
APP = Celery(
"sync service",
broker=BROKER,
backend='rpc://',
include=["psa_integration.sync_service.alert_sync.alert",
"psa_integration.sync_service.tenant_sync.tenant",
"psa_integration.sync_service.alert_sync.update_status"]
)
APP.conf.task_queues = [
Queue(QUEUE, queue_arguments={'x-max-priority': MAX_PRIORITIES_SUPPORT_AT_TIME}),
]
The below is the cw_manage_integration/start_service.py:
"""Scrip to start Sync service via Celery."""
from psa_integration.utils.logger import *
from psa_integration import sync_service
from psa_integration.celery_config import CELERY_CONCURRENCY
APP = sync_service.APP
try:
APP.start(["__init__.py", "worker", "-c", str(CELERY_CONCURRENCY)])
except Exception as scheduler_exception:
logging.exception("Exception occurred while starting services. Exception = {}".format(scheduler_exception))
When I run the command python3 start_service.py using celery version celery==4.4.5, it just works fine by starting celery workers.
But when the same start_service.py is run using celery==5.1.2, it is throwing the below error:
>python3 start_service.py
MainProcess INFO 2021-07-07 16:27:42,725 all_logs
79 : started MainProcess INFO 2021-07-07 16:27:42,725 all_logs
80 : log file name:
/home/sdodmane/PycharmProjects/cw_manage_integration1/cw_manage_integration/psa_integration/logs/worker_2021-07-07.log
MainProcess INFO 2021-07-07 16:27:42,725 all_logs
81 : Level: 4 Traceback (most recent call last): File
"/home/sdodmane/.local/lib/python3.8/site-packages/click_didyoumean/init.py",
line 34, in resolve_command
return super(DYMMixin, self).resolve_command(ctx, args) File "/usr/lib/python3/dist-packages/click/core.py", line 1188, in
resolve_command
ctx.fail('No such command "%s".' % original_cmd_name) File "/usr/lib/python3/dist-packages/click/core.py", line 496, in fail
raise UsageError(message, self) click.exceptions.UsageError: No such command "init.py".
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "start_service.py", line 10,
in
APP.start(["init.py", "worker", "-c", str(CELERY_CONCURRENCY)]) File
"/usr/local/lib/python3.8/dist-packages/celery/app/base.py", line 371,
in start
celery.main(args=argv, standalone_mode=False) File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main
rv = self.invoke(ctx) File "/usr/lib/python3/dist-packages/click/core.py", line 1132, in invoke
cmd_name, cmd, args = self.resolve_command(ctx, args) File "/home/sdodmane/.local/lib/python3.8/site-packages/click_didyoumean/init.py",
line 42, in resolve_command
raise click.exceptions.UsageError(error_msg, error.ctx) click.exceptions.UsageError: No such command "init.py".
Not able to differentiate between celery==4.4.5 and celery==5.1.2 in this context.
Please help me in solving this problem.
Currently, the start method is broken in the current(5.1.2) release.
It has been fixed (https://github.com/celery/celery/pull/6825/files) but has not been released yet. Hopefully, the next release v5.1.3 will fix this issue.
I had a similar problem and was able to fix with the following change:
# celery==5.1.1
APP.start()
# celery==5.2.6
import sys
APP.start(argv=sys.argv[1:])
For you, that may mean removing the __init__.py in your list of args:
APP.start(["worker", "-c", str(CELERY_CONCURRENCY)])

Getting "PreconditionFailed - inequivalent arg 'x-max-priority' for queue" error when trying to set up priority queues with Celery+RabbitMQ

I have RabbitMQ setup with two queues called: low and high. I want my celery workers to consume from the high priority queue before consuming tasks for the low priority queue. I get this following error when trying to push a message into RabbitMQ
>>> import tasks
>>> tasks.high.apply_async()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/vagrant/.local/lib/python3.6/site-packages/celery/app/task.py", line 570, in apply_async
**options
File "/home/vagrant/.local/lib/python3.6/site-packages/celery/app/base.py", line 756, in send_task
amqp.send_task_message(P, name, message, **options)
File "/home/vagrant/.local/lib/python3.6/site-packages/celery/app/amqp.py", line 552, in send_task_message
**properties
File "/home/vagrant/.local/lib/python3.6/site-packages/kombu/messaging.py", line 181, in publish
exchange_name, declare,
File "/home/vagrant/.local/lib/python3.6/site-packages/kombu/connection.py", line 510, in _ensured
return fun(*args, **kwargs)
File "/home/vagrant/.local/lib/python3.6/site-packages/kombu/messaging.py", line 194, in _publish
[maybe_declare(entity) for entity in declare]
File "/home/vagrant/.local/lib/python3.6/site-packages/kombu/messaging.py", line 194, in <listcomp>
[maybe_declare(entity) for entity in declare]
File "/home/vagrant/.local/lib/python3.6/site-packages/kombu/messaging.py", line 102, in maybe_declare
return maybe_declare(entity, self.channel, retry, **retry_policy)
File "/home/vagrant/.local/lib/python3.6/site-packages/kombu/common.py", line 121, in maybe_declare
return _maybe_declare(entity, channel)
File "/home/vagrant/.local/lib/python3.6/site-packages/kombu/common.py", line 145, in _maybe_declare
entity.declare(channel=channel)
File "/home/vagrant/.local/lib/python3.6/site-packages/kombu/entity.py", line 609, in declare
self._create_queue(nowait=nowait, channel=channel)
File "/home/vagrant/.local/lib/python3.6/site-packages/kombu/entity.py", line 618, in _create_queue
self.queue_declare(nowait=nowait, passive=False, channel=channel)
File "/home/vagrant/.local/lib/python3.6/site-packages/kombu/entity.py", line 653, in queue_declare
nowait=nowait,
File "/home/vagrant/.local/lib/python3.6/site-packages/amqp/channel.py", line 1154, in queue_declare
spec.Queue.DeclareOk, returns_tuple=True,
File "/home/vagrant/.local/lib/python3.6/site-packages/amqp/abstract_channel.py", line 80, in wait
self.connection.drain_events(timeout=timeout)
File "/home/vagrant/.local/lib/python3.6/site-packages/amqp/connection.py", line 500, in drain_events
while not self.blocking_read(timeout):
File "/home/vagrant/.local/lib/python3.6/site-packages/amqp/connection.py", line 506, in blocking_read
return self.on_inbound_frame(frame)
File "/home/vagrant/.local/lib/python3.6/site-packages/amqp/method_framing.py", line 55, in on_frame
callback(channel, method_sig, buf, None)
File "/home/vagrant/.local/lib/python3.6/site-packages/amqp/connection.py", line 510, in on_inbound_method
method_sig, payload, content,
File "/home/vagrant/.local/lib/python3.6/site-packages/amqp/abstract_channel.py", line 126, in dispatch_method
listener(*args)
File "/home/vagrant/.local/lib/python3.6/site-packages/amqp/channel.py", line 282, in _on_close
reply_code, reply_text, (class_id, method_id), ChannelError,
amqp.exceptions.PreconditionFailed: Queue.declare: (406) PRECONDITION_FAILED - inequivalent arg 'x-max-priority' for queue 'high' in vhost '/': received none but current is the value '10' of type 'signedint'
Here is my celery configuration
import ssl
broker_url="amqps://"
result_backend="amqp://"
include=["tasks"]
task_acks_late=True
task_default_rate_limit="150/m"
task_time_limit=300
worker_prefetch_multiplier=1
worker_max_tasks_per_child=2
timezone="UTC"
broker_use_ssl = {'keyfile': '/usr/local/share/private/my_key.key', 'certfile': '/usr/local/share/ca-certificates/my_cert.crt', 'ca_certs': '/usr/local/share/ca-certificates/rootca.crt', 'cert_reqs': ssl.CERT_REQUIRED, 'ssl_version': ssl.PROTOCOL_TLSv1_2}
from kombu import Exchange, Queue
task_default_priority=5
task_queue_max_priority = 10
task_queues = [Queue('high', Exchange('high'), routing_key='high', queue_arguments={'x-max-priority': 10}),]
task_routes = {'tasks.high': {'queue': 'high'}}
I have a tasks.py script with the following tasks defined
from __future__ import absolute_import, unicode_literals
from celery_app import celery_app
#celery_app.task
def low(queue='low'):
print("Low Priority")
#celery_app.task(queue='high')
def high():
print("HIGH PRIORITY")
And my celery_app.py script:
from __future__ import absolute_import, unicode_literals
from celery import Celery
from celery_once import QueueOnce
import celeryconfig
celery_app = Celery("test")
if __name__ == '__main__':
celery_app.start()
I am starting the celery workers with this command
celery -A celery_app worker -l info --config celeryconfig --concurrency=16 -n "%h:celery" -O fair -Q high,low
I'm using:
RabbitMQ: 3.7.17
Celery: 4.3.0
Python: 3.6.7
OS: Ubuntu 18.04.3 LTS bionic
Recently I stuck with the same Issue and found this question. I decided to post possible solution for somebody else who will find it in the future.
Current error message means that the queue has been already declared with priority 10, but now it's signature contains priority none. In example here is a similar issue with x-expires with good explanation:
Celery insists that every client know in advanced how a queue was created.
In order to fix such issue you may vary the following things:
change task_queue_max_priority (which defines default value of queue's x-max-priority) or get rid of it.
declare queue low with the queue_arguments={'x-max-priority': 10} as you did for queue high.
For me the problem has been solved when all queue declarations matched with previously created queues.

Cassandra retry_policy is not enforced in python client?

I have a single node Apache Cassandra cluster 3.11.6
I write 2750 rows by 10000 columns each first in apache airflow DAG task (this is successfully passed)
Then immediately after I try to connect to Cassandra to perform various reads inside another set of parallel tasks of the same airflow DAG and it fails with
ERROR - ('Unable to connect to any servers', {'10.0.1.135:9042': OperationTimedOut('errors=None, last_host=None')})
I have configured retries in execution_profiles, but it seems they're not enforced or I am misreading how the "retry" is supposed to work on the client side.
nodetool status shows UN, means UP/Normal.
I have multiple DAG tasks running in parallel pulling info from Cassandra. Some of them finish successfully (green), but some do fail because of the OperationTimedOut exception.
I don't get retries with failed connections, you can see this in the following apache airflow log:
it started at [2020-07-22 22:16:28,345]
and it errored at [2020-07-22 22:16:31,546]
which is just 3 seconds. However in the profile I set:
retry_policy=ConstantReconnectionPolicy(delay=10),
Log
[2020-07-22 22:16:28,345] {{taskinstance.py:880}} INFO - Starting attempt 1 of 1
[2020-07-22 22:16:28,345] {{taskinstance.py:881}} INFO -
--------------------------------------------------------------------------------
[2020-07-22 22:16:28,359] {{taskinstance.py:900}} INFO - Executing <Task(DjangoOperator): RespondentMediaValueMatrixImportStep> on 2020-07-22T22:02:20+00:00
[2020-07-22 22:16:28,363] {{standard_task_runner.py:53}} INFO - Started process 651 to run task
[2020-07-22 22:16:28,622] {{logging_mixin.py:112}} INFO - Running %s on host %s <TaskInstance: etl-run-dag.RespondentMediaValueMatrixImportStep 2020-07-22T22:02:20+00:00 [running]> 10.0.102.143
[2020-07-22 22:16:28,803] {{logging_mixin.py:112}} INFO - [2020-07-22 22:16:28,802] {{connection.py:101}} WARNING - Cluster.__init__ called with contact_points specified, but load-balancing policies are not specified in some ExecutionProfiles. In the next major version, this will raise an error; please specify a load-balancing policy. (contact_points = ['cassandra-node0.dev.emotionaldna.host'], EPs without explicit LBPs = ('EXEC_PROFILE_DEFAULT',))
[2020-07-22 22:16:29,543] {{logging_mixin.py:112}} INFO - [2020-07-22 22:16:29,543] {{policies.py:292}} INFO - Using datacenter 'us-east-2' for DCAwareRoundRobinPolicy (via host '10.0.1.135:9042'); if incorrect, please specify a local_dc to the constructor, or limit contact points to local cluster nodes
[2020-07-22 22:16:31,545] {{logging_mixin.py:112}} INFO - [2020-07-22 22:16:31,545] {{connection.py:103}} WARNING - [control connection] Error connecting to 10.0.1.135:9042:
Traceback (most recent call last):
File "cassandra/cluster.py", line 3522, in cassandra.cluster.ControlConnection._reconnect_internal
File "cassandra/cluster.py", line 3591, in cassandra.cluster.ControlConnection._try_connect
File "cassandra/cluster.py", line 3588, in cassandra.cluster.ControlConnection._try_connect
File "cassandra/cluster.py", line 3690, in cassandra.cluster.ControlConnection._refresh_schema
File "cassandra/metadata.py", line 142, in cassandra.metadata.Metadata.refresh
File "cassandra/metadata.py", line 165, in cassandra.metadata.Metadata._rebuild_all
File "cassandra/metadata.py", line 2522, in get_all_keyspaces
File "cassandra/metadata.py", line 2031, in get_all_keyspaces
File "cassandra/metadata.py", line 2719, in cassandra.metadata.SchemaParserV3._query_all
File "cassandra/connection.py", line 985, in cassandra.connection.Connection.wait_for_responses
File "cassandra/connection.py", line 983, in cassandra.connection.Connection.wait_for_responses
File "cassandra/connection.py", line 1435, in cassandra.connection.ResponseWaiter.deliver
cassandra.OperationTimedOut: errors=None, last_host=None
[2020-07-22 22:16:31,546] {{logging_mixin.py:112}} INFO - [2020-07-22 22:16:31,545] {{connection.py:103}} ERROR - Control connection failed to connect, shutting down Cluster:
Traceback (most recent call last):
File "cassandra/cluster.py", line 1690, in cassandra.cluster.Cluster.connect
File "cassandra/cluster.py", line 3488, in cassandra.cluster.ControlConnection.connect
File "cassandra/cluster.py", line 3533, in cassandra.cluster.ControlConnection._reconnect_internal
cassandra.cluster.NoHostAvailable: ('Unable to connect to any servers', {'10.0.1.135:9042': OperationTimedOut('errors=None, last_host=None')})
[2020-07-22 22:16:31,546] {{logging_mixin.py:112}} INFO - [2020-07-22 22:16:31,546] {{connection.py:107}} WARNING - [Connection: default] connect failed, setting up for re-attempt on first use
[2020-07-22 22:16:31,546] {{taskinstance.py:1145}} ERROR - ('Unable to connect to any servers', {'10.0.1.135:9042': OperationTimedOut('errors=None, last_host=None')})
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 978, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 113, in execute
return_value = self.execute_callable()
File "/usr/local/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 118, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/app/dags/etl/workflow.py", line 126, in run_import_step
keyspace=task_instance.get('cassandra_keyspace')
File "/app/etl_process/import_steps/mixins.py", line 469, in __init__
super().__init__(*args, **kwargs)
File "/app/etl_process/import_steps/mixins.py", line 268, in __init__
super().__init__(*args, **kwargs)
File "/app/etl_process/import_steps/abstract.py", line 177, in __init__
self._cas = get_session()
File "/app/etl_process/cassandra/client.py", line 60, in get_session
execution_profiles={EXEC_PROFILE_DEFAULT: profile},
File "/usr/local/lib/python3.7/site-packages/cassandra/cqlengine/connection.py", line 326, in setup
retry_connect=retry_connect, cluster_options=kwargs, default=True)
File "/usr/local/lib/python3.7/site-packages/cassandra/cqlengine/connection.py", line 195, in register_connection
conn.setup()
File "/usr/local/lib/python3.7/site-packages/cassandra/cqlengine/connection.py", line 103, in setup
self.session = self.cluster.connect()
File "cassandra/cluster.py", line 1667, in cassandra.cluster.Cluster.connect
File "cassandra/cluster.py", line 1703, in cassandra.cluster.Cluster.connect
File "cassandra/cluster.py", line 1690, in cassandra.cluster.Cluster.connect
File "cassandra/cluster.py", line 3488, in cassandra.cluster.ControlConnection.connect
File "cassandra/cluster.py", line 3533, in cassandra.cluster.ControlConnection._reconnect_internal
cassandra.cluster.NoHostAvailable: ('Unable to connect to any servers', {'10.0.1.135:9042': OperationTimedOut('errors=None, last_host=None')})
settings.CASSANDRA is
CASSANDRA_REQUEST_TIMEOUT = 90000
CASSANDRA = {
'NAME': 'cassandra',
'USER': user,
'PASSWORD': password,
'TEST_NAME': 'test_db',
'HOST': host,
'OPTIONS': {
'replication': {
'strategy_class': 'SimpleStrategy',
'replication_factor': 1,
},
'connection': {
'consistency': CASSANDRA_CONSISTENCY_LEVEL,
'retry_connect': True,
},
'session': {
'default_timeout': CASSANDRA_REQUEST_TIMEOUT,
'default_fetch_size': 10000,
},
},
}
get_session()
from django.conf import settings
from cassandra.auth import PlainTextAuthProvider
from cassandra.cluster import EXEC_PROFILE_DEFAULT, ExecutionProfile
from cassandra.cqlengine import connection
from cassandra.policies import (
ConstantReconnectionPolicy, DowngradingConsistencyRetryPolicy
)
from cassandra.query import tuple_factory
__all__ = ['get_session']
def get_session(
keyspace: str = None,
consistency_level=settings.CASSANDRA_CONSISTENCY_LEVEL,
request_timeout=settings.CASSANDRA_REQUEST_TIMEOUT,
) -> connection:
"""Initiate connection with apache cassandra cluster.
Arguments:
:param str keyspace: default keyspace to connect to
:param int consistency_level: desired consistency level of the connection
:param int request_timeout: cassandra request timeout. If wait time exceeds
this number, then cassandra will send 1300 error code with 0 nodes
replied statement in the response.
"""
dbconf = settings.CASSANDRA
auth_provider = PlainTextAuthProvider(
username=dbconf['USER'],
password=dbconf['PASSWORD'],
)
host = dbconf['HOST']
# define execution profile for the cluster/session
profile = ExecutionProfile(
retry_policy=ConstantReconnectionPolicy(delay=10),
consistency_level=consistency_level,
request_timeout=request_timeout,
row_factory=tuple_factory
)
# the host should be always LIST passed in the connection
# setup
if isinstance(host, str):
host = [host]
# setup the connection
connection.setup(
host,
keyspace,
retry_connect=True,
protocol_version=4,
auth_provider=auth_provider,
consistency=consistency_level,
execution_profiles={EXEC_PROFILE_DEFAULT: profile},
)
return connection.session
cassandra.yaml
# How long the coordinator should wait for read operations to complete
read_request_timeout_in_ms: 600000
# How long the coordinator should wait for seq or index scans to complete
range_request_timeout_in_ms: 600000
# How long the coordinator should wait for writes to complete
write_request_timeout_in_ms: 600000
# How long the coordinator should wait for counter writes to complete
counter_write_request_timeout_in_ms: 100000
# How long a coordinator should continue to retry a CAS operation
# that contends with other proposals for the same row
cas_contention_timeout_in_ms: 100000
# How long the coordinator should wait for truncates to complete
# (This can be much longer, because unless auto_snapshot is disabled
# we need to flush first so we can snapshot before removing the data.)
truncate_request_timeout_in_ms: 600000
# The default timeout for other, miscellaneous operations
request_timeout_in_ms: 600000
# How long before a node logs slow queries. Select queries that take longer than
# this timeout to execute, will generate an aggregated log message, so that slow queries
# can be identified. Set this value to zero to disable slow query logging.
slow_query_log_timeout_in_ms: 3000
# Enable operation timeout information exchange between nodes to accurately
# measure request timeouts. If disabled, replicas will assume that requests
# were forwarded to them instantly by the coordinator, which means that
# under overload conditions we will waste that much extra time processing
# already-timed-out requests.
#
# Warning: before enabling this property make sure to ntp is installed
# and the times are synchronized between the nodes.
cross_node_timeout: false
You're using incorrect class as parameter for retry_policy. What you're specifying is reconnection policy that defines how to try to re-connect to the node that is marked as DOWN. Retry policy defines what to do with failed statements. And you can omit it, as by default it's set to the RetryPolicy class that may retry some statements, but only if they are marked as is_idempotent=True (see documentation)

Conflicting logging with Python pid package

I've been working on a python daemon with pid, and after initial logs directly to console I wanted to switch to file logging, using python's logging module. This is when I run into the problem.
I have start/stop functions to manage the daemon:
import os
import sys
import time
import signal
import lockfile
import logging
import logging.config
import daemon
from pid import PidFile
from mpmonitor.monitor import MempoolMonitor
# logging.config.fileConfig(fname="logging.conf", disable_existing_loggers=False)
# log = logging.getLogger("mpmonitor")
def start():
print("Starting Mempool Monitor")
_pid_file = PidFile(pidname="mpmonitor.pid", piddir=curr_dir)
with daemon.DaemonContext(stdout=sys.stdout,
stderr=sys.stderr,
stdin=sys.stdin,
pidfile=_pid_file):
# Start the monitor:
mpmonitor = MempoolMonitor()
mpmonitor.run()
def stop():
print("\n{}\n".format(pid_file))
try:
with open(pid_file, "r") as f:
content = f.read()
f.close()
except FileNotFoundError as fnf_err:
print("WARNING - PID file not found, cannot stop daemon.\n({})".format(pid_file))
sys.exit()
print("Stopping Mempool Monitor")
# log.info("Stopping Mempool Monitor")
pid = int(content)
os.kill(pid, signal.SIGTERM)
sys.exit()
which works as you would expect it to. (Note the logging code is commented.)
Uncommenting the logging code breaks everything and some pretty random stuff happens. The error message (trimmed down, full traceback "looks like spam"):
--- Logging error ---
OSError: [Errno 9] Bad file descriptor
Call stack:
File "/home/leilerg/.local/lib/python3.6/site-packages/pid/__init__.py", line 77, in setup
self.logger.debug("%r entering setup", self)
Message: '%r entering setup'
Arguments: (<pid.PidFile object at 0x7fc8faa479e8>,)
--- Logging error ---
OSError: [Errno 9] Bad file descriptor
Call stack:
File "/home/leilerg/.local/lib/python3.6/site-packages/pid/__init__.py", line 170, in create
self.logger.debug("%r create pidfile: %s", self, self.filename)
Message: '%r create pidfile: %s'
Arguments: (<pid.PidFile object at 0x7fc8faa479e8>, '/home/leilerg/python/mempoolmon/mpmonitor.pid')
Traceback (most recent call last):
File "/home/leilerg/.local/lib/python3.6/site-packages/pid/__init__.py", line 136, in inner_check
pid = int(pid_str)
ValueError: invalid literal for int() with base 10: 'DEBUG - 2020-04-'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/leilerg/.local/lib/python3.6/site-packages/pid/__init__.py", line 139, in inner_check
raise PidFileUnreadableError(exc)
pid.PidFileUnreadableError: invalid literal for int() with base 10: 'DEBUG - 2020-04-'
--- Logging error ---
Traceback (most recent call last):
OSError: [Errno 9] Bad file descriptor
Call stack:
File "/home/leilerg/.local/lib/python3.6/site-packages/pid/__init__.py", line 197, in close
self.logger.debug("%r closing pidfile: %s", self, self.filename)
Message: '%r closing pidfile: %s'
Arguments: (<pid.PidFile object at 0x7fc8faa479e8>, '/home/leilerg/python/mempoolmon/mpmonitor.pid')
The random stuff I was referring to is now the file mpmonitor.pid doesn't contain a PID anymore but some attempted logs/error messages
user#mylaptor:mempoolmon: cat mpmonitor.pid
DEBUG - 2020-04-05 10:52:55,676 - PidFile: <pid.PidFile object at 0x7fc8faa479e8> entering setup
DEBUG - 2020-04-05 10:52:55,678 - PidFile: <pid.PidFile object at 0x7fc8faa479e8> create pidfile: /home/leilerg/python/mempoolmon/mpmonitor.pid
DEBUG - 2020-04-05 10:52:55,678 - PidFile: <pid.PidFile object at 0x7fc8faa479e8> check pidfile: /home/leilerg/python/mempoolmon/mpmonitor.pid
DEBUG - 2020-04-05 10:52:55,678 - PidFile: <pid.PidFile object at 0x7fc8faa479e8> closing pidfile: /home/leilerg/python/mempoolmon/mpmonitor.pid
To me this looks like the pid logfile got confused with the PID file somehow. This is odd as I explicitly set disable_existing_loggers=False.
Any ideas?
If relevant, I'm on the latest Linux Mint. I also posted the question on the pid project GitHub, as I suspect this is a bug.
The problem has been solved on the GitHub page, issue 31.

GlfwError: Failed to create GLFW window (windows)

I am using openAI gym on windows 10 x64, with python 3.6.7 through windows remote desktop
and i succeeded in installing atari-py and mujoco-py, but when i tried running this code.
import gym
env = gym.make('Humanoid-v2')
for i_episode in range(100):
env.reset()
for t in range(100):
env.render()
action = env.action_space.sample()
observation, reward, done, info = env.step(action)
if done:
print("Episode finished after {} timesteps".format(t+1))
break
I got this error:
GLFW error (code %d): %s 65544 b'Vulkan: Failed to query instance extension count: The requested version of Vulkan is not supported by the driver or is otherwise incompatible'
Creating window glfw
GLFW error (code %d): %s 65542 b'WGL: The driver does not appear to support OpenGL'
Traceback (most recent call last):
File "test.py", line 7, in <module>
env.render()
File "D:\ReinforceLearning\RLenv\lib\site-packages\gym\core.py", line 275, in render
return self.env.render(mode, **kwargs)
File "D:\ReinforceLearning\RLenv\lib\site-packages\gym\envs\mujoco\mujoco_env.py", line 118, in render
self._get_viewer(mode).render()
File "D:\ReinforceLearning\RLenv\lib\site-packages\gym\envs\mujoco\mujoco_env.py", line 130, in _get_viewer
self.viewer = mujoco_py.MjViewer(self.sim)
File "D:\ReinforceLearning\RLenv\lib\site-packages\mujoco_py-1.50.1.0-py3.6.egg\mujoco_py\mjviewer.py", line 130, in __init__
super().__init__(sim)
File "D:\ReinforceLearning\RLenv\lib\site-packages\mujoco_py-1.50.1.0-py3.6.egg\mujoco_py\mjviewer.py", line 25, in __init__
super().__init__(sim)
File "RLenv\lib\site-packages\mujoco_py-1.50.1.0-py3.6.egg\mujoco_py\mjrendercontext.pyx", line 244, in mujoco_py.cymj.MjRenderContextWindow.__init__
super().__init__(sim, offscreen=False)
File "RLenv\lib\site-packages\mujoco_py-1.50.1.0-py3.6.egg\mujoco_py\mjrendercontext.pyx", line 43, in mujoco_py.cymj.MjRenderContext.__init__
self._setup_opengl_context(offscreen)
File "RLenv\lib\site-packages\mujoco_py-1.50.1.0-py3.6.egg\mujoco_py\mjrendercontext.pyx", line 92, in mujoco_py.cymj.MjRenderContext._setup_opengl_context
self._opengl_context = GlfwContext(offscreen=offscreen)
File "RLenv\lib\site-packages\mujoco_py-1.50.1.0-py3.6.egg\mujoco_py\opengl_context.pyx", line 48, in mujoco_py.cymj.GlfwContext.__init__
self.window = self._create_window(offscreen)
File "RLenv\lib\site-packages\mujoco_py-1.50.1.0-py3.6.egg\mujoco_py\opengl_context.pyx", line 97, in mujoco_py.cymj.GlfwContext._create_window
raise GlfwError("Failed to create GLFW window")
mujoco_py.cymj.GlfwError: Failed to create GLFW window
OpenGL over WindowsRemote is not supported on NVIDIA GPUs for OpenGL Versions after 1.1.
Did a writeup on what workaround exist:
Current state and solutions for OpenGL over Windows Remote
For extra salt into the wound: You can launch the an opengl context and then connect via WindowsRemote. But launching inside the session directly is impossible without workarounds.

Resources