I have a single node Apache Cassandra cluster 3.11.6
I write 2750 rows by 10000 columns each first in apache airflow DAG task (this is successfully passed)
Then immediately after I try to connect to Cassandra to perform various reads inside another set of parallel tasks of the same airflow DAG and it fails with
ERROR - ('Unable to connect to any servers', {'10.0.1.135:9042': OperationTimedOut('errors=None, last_host=None')})
I have configured retries in execution_profiles, but it seems they're not enforced or I am misreading how the "retry" is supposed to work on the client side.
nodetool status shows UN, means UP/Normal.
I have multiple DAG tasks running in parallel pulling info from Cassandra. Some of them finish successfully (green), but some do fail because of the OperationTimedOut exception.
I don't get retries with failed connections, you can see this in the following apache airflow log:
it started at [2020-07-22 22:16:28,345]
and it errored at [2020-07-22 22:16:31,546]
which is just 3 seconds. However in the profile I set:
retry_policy=ConstantReconnectionPolicy(delay=10),
Log
[2020-07-22 22:16:28,345] {{taskinstance.py:880}} INFO - Starting attempt 1 of 1
[2020-07-22 22:16:28,345] {{taskinstance.py:881}} INFO -
--------------------------------------------------------------------------------
[2020-07-22 22:16:28,359] {{taskinstance.py:900}} INFO - Executing <Task(DjangoOperator): RespondentMediaValueMatrixImportStep> on 2020-07-22T22:02:20+00:00
[2020-07-22 22:16:28,363] {{standard_task_runner.py:53}} INFO - Started process 651 to run task
[2020-07-22 22:16:28,622] {{logging_mixin.py:112}} INFO - Running %s on host %s <TaskInstance: etl-run-dag.RespondentMediaValueMatrixImportStep 2020-07-22T22:02:20+00:00 [running]> 10.0.102.143
[2020-07-22 22:16:28,803] {{logging_mixin.py:112}} INFO - [2020-07-22 22:16:28,802] {{connection.py:101}} WARNING - Cluster.__init__ called with contact_points specified, but load-balancing policies are not specified in some ExecutionProfiles. In the next major version, this will raise an error; please specify a load-balancing policy. (contact_points = ['cassandra-node0.dev.emotionaldna.host'], EPs without explicit LBPs = ('EXEC_PROFILE_DEFAULT',))
[2020-07-22 22:16:29,543] {{logging_mixin.py:112}} INFO - [2020-07-22 22:16:29,543] {{policies.py:292}} INFO - Using datacenter 'us-east-2' for DCAwareRoundRobinPolicy (via host '10.0.1.135:9042'); if incorrect, please specify a local_dc to the constructor, or limit contact points to local cluster nodes
[2020-07-22 22:16:31,545] {{logging_mixin.py:112}} INFO - [2020-07-22 22:16:31,545] {{connection.py:103}} WARNING - [control connection] Error connecting to 10.0.1.135:9042:
Traceback (most recent call last):
File "cassandra/cluster.py", line 3522, in cassandra.cluster.ControlConnection._reconnect_internal
File "cassandra/cluster.py", line 3591, in cassandra.cluster.ControlConnection._try_connect
File "cassandra/cluster.py", line 3588, in cassandra.cluster.ControlConnection._try_connect
File "cassandra/cluster.py", line 3690, in cassandra.cluster.ControlConnection._refresh_schema
File "cassandra/metadata.py", line 142, in cassandra.metadata.Metadata.refresh
File "cassandra/metadata.py", line 165, in cassandra.metadata.Metadata._rebuild_all
File "cassandra/metadata.py", line 2522, in get_all_keyspaces
File "cassandra/metadata.py", line 2031, in get_all_keyspaces
File "cassandra/metadata.py", line 2719, in cassandra.metadata.SchemaParserV3._query_all
File "cassandra/connection.py", line 985, in cassandra.connection.Connection.wait_for_responses
File "cassandra/connection.py", line 983, in cassandra.connection.Connection.wait_for_responses
File "cassandra/connection.py", line 1435, in cassandra.connection.ResponseWaiter.deliver
cassandra.OperationTimedOut: errors=None, last_host=None
[2020-07-22 22:16:31,546] {{logging_mixin.py:112}} INFO - [2020-07-22 22:16:31,545] {{connection.py:103}} ERROR - Control connection failed to connect, shutting down Cluster:
Traceback (most recent call last):
File "cassandra/cluster.py", line 1690, in cassandra.cluster.Cluster.connect
File "cassandra/cluster.py", line 3488, in cassandra.cluster.ControlConnection.connect
File "cassandra/cluster.py", line 3533, in cassandra.cluster.ControlConnection._reconnect_internal
cassandra.cluster.NoHostAvailable: ('Unable to connect to any servers', {'10.0.1.135:9042': OperationTimedOut('errors=None, last_host=None')})
[2020-07-22 22:16:31,546] {{logging_mixin.py:112}} INFO - [2020-07-22 22:16:31,546] {{connection.py:107}} WARNING - [Connection: default] connect failed, setting up for re-attempt on first use
[2020-07-22 22:16:31,546] {{taskinstance.py:1145}} ERROR - ('Unable to connect to any servers', {'10.0.1.135:9042': OperationTimedOut('errors=None, last_host=None')})
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 978, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 113, in execute
return_value = self.execute_callable()
File "/usr/local/lib/python3.7/site-packages/airflow/operators/python_operator.py", line 118, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/app/dags/etl/workflow.py", line 126, in run_import_step
keyspace=task_instance.get('cassandra_keyspace')
File "/app/etl_process/import_steps/mixins.py", line 469, in __init__
super().__init__(*args, **kwargs)
File "/app/etl_process/import_steps/mixins.py", line 268, in __init__
super().__init__(*args, **kwargs)
File "/app/etl_process/import_steps/abstract.py", line 177, in __init__
self._cas = get_session()
File "/app/etl_process/cassandra/client.py", line 60, in get_session
execution_profiles={EXEC_PROFILE_DEFAULT: profile},
File "/usr/local/lib/python3.7/site-packages/cassandra/cqlengine/connection.py", line 326, in setup
retry_connect=retry_connect, cluster_options=kwargs, default=True)
File "/usr/local/lib/python3.7/site-packages/cassandra/cqlengine/connection.py", line 195, in register_connection
conn.setup()
File "/usr/local/lib/python3.7/site-packages/cassandra/cqlengine/connection.py", line 103, in setup
self.session = self.cluster.connect()
File "cassandra/cluster.py", line 1667, in cassandra.cluster.Cluster.connect
File "cassandra/cluster.py", line 1703, in cassandra.cluster.Cluster.connect
File "cassandra/cluster.py", line 1690, in cassandra.cluster.Cluster.connect
File "cassandra/cluster.py", line 3488, in cassandra.cluster.ControlConnection.connect
File "cassandra/cluster.py", line 3533, in cassandra.cluster.ControlConnection._reconnect_internal
cassandra.cluster.NoHostAvailable: ('Unable to connect to any servers', {'10.0.1.135:9042': OperationTimedOut('errors=None, last_host=None')})
settings.CASSANDRA is
CASSANDRA_REQUEST_TIMEOUT = 90000
CASSANDRA = {
'NAME': 'cassandra',
'USER': user,
'PASSWORD': password,
'TEST_NAME': 'test_db',
'HOST': host,
'OPTIONS': {
'replication': {
'strategy_class': 'SimpleStrategy',
'replication_factor': 1,
},
'connection': {
'consistency': CASSANDRA_CONSISTENCY_LEVEL,
'retry_connect': True,
},
'session': {
'default_timeout': CASSANDRA_REQUEST_TIMEOUT,
'default_fetch_size': 10000,
},
},
}
get_session()
from django.conf import settings
from cassandra.auth import PlainTextAuthProvider
from cassandra.cluster import EXEC_PROFILE_DEFAULT, ExecutionProfile
from cassandra.cqlengine import connection
from cassandra.policies import (
ConstantReconnectionPolicy, DowngradingConsistencyRetryPolicy
)
from cassandra.query import tuple_factory
__all__ = ['get_session']
def get_session(
keyspace: str = None,
consistency_level=settings.CASSANDRA_CONSISTENCY_LEVEL,
request_timeout=settings.CASSANDRA_REQUEST_TIMEOUT,
) -> connection:
"""Initiate connection with apache cassandra cluster.
Arguments:
:param str keyspace: default keyspace to connect to
:param int consistency_level: desired consistency level of the connection
:param int request_timeout: cassandra request timeout. If wait time exceeds
this number, then cassandra will send 1300 error code with 0 nodes
replied statement in the response.
"""
dbconf = settings.CASSANDRA
auth_provider = PlainTextAuthProvider(
username=dbconf['USER'],
password=dbconf['PASSWORD'],
)
host = dbconf['HOST']
# define execution profile for the cluster/session
profile = ExecutionProfile(
retry_policy=ConstantReconnectionPolicy(delay=10),
consistency_level=consistency_level,
request_timeout=request_timeout,
row_factory=tuple_factory
)
# the host should be always LIST passed in the connection
# setup
if isinstance(host, str):
host = [host]
# setup the connection
connection.setup(
host,
keyspace,
retry_connect=True,
protocol_version=4,
auth_provider=auth_provider,
consistency=consistency_level,
execution_profiles={EXEC_PROFILE_DEFAULT: profile},
)
return connection.session
cassandra.yaml
# How long the coordinator should wait for read operations to complete
read_request_timeout_in_ms: 600000
# How long the coordinator should wait for seq or index scans to complete
range_request_timeout_in_ms: 600000
# How long the coordinator should wait for writes to complete
write_request_timeout_in_ms: 600000
# How long the coordinator should wait for counter writes to complete
counter_write_request_timeout_in_ms: 100000
# How long a coordinator should continue to retry a CAS operation
# that contends with other proposals for the same row
cas_contention_timeout_in_ms: 100000
# How long the coordinator should wait for truncates to complete
# (This can be much longer, because unless auto_snapshot is disabled
# we need to flush first so we can snapshot before removing the data.)
truncate_request_timeout_in_ms: 600000
# The default timeout for other, miscellaneous operations
request_timeout_in_ms: 600000
# How long before a node logs slow queries. Select queries that take longer than
# this timeout to execute, will generate an aggregated log message, so that slow queries
# can be identified. Set this value to zero to disable slow query logging.
slow_query_log_timeout_in_ms: 3000
# Enable operation timeout information exchange between nodes to accurately
# measure request timeouts. If disabled, replicas will assume that requests
# were forwarded to them instantly by the coordinator, which means that
# under overload conditions we will waste that much extra time processing
# already-timed-out requests.
#
# Warning: before enabling this property make sure to ntp is installed
# and the times are synchronized between the nodes.
cross_node_timeout: false
You're using incorrect class as parameter for retry_policy. What you're specifying is reconnection policy that defines how to try to re-connect to the node that is marked as DOWN. Retry policy defines what to do with failed statements. And you can omit it, as by default it's set to the RetryPolicy class that may retry some statements, but only if they are marked as is_idempotent=True (see documentation)
Related
My goal is to make an Airflow DAG check if a file exists in a directory inside a different server (in this case, an edge-node from a cluster).
My first approach was to make a SSHOperator which triggered a bash script (in the edge-node server) that checks if the directory is empty. This worked. I was able to receive the output from the bash script in the DAG logs telling me if the dir is empty or not. However, when the SSHOperator fails (ie, the script did not found a file in the dir) the current dag run is interrupted and a new dag run starts. If this happens multiple times (which is expected) I will end up with a tonne of interrupted dag runs in the tree view =/
So, my second approach is to use a proper sensor. In this case, the SFTPSensor seems to be the best option.
So here is my python DAG code:
from airflow import DAG
from datetime import timedelta, datetime
from airflow.utils.dates import days_ago
from airflow.models import Variable
import requests
import logging
import time
from airflow.contrib.sensors.sftp_sensor import SFTPSensor
from airflow.operators.python_operator import PythonOperator
def say_bye(**context):
print("byebyeeee!")
default_args = {
'owner': 'airflow',
"start_date": days_ago(1),
}
ssh_id = Variable.get("ssh_connection_id_imb")
source_path = "/trf/cq/millennium/rcp/"
dag = DAG(dag_id='ing_cgd_millennium_t_ukajrnl_imb_test4', default_args=default_args, schedule_interval=None)
with dag:
s0 = SFTPSensor(
task_id='sensing_task',
path=source_path,
fs_conn_id=ssh_id,
poke_interval=60,
mode='reschedule',
retries=1
)
t1 = PythonOperator(task_id='run_this_goodbye',python_callable=say_bye,provide_context=True)
s0 >> t1
My SSH connection (ssh_connection_id_imb) looks like this: https://i.stack.imgur.com/x7iLu.png
And the error:
[2021-03-09 11:56:07,662] {base_hook.py:89} INFO - Using connection to: id: sftp_default. Host: localhost, Port: 22, Schema: None, Login: airflow, Password: None, extra: XXXXXXXX
[2021-03-09 11:56:07,664] {base_hook.py:89} INFO - Using connection to: id: sftp_default. Host: localhost, Port: 22, Schema: None, Login: airflow, Password: None, extra: XXXXXXXX
[2021-03-09 11:56:07,665] {sftp_sensor.py:46} INFO - Poking for lpc600.group.com:/trf/cq/millenium/rcp/C.PGMLNGL.FKM001.041212.20201123.gz
[2021-03-09 11:56:07,665] {logging_mixin.py:112} WARNING - /opt/miniconda/lib/python3.7/site-packages/pysftp/__init__.py:61: UserWarning: Failed to load HostKeys from /root/.ssh/known_hosts. You will need to explicitly load HostKeys (cnopts.hostkeys.load(filename)) or disableHostKey checking (cnopts.hostkeys = None).
warnings.warn(wmsg, UserWarning)
[2021-03-09 11:56:07,666] {taskinstance.py:1150} ERROR - Unable to connect to localhost: [Errno 101] Network is unreachable
Traceback (most recent call last):
File "/opt/miniconda/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 984, in _run_raw_task
result = task_copy.execute(context=context)
File "/opt/miniconda/lib/python3.7/site-packages/airflow/sensors/base_sensor_operator.py", line 107, in execute
while not self.poke(context):
File "/opt/miniconda/lib/python3.7/site-packages/airflow/contrib/sensors/sftp_sensor.py", line 48, in poke
self.hook.get_mod_time(self.path)
File "/opt/miniconda/lib/python3.7/site-packages/airflow/contrib/hooks/sftp_hook.py", line 219, in get_mod_time
conn = self.get_conn()
File "/opt/miniconda/lib/python3.7/site-packages/airflow/contrib/hooks/sftp_hook.py", line 114, in get_conn
self.conn = pysftp.Connection(**conn_params)
File "/opt/miniconda/lib/python3.7/site-packages/pysftp/__init__.py", line 140, in __init__
self._start_transport(host, port)
File "/opt/miniconda/lib/python3.7/site-packages/pysftp/__init__.py", line 176, in _start_transport
self._transport = paramiko.Transport((host, port))
File "/opt/miniconda/lib/python3.7/site-packages/paramiko/transport.py", line 416, in __init__
"Unable to connect to {}: {}".format(hostname, reason)
paramiko.ssh_exception.SSHException: Unable to connect to localhost: [Errno 101] Network is unreachable
I noticed that the base_hook is pointing to localhost and the sftp_sensor is pointing to the correct server.... do I need to set up the base hook?? Am I missing a step?? Thanks for the help! =)
Just realized my errors...
Problem #1 bad sftp_connection name:
s0 = SFTPSensor(
task_id='sensing_task',
path=source_path,
sftp_conn_id=ssh_id, # instead of fs_conn_id
poke_interval=60,
mode='reschedule',
retries=1
)
Problem #2 Extra field needs to be defined in the connection
I created a public key and add this to the Extra field:
{"key_file": "/airflow/generated_sshkey_dir/id_rsa.pub", "no_host_key_check": true}
Sooo, this makes my connection prone to a man in the middle attack, since I'm not checking for the host key. In my case, this solution is sufficient.
I tried and sadly failed to use Python SQLAlchemy ORM with MonetDB and database schema.
A minimal example to demonstrate my problem is the following:
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String
engine = create_engine(f"monetdb://monetdb:monetdb#localhost:50000/demo")
connection = engine.connect()
Session = sessionmaker(bind=engine)
session = Session()
class Template(object):
__table_args__ = ({"schema": "test"}, )
Base = declarative_base(cls=Template)
class User(Base):
__tablename__ = "users"
id = Column(Integer, primary_key=True)
name = Column(String)
schemas = [name[0] for name in connection.execute("SELECT name FROM sys.schemas")]
if not "test" in schemas:
connection.execute("CREATE SCHEMA test")
Base.metadata.create_all(bind=engine)
session.add_all([User(name="a"), User(name="b"), User(name="c")])
session.commit()
print(session.query(User).one())
This should work with a clean/empty MonetDB database (e.g. the demo one in Windows).
If the above example is run, it throws an error similar to the following:
Traceback (most recent call last):
File "C:\some\path\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1278, in _execute_context
cursor, statement, parameters, context
File "C:\some\path\Anaconda3\lib\site-packages\sqlalchemy\engine\default.py", line 593, in do_execute
cursor.execute(statement, parameters)
File "C:\some\path\Anaconda3\lib\site-packages\pymonetdb\sql\cursors.py", line 165, in execute
block = self.connection.execute(query)
File "C:\some\path\Anaconda3\lib\site-packages\pymonetdb\sql\connections.py", line 140, in execute
return self.command('s' + query + '\n;')
File "C:\some\path\Anaconda3\lib\site-packages\pymonetdb\sql\connections.py", line 145, in command
return self.mapi.cmd(command)
File "C:\some\path\Anaconda3\lib\site-packages\pymonetdb\mapi.py", line 266, in cmd
raise exception(msg)
pymonetdb.exceptions.OperationalError: 42000!TODO: column names of level >= 3
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "test.py", line 42, in <module>
print(session.query(User).one())
File "C:\some\path\Anaconda3\lib\site-packages\sqlalchemy\orm\query.py", line 3458, in one
ret = self.one_or_none()
File "C:\some\path\Anaconda3\lib\site-packages\sqlalchemy\orm\query.py", line 3427, in one_or_none
ret = list(self)
File "C:\some\path\Anaconda3\lib\site-packages\sqlalchemy\orm\query.py", line 3503, in __iter__
return self._execute_and_instances(context)
File "C:\some\path\Anaconda3\lib\site-packages\sqlalchemy\orm\query.py", line 3528, in _execute_and_instances
result = conn.execute(querycontext.statement, self._params)
File "C:\some\path\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1014, in execute
return meth(self, multiparams, params)
File "C:\some\path\Anaconda3\lib\site-packages\sqlalchemy\sql\elements.py", line 298, in _execute_on_connection
return connection._execute_clauseelement(self, multiparams, params)
File "C:\some\path\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1133, in _execute_clauseelement
distilled_params,
File "C:\some\path\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1318, in _execute_context
e, statement, parameters, cursor, context
File "C:\some\path\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1512, in _handle_dbapi_exception
sqlalchemy_exception, with_traceback=exc_info[2], from_=e
File "C:\some\path\Anaconda3\lib\site-packages\sqlalchemy\util\compat.py", line 178, in raise_
raise exception
File "C:\some\path\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1278, in _execute_context
cursor, statement, parameters, context
File "C:\some\path\Anaconda3\lib\site-packages\sqlalchemy\engine\default.py", line 593, in do_execute
cursor.execute(statement, parameters)
File "C:\some\path\Anaconda3\lib\site-packages\pymonetdb\sql\cursors.py", line 165, in execute
block = self.connection.execute(query)
File "C:\some\path\Anaconda3\lib\site-packages\pymonetdb\sql\connections.py", line 140, in execute
return self.command('s' + query + '\n;')
File "C:\some\path\Anaconda3\lib\site-packages\pymonetdb\sql\connections.py", line 145, in command
return self.mapi.cmd(command)
File "C:\some\path\Anaconda3\lib\site-packages\pymonetdb\mapi.py", line 266, in cmd
raise exception(msg)
sqlalchemy.exc.OperationalError: (pymonetdb.exceptions.OperationalError) 42000!TODO: column names of level >= 3
[SQL: SELECT test.users.id AS test_users_id, test.users."name" AS test_users_name
FROM test.users]
(Background on this error at: http://sqlalche.me/e/13/e3q8)
And here is how the log from a freshly started MonetDB server on Windows could like in this scenario:
# MonetDB 5 server v11.37.7 (Jun2020)
# Serving database 'demo', using 8 threads
# Compiled for x86_64-pc-winnt/64bit
# Found 63.847 GiB available main-memory of which we use 52.036 GiB
# Copyright (c) 1993 - July 2008 CWI.
# Copyright (c) August 2008 - 2020 MonetDB B.V., all rights reserved
# Visit https://www.monetdb.org/ for further information
# Listening for connection requests on mapi:monetdb://127.0.0.1:50000/
# SQL catalog created, loading sql scripts once
# loading sql script: 09_like.sql
# loading sql script: 10_math.sql
# loading sql script: 12_url.sql
# loading sql script: 13_date.sql
# loading sql script: 14_inet.sql
# loading sql script: 15_querylog.sql
# loading sql script: 16_tracelog.sql
# loading sql script: 17_temporal.sql
# loading sql script: 18_index.sql
# loading sql script: 20_vacuum.sql
# loading sql script: 21_dependency_views.sql
# loading sql script: 22_clients.sql
# loading sql script: 23_skyserver.sql
# loading sql script: 25_debug.sql
# loading sql script: 26_sysmon.sql
# loading sql script: 27_rejects.sql
# loading sql script: 39_analytics.sql
# loading sql script: 40_json.sql
# loading sql script: 41_md5sum.sql
# loading sql script: 45_uuid.sql
# loading sql script: 46_profiler.sql
# loading sql script: 51_sys_schema_extension.sql
# loading sql script: 58_hot_snapshot.sql
# loading sql script: 60_wlcr.sql
# loading sql script: 61_wlcr.sql
# loading sql script: 75_storagemodel.sql
# loading sql script: 80_statistics.sql
# loading sql script: 80_udf.sql
# loading sql script: 81_tracer.sql
# loading sql script: 90_generator.sql
# loading sql script: 99_system.sql
# MonetDB/SQL module loaded
# MonetDB server is started. To stop server press Ctrl-C.
#client1: createExceptionInternal: !ERROR: ParseException:SQLparser:42000!TODO: column names of level >= 3
It seems that the query
SELECT test.users.id AS test_users_id, test.users."name" AS test_users_name FROM test.users
can't be handeled correctly by the MonetDB API/driver.
Related bug reports can be found, too:
https://www.monetdb.org/bugzilla/show_bug.cgi?id=2526
https://www.monetdb.org/bugzilla/show_bug.cgi?id=2854
https://www.monetdb.org/bugzilla/show_bug.cgi?id=3062
Sadly, as the bugs were first mentioned in approx. 2010 this issue probably won't get fixed soon (or never at all).
And finally here is some version information:
System: Windows 10 1809
MonetDB: 20200529
python: 3.7.7
pymonetdb: 1.3.1
sqlalchemy: 1.3.18
sqlalchemy-monetdb: 1.0.0
Does anyone know a way to workaround this issue e.g by telling SQLAlchemy ORM to use temporary aliases, etc.?
Indeed, MonetDB still doesn't support more than two levels of naming. It's still on our list though and your question has just increased its position.
I don't know much about SQLAlchemy. Any chance you can find a workaround for this problem?
I have a docker image which creates a connection to Azure Datalake using adlCreds = lib.auth(tenant_id=tenantId, client_secret=application_key, client_id=application_id) in python and access a file. I am executing the docker image in a Kubernetes Pod using Airflow Kubernetes_Pod_Operator. When I instantiate the pod operator, the container in Kubernetes is terminating with the error OAuth2Client:Get Token request failed
I have verified the Azure creds and the connection is working fine when I execute it locally. I am getting this error only when I run it in the cluster using Kubernetes Pod.
# Creating ADL connection
def make_azure_enduser():
logging.info('Creating the Azure Datalake Client...')
config = dict(azure_tenantid=<tenant_id>'),
application_key=<azure_application_key>),
application_id=<azure_applicationid>),
subscriptionId=<azure_subid>),
adlAccountName=<azure_datalake_accountname>)
)
# Format the tenant ID and Account Name as strings to be passed into the object
application_id = "{}".format(config['application_id'])
application_key = "{}".format(config['application_key'])
tenantId = "{}".format(config['azure_tenantid'])
store_name = "{}".format(config['adlAccountName'])
# Create the token and the Azure Datalake File System Client
adlCreds = lib.auth(tenant_id=tenantId, client_secret=application_key, client_id=application_id)
adlsFileSystemClient = core.AzureDLFileSystem(adlCreds, store_name=store_name)
return adlsFileSystemClient
# instantiating Kubernetes_pod_operator in Airflow DAG
get_adl_file = KubernetesPodOperator(namespace='default',
image="get_adl_file:latest",
image_pull_policy='IfNotPresent',
image_pull_secrets='acr-registry',
# cmds=["/bin/sh", "-c", "sleep 500"],
# arguments=["sleep 500"],
labels={"foo": "bar"},
name="get_file",
task_id="get_adl_file",
get_logs=True,
dag=dag
)
I expect the pod to create a connection to datalake, but it is failing with the below error.
Kubernetes Pod logs
> Creating the Azure Datalake Client...
> ed2a3672-9d3b-456e-924f-cdf6e5f60ce8 - TokenRequest:Getting token with
> client credentials.
> /opt/conda/lib/python3.7/site-packages/sklearn/externals/joblib/__init__.py:15:
> DeprecationWarning: sklearn.externals.joblib is deprecated in 0.21 and
> will be removed in 0.23. Please import this functionality directly
> from joblib, which can be installed with: pip install joblib. If this
> warning is raised when loading pickled models, you may need to
> re-serialize those models with scikit-learn 0.21+.
> warnings.warn(msg, category=DeprecationWarning)
> ed2a3672-9d3b-456e-924f-cdf6e5f60ce8 - OAuth2Client:Get Token request
> failed Traceback (most recent call last): File
> "/opt/conda/lib/python3.7/site-packages/urllib3/connection.py", line
> 159, in _new_conn
> (self._dns_host, self.port), self.timeout, extra_kw) File "/opt/conda/lib/python3.7/site-packages/urllib3/util/connection.py",
> line 57, in create_connection
> for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM): File "/opt/conda/lib/python3.7/socket.py", line
> 748, in getaddrinfo
> for res in _socket.getaddrinfo(host, port, family, type, proto, flags): socket.gaierror: [Errno -3] Temporary failure in name
> resolution During handling of the above exception, another exception
> occurred: Traceback (most recent call last): File
> "/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py",
> line 600, in urlopen
> chunked=chunked) File "/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py",
> line 343, in _make_request
> self._validate_conn(conn) File "/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py",
> line 839, in _validate_conn
> conn.connect() File "/opt/conda/lib/python3.7/site-packages/urllib3/connection.py", line
> 301, in connect
> conn = self._new_conn() File "/opt/conda/lib/python3.7/site-packages/urllib3/connection.py", line
> 168, in _new_conn
> self, "Failed to establish a new connection: %s" % e) urllib3.exceptions.NewConnectionError:
> <urllib3.connection.VerifiedHTTPSConnection object at 0x7f0941ce8160>:
> Failed to establish a new connection: [Errno -3] Temporary failure in
> name resolution
Please help me to fix this connection error. Thanks
This seems like a very specific issue.
Try the “Slack” of Kubernetes, there are skilled users who are gladly to help out!
I am using Ray to run a parallel loop on an Ubuntu 14.04 cluster on AWS EC2. The following Python 3 script works well on my local machine with just 4 workers (imports and local initializations left out):-
ray.init() #initialize Ray
#ray.remote
def test_loop(n):
c=tests[n,0]
tout=100
rc=-1
with tmp.TemporaryDirectory() as path: #Create a temporary directory
for files in filelist: #then copy in all of the
sh.copy(filelist,path) #files
txtfile=path+'/inputf.txt' #create the external
fileId=open(txtfile,'w') #data input text file,
s='Number = '+str(c)+"\n" #write test number,
fileId.write(s)
fileId.close() #close external parameter file,
os.chdir(path) #and change working directory
try: #Try running simulation:
rc=sp.call('./simulation.run',timeout=tout,stdout=sp.DEVNULL,\
stderr=sp.DEVNULL,shell=True) #(must use .call for timeout)
outdat=sio.loadmat('outputf.dat') #get the output data struct
rt_Data=outdat.get('rt_Data') #extract simulation output
err=float(rt_Data[-1]) #use final value of error
except: #If system fails to execute,
err=deferr #use failure default
#end try
if (err<=0) or (err>deferr) or (rc!=0):
err=deferr #Catch other types of failure
return err
if __name__=='__main__':
result=ray.get([test_loop.remote(n) for n in range(0,ntest)])
print(result)
The unusual bit here is that the simulation.run has to read in a different test number from an external text file when it runs. The file name is the same for all iterations of the loop, but the test number is different.
I launched an EC2 cluster using Ray, with the number of CPUs available equal to n (I am trusting that Ray will not default to multi-threading). Then I had to copy the filelist (which includes the Python script) from my local machine to the master node using rsync, because I couldn't do this from the config (see recent question: "Workers not being launched on EC2 by Ray"). Then ssh into that node, and run the script. The result is a file-finding error:-
~$ python3 test_small.py
2019-04-29 23:39:27,065 WARNING worker.py:1337 -- WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.
2019-04-29 23:39:27,065 INFO node.py:469 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-04-29_23-39-27_3897/logs.
2019-04-29 23:39:27,172 INFO services.py:407 -- Waiting for redis server at 127.0.0.1:42930 to respond...
2019-04-29 23:39:27,281 INFO services.py:407 -- Waiting for redis server at 127.0.0.1:47779 to respond...
2019-04-29 23:39:27,282 INFO services.py:804 -- Starting Redis shard with 0.21 GB max memory.
2019-04-29 23:39:27,296 INFO node.py:483 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-04-29_23-39-27_3897/logs.
2019-04-29 23:39:27,296 INFO services.py:1427 -- Starting the Plasma object store with 0.31 GB memory using /dev/shm.
(pid=3917) sh: 0: getcwd() failed: No such file or directory
2019-04-29 23:39:44,960 ERROR worker.py:1672 -- Traceback (most recent call last):
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/worker.py", line 909, in _process_task
self._store_outputs_in_object_store(return_object_ids, outputs)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/worker.py", line 820, in _store_outputs_in_object_store
self.put_object(object_ids[i], outputs[i])
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/worker.py", line 375, in put_object
self.store_and_register(object_id, value)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/worker.py", line 309, in store_and_register
self.task_driver_id))
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/worker.py", line 238, in get_serialization_context
_initialize_serialization(driver_id)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/worker.py", line 1148, in _initialize_serialization
serialization_context = pyarrow.default_serialization_context()
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/pyarrow_files/pyarrow/serialization.py", line 326, in default_serialization_context
register_default_serialization_handlers(context)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/pyarrow_files/pyarrow/serialization.py", line 321, in register_default_serialization_handlers
_register_custom_pandas_handlers(serialization_context)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/pyarrow_files/pyarrow/serialization.py", line 129, in _register_custom_pandas_handlers
import pandas as pd
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/__init__.py", line 42, in <module>
from pandas.core.api import *
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/core/api.py", line 10, in <module>
from pandas.core.groupby import Grouper
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/core/groupby.py", line 49, in <module>
from pandas.core.frame import DataFrame
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 74, in <module>
from pandas.core.series import Series
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/core/series.py", line 3042, in <module>
import pandas.plotting._core as _gfx # noqa
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/plotting/__init__.py", line 8, in <module>
from pandas.plotting import _converter
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/plotting/_converter.py", line 7, in <module>
import matplotlib.units as units
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/matplotlib/__init__.py", line 1060, in <module>
rcParams = rc_params()
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/matplotlib/__init__.py", line 892, in rc_params
fname = matplotlib_fname()
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/matplotlib/__init__.py", line 736, in matplotlib_fname
for fname in gen_candidates():
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/matplotlib/__init__.py", line 725, in gen_candidates
yield os.path.join(six.moves.getcwd(), 'matplotlibrc')
FileNotFoundError: [Errno 2] No such file or directory
During handling of the above exception, another exception occurred:
The problem then seems to repeat for all the other workers and finally gives up:-
AttributeError: module 'pandas' has no attribute 'core'
This error is unexpected and should not have happened. Somehow a worker
crashed in an unanticipated way causing the main_loop to throw an exception,
which is being caught in "python/ray/workers/default_worker.py".
2019-04-29 23:44:08,489 ERROR worker.py:1672 -- A worker died or was killed while executing task 000000002d95245f833cdbf259672412d8455d89.
Traceback (most recent call last):
File "test_small.py", line 82, in <module>
result=ray.get([test_loop.remote(n) for n in range(0,ntest)])
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/worker.py", line 2184, in get
raise value
ray.exceptions.RayWorkerError: The worker died unexpectedly while executing this task.
I suspect that I am not initializing Ray correctly. I tried with ray.init(redis_address="172.31.50.149:6379") - which was the redis address given when the cluster was formed, but the error was more or less the same. I also tried starting Ray on the master (in case it needed starting):-
~$ ray start --redis-address 172.31.50.149:6379 #Start Ray
2019-04-29 23:46:20,774 INFO services.py:407 -- Waiting for redis server at 172.31.50.149:6379 to respond...
2019-04-29 23:48:29,076 INFO services.py:412 -- Failed to connect to the redis server, retrying.
....etc.
The installation of pandas and matplotlib on the master node seems to have solved the problem. Ray now initializes successfully.
I'm doing a benchmark with crate and insert a lot of records at the same time. It seems like I hit some limit (queue capacity 50) and I didn't find how to change the configuration.
Exception in thread Thread-1:
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "createdata.py", line 60, in worker
cursor.execute(ins, params)
File "/Users/jodok/sandbox/crate-demo/amsterdam/pyenv/lib/python2.7/site-packages/crate/client/cursor.py", line 48, in execute
self._result = self.connection.client.sql(sql, parameters)
File "/Users/jodok/sandbox/crate-demo/amsterdam/pyenv/lib/python2.7/site-packages/crate/client/http.py", line 190, in sql
content = self._json_request('POST', self.sql_path, data=data)
File "/Users/jodok/sandbox/crate-demo/amsterdam/pyenv/lib/python2.7/site-packages/crate/client/http.py", line 345, in _json_request
self._raise_for_status(response)
File "/Users/jodok/sandbox/crate-demo/amsterdam/pyenv/lib/python2.7/site-packages/crate/client/http.py", line 331, in _raise_for_status
raise ProgrammingError(error.get('message', ''))
ProgrammingError: SQLActionException[RemoteTransportException[[nuc2][inet[/192.168.42.72:4300]][bulk/shard]]; nested: EsRejectedExecutionException[rejected execution (queue capacity 50) on org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1#23c7247f]; ]
bulk inserts are using the bulk threadpool, so add this to your crate.yml configuration file to change its size:
threadpool.bulk.queue_size: 100
but with the current master this shouldn't be needed anymore, because crate is retrying the current bulk request on such queue size rejecting issues now.