All was fine until "something". After that "something", I could use docker and docker-compose with sudo but without it the docker would hang indefinitely and the docker-compose would return connection stack traces.
Doesn't work:
Running docker without sudo hangs indefinitely. Temporarily fixed by exporting DOCKER_HOST=tcp://127.0.0.1:2375.
docker-compose without sudo returns with connection refused. It's not about connecting to specific container but for all basic commands such as docker-compose ps.
Proof that user belongs to docker group:
kretyn#junk$ groups kretyn
kretyn : kretyn adm cdrom sudo dip plugdev lpadmin lxd sambashare docker
Solution hints
I had a docker context activated that was pointing at existing but inactive host. Removing that context and unsetting DOCKER_HOST allowed the docker command to work correctly. That's a surprise since I reinstalled (purged everything I could find) the docker before writing this question.
Removing context isn't sufficient. Docker Compose required explicit docker context use default otherwise it wouldn't understand.
The problem with docker-compose ps is still present.
Stack trace on docker-compose
kretyn#junk$ docker-compose ps
Traceback (most recent call last):
File "urllib3/connection.py", line 159, in _new_conn
File "urllib3/util/connection.py", line 84, in create_connection
File "urllib3/util/connection.py", line 74, in create_connection
ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "urllib3/connectionpool.py", line 670, in urlopen
File "urllib3/connectionpool.py", line 392, in _make_request
File "http/client.py", line 1255, in request
File "http/client.py", line 1301, in _send_request
File "http/client.py", line 1250, in endheaders
File "http/client.py", line 1010, in _send_output
File "http/client.py", line 950, in send
File "urllib3/connection.py", line 187, in connect
File "urllib3/connection.py", line 171, in _new_conn
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f762c4e3e20>: Failed to establish a new connection: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "requests/adapters.py", line 439, in send
File "urllib3/connectionpool.py", line 726, in urlopen
File "urllib3/util/retry.py", line 446, in increment
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=2375): Max retries exceeded with url: /version (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f762c4e3e20>: Failed to establish a new connection: [Errno 111] Connection refused'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "docker/api/client.py", line 214, in _retrieve_server_version
File "docker/api/daemon.py", line 181, in version
File "docker/utils/decorators.py", line 46, in inner
File "docker/api/client.py", line 237, in _get
File "requests/sessions.py", line 543, in get
File "requests/sessions.py", line 530, in request
File "requests/sessions.py", line 643, in send
File "requests/adapters.py", line 516, in send
requests.exceptions.ConnectionError: HTTPConnectionPool(host='127.0.0.1', port=2375): Max retries exceeded with url: /version (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f762c4e3e20>: Failed to establish a new connection: [Errno 111] Connection refused'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "docker-compose", line 3, in <module>
File "compose/cli/main.py", line 80, in main
File "compose/cli/main.py", line 189, in perform_command
File "compose/cli/command.py", line 60, in project_from_options
File "compose/cli/command.py", line 152, in get_project
File "compose/cli/docker_client.py", line 41, in get_client
File "compose/cli/docker_client.py", line 170, in docker_client
File "docker/api/client.py", line 197, in __init__
File "docker/api/client.py", line 221, in _retrieve_server_version
docker.errors.DockerException: Error while fetching server API version: HTTPConnectionPool(host='127.0.0.1', port=2375): Max retries exceeded with url: /version (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f762c4e3e20>: Failed to establish a new connection: [Errno 111] Connection refused'))
[46145] Failed to execute script docker-compose
Versions/builds
Linux distro
kretyn#junk$ uname -a
Linux junk 5.8.0-43-generic #49~20.04.1-Ubuntu SMP Fri Feb 5 09:57:56 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Docker
kretyn#junk$ sudo docker system info
Client:
Context: default
Debug Mode: false
Plugins:
app: Docker App (Docker Inc., v0.9.1-beta3)
buildx: Build with BuildKit (Docker Inc., v0.5.1-docker)
Server:
Containers: 9
Running: 8
Paused: 0
Stopped: 1
Images: 44
Server Version: 20.10.3
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 269548fa27e0089a8b8278fc4fc781d7f65a939b
runc version: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 5.8.0-43-generic
Operating System: Ubuntu 20.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 12
Total Memory: 15.29GiB
Name: meland
ID: 3PE6:4AWU:LGRR:HOBJ:XUCJ:ER4H:ZEF5:MWUJ:COEG:WPK5:YH2U:FKUF
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: No blkio weight support
WARNING: No blkio weight_device support
Docker-compose
kretyn#junk$ sudo docker-compose version
docker-compose version 1.28.0, build d02a7b1a
docker-py version: 4.4.1
CPython version: 3.9.0
OpenSSL version: OpenSSL 1.1.1d 10 Sep 2019
The problem was a dead context entry.
Solution
Removed the dead context named remote using docker context rm remote.
Even though docker context ls was pointing default as default, it was necessary to do docker context use default for the context to be properly updated.
As a partial fix I had previously set DOCKER_HOST to tcp://127.0.0.1:2375. After updating context the env variable had to be unset, i.e. unset DOCKER_HOST.
Lazy commands, assuming context remote:
$ docker context rm remote
$ docker context use default
$ unset DOCKER_HOST
Background
Had a remote context that was based on ssh endpoint. My ssh config was set that the name points at my domain but recently the host behind it was removed.
Interestingly, I've been "fixing" this for a while and one of the steps was to completely purge docker from my local Ubuntu host, including apt, /var/lib/docker, /etc/docker and random scripts in ~/.local/. The context, however, wasn't cleaned.
The solution by Dawid was perfect for me.
I wanna mention that even when your sudo docker context ls only shows default context, your docker context ls might show different (and potentially broken) contexts. You wanna delete them.
But the problem is that your docker ps hangs.
So before you try docker rm, you need to override the context by DOCKER_HOST=tcp://127.0.0.1:2375.
As a result, the full solution is
DOCKER_HOST=tcp://127.0.0.1:2375
docker context ls
docker context rm <broken context name>
docker context use <valid context name(maybe default?)>
unset DOCKER_HOST
and then make sure all docker commands work without sudo.
docker ps
You don't have the necessary rights to use docker or docker-compose without sudo, check the documentation.
sudo usermod -aG docker
sudo chmod +x /usr/local/bin/docker-compose
You have to restart your terminal after that.
Related
I tried setting up Odoo and Postgres containers in azure using docker-compose, when running them I have an issue with the server closing the connection,
That's what I get from the log at the start of the Postgres container:
The files belonging to this database system will be owned by the user "postgres".
This user must also own the server process.
The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".
Data page checksums are disabled.
fixing permissions on existing directory /tmp ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... Etc/UTC
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok
Success. You can now start the database server using:
pg_ctl -D /tmp -l logfile start
initdb: warning: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.
waiting for server to start....2022-10-17 14:27:53.717 UTC [758] LOG: starting PostgreSQL 13.8 (Debian 13.8-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2022-10-17 14:27:53.735 UTC [758] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-10-17 14:27:53.761 UTC [759] LOG: database system was shut down at 2022-10-17 14:27:46 UTC
2022-10-17 14:27:53.774 UTC [758] LOG: database system is ready to accept connections
done
server started
/usr/local/bin/docker-entrypoint.sh: ignoring /docker-entrypoint-initdb.d/*
waiting for server to shut down...2022-10-17 14:27:53.985 UTC [758] LOG: received fast shutdown request
.2022-10-17 14:27:53.999 UTC [758] LOG: aborting any active transactions
2022-10-17 14:27:54.006 UTC [758] LOG: background worker "logical replication launcher" (PID 765) exited with exit code 1
2022-10-17 14:27:54.007 UTC [760] LOG: shutting down
2022-10-17 14:27:54.090 UTC [758] LOG: database system is shut down
done
server stopped
PostgreSQL init process complete; ready for start up.
2022-10-17 14:27:54.241 UTC [696] LOG: starting PostgreSQL 13.8 (Debian 13.8-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2022-10-17 14:27:54.242 UTC [696] LOG: listening on IPv4 address "0.0.0.0", port 5432
2022-10-17 14:27:54.242 UTC [696] LOG: listening on IPv6 address "::", port 5432
2022-10-17 14:27:54.267 UTC [696] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-10-17 14:27:54.307 UTC [770] LOG: database system was shut down at 2022-10-17 14:27:54 UTC
2022-10-17 14:27:54.317 UTC [696] LOG: database system is ready to accept connections
Then that is the resulting logs from the Odoo container :
2022-10-17 14:27:02,735 445 INFO ? odoo: Odoo version 15.0-20221012
2022-10-17 14:27:02,735 445 INFO ? odoo: Using configuration file at
/etc/odoo/odoo.conf 2022-10-17 14:27:02,735 445 INFO ? odoo: addons
paths: ['/usr/lib/python3/dist-packages/odoo/addons',
'/var/lib/odoo/addons/15.0', '/mnt/extra-addons'] 2022-10-17
14:27:02,735 445 INFO ? odoo: database: odoo#db:5432 2022-10-17
14:27:02,960 445 INFO ? odoo.addons.base.models.ir_actions_report:
Will use the Wkhtmltopdf binary at /usr/local/bin/wkhtmltopdf
2022-10-17 14:27:03,367 445 INFO ? odoo.service.server: HTTP service
(werkzeug) running on localhost:8069 Exception in thread
odoo.service.cron.cron0: Traceback (most recent call last): File
"/usr/lib/python3/dist-packages/odoo/service/server.py", line 445, in
cron_thread
pg_conn.poll() psycopg2.OperationalError: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File
"/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner
self.run() File "/usr/lib/python3.9/threading.py", line 892, in run
self._target(*self._args, **self._kwargs) File "/usr/lib/python3/dist-packages/odoo/service/server.py", line 473, in
target
self.cron_thread(i) File "/usr/lib/python3/dist-packages/odoo/service/server.py", line 457, in
cron_thread
thread.start_time = None File "/usr/lib/python3/dist-packages/odoo/sql_db.py", line 162, in exit
self.close() File "", line 2, in close File "/usr/lib/python3/dist-packages/odoo/sql_db.py", line 90, in check
return f(self, *args, **kwargs) File "/usr/lib/python3/dist-packages/odoo/sql_db.py", line 379, in close
return self._close(False) File "/usr/lib/python3/dist-packages/odoo/sql_db.py", line 405, in _close
self.rollback() File "", line 2, in rollback File "/usr/lib/python3/dist-packages/odoo/sql_db.py", line 90, in
check
return f(self, *args, **kwargs) File "/usr/lib/python3/dist-packages/odoo/sql_db.py", line 484, in rollback
result = self._cnx.rollback() psycopg2.InterfaceError: connection already closed Exception in thread odoo.service.cron.cron1: Traceback
(most recent call last): File
"/usr/lib/python3/dist-packages/odoo/service/server.py", line 445, in
cron_thread
pg_conn.poll() psycopg2.OperationalError: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File
"/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner
self.run() File "/usr/lib/python3.9/threading.py", line 892, in run
self._target(*self._args, **self._kwargs) File "/usr/lib/python3/dist-packages/odoo/service/server.py", line 473, in
target
self.cron_thread(i) File "/usr/lib/python3/dist-packages/odoo/service/server.py", line 457, in
cron_thread
thread.start_time = None File "/usr/lib/python3/dist-packages/odoo/sql_db.py", line 162, in exit
self.close() File "", line 2, in close File "/usr/lib/python3/dist-packages/odoo/sql_db.py", line 90, in check
return f(self, *args, **kwargs) File "/usr/lib/python3/dist-packages/odoo/sql_db.py", line 379, in close
return self._close(False) File "/usr/lib/python3/dist-packages/odoo/sql_db.py", line 405, in _close
self.rollback() File "", line 2, in rollback File "/usr/lib/python3/dist-packages/odoo/sql_db.py", line 90, in
check
return f(self, *args, **kwargs) File "/usr/lib/python3/dist-packages/odoo/sql_db.py", line 484, in rollback
result = self._cnx.rollback() psycopg2.InterfaceError: connection already closed
When I check the postgres logs again cause the connection was open the
1st time, i get this from it (after restart) :
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.
The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".
Data page checksums are disabled.
fixing permissions on existing directory /tmp ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... Etc/UTC
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
My docker-compose.yml :
version: '3.2'
services:
db:
image: registryodoo.azurecr.io/samples/postgres:13
volumes:
- db:/var/lib/postgresql/data/pgdata
deploy:
restart_policy:
condition: always
ports:
- "5432:5432"
environment:
POSTGRES_PASSWORD: odoo
POSTGRES_DB: postgres
POSTGRES_USER: odoo
PGDATA: /tmp
healthcheck:
test: ["CMD-SHELL", "pg_isready -U odoo"]
interval: 5s
timeout: 5s
retries: 5
odoocontainer:
depends_on:
db:
condition: service_healthy
image: registryodoo.azurecr.io/samples/odoo:latest
volumes:
- data:/var/lib/odoo
- extra-addons:/mnt/extra-addons
ports:
- 8069:8069
deploy:
restart_policy:
condition: always
environment:
HOST: db
USER: odoo
PASSWORD: odoo
volumes:
data:
driver: azure_file
driver_opts:
share_name: datafileshare
storage_account_name: odooaccount1
db:
driver: azure_file
driver_opts:
share_name: dbfileshare
storage_account_name: odooaccount1
extra-addons:
driver: azure_file
driver_opts:
share_name: odoofileshare
storage_account_name: odooaccount1
Can't figure what's the origin issue and how to solve it
TLDR;
It's possible to connect 2 pods in Kubernetes as they were in the same local-net with all the ports opened?
Motivation
Currently, we have airflow implemented in a Kubernetes cluster, and aiming to use TensorFlow Extended we need to use Apache beam. For our use case Spark would be the appropriate runner to be used, and as airflow and TensorFlow are coded in python we would need to use the Apache Beam's Portable Runner (https://beam.apache.org/documentation/runners/spark/#portability).
The problem
The communication between the airflow pod and the job server pod is resulting in transmitting errors (probably because of some random ports used by the job server).
Setup
To follow a good isolation practice and to imitate the Spark in Kubernetes common setup (using the driver inside the cluster in a pod), the job server was implemented as:
apiVersion: apps/v1
kind: Deployment
metadata:
name: beam-spark-job-server
labels:
app: airflow-k8s
spec:
selector:
matchLabels:
app: beam-spark-job-server
replicas: 1
template:
metadata:
labels:
app: beam-spark-job-server
spec:
restartPolicy: Always
containers:
- name: beam-spark-job-server
image: apache/beam_spark_job_server:2.27.0
args: ["--spark-master-url=spark://spark-master:7077"]
resources:
limits:
memory: "1Gi"
cpu: "0.7"
env:
- name: SPARK_PUBLIC_DNS
value: spark-client
ports:
- containerPort: 8099
protocol: TCP
name: job-server
- containerPort: 7077
protocol: TCP
name: spark-master
- containerPort: 8098
protocol: TCP
name: artifact
- containerPort: 8097
protocol: TCP
name: java-expansion
apiVersion: v1
kind: Service
metadata:
name: beam-spark-job-server
labels:
app: airflow-k8s
spec:
type: ClusterIP
selector:
app: beam-spark-job-server
ports:
- port: 8099
protocol: TCP
targetPort: 8099
name: job-server
- port: 7077
protocol: TCP
targetPort: 7077
name: spark-master
- port: 8098
protocol: TCP
targetPort: 8098
name: artifact
- port: 8097
protocol: TCP
targetPort: 8097
name: java-expansion
Development/Errors
If I execute the command python -m apache_beam.examples.wordcount --output ./data_test/ --runner=PortableRunner --job_endpoint=beam-spark-job-server:8099 --environment_type=LOOPBACK from the airflow pod I get no logs on the job server and I get this error on the terminal:
INFO:apache_beam.internal.gcp.auth:Setting socket default timeout to 60 seconds.
INFO:apache_beam.internal.gcp.auth:socket default timeout is 60.0 seconds.
INFO:oauth2client.client:Timeout attempting to reach GCE metadata service.
WARNING:apache_beam.internal.gcp.auth:Unable to find default credentials to use: The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.
Connecting anonymously.
INFO:apache_beam.runners.worker.worker_pool_main:Listening for workers at localhost:46569
WARNING:root:Make sure that locally built Python SDK docker image has Python 3.7 interpreter.
INFO:root:Default Python SDK image for environment is apache/beam_python3.7_sdk:2.27.0
Traceback (most recent call last):
File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/airflow/.local/lib/python3.7/site-packages/apache_beam/examples/wordcount.py", line 99, in <module>
run()
File "/home/airflow/.local/lib/python3.7/site-packages/apache_beam/examples/wordcount.py", line 94, in run
ERROR:grpc._channel:Exception iterating requests!
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.7/site-packages/grpc/_channel.py", line 195, in consume_request_iterator
request = next(request_iterator)
File "/home/airflow/.local/lib/python3.7/site-packages/apache_beam/runners/portability/artifact_service.py", line 355, in __next__
raise self._queue.get()
File "/home/airflow/.local/lib/python3.7/site-packages/apache_beam/pipeline.py", line 561, in run
return self.runner.run_pipeline(self, self._options)
File "/home/airflow/.local/lib/python3.7/site-packages/apache_beam/runners/portability/portable_runner.py", line 421, in run_pipeline
job_service_handle.submit(proto_pipeline)
File "/home/airflow/.local/lib/python3.7/site-packages/apache_beam/runners/portability/portable_runner.py", line 115, in submit
prepare_response.staging_session_token)
File "/home/airflow/.local/lib/python3.7/site-packages/apache_beam/runners/portability/portable_runner.py", line 214, in stage
staging_session_token)
File "/home/airflow/.local/lib/python3.7/site-packages/apache_beam/runners/portability/artifact_service.py", line 241, in offer_artifacts
for request in requests:
File "/home/airflow/.local/lib/python3.7/site-packages/grpc/_channel.py", line 416, in __next__
return self._next()
File "/home/airflow/.local/lib/python3.7/site-packages/grpc/_channel.py", line 803, in _next
raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.INVALID_ARGUMENT
details = "Unknown staging token job_b6f49cc2-6732-4ea3-9aef-774e3d22867b"
debug_error_string = "{"created":"#1613765341.075846957","description":"Error received from peer ipv4:127.0.0.1:8098","file":"src/core/lib/surface/call.cc","file_line":1067,"grpc_message":"Unknown staging token job_b6f49cc2-6732-4ea3-9aef-774e3d22867b","grpc_status":3}"
>
output | 'Write' >> WriteToText(known_args.output)
File "/home/airflow/.local/lib/python3.7/site-packages/apache_beam/pipeline.py", line 582, in __exit__
self.result = self.run()
File "/home/airflow/.local/lib/python3.7/site-packages/apache_beam/pipeline.py", line 561, in run
return self.runner.run_pipeline(self, self._options)
File "/home/airflow/.local/lib/python3.7/site-packages/apache_beam/runners/portability/portable_runner.py", line 421, in run_pipeline
job_service_handle.submit(proto_pipeline)
File "/home/airflow/.local/lib/python3.7/site-packages/apache_beam/runners/portability/portable_runner.py", line 115, in submit
prepare_response.staging_session_token)
File "/home/airflow/.local/lib/python3.7/site-packages/apache_beam/runners/portability/portable_runner.py", line 214, in stage
staging_session_token)
File "/home/airflow/.local/lib/python3.7/site-packages/apache_beam/runners/portability/artifact_service.py", line 241, in offer_artifacts
for request in requests:
File "/home/airflow/.local/lib/python3.7/site-packages/grpc/_channel.py", line 416, in __next__
return self._next()
File "/home/airflow/.local/lib/python3.7/site-packages/grpc/_channel.py", line 803, in _next
raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.INVALID_ARGUMENT
details = "Unknown staging token job_b6f49cc2-6732-4ea3-9aef-774e3d22867b"
debug_error_string = "{"created":"#1613765341.075846957","description":"Error received from peer ipv4:127.0.0.1:8098","file":"src/core/lib/surface/call.cc","file_line":1067,"grpc_message":"Unknown staging token job_b6f49cc2-6732-4ea3-9aef-774e3d22867b","grpc_status":3}"
Which indicates an error transmitting the job. If I implement the Job Server in the same pod as airflow I get a full working communication between these two containers, I would like to have the same behavior but with them in different pods.
You need to deploy two containers in one pod
I am trying to conecting to Cassandra with cqlsh but no able to connect while the Cassandra service is running. Below is the status of Cassandra service.
(base) kuldeep#kuldeep-OptiPlex-3050:~$ systemctl status cassandra
● cassandra.service - LSB: distributed storage system for structured data
Loaded: loaded (/etc/init.d/cassandra; generated)
Active: active (running) since Mon 2020-12-28 18:13:35 IST; 25min ago
Docs: man:systemd-sysv-generator(8)
Process: 21770 ExecStart=/etc/init.d/cassandra start (code=exited, status=0/SUCCESS)
Tasks: 52 (limit: 4915)
CGroup: /system.slice/cassandra.service
└─21858 java -ea -da:net.openhft... -XX:+UseThreadPriorities -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=1000003 -XX:+AlwaysPreTouch -XX:-Us
Dec 28 18:13:35 kuldeep-OptiPlex-3050 systemd[1]: Starting LSB: distributed storage system for structured data...
Dec 28 18:13:35 kuldeep-OptiPlex-3050 systemd[1]: Started LSB: distributed storage system for structured data.
But when I tried to connect to Cassandra with cqlsh I am getting the following error.
(base) kuldeep#kuldeep-OptiPlex-3050:~$ cqlsh
Traceback (most recent call last):
File "/usr/local/bin/cqlsh", line 115, in <module>
from cqlshlib import cqlhandling, cql3handling, pylexotron
File "/usr/lib/python2.7/dist-packages/cqlshlib/cqlhandling.py", line 22, in <module>
from cassandra.metadata import cql_keywords_reserved
ImportError: No module named cassandra.metadata
After doing some research someone suggests me to install the Cassandra-driver and I follow the same.
(base) kuldeep#kuldeep-OptiPlex-3050:~$ pip install cassandra_driver
Requirement already satisfied: cassandra_driver in ./anaconda3/lib/python3.7/site-packages (2.7.0)
Requirement already satisfied: six>=1.6 in ./anaconda3/lib/python3.7/site-packages (from cassandra_driver) (1.14.0)
Requirement already satisfied: futures in ./anaconda3/lib/python3.7/site-packages (from cassandra_driver) (3.1.1)
But again I am getting the same error.
cqlsh
Traceback (most recent call last):
File "/usr/local/bin/cqlsh", line 115, in <module>
from cqlshlib import cqlhandling, cql3handling, pylexotron
File "/usr/lib/python2.7/dist-packages/cqlshlib/cqlhandling.py", line 22, in <module>
from cassandra.metadata import cql_keywords_reserved
ImportError: No module named cassandra.metadata
can anyone help me to getting out of this. Any help will be greatly apprreciated.
I'm struggling starting Cassandra cluster with 'ccm start' command.
I created a cluster named Gdelt, with 3 nodes, as follow:
ccm status gives:
Cluster: 'Gdelt'
-------------------
node1: DOWN (Not initialized)
node3: DOWN (Not initialized)
node2: DOWN (Not initialized)
node4: DOWN (Not initialized)
but ccm start raises the following error:
Traceback (most recent call last):
File "/usr/local/bin/ccm", line 112, in <module>
cmd.run()
File "/usr/local/lib/python2.7/dist-packages/ccmlib/cmds/cluster_cmds.py", line 510, in run
allow_root=self.options.allow_root) is None:
File "/usr/local/lib/python2.7/dist-packages/ccmlib/cluster.py", line 390, in start
common.assert_socket_available(itf)
File "/usr/local/lib/python2.7/dist-packages/ccmlib/common.py", line 521, in assert_socket_available
raise UnavailableSocketError("Inet address %s:%s is not available: %s; a cluster may already be running or you may need to add the loopback alias" % (addr, port, msg))
ccmlib.common.UnavailableSocketError: Inet address 127.0.0.1:9042 is not available: [Errno 98] Address already in use; a cluster may already be running or you may need to add the loopback alias
Traceback (most recent call last):
File "/usr/local/bin/ccm", line 112, in <module>
cmd.run()
File "/usr/local/lib/python2.7/dist-packages/ccmlib/cmds/cluster_cmds.py", line 510, in run
allow_root=self.options.allow_root) is None:
File "/usr/local/lib/python2.7/dist-packages/ccmlib/cluster.py", line 390, in start
common.assert_socket_available(itf)
File "/usr/local/lib/python2.7/dist-packages/ccmlib/common.py", line 521, in assert_socket_available
raise UnavailableSocketError("Inet address %s:%s is not available: %s; a cluster may already be running or you may need to add the loopback alias" % (addr, port, msg))
ccmlib.common.UnavailableSocketError: Inet address 127.0.0.1:9042 is not available: [Errno 98] Address already in use; a cluster may already be running or you may need to add the loopback alias
I've tried creating loopback aliases, with bash script as following, and executing it:
#!/bin/bash
sudo ifconfig lo0 alias 127.0.0.2 up
sudo ifconfig lo0 alias 127.0.0.3 up
sudo ifconfig lo0 alias 127.0.0.4 up
sudo ifconfig lo0 alias 127.0.0.5 up
sudo ifconfig lo0 alias 127.0.0.6 up
which raises the following error upon bash script execution:
alias: Host name lookup failure
ifconfig: `--help' gives usage information.
I've tried the ifconfig directly in the command line as following:
sudo ifconfig lo:0 127.0.0.1 up
which gives the following error:
SIOCSIFADDR: File exists
SIOCSIFFLAGS: Cannot assign requested address
SIOCSIFFLAGS: Cannot assign requested address
Is this clear, tell me please if not, so that I clarify more
I don't know finally how to run my cluster in Cassandra.
Thank you very much for your help.
Habib
So Cassandra users port 9042 by default. However, as your output shows, something appears to be using that port already when you try to start you node. That is evident from the following message:
Inet address 127.0.0.1:9042 is not available: [Errno 98] Address already in use
In this scenario you have a couple of options:
Find the process using port 9042 and kill it (assuming it's something unimportant)
Use a different port to start the node
For (1), you can use:
netstat -nap | grep 9042 | grep LISTEN
You may need to be the "root" user to find the process as netstat will only display things that your current user "owns". If you're root, you'll see all processes.
For (2), you can simply change the port in the cassandra.yaml file (native_transport_port). You should also change the "listen_address" parameter and the "native_transport_broadcast_address" to your host IP address so you're not using the loopback address (You can use "hostname -i" to find your IP address of your local host) - and set your "native_transport_address" to "0.0.0.0" if you want it to listen on all interfaces. That's what we do. Hopefully that will get you node up and running.
Cassandra uses the following ports for different processes:
thrift=('127.0.0.1', 9160)
binary=('127.0.0.1', 9042)
storage=('127.0.0.1', 7000)
Make sure all these ports are not being used for any of the IP of your cluster.
I'm attempting to give a script the cap_net_bind_service Linux capability. However using setcap doesn't seem to be working.
$ cat listen.sh
#!/bin/bash
python -m SimpleHTTPServer 80
$ getcap listen.sh
listen.sh =
$ sudo setcap cap_net_bind_service=+eip ./listen.sh
$ getcap listen.sh
listen.sh = cap_net_bind_service+eip
$ ls -al listen.sh
-rwxrwxr-x. 1 eric eric 43 Jul 11 23:01 listen.sh
$ ./listen.sh
Traceback (most recent call last):
File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
...
File "/usr/lib64/python2.7/SocketServer.py", line 434, in server_bind
self.socket.bind(self.server_address)
File "/usr/lib64/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(*args)
socket.error: [Errno 13] Permission denied
Using sudo still works fine.
$ sudo ./listen.sh
Serving HTTP on 0.0.0.0 port 80 ...
This is on Fedora 23 workstation.
$ cat /proc/version
Linux version 4.4.9-300.fc23.x86_64 (mockbuild#bkernel02.phx2.fedoraproject.org) (gcc version 5.3.1 20160406 (Red Hat 5.3.1-6) (GCC) ) #1 SMP Wed May 4 23:56:27 UTC 2016
I'm a little lost at this point, have tried turning off firewalld to no effect, and can't figure out how to debug this.
setcap(8) only sets capabilities on files. When it comes to interpreters, I think you're in for a rough ride. capabilities(7) -- I'm reading from RHEL 7.4 -- lists 'Thread' capability sets as well as 'File' capabilities. In 'Thread' capability sets, there is a notion of 'Ambient' sets, as well as 'Inheritable'. The important distinction is that 'inheritable capabilities are not generally preserved across execve(2) when running as a non-root user', you should set the Ambient capability set.
Note: RHEL 7 (7.4) has this backported. 'Ambient' capabilities apparently came out in Linux 4.3, and RHEL 7 is nominally a 3.10 series kernel.
I struck a similar issue as yourself, except I was trying to use a Python server under Systemd. I found that I needed to set the 'Ambient' capability set. I imagine this is what gets inherited. Here's an example Systemd unit file:
[Unit]
Description=Docker Ident Service
After=network.target
Wants=docker.service
[Service]
Type=simple
ExecStart=/usr/local/sbin/docker-identd
Restart=on-failure
RestartSec=43s
User=docker-identd
Group=docker
CapabilityBoundingSet=CAP_NET_BIND_SERVICE
AmbientCapabilities=CAP_NET_BIND_SERVICE
NoNewPrivileges=true
Nice=12
StandardOutput=syslog
StandardError=syslog
SyslogFacility=daemon
SyslogIdentifier=docker-identd
SyslogLevel=info
[Install]
WantedBy=multi-user.target
Thanks for the question. It gave me an opportunity to learn a bit more.
For some further reading as to how Ambiant Capabililities works in practice, have a look at Kubernetes should configure the ambient capability set #56374
Cheers,
Cameron
File capabilities applied to scripts (executables with shebang headers) won't have any effect. Instead you need to apply the capabilities to the binary interpreter used to execute the script (usually the command mentioned in the shebang header).
In your case just apply the capabilities to the python executable.