running cassandra on a mesos cluster - cassandra

I'm trying to deploy Cassandra on a small (test) Mesos cluster. I have one master node (say 10.10.10.1) and three worker nodes: 10.10.10.2-4.
On the official site of apache mesos there is a link to cassandra framework developed for mesos (it is here: https://github.com/mesosphere/cassandra-mesos).
I'm following the tutorial that they provide there. In step 3 they are saying I should edit the conf/mesos.yaml file, specifically that I should set mesos.master.url so that it points to the master node (on which I also have the conf file).
The first thing I tried was just to replace localhost by the master node ip, so I had
mesos.master.url: 'zk://10.10.10.1:2181/mesos'
but when I then started the deployment script (by running bin/cassandra-mesos as they say in point 5 I should) I get the following error:
2015-02-24 09:18:24,262:12041(0x7fad617fa700):ZOO_ERROR#handle_socket_error_msg#1697: Socket [10.10.10.1:2181] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
It keeps retrying and displays the same error until I terminate it.
I tried removing 'zk' or replacing it with 'mesos' in the URL, changing (or removing altogether) the port removing the 'mesos' word in the URL but I keep getting the same error.
I also tried looking at how other frameworks do it (specifically spark, which I am hoping to deploy next) but didn't find anything helpful. Any ideas how to run it? Thanks!

The URL provided to mesos.master.url is passed directly to the underlying Mesos Native Java Library. The format listed in your example looks correct.
Next steps in debugging the connection issue would be to verify the IP address the ZooKeeper sever has bound to. You can find out by running sudo netstat -ntplv | grep 2181 on the server that is running ZooKeeper.
I would expect to see something like the following:
tcp 0 0 0.0.0.0:2181 0.0.0.0:* LISTEN 3957/java
Another possibility could be that ZooKeeper is binding specifically to localhost:
tcp 0 0 127.0.0.1:2181 0.0.0.0:* LISTEN 3957/java
If ZooKeeper has bound to localhost a client will only be able to connect to it with the URL zk://127.0.0.1:2181/mesos
A note about the future of the Cassandra Mesos Framework.
I am one of the developers working on rewriting the cassandra-mesos project to be more robust, stable and easier to run. The code in current master(6aa82acfac) is end-of-life and will be replaced within the next couple of weeks with the code that is in the rewrite branch.
If you would like to try out the latest build of the rewrite branch a marathon.json for running the framework can be found here. After downloading the marathon.json update the values for MESOS_ZK and CASSANDRA_ZK (and any resource values you want to update) then POST the json to marathon at /v2/apps.

If you have one master and no zk, how about setting the mesos.master.url to 10.10.10.1:5050 (where 5050 is the default mesos-master port)?
If Ben is right and the Cassandra framework otherwise requires ZK for its own persistence/HA, then try disabling that feature if possible. Otherwise you may have to rip the ZK code out yourself and recompile, if you really want a ZK-free setup (consequently without any HA features).

Related

Minifabric problem with apiserver and fabric 2.2.1 error join on the channel in multihost configuration

I had created a custom network with my organizations and my peers running 100% on a host with which I was interfacing via an ApiServer based on this: https://kctheservant.medium.com/rework-an-implementation-of-api-server-for-hyperledger-fabric-network-fabric-v2-2-a747884ce3dc The problem is when I switched to using networking on 2 hosts using docker swarm. When I join the channel from the second host I get the error: "unable to contact the endpoint". So I switched to using "minifabri" which promises easy use, and in fact the network is customized in a short time, even there it gave me an error when joining the channel of the second host, solved by setting the variable EXPOSE_ENDPOINT = true. The problem is that now nmon I can no longer get my apiserver to work, what I did is (as indicated in the Readme) replace the contents of the "main.js" file with my server code and run the "apprun" command. This gives me an error if I leave the server listening on a port, while it is successful if I comment out the last 2 lines of the code. The problem is that I don't have how to query the server if I don't have a listening port. Summarizing my questions are:
how can I create an api server done like that on minifabric?
alternatively how can I solve the problem on the Fabric base (I can't find an EXPOSE_ENDPOINT variable to set), probably the problem will be the same as the one I had on minifabric. Thanks to those who will help me.

Why is a Node.js 12 docker app connection to MongoDB 4 via the docker network giving a timeout while a connection via the public network works?

I'm seeing a problem I can't explain at all:
After upgrading a Meteor app to v 1.9 and therefore Node.js 12 we also have to switch docker containers to Node.js 12 based containers. In our case we use abernix/meteord:node-12-base (git).
After booting up the updated app we get a DB timeout in the docker container of the app:
/bundle/bundle/programs/server/node_modules/fibers/future.js:313
throw(ex);
^
MongoTimeoutError: Server selection timed out after 10000 ms
at Timeout._onTimeout (/bundle/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb-core/lib/sdam/topology.js:773:16)
at listOnTimeout (internal/timers.js:531:17)
at processTimers (internal/timers.js:475:7) {
name: 'MongoTimeoutError',
[Symbol(mongoErrorContextSymbol)]: {}
}
This happens with the following MONGO_URL:
❌ mongodb://root:OurPw#mongo-docker-alias:27017/meteor?authSource=admin
Funnily enough when we expose the port 27017 in the MongoDB container the following MONGO_URL just works:
✔️ mongodb://root:OurPw#docker-host:27017/meteor?authSource=admin
Now I thought we are having a docker problem but if I attach to a bash inside the Node.js 12 meteord container, apt install the MongoDB shell and try to connect with:
✔️ mongo "mongodb://root:OurPw#mongo-docker-alias:27017/meteor?authSource=admin"
that also just works.
And now I'm left without a clue. I tried multiple MongoDB docker images between v4.0 and 4.2.3 as well as Node.js 12.14 and 12.10. And I also tried without MongoDB auth one time just to rule it out as the problem but the outcome is always the same.
Any idea would be very much appreciated since I'd like to avoid having to connect via an exposed port and the docker host's name because that is prone to errors obviously...
Check the /etc/mOngd.conf file for the network binding. You may need to allow it to respond on all network interfaces as the network might be a different ip/subnet when exposed (or not) which might explain why it works in some scenarios
tl;dr:
setting dns_search: '' for the backend in the docker-compose.yml (or via the docker CLI or other means) fixes the problem.
More information:
Docker seems to add any search directive of the host's /etc/resolv.conf to the container's /etc/resolv.conf per default. However resolving something like mongo-docker-alias.actual-intranet-domain.tld is likely to be problematic since the outside network and DNS has no knowledge of this subdomain. Actually we found out that it got still resolved inside the container in our case, it just took a few seconds (vs <1ms normally). And since the backend tries to establish multiple DB connection it always runs into the timeout.
Docker's DNS search option luckily allows to divert from the default behavior including setting a blank value. Knowing the problem another workaround should be to use docker aliases with a dot in it since then the search shouldn't be used, but we haven't tried that.
A few questions remain but they are not so important. Like in our case why did this happen with the Meteor update, maybe the actual reason was that also the docker version on the host changed since we wouldn't be aware of an infrastructure change. And in general why is docker adding these entries to /etc/resolv.conf? It doesn't seem very useful but if it is maybe there is in general a better approach for this?
A very helpful blog post on this matter was also published by davd.io.

How to make dependency between 2 services on 2 different Centos host without big statck?

I have two servers :
On the first one there is a tomcat that contains my application (Spring boot)
On the second sever, there is my database server (MySQL)
How to be robust when the tomcat start while MySQL server is not ready?
In fact I got this trouble during a power failure and the two services start at the same time, finally tomcat failed.
What is an elegant way to manage this problematic of dependencies between services on different hosts ? Is there a native way in unix for that ?
There already exist answers to your question.
Links:
https://unix.stackexchange.com/questions/433113/how-do-you-use-systemd-to-ensure-remote-database-is-available
https://serverfault.com/questions/867830/systemd-start-service-only-after-dns-is-available
Basically you need to check is mysql answered on the needed port.
So you can modify tomcat systemd unit file with construction like this:
ExecStartPre=/bin/bash -c 'until host example.com; do sleep 1; done'
This will be work on the hosts with systemd.
In general you will be need to create simple script which try connect to the remote database and if it's successful return exit code 0

Setting up a dask distributed scheduler on two IP addresses?

I am new to dask, and have what appears to be a bit of an odd use-case, where I would like to set up a dask scheduler on a "bridging" machine with two network interfaces, such that clients can connect to one of the interfaces (the "front"), and workers will live on multiple machines connected to another interface (the "back"). The interfaces have separate IP addresses and hostnames.
Essentially, I want to do this picture where the brown and blue pieces have no route between them, except through the machine with the scheduler on it. (The picture is from some old documentation for dask distributed, 0.7 I think, when things were apparently less settled than now.)
Everything is 64-bit Linux (Debian 8 "jessie"), and I'm working with version 0.14.0 of dask and 1.16.0 of distributed, installed in an anaconda environment.
The dask-scheduler command-line tool does not seem to have a way to do more than one hostname, which I think is what I want.
I can get the effect I want by SSH port-forwarding.
For example, suppose the relevant interfaces are machines worker, scheduler-front, scheduler-back, and client. The two scheduler-* interfaces are different NICs on the same machine, and there is a TCP route from client to scheduler-front, and one from scheduler-back to worker, but there is no route from client to worker, from scheduler-front to worker, or from scheduler-back to client.
Then, the following works (the leading bit below is meant to be a command-line prompt indicating which machine the command is run on, with '#' meaning the shell, and '>>>' meaning Python):
First, start a scheduler listening on the "back" of the bridge host:
scheduler# dask-scheduler --host schedular-back
Second, start a worker and connect it to the scheduler in the ordinary way:
worker# dask-worker scheduler-back:8786
Third, forward localhost:8786 on the client to scheduler-back:8786 on the scheduler machine, ssh-ing in through the scheduler-front interface:
client# ssh -L 8786:scheduler-back:8786 scheduler-front
Finally, start up the client on the client machine, and connect to the near end of the forwarded port whose other end can see the scheduler.
client>>> from distributed import Client
client>>> cl = Client('127.0.0.1:8786')
client>>> ...
As I say, this works, I can do maps and gathers and get results.
But I can't help thinking that I'm over-doing it, and maybe I missed something simple that allows multi-homed schedulers. Private sub-nets aren't all that strange, they come up in the context of containers and clusters.
Is there a smarter way to do this?
In case it's of interest, the reason for not using the cluster queuing system is that the target "worker" machine is the one with a GPU, and we are having some difficulty getting the queuing system to allocate it properly, so at the moment, that machine is working outside of the queuing system. We will eventually solve that problem, but for now, we're trying to do this.
Also, for completeness, the reason for not having the client be on the scheduler machine is that, in our scenario, the client needs to do visualizations, and the scheduler is a cluster head-node that's in rack in the machine room and is not physically accessible to users.
If you don't specify any --host to dask-scheduler, it will listen on all interfaces by default. For instance:
$ dask-scheduler
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Scheduler at: tcp://192.168.1.68:8786
distributed.scheduler - INFO - http at: 0.0.0.0:9786
distributed.scheduler - INFO - bokeh at: 0.0.0.0:8788
distributed.bokeh.application - INFO - Web UI: http://127.0.0.1:8787/status/
distributed.scheduler - INFO - -----------------------------------------------
and:
$ netstat -tnlp | \grep 8786
tcp 0 0 0.0.0.0:8786 0.0.0.0:* LISTEN 23969/python
tcp6 0 0 :::8786 :::* LISTEN 23969/python
So you can then connect from the subnetwork you want, using the right IP (v4 or v6) address to contact the scheduler. Your workers might use tcp://192.168.1.68:8786 and your clients tcp://10.1.2.3:8786, for instance.
If you're willing to listen on more than one interface, but not all of them, however, this is not possible currently.

UnknownHostException on tasktracker in Hadoop cluster

I have set up a pseudo-distributed Hadoop cluster (with jobtracker, a tasktracker, and namenode all on the same box) per tutorial instructions and it's working fine. I am now trying to add in a second node to this cluster as another tasktracker.
When I examine the logs on Node 2, all the logs look fine except for the tasktracker. I'm getting an infinite loop of the error message listed below. It seems that the Task Tracker is trying to use the hostname SSP-SANDBOX-1.mysite.com rather than the ip address. This hostname is not in /etc/hosts so I'm guessing this is where the problem is coming from. I do not have root access in order to add this to /etc/hosts.
Is there any property or configuration I can change so that it will stop trying to connect using the hostname?
Thanks very much,
2011-01-18 17:43:22,896 ERROR org.apache.hadoop.mapred.TaskTracker:
Caught exception: java.net.UnknownHostException: unknown host: SSP-SANDBOX-1.mysite.com
at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:195)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:850)
at org.apache.hadoop.ipc.Client.call(Client.java:720)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at $Proxy5.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:207)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:170)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1033)
at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1720)
at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2833)
This blog posting might be helpful:
http://western-skies.blogspot.com/2010/11/fix-for-exceeded-maxfaileduniquefetches.html
The short answer is that Hadoop performs reverse hostname lookups even if you specify IP addresses in your configuration files. In your environment, in order for you to make Hadoop work, SSP-SANDBOX-1.mysite.com must resolve to the IP address of that machine, and the reverse lookup for that IP address must resolve to SSP-SANDBOX-1.mysite.com.
So you'll need to talk to whoever is administering those machines to either fudge the hosts file or to provide a DNS server that will do the right thing.

Resources