I have two servers in different networks, one in USA and another in Germany.
Can I put together in a failover cluster?
Yes, no issue. you can also have SQL Server AG as well. Make sure you have the firewalls opened properly between these two servers. And also you need to have a common DNS/AD for the failover cluster to work. Also have a 3rd server or cluster disk for quorum or common data files.
Related
Let's say I have a cluster of 3 nodes for ScyllaDB in my local network (it can be AWS VPC).
I have my Java application running in the same local network.
I am concerned how to properly connect app to DB.
Do I need to specify all 3 IP addresses of DB nodes for the app?
What if over time one or several nodes die and get resurrected on other IPs? Do I have to manually reconfigure application?
How is it done properly in big real production cases with tens of DB servers, possibly in different data centers?
I would be much grateful for a code sample of how to connect Java app to multi-node cluster.
You need to specify contact points (you can use DNS names instead of IPs) - several nodes (usually 2-3), and driver will connect to one of them, and will discover the all nodes of the cluster after connection (see the driver's documentation). After connection is established, driver keeps the separate control connection opened, and via it receives the information about nodes that are going up & down, joining or leaving the cluster, etc., so it's able to keep information about cluster topology up-to-date.
If you're specifying DNS names instead of the IP addresses, then it's better to specify configuration parameter datastax-java-driver.advanced.resolve-contact-points as true (see docs), so the names will be resolved to IPs on every reconnect, instead of resolving at the start of application.
Alex Ott's answer is correct, but I wanted to add a bit more background so that it doesn't look arbitrary.
The selection of the 2 or 3 nodes to connect to is described at
https://docs.scylladb.com/kb/seed-nodes/
However, going forward, Scylla is looking to move away from differentiating between Seed and non-Seed nodes. So, in future releases, the answer will likely be different. Details on these developments at:
https://www.scylladb.com/2020/09/22/seedless-nosql-getting-rid-of-seed-nodes-in-scylla/
Answering the specific questions:
Do I need to specify all 3 IP addresses of DB nodes for the app?
No. Your app just needs one to work. But it might not be a bad idea to have a few, just in case one is down.
What if over time one or several nodes die and get resurrected on other IPs?
As long as your app doesn't stop, it maintains its own version of gossip. So it will see the new nodes being added and connect to them as it needs to.
Do I have to manually reconfigure application?
If you're specifying IP addresses, yes.
How is it done properly in big real production cases with tens of DB servers, possibly in different data centers?
By abstracting the need for a specific IP, using something like Consul. If you wanted to, you could easily build a simple restful service to expose an inventory list or even the results of nodetool status.
We are looking at moving around 100 websites that we have on a dedicated web server, from our current hosting company; and host these sites on a EC2 Windows 2012 server.
I've looked at the type of EC2 instances available. Am I better going for a m1.small (or t1.micro with auto scaling). With regards auto scaling, how does it work, if I upload a file to the master instance, when are the other instances updated ? Is it when the instances are auto scaled again ?
Also, I will be needing to host a mail enable (mail server) application. Any thoughts on best practice for this ? Am I better off hosting 1 server for everything, or splitting it across instances...?
When you are working with EC2, you need to start thinking about how your applications are designed and deployed differently.
Autoscaling works best when your instances follow shared nothing architecture. The instances themselves should never store persistent data. They should also be able to be automatically set up at launch.
Some applications are not designed to work in this environment. They require local file storage, or other issues.
You probably wont be using micro instances. They are mostly designed for very specific low utilization workloads.
You can run a mail server on ec2, but you will have to use an Elastic IP and whitelist the instances sending mail. By default, EC2 instances are on the spamhaus block list.
I'm currently managing a cluster of PHP-FPM servers, all of which tend to get out of sync with each other. The application that I'm using on top of the app servers (Magento) allows for admins to modify various files on the system, but now that the site is in a clustered set up modifying a file only modifies it on a single instance (on one of the app servers) of the various machines in the cluster.
Is there an open-source application for Linux that may allow me to keep all of these servers in sync? I have no problem with creating a small VM instance that can listen for changes from machines to sync. In theory, the perfect application would have small clients that run on each machine to be synced, which would talk to the master server which would then decide how/what to sync from each machine.
I have already examined the possibilities of running a centralized file server, but unfortunately my app servers are spread out between EC2 and physical machines, which makes this unfeasible. As there are multiple app servers (some of which are dynamically created depending on the load of the site), simply setting up a rsync cron job is not efficient as the cron job would have to be modified on each machine to send files to every other machine in the cluster, and that would just be a whole bunch of unnecessary data transfers/ssh connections.
I'm dealing with setting up a similar solution. I'm half way there. I would recommend you use lsyncd, which basically monitors the disk for changes and then immediately (or whatever interval you want) automatically syncs files to a list of servers using rsync.
The only issue I'm having is keeping the server lists up to date, since I can spin up additional servers at any time, I would need to have each machine in the cluster notified whenever a machine is added or removed from the cluster.
I think lsyncd is a great solution that you should look into. The issue I'm having may turn out to be a problem for you as well, and that remains to be solved.
Instead of keeping tens or hundreds of servers cross-synchronized it would be much more efficient, reliable, and most of all simple maintaining just one "admin node" and replicating changes from that to all your "worker nodes".
For instance at our company we use a Development server -> Staging server -> Live backends workflow where all the changes are transferred across servers using a custom php+rsync front end. That allows the developers to push updates to a Staging server in the live environment, test out changes, and roll them to Live backends incrementally.
A similar approach could very well work in your case as well. Obviously it's not a plug-and-play solution, but I see it as the easiest way to go - both in terms of maintainability and scalability.
We would like to protect the Cassandra against man-in-the-middle attacks. Is there any way to configure Cassandra in a way that the client-server and server-server (replication) communications are SSL encrypted?
thank you
short answer: no :)
For client - server : THRIFT-151
Edit: You might want to follow this thread on the ML
Encrypted server server communication seems to be available now:
https://issues.apache.org/jira/browse/CASSANDRA-1567
Provide configurable encryption support for internode communication
Resolution: Fixed
Fix Version/s: 0.8 beta 1
Resolved: 19/Jan/11 18:11
The strategy I employ is to have Apache Cassandra nodes communicate through a site to site VPN tunnel.
Specific configurations for the cassandra.yaml file:
listen_address: 10.x.x.x # vpn network ip
rpc_address: 172.16.x.x. # non-vpn network for client access although, I leave it blank so that it listens on all interfaces
The benefits to this approach is that you can deploy Apache Cassandra to many different environments and you become provider agnostic. For example, hosting nodes in various Amazon EC2 environments and hosting nodes in your own physical data center and hosting a few others under your desk!
Cost an issue preventing you from looking into this approach? Check out Vyatta ...
As KajMagnus pointed out, there is a JIRA ticket resolved and available in the stable version of Apache Cassandra: https://issues.apache.org/jira/browse/CASSANDRA-1567 which enables you to accomplish what you would like via TLS/SSL .. but there are a few ways to accomplish what you would like.
Finally, if you want to host your instance on Amazon EC2, region to region can be problematic and although there is a patch available in 1.x.x, is it really the right approach? I have found the VPN approach reduces latency between nodes in different regions and still maintains the necessary level of security.
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Running-across-multiple-EC2-regions-td6634995.html
Finally -- part 2 --
If you want to secure client to server communications, have your clients (web servers) communicate through the same VPN .. The configuration I have:
Front end webservers communicate via internal network to application servers
Application servers sit on their own internal network and VPN network and communicate to the Data Layer via the VPN tunnel and between each other on the internal network
Data Layer exists on it's own network per Data Centre / Rack and receives requests via the VPN network
Node to Node (gossip) communication can be secured per the issue above. Client and server will both soon support Kerberos (In Hector master as of commit: https://github.com/rantav/hector/commit/08149a03c81b559cba5680d115943dbf334f58fa should hit Cassandra side shortly).
Can I used cassandra on EC2 instances without Elastic IP addresses? I believe in that case any instance that goes down, would create an issue.
If I use Elastic IP addresses for the cassandra nodes, I have to configure them such that they use the Public IP address for internal communication (gossip etc.). But that will increase the network latency.
Please suggest how should I configure my nodes such that the problems can be minimized.
My answer would be, use Rackspace Cloud Servers instead because you get better i/o performance as well as both public and internal IPs.
But there are several people in the community using EC2; I'd ask on the cassandra-user list if you insist on that. :)
Many people (me included) use cassandra on Amazon's EC2 without issue. As the internal IP addresses are prone to change at will you'll just need to use your internal EC2 DNS names (not your public IP address or public DNS name as that would both be a security hole and also Amazon would charge you for all your Cassandra traffic).
It does mean that if your Cassandra node goes down for any reason then you'll lose the data on that node (unless you're using persistent storage which is slower) but this can easily be fixed by increasing your replication factor (we use RF=3)
We use a VPN solution that has worked exceptionally well with EC2 & Cassandra which allows us to not use Elastic IP's. Some information about it can be found here.
Spreading MongoDB across EC2 regions
http://mail-archives.apache.org/mod_mbox/cassandra-user/201105.mbox/%3CBANLkTi=mB5joMKZHasodJcPUJOQ7+V7O-Q#mail.gmail.com%3E
Now, this option isn't for everyone, but allows us to mitigate the issues with EC2 nodes changing IP addresses and not having to use Elastic IP. We also find that we receive performance increases by NOT using the Amazon interfaces (internal/external). Don't ask me why .. I don't know enough about their architecture to be able to explain it -- expcet that it just does!
Further, it allows us to leverage nicely the suggestions #jbellis proposes, to use RackSpace. We have a mixed provider setup .. so that we can leverage EC2, RackSpace and our own internal hosted nodes. Being vendor / service provider agnostic is quite important to me...