Cassandra on Amazon EC2 with Elastic IP addresses

Cassandra on Amazon EC2 with Elastic IP addresses - cassandra

Can I used cassandra on EC2 instances without Elastic IP addresses? I believe in that case any instance that goes down, would create an issue.
If I use Elastic IP addresses for the cassandra nodes, I have to configure them such that they use the Public IP address for internal communication (gossip etc.). But that will increase the network latency.
Please suggest how should I configure my nodes such that the problems can be minimized.

My answer would be, use Rackspace Cloud Servers instead because you get better i/o performance as well as both public and internal IPs.
But there are several people in the community using EC2; I'd ask on the cassandra-user list if you insist on that. :)

Many people (me included) use cassandra on Amazon's EC2 without issue. As the internal IP addresses are prone to change at will you'll just need to use your internal EC2 DNS names (not your public IP address or public DNS name as that would both be a security hole and also Amazon would charge you for all your Cassandra traffic).
It does mean that if your Cassandra node goes down for any reason then you'll lose the data on that node (unless you're using persistent storage which is slower) but this can easily be fixed by increasing your replication factor (we use RF=3)

We use a VPN solution that has worked exceptionally well with EC2 & Cassandra which allows us to not use Elastic IP's. Some information about it can be found here.
Spreading MongoDB across EC2 regions
http://mail-archives.apache.org/mod_mbox/cassandra-user/201105.mbox/%3CBANLkTi=mB5joMKZHasodJcPUJOQ7+V7O-Q#mail.gmail.com%3E
Now, this option isn't for everyone, but allows us to mitigate the issues with EC2 nodes changing IP addresses and not having to use Elastic IP. We also find that we receive performance increases by NOT using the Amazon interfaces (internal/external). Don't ask me why .. I don't know enough about their architecture to be able to explain it -- expcet that it just does!
Further, it allows us to leverage nicely the suggestions #jbellis proposes, to use RackSpace. We have a mixed provider setup .. so that we can leverage EC2, RackSpace and our own internal hosted nodes. Being vendor / service provider agnostic is quite important to me...

Related

Is a DNS Cache Manager necessary to accurately test the performance of my web application?

What I'm doing:
I have several servers that sit behind my load balancer, hosting a dummy website. I am trying to profile the performance of the servers and load balancer with Jmeter.
What I'm unsure of:
I was recommended to use the DNS Cache Manager option in my Jmeter configuration. However, is this necessary? Since I'm using a load balancer, doesn't Jmeter only see the LB's IP address?

If your LB has only one IP address - it doesn't really make a lot of sense. If more than one (like described here)- it's highly recommendable (if not mandatory)
Moreover DNS Cache Manager has nice feature of DNS entries override so you can transform IP addresses into human-readable meaningful values like "appserver1", "appserver2", "LB", "DB", etc.
More information: The DNS Cache Manager: The Right Way To Test Load Balanced Apps

How to properly connect client application to Scylla or Cassandra?

Let's say I have a cluster of 3 nodes for ScyllaDB in my local network (it can be AWS VPC).
I have my Java application running in the same local network.
I am concerned how to properly connect app to DB.
Do I need to specify all 3 IP addresses of DB nodes for the app?
What if over time one or several nodes die and get resurrected on other IPs? Do I have to manually reconfigure application?
How is it done properly in big real production cases with tens of DB servers, possibly in different data centers?
I would be much grateful for a code sample of how to connect Java app to multi-node cluster.

You need to specify contact points (you can use DNS names instead of IPs) - several nodes (usually 2-3), and driver will connect to one of them, and will discover the all nodes of the cluster after connection (see the driver's documentation). After connection is established, driver keeps the separate control connection opened, and via it receives the information about nodes that are going up & down, joining or leaving the cluster, etc., so it's able to keep information about cluster topology up-to-date.
If you're specifying DNS names instead of the IP addresses, then it's better to specify configuration parameter datastax-java-driver.advanced.resolve-contact-points as true (see docs), so the names will be resolved to IPs on every reconnect, instead of resolving at the start of application.

Alex Ott's answer is correct, but I wanted to add a bit more background so that it doesn't look arbitrary.
The selection of the 2 or 3 nodes to connect to is described at
https://docs.scylladb.com/kb/seed-nodes/
However, going forward, Scylla is looking to move away from differentiating between Seed and non-Seed nodes. So, in future releases, the answer will likely be different. Details on these developments at:
https://www.scylladb.com/2020/09/22/seedless-nosql-getting-rid-of-seed-nodes-in-scylla/

Answering the specific questions:
Do I need to specify all 3 IP addresses of DB nodes for the app?
No. Your app just needs one to work. But it might not be a bad idea to have a few, just in case one is down.
What if over time one or several nodes die and get resurrected on other IPs?
As long as your app doesn't stop, it maintains its own version of gossip. So it will see the new nodes being added and connect to them as it needs to.
Do I have to manually reconfigure application?
If you're specifying IP addresses, yes.
How is it done properly in big real production cases with tens of DB servers, possibly in different data centers?
By abstracting the need for a specific IP, using something like Consul. If you wanted to, you could easily build a simple restful service to expose an inventory list or even the results of nodetool status.

NodeJS TLS/TCP server in need of an external firewall

Problem:
I have an AWS EC2 instance running FreeBSD. In there, I'm running a NodeJS TLS/TCP server. I'd like to create a set of rules (in my NodeJS application) to be able to individually block IP addresses programmatically based on a few logical conditions.
I'd like to run an external (not on the same machine/instance) firewall or load-balancer, that I can control from NodeJS programmatically, such that when certain conditions are given, I can block a specific remote-address(IP) before it reaches the NodeJS instance.
Things I've tried:
I have initially looked into nginx as an option, running it on a second instance, and placing my NodeJS server behind it, but after skimming through the NGINX
Cookbook
Advanced Recipes for High Performance
Load Balancing I've learned that only the NGINX Plus (the paid version) allows for remote/API control & customization. While I believe that paying $3500/license is not too much (considering all NGINX Plus' features), I simply can not afford to buy it at this point in time; in addition the only feature I'd be using (at this point) would be the remote API control and the IP address blocking.
My second thought was to go with the AWS/ELB (elastic-load-balancer) by integrating AWS' SDK into my project. That sounded feasible, unfortunately, after reading a few forum threads and part of their documentation (unless I'm mistaken) it seems these two features I need are not available on the AWS/ELB. AWS seems to offer an entire different service called WAF that I honestly don't understand very well (both as a service and from a feature-stand-point).
I have also (briefly) looked into CloudFlare, as it was recommended in one of the posts, here on Sackoverflow, though I can't really tell if their firewall would allow this level of (remote) control.
Question:
What are my options? What would you guys recommend I did?

I think Nginx provide such kind of functionality please refer to link
If you want to block an IP with Node TCP you can just edit a nginx config file and deny IP address.
Frankly speaking, If I were you, I would use AWS WAF but if you don’t want to use it, you can simply use Node JS
In Node JS You should have a global array variable where you will store all blocked IP addresses and upon connection, you will check whether connected host IP is in blocked IP variable. However there occurs a problem when machine or application is restarted, you will lose all information about blocked IP-s. So as a solution to that you can just setup Redis (It is key-value database but there are also other datatypes) DB and store blocked IP-s there. Inasmuch as Redis DB is in RAM all interaction with DB will be instantly and as long as machine or node is restarted, Redis makes a backup on hard drive and it syncs from it and continue to work in RAM with old databases.

Sync mongodb instances

What is the best solution to sync a mongodb instance in local server with dynamic IP (set by ISP) with a mongodb instance in public server (eg. Amazon AWS)? Can i do that from node.js ?

You can do this in a number of ways, but first to address the public/dynamic IP issue you will want to either use a hostname --> IP address mapping that you maintain (/etc/hosts or your own DNS servers) or look into one of the dynamic DNS solutions.
Once you have the changing IP address problems solved, the question is how to keep the systems in sync. The most obvious way is to have the two nodes in a replica set - if your connection is reliable enough this might work, though you will probably want to put an arbiter locally or remotely for whatever side of the connection you want to do writes on when the connection is flakey (in a 2 node set, if either node is down then they are both secondary and cannot take writes).
Another option is to use the mongo connector which lets you sync to arbitrary destinations, including another MongoDB instance.
That project will give you a pretty good idea of what you need to do (in python) to provide such a syncing service. You will need to write something similar in node.js to achieve a proper sync and essentially you will need to tail the oplog on one host and apply it to the other on a regular basis, depending on your requirements.

Securing Cassandra communication with TLS/SSL

We would like to protect the Cassandra against man-in-the-middle attacks. Is there any way to configure Cassandra in a way that the client-server and server-server (replication) communications are SSL encrypted?
thank you

short answer: no :)
For client - server : THRIFT-151
Edit: You might want to follow this thread on the ML

Encrypted server server communication seems to be available now:
https://issues.apache.org/jira/browse/CASSANDRA-1567
Provide configurable encryption support for internode communication
Resolution: Fixed
Fix Version/s: 0.8 beta 1
Resolved: 19/Jan/11 18:11

The strategy I employ is to have Apache Cassandra nodes communicate through a site to site VPN tunnel.
Specific configurations for the cassandra.yaml file:
listen_address: 10.x.x.x # vpn network ip
rpc_address: 172.16.x.x. # non-vpn network for client access although, I leave it blank so that it listens on all interfaces
The benefits to this approach is that you can deploy Apache Cassandra to many different environments and you become provider agnostic. For example, hosting nodes in various Amazon EC2 environments and hosting nodes in your own physical data center and hosting a few others under your desk!
Cost an issue preventing you from looking into this approach? Check out Vyatta ...
As KajMagnus pointed out, there is a JIRA ticket resolved and available in the stable version of Apache Cassandra: https://issues.apache.org/jira/browse/CASSANDRA-1567 which enables you to accomplish what you would like via TLS/SSL .. but there are a few ways to accomplish what you would like.
Finally, if you want to host your instance on Amazon EC2, region to region can be problematic and although there is a patch available in 1.x.x, is it really the right approach? I have found the VPN approach reduces latency between nodes in different regions and still maintains the necessary level of security.
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Running-across-multiple-EC2-regions-td6634995.html
Finally -- part 2 --
If you want to secure client to server communications, have your clients (web servers) communicate through the same VPN .. The configuration I have:
Front end webservers communicate via internal network to application servers
Application servers sit on their own internal network and VPN network and communicate to the Data Layer via the VPN tunnel and between each other on the internal network
Data Layer exists on it's own network per Data Centre / Rack and receives requests via the VPN network

Node to Node (gossip) communication can be secured per the issue above. Client and server will both soon support Kerberos (In Hector master as of commit: https://github.com/rantav/hector/commit/08149a03c81b559cba5680d115943dbf334f58fa should hit Cassandra side shortly).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string