Cassandra multi datacenter cluster healthcheck from f5 load balancer - cassandra

I have a working cassandra cluster across two data centers. Each DC has 3 nodes with replication factor as 3 and READ/WRITE consistency as LOCAL_QUORUM.
I want to stop the traffic to a particular DC when two nodes in the DC are down, because quorum is no longer met. I expected this to be handled by my application(client) i.e. connect to other DC cassandra when local quorum is not met but it is not possible from there.
Can we setup some kind of rule at f5 load balancer to achieve this?

You can setup an external monitor on the BIG-IP to run a script determining cluster health and then load balance on the results. If you're using BIG-IP 11.x+ you create your script and import it, adding any needed arguments it may require. Then you create a monitor profile to call that external monitor.
If you have a DevCentral account, check out this page:
DevCentral Wiki: External Monitor
Scroll down and you'll see a ton of examples to build off. Examples to note are the MySql monitors. This is the path I would recommend for cluster health checks for BIG-IP.
Alternatively, you can simply query a web page looking for a success/failure message so if you already have a cluster health status page, you can have an HTTP monitor validate the message. You can customize the receive string to look for specific content or use regex to look for any specific string (such as clusterFailure or whatnot). From there, you can make the appropriate LB decisions. I ran a similar monitor that read a nagios status page and if it read a failure on a specific message, it would LB connections from that node.
Here's some info on regex with http monitors.

Related

pgbouncer - auroraDB cluster not load balancing correctly

I as using AuroraDB cluster with 2 readers and pgBouncer to maintain a connection pool.
My application is very read intensive and fires a lot of select queries.
the problem I am facing is my 2 read replicas are not getting used completely in parallel.
I can see the trends where all connections get moved to 1 replica where other replica is serving 0 connections and after some time the situation shift when 2nd replica serves all connections and 1st serves 0.
I investigated this and found that auroraDB cluster load balancing is done on by time slicing 1-second intervals.
My guess is when pgBouncer creates connection pool all connection are created within 1 second window and all connections end up on 1 read replica.
is there any way I can correct this?
The DB Endpoint is a Route 53 DNS and load balancing is done basically via DNS round robin, each time you resolve the DNS. When you use pgBouncer, is it resolving the DNS once and trying to open connections to the resolved IP? If yes, then this is expected that all your connections are resolved to the same instance. You could fix this conceptually in multiple ways (I'm not too familiar with pgBouncer), but you basically need to somehow make the library resolve the DNS explicitly for each connection, or explicitly add all the instance endpoints to the configuration. The latter is not recommended if you plan on issuing writes using this Connection pool. You don't have any control over who stays as the writer, so you may inadvertently end up sending your writes to a replica.
AuroraDB cluster load balancing is done on by time slicing 1-second intervals
I'm not too sure where you read that. Could you share some references?

Load test on Azure

I am running a load test using JMeter on my Azure web services.
I scale my services on S2 with 4 instances and run JMeter 4 instances with 500 threads on each.
It starts perfectly fine but after a while calls start failing and giving Timeout error (HTTP status:500).
I have checked HTTP request queue on azure and found that on 2nd instance it is very high and two instances it is very low.
Please help me to success my load test.
I assume you are using Azure App Service. If you check the settings of your App, you will notice ARR’s Instance Affinity will be enabled by default. A brief explanation:
ARR cleverly keeps track of connecting users by giving them a special cookie (known as an affinity cookie), which allows it to know, upon subsequent requests, to which server instance they were talking to. This way, we can be sure that once a client establishes a session with a specific server instance, it will keep talking to the same server as long as his session is active.
This is an important feature for session-sensitive applications, but if it's not your case then you can safely disable it to improve the load balance between your instances and avoid situations like the one you've described.
Disabling ARR’s Instance Affinity in Windows Azure Web Sites
It might be due to caching of network names resolution on JVM or OS level so all your requests are hitting only one server. If it is the case - add DNS Cache Manager to your Test Plan and it should resolve your issue.
See The DNS Cache Manager: The Right Way To Test Load Balanced Apps article for more detailed explanation and configuration instructions.

Does Azure load balancer allow connection draining

I cant seem to find any documentation for it.
If connection draining is not available how is one supposed to do zero-downtime deployments?
Rick Rainey answered essentially the same question on Server Fault. He states:
The recommended way to do this is to have a custom health probe in
your load balanced set. For example, you could have a simple
healthcheck.html page on each of your VM's (in wwwroot for example)
and direct the probe from your load balanced set to this page. As long
as the probe can retrieve that page (HTTP 200), the Azure load
balancer will keep sending user requests to the VM.
When you need to update a VM, then you can simply rename the
healthcheck.html to a different name such as _healthcheck.html. This
will cause the probe to start receiving HTTP 404 errors and will take
that machine out of the load balanced rotation because it is not
getting HTTP 200. Existing connections will continue to be serviced
but the Azure LB will stop sending new requests to the VM.
After your updates on the VM have been completed, rename
_healthcheck.html back to healthcheck.html. The Azure LB probe will start getting HTTP 200 responses and as a result start sending
requests to this VM again.
Repeat this for each VM in the load balanced set.
Note, however, that Kevin Williamson from Microsoft states in his MSDN blog post Heartbeats, Recovery, and the Load Balancer, "Make sure your probe path is not a simple HTML page, but actually includes logic to determine your service health (eg. Try to connect to your SQL database)." So you may actually want an aspx page that can check several factors, including a custom "drain" flag you put somewhere.
Your clients need to simply retry.
The load balancer only forwards a request to an instance that is alive (determined by pings), it doesn't keep track of the connections. So if you have long-standing connections, it is your responsibility to clean them up on restart events or leave it to the OS to clean them up on restarts (which is obviously not gracefully in most of the cases).
Zero-downtime means that you'll always be able to reach an instance that is alive, nothing more- it gives you no guarantees on long running requests.
Note that when a probe is down, only new connections will go to other VMs
Existing connections are not impacted.

How to measure effectiveness of using Token Aware connection pool?

My team is testing the token aware connection pool of Astyanax. How can we measure effectiveness of the connection pool type, i.e. how can we know how the tokens are distributed in a ring and how client connections are distributed across them?
Our initial tests by counting the number of open connection on network cards show that only 3 out of 4 or more Cassandra instances in a ring are used and the other nodes participate in request processing in a very limited scope.
What other information would help making a valid judgment/verification? Is there an Cassandra/Astyanax API or command line tools to help us out?
Use Opscenter. This will show you how balanced your cluster is, i.e. whether each node has the same amount of data, as well asbeing able to graph the incoming read / write request per node and for your entire cluster. It is free and works with open source Cassandra as well as DSE. http://www.datastax.com/what-we-offer/products-services/datastax-opscenter

How to handle read/write request in cassandra

I have 5 node cluster with 2 Cassandra,2 solr and 1 hadoop on EC2 with DSE4.5.
My requirement is I dont want to hard code node IP address while requesting for Reading/writing from Cluster. I have to develop web service, thru which requester can send read/write request to my cluster and web service has to determine following
1) route read request to appropriate node.
2) route write request to appropriate node.
If there is any write request then it should direct to Cassandra node on basis of keyspace and replication factor. if it is a read request then request should route to Solr node (as I done indexing on solr) and if there is any analytic query then request should route to hadoop.
And if any node goes down in that case response will not affect.
Apart from dedicated request, is there any way to request a cluster ?
by dedicated mean giving specific IP address for read and write.
Is any method or algorithm exist in DSE? or Is there any tool available in for this?
The Java driver should take care of all of that for you:
http://www.datastax.com/documentation/developer/java-driver/2.0/common/drivers/introduction/introArchOverview_c.html
For example:
Nodes discovery: the driver automatically discovers and uses all nodes of the Cassandra cluster, including newly bootstrapped ones
Configurable load balancing: the driver allows for custom routing and load balancing of queries to Cassandra nodes. Out of the box, round robin is provided with optional data-center awareness (only nodes from the local data-center are queried (and have connections maintained to)) and optional token awareness (that is, the ability to prefer a replica for the query as coordinator).
Transparent failover: if Cassandra nodes fail or become unreachable, the driver automatically and transparently tries other nodes and schedules reconnection to the dead nodes in the background.
On the Solr query side, you can use the SolrJ load balancer, but you have to hard-wire the list of nodes to be used as coordinator nodes, but SolrJ will round robin for you.

Resources