I have a use case where requires data backup across multiple data centers and needs strong consistency. The ideal view is each segment is replicated to three clusters located at three different data centers. pulsar supports using multiple clusters as a large bookie pool but I didn't find how to configure the replicas in different clusters. Anyone has similar use case before? i think it should be not hard to do considering pulsar separate broker and storage + replicas in different clusters
It's possible to enable a region aware placement policy of bookies (parameter bookkeeperClientRegionawarePolicyEnabled). You'll also need to configure the bookie region with the admin command set-bookie-rack This is not much documented in Pulsar/BookKeeper docs. See this blog post for more details : https://techblog.cdiscount.com/ensure-cross-datacenter-guaranteed-message-delivery-and-resilience-with-apache-pulsar/
Beware that due to the cross-region latency between the brokers and the bookies, the throughput will drop but that can't really be helped if you need strong consistency even in the case of a region failure.
Related
In our organization, We are trying to have two cassandra dataceters with only 1 node on each side. From the preliminary investigation, I see replication is happening but I want to know if we can use this deployment in production? Will there be any performance issue with replication ?
We have already setup 2 datacenters with one node on each datacenter and replication is working fine.
Want to know if this kind of setup is recommended for production deployment.
Not sure what your use case is.
But in general multiple data centers for many reason:
1) Disaster recovery (DR).
2) To run different kind of load like analytics or search.
3) To decrease latency if your users are spread across the world.
In general minimum three nodes per data center is recommended in production. Again it depends on use case.
Does it make sense to create a separate Kubernetes cluster for my Cassandra instances and one cluster for the application layer? Is the DB cluster accessible from the service cluster when both are in the same region and zone?
Or is it better to have one cluster with different pools - one pool for the service layer and one pool the DB nodes?
Thanks
This is more of a toss-up or opinion in terms of how you want to design your whole architecture. Here are some things to consider:
Same cluster:
Pros
Workloads don't need to go to a different podCidr to get its data.
You can optimize your resources in the same set of servers.
This is one of the main reasons people use containers orchestrators and containers.
It allows you to run multiple different types of workloads on the same set of resources.
Cons
If you have an issue with your cluster running Cassandra you risk losing your data. Or temporarily lose data if you have backups. (Longer downtime)
If you'd like to super isolate the db and app in terms of security, it may be harder.
Different clusters:
Pros
'Safer' if one of your clusters goes down.
More separation in terms of security for your data at rest.
Cons
Resources may not be optimally utilized. Leaving some CPUs, memory, etc idle.
More infrastructure management.
Different node pools:
Pros
Separation of data at rest
Still going through the same PodCidr.
Cons
More management of different node pools.
Resources may not be optimally utilized.
When in Java, I create a Cassandra cluster builder, I provide a list of multiple Cassandra nodes as shown below:
Cluster cluster = Cluster.builder().addContactPoint(host1, host2, host3, host4).build();
But from what I understand, the connector connects only to the first host in the list that is available, and that host becomes my connection point to the Cassandra cluster.
Now, my question is if my Java application reads/writes huge amount of data from/to Cassandra, then doesn't my Java application overwhelm the node that it is connected to?
Is there a way to configure my connection such that it uses multiple nodes of Cassandra for its reads/writes? What is the common practice?
It uses the contact point to find the rest of the nodes in the cluster, then creates a pool of connections to all the hosts and balances the requests among them. It doesn't only connect to the hosts you provide unless you use the whitelist load balancing policy or a custom one.
If your worried about overwhelming nodes use the RoundRobinLoadBalancingPolicy (DC aware if multiple DCs) and it will distribute it amongst all of them evenly. If you have hot spots of data and use the TokenAware policy you may have it uneven, but you shouldn't need to worry about it.
Assuming a live cluster with several DCs, whats the best way to setup some nodes that are dedicated for analytic queries?
Analytic nodes will be hosted in a separate (routed) network and must not write any data back to the production nodes. They also must not be counted against for any CL. This especially applies to EACH_QUORUM that will be used for some writes. Analytics nodes may be offline at any time.
All solutions I've looked into seem to have their own drawbacks.
1) Take snapshots on production and transfer to independent analytics cluster
Significant update delay
IO intensive either on network or disk (e.g. rsync)
Lots of duplicate data due to different replication factors (3:1 prod. vs analytics)
Mismatch in SSTable row ranges and cluster topology on analytics cluster may require to use sstableloader
2) Use write survey mode to establish read-only nodes
Not 100% sure how this could be done for setting up multiple survey nodes to cover the whole ring
Queries can only be executed against each node locally as they could not be part of a coordinated execution
3) Add regular DC dedicated for analytics
EACH_QUORUM will fail in case analytics cluster is not available
Queries on production should not be served from analytics
Would require a way to prevent users on analytics to be able to execute queries or updates on production
Any other options or existing tools that could be used?
I have a cassandra cluster deployed with 3 cassandra nodes with replication factor of 3. I have a lot of data being written to cassandra on daily basis (10-15GB). I have provisioned these cassandra on commodity hardware as suggested by "Big data community" and I am expecting the nodes to go down frequently which is handled using redundancy provided by cassandra.
My problem is, I have observed cassandra to slow down with writes when a new node is provisioned and the data is being streamed while bootstrapping. So, to overcome this hurdle, We have decided to have a separate network interface for inter-node communication and for client application to write data to cassandra. My question is how can this be configured, if at all this is possible ?
Any help is appreciated.
I think you are chasing the wrong solution.
I am confused by the fact that you only have 3 nodes, yet your concern is around slow writes while bootstrapping. Why? Are you planning to grow your cluster regularly? What is your consistency level on write, as this has a big impact on performance? Obviously if you only have 2 or 3 nodes and you're trying to bootstrap, you will see a slowdown, because you're tying up a significant percentage of your cluster to do the streaming.
Note that "commodity hardware" doesn't mean cheap, low-performance hardware. It just means you don't need the super high-end database-class machines used for databases like Oracle. You should still use really good commodity hardware. You may also need more nodes, as setting RF equal to cluster size is not typically a great idea.
Having said that, you can set your listen_address to the inter-node interface and rpc_address to the client address if you feel that will help.