Cassandra 2+ HPC Deployment - cassandra

I am trying to deploy Cassandra on a Linux Based HPC cluster and I need some guidelines if possible. Specifically, what is the difference between running Cassandra locally and in cluster.
When managing locally (in which case it runs smoothly) we duplicate the original files for every node inside our Cassandra directory and we apply the appropriate changes for IP address, rcp, JMX etc... however, when managing a network which files do we need to install in each node. The whole package with all the files or just some of the required ones
like, bin/cassandra.in.sh, conf/cassandra.yaml, bin/cassandra.
I am a little bit confused on what to store in each node separately so to start working on the cluster.

You need to install Cassandra on each node (VM), i.e. the whole package and then update config files as neccessary. As described here to configure cluster in a single data center you need:
Install Cassandra on each node
Configure cluster name
Configure seeds
Configure snitch, if needed

Related

Cassndra in Production

Anybody supporting a Cassandra application in production? Curious to know about, how you handle cassandra.yaml file. Also, do you think "seed node" get's a status of master node (partially).
Anybody supporting a Cassandra application in production?
Yes, my team supports several applications which use Cassandra in production.
Curious to know about, how you handle cassandra.yaml file.
By "handle" the cassandra.yaml file, I assume you mean deploy with different values with automation at large scale. We use an open source tool called Rundeck for that.
Rundeck allows you to build options into your jobs, which is useful for properties like cluster_name, seeds, etc. Then, you inject those options into your deploy scripts, using a regex replace (sed) to get them into specific properties in the yaml. Ex:
sed -i "s/cluster_name: 'Test Cluster'/cluster_name: '#cluster_name#'/" cassandra.yaml
Also, do you think "seed node" get's a status of master node (partially).
No, a seed node is not any kind of "master" node.
A seed node is no different from any other node.
In theory, every node in your cluster could be a seed node for another node. All it is, is a way for a new node to discover the network topology of the cluster. Think of it as an entry point to the cluster.

How to sync configuration between hadoop worker machines

We have huge hadoop cluster and we installed one coordinator preso node
and 850 presto workers nodes. now we want to change the values in the file - config.properties but this should be done on all the workers!
so under
/opt/DBtasks/presto/presto-server-0.216/etc
the file is like this
[root#worker01 etc]# more config.properties
#
coordinator=false
http-server.http.port=8008
query.max-memory=50GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery.uri=http://master01.sys76.com:8008
and we want to change it to
coordinator=false
http-server.http.port=8008
query.max-memory=500GB
query.max-memory-per-node=5GB
query.max-total-memory-per-node=20GB
discovery.uri=http://master01.sys76.com:8008
but this was done only on the first node - worker01, but we need to do it also on all workers. well - we can copy this file by scp to all other workers , but not in case root is restricted but what I want to know , if presto already think about more elegant approach that sync the configuration on all workers node as all know after we set new values we need also to restart the presto louncer script
dose presto have solution to this ?
I must to tell that my cluster is restricted root , so we cant copy the files VIA ssh
Presto does not have the ability to sync the configurations. This is something you would need to manage outside e.g. using a tool like Ansible. There is also project command line tool presto-admin (https://github.com/prestosql/presto-admin) that can assist with deploying the configs across the cluster.
Additionally, if you are using public clouds such as AWS, there are commercial solutions from Starburst (https://www.starburstdata.com/) that can assist management of the configurations as well.

Thingsboard cluster setup

Building a Thingsboard cluster
I need help setting up a Thingsboard cluster, the documentation online is very limited.
The cluster will contain 2 Zookeeper nodes and 4 Thingsboard nodes with Cassandra DB.
Should Zookeeper be installed separately?
A step-by-step guide would be much appreciated!
I cannot provide you detailed step-by-step instructions to setup a ThingsBoard cluster. I can point you into the right direction by sharing the different documents you need to do so.
Bottom line, the following tasks must be completed:
Install and configure a ZooKeeper ensemble.
Check the ZooKeeper documentation for further installation details. Keep in mind that you need at least three different ZK-nodes in a clustered environment and that you always need an odd number of ZK nodes (3,5,7,...). It is a very very very bad idea to build a cluster consisting out of two ZK-nodes, check split brain condition that might appear under these circumstances! Basically you setup the number of individual nodes you wish to use and change the configuration file to enable the different nodes as an ensemble. This is documented quite well in the ZK-docs.
Install and configure a Cassandra cluster.
Again you will setup the number of individual nodes you need for your Cassandra cluster and modify the individual configuration files to convert them into a Cassandra cluster. Check Cassandra documentation for details. Be sure to check proper configuration using the nodetool status command as described at the end of the document. All your nodes should be up and running.
Install and configure a ThingsBoard cluster.
Use the instructions provided with ThingsBoard single node setup.
Install Java
Skip External database installation
ThingsBoard service installation
Configure ThingsBoard to use the external database - Cassandra
Go to Cluster setup and apply the configuration steps depicted (ZK, Cassandra and RPC). Keep in mind to point to ALL members of your ZK, Cassandra cluster. You can also use IP-addresses instead of host names.
Return to single node setup and run the installation script at ONE NODE only!
Start ThingsBoard service
If everything went well, you should be able to access your ThingsBoard nodes directly using the URL http://[NODE_IP]:8080. You can verify proper cluster operation by creating a tenant on one node and check its presence on another node.
I don't know if using an even number of ThingsBoard nodes is a good idea. The documentation does not mention anything about this.
One final remark, you could/should consider putting a proxy in front of your ThingsBoard cluster to provide load balancing to your web clients and improve user experience. This way you shouldn't share the individual host addresses with your users and you will prevent node overloading due to the fact that everybody is using the same web-address to access your dashboard(s). You could also proxy your MQTT broker to provide load balancing as well.
Good luck in setting up your cluster!
Zookeeper needs at least 3 nodes to run in a cluster mode. Each node voting and the valid replica count to gain the QUORUM is 3.

Enable Spark on Same Node As Cassandra

I am trying to test out Spark so I can summarize some data I have in Cassandra. I've been through all the DataStax tutorials and they are very vague as to how you actually enable spark. The only indication I can find is that it comes enabled automatically when you select "Analytics" node during install. However, I have an existing Cassandra node and I don't want to have to use a different machine for testing as I am just evaluating everything on my laptop.
Is it possible to just enable Spark on the same node and deal with any performance implications? If so how can I enable it so that it can be tested?
I see the folders there for Spark (although I'm not positive all the files are present) but when I check to see if it's set to Spark master, it says that no spark nodes are enabled.
dsetool sparkmaster
I am using Linux Ubuntu Mint.
I'm just looking for a quick and dirty way to get my data averaged and so forth and Spark seems like the way to go since it's a massive amount of data, but I want to avoid having to pay to host multiple machines (at least for now while testing).
Yes, Spark is also able to interact with a cluster even if it is not on all the nodes.
Package install
Edit the /etc/default/dse file, and then edit the appropriate line
to this file, depending on the type of node you want:
...
Spark nodes:
SPARK_ENABLED=1
HADOOP_ENABLED=0
SOLR_ENABLED=0
Then restart the DSE service
http://docs.datastax.com/en/datastax_enterprise/4.5/datastax_enterprise/reference/refDseServ.html
Tar Install
Stop DSE on the node and the restart it using the following command
From the install directory:
...
Spark only node: $ bin/dse cassandra -k - Starts Spark trackers on a cluster of Analytics nodes.
http://docs.datastax.com/en/datastax_enterprise/4.5/datastax_enterprise/reference/refDseStandalone.html
Enable spark by changing SPARK_ENABLED=1
using the command: sudo nano /usr/share/dse/resources/dse/conf/dse.default

How to create a Cassandra copy to a test machine?

We have a staging environment which runs a one node cluster completely separate from our production environment. What I'd like to do is copy this one node cluster over to a test machine that I have for the sole purpose of testing.
What is the correct way to do this? The server and test server are running Centos 6.x, and the version of DSE is 4.5.1 and Cassandra 2.0.8.39
All you need is to follow the steps described in this document:
http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_backup_restore_c.html.
If your test cluister's topology is different from the original cluster then you will need to use a tool like a sstableloader:
http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_backup_snapshot_restore_t.html
http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_move_cluster.html
http://www.datastax.com/documentation/cassandra/2.0/cassandra/tools/toolsBulkloader_t.html

Resources