Setting up 3-4 node Cassandra (resource-light) test cluster (s.a. in linux container)? - cassandra

Trying to see if it is possible to setup a 3 or 4 node Cassandra cluster, with minimum resource requirements, that can be installed on something like a single VM, either inside Linux container, or directly on the single VM using different port-numbers or virtual NICs/IPs.
This will be used for some application demonstration where I might like to demonstrate datastore high-availability, data partitioning, dynamic addition / removal of cluster node.
The setup would be running on a VM running on a laptop, so "resources" are a constraint (i.e. VRAM and VCPUs that can be allocated for this purposes). Also, as the actual data stored would be quite limited (let's say everything in a single key-space, about 10 tables, with 10 odd cols, and 1000 rows).

From your description, it sounds like ccm might be the tool for you. With it, you can create local clusters on your laptop (or in a VM, I suppose), add nodes, delete nodes etc etc. It can be easily installed on a linux OS, MAC, or Windows. You don't need to use a VM, your choice. I imagine you would see performance degradation in a VM.

Related

cassandra 3.11.x mixing vesions

We have a 6 node cassandra 3.11.3 cluster with ubuntu 16.04. These are virtual machines.
We are switching to physical machines on brand (8!) new servers that will have debian 11 and presumably cassandra 3.11.12.
Since the main version is always 3.11.x and ubuntu 16.04 is out of support, the question is: can we just let the new machines join the old cluster and then decommission the outdated?
I hope to get a tips about this becouse intuitively it seems fine but we are not too sure about that.
Thank you.
We have a 6 node cassandra 3.11.3 cluster with ubuntu 16.04. These are virtual machines. We are switching to physical machines on brand (8!)
Quick tip here; but it's a good idea to build your clusters in multiples of your RF. Not sure what your RF is, but if RF=3, I'd either stay with six or get one more and go to nine. It's all about even data distribution.
can we just let the new machines join the old cluster and then decommission the outdated?
In short, no. You'll want to upgrade the existing nodes to 3.11.12, first. I can't recall if 3.11.3 and 3.11.12 are SSTable compatible, but I wouldn't risk it.
Secondly, the best way to do this, is to build your new (physical) nodes in the cluster as their own logical data center. Start them up empty, and then run a nodetool rebuild on each. Once that's complete, then decommission the old nodes.
There is a bit simpler solution - move data from each virtual machine into a physical server, as following:
Prepare Cassandra installation on a physical machine, configure the same cluster name, etc.
1.Stop Cassandra in a virtual machine & make sure that it won't start
Copy all Cassandra data /var/lib/cassandra or something like from VM to the physical server
Start Cassandra process on a physical server
Repeat that process for all VM nodes, at some point, updating seeds, etc. After process is finished, you can add two physical servers that are left. Also, to speedup process, you can do initial copy of the data before stopping Cassandra in the VM, and after it's stopped, re-sync data with rsync or something like. This way you can minimize the downtime.
This approach would be much faster compared to the adding a new node & decommissioning the old one as we won't need to stream data twice. This works because after node is initialized, Cassandra identify nodes by assigned UUID, not by IP address.
Another approach is to follow instructions on replacement of the dead node. In this case streaming of data will happen only once, but it could be a bit slower compared to the direct copy of the data.

How to scale Azure VM Cores

I have a Python code that I need to run on 1000 CSVs in parallel computing to do calculations. One CPU core can finish running the code over each CSV in 8 hours.
Thus I am looking for a way to use Azure for this. I would like to create several virtual machines, say 4x D5v2 with 16 cores each to access a Windows Server that runs on a 64 Cores machine.
I tried to create these VMs in the same Cloud Service and I put them into the same Availability Set, which worked fine. When all VMs are running and I access any one of those VMs, I see that the cores on all other VMs are allocated to "Other Roles".
My questions are:
1) Is it possible to create a hypothetical VM out of 4 VMs to use more cores?
2) How can I manually allocate all cores in the Cloud Service to one specific VM?
Your best solution would be to use Azure Batch With Batch you create a job, and it will run on as many CPU's as you specify it can run on.
Taken from the Batch front page
When you are ready to run a job, Batch starts a pool of compute virtual machines for you, installing applications and staging data, running jobs with as many tasks as you have, identifying failures and re-queuing work and scaling down the pool as work completes. You have control over scale to meet deadlines, manage costs, and run at the right scale for your application.
1) Is it possible to create a hypothetical VM out of 4 VMs to use more cores?
No you can not.
2) How can I manually allocate all cores in the Cloud Service to one specific VM?
You can not do this. You need to use a cloud native solution to scale your process over multiple resources.

Cassandra cluster on budget

I am learning Cassandra and want to run a cloud based cluster. I don't care much about speed.
What I want to really test is the replication and recovery features.
I would be running tests like
taking nodes offline every once in a while
kill -9 cassandra
powering off server
manually corrupting sstables/commitlog (not sure if this is recoverable)
I am thinking of going for a 4 node cluster.
Each node will have the following config:
2 GB RAM
10 GB SSD
2 CPUs (Virtual)
Two nodes will be in a European datacenter and other two will be in a North American data center.
I know 8GB is the recommended minimum for Cassandra. But that config would be quite expensive.
If it helps, I can run one more VM on a dedicated box. This VM can have 16 GB RAM and 8 virtual CPUs. I could also run 4 VMs with 4GB RAM each on this box. But I guess, having 4 separate VMs in different data centers would make a more realistic setup and bring to fore any issues that may arise out of network problems, latencies etc.
Is it okay to run Cassandra on machines with this config? Please share your thoughts.
Many people run multiple instances of cassandra on modern laptops using ccm ( https://github.com/pcmanus/ccm ). If you just want to get an idea of what it does (create a 3 node cluster, add data, add a 4th node, create a snapshot, remove a node, add it back, restore the snapshot, etc), using ccm on a PC may be 'good enough'.
Otherwise, you can certainly run with less than 1GB of ram, but it's not always fun. There have been some clusters on tiny hardware ( http://www.datastax.com/dev/blog/32-node-raspberry-pi-cassandra-cluster ). Depending on your budget, making a cluster on raspberry pi's may be as cost effective as your 2 VM cluster.

Cassandra compaction tasks stuck

I'm running Datastax Enterprise in a cluster consisting of 3 nodes. They are all running under the same hardware: 2 Core Intel Xeon 2.2 Ghz, 7 GB RAM, 4 TB Raid-0
This should be enough for running a cluster with a light load, storing less than 1 GB of data.
Most of the time, everything is just fine but it appears that sometimes the running tasks related to the Repair Service in OpsCenter sometimes get stuck; this causes an instability in that node and an increase in load.
However, if the node is restarted, the stuck tasks don't show up and the load is at normal levels again.
Because of the fact that we don't have much data in our cluster we're using the min_repair_time parameter defined in opscenterd.conf to delay the repair service so that it doesn't complete too often.
It really seems a little bit weird that the tasks that says that are marked as "Complete" and are showing a progress of 100% don't go away, and yes, we've waited hours for them to go away but they won't; the only way that we've found to solve this is to restart the nodes.
Edit:
Here's the output from nodetool compactionstats
Edit 2:
I'm running under Datastax Enterprise v. 4.6.0 with Cassandra v. 2.0.11.83
Edit 3:
This is output from dstat on a node that behaving normally
This is output from dstat on a node with stucked compaction
Edit 4:
Output from iostat on node with stucked compaction, see the high "iowait"
azure storage
Azure divides disk resources among storage accounts under an individual user account. There can be many storage accounts in an individual user account.
For the purposes of running DSE [or cassandra], it is important to note that a single storage account should not should not be shared between more than two nodes if DSE [or cassandra] is configured like the examples in the scripts in this document. This document configures each node to have 16 disks. Each disk has a limit of 500 IOPS. This yields 8000 IOPS when configured in RAID-0. So, two nodes will hit 16,000 IOPS and three would exceed the limit.
See details here
So, this has been an issue that have been under investigation for a long time now and we've found a solution, however, we aren't sure what the underlaying problem that were causing the issues were but we got a clue even tho that, nothing can be confirmed.
Basically what we did was setting up a RAID-0 also known as Striping consisting of four disks, each at 1 TB of size. We should have seen somewhere 4x one disks IOPS when using the Stripe, but we didn't, so something was clearly wrong with the setup of the RAID.
We used multiple utilities to confirm that the CPU were waiting for the IO to respond most of the time when we said to ourselves that the node was "stucked". Clearly something with the IO and most probably our RAID-setup was causing this. We tried a few differences within MDADM-settings etc, but didn't manage to solve the problems using the RAID-setup.
We started investigating Azure Premium Storage (which still is in preview). This enables attaching disks to VMs whose underlaying physical storage actually are SSDs. So we said, well, SSDs => more IOPS, so let us give this a try. We did not setup any RAID using the SSDs. We are only using one single SSD-disk per VM.
We've been running the Cluster for almost 3 days now and we've stress tested it a lot but haven't been able to reproduce the issues.
I guess we didn't came down to the real cause but the conclusion is that some of the following must have been the underlaying cause for our problems.
Too slow disks (writes > IOPS)
RAID was setup incorrectly which caused the disks to function non-normally
These two problems go hand-in-hand and most likely is that we basically just was setting up the disks in the wrong way. However, SSDs = more power to the people, so we will definitely continue using SSDs.
If someone experience the same problems that we had on Azure with RAID-0 on large disks, don't hesitate to add to here.
Part of the problem you have is that you do not have a lot of memory on those systems and it is likely that even with only 1GB of data per node, your nodes are experiencing GC pressure. Check in the system.log for errors and warnings as this will provide clues as to what is happening on your cluster.
The rollups_60 table in the OpsCenter schema contains the lowest (minute level) granularity time series data for all your Cassandra, OS, and DSE metrics. These metrics are collected regardless of whether you have built charts for them in your dashboard so that you can pick up historical views when needed. It may be that this table is outgrowing your small hardware.
You can try tuning OpsCenter to avoid this kind of issues. Here are some options for configuration in your opscenterd.conf file:
Adding keyspaces (for example the opsc keyspace) to your ignored_keyspaces setting
You can also decrease the TTL on this table by tuning the 1min_ttlsetting
Sources:
Opscenter Config DataStax docs
Metrics Config DataStax Docs

can HBase , MapReduce and HDFS can work on a single machine having Hadoop installed and running on it?

I am working on a search engine design, which is to be run on cloud.
We have just started, and have not much idea about Hdoop.
Can anyone tell if HBase , MapReduce and HDFS can work on a single machine having Hdoop installed and running on it ?
Yes you can. You can even create a Virtual Machine and run it on there on a single "computer" (which is what I have :) ).
The key is to simply install Hadoop in "Pseudo Distributed Mode" which is even described in the Hadoop Quickstart.
If you use the Cloudera distribution they have even created the configs needed for that in an RPM. Look here for more info in that.
HTH
Yes. In my development environment, I run
NameNode (HDFS)
SecondaryNameNode (HDFS)
DataNode (HDFS)
JobTracker (MapReduce)
TaskTracker (MapReduce)
Master (HBase)
RegionServer (HBase)
QuorumPeer (ZooKeeper - needed for HBase)
In addition, I run my applications, and map and reduce tasks launched by the task tracker.
Running so many processes on the same machine results in a lot of contention for CPU cores, memory, and disk I/O, so it's definitely not great for high performance, but there is no limitation other than the amount of resources available.
same here, I am running hadoop/hbase/hive on a single computer.
If you really really want to see distributed computing on a single computer, grab lots of RAM, some hard disk space and go like this -
make one or two virtual machines (use virtual box)
install hadoop on each of them, make ur real instalation (not any virtual one) as the master, rest slave
configure hadoop for real distributed environment
now when hadoop starts, you should actually have a cluster of multiple computers (one real, rest virtual)
this could just be an experiment, because unless you have a decent multi-cpu or multi-core system, such a configuration will actually consume more on maintaining itself than giving you any performance.
gud luck.
--l4l

Resources