Guide on smallest VM needed for test/learning cassandra setup - cassandra

Want to rent a small collection of VM's so I can play around with and learn Cassandra. Whats the minimum memory (and maybe) disk footprint you can get away with?
For the sake of the exercise, lets assume I'm starting with three nodes.

People have been running Cassandra on RaspberryPi's (watch a talk on YouTube, or look at some slides on SlideShare), so the minimum specs required are pretty low. The RaspberryPi has 500 megs of RAM and an ARM CPU running at 700 Mhz.
For disk space there's very little needed before you start writing data, then it scales roughly linerally with the data (but remember that removing things actually increases the disk usage temporarily, and compactions require some extra space too while they are running).
For playing you probably won't need much more than the equivalent of a RaspberryPi, but running it on say three EC2 m1.small works absolutely fine. I haven't run it on t1.micro, but it should work (t1.micro is actually spec'ed lower than a RaspberryPi, although it probably has faster disk IO).

Cassandra can scale down well for testing so EC2 small instances or equivalent will be fine. Later, if you plan on performance/load testing, you can scale up. Be sure to adjust/check the config - especially the JVM heap and size it according to the server spec.
If you're just wanting to learn the software, data modelling, CQL etc and replication isn't important just yet then a single server would work fine too - just create keyspaces with a replication factor of 1. I run cassandra on my laptop for development work and it's quite happy with 0.5 gig of jvm heap.

If you just want to play around with a cluster and not necessarily have something running all of the time you can try running a cluster on your own machine without running separate VMs for each node with ccm.

Related

How to determine infra needs for a spark cluster

I am looking for some suggestions or resources on how to size-up servers for a spark cluster. We have enterprise requirements that force us to use on-prem servers only so I can't try the task on a public cloud (and even if I used fake data for PoC I would still need to buy the physical hardware later). The org also doesn't have a shared distributed compute env that I could use/I wasn't able to get good internal guidance on what to buy. I'd like to have some idea on what we need before I talk to a vendor who would try to up-sell me.
Our workload
We currently have a data preparation task that is very parallel. We implement it in python/pandas/sklearn + multiprocessing package on a set of servers with 40 skylake cores/80 threads and ~500GB RAM. We're able to complete the task in about 5 days by manually running this task over 3 servers (each one working on a separate part of the dataset). The tasks are CPU bounded (100% utilization on all threads) and usually the memory usage is low-ish (in the 100-200 GB range). Everything is scalable to a few thousand parallel processes, and some subtasks are even more paralellizable. A single chunk of data is in 10-60GB range (different keys can have very different sizes, a single chunk of data has multiple things that can be done to it in parallel). All of this parallelism is currently very manual and clearly should be done using a real distributed approach. Ideally we would like to complete this task in under 12 hours.
Potential of using existing servers
The servers we use for this processing workload are often used on individual basis. They each have dual V100 and do (single node, multigpu) GPU accelerated training for a big portion of their workload. They are operated bare metal/no vm. We don't want to lose this ability to use the servers on individual basis.
Looking for typical spark requirements they also have the issue of (1) only 1GB ethernet connection/switches between them (2) their SSDs are configured into a giant 11TB RAID 10 and we probably don't want to change how the file system looks like when the servers are used on individual basis.
Is there a software solution that could transform our servers into a cluster and back on demand or do we need to reformat everything into some underlying hadoop cluster (or something else)?
Potential of buying new servers
With the target of completing the workload in 12 hours, how do we go about selecting the correct number of nodes/node size?
For compute nodes
How do we choose number of nodes
CPU/RAM/storage?
Networking between nodes (our DC provides 1GB switches but we can buy custom)?
Other considerations?
For storage nodes
Are they the same as compute nodes?
If not how do we choose what is appropriate (our raw dataset is actually small, <1TB)
We extensively utilize a NAS as a shared storage between the servers, are there special consideration on how this needs to work with a cluster?
I'd also like to understand how I can scale up/down these numbers while still being able to viably complete the parallel processing workload. This way I can get a range of quotes => generate a budget proposal for 2021 => buy servers ~Q1.

Is it a good idea to run Cassandra inside an LXC or Docker, in production?

I know it runs just fine, so it's ok for development which is great, but won't it have considerably worse disk and/or network IO performance because of AuFS ?
If you put Cassandra data on a volume, disk I/O performance will be exactly the same as outside of containers, since AUFS will be bypassed entirely.
And even if you don't use a volume, performance will be fine as long as you don't commit Cassandra data into a new image to run that image later. And even if you do that, performance will be affected only during the first writes on each file; after that, it will be native.
You will not see any different in Network I/O performance, unless your containers are dealing with 100s of Mb/s of network traffic and/or 1000s of connections per second. In that case, you can use tools like Pipework to assign MAC VLAN interfaces or even native physical interfaces to your containers.
We are actually running Cassandra in Docker in production and have had to work through a lot of performance issues.
Networking: you should this as --net=host to use the host networking. Otherwise you will take a substantial hit to your network speeds. See this article for more information on recommend best practices.
Data volume: you should expose your data volume to the physical host. If you're operating in the cloud note that where you place your data volume may limit your iops.
JVM: just because you run Cassandra in a container doesn't mean you can get away from tuning your jvm. You still need to modify it to account for the system resources on the host machine.
Cluster Name/Seeds: these need to be configured and need to be changed from hard coded values to find and replace with environment variables using sed.
The big take away is that like any software you need to do some configuration. It's not 100% plug and play.
Looking into the same thing, Just found this on slideshare:
"Docker uses Linux Ethernet Bridges for basic software routing. This will hose your network throughput. (50% hit)
Use the host network stack instead (10% hit)"

Azure compute instances

On Azure I can get 3 extra small instances for the price 1 small.I'm not worried about my site not scaling.
Are there any other reasons I should not go for 3 extra small instead of 1 small?
See: Azure pricing calculator.
An Extra Small instance is limited to approx. 5Mbps bandwidth on the NIC (vs. approx. 100Mbps per core with Small, Medium, Large, and XL), and has less than 1GB of RAM. So, let's say you're running something that's very storage-intensive. You could run into bottlenecks accessing SQL Azure or Windows Azure storage.
With RAM: If you're running 3rd-party apps, such as MongoDB, you'll likely run into memory issues.
From a scalability standpoint, you're right that you can spread the load across 2 or 3 Extra Small instances, and you'll have a good SLA. Just need to make sure your memory and bandwidth are good enough for your performance targets.
For more details on exact specs for each instance size, including NIC bandwidth, see this MSDN article.
Look at the fine print - the I/O performance is supposed to be much better with the small instance compared to the x-small instance. I am not sure if this is due to a technology related bottleneck or a business decision, but that's the way it is.
Also I'm guessing the OS takes a bit of RAM in each of the instances, so in 3 X-small instances it takes it up three times instead of just once in a small instance. That would reduce the resources that are actually available for your application needs.
While 3 xtra-small instances theoretically may equal or even be better "on paper" than one small instance, do remember that xtra-small instances do not have dedicated cores and their raw computing resources are shared with other tenants. I've tried these xtra-small instances in an attempt to save money for tiny-load website and must say that there were simply outages or times of horrible performance that I've found unacceptable.
In short: I would not use xtra-small instances for any sort of production environment

spikes in traffic

I have a bare bones setup on Amazon, and wanted to know which is the better approach coming out of the gate on a new site, where we anticipate a spike of traffic occasionally (from tech press) before we gradually build up 'real' membership traffic to a reasonable level.
I currently am toying with two starter options:
1) Do I have 1 node app (micro ec2) pointing to a redis-server AND mongod (EC2 server) (which mounts one combined 10G EBS).
Or
2) do I have 1 node app (micro ec2) running redis-server and mongod locally (but with 2 10G EBS mounts, 1 for redis and 1 for mongo).
If traffic went crazy (tech press etc), which is easiest/fastest to scale to handle the spike in traffic. I anticipate equal read writes for mongo and redis btw, and I have no caching (other than that provided by cloudfront assets like images and some css)
I can't speak to Redis, but for MongoDB you'll want to be sure that you run on an instance with sufficient RAM to hold your "working set" of data in memory. "Working set" means, roughly, the full set of data that your application accesses frequently -- for instance, consider Twitter -- the working set of Twitter data is the most recent set of status updates across all users, as this is what is shown on web pages and what Twitter provides via its APIs. For your application, the definition of working set may differ.
Mongo uses memory-mapped files for data access, which means that its performance is great when there is enough memory to hold the data you are accessing frequently, and can degrade when there is not. If you expect your data set to grow beyond about 2.5 gigabytes, you will also want to ensure that you are on a 64-bit instance -- on 32-bit instances, Mongo is limited to around 2.5 gigabytes of data, due to the limited memory address space available on such a platform. For more on MongoDB on EC2, see the Mongo docs on EC2 deployment on the wiki.
I would also caution against using EC2 Micro instances in your production environment. The nature of Micros is that they have "burstable" but very limited CPU resources. If you get a spike of traffic due to tech press, it's likely that your application would be limited by EC2 to a very low amount of available CPU, which will cause performance to suffer. You can mitigate this to a certain extent with load balancing and many Micro instances, but it may be more cost-effective and less complex to simply use Large instances for both Mongo/Redis and your application servers.
You may want to have a look at this question, since IMO, the answer also applies to your situation:
Benefits of deploying multiple instances for serving/data/cache
I would never put mongod and redis-server on the same box. MongoDB is meant to swap due to its usage of memory mapped files, and will generate swapping activity if the data cannot fit in RAM. Redis does not use data structures which are compatible with swapping (like MongoDB does with btrees), and will become unresponsive if its memory is swapped out. Currently, there is no easy way to lock Redis in memory.
So I would put Redis and the app server on the same box, and isolate MongoDB on its own box.
Depending on the size of the data you want to store in Redis, I would pick a large or a small EC2 instance. Redis works well in 32 bits, but memory is limited. For MongoDB, a 64 bits box is almost mandatory. In any case, I would avoid micro instances like the plague.

MongoDB is using too much memory

I'm working with MongoDB on a 32bits CentOS VPS with 1GB of internal memory.
It works fine most of the time, but every now and then it's memory usage spikes and crashes my server.
Is there a way to prevent this, for example, by limiting the memory and CPU that MongoDB daemon uses?
I was thinking of running the Mongo daemon with ionice and giving it a low priority, but I'm not sure if that would work.
Any help or pointers are welcome!
It is not currently possible to limit amount of memory. MongoDB uses memory-mapped file mechanism to access data files. Therefore, amount of used memory is governed by the system. The more data you touch, the more RAM you need.
I'm guessing you're also running everything else on that same server?
Really, the best way to run mongo is to put it on its own server, where things like apache, mysql, etc. won't jump up and interfere with the RAM it wants to use. I had the same problem myself--the server would go into swap and choke itself every once in a while, with heavy use.
You'd probably be better off with two 512MB servers which is hopefully comparable in price (one running mongo, and one running the rest). I also thought about trying to run a VM with mongo on it within the VPS, but that fell into the "too much effort" category, for me.
And yeah, as dcrosta says, use 64-bit, unless you want to limit your data size to under 2GB (and why would you want to do that?)
I did have similar problems, when I was using lots of map/reduce where the memory leaks and crashes were often. I don't use map/reduce anymore and there have been no memory leaks / crashes for many months now.

Resources