We have 10 node Cassandra cluster and would like to monitor Load(CPU) .
The OpsCenter shows the load on nodes ranges from 0 to 20s and sometimes it goes in 50s and 60s . But normally its below 25 .
We just want to understand how much should be considered alarming ?
It's ok until you don't have performance issues/timeouts.
Going higher than 80% is not recommended as you don't have time to react. Also consider checking your 60% spikes.
Depends on your hardware and installation.
I recommend doing stress-testing of your system. Because we don't know how many writes and reads you are doing. You need reserve one core for linux and keep load 80%.
Related
We are using 3-nodes Cassandra Cluster in production from 4-5 months, We haven't faced any major issue, however, every day at 2200 hours (UTC), we see a spike in CPU Utilization (see the attached graph).
I have confirmed there is no major load from our end at that time. Also, most of the time CPU Util is < 1%, (< 15% while querying and few jobs). We take snapshots at a different time, nowhere close to 2200 hours.
There is nothing unusual in system.log and debug.log.
One more important thing is, it happens only for one node. All of them reache their highest CPU Util, but one of the node is always thrice as much as others.
(Orange colors are first node. All the peaks are at 2200 hours)
Graph for all 3 nodes.
Graph for 2 nodes, except first one (orange). As per #PhillipBlum asked.
Is there a way to set for each stage how many failures I can tolerate when running a Spark job? For example, if I have 1000 nodes and I tolerate 10 failures, then in a case where 5 nodes have failed, my job will not rerun them and ignore their results.
As a a result, I will get a bit less accurate result, but such capability will haste the running time execution since I get a result with no need to wait for the failing nodes, assuming that their execution time is taking too long.
Thanks!
I think what you're looking for is
spark.speculation=true
This is from http://spark.apache.org/docs/1.2.0/configuration.html#scheduling
Which will use a heuristic to relaunch the task on another machine if one is clearly lagging.
I'm running the scikit learn on some rather large training datasets ~1,600,000,000 rows with ~500 features. The platform is Ubuntu server 14.04, the hardware has 100gb of ram and 20 CPU cores.
The test datasets are about half as many rows.
I set n_jobs = 10, and am forest_size = 3*number_of_features so about 1700 trees.
If I reduce the number of features to about 350 it works fine but never completes the training phase with the full feature set of 500+. The process is still executing and using up about 20gb of ram but is using 0% of CPU. I have also successfully completed on datasets with ~400,000 rows but twice as many features which completes after only about 1 hour.
I am being careful to delete any arrays/objects that are not in use.
Does anyone have any ideas I might try?
Installing the current master branch version as suggested by orgrisel worked for me. I did have to a "make clean" as described here.
The new version seems to be a really big improvement. I hope it is released soon.
Many thanks to orgisel and other contributors for such a great piece of software!
Been using a 6GB dataset with each source record being ~1KB in length when I accidentally added an index on a column that I am pretty sure has a 100% cardinality.
Tried dropping the index from cqlsh but by that point the two node cluster had gone into a run away death spiral with loadavg surpassing 20 on each node and cqlsh hung on the drop command for 30 minutes. Since this was just a test setup, I shut-down and destroyed the cluster and restarted.
This is a fairly disconcerting problem as it makes me fear a scenario where a junior developer is on a production cluster and they set an index on a similar high cardinality column. I scanned through the documentation and looked at the options in nodetool but there didn't seem to be anything along the lines of "abort job or abort building index".
Test environment:
2x m1.xlarge EC2 instances with 2 Raid 0 ephemeral disks
Dataset was 6GB, 1KB per record.
My question in summary: Is it possible to abort the process of building a secondary index AND or possible to stop/postpone running builds (indexing, compaction) for a later date.
nodetool -h node_address stop index_build
See: http://www.datastax.com/docs/1.2/references/nodetool#nodetool-stop
I've got pretty unusual latency patterns in my production setup:
the whole cluster (3 machines: 48 gig ram, 7500 rpm disk, 6 cores) shows latency spikes every 10 minutes, all machines at the same time.
See this screenshot.
I checked the logfiles and it seems as there are no compactions taking place at that time.
I've got 2k reads and 5k reads/sec. No optimizations have been made so far.
Caching is set to "ALL", hit rate for row cache is at ~0,7.
Any ideas? Is tuning memtable size an option?
Best,
Tobias