dont understand the cassandra jmx mbean metrics - cassandra

there are several questions about cassandra jmx mbean.
1.how the attribute mean calculate?
I have monitored metrics with jconsole.and I see the value of write mbean.
there is an attribute Mean in write MBean.and I dont know how to count the value in cassandra as I doubt that the value is right
make a junit test,
Timer latency=new Timer()
latency.update(timeTaken, TimeUnit.MILLISECONDS);
I input three values,0,1,2
and expected the mean value is 1000microseconds.but the fact is 1131.752microseconds,which confuse me a lot.
3.there are mean attribute, 50thPercentile attribute,and etc.
but I cannot get the instant value,when I want to see the quick change in cluster.all of these attributes cannot indicate the changes

Important to note that the latencies are estimates, not exact. It cannot store every latency that has occurred or it will run out of memory. So it keeps an approximate reservoir or histogram of all the latencies (depending on version) that it uses to calculate the statistics. Assuming your over C* 2.2, it stores a histogram of the latencies and can calculate approximate mean, min, max and percentiles within a given error %.
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/utils/EstimatedHistogram.java#L227 is the mean calculation. Since each bucket represents a range of latencies it uses the high end, so it always will be at worst, higher than the actual latency.
Before 2.2 this was kept differently (see http://metrics.dropwizard.io/3.1.0/ for details).
Aside: Mean is a pretty bad statistic to go by for latencies so shouldn't put too much stock in it, percentiles are better to look at.

Related

Controlling batch size to hint scheduler performance

I would like to manually tune how big my mini-batches (in terms of cardinality) are. A way to set max number of events would be enough, but if there's a way to set max/min that would be better.
The reason I want to mess around with this is because I know for a fact that my processing code does not scale linearly.
In my particular case I'm not doing time aggregation, so I don't really care about time-frame aggregation, but depleting the "input queue" as soon as possible (by hinting the engine how many elements to process at a time).
However, if there's no way to set the max/min batch cardinality directly, I could probably workaround the limitation using a dummy time aggregation approach by stamping my input data before Spark consumes it.
Thanks

How to obtain row count estimates in in Cassandra using the Java client driver

If the only thing I have available is a com.datastax.driver.core.Session, is there a way to get a rough estimate of row count in a Cassandra table from a remote server? Performing a count is too expensive. I understand I can get a partition count estimate through JMX but I'd rather not assume JMX has been configured. (I think that result must be multiplied by number of nodes and divided by replication factor.) Ideally the estimate would include cluster keys too, but everything is on the table.
I also see there's a size_estimates table in the system keyspace but I don't see much documentation on it. Is it periodically refreshed or do the admins need to run something like nodetool flush?
Aside from not including cluster keys, what's wrong with using this as a very rough estimate?
select sum(partitions_count)
from system.size_estimates
where keyspace_name='keyspace' and table_name='table';
The size estimates is updated on a timer every 5 minutes (overridable with -Dcassandra.size_recorder_interval).
This is a very rough estimate, but you could from the token of the partition key find the range it belongs in and on each of the replicas pull from this table (its local replication and unique to each node, not global) and divide out the size and the number of partitions for a very vague approximate estimate of the partition size. There are so many assumptions and averaging that occurs in this path even before writing to this table. Cassandra errs on efficiency side at cost of accuracy and is more for general uses like spark bulk reading so take it with a grain of salt.
Its not useful now but looking towards the future post 4.0 freeze there will be many new virtual tables, including possibly ones to get accurate statistics on specific and ranges of partitions on demand.

Spark count dataframe to estimate output partitions, then write, efficiently without caching?

As my spark program runs on more data, I think I am crashing because I'm picking up the default number of output partitions for aggregation - namely the 200. I've learned how to control this, but it seems ideally, I would set the number of output partitions based on the amount of data I'm writing. Here in lies the conundrum - I need to first call count() on the dataframe, and then write it. That means I may re-ready it from S3 twice. I could cache and then count, but I've seen spark crash when I cache this data, caching seems to use the most resources, whereas if I just write it - it can do something more optimal.
So my questions are, if you think this is a decent approach - doing a count first (the count is a proxy to the size on disk) or should you just hard code some numbers, change them when you need? And if I am going to count first, is their some clever way to optimize things so that the count and write share work? Other than caching the whole dataframe?
Yes the count approach is actually correct way to go. Ideally you want your rdd partitions to be some considerable size like 50MB before writing. Otherwise you will end up with "small file problem".
Now if you have large data caching in memory could be hard. You could try MEMORY_AND_DISK but then the data will spill to disk and cause slowdown.
I have faced this predicament multiple times and every time I have chosen a "magic number" for the number of partitions. The number is parameterized so when I need to change I don't need to change the code, rather pass the different parameter.
If you know your datasize is generally in a particular range you could set the partition number hard coded. It is not ideal but gets the job done.
Also you could pump the metrics like size of the data in s3 and if that breaches some threshold raise an alarm then someone could change the partition number manually.
In generally if you keep the partition number moderately high like 5000 for approximately 500GB data that works for a large range i.e from 300GB to 1.2TB amount of data. This means probably you don't need to change the partition number too often if you have moderate inflow of data.

Spark streaming - Does reduceByKeyAndWindow() use constant memory?

I'm playing with the idea of having long-running aggregations (possibly a one day window). I realize other solutions on this site say that you should use batch processing for this.
I'm specifically interested in understanding this function though. It sounds like it would use constant space to do an aggregation over the window, one interval at a time. If that is true, it sounds like a day-long aggregation would be possible-viable (especially since it uses check-pointing in case of failure).
Does anyone know if this is the case?
This function is documented as: https://spark.apache.org/docs/2.1.0/streaming-programming-guide.html
A more efficient version of the above reduceByKeyAndWindow() where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enters the sliding window, and “inverse reducing” the old data that leaves the window. An example would be that of “adding” and “subtracting” counts of keys as the window slides. However, it is applicable only to “invertible reduce functions”, that is, those reduce functions which have a corresponding “inverse reduce” function (taken as parameter invFunc). Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument. Note that checkpointing must be enabled for using this operation.
After researching this on the MapR forums, it seems that it would definitely use a constant level of memory, making a daily window possible assuming you can fit one day of data in your allocated resources.
The two downsides are that:
Doing a daily aggregation may only take 20 minutes. Doing a window over a day means that you're using all those cluster resources permanently rather than just for 20 minutes a day. So, stand-alone batch aggregations are far more resource efficient.
Its hard to deal with late data when you're streaming exactly over a day. If your data is tagged with dates, then you need to wait till all your data arrives. A 1 day window in streaming would only be good if you were literally just doing an analysis of the last 24 hours of data regardless of its content.

Statsd & Graphite statistics per given item, not time

It is easy to process any metrics with statsd and graphite, assuming they are measured per timespan. As an example, it is easy to track number of request per second.
On the other hand, sometimes might be useful to track a metric based on given "base item". For example, I process a data set and I want to track the percentage of invalid fields, number of actions necessary to process the data set etc. I can easily see a result as: "we had 10 invalid values in data set per second" and "we process 100 data fields in average, per second", but I would rather see something like "in 100 fields, there are 10 invalid values".
The results are similar when processing these fields takes similar amount of time. However, if it varies (and especially if the time differs according to the nature of data), time-based statistic is slightly confusing and does not reflect what I want to see.
Any solutions how to set up statsd / Graphite to solve the issue I have mentioned?
Creating a more meaningful relationship of time-series data at the boundary of StatsD/Graphite is quite difficult because, as you alluded to in the question, the data (used for deriving the percentage) is only related by time and key.
That said, for this type of data I've set up "percentage graphs" using asPercent(). Like this:
asPercent(stats_counts.myapp.messages.{ignored,dropped,recycled},
stats_counts.myapp.messages.received)
You could also consider pushing this down into your application and performing the calculation where you can better relate the data and sending the data to Statsd as a gauge.

Resources