How to output gridgain/ignite statistics to a file - gridgain

How can i capture operation statistics such as puts/sec, gets/sec per server from ignite/gridgain.
Is it possible to output them into some file so that we can analyze them later?

Cache stats for a particular server node can be acquired with IgniteCache.metrics(ClusterGroup grp) method, like this:
ClusterGroup grp = ignite.cluster().forNodeId(SERVER_NODE_ID);
CacheMetrics metrics = cache.metrics(grp);
long puts = metrics.getCachePuts();
long gets = metrics.getCacheGets();
You can periodically get them, calculate throughput values for this period of time (you will have to save the previous snapshot) and log to a file.
Note that metrics are disabled by default for performance reasons. To enable them set statisticsEnabled flag on CacheConfiguration to true:
cacheCfg.setStatisticsEnabled(true);
Hope this helps.

Related

How to set spark Kusto connector write polling interval?

I'm writing data to Kusto using azure-kusto-spark.
I see this write has high latency. On seeing debug logs from Spark cluster, I see KustoConnector does polling on write. I believe there is default long polling time interval value. Is there a way to configure it to lower time interval value?
In azure-kusto-spark codebase I see this piece of code which I think is responsible for polling.
def finalizeIngestionWhenWorkersSucceeded(
...
DelayPeriodBetweenCalls,
(writeOptions.timeout.toMillis / DelayPeriodBetweenCalls + 5).toInt,
res => res.isDefined && res.get.status == OperationStatus.Pending,
res => finalRes = res,
maxWaitTimeBetweenCalls = KDSU.WriteMaxWaitTime.toMillis.toInt)
.await(writeOptions.timeout.toMillis, TimeUnit.MILLISECONDS)
....
Not sure about understanding it.
The polling operations is just to check whether the data was inserted to Kusto. It has a max timeout but this is not impacting the latency.
I believe the latency is coming from the time you have in your Kusto DB batching ingestion policy - see details here https://learn.microsoft.com/en-us/azure/data-explorer/kusto/management/batchingpolicy
By default it's 5 minutes, so you might want to reduce this time if it won't have any negative impact (see doc). Note that the policy should be changed on DB, not table, the reason being the following: the connector creates a temporary table in the same Kusto DB and first inserts into this temporary table. So even if you change the policy of your destination table, it'll still take at least 5min to write to the temporary.

Tracking a counter value in application insights

I'm trying to use application insights to keep track of a counter of number of active streams in my application. I have 2 goals to achieve:
Show the current (or at least recent) number of active streams in a dashboard
Activate a kind of warning if the number exceeds a certain limit.
These streams can be quite long lived, and sometimes brief. So the number can sometimes change say 100 times a second, and sometimes remain unchanged for many hours.
I have been trying to track this active streams count as an application insights metric.
I'm incrementing a counter in my application when a new stream opens, and decrementing when one closes. On each change I use the telemetry client something like this
var myMetric = myTelemetryClient.GetMetric("Metricname");
myMetric.TrackValue(myCount);
When I query my metric values with Kusto, I see that because of these clusters of activity within a 10 sec period, my metric values get aggregated. For the purposes of my alarm, I can live with that, as I can look at the max value of the aggregate. But I can't present a dashboard of the number of active streams, as I have no way of knowing the number of active streams between my measurement points. I know the min value, max and average, but I don't know the last value of the aggregate period, and since it can be somewhere between 0 and 1000, its no help.
So the solution I have doesn't serve my needs, I thought of a couple of changes:
Adding a scheduled pump to my counter component, which will send the current counter value, once every say 5 minutes. But I don't like that I then have to add a thread for each of these counters.
Adding a timer to send the current value once, 5 minutes after the last change. Countdown gets reset each time the counter changes. This has the same problem as above, and does an excessive amount of work to reset the counter when it could be changing thousands of times a second.
In the end, I don't think my needs are all that exotic, so I wonder if I'm using app insights incorrectly.
Is there some way I can change the metric's behavior to suit my purposes? I appreciate that it's pre-aggregating before sending data in order to reduce ingest costs, but it's preventing me from solving a simple problem.
Is a metric even the right way to do this? Are there alternative approaches within app insights?
You can use TrackMetric instead of the GetMetric ceremony to track individual values withouth aggregation. From the docs:
Microsoft.ApplicationInsights.TelemetryClient.TrackMetric is not the preferred method for sending metrics. Metrics should always be pre-aggregated across a time period before being sent. Use one of the GetMetric(..) overloads to get a metric object for accessing SDK pre-aggregation capabilities. If you are implementing your own pre-aggregation logic, you can use the TrackMetric() method to send the resulting aggregates.
But you can also use events as described next:
If your application requires sending a separate telemetry item at every occasion without aggregation across time, you likely have a use case for event telemetry; see TelemetryClient.TrackEvent (Microsoft.ApplicationInsights.DataContracts.EventTelemetry).

Using Timer for batch operations

I'm new to using Micrometer and am trying to see if there's a way to use a Timer that would also include a count of the number of items in a batch processing scenario. Since I'm processing the batch with Java streams, I didn't see an obvious way to record the timer for each item processed, so I was looking for a way to set a batch size attribute. One way I think that could work is to use the FunctionTimer from https://micrometer.io/docs/concepts#_function_tracking_timers, but I believe that requires the app to maintain a persistent monotonically increasing set of values for the total count and total time.
Is there a simpler way this can be done? Ultimately this data will be fed to New Relic. I've also tried setting tags for the batch size, but those seem to be reported as strings so I can't do any type of aggregation on the values.
Thanks!
A timer is intended for measuring an action and at a minimum results in two measurements: a count and a duration.
So a timer will work perfectly for your batch processing. In the the java stream, a peek operation might be a good place to put a timer.
If you were about to process 20 elements and you were just measuring the time for all 20 elements, you would need to create a new Counter for measuring the batch size. You could them divide the timer's total duration against your counter to get a per-item duration or divide it against the timer's total count to get a per-batch duration.
Feel free to add code snippets if you would like feedback for those.

How to deal with a large amount of logs and redis?

Say I have about 150 requests coming in every second to an api (node.js) which are then logged in Redis. At that rate, the moderately priced RedisToGo instance will fill up every hour or so.
The logs are only necessary to generate daily\monthly\annual statistics: which was the top requested keyword, which was the top requested url, total number of requests daily, etc. No super heavy calculations, but a somewhat time-consuming run through arrays to see which is the most frequent element in each.
If I analyze and then dump this data (with a setInterval function in node maybe?), say, every 30 minutes, it doesn't seem like such a big deal. But what if all of sudden I have to deal with, say, 2500 requests per second?
All of a sudden I'm dealing with 4.5 ~Gb of data per hour. About 2.25Gb every 30 minutes. Even with how fast redis\node are, it'd still take a minute to calculate the most frequent requests.
Questions:
What will happen to the redis instance while 2.25 gb worth of dada is being processed? (from a list, I imagine)
Is there a better way to deal with potentially large amounts of log data than moving it to redis and then flushing it out periodically?
IMO, you should not use Redis as a buffer to store your log lines and process them in batch afterwards. It does not really make sense to consume memory for this. You will better served by collecting your logs in a single server and write them on a filesystem.
Now what you can do with Redis is trying to calculate your statistics in real-time. This is where Redis really shines. Instead of keeping the raw data in Redis (to be processed in batch later), you can directly store and aggregate the statistics you need to calculate.
For instance, for each log line, you could pipeline the following commands to Redis:
zincrby day:top:keyword 1 my_keyword
zincrby day:top:url 1 my_url
incr day:nb_req
This will calculate the top keywords, top urls and number of requests for the current day. At the end of the day:
# Save data and reset counters (atomically)
multi
rename day:top:keyword tmp:top:keyword
rename day:top:url tmp:top:url
rename day:nb_req tmp:nb_req
exec
# Keep only the 100 top keyword and url of the day
zremrangebyrank tmp:top:keyword 0 -101
zremrangebyrank tmp:top:url 0 -101
# Aggregate monthly statistics for keyword
multi
rename month:top:keyword tmp
zunionstore month:top:keyword 2 tmp tmp:top:keyword
del tmp tmp:top:keyword
exec
# Aggregate monthly statistics for url
multi
rename month:top:url tmp
zunionstore month:top:url 2 tmp tmp:top:url
del tmp tmp:top:url
exec
# Aggregate number of requests of the month
get tmp:nb_req
incr month:nb_req <result of the previous command>
del tmp:nb_req
At the end of the month, the process is completely similar (using zunionstore or get/incr on monthly data to aggregate the yearly data).
The main benefit of this approach is the number of operations done for each log line is limited while the monthly and yearly aggregation can easily be calculated.
how about using flume or chukwa (or perhaps even scribe) to move log data to a different server (if available) - you could store log data using hadoop/hbase or any other disk based store.
https://cwiki.apache.org/FLUME/
http://incubator.apache.org/chukwa/
https://github.com/facebook/scribe/

Azure Diagnostic - how to get performance counter raw data

i am looking for a way to get the raw data from a performance counter in windows azure
using the diagnostic api.
so far I've noticed that i can configured a counter from the known counters
and set the sampling rate for that counter.
Is the sampling rate configured in the diagnostics configuration is the sampling rate
that the counter calculation is based on ?
if not how can i get the raw data for that counter, since i want to get the cpu user time (for example)
and do the calculation by myself.
thanks
Each counter has a sampling frequency from 1 second whatever number. Azure will sample each instance at the given rate and capture the values and store them inside each instance. Furthermore, there is a setting that allows Azure to transfer these values from each instance onto storage account's WADPerformanceCountersTable. The transfer setting is measured in minutes and a minimum of once per minute.
To get details you want to read this:
http://convective.wordpress.com/2009/12/10/diagnostics-management-in-windows-azure/
and this:
http://convective.wordpress.com/2010/12/01/configuration-changes-to-windows-azure-diagnostics-in-azure-sdk-v1-3/

Resources