I'm using Prometheus to collect some metrics every 30s, which is doing a great job. To save some storage space, I also created some recording rules to aggregate these raw metrics to more coarse ones, like cpu_time:avg1h.
Now the problem is, I want the aggregation to happen at the beginning of each hour. Is it possible in Prometheus?
There's no such feature in Prometheus, while you can control the interval of rule evaluations the phases are random.
It is generally advised to have your scrape end evaluation intervals the same for the sake of sanity, so I'd suggest using a 30s evaluation interval. Prometheus has pretty good compression, so it's going to be fairly cheap.
Related
I wrote a Flink app, using a Cassandra Sink. Everything works really well, but various metrics are not being generated (like Latency, NumRecordsInPerSecond, etc).
The following graphs are the only ones that show information.
I really need the latency metric Flink offers, but It doenst show any data.
Do I need to do something else to get the others graphs to work? Especially, the latency one.
Thanks.
Make sure you have not disabled LatencyMarker. By default, the LatencyMarker will be emitted every 2 seconds to calculate latency for each operator. You can set the interval to send LatencyMarker through env.getConfig().setLatencyTrackingInterval(1000L);
In our workflow, we have little ongoing work in the arangodb (~1% cpu use). For about 30 minutes of the day usage spikes and we need it to be more performant (helping do a 3s query to 1s).
Instead of moving up the instance box that it's hosted on, is there a way to get more out of arango temporarily during peak times? Would this be clustering or should we just look into temporarily boosting the instance that it's on.
Accumulating above suggestions plus adding some more that fit the generic nature of this question.
if possible split read/write workload, either in a timely fashion by holding back writes, or by switching to a new collection for the new writes.
make sure indices are properly set (use explain)
try whether query profiling can help you improve the performance
I'm playing with the idea of having long-running aggregations (possibly a one day window). I realize other solutions on this site say that you should use batch processing for this.
I'm specifically interested in understanding this function though. It sounds like it would use constant space to do an aggregation over the window, one interval at a time. If that is true, it sounds like a day-long aggregation would be possible-viable (especially since it uses check-pointing in case of failure).
Does anyone know if this is the case?
This function is documented as: https://spark.apache.org/docs/2.1.0/streaming-programming-guide.html
A more efficient version of the above reduceByKeyAndWindow() where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enters the sliding window, and “inverse reducing” the old data that leaves the window. An example would be that of “adding” and “subtracting” counts of keys as the window slides. However, it is applicable only to “invertible reduce functions”, that is, those reduce functions which have a corresponding “inverse reduce” function (taken as parameter invFunc). Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument. Note that checkpointing must be enabled for using this operation.
After researching this on the MapR forums, it seems that it would definitely use a constant level of memory, making a daily window possible assuming you can fit one day of data in your allocated resources.
The two downsides are that:
Doing a daily aggregation may only take 20 minutes. Doing a window over a day means that you're using all those cluster resources permanently rather than just for 20 minutes a day. So, stand-alone batch aggregations are far more resource efficient.
Its hard to deal with late data when you're streaming exactly over a day. If your data is tagged with dates, then you need to wait till all your data arrives. A 1 day window in streaming would only be good if you were literally just doing an analysis of the last 24 hours of data regardless of its content.
Not sure the title is well suited to what I'm trying to achieve, so bear with me.
I'll start with defining my use case:
Many(say millions) IoT devices are sending data to my Spark stream. These devices are sending the current temperature level every 10 seconds.
The owner of all of these IoT devices has the ability to define a preset rules, for example: if temperature > 50 then do something.
I'm trying to figure out if I can output how many of these devices have met this if > 50 criteria in some time period. The catch is that the rules are defined in real time and should be applied to the Spark job at real time.
How would I do that. Is Spark the right tool for the job?
Many thanks
Is Spark the right tool for the job?
I think so.
the rules are defined in real time and should be applied to the Spark job at real time.
Let's assume the rules are in a database so every batch interval Spark would fetch them and apply one by one. They could also be in a file or any other storage. That's just orthogonal to the main requirement.
How would I do that?
The batch interval would be "some time period". I assume that the payload would have deviceId and temperature. With that you can just use regular filter over temperature and get deviceId back. You don't need stateful pipeline for this unless you want to accumulate data over time that is longer than your batch interval.
I'm looking for a distributed Time series database which is free to use in a cluster setup up mode and production ready plus it has to fit well in the hadoop ecosystem.
I have an IOT project which is basically around 150k Sensors which send data every 10 minutes or One hour, so I'm trying to look at time series database that has useful functions like aggregating metrics, Down-sampling, pre-aggregate (roll-ups) i have found this comparative in this Google stylesheet document time series database comparative .
I have tested Opentsdb, the data model of the hbaserowkey really suits my use case : but the functions that sill need to be developed for my use case are :
aggregate multiples metrics
do rollups
I have tested also keirosDB which is a fork of opentsdb with a richer API and it uses Cassandra as a backend storage the thing is that their API does all what my looking for downsampling rollups querying multiples metrics and a lot more.
I have tested Warp10.io and Apache Phoenix which i have read here Hortonworks link that it will be used by Ambari Metrics so i assume that its well suited for time series data too.
My question is as of now what's the best Time series Database to do real time analytics with requests performance under 1S for all the type of requests example : we want the average of the aggregated data sent by 50 sensors in a period of 5 years resampled by months ?
Such requests I assume can't be done under 1S so I believe for such requests we need some rollups/ pre aggregate mechanism, but I'm not so sure because there's a lot of tools out there and i can't decide which one suits my need the best.
I'm the lead for Warp 10 so my answer can be considered opinionated.
Given your projected data volume, 150k sensors sending data every 10 minutes, it is a mean of 250 datapoints per second and less than 40B on a period of 5 years. Such a volume can easily fit on a simple Warp 10 standalone, and if you later need to have a larger infrastructure you can migrate to a distributed Warp 10 based on Hadoop.
In terms of requests, if your data is already resampled, fetch 5 years of monthly data for 50 sensors is only 3000 datapoints, Warp 10 can do that in far less than 1s, and doing the automatic rollups is just a matter of scheduling WarpScript code in a monthly manner, nothing fancy.
Lastly, in terms of integration with the Hadoop ecosystem, Warp 10 is on top of things with integration of the WarpScript language in Pig, Spark, Flink and Storm. With the Warp10InputFormat you can fetch data from a Warp 10 platform or you can load data using any other InputFormat and then manipulate them using WarpScript.
At OVH we are heavy users of #OvhMetrics which rely on Warp10/HBase, and we provide a protocol abstraction with OpenTSDB/WarpScript/PromQL/...
I'm not interested in Warp10, but it has been a great success for us. Both on the scaling challenge and for the use cases that WarpScript can cover.
Most of the time we don't even leverage hadoop/flink integration because our customers needs are addressed easily with the real time WarpScript API.
For real time analytics, you can try Druid, an open source project maintainted by Apache, or you can also check out database specialized for IoT: GridDB and CrateDB. The best way is to test these databases yourselves and see if they suit your need. You can also connect these databases as a sink to Kafka.
When you are dealing with IoT project, you need to forecast if you have to maintain large data set in the future or if you are happy with downsampled data. Some TSDB have good compression like InfluxDB, but others may not be scalable beyond tens of terabytes, so if you think you need to scale big, look also for one with scale-out architecture.