It is easy to process any metrics with statsd and graphite, assuming they are measured per timespan. As an example, it is easy to track number of request per second.
On the other hand, sometimes might be useful to track a metric based on given "base item". For example, I process a data set and I want to track the percentage of invalid fields, number of actions necessary to process the data set etc. I can easily see a result as: "we had 10 invalid values in data set per second" and "we process 100 data fields in average, per second", but I would rather see something like "in 100 fields, there are 10 invalid values".
The results are similar when processing these fields takes similar amount of time. However, if it varies (and especially if the time differs according to the nature of data), time-based statistic is slightly confusing and does not reflect what I want to see.
Any solutions how to set up statsd / Graphite to solve the issue I have mentioned?
Creating a more meaningful relationship of time-series data at the boundary of StatsD/Graphite is quite difficult because, as you alluded to in the question, the data (used for deriving the percentage) is only related by time and key.
That said, for this type of data I've set up "percentage graphs" using asPercent(). Like this:
asPercent(stats_counts.myapp.messages.{ignored,dropped,recycled},
stats_counts.myapp.messages.received)
You could also consider pushing this down into your application and performing the calculation where you can better relate the data and sending the data to Statsd as a gauge.
Related
I would like to manually tune how big my mini-batches (in terms of cardinality) are. A way to set max number of events would be enough, but if there's a way to set max/min that would be better.
The reason I want to mess around with this is because I know for a fact that my processing code does not scale linearly.
In my particular case I'm not doing time aggregation, so I don't really care about time-frame aggregation, but depleting the "input queue" as soon as possible (by hinting the engine how many elements to process at a time).
However, if there's no way to set the max/min batch cardinality directly, I could probably workaround the limitation using a dummy time aggregation approach by stamping my input data before Spark consumes it.
Thanks
Is there a way to calculate how many RUs I would need if the a documentdb database is expected to have roughly 800 writes a second and 1500 reads a second?
Each read is a simple retrieve based on the index, and each item will have about 15 small data fields (a few bools, short strings, and short doubles).
Each write will be an update of most of the data values for the record.
The documentations states 1 RU = 1kb GET, well each GET in this instance should be less than 1kb I would suspect so the reads would be about 1500 RU/s but I have no idea how to calculate the writes; any help would be greatly appreciated.
There's a simple to use capacity planning tool available online. You can simply upload a sample JSON document and then specify how many reads and writes per second you expect and it will estimate your required RU/s throughput.
As David so eloquently pointed out, this should only be used as a starting point to give you a ballpark of what your minimum RU cost might be. If your primary read pattern was simply retrieving documents directly by their Id then it might be relatively accurate. In reality, RU is calculated based on the complexity of your queries. So once you have your baseline it's important to do proper analysis of your query patterns and get a feel for their RU cost.
Luckily, the ease and speed with which you can scale Cosmos in response to load is one of it's most compelling features in my opinion. In my experience, adding or removing RU throughput is done within a matter of seconds so you can definitely add a layer of intelligent database tuning within your application to optimize your cost and usage.
there are several questions about cassandra jmx mbean.
1.how the attribute mean calculate?
I have monitored metrics with jconsole.and I see the value of write mbean.
there is an attribute Mean in write MBean.and I dont know how to count the value in cassandra as I doubt that the value is right
make a junit test,
Timer latency=new Timer()
latency.update(timeTaken, TimeUnit.MILLISECONDS);
I input three values,0,1,2
and expected the mean value is 1000microseconds.but the fact is 1131.752microseconds,which confuse me a lot.
3.there are mean attribute, 50thPercentile attribute,and etc.
but I cannot get the instant value,when I want to see the quick change in cluster.all of these attributes cannot indicate the changes
Important to note that the latencies are estimates, not exact. It cannot store every latency that has occurred or it will run out of memory. So it keeps an approximate reservoir or histogram of all the latencies (depending on version) that it uses to calculate the statistics. Assuming your over C* 2.2, it stores a histogram of the latencies and can calculate approximate mean, min, max and percentiles within a given error %.
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/utils/EstimatedHistogram.java#L227 is the mean calculation. Since each bucket represents a range of latencies it uses the high end, so it always will be at worst, higher than the actual latency.
Before 2.2 this was kept differently (see http://metrics.dropwizard.io/3.1.0/ for details).
Aside: Mean is a pretty bad statistic to go by for latencies so shouldn't put too much stock in it, percentiles are better to look at.
I am trying to build a real-time stock application.
Every seconds I can get some data from web service like below:
[{"amount":"20","date":1386832664,"price":"183.8","tid":5354831,"type":"sell"},{"amount":"22","date":1386832664,"price":"183.61","tid":5354833,"type":"buy"}]
tid is the ticket ID for stock buying and selling;
date is the second from 1970.1.1;
price/amount is at what price and how many stock traded.
Reuirement
My requirement is show user highest/lowest price at every minute/5 minutes/hour/day in real-time; show user the sum of amount in every minute/5 minutes/hour/day in real-time.
Question
My question is how to store the data to redis, so that I can easily and quickly get highest/lowest trade from DB for different periods.
My design is something like below:
[date]:[tid]:amount
[date]:[tid]:price
[date]:[tid]:type
I am new in redis. If the design is this is that means I need to use sorted set, will there any performance issue? Or is there any other way to get highest/lowest price for different periods.
Looking forward for your suggestion and design.
My suggestion is to store min/max/total for all intervals you are interested in and update it for current ones with every arriving data point. To avoid network latency when reading previous data for comparison, you can do it entirely inside Redis server using Lua scripting.
One key per data point (or, even worse, per data point field) is going to consume too much memory. For the best results, you should group it into small lists/hashes (see http://redis.io/topics/memory-optimization). Redis only allows one level of nesting in its data structures: if you data has multiple fields and you want to store more than one item per key, you need to somehow encode it yourself. Fortunately, standard Redis Lua environment includes msgpack support which is very a efficient binary JSON-like format. JSON entries in your example encoded with msgpack "as is" will be 52-53 bytes long. I suggest grouping by time so that you have 100-1000 entries per key. Suppose one-minute interval fits this requirement. Then the keying scheme would be like this:
YYmmddHHMMSS — a hash from tid to msgpack-encoded data points for the given minute.
5m:YYmmddHHMM, 1h:YYmmddHH, 1d:YYmmdd — window data hashes which contain min, max, sum fields.
Let's look at a sample Lua script that will accept one data point and update all keys as necessary. Due to the way Redis scripting works we need to explicitly pass the names of all keys that will be accessed by the script, i.e. the live data and all three window keys. Redis Lua has also JSON parsing library available, so for the sake of simplicity let's assume we just pass it JSON dictionary. That means that we have to parse data twice: on the application side and on the Redis side, but the performance effects of it are not clear.
local function update_window(winkey, price, amount)
local windata = redis.call('HGETALL', winkey)
if price > tonumber(windata.max or 0) then
redis.call('HSET', winkey, 'max', price)
end
if price < tonumber(windata.min or 1e12) then
redis.call('HSET', winkey, 'min', price)
end
redis.call('HSET', winkey, 'sum', (windata.sum or 0) + amount)
end
local currkey, fiveminkey, hourkey, daykey = unpack(KEYS)
local data = cjson.decode(ARGV[1])
local packed = cmsgpack.pack(data)
local tid = data.tid
redis.call('HSET', currkey, tid, packed)
local price = tonumber(data.price)
local amount = tonumber(data.amount)
update_window(fiveminkey, price, amount)
update_window(hourkey, price, amount)
update_window(daykey, price, amount)
This setup can do thousands of updates per second, not very hungry on memory, and window data can be retrieved instantly.
UPDATE: On the memory part, 50-60 bytes per point is still a lot if you want to store more a few millions. With this kind of data I think you can get as low as 2-3 bytes per point using custom binary format, delta encoding, and subsequent compression of chunks using something like snappy. It depends on your requirements, whether it's worth doing this.
I have some software which collects data over a large period of time, approx 200 readings per second. It uses an SQL database for this. I am looking to use Azure to move a lot of my old "archived" data to.
The software uses a multi-tenant type architecture, so I am planning to use one Azure Table per Tenant. Each tenant is perhaps monitoring 10-20 different metrics, so I am planning to use the Metric ID (int) as the Partition Key.
Since each metric will only have one reading per minute (max), I am planning to use DateTime.Ticks.ToString("d19") as my RowKey.
I am lacking a little understanding as to how this will scale however; so was hoping somebody might be able to clear this up:
For performance Azure will/might split my table by partitionkey in order to keep things nice and quick. This would result in one partition per metric in this case.
However, my rowkey could potentially represent data over approx 5 years, so I estimate approx 2.5 million rows.
Is Azure clever enough to then split based on rowkey as well, or am I designing in a future bottleneck? I know normally not to prematurely optimise, but with something like Azure that doesn't seem as sensible as normal!
Looking for an Azure expert to let me know if I am on the right line or whether I should be partitioning my data into more tables too.
Few comments:
Apart from storing the data, you may also want to look into how you would want to retrieve the data as that may change your design considerably. Some of the questions you might want to ask yourself:
When I retrieve the data, will I always be retrieving the data for a particular metric and for a date/time range?
Or I need to retrieve the data for all metrics for a particular date/time range? If this is the case then you're looking at full table scan. Obviously you could avoid this by doing multiple queries (one query / PartitionKey)
Do I need to see the most latest results first or I don't really care. If it's former, then your RowKey strategy should be something like (DateTime.MaxValue.Ticks - DateTime.UtcNow.Ticks).ToString("d19").
Also since PartitionKey is a string value, you may want to convert int value to a string value with some "0" prepadding so that all your ids appear in order otherwise you'll get 1, 10, 11, .., 19, 2, ...etc.
To the best of my knowledge, Windows Azure partitions the data based on PartitionKey only and not the RowKey. Within a Partition, RowKey serves as unique key. Windows Azure will try and keep data with the same PartitionKey in the same node but since each node is a physical device (and thus has size limitation), the data may flow to another node as well.
You may want to read this blog post from Windows Azure Storage Team: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows-azure-tables.aspx.
UPDATE
Based on your comments below and some information from above, let's try and do some math. This is based on the latest scalability targets published here: http://blogs.msdn.com/b/windowsazurestorage/archive/2012/11/04/windows-azure-s-flat-network-storage-and-2012-scalability-targets.aspx. The documentation states that:
Single Table Partition– a table partition are all of the entities in a
table with the same partition key value, and usually tables have many
partitions. The throughput target for a single table partition is:
Up to 2,000 entities per second
Note, this is for a single partition, and not a single table. Therefore, a table with good partitioning, can process up to the
20,000 entities/second, which is the overall account target described
above.
Now you mentioned that you've 10 - 20 different metric points and for for each metric point you'll write a maximum of 1 record per minute that means you would be writing a maximum of 20 entities / minute / table which is well under the scalability target of 2000 entities / second.
Now the question remains of reading. Assuming a user would read a maximum of 24 hours worth of data (i.e. 24 * 60 = 1440 points) per partition. Now assuming that the user gets the data for all 20 metrics for 1 day, then each user (thus each table) will fetch a maximum 28,800 data points. The question that is left for you I guess is how many requests like this you can get per second to meet that threshold. If you could somehow extrapolate this information, I think you can reach some conclusion about the scalability of your architecture.
I would also recommend watching this video as well: http://channel9.msdn.com/Events/Build/2012/4-004.
Hope this helps.