How can I aggregate data in Time Series Insights preview using the hierarchy? - azure

I am storing 15 minute electricity consumption measurements in a TSI preview environment. Is it possible to aggregate the total energy consumption per day of multiple meters using the TSI query API?
I have configured a hierarchy as Area-Building and the Time Series ID is the 'MeterId' of the Meter.
The query API (https://learn.microsoft.com/en-us/rest/api/time-series-insights/preview-query#aggregate-series-api) enabled me to aggregate to consumption per day for a single meter. Then I expected to find an API to aggregate the electricity consumption to Building and Area, but could only find the aggregate operation with a single "timeSeriesId" or "timeSeriesName" as required parameter. Is aggregation to a level in the hierarchy not possible? If not, what would be a good alternative (within or outside TSI) to obtain these aggregated values?

What you may do, is get all the instances you need with the search api(docs).(mind that the documentation is wrong for the
url, it should contain "search" instead of "suggest" like this:
)
Then loop through the instances you get in the response to call the aggregates by id one by one. And finally sum the results yourself to have a daily result for all the telemetry sensors responding to your search.
Note: You can only make 9 aggregate calls at the same time(limitations).
I hope they fix aggregates soon.
In the meanwhile I hope it helps you out.
Good luck,

Related

Azure Time Series Insights query multiple time series in 1 query

Azure time series insights (TSI) will no longer be supported in 2025 but for now, it is a good cheap solution for something I am working on. There is one major problem though. I need to query over 100 time series in 1 query and I can't find a way to do this. For now, I have 100 API calls running in a loop to do this, which is terrible. On top of that, I need to do this every 60 seconds. Is there a way to grab all of these time series in a single query? Also, if not, is there another solution besides TSI that isn't expensive? ADX is an alternative but it is exponentially more expensive than TSI.

Azure Time Series (TSI) initial considerations and best practices

My apologies for the bad title!
I am in the initial phase of designing an Azure Time Series solution and I have run into a number of uncertainties. The background for getting into TSI is that we currently have a rather badly designed cosmos db which contains close to 1TB of IoT data and it is growing by the minute. By "badly" I mean that the partition key was designed in such a manner that we do not have control of the size of the partitions. Knowing that there is a limit of 10GB(?) pr partition key, we will soon run out of space and need to come up with a new solution. Also, when running historical queries on the cosmos db, it does not respond within an acceptable time frame. Any experiments with throughput calculations and changes does not improve the response time to an accepted time frame.
We are in the business of logging IoT time series data including metadata from different sensors. We have a number of clients which have from 30 to 300 sensors each - smaller and larger clients. At the client side the sensors are grouped into locations and sub-locations.
An example of an event could be something like this:
{
deviceId,
datetime,
clientId,
locationId,
sub-locationId,
sensor,
value,
metadata{}
}
Knowing how to better design a partition key in CosmosDB, would the same approach as described below be considered as a good practice in TSI when composing the TimeSeriesId?
In a totally different cosmosdb solution we have included eventDate.datepart(YYYY-MM) as a part of the partition key to stop it from growing out of bounds and to better predict the response time on queries within one partition.
Or will TSI handle time series data differently thus making the datepart in TimeSeriesId obsolete?
Having TSI API queries in mind, should I consider the simpicity of the composed TimeSeriesId as well? The TimeSeriesId has to be provided in the body of each API request - as far as I can tell, and when composing a query in a back-end service I do have access to all our clients id's and location/sub-location id's. And these are more accesible than the deviceId's
And finally, when storing IoT data for multiple clients would it be best practice to provision a new TSI solution for each client or does TSI support collections as seen in CosmosDB?
As stated in this article, when using composite key, you will need to query against all the key properties, and not against one or some of them. That's a consideration when deciding for a single key or composite key. Also, as it states in the article, as tip,
If your event source is an IoT hub, your Time Series ID will likely be iothub-connection-device-id.
So, I assume you will have at least one IoT Hub sourcing the events reported from the devices, and in this case you can use the iothub-connection-device-id.

We migrated our app from Parse to Azure but the costs of DocumentDB is so high. Are we doing something wrong?

We migrated our mobile app (still being developed) from Parse to Azure. Everything is running, but the price of DocumentDB is so high that we can't continue with Azure without fix that. Probably we're doing something wrong.
1) Price seams to have a bottleneck in the DocumentDB requests.
Running a process to load the data (about 0.5 million documents), memory and CPU was ok, but the DocumentDB request limit was a bottleneck, and the price charged was very high.
2) Even after the end of this data migration (few days of processing), azure continue to charge us every day.
We can't understand what is going on here. The graphic for use are flat, but the price is still climbing, as you can see in the imagens.
Any ideas?
Thanks!
From your screenshots, you have 15 collections under the Parse database. With Parse: Aside from the system classes, each of your user-defined classes gets stored in its own collection. And given that each (non-partitioned) collection has a starting run-rate of ~$24/month (for an S1 collection), you can see where the baseline cost would be for 15 collections (around $360).
You're paying for reserved storage and RU capacity. Regardless of RU utilization, you pay whatever the cost is for that capacity (e.g. S2 runs around $50/month / collection, even if you don't execute a single query). Similar to spinning up a VM of a certain CPU capacity and then running nothing on it.
The default throughput setting for the parse collections is set to 1000 RUPS. This will cost $60 per collection (at the rate of $6 per 100 RUPS). Once you finish the parse migration, the throughput can be lowered if you believe the workload decreased. This will reduce the charge.
To learn how to do this, take a look at https://azure.microsoft.com/en-us/documentation/articles/documentdb-performance-levels/ (Changing the throughput of a Collection).
The key thing to note is that DocumentDB delivers predictable performance by reserving resources to satisfy your application's throughput needs. Because application load and access patterns change over time, DocumentDB allows you to easily increase or decrease the amount of reserved throughput available to your application.
Azure is a "pay-for-what-you-use" model, especially around resources like DocumentDB and SQL Database where you pay for the level of performance required along with required storage space. So if your requirements are that all queries/transactions have sub-second response times, you may pay more to get that performance guarantee (ignoring optimizations, etc.)
One thing I would seriously look into is the DocumentDB Cost Estimation tool; this allows you to get estimates of throughput costs based upon transaction types based on sample JSON documents you provide:
So in this example, I have an 8KB JSON document, where I expect to store 500K of them (to get an approx. storage cost) and specifying I need throughput to create 100 documents/sec, read 10/sec, and update 100/sec (I used the same document as an example of what the update will look like).
NOTE this needs to be done PER DOCUMENT -- if you're storing documents that do not necessarily conform to a given "schema" or structure in the same collection, then you'll need to repeat this process for EVERY type of document.
Based on this information, I cause use those values as inputs into the pricing calculator. This tells me that I can estimate about $450/mo for DocumentDB services alone (if this was my anticipated usage pattern).
There are additional ways you can optimize the Request Units (RUs -- metric used to measure the cost of the given request/transaction -- and what you're getting billed for): optimizing index strategies, optimizing queries, etc. Review the documentation on Request Units for more details.

Cassandra - multiple counters based on timeframe

I am building an application and using Cassandra as my datastore. In the app, I need to track event counts per user, per event source, and need to query the counts for different windows of time. For example, some possible queries could be:
Get all events for user A for the last week.
Get all events for all users for yesterday where the event source is source S.
Get all events for the last month.
Low latency reads are my biggest concern here. From my research, the best way I can think to implement this is a different counter tables for each each permutation of source, user, and predefined time. For example, create a count_by_source_and_user table, where the partition key is a combination of source and user ID, and then create a count_by_user table for just the user counts.
This seems messy. What's the best way to do this, or could you point towards some good examples of modeling these types of problems in Cassandra?
You are right. If latency is your main concern, and it should be if you have already chosen Cassandra, you need to create a table for each of your queries. This is the recommended way to use Cassandra: optimize for read and don't worry about redundant storage. And since within every table data is stored sequentially according to the index, then you cannot index a table in more than one way (as you would with a relational DB). I hope this helps. Look for the "Data Modeling" presentation that is usually given in "Cassandra Day" events. You may find it on "Planet Cassandra" or John Haddad's blog.

Design of Partitioning for Azure Table Storage

I have some software which collects data over a large period of time, approx 200 readings per second. It uses an SQL database for this. I am looking to use Azure to move a lot of my old "archived" data to.
The software uses a multi-tenant type architecture, so I am planning to use one Azure Table per Tenant. Each tenant is perhaps monitoring 10-20 different metrics, so I am planning to use the Metric ID (int) as the Partition Key.
Since each metric will only have one reading per minute (max), I am planning to use DateTime.Ticks.ToString("d19") as my RowKey.
I am lacking a little understanding as to how this will scale however; so was hoping somebody might be able to clear this up:
For performance Azure will/might split my table by partitionkey in order to keep things nice and quick. This would result in one partition per metric in this case.
However, my rowkey could potentially represent data over approx 5 years, so I estimate approx 2.5 million rows.
Is Azure clever enough to then split based on rowkey as well, or am I designing in a future bottleneck? I know normally not to prematurely optimise, but with something like Azure that doesn't seem as sensible as normal!
Looking for an Azure expert to let me know if I am on the right line or whether I should be partitioning my data into more tables too.
Few comments:
Apart from storing the data, you may also want to look into how you would want to retrieve the data as that may change your design considerably. Some of the questions you might want to ask yourself:
When I retrieve the data, will I always be retrieving the data for a particular metric and for a date/time range?
Or I need to retrieve the data for all metrics for a particular date/time range? If this is the case then you're looking at full table scan. Obviously you could avoid this by doing multiple queries (one query / PartitionKey)
Do I need to see the most latest results first or I don't really care. If it's former, then your RowKey strategy should be something like (DateTime.MaxValue.Ticks - DateTime.UtcNow.Ticks).ToString("d19").
Also since PartitionKey is a string value, you may want to convert int value to a string value with some "0" prepadding so that all your ids appear in order otherwise you'll get 1, 10, 11, .., 19, 2, ...etc.
To the best of my knowledge, Windows Azure partitions the data based on PartitionKey only and not the RowKey. Within a Partition, RowKey serves as unique key. Windows Azure will try and keep data with the same PartitionKey in the same node but since each node is a physical device (and thus has size limitation), the data may flow to another node as well.
You may want to read this blog post from Windows Azure Storage Team: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows-azure-tables.aspx.
UPDATE
Based on your comments below and some information from above, let's try and do some math. This is based on the latest scalability targets published here: http://blogs.msdn.com/b/windowsazurestorage/archive/2012/11/04/windows-azure-s-flat-network-storage-and-2012-scalability-targets.aspx. The documentation states that:
Single Table Partition– a table partition are all of the entities in a
table with the same partition key value, and usually tables have many
partitions. The throughput target for a single table partition is:
Up to 2,000 entities per second
Note, this is for a single partition, and not a single table. Therefore, a table with good partitioning, can process up to the
20,000 entities/second, which is the overall account target described
above.
Now you mentioned that you've 10 - 20 different metric points and for for each metric point you'll write a maximum of 1 record per minute that means you would be writing a maximum of 20 entities / minute / table which is well under the scalability target of 2000 entities / second.
Now the question remains of reading. Assuming a user would read a maximum of 24 hours worth of data (i.e. 24 * 60 = 1440 points) per partition. Now assuming that the user gets the data for all 20 metrics for 1 day, then each user (thus each table) will fetch a maximum 28,800 data points. The question that is left for you I guess is how many requests like this you can get per second to meet that threshold. If you could somehow extrapolate this information, I think you can reach some conclusion about the scalability of your architecture.
I would also recommend watching this video as well: http://channel9.msdn.com/Events/Build/2012/4-004.
Hope this helps.

Resources