I'm looking to understand the data returned by WADMetricsPT1HP10DV2S20160415 table inside Azure's storage account.
It has the PartitionKey, RowKey and EntityProperties. PartitionKey, I understand. It translates to the resource Id of my resource inside the storage account. However, I partially understand the RowKey. An example RowKey is:
:005CNetworkInterface:005CPacketsReceived__2519410355999999999
I understand the first part, which is a metric name. But what I don't understand in the number/digits that follow. I am assuming it to be the timestamp, but can't say for sure.
I was attempting to use the RowKey filter, but due to this added wrinkle it's almost impossible to generate the RowKey and use it as a filter. Does anyone know how to generate the numbers/digits in order to create a RowKey filter?
If you're curious to know about what 2519410355999999999 number is, it is essentially reverse ticks derived by subtracting the event date/time ticks from the max date/time ticks value. To derive date/time value from this number, here's what you can do:
long reverseTicks = 2519410355999999999;
long maxTicks = DateTime.MaxValue.Ticks;
var dateTime = new DateTime(maxTicks - reverseTicks, DateTimeKind.Utc);
Related
How to get Last 6 Month data comparing with timestamp column using cassandra query?
I need to get all account statement which belongs to last 3/6 months comparing with updatedTime(TimeStamp column) and CurrentTime.
For example in SQL we are using DateAdd() function tor this to get. i dont know how to proceed this in cassandra.
If anyone know,reply.Thanks in Advance.
Cassandra 2.2 and later allows users to define functions (UDT) that can be applied to data stored in a table as part of a query result.
You can create your own method if you use Cassandra 2.2 and later UDF
CREATE FUNCTION monthadd(date timestamp, month int)
CALLED ON NULL INPUT
RETURNS timestamp
LANGUAGE java
AS $$java.util.Calendar c = java.util.Calendar.getInstance();c.setTime(date);c.add(java.util.Calendar.MONTH, month);return c.getTime();$$
This method receive two parameter
date timestamp: The date from you want add or subtract number of month
month int: Number of month you want to or add(+) subtract(-) from date
Return the date timestamp
Here is how you can use this :
SELECT * FROM ttest WHERE id = 1 AND updated_time >= monthAdd(dateof(now()), -6) ;
Here monthAdd method subtract 1 mont from the current timestamp, So this query will data of last month
Note : By default User-defined-functions are disabled in cassandra.yaml - set enable_user_defined_functions=true to enable if you are aware of the security risks
In cassandra you have to build the queries upfront.
Also be aware that you will probably have to bucket the data depending on the number of accounts that you have within some period of time.
If your whole database doesn't contain more than let's say 100k entries you are fine with just defining a single generic partition let's say with name 'all'. But usually people have a lot of data that simply goes into bucket that carries a name of month, week, hour. This depends on the number of inserts you get.
The reason for creating buckets is that every node can find a partition by it's partition key. This is the first part of the primary key definition. Then on every node the data is sorted by the second information that you pass in to the primary key. Having the data sorted enables you to "scan" over them i.e. you will be able to retrieve them by giving timestamp parameter.
Let's say you want to retrieve accounts from the last 6 months and that you are saving all the accounts from one month in the same bucket.
The schema might be something on the lines of:
create table accounts {
month text,
created_time timestamp,
account text,
PRIMARY KEY (month, created_time)
}
Usually you will do this at the application level, merging queries is an anti pattern but is o.k. for smaller amount of queries:
select account
from accounts
where month = '201701';
Output:
'201702'
'201703'
and so on.
If you have something really simple with let's say expected 100 000 entries then you could use the above schema and just do something like:
create table accounts {
bucket text,
created_time timestamp,
account text,
PRIMARY KEY (bucket, created_time)
}
select account
from accounts
where bucket = 'some_predefined_name'
and created_time > '2016-10-04 00:00:00'
Once more as a wrap-up, with cassandra you always have to prepare the structures for the access pattern you are going to use.
Say I want to fetch WADLogsTable logs for last 24h of a given cloud service, perhaps narrow down to a given role, and even further down to a given instance. What does the query look like? Is there any high level api for that?
As such there is no API specific for querying diagnostics data. Because diagnostics data is stored in Azure Tables, you can simply query the table to fetch the desired data.
One thing you would notice is that PartitionKey in these tables represent time (rounded to nearest minute) represented in ticks. If you're interested in fetching last "x" hours from WADLogsTable table, here's what you would need to do:
Convert from and to date/time (UTC) into ticks.
Prepend 0 in front of them.
Use these in query on PartitionKey. Assuming you're using REST API, the query would look something like:
(PartitionKey ge 'from date/time ticks with prepended zero' and PartitionKey le 'to
date/time ticks with prepended zero')
To further filter down by role or role instance, you would add appropriate filter criteria. For example:
(PartitionKey ge 'from date/time ticks with prepended zero' and PartitionKey le 'to
date/time ticks with prepended zero') and (Role eq 'Role Name')
I wrote a blog post long time back which you may find useful: http://gauravmantri.com/2012/02/17/effective-way-of-fetching-diagnostics-data-from-windows-azure-diagnostics-table-hint-use-partitionkey/ (Caution: The code showing the use of storage client library is based on the older version of the library, so you may want to use that only for understanding purpose).
I'm considering using a timestamp for partition key for logging messages to always keep it ordered with the most recents first.
PartitionKey = String.Format("{0:D19}", DateTime.MaxValue.Ticks - DateTime.UtcNow.Ticks);
Since the key is not increasing every time with the same pattern, will the partitions still be grouped in a range partition in this case?
It's just for logging, there will be no updates, only inserts of new ones and maybe delete older ones eventually. The main properties to query is date range, user and entityName (where did it happen).
Take a look at Understanding the Table Service Data Model and Designing a Scalable Partitioning Strategy for Azure Table Storage - A TableEntity already has a Timestamp property. Since you aren't doing updates the Timestamp should be sufficient for your date range query.
What do you recommend for the following scenario:
If I have in a table 100.000.000 items (a lot of items to be more exact) how can I get those items per day?
Once the items are added into the table, they are not modified or deleted anymore. Basically is just insert and read them.
My question is about retrieving them without having to loop through all 100.000.000 items.
Should I make the PartitionKey a datetime or just date and then retrieve by partionkey where is equal for example 22.10.2013?
What do you recommend?
If you are reading the items per day, then using the date (just the Date part, not the full DateTime) as the PartitionKey is the best solution.
When using a Date as the Key, I prefer converting it to a String in the YYYYMMDD (or YYYY-MM-DD) format.
The use of a datetime as a PartitionKey is an anti-pattern since all writes go in the same partition - which limits scalability. The Azure Storage scalability targets indicate that you can do 2,000 operations a second against a partition but 20,000 operations a second against a storage account. You can get round this by sharding inserts across a set of buckets for the day - and prepending the date with the bucket name.
I have a table that stores the online status of a user.
columns(userid, IsOnline,date)
If the user is online, the Isonline bool flag is true, if it goes offline, the IsOnline bool is false. This way I can see between which and which time the user was online.
Would it be fine choosing partitionKey: userId,
Rowkey: the date and time of the event
?
The user can not go on and off at the same time, so rowkey should be unique. I like about this that it keeps all data for a user on the same partition. Also does choosing the date as rowId make sorting more efficient?
UserId is a good strategy for the PartitionKey
With respect to RowKey, I would suggest using "DateTime.MaxValue.Ticks - dateOfEvent.Ticks" formatted to max number of digits.
This will make your RowKey always be in the descending order and thus allow you to pick the latest status of the user without getting data from all of he partition
Sounds reasonable to me. This groups all of a given user's actions together in a single partition. Each action is then delineated by an individual row with the Timestamp for the key.
You might want to keep in mind that every row in Azure Table Storage has a Timestamp column that is populated automatically on create/update. You could consider using this column for your Timestamp but searching/sorting will be slow since it is part of the tertiary data set associated with a Table Storage row.