Azure Storage Logs - How Unique Is RequestID? - azure

I'm trying to move the azure storage log file into azure storage tables so I can more easily work with them, but I noticed this
"duplicate log records may exist in logs generated for the same hour
and can be detected by checking for duplicate RequestId and Operation
number."
source:https://blogs.msdn.microsoft.com/windowsazurestorage/2011/08/02/windows-azure-storage-logging-using-logs-to-track-storage-requests/
(I know it's an old article, but it's all I can find)
With this in mind, I thought it would be sensible to use a concatenation of the requestID with the operationID as my row key.
I wanted to check if anyone is aware just how unique the requestID is (Apparently some requests might have more that 1 operation such as "copy", but most will have just 1).
If I'm using it as a row key, I can't afford for it to appear twice in the same partition (Partitioning by userID, but lets suppose each partition can contain millions of records).
Thanks

If I'm using it as a row key, I can't afford for it to appear twice in the same partition (Partitioning by userID, but lets suppose each partition can contain millions of records).
If I understand correctly, you could combine requestID and new Guid with hyphenation as unique row key. for example: requestId|newGuid.

Related

DynamoDB sorting through data

Everywhere I look, the web is telling me to never use scan() in dynamoDB.
It uses all your capacity units, 1mb response size, etc.
I’ve looked at querying, but that doesn’t achieve what I want either.
How am I supposed to parse through my table?
Here is my setup-
I have a table “people” with rows of people.
I have attributes “email” (partition key), “fName”, “lName”, “displayName”, “passwordHash”, and “subscribed”.
subscribed is either true or false, and I need to sort through every person who is subscribed.
I can’t use a sort key because all emails are unique…
It is my understanding that DynamoDB data is sorted like follows:
primary key-
—sort key 1
——— Item 1
—sort key 2
——- Item 2
primary key 2
—Sort ket 1
..etc..
So setting subscribed as a sort ket would not work… I would still need to loop through every primary key.
Right now I am just getting every item with a filterExpression to check if someone is subscribed.
If they are, they pass. But what happens when I have hundreds of users, whose data eclipses 1mb?
I wouldn’t get every user that is subscribed in this case, and sending repeating requests with the start key to get every Mb of data is too tedious for the processor, and would slow the server down significantly
Are there any recommendations for how I should go about getting every subscribed user?
Note: Subscribed can not be a primary key and the email a sort key, because I have instances where I need just the user, which is easy to access if the email is the primary key.
Right now I am just getting every item with a filterExpression to check if someone is subscribed. If they are, they pass. But what happens when I have hundreds of users, whose data eclipses 1mb?
GetItem for single person lookups
You should ideally be using a GetItem here by providing the users email as a search parameter, and then checking if they are subscribed or not. Scanning to see if an individual is subscribed is not scalable in any way.
Pagination
When data exceeds 1MB you simply paginate:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Scan.html
Are there any recommendations for how I should go about getting every subscribed user?
Sparse Indexes
For this use case it's best to use a sparse index, in which you set subscribed="true" only if it's true, if it's false don't set it (you must use a string also, as boolean can't be used as a key).
Once you do so, you can create a GSI on the attribute subscribed, now only the items which are true are contained in your GSI making it sparse. So a Scan on that value now makes it as efficient as possible, albeit it will limit throughout capacity to 1000 WCU.
Making things scalable
An even better way to do so is to create an attribute called GSI_PK and assign it a random number. Then use subscribed as a sort key, again using a string and only when true. This will mean that your index will not become a bottleneck and limit your throughput to 1000 WCU due to a single value being Partition key.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-indexes-general-sparse-indexes.html

Azure table delete pattern - delete old items

I'm working with Azure Table (storage) in order to store information about websites I'm working with. So, I planned this structure:
Partition Key - domain name
Row key - Webpage address
Valid until (date time) - after this date, the record will be deleted.
Other crucial data here...
Those columns will be stored in a table called as the website address (e.g. "cnn.com").
I have two main use case (high to low):
1. Check if URL "x" is in the table - find by combination of Partition Key and Row Key - very efficient.
2. Delete old data - remove all expired data (according to "Valid until" column). This operation is taking place every mid-night and possibly delete millions of row - very heavy.
So, our first task (check if URL exists) is implemented in efficient way with this data model. The second task, not. I want to avoid batch deletion.
I also worry about making "hot-spots", which will make me low performance. This because the Partition Key. I expect that in some hours, I will query more question for specific domain. This will make this partition hotspot and hit my performance. In order to avoid this, I thought to use hash-function (on the URL) and the result will be the "partition key". Is this good idea?
I also thought about other implementation way and it's looks like they have some problems:
Storing the rows in table that named with the deletion date (e.g. "cnn.com-1-1-2016"). This provide us great deleting performance. But, bad searching experience (the row can be exists in more then one table. e.g. "cnn.com-1-1-2016" or "cnn.com-2-1-2016"...).
What is the right solution for my problem?
Have you seen the Azure Table Storage Design Guide? It describes principles and patterns for designing tables solutions at scale. For hot spots take a look at the prepend / append anti-pattern for some extra information. This is where all your operations occur within a single partition which prevents additional resources from being added. For these types of scenarios you will get better scale if you can distribute the operations across partitions instead.
Let's assume you have a site https://www.yahoo.com/news/death-omar-al-shishani-could-mean-war-against-203132664.html?nhp=1. You can keep PK as domainName + "/news/" + 2 letters of page address, summary https://www.yahoo.com/news/de. RK - other part of the full address. This will split your domain partition on near 1000 partitions. If that's not enough - use 3 first letter in PK.
Remove obsolete data every 15 minutes (create a separate service for it). Your millions will became just tens of thousands. Or keep less data (2 weeks instead of month for.ex.). And do not forget optimize deletion (get PK and RK only, update ETag to "*", remove as DynamicTableEntity, batch if possible).

What is the disadvantage to unique partition keys?

My data set will only ever be directly queried (meaning I am looking up a specific item by some identifier) or will be queried in full (meaning return every item in the table). Given that, is there any reason to not use a unique partition key?
From what I have read (e.g.: https://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/#choosing-an-appropriate-partitionkey) the advantage of a non-unique partition key is being able to do transactional updates. I don't need transactional updates in this data set so is there any reason to partition by anything other than some unique thing (e.g., GUID)?
Assuming I go with a unique partition key per item, this means that each partition will have one row in it. Should I repeat the partition key in the row key or should I just have an empty string for a row key? Is a null row key allowed?
Zhaoxing's answer is essentially correct but I want to expand on it so you can understand a bit more why.
A table partition is defined as the table name plus the partition key. A single server can have many partitions, but a partition can only ever be on one server.
This fundamental design means that access to entities stored in a single partition cannot be load-balanced because partitions support atomic batch transactions. For this reason, the scalability target for an individual table partition is lower than for the table service as a whole. Spreading entities across many partitions allows Azure storage to scale your load much better.
Point queries are optimal which is great because it sounds like that's what you will be doing a lot of. If partition key has no logical meaning (ie, you won't want all the entities in a particular partition) you're best splitting out to many partition keys. Listing all entities in a table will always be slower because it's a scan. Azure storage will return continuation tokens if we hit timeout, 1000 entities, or a server boundary (as discussed above). Many of the storage client libraries have convenience methods which should help you handle this by automatically following these tokens as you iterate through the list.
TL;DR: With the information you've given I'd recommend a unique partition key per item. Null row keys are not allowed, but however else you'd like to construct the row key is fine.
Reading:
Azure Storage Table Design Guide
Azure Storage Performance Check List
If you don't need EntityGroupTransaction to update entities in batch, unique partition keys are good option to you.
Table service auto-scale feature may not work perfectly I think. When some of data in a partition are 'hot', table service will move them to another cluster to enhance performance. But since you have unique partition key, probably non of your entity will be determined as 'hot', while if you grouped them in partitions some partition will be 'hot' and moved. This problem below may also be there if you are using static partition key.
Besides, table service may returns partial entities of your query when
More than 1000 entities in result.
Partition boundary is crossed.
From your request you also need full query (return all entities). If your are using unique partition key this mean each entity is a unique partition, so your query will only return 1 entity with a continue token. And you need to fire another query with this continue token to retrieve the next entity. I don't think this is what you want.
So my suggestion is, select a reasonable partition key in any cases, even though it looks useless in your business, because it helps table service to optimize your data.

Strategy for storing application logs in Azure Table Storage

I am to determine a good strategy for storing logging information in Azure Table Storage. I have the following:
PartitionKey: The name of the log.
RowKey: Inversed DateTime ticks,
The only issue here is that partitions could get very large (millions of entities) and the size will increase with time.
But that being said, the type of queries being performed will always include the PartitionKey (no scanning) AND a RowKey filter (a minor scan).
For example (in a natural language):
where `PartitionKey` = "MyApiLogs" and
where `RowKey` is between "01-01-15 12:00" and "01-01-15 13:00"
Provided that the query is done on both PartitionKey and RowKey, I understand that the size of the partition doesn't matter.
Take a look at our new Table Design Patterns Guide - specifically the log-data anti-pattern as it talks about this scenario and alternatives. Often when people write log files they use a date for the PK which results in a partition being hot as all writes go to a single partition. Quite often Blobs end up being a better destination for log data - as people typically end up processing the logs in batches anyway - the guide talks about this as an option.
Adding my own answer so people can have something inline without needing external links.
You want the partition key to be the timestamp plus the hash code of the message. This is good enough in most cases. You can add to the hash code of the message the hash code(s) of any additional key/value pairs as well if you want, but I've found it's not really necessary.
Example:
string partitionKey = DateTime.UtcNow.ToString("o").Trim('Z', '0') + "_" + ((uint)message.GetHashCode()).ToString("X");
string rowKey = logLevel.ToString();
DynamicTableEntity entity = new DynamicTableEntity { PartitionKey = partitionKey, RowKey = rowKey };
// add any additional key/value pairs from the log call to the entity, i.e. entity["key"] = value;
// use InsertOrMerge to add the entity
When querying logs, you can use a query with partition key that is the start of when you want to retrieve logs, usually something like 1 minute or 1 hour from the current date/time. You can then page backwards another minute or hour with a different date/time stamp. This avoids the weird date/time hack that suggests subtracting the date/time stamp from DateTime.MaxValue.
If you get extra fancy and put a search service on top of the Azure table storage, then you can lookup key/value pairs quickly.
This will be much cheaper than application insights if you are using Azure functions, which I would suggest disabling. If you need multiple log names just add another table.

How to structure a Azure Table to hold user messages

I'm still trying to get my head around the correct way to use Azure Tables. I understand that they have a partition key and a row key, that that's it. Everything else is just data that you keep in that row.
Use Case
My web app gets files uploaded by a user, puts them in a queue, then has a worker roll process the queue and do analytics on those files.
I would like to put messages about those files in an Azure Table based on what we find when we process those files.
I then plan on making an AJAX call to get a members messages when they visit a webpage. If the user clicks on the message or closes the message then I'll delete it from the table. Very StackOverflowish.
Question
My question is on how to best store these messages in my Azure Table.
Here's my thinking so far:
PartionKey: MemberID
RowKey: ???(not sure what to have)
Column Data: Message data including any links and a time stamp. Probably a view count too.
I can't think of what I would put in a seperate index for the row key. Timestamp could work so I can order messages correctly, but I don't think I'll get much bang for my buck with that.
I have found that the best to think about the choice of partition and row keys is to think about the data access patterns. If your access pattern is to have a single row/entity represent something meaningful in your system. In your case is sounds like userid/fileid uniquely identifies the entity. From this, you have three options:
userid for partition key, fileid for row key
constant value for partition key, and a combination of userid and fileid for row key
constant value for row key, and a combination of userid and fileid for partition key
The decision on there is to figure out what other access pattern. Are you going to be querying for all files for a particular user? Then you would want userid as partition or row key. If you will only ever be querying based on fileid/userid, then it doesn't really matter.
Erick
Before thinking about actual storage, you should try to think about what entities you're going to have.
Sounds like something like this:
User entity
UserFile entity
FileMessage entity
Do you have one FileMessage per UserFile or can you have more than one? It sounds like (by your explanation of deletion logic) that you would only have one FileMessage per File.
If my assumptions so far are correct and if it were me, the FileMessage table would have the following structure:
PartitionKey: userId
RowKey: fileId (name/url/etc)
Other columns: as you see fit
HTH
I would think of it as: Partition Key is how you want to break data out, so if data is related, you want to keep the partition key the same. If you are doing something with a lot of data, you may want to use like the date for the Partition Key. The Row Key is the index, so that is what you will use to query the data.

Resources