Query Azure table storage by list of partition and row key pairs - azure

I have a list of known partition and row key pairs from the same table, (e.g P1R1, P2R2, P3R3, P-PartitionKey, R-Row key), anyone know how to query from Azure Table Storage to get these 3 entities in one request?

I don't think there's an option besides just spelling it out in the Where/Filter clause explicitly.

You can do that, but you don't want to. It will result in a table scan and be extremely slow. You'd be much better served by firing off each request separately.
If you're dead set on doing it with one query, comment here and I'll get you the code. I'm on my iPad right now so I don't have it handy.

Related

Get all the Partition Keys in Azure Cosmos DB collection

I have recently started using Azure Cosmos DB in our project. For the reporting purpose, we need to get all the Partition Keys in the collection. I could not find any suitable API to achieve it.
UPDATE: According to Brian in the comments below, DISTINCT is now supported. Try something like:
SELECT DISTINCT c.partitionKey FROM c
Prior answer: Idea that could work but for one thing...
The only way to get the actual partition key values is to do a unique aggregate on that field.
You can directly hit the REST endpoint at https://{your endpoint domain}.documents.azure.com/dbs/{your collection's uri fragment}/pkranges to pull back the minInclusive and maxExclusive ranges for each partition but those are hash space ranges and I don't know how to convert those into partition key values nor do a fanout using the actual minInclusive hash.
Also, there is a slim possibility that the pkranges can change between the time you retrieve them and the time you go to do something with them.

partitionkey and rowkey in azure table storage

I understand the benefit of a partition key in azure table storage. However, given my relational database background, I am a bit confused about how to retrieve an entity from azure table storage given just the rowkey. As far as I know, this is impossible. This means that I have to store the partition key/rowkey pair somewhere to just get the entity given the rowkey. Should I just introduce a 'sharding' table with one arbitrary partition key, which allows me to look up the partition key given the rowkey?
It is possible but will result in a table scan as described in this section of MSDN.
If you don't need multiple partitions then it is absolutely fine to use a single partition (e.g. using a constant) if your data isn't going to be enormous in size and needs the scalability of multiple partitions.
Another possible approach is to use your current RowKey as PartitionKey which would give you a highly scalable solution but would result in bad performance if you need to query ranges of rows.
The linked MSDN page talks about the pros and cons of both so I think with your knowledge about your specific problem domain you should be able to find a balanced solution.

What is the disadvantage to unique partition keys?

My data set will only ever be directly queried (meaning I am looking up a specific item by some identifier) or will be queried in full (meaning return every item in the table). Given that, is there any reason to not use a unique partition key?
From what I have read (e.g.: https://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/#choosing-an-appropriate-partitionkey) the advantage of a non-unique partition key is being able to do transactional updates. I don't need transactional updates in this data set so is there any reason to partition by anything other than some unique thing (e.g., GUID)?
Assuming I go with a unique partition key per item, this means that each partition will have one row in it. Should I repeat the partition key in the row key or should I just have an empty string for a row key? Is a null row key allowed?
Zhaoxing's answer is essentially correct but I want to expand on it so you can understand a bit more why.
A table partition is defined as the table name plus the partition key. A single server can have many partitions, but a partition can only ever be on one server.
This fundamental design means that access to entities stored in a single partition cannot be load-balanced because partitions support atomic batch transactions. For this reason, the scalability target for an individual table partition is lower than for the table service as a whole. Spreading entities across many partitions allows Azure storage to scale your load much better.
Point queries are optimal which is great because it sounds like that's what you will be doing a lot of. If partition key has no logical meaning (ie, you won't want all the entities in a particular partition) you're best splitting out to many partition keys. Listing all entities in a table will always be slower because it's a scan. Azure storage will return continuation tokens if we hit timeout, 1000 entities, or a server boundary (as discussed above). Many of the storage client libraries have convenience methods which should help you handle this by automatically following these tokens as you iterate through the list.
TL;DR: With the information you've given I'd recommend a unique partition key per item. Null row keys are not allowed, but however else you'd like to construct the row key is fine.
Reading:
Azure Storage Table Design Guide
Azure Storage Performance Check List
If you don't need EntityGroupTransaction to update entities in batch, unique partition keys are good option to you.
Table service auto-scale feature may not work perfectly I think. When some of data in a partition are 'hot', table service will move them to another cluster to enhance performance. But since you have unique partition key, probably non of your entity will be determined as 'hot', while if you grouped them in partitions some partition will be 'hot' and moved. This problem below may also be there if you are using static partition key.
Besides, table service may returns partial entities of your query when
More than 1000 entities in result.
Partition boundary is crossed.
From your request you also need full query (return all entities). If your are using unique partition key this mean each entity is a unique partition, so your query will only return 1 entity with a continue token. And you need to fire another query with this continue token to retrieve the next entity. I don't think this is what you want.
So my suggestion is, select a reasonable partition key in any cases, even though it looks useless in your business, because it helps table service to optimize your data.

Setting RowKey in Azure Table Storage

I am storing some logs in Azure Table Storage. I've identified the PartitionKey I should use. However, I'm having trouble determining what I should use for the RowKey. If I was using Sql Server, I would use an auto-incrementing integer. From what I can tell, having an auto-generated RowKey is not an option with Azure Table Storage. I'm fine using a GUID, however, everyone seems to warn against using a GUID. Yet, I'm not sure what I should be using.
Can anyone provide me a pointer for what I should use as the RowKey for storing log data? I've seen the following syntax (RowKey: {'_': '1'}), as shown below, but can't find out what it means:
var task = {
PartitionKey: {'_':'hometasks'},
RowKey: {'_': '1'}
};
Thanks!
There are many approaches you can take. One such approach would be to store date/time value in ticks as RowKey. This would help you in fetching logs data for a particular time range. Just remember that since RowKey is of String data type, you may want to pre-pad it with zeros so that all values are of same length. For example,
DateTime.UtcNow.Ticks.ToString("d20")
With this, you could take 2 approaches:
Store them in chronological order as shown in example above.
Store them in reverse chronological order. The advantage of this approach is that the latest entries will always be added on top of the table. So you could just query the table on PartitionKey and get top 'x' rows and they will be latest. You will do something like:
(DateTime.MaxValue.Ticks - DateTime.UtcNow.Ticks).ToString("d20")
You also mentioned in your comment that I expect the data sets to be quite large.. I hope you are not using a single PartitionKey. Because if number of records are quite large and all of them are put in same partition, the performance might be impacted.

How to structure a Azure Table to hold user messages

I'm still trying to get my head around the correct way to use Azure Tables. I understand that they have a partition key and a row key, that that's it. Everything else is just data that you keep in that row.
Use Case
My web app gets files uploaded by a user, puts them in a queue, then has a worker roll process the queue and do analytics on those files.
I would like to put messages about those files in an Azure Table based on what we find when we process those files.
I then plan on making an AJAX call to get a members messages when they visit a webpage. If the user clicks on the message or closes the message then I'll delete it from the table. Very StackOverflowish.
Question
My question is on how to best store these messages in my Azure Table.
Here's my thinking so far:
PartionKey: MemberID
RowKey: ???(not sure what to have)
Column Data: Message data including any links and a time stamp. Probably a view count too.
I can't think of what I would put in a seperate index for the row key. Timestamp could work so I can order messages correctly, but I don't think I'll get much bang for my buck with that.
I have found that the best to think about the choice of partition and row keys is to think about the data access patterns. If your access pattern is to have a single row/entity represent something meaningful in your system. In your case is sounds like userid/fileid uniquely identifies the entity. From this, you have three options:
userid for partition key, fileid for row key
constant value for partition key, and a combination of userid and fileid for row key
constant value for row key, and a combination of userid and fileid for partition key
The decision on there is to figure out what other access pattern. Are you going to be querying for all files for a particular user? Then you would want userid as partition or row key. If you will only ever be querying based on fileid/userid, then it doesn't really matter.
Erick
Before thinking about actual storage, you should try to think about what entities you're going to have.
Sounds like something like this:
User entity
UserFile entity
FileMessage entity
Do you have one FileMessage per UserFile or can you have more than one? It sounds like (by your explanation of deletion logic) that you would only have one FileMessage per File.
If my assumptions so far are correct and if it were me, the FileMessage table would have the following structure:
PartitionKey: userId
RowKey: fileId (name/url/etc)
Other columns: as you see fit
HTH
I would think of it as: Partition Key is how you want to break data out, so if data is related, you want to keep the partition key the same. If you are doing something with a lot of data, you may want to use like the date for the Partition Key. The Row Key is the index, so that is what you will use to query the data.

Resources