Does TableBatchOperation guarantee that all entities will have same Timestamp field value? - azure

I'm adding multiple entries to Azure CloudTable:
TableBatchOperation tableBatchOperation = new TableBatchOperation();
foreach (var entity in entities)
{
tableBatchOperation.InsertOrReplace(entity);
}
table.ExecuteBatch(tableBatchOperation);
Is there any guarantee that all entries inserted / updated in this batch operation will have the same Timestamp property value?

Short answer is: entities inserted in the same batch can have different timestamps.
It depends on the batch size and I guess current load of Table Service.
I wrote a simple unit test to check that, you can find it here and in one batch of 100 items (every with 30KB string property) I can see few different timestamps(ticks):
635516539271235769
635516539271245771
635516539271225762
but for smaller batches timestamp is sometimes the same.
Differences are really small (ticks) but for sure I would not depend on timestamp since it's internal Azure Table Service property, and it changes on every update.
I would rather add another property to an entity with a batch timestamp.

Related

New item inserted in Azure Table Storage is not immediately available

I have
an endpoint in an Azure Function called "INSERT" that that inserts a
record in Table Storage using a batch operation.
an endpoint in a different Azure Function
called "GET" that gets a record in Table Storage.
If I insert an item and then immediately get that same item, then the item has not appeared yet!
If I delay by one second after saving, then I find the item.
If I delay by 10ms after saving, then I don't find the item.
I see the same symptom when updating an item. I set a date field on the item. If I get immediately after deleting, then some times the date field is not set yet.
Is this known behavior in Azure Table Storage? I know about ETags as described here but I cannot see how it applies to this issue.
I cannot easily provide a code sample because this is distributed among multiple functions and I think if I did put it in a simpler example, then there would be some mechanism that would see I am calling from the same ip or with the same client and manage to return the recently saved item.
As mentioned in the comments, Azure Table Storage is Strongly Consistent. Data is available to you as soon as it is written to Storage.
This is in contrast with Cosmos DB Table Storage where there are many consistency levels and data may not be immediately available for you to read after it is written depending on the consistent level set.
The issue was related to my code and queues running in the background.
I had shut down the Function that has queue triggers but to my surprise I found that the Function in my staging slot was picking items off the queue. That is why it made a difference whether I delay for a second or two.
And to the second part, why a date field is seemingly not set as fast as I get it. Well, it turns out I had filtered by columns, like this:
var operation = TableOperation.Retrieve<Entity>(partitionKey, id, new List<string> { "Content", "IsDeleted" });
And to make matters worse, the class "Entity" that I deserialize to, of course had default primitive values (such as "false") so it didn't look like they were not being set.
So the answer does not have much to do with the question, so in summary, for anyone finding this question because they are wondering the same thing:
The answer is YES - Table Storage is in fact strongly consistent and it doesn't matter whether you're 'very fast' or connect from another location.

MongoDB (+ Node/Mongoose): How to track processed records without "marking" all of them?

I have several large "raw" collections of documents which are processed in a queue, and the processed results are all placed into a single collection.
The queue only runs when the system isn't otherwise indisposed, and new data is being added into the "raw" collections all the time.
What I need to do is make sure the queue knows which documents it has already processed, so it doesn't either (a) process any documents more than once, or (b) skip documents. Updating each raw record with a "processed" flag as I go isn't a good option because it adds too much overhead.
I'm using MongoDB 4.x, with NodeJS and Mongoose. (I don't need a strictly mongoose-powered answer, but one would be OK).
My initial attempt was to do this by retrieving the raw documents sorted by _id in a smallish batch (say 100), then grabbing the first and last _id values in the return result, and storing those values, so when I'm ready to process the next batch, I can limit my find({}) query to records with an _id greater than what I stored as the last-processed result.
But looking into it a bit more, unless I'm misunderstanding something, it appears I can't really count on a strict ordering by _id.
I've looked into ways to implement an auto-incrementing numeric ID field (SQL style), which would have a strict ordering, but the solutions I've seen look like they add a nontrivial amount of overhead each time I create a record (not dissimilar to what it would take to mark processed records, just would be on the insertion end instead of the processing end), and this system needs to process a LOT of records very fast.
Any ideas? Is there a way to do an auto-incrementing numeric ID that's super efficient? Will default _id properties actually work in this case and I'm misunderstanding? Is there some other way to do it?
As per the documentation of ObjectID:
While ObjectId values should increase over time, they are not
necessarily monotonic. This is because they:
Only contain one second of temporal resolution, so ObjectId values created within the same second do not have a guaranteed ordering, and
Are generated by clients, which may have differing system clocks.
So if you are creating that many records per second then _id ordering is not for you.
However Timestamp within a mongo instance is guaranteed to be unique.
BSON has a special timestamp type for internal MongoDB use and is not
associated with the regular Date type. Timestamp values are a 64 bit
value where:
the first 32 bits are a time_t value (seconds since the Unix epoch)
the second 32 bits are an incrementing ordinal for operations within a
given second.
Within a single mongod instance, timestamp values are always unique.
Although it clearly states that this is for internal use it maybe something for you to consider. Assuming you are dealing with a single mongod instance you can decorate your records when they are getting into the "raw" collections with timestamps ... then you could remember the last processed record only. Your queue would only pick records with timestamps larger that the last processed timestamp.

What is the best way of creating Partition key in azure table storage for storing sensor output data?

I searched best practice for storing sensor output data in azure table storage but didn't get best answer. I am currently working on a project that consists of storing sensor data to azure table storage. Currently I am using partition key as Sensor ID . Every second I am storing the sensor outputs. About 100 sensors are currently using. So imagine large data is storing every day. So I am getting slow performance in my web application when i searched particular sensor data by date wise. Is there a better way to improve the performance of the web app? How about changing sensor id to date as partition key? Code is not important here. I need a logical solution.. May be this question will help lot of developers who are working on such scenario..
UPDATE
Each sensor provides a 10 different outputs and date which is the output datetime. So they are in a same row of each sensor id. And I am taking sensor data using Date range and Sensor id
Partition key - sensor id , RowKey - datetime , 10 output columns and output date
here is my code
var query = TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, sensorID);
var dateFilter = TableQuery.CombineFilters(
TableQuery.GenerateFilterConditionForDate("outputdate", QueryComparisons.GreaterThanOrEqual, Convert.ToDateTime(from)),
TableOperators.And,
TableQuery.GenerateFilterConditionForDate("outputdate", QueryComparisons.LessThanOrEqual, Convert.ToDateTime(to))
);
query = TableQuery.CombineFilters(query, TableOperators.And, dateFilter);
var rangeQuery = new TableQuery<TotalizerTableEntity>().Where(query);
var entitys = table.ExecuteQuery(rangeQuery).OrderBy(j => j.date).ToList();
outputdate indicates output generated time. This is getting as datetime. All output have same output time.
First, I would highly recommend that you read Azure Storage Table Design Guide: Designing Scalable and Performant Tables. This will give you a lot of ideas about how to structure your data.
Now coming to your current implementation. What I am noticing is that you're including PartitionKey in your query (which is very good BTW) but then adding a non-indexed attribute (outputdate) in your query as well. This will result in what is known is Partition Scan. For larger tables, this will create a problem because your query will be scanning the entire partition for matching outputdate attribute.
You mentioned that you're storing datetime value is RowKey. Assuming the RowKey value matches with the value of output date, I would recommend using RowKey in your query instead of this non-indexed attribute. RowKey (along with PartitionKey) are the only two attributes that are indexed in a table, so the query will be comparatively much faster.
When saving date/time as RowKey, I would recommend converting it into ticks (DateTime.Ticks) and then saving that instead of simply converting the date/time value to string. If you're going with this approach, I would suggest prepending 0 in front of this ticks so that all values are of same length (doing something like DateTime.Ticks.ToString("d19")).
You can also save the RowKey as Reverse Ticks i.e. (DateTime.MaxValue.Ticks - DateTime.Ticks).ToString("d20"). This will ensure that all the latest entries get added to the top of the table instead of at the bottom. This will help in scenario where you are more interested in querying the latest records.
If you will always query for a particular sensor, it may not hurt to save data for each sensor in a separate table i.e. each sensor gets a separate table. This will free up one key for you. You can use date/time value (which you're currently storing as RowKey) as PartitionKey and can use some other value as RowKey. Furthermore, it will allow you to scale across storage accounts - data for some sensors will go in one storage account while the data for other sensors will go in other storage account. Somewhere you just need to save this relationship so that data reaches correct storage account/table.

Strategy for storing application logs in Azure Table Storage

I am to determine a good strategy for storing logging information in Azure Table Storage. I have the following:
PartitionKey: The name of the log.
RowKey: Inversed DateTime ticks,
The only issue here is that partitions could get very large (millions of entities) and the size will increase with time.
But that being said, the type of queries being performed will always include the PartitionKey (no scanning) AND a RowKey filter (a minor scan).
For example (in a natural language):
where `PartitionKey` = "MyApiLogs" and
where `RowKey` is between "01-01-15 12:00" and "01-01-15 13:00"
Provided that the query is done on both PartitionKey and RowKey, I understand that the size of the partition doesn't matter.
Take a look at our new Table Design Patterns Guide - specifically the log-data anti-pattern as it talks about this scenario and alternatives. Often when people write log files they use a date for the PK which results in a partition being hot as all writes go to a single partition. Quite often Blobs end up being a better destination for log data - as people typically end up processing the logs in batches anyway - the guide talks about this as an option.
Adding my own answer so people can have something inline without needing external links.
You want the partition key to be the timestamp plus the hash code of the message. This is good enough in most cases. You can add to the hash code of the message the hash code(s) of any additional key/value pairs as well if you want, but I've found it's not really necessary.
Example:
string partitionKey = DateTime.UtcNow.ToString("o").Trim('Z', '0') + "_" + ((uint)message.GetHashCode()).ToString("X");
string rowKey = logLevel.ToString();
DynamicTableEntity entity = new DynamicTableEntity { PartitionKey = partitionKey, RowKey = rowKey };
// add any additional key/value pairs from the log call to the entity, i.e. entity["key"] = value;
// use InsertOrMerge to add the entity
When querying logs, you can use a query with partition key that is the start of when you want to retrieve logs, usually something like 1 minute or 1 hour from the current date/time. You can then page backwards another minute or hour with a different date/time stamp. This avoids the weird date/time hack that suggests subtracting the date/time stamp from DateTime.MaxValue.
If you get extra fancy and put a search service on top of the Azure table storage, then you can lookup key/value pairs quickly.
This will be much cheaper than application insights if you are using Azure functions, which I would suggest disabling. If you need multiple log names just add another table.

Azure Table Storage batch insert with potentially pre-existing rowkeys

I'm trying to send a simple batch of Insert operations to Azure Table Storage but it seems that the whole batch transaction is invalidated and, using the managed azure storage client, the ExecuteBatch method itself throws an Exception if there is a single Insert in the batch to a pre-existing record. (using 2.0 client):
public class SampleEntity : TableEntity
{
public SampleEntity(string partKey, string rowKey)
{
this.PartitionKey = partKey;
this.RowKey = rowKey;
}
}
var acct = CloudStorageAccount.DevelopmentStorageAccount;
var client = acct.CreateCloudTableClient();
var table = client.GetTableReference("SampleEntities");
var foo = new SampleEntity("partition1", "preexistingKey");
var bar = new SampleEntity("partition1", "newKey");
var batchOp = new TableBatchOperation();
batchOp.Add(TableOperation.Insert(foo));
batchOp.Add(TableOperation.Insert(bar));
var result = table.ExecuteBatch(batchOp); // throws exception: "0:The specified entity already exists."
The batch-level exception is avoided by using InsertOrMerge but then every individual operation response returns a 204, whether or not that particular operation inserted or merged it. So it seems its impossible for the client application to retain knowledge of whether it, or another node in the cluster, inserted the record. Unforunately, in my current case, this knowledge is necessary for some downstream synchronization.
Is there some configuration or technique to allow the batch of inserts to proceed and return the particular response code per-item without throwing a blanket exception?
As you already know, since batch is a transaction operation you get an all-or-none kind of a deal. One thing interesting with batch transactions is that you get an index of first failed entity in the batch. So assuming you're trying to insert 100 entities in a batch and 50th entity is already present in the table, the batch operation will give you the index of failed entity (49 in this case).
Is there some configuration or technique to allow the batch of inserts
to proceed and return the particular response code per-item without
throwing a blanket exception?
I don't think so. The transaction would fail as soon as the first entity fails. It will not even attempt to process other entities.
Possible Solutions (Just thinking out loud :))
If I understand correctly, your key requirement is to identify if an entity was inserted or merged (or replaced). For this the approach would be to separate out failed entities from a batch and process them separately. Based on this, I can think of two approaches:
What you could possibly do in this case is split that batch into 3
batches: 1st batch will contain 49 entities, 2nd batch will contain
just 1 entity (which failed) and the 3rd batch will contain 50
entities. You could now insert all entities in the 1st batch, decide
what you want to do with that failed entity and try to insert the
3rd batch. You would need to repeat the process over and over again
till the time this operation is complete.
Another idea would be to remove the failed entity from the batch and
retry that batch. So in the example above, in your 1st attempt
you'll try with 100 entities, in your 2nd attempt you'll try with 99
entities and so on and so forth keeping track of failed entities all
the while (with the reason as to why they failed). Once the batch
operation is successfully completed, you can work with all the
failed entities.

Resources