Deconstructing ZonedDateTime - nodatime

Using NodaTime, and I'm looking to deconstruct a ZonedDateTime in order to save it to a SQL database. It seems to me there are a few options. I could deconstruct it to Instant and DateTimeZone and save it as a datetime2 and nvarchar(50). I could deconstruct it to DateTimeOffset and DateTimeZone or LocalDateTime and DateTimeZone and, in either case, save it as datetimeoffset and nvarchar(5).
Is there a difference, or reason to chose one over the other?
The only think I can think of is that the datetimeoffset plus nvarchar(50) might be better in case the db is ever read by a service that doesn't have as robust a timezone -> offset conversion system as NodaTime. In that situation I've at least capture what the offset was, in that timezone, at that point in time, which is lost (or at needs to be recalculated from historical timezone information) with a datetime2 plus nvarchar(50) approach.
Are there other considerations I'm missing?

I would suggest using a datetimeoffset and a separate time zone ID. I'm assuming datetimeoffset still allows you to perform a total ordering (i.e. by instant) - although I suppose it's possible that that is less efficient than if you've stored a datetime2. It also may well take more space in the database, given that it's storing more data.
Even if the database is read by a service that does have time zone conversion operations, storing the offset in the database allows you to perform queries over the data based on the local date, e.g. "Show me all my appointments on Tuesday". You can't perform that query purely database-side if you only have instants.
One other thing you might want to consider if you're storing future date/time values is that the predicated time zone offset may change due to changes in rules. If your original input data was as a local date/time (which is usually the case if you're working with ZonedDateTime) then the datetimeoffset approach is storing "what the user gave you" plus the inferred offset - you can easily then update all the data with a later version of the time zone database if necessary. If you only have the computed instant, you'd need to work out what the original local date/time was in the "old" time zone database before adjusting it to the "new" time zone database. That may also have lost information, e.g. if the input value used to be ambiguous (so you picked one offset or the other) but no longer is.

Related

Difference between Search and Incremental Exports?

Incremental Exports:
I query at a certain time for all records which have changed since a specific time. I would personally use the max updated_at timestamp in my database.
https://developer.zendesk.com/rest_api/docs/support/incremental_export
Search:
I can query a certain endpoint/table for all tickets which have been updated since the max updated_at timestamp in my database.
https://developer.zendesk.com/rest_api/docs/support/search
path='/api/v2/search.json', query='query=type:ticket updated>=2019-06-10T00:00:00Z'
It seems like both of these methods achieve the same goal, but I want to be certain that I choose the right one and that there are no caveats or issues I will run into later.
I assume that if I keep track of the max update timestamp that I have already retrieved, then I can always pull new/changed records >= from that timestamp (and only have minor duplication that I need to address from records with the exact same timestamp). Any suggestions?
The main difference between Search and Incremental Export is Query.
But according to you start_time in Incremental Export and country=US&toll_free=true in Search both are same but actually NOT.
Because query is checking the availability of something, which is actually performing by any search query but in start_time that is not query because you are just pointing a starting point you have no concern what the values is coming. But in query you have concerns, because you want "That is the value which I want"
Hope so you are getting my point and clearing your doubt.

MongoDB (+ Node/Mongoose): How to track processed records without "marking" all of them?

I have several large "raw" collections of documents which are processed in a queue, and the processed results are all placed into a single collection.
The queue only runs when the system isn't otherwise indisposed, and new data is being added into the "raw" collections all the time.
What I need to do is make sure the queue knows which documents it has already processed, so it doesn't either (a) process any documents more than once, or (b) skip documents. Updating each raw record with a "processed" flag as I go isn't a good option because it adds too much overhead.
I'm using MongoDB 4.x, with NodeJS and Mongoose. (I don't need a strictly mongoose-powered answer, but one would be OK).
My initial attempt was to do this by retrieving the raw documents sorted by _id in a smallish batch (say 100), then grabbing the first and last _id values in the return result, and storing those values, so when I'm ready to process the next batch, I can limit my find({}) query to records with an _id greater than what I stored as the last-processed result.
But looking into it a bit more, unless I'm misunderstanding something, it appears I can't really count on a strict ordering by _id.
I've looked into ways to implement an auto-incrementing numeric ID field (SQL style), which would have a strict ordering, but the solutions I've seen look like they add a nontrivial amount of overhead each time I create a record (not dissimilar to what it would take to mark processed records, just would be on the insertion end instead of the processing end), and this system needs to process a LOT of records very fast.
Any ideas? Is there a way to do an auto-incrementing numeric ID that's super efficient? Will default _id properties actually work in this case and I'm misunderstanding? Is there some other way to do it?
As per the documentation of ObjectID:
While ObjectId values should increase over time, they are not
necessarily monotonic. This is because they:
Only contain one second of temporal resolution, so ObjectId values created within the same second do not have a guaranteed ordering, and
Are generated by clients, which may have differing system clocks.
So if you are creating that many records per second then _id ordering is not for you.
However Timestamp within a mongo instance is guaranteed to be unique.
BSON has a special timestamp type for internal MongoDB use and is not
associated with the regular Date type. Timestamp values are a 64 bit
value where:
the first 32 bits are a time_t value (seconds since the Unix epoch)
the second 32 bits are an incrementing ordinal for operations within a
given second.
Within a single mongod instance, timestamp values are always unique.
Although it clearly states that this is for internal use it maybe something for you to consider. Assuming you are dealing with a single mongod instance you can decorate your records when they are getting into the "raw" collections with timestamps ... then you could remember the last processed record only. Your queue would only pick records with timestamps larger that the last processed timestamp.

Unreliable performance in Azure Table Storage when joining point queries together

I am having a consistent problem with the performance of Azure Table Storage. I'm querying a table which holds user accounts. The table stores the userId in both the PartitionKey and RowKey so I can easily make point queries.
My issue is because in several cases I need to retrieve multiple users in a single query. To achieve that I have a class which builds filter strings for me. The manner which this works is not related to the problem, however this is an example of the output:
(PartitionKey eq '00540de6-dd2b-469f-8730-e7800e06ccc0' and RowKey eq '00540de6-dd2b-469f-8730-e7800e06ccc0') or
(PartitionKey eq '02aa11b7-974a-4ee9-9a8e-5fc09970bb99' and RowKey eq '02aa11b7-974a-4ee9-9a8e-5fc09970bb99') or
(PartitionKey eq '040aec50-ebcd-4e5d-8f58-82aea616bd82' and RowKey eq '040aec50-ebcd-4e5d-8f58-82aea616bd82') or
// up to 22 more (25 total)
Upon first execution of the query it takes a long time to execute, between 2-5 seconds, and is missing data which is leading to errors. When run a second time the query takes between 0.2 and 0.5 seconds to complete and has all data contained within it.
Note that I also tried it just supplying just the PartitionKey, however it made no difference. I had assumed that a point query would perform better.
From this presentation of the bug I can only presume it's caused by the data being 'cold' when first requested and then pulled from a 'hot' cache upon successive requests.
If this is the case, how can I change the filter string to improve performance? Alternatively, how can I change the timeout of the table storage query to give it more time to complete? Is it possible to increase the scaling of my table storage?
Please don't use point query strings concatenated with 'or', since Azure Storage Table can't treat it as multiple point queries. Instead, Azure Table will treat it as a full table scan, which is terrible in performance. You should execute 25 point queries respectively to improve performance.

Strategy for storing application logs in Azure Table Storage

I am to determine a good strategy for storing logging information in Azure Table Storage. I have the following:
PartitionKey: The name of the log.
RowKey: Inversed DateTime ticks,
The only issue here is that partitions could get very large (millions of entities) and the size will increase with time.
But that being said, the type of queries being performed will always include the PartitionKey (no scanning) AND a RowKey filter (a minor scan).
For example (in a natural language):
where `PartitionKey` = "MyApiLogs" and
where `RowKey` is between "01-01-15 12:00" and "01-01-15 13:00"
Provided that the query is done on both PartitionKey and RowKey, I understand that the size of the partition doesn't matter.
Take a look at our new Table Design Patterns Guide - specifically the log-data anti-pattern as it talks about this scenario and alternatives. Often when people write log files they use a date for the PK which results in a partition being hot as all writes go to a single partition. Quite often Blobs end up being a better destination for log data - as people typically end up processing the logs in batches anyway - the guide talks about this as an option.
Adding my own answer so people can have something inline without needing external links.
You want the partition key to be the timestamp plus the hash code of the message. This is good enough in most cases. You can add to the hash code of the message the hash code(s) of any additional key/value pairs as well if you want, but I've found it's not really necessary.
Example:
string partitionKey = DateTime.UtcNow.ToString("o").Trim('Z', '0') + "_" + ((uint)message.GetHashCode()).ToString("X");
string rowKey = logLevel.ToString();
DynamicTableEntity entity = new DynamicTableEntity { PartitionKey = partitionKey, RowKey = rowKey };
// add any additional key/value pairs from the log call to the entity, i.e. entity["key"] = value;
// use InsertOrMerge to add the entity
When querying logs, you can use a query with partition key that is the start of when you want to retrieve logs, usually something like 1 minute or 1 hour from the current date/time. You can then page backwards another minute or hour with a different date/time stamp. This avoids the weird date/time hack that suggests subtracting the date/time stamp from DateTime.MaxValue.
If you get extra fancy and put a search service on top of the Azure table storage, then you can lookup key/value pairs quickly.
This will be much cheaper than application insights if you are using Azure functions, which I would suggest disabling. If you need multiple log names just add another table.

Updating an object to Azure Table Storage - is there any way to get the new Timestamp?

I'm updating an object in AzureTableStorage using the StorageClient library with
context.UpdateObject(obj);
context.SaveChangesWithRetries(obj);
when I do this, is there any way to get hold of the new timestamp for obj without making another request to the server?
Thanks
Stuart
To supplement Seva Titov's answer: the excerpt reported was valid at least until May 2013, but as of November 2013 it has changed (emphasis added):
The Timestamp property is a DateTime value that is maintained on the server side to record the time an entity was last modified. The Table service uses the Timestamp property internally to provide optimistic concurrency. The value of Timestamp is a monotonically increasing value, meaning that each time the entity is modified, the value of Timestamp increases for that entity. This property should not be set on insert or update operations (the value will be ignored).
Now the Timestamp property is no longer regarded as opaque and it is documented that its value increases after each edit -- this suggests that could Timestamp could be now used to track subsequent updates (at least with regard to the single entity).
Nevertheless, as of November 2013 it is still needed another request to Table Storage to obtain the new timestamp when you update the entity (see the documentation of Update Entity REST method). Only when inserting an entity the REST service returns the entire entity with the timestamp (but I don't remember if this is exposed by the StorageClient/Windows Azure storage library).
MSDN page has some guidance on the usage of Timestamp field:
Timestamp Property
The Timestamp property is a DateTime
value that is maintained on the server
side to record the time an entity was
last modified. The Table service uses
the Timestamp property internally to
provide optimistic concurrency. You
should treat this property as opaque:
It should not be read, nor set on
insert or update operations (the value
will be ignored).
This implies that it is really implementation details of the table storage, you should not rely the Timestamp field to represent timestamp of last update.
If you want a field which is guaranteed to represent time of last write, create new field and set it on every update operatio. I understand this is more work (and more storage space) to maintain the field, but that would actually automatically resolves your question -- how to get the timestamp back, because you would already know it when calling context.UpdateObject().
The Timestamp property is actually a Lamport timestamp. It is guaranteed to always grow over time and while it is presented as a DateTime value it's really not.
On the server side, that is, Windows Azure Storage, for each change does this:
nextTimestamp = Math.Max(currentTimestamp + 1, DateTime.UtcNow)
This is all there is to it. And it's of course guaranteed to happen in a transactional manner. The point of all this is to provide a logical clock (monotonic function) that can be used to ensure that the order of events happen in the intended order.
Here's a link to a version of the actual WAS paper and while it doesn't contain any information on the timestamp scheme specifically it has enough stuff there that you quickly realize that there's only one logical conclusion you can draw from this. Anything else would be stupid. Also, if you have any experience with LevelDB, Cassandra, Memtables and it's ilk, you'll see that the WAS team went the same route.
Though I should add to clarify, since WAS provides a strong consistency model, the only way to maintain the timestamp is to do it under lock and key, so there's no way you can guess the correct next timestamp. You have to query WAS for the information. There's no way around that. You can however hold on to an old value and presume that it didn't change. WAS will tell you if it did and then you can resolve the race condition any way you see fit.
I am using Windows Azure Storage 7.0.0
And you can check the result of the operation to get the eTag and the Timespan properties :
var tableResult = cloudTable.Execute(TableOperation.Replace(entity));
var updatedEntity = tableResult.Result as ITableEntity;
var eTag = updatedEntity.ETag;
var timestamp = updatedEntity.Timestamp;
I don't think so, as far as I know Timespan and Etag are set by Azure Storage itself.

Resources