Azure function CosmosDbTrigger (Start from the beginning option) - azure

I have a azure function with cosmos db trigger which makes some calculations and write results to db. If something goes wrong i want to have a possibility to start from the first item or specific item make calculations again. Is it possible? Thanks
public static void Run([CosmosDBTrigger(
databaseName: "db",
collectionName: "collection",
ConnectionStringSetting = "DocDbConnStr",
CreateLeaseCollectionIfNotExists = true,
LeaseCollectionName = "leases")]IReadOnlyList<Document> input, TraceWriter log)
{
...
}

Right now, the StartFromBeginning option is not exposed to the Cosmos DB Trigger. The default behavior is to start receiving changes from the moment the Function starts running, leases/checkpoints will be generated in case the Host/Runtime shutsdown so when the Host/Runtime is back up it will pickup from the last checkpointed item.
The Trigger does not implement dead-lettering or error handling as it might generate infinite-loops / unexpected billing / multiple processing of the same batch if the error is not related to the batch itself (for example, you process the documents and then send an email and the email fails, the entire batch would be re-processed for an error not related to the feed itself), so we recommend users to implement their own try/catch or error handling logic inside the Function's code. It's the same approach as the Event Hub Trigger.
That being said, we are in the process of exposing several new options on the Trigger and there is a contributor working on an advanced retrying mechanism.

As #Matias Quaranta and #Pankaj Rawat say in the comments, the accept answer is old and is no longer true. You can use StartFromTheBeginning as a C# attribute within your azure function's parameter list like so:
[FunctionName(nameof(MyAzureFunction))]
public async Task RunAsync([CosmosDBTrigger(
databaseName: "myCosmosDbName",
collectionName: "myCollectionName",
ConnectionStringSetting = "cosmosConnectionString",
LeaseCollectionName = "leases",
CreateLeaseCollectionIfNotExists = true,
MaxItemsPerInvocation = 1000,
StartFromBeginning = true)]IReadOnlyList<Document> documents)
{
....
}
Please change the accepted answer.

The current offsets (positions in Cosmos DB change feed) are managed by clients, Azure Functions runtime in this case.
Functions store the offsets in lease collection (it's called leases in your example).
To restart from a specific item, you would have to make a snapshot of documents in leases collection at some point, and then restore your current collection to that snapshot when needed.
I am not familiar with a tool that automates that for you, other than generic tools working with Cosmos DB collections.

Check startFromBeginning option available in Function v2. Unfortunately, I'm still using V1 and not able to verify.
When set, it tells the Trigger to start reading changes from the beginning of the history of the collection instead of the current time. This only works the first time the Trigger starts, as in subsequent runs, the checkpoints are already stored. Setting this to true when there are leases already created has no effect.

Related

Force Azure CosmosDB Java SDK Read Latest Value

I am setting the TTL on my Cosmos Container to 1 to force the deletion of all items, I then query SELECT VALUE COUNT(1) from c to check that all items are deleted before setting TTL back to its previous value.
My issue is, I can see via the portal that the items are deleted but my query via the SDK returns the "old" wrong value for an inordinate time. Is there a way to force it to read the "real" value from the back end or establish a fresh connection etc?
//I create my client like so, setting Consistencylevel.STRONG will throw an error as
//it is higher level than the DB
CosmosClient cosmosClient= new CosmosClientBuilder().endpoint(DATABASE_HOST)
.key(DATABASE_KEY)
.consistencyLevel(ConsistencyLevel.SESSION)
.contentResponseOnWriteEnabled(true)
.buildClient();
//get the database
CosmosDatabase dataBase = cosmosClient.getDatabase(databaseName);
return dataBase;
//get the container
CosmosContainer container = theDatabase.getContainer(containerProps.getId());#
//update the TTL
containerProps.setDefaultTimeToLiveInSeconds(1);
container.replace(containerProps);
Thread.sleep(1000);
//now confirm that the container contents are deleted
//i tried refreshing my client/db/container objects to see if it would help
CosmosClient refreshedCosmosClient = createSyncCosmosClient();
CosmosDatabase refreshedDatabase = refreshedCosmosClient.getDatabase(DATABASE_NAME);
CosmosContainer refreshedContainer = refreshedDatabase.getContainer(container.getId());
//query the number of ITEMS in the container
CosmosPagedIterable<JsonNode> countOfDocs =
refreshedContainer.queryItems(CHECK_CONTAINER_EMPTY_QUERY, new CosmosQueryRequestOptions(),
JsonNode.class);
context.getLogger().info("wooooooooooooooooaaaa" +countOfDocs.toString());
//THIS VALUE IS NOT UP TO DATE. IT IS THE OLD VALUE
JsonNode count = countOfDocs.iterator().next();
int numberOfDocuments = count.asInt();
When you set TTL to 1 second, yes, it will eventually delete all the documents but this does not happen instantaneously. Depending on the volume of data, this can take some time, what happens is that the documents that are in the process of deletion by TTL cannot be seen by read operations (hence the COUNT shows 0)
If you disable TTL and there are documents still in the process of TTL deletion, then those are now back to being accessible (because the process by which they were being deleted is disabled).
Reference: https://learn.microsoft.com/azure/cosmos-db/nosql/time-to-live
Deletion of expired items is a background task that consumes left-over Request Units, that is Request Units that haven't been consumed by user requests. Even after the TTL has expired, if the container is overloaded with requests and if there aren't enough RU's available, the data deletion is delayed. Data is deleted once there are enough RUs available to perform the delete operation. Though the data deletion is delayed, data is not returned by any queries (by any API) after the TTL has expired.
I can set the ConsistencyLevel to BOUNDED_STALENESS #5 seconds, this seems to improve things. I cannot use ConsistencyLevel.STRONG *due to DB configuration

Access CosmosDB from Azure Function (without input binding)

I have 2 collections in CosmosDB, Stocks and StockPrices.
StockPrices collection holds all historical prices, and is constantly updated.
I want to create Azure Function that listens to StockPrices updates (CosmosDBTrigger) and then does the following for each Document passed by the trigger:
Find stock with matching ticker in Stocks collection
Update stock price in Stocks collection
I can't do this with CosmosDB input binding, as CosmosDBTrigger passes a List (binding only works when trigger passes a single item).
The only way I see this working is if I foreach on CosmosDBTrigger List, and access CosmosDB from my function body and perform steps 1 and 2 above.
Question: How do I access CosmosDB from within my function?
One of the CosmosDB binding forms is to get a DocumentClient instance, which provides the full range of operations on the container. This way, you should be able to combine the change feed trigger and the item manipulation into the same function, like:
[FunctionName("ProcessStockChanges")]
public async Task Run(
[CosmosDBTrigger(/* Trigger params */)] IReadOnlyList<Document> changedItems,
[CosmosDB(/* Client params */)] DocumentClient client,
ILogger log)
{
// Read changedItems,
// Create/read/update/delete with client
}
It's also possible with .NET Core to use dependency injection to provide a full-fledged custom service/repository class to your function instance to interface to Cosmos. This is my preferred approach, because I can do validation, control serialization, etc with the latest version of the Cosmos SDK.
You may have done so intentionally, but just mentioning to consider combining your data into a single container partitioned by, for example, a combination of record type (Stock/StockPrice) and identifier. This simplifies things and can be more cost/resource efficient relative to multiple containers.
Ended up going with #Noah Stahl's suggestion. Leaving this here as an alternative.
Couldn't figure out how to do this directly, so came up with a work-around:
Add function with CosmosDBTrigger on StockPrices collection with Queue output binding
foreach over Documents from the trigger, serialize and add to the Queue
Add function with QueueTrigger, CosmosDB input binding for Stocks collection (with PartitionKey and Id set to StockTicker), and CosmosDB output binding for Stocks collection
Update Stock from CosmosDB input binding with values from the QueueTrigger
Assign updated Stock to CosmosDB output binding parameter (updates record in DB)
This said, I'd like to hear about more straightforward ways of doing this, as my approach seems like a hack.

How to avoid race condition when updating Azure Table Storage record

Azure Function utilising Azure Table Storage
I have an Azure Function which is triggered from Azure Service Bus topic subscription, let's call it "Process File Info" function.
The message on the subscription contains file information to be processed. Something similar to this:
{
"uniqueFileId": "adjsdakajksajkskjdasd",
"fileName":"mydocument.docx",
"sourceSystemRef":"System1",
"sizeBytes": 1024,
... and other data
}
The function carries out the following two operations -
Check individual file storage table for the existing of the file. If it exists, update that file. If it's new, add the file to the storage table (stored on a per system|per fileId basis).
Capture metrics on the file size bytes and store in a second storage table, called metrics (constantly incrementing the bytes, stored on a per system|per year/month basis).
The following diagram gives a brief summary of my approach:
The difference between the individualFileInfo table and the fileMetric is that the individual table has one record per file, where as the metric table stores one record per month that is constantly updated (incremented) gathering the total bytes that are passed through the function.
Data in the fileMetrics table is stored as follows:
The issue...
Azure functions are brilliant at scaling, in my setup I have a max of 6 of these functions running at any one time. Presuming each file message getting processed is unique - updating the record (or inserting) in the individualFileInfo table works fine as there are no race conditions.
However, updating the fileMetric table is proving problematic as say all 6 functions fire at once, they all intend to update the metrics table at the one time (constantly incrementing the new file counter or incrementing the existing file counter).
I have tried using the etag for optimistic updates, along with a little bit of recursion to retry should a 412 response come back from the storage update (code sample below). But I can't seem to avoid this race condition. Has anyone any suggestion on how to work around this constraint or come up against something similar before?
Sample code that is executed in the function for storing the fileMetric update:
internal static async Task UpdateMetricEntry(IAzureTableStorageService auditTableService,
string sourceSystemReference, long addNewBytes, long addIncrementBytes, int retryDepth = 0)
{
const int maxRetryDepth = 3; // only recurively attempt max 3 times
var todayYearMonth = DateTime.Now.ToString("yyyyMM");
try
{
// Attempt to get existing record from table storage.
var result = await auditTableService.GetRecord<VolumeMetric>("VolumeMetrics", sourceSystemReference, todayYearMonth);
// If the volume metrics table existing in storage - add or edit the records as required.
if (result.TableExists)
{
VolumeMetric volumeMetric = result.RecordExists ?
// Existing metric record.
(VolumeMetric)result.Record.Clone()
:
// Brand new metrics record.
new VolumeMetric
{
PartitionKey = sourceSystemReference,
RowKey = todayYearMonth,
SourceSystemReference = sourceSystemReference,
BillingMonth = DateTime.Now.Month,
BillingYear = DateTime.Now.Year,
ETag = "*"
};
volumeMetric.NewVolumeBytes += addNewBytes;
volumeMetric.IncrementalVolumeBytes += addIncrementBytes;
await auditTableService.InsertOrReplace("VolumeMetrics", volumeMetric);
}
}
catch (StorageException ex)
{
if (ex.RequestInformation.HttpStatusCode == 412)
{
// Retry to update the volume metrics.
if (retryDepth < maxRetryDepth)
await UpdateMetricEntry(auditTableService, sourceSystemReference, addNewBytes, addIncrementBytes, retryDepth++);
}
else
throw;
}
}
Etag keeps track of conflicts and if this code gets a 412 Http response it will retry, up to a max of 3 times (an attempt to mitigate the issue). My issue here is that I cannot guarantee the updates to table storage across all instances of the function.
Thanks for any tips in advance!!
You can put the second part of the work into a second queue and function, maybe even put a trigger on the file updates.
Since the other operation sounds like it might take most of the time anyways, it could also remove some of the heat from the second step.
You can then solve any remaining race conditions by focusing only on that function. You can use sessions to limit the concurrency effectively. In your case, the system id could be a possible session key. If you use that, you will only have one Azure Function processing data from one system at one time, effectively solving your race conditions.
https://dev.to/azure/ordered-queue-processing-in-azure-functions-4h6c
Edit: If you can't use Sessions to logically lock the resource, you can use locks via blob storage:
https://www.azurefromthetrenches.com/acquiring-locks-on-table-storage/

Azure Cosmos DB Functions - Delete a document

I can get a document response using HTTP trigger in FunctionsApp on Azure like a rest API However, I cannot delete the document.
I'm selecting the DELETE as Selected 'HTTP methods' but I am not clear what should I do for next step.
In Input Parameters, when I write 'Delete from mydocument' in SQL Query (optional) textbox, it doesn't work.
Probably I need to change the 'run.csx' code but how?
any clue?
I believe the 'SQL Query' section is for the input binding for 'finding' the document that you wish to work with. This still may be useful depending how you want to proceed. You can still use a HTTP Delete trigger if you want, but merely 'saying' that its a DELETE verb doesn't automatically perform a delete. Instead, it means that you can 'invoke' the function only if you specify it as a DELETE action.
I've previous deleted documents by binding directly to the DocumentClient itself, and delete the Document programatically.
[FunctionName("DeleteDocument")]
public static async Task Run(
[TimerTrigger("00:01", RunOnStartup = true)] TimerInfo timer,
[DocumentDB] DocumentClient client,
TraceWriter log)
{
var collectionUri = UriFactory.CreateDocumentCollectionUri("ItemDb", "ItemCollection");
var documents = client.CreateDocumentQuery(collectionUri);
foreach (Document d in documents)
{
await client.DeleteDocumentAsync(d.SelfLink);
}
}
See DocumentDBSamples

Is it possible to generate a unique BlobOutput name from an Azure WebJobs QueueInput item?

I have a continuous Azure WebJob that is running off of a QueueInput, generating a report, and outputting a file to a BlobOutput. This job will run for differing sets of data, each requiring a unique output file. (The number of inputs is guaranteed to scale significantly over time, so I cannot write a single job per input.) I would like to be able to run this off of a QueueInput, but I cannot find a way to set the output based on the QueueInput value, or any value except for a blob input name.
As an example, this is basically what I want to do, though it is invalid code and will fail.
public static void Job([QueueInput("inputqueue")] InputItem input, [BlobOutput("fileoutput/{input.Name}")] Stream output)
{
//job work here
}
I know I could do something similar if I used BlobInput instead of QueueInput, but I would prefer to use a queue for this job. Am I missing something or is generating a unique output from a QueueInput just not possible?
There are two alternatives:
Use IBInder to generate the blob name. Like shown in these samples
Have an autogenerated in the queue message object and bind the blob name to that property. See here (the BlobNameFromQueueMessage method) how to bind a queue message property to a blob name
Found the solution at Advanced bindings with the Windows Azure Web Jobs SDK via Curah's Complete List of Web Jobs Tutorials and Videos.
Quote for posterity:
One approach is to use the IBinder interface to bind the output blob and specify the name that equals the order id. The better and simpler approach (SimpleBatch) is to bind the blob name placeholder to the queue message properties:
public static void ProcessOrder(
[QueueInput("orders")] Order newOrder,
[BlobOutput("invoices/{OrderId}")] TextWriter invoice)
{
// Code that creates the invoice
}
The {OrderId} placeholder from the blob name gets its value from the OrderId property of the newOrder object. For example, newOrder is (JSON): {"CustomerName":"Victor","OrderId":"abc42"} then the output blob name is “invoices/abc42″. The placeholder is case-sensitive.
So, you can reference individual properties from the QueueInput object in the BlobOutput string and they will be populated correctly.

Resources