I currently have a Timer triggered Azure Function that checks a data endpoint to determine if any new data has been added. If new data has been added, then I generate an output blob (which I return).
However, returning output appears to be mandatory. Whereas I'd only like to generate an output blob under specific conditions, I must do it all of the time, clogging up my storage.
Is there any way to generate output only under specified conditions?
If you have the blob output binding set to your return value, but you do not want to generate a blob, simply return null to ensure the blob is not created.
You're free to execute whatever logic you want in your functions. You may need to remove the output binding from your function (this is what is making the output required) and construct the connection to blob storage in your function instead. Then you can conditionally create and save the blob.
Related
I have 2 collections in CosmosDB, Stocks and StockPrices.
StockPrices collection holds all historical prices, and is constantly updated.
I want to create Azure Function that listens to StockPrices updates (CosmosDBTrigger) and then does the following for each Document passed by the trigger:
Find stock with matching ticker in Stocks collection
Update stock price in Stocks collection
I can't do this with CosmosDB input binding, as CosmosDBTrigger passes a List (binding only works when trigger passes a single item).
The only way I see this working is if I foreach on CosmosDBTrigger List, and access CosmosDB from my function body and perform steps 1 and 2 above.
Question: How do I access CosmosDB from within my function?
One of the CosmosDB binding forms is to get a DocumentClient instance, which provides the full range of operations on the container. This way, you should be able to combine the change feed trigger and the item manipulation into the same function, like:
[FunctionName("ProcessStockChanges")]
public async Task Run(
[CosmosDBTrigger(/* Trigger params */)] IReadOnlyList<Document> changedItems,
[CosmosDB(/* Client params */)] DocumentClient client,
ILogger log)
{
// Read changedItems,
// Create/read/update/delete with client
}
It's also possible with .NET Core to use dependency injection to provide a full-fledged custom service/repository class to your function instance to interface to Cosmos. This is my preferred approach, because I can do validation, control serialization, etc with the latest version of the Cosmos SDK.
You may have done so intentionally, but just mentioning to consider combining your data into a single container partitioned by, for example, a combination of record type (Stock/StockPrice) and identifier. This simplifies things and can be more cost/resource efficient relative to multiple containers.
Ended up going with #Noah Stahl's suggestion. Leaving this here as an alternative.
Couldn't figure out how to do this directly, so came up with a work-around:
Add function with CosmosDBTrigger on StockPrices collection with Queue output binding
foreach over Documents from the trigger, serialize and add to the Queue
Add function with QueueTrigger, CosmosDB input binding for Stocks collection (with PartitionKey and Id set to StockTicker), and CosmosDB output binding for Stocks collection
Update Stock from CosmosDB input binding with values from the QueueTrigger
Assign updated Stock to CosmosDB output binding parameter (updates record in DB)
This said, I'd like to hear about more straightforward ways of doing this, as my approach seems like a hack.
I am using node-storage in the following code to store a value in a file, however when I create a new storage object changes from another storage object are not yet saved. I need a way to save the changes before creating the new storage object.
Below is a program called code.js which I am running like so in the console: node code.js. If you run it you will see that the first time it is run the key value pair doesn't yet exist however it does exist the second time.
key = "key"
storage = require('node-storage')
const store1 = new storage("file")
const store2 = new storage("file")
store1.put(key,'val')
console.log(store2.get(key))
My motivation for this is that I want to be able to have a function called "set" which takes a key and a value and sets the key value pair in a dictionary of values that is store in a file. I want to be able to refer to this dictionary later, with for example a 'get' function, and have the changes present.
I am thinking there might be a function called "save" or something similar that applies the changes to the file. Is there such a function or some other solution?
node-storage saves the changes in the dictionary to disk after every call to put or remove. This is not the issue.
Your problem is that the dictionary in store2 has not been updated with the new properties. node-storage only loads the file from disk when the object is first created.
My suggestion would be to only have one instance of storage per file.
However, if this is not possible, then you might want to consider updating store2's cache before you get the property. This can be done using:
store2.store = store2._load();
This may not be the best for performance, as _load loads the entire file from disk synchronously every time it is called, so try to limit its use.
I'm using the Microsoft.Azure.Storage.DataMovement nuget package to transfer multiple, very large (150GB) files into Azure cold storage using
TransferManager.UploadDirectoryAsync
It works very well, but a choke point in my process is that after upload I am attaching to the FileTransferred event and reading the local file all over again to calculate the md5 checksum and compare it to the remote copy:
private void FileTransferredCallback(object sender, TransferEventArgs e)
{
var sourceFile = e.Source.ToString();
var destinationFile = (ICloudBlob) e.Destination;
var localMd5 = CalculateMd5(e.Source.ToString());
var remoteMd5 = destinationFile.Properties.ContentMD5;
if (localMd5 == remoteMd5)
{
destinationFile.Metadata.Add(Md5VerifiedKey, DateTimeOffset.UtcNow.ToDisplayText());
destinationFile.SetMetadata();
}
}
It is slower than it needs to be since every file is getting double handled - first by the library, then by my MD5 check.
Is this check even necessary or is the library already doing the heavy lifting for me? I can see Md5HashStream but after quickly looking through the source it isn't clear to me if it is being used to verify the entire remote file.
Note that metadata blob.Properties.ContentMD5 of the entire blob is actually set by Microsoft.Azure.Storage.DataMovement library per its local calculation result after uploading all the blocks of this blob, not by Azure Storage Blob Service.
The data integrity of blob uploading is guaranteed by Content-MD5 HTTP header when putting every single block, not by metadata blob.Properties.ContentMD5 of the entire blob, since Azure Storage Blob Service doesn't really validate the value when Microsoft.Azure.Storage.DataMovement library is setting metadata (check the introduction of x-ms-blob-content-md5 HTTP header).
The main purpose of blob.Properties.ContentMD5 is to verify the data integrity when downloading the blob back to local disk via Microsoft.Azure.Storage.DataMovement library (if DownloadOptions.DisableContentMD5Validation is set to false, which is the default behavior).
Is this check even necessary or is the library already doing the heavy lifting for me?
Based on my knowledge, we just need to check the blob whether there is a value for the ContentMD5 propetry.
When using Microsoft.Azure.Storage.DataMovement to upload the large file,it is actually composed of multiple PutBlock requests plus one PutBlockList request.Each PutBlock request uploads only part of the content, so MD5 in such requests may only be for the current upload content, and can not be used as the final blob MD5 value.
The contents of the PutBlockList request is a list of all the above upload Block identity, so the MD5 value of this request can only be done on this list integrity check.
when all of these requests are validated, the integrity of the content is guaranteed. For the sake of performance, the Storage server does not summarize the contents of all the blocks in the previous request to calculate the MD5 value of the entire blob, but provides a special request header, x-ms-blob-content-md5, The end will set this header property value to the blob's MD5 value.So the client as long as the final PutBlockList request set the entire contents of the MD5 value to x-ms-blob-content-md5, then ensure the verification, blob also has the MD5 value.
So the blocking upload MD5 based on the integrity of the work process is:
The uploaded file is divided into pieces
Send each block as a PutBlock request and calculate the MD5 value of the current block to the Content-MD5 header
After all the blocks have been sent, the PutBlockList request is sent
Calculate the MD value of the entire uploaded file and set it to the head of x-ms-blob-content-md5
Make a list of the identities of the blocks sent earlier as the contents of the request
Set the MD5 value for the block ID list to the Content-MD5 header
Then assign the x-ms-blob-content-md5 value in the PutBlockList request to the blob's MD5 attribute
In summary, when blocking upload, it depends on whether x-ms-blob-content-md5 has a value.
Hi Guys I am building a Client which Interact with Azure Storage Rest API.
I was going through documentation https://learn.microsoft.com/ru-ru/rest/api/storageservices/fileservices/list-containers2:
And didn't understood the use of parameter prefix and marker which can be send along with Azure request.
It says:
prefix
Optional. Filters the results to return only containers whose name
begins with the specified prefix.
marker
Optional. A string value that identifies the portion of the list of
containers to be returned with the next listing operation. The
operation returns the NextMarker value within the response body if the
listing operation did not return all containers remaining to be listed
with the current page. The NextMarker value can be used as the value
for the marker parameter in a subsequent call to request the next page
of list items.
The marker value is opaque to the client.
With Prefix, I think:
If i have dir structure:
file01.txt
images/image01.jpg
images/folder/image001.jpg
fightVideo/subFolder/current/video001.mpg
fightVideo/subFolder/current/video002.mpg
If I give prefix container name as "fight". It should return
fightVideo.
But I am not sure.
And for Marker I don't understand whats its use?
Please can someone explain the use of Prefix and Marker with examples?
In context of listing containers, if you specify prefix parameter it will list the containers names of which start with that prefix value. It has nothing to do with listing blobs.
List blobs operation also supports this prefix parameter and when you specify this parameter, it will list the blobs names of which start with that prefix value.
So the example you have given is for listing blobs and when you specify flight as prefix there, you will get back fightVideo/subFolder/current/video001.mpg and fightVideo/subFolder/current/video002.mpg in response but not when you call list containers with this prefix.
Regarding marker, Kalyan's explanation is correct but let me add a little bit more to that.
Essentially Azure Storage Service is a shared service and you simply can't ask it to return all the results in one go (if we were to take an analogy from SQL world, you simply can't do SELECT * FROM TABLE kind of thing). Each request to the service is assigned a predefined timeout and the response would include the items fetched in that time + optionally a token if the service thinks that there's more data available. This token is called continuation token. In order to get the next set of items, you would need to pass this continuation token in the marker parameter in your next request.
Each call to storage service will try to return a predefined maximum number of items. For listing blob containers/blobs, this limit is 5000 items. For listing tables/entities, this limit is 1000 items. If there are more items in your account, then apart from this data storage service returns you a continuation token which tells you that there's more data available.
Please note that even though the limit is there but you can't always assume that you will get these number of records. Based on a number of conditions, it is quite possible that you don't get back any data but still receive a continuation token. So your code need to handle this condition as well.
If there are too many blobs to be listed, then the response contains the NextMarker element.
<?xml version="1.0" encoding="utf-8"?>
<EnumerationResults ServiceEndpoint="https://myaccount.blob.core.windows.net">
<Prefix>string-value</Prefix>
<Marker>string-value</Marker>
<MaxResults>int-value</MaxResults>
<Containers>
<Container>
<Name>container-name</Name>
<Properties>
<Last-Modified>date/time-value</Last-Modified>
<Etag>etag</Etag>
<LeaseStatus>locked | unlocked</LeaseStatus>
<LeaseState>available | leased | expired | breaking | broken</LeaseState>
<LeaseDuration>infinite | fixed</LeaseDuration>
<PublicAccess>container | blob</PublicAccess>
</Properties>
<Metadata>
<metadata-name>value</metadata-name>
</Metadata>
</Container>
</Containers>
<NextMarker>marker-value</NextMarker>
</EnumerationResults>
The REST API documentation mentions that the marker value can be used in a subsequent call to request the next set of list items.
You can imagine marker as a paginatator index.
I have a continuous Azure WebJob that is running off of a QueueInput, generating a report, and outputting a file to a BlobOutput. This job will run for differing sets of data, each requiring a unique output file. (The number of inputs is guaranteed to scale significantly over time, so I cannot write a single job per input.) I would like to be able to run this off of a QueueInput, but I cannot find a way to set the output based on the QueueInput value, or any value except for a blob input name.
As an example, this is basically what I want to do, though it is invalid code and will fail.
public static void Job([QueueInput("inputqueue")] InputItem input, [BlobOutput("fileoutput/{input.Name}")] Stream output)
{
//job work here
}
I know I could do something similar if I used BlobInput instead of QueueInput, but I would prefer to use a queue for this job. Am I missing something or is generating a unique output from a QueueInput just not possible?
There are two alternatives:
Use IBInder to generate the blob name. Like shown in these samples
Have an autogenerated in the queue message object and bind the blob name to that property. See here (the BlobNameFromQueueMessage method) how to bind a queue message property to a blob name
Found the solution at Advanced bindings with the Windows Azure Web Jobs SDK via Curah's Complete List of Web Jobs Tutorials and Videos.
Quote for posterity:
One approach is to use the IBinder interface to bind the output blob and specify the name that equals the order id. The better and simpler approach (SimpleBatch) is to bind the blob name placeholder to the queue message properties:
public static void ProcessOrder(
[QueueInput("orders")] Order newOrder,
[BlobOutput("invoices/{OrderId}")] TextWriter invoice)
{
// Code that creates the invoice
}
The {OrderId} placeholder from the blob name gets its value from the OrderId property of the newOrder object. For example, newOrder is (JSON): {"CustomerName":"Victor","OrderId":"abc42"} then the output blob name is “invoices/abc42″. The placeholder is case-sensitive.
So, you can reference individual properties from the QueueInput object in the BlobOutput string and they will be populated correctly.