How does the delimiter work in Azure Blob Storage?

How does the delimiter work in Azure Blob Storage? - azure

I am storing image files as blobs in Azure Storage with the following naming convention:
directory/image-name
When trying to retrieve the blobs using BlobService.listBlobs(container, options, callback) in Javascript on the server, I use:
var options = { "prefix":directory }
and it gets back only blobs that start with the directory name, as I expect, but I thought I would also be able to use:
var options = { "delimiter":"/", "prefix":directory }
and get back the same blobs, perhaps without the prefix in their names. Instead I get back nothing at all. What is the correct way to use the delimiter? What's the point in having it if you get the items that you want with only using the prefix?

I've not used the REST APIs from JavaScript, but I think what you are missing is a trailing slash after the directory name, so I suggest:
var options = { "delimiter":"/", "prefix":directory+"/" }
Windows Azure Storage doesn't really have directories, in the underlying implementation all the blobs in a container are just flat list, and blob names (not container names) may contain slashes. The delimiter is an option when calling the ListBlobs REST API that allows you to simulate directory-like behavior. If the delimiter option is enabled, and the part of the blob name past the prefix contains the delimiter, the reply will omit that blob.
To illustrate, lets name some blobs, assuming all of them in the same container https://myaccount.blob.core.windows.net/mycontainer":
a/b/extra.txt
a/bloba.txt
a/blobb.txt
other.txt
So then if you invoke listBlobs on that container with the prefix "a/" and without specifying the delimiter, it will return the first three names, because they all have the "a/" prefix.
If instead you invoke listBlobs with the same "a/" prefix and set the delimiter to "/", you only get the middle two names; the service leaves out a/b/extra.txt because it's in a (simulated) sub-directory "b".

Related

How to get files from Google Cloud Storage bucket based on a regular expression?

I am trying to get files from a google cloud storage bucket. The file name are something like 20180618_1400/SOMEID_20180618.jpg, 20180618_1200/SOMEID_20180618.jpg, 20180617_1400/SOMEOTHERID_20180617.jpg, etc.
I want to get files based on SOMEID.
I tried using the following code with reg exp
bucket.getFiles({
prefix: new RegExp(`[0-9_]*\/SOMEID_`),
}, (err, files) => {
if (err) return reject(err);
resolve(files);
});
The expected result is files 20180618_1400/SOMEID_20180618.jpg and 20180618_1200/SOMEID_20180618.jpg. But the code returns all the files in the bucket.
I searched on the internet but couldn't find anything.
Is there any other way to achieve this?

The prefix has to be a string. This is a prefix, not a regex. I had a look to be sure in documentation and it is, as expected, not possible.
The correct way to do that in GCS would be to structure your bucket in a way prefix as a string is usable. For example, having a directory for profile picture, another for pdf, ... And all files are named with your user id.
Example:
profiles/1245.jpg
profiles/7561.jpg
billing/1245-2018-10.pdf
billing/1245-2018-09.pdf
billing/7561-2018-10.pdf
...
If you cannot, you will have to get all items and then apply your regex on it. You have an example at the end of the getFiles() documentation
I think (it's been a while), you can use a regex using gsutils, but gsutils get all files and then apply the regex on the client side, so it won't be a better solution.

Optionally generate output with an Azure Function

I currently have a Timer triggered Azure Function that checks a data endpoint to determine if any new data has been added. If new data has been added, then I generate an output blob (which I return).
However, returning output appears to be mandatory. Whereas I'd only like to generate an output blob under specific conditions, I must do it all of the time, clogging up my storage.
Is there any way to generate output only under specified conditions?

If you have the blob output binding set to your return value, but you do not want to generate a blob, simply return null to ensure the blob is not created.

You're free to execute whatever logic you want in your functions. You may need to remove the output binding from your function (this is what is making the output required) and construct the connection to blob storage in your function instead. Then you can conditionally create and save the blob.

Azure Rest Api List container : Parameter Marker

Hi Guys I am building a Client which Interact with Azure Storage Rest API.
I was going through documentation https://learn.microsoft.com/ru-ru/rest/api/storageservices/fileservices/list-containers2:
And didn't understood the use of parameter prefix and marker which can be send along with Azure request.
It says:
prefix
Optional. Filters the results to return only containers whose name
begins with the specified prefix.
marker
Optional. A string value that identifies the portion of the list of
containers to be returned with the next listing operation. The
operation returns the NextMarker value within the response body if the
listing operation did not return all containers remaining to be listed
with the current page. The NextMarker value can be used as the value
for the marker parameter in a subsequent call to request the next page
of list items.
The marker value is opaque to the client.
With Prefix, I think:
If i have dir structure:
file01.txt
images/image01.jpg
images/folder/image001.jpg
fightVideo/subFolder/current/video001.mpg
fightVideo/subFolder/current/video002.mpg
If I give prefix container name as "fight". It should return
fightVideo.
But I am not sure.
And for Marker I don't understand whats its use?
Please can someone explain the use of Prefix and Marker with examples?

In context of listing containers, if you specify prefix parameter it will list the containers names of which start with that prefix value. It has nothing to do with listing blobs.
List blobs operation also supports this prefix parameter and when you specify this parameter, it will list the blobs names of which start with that prefix value.
So the example you have given is for listing blobs and when you specify flight as prefix there, you will get back fightVideo/subFolder/current/video001.mpg and fightVideo/subFolder/current/video002.mpg in response but not when you call list containers with this prefix.
Regarding marker, Kalyan's explanation is correct but let me add a little bit more to that.
Essentially Azure Storage Service is a shared service and you simply can't ask it to return all the results in one go (if we were to take an analogy from SQL world, you simply can't do SELECT * FROM TABLE kind of thing). Each request to the service is assigned a predefined timeout and the response would include the items fetched in that time + optionally a token if the service thinks that there's more data available. This token is called continuation token. In order to get the next set of items, you would need to pass this continuation token in the marker parameter in your next request.
Each call to storage service will try to return a predefined maximum number of items. For listing blob containers/blobs, this limit is 5000 items. For listing tables/entities, this limit is 1000 items. If there are more items in your account, then apart from this data storage service returns you a continuation token which tells you that there's more data available.
Please note that even though the limit is there but you can't always assume that you will get these number of records. Based on a number of conditions, it is quite possible that you don't get back any data but still receive a continuation token. So your code need to handle this condition as well.

If there are too many blobs to be listed, then the response contains the NextMarker element.
<?xml version="1.0" encoding="utf-8"?>
<EnumerationResults ServiceEndpoint="https://myaccount.blob.core.windows.net">
<Prefix>string-value</Prefix>
<Marker>string-value</Marker>
<MaxResults>int-value</MaxResults>
<Containers>
<Container>
<Name>container-name</Name>
<Properties>
<Last-Modified>date/time-value</Last-Modified>
<Etag>etag</Etag>
<LeaseStatus>locked | unlocked</LeaseStatus>
<LeaseState>available | leased | expired | breaking | broken</LeaseState>
<LeaseDuration>infinite | fixed</LeaseDuration>
<PublicAccess>container | blob</PublicAccess>
</Properties>
<Metadata>
<metadata-name>value</metadata-name>
</Metadata>
</Container>
</Containers>
<NextMarker>marker-value</NextMarker>
</EnumerationResults>
The REST API documentation mentions that the marker value can be used in a subsequent call to request the next set of list items.
You can imagine marker as a paginatator index.

Is it possible to generate a unique BlobOutput name from an Azure WebJobs QueueInput item?

I have a continuous Azure WebJob that is running off of a QueueInput, generating a report, and outputting a file to a BlobOutput. This job will run for differing sets of data, each requiring a unique output file. (The number of inputs is guaranteed to scale significantly over time, so I cannot write a single job per input.) I would like to be able to run this off of a QueueInput, but I cannot find a way to set the output based on the QueueInput value, or any value except for a blob input name.
As an example, this is basically what I want to do, though it is invalid code and will fail.
public static void Job([QueueInput("inputqueue")] InputItem input, [BlobOutput("fileoutput/{input.Name}")] Stream output)
{
//job work here
}
I know I could do something similar if I used BlobInput instead of QueueInput, but I would prefer to use a queue for this job. Am I missing something or is generating a unique output from a QueueInput just not possible?

There are two alternatives:
Use IBInder to generate the blob name. Like shown in these samples
Have an autogenerated in the queue message object and bind the blob name to that property. See here (the BlobNameFromQueueMessage method) how to bind a queue message property to a blob name

Found the solution at Advanced bindings with the Windows Azure Web Jobs SDK via Curah's Complete List of Web Jobs Tutorials and Videos.
Quote for posterity:
One approach is to use the IBinder interface to bind the output blob and specify the name that equals the order id. The better and simpler approach (SimpleBatch) is to bind the blob name placeholder to the queue message properties:
public static void ProcessOrder(
[QueueInput("orders")] Order newOrder,
[BlobOutput("invoices/{OrderId}")] TextWriter invoice)
{
// Code that creates the invoice
}
The {OrderId} placeholder from the blob name gets its value from the OrderId property of the newOrder object. For example, newOrder is (JSON): {"CustomerName":"Victor","OrderId":"abc42"} then the output blob name is “invoices/abc42″. The placeholder is case-sensitive.
So, you can reference individual properties from the QueueInput object in the BlobOutput string and they will be populated correctly.

How to check wether a CloudBlobDirectory exists or not?

In the software that I am programming, I am attempting to create a virtual file system over the blobs structure of Azure.
Many times in the process, I get a path from the system and I need to tell whether the path is of a Blob or just a virtual BlobDirectory that azure provides. I did this by casting it from one form to another and handling the error.
But now, if I know that a path points to a virtual directory, how can I check whether this virtual directory exists or not?
I can get the reference to the CloudBlobDirectory with the following code:
var blobDirectory = client.GetBlobDirectoryReference("Path_to_dir");

In blob storage, directories don't exist as an item by themselves. What you can have is a blob that has a name that can be interpreted as being in a directory. If you look at the underlying REST API you'll see that that there's nothing in there about directories. What the storage client library is doing for you is searching for blobs that start with the directory name then the delimiter e.g. "DirectoryA/DirectoryB/FileName.txt". What this means is that for a directory to exist it must contain a blob. To check if the directory exists you can try either:
var blobDirectory = client.GetBlobDirectoryReference("Path_to_dir");
bool directoryExists = blobDirectory.ListBlobs().Count() > 0
or
bool directoryExists = client.ListBlobsWithPrefix("DirectoryA/DirectoryB/").Count() > 0
I'm aware that listing everything in the directory just to get the count isn't that great an idea, I'm sure you can come up with a better method.

Not sure if you can use GetAttributes method and if it raised exception then means no directory exist. I used the similar approach to verify if an blob exist, but didn't tested on a directory yet.

For Java
this can be used:
container.getDirectoryReference(directoryName).listBlobs().iterator().hasNext() == true
means directory exists else no directory exists.

val storageAccountString: String = s"BlobEndpoint=https://$account.$endpoint;SharedAccessSignature=$SASTOKEN"
val client: CloudBlobContainer = CloudStorageAccountparse(storageAccountString).createCloudBlobClient().getContainerReference(container)
val blobPattern: String = s"wasbs://$containerName#$accountName.blob.core.windows.net/$dirPath"
client
.getBlockBlobReference(blobPattern)
.exists
.booleanValue

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How does the delimiter work in Azure Blob Storage? - azure

Related

How to get files from Google Cloud Storage bucket based on a regular expression?

Optionally generate output with an Azure Function

Azure Rest Api List container : Parameter Marker

Is it possible to generate a unique BlobOutput name from an Azure WebJobs QueueInput item?

How to check wether a CloudBlobDirectory exists or not?

Categories

Resources