Slow Stream Analytics with blob input - azure

I've inherited a solution that uses Stream Analytics with blobs as the input and then writes to an Azure SQL database.
Initially, the solution worked fine, but after adding several million blobs to a container (and not deleting old blobs), Stream Analytics is slow in processing new blobs. Also, it appears that some blobs are being missed/skipped.
Question: How does Stream Analytics know there are new blobs in a container?
Prior to EventGrid, Blob storage did not have a push notification mechanism to let Stream Analytics know that a new blob needs to be processed, so I'm assuming that Stream Analytics is polling the container to get the list of blobs (with something like CloudBlobContainer.ListBlobs()) and saves the list of blobs internally, so that when it goes to poll again it can compare the new list with the old list and know which blobs are new and need to be processed.
The documentation states:
Stream Analytics will view each file only once
However, besides that note, I have not seen any other documentation to explain how Stream Analytics knows which blobs to process.

ASA uses list blobs to get list of blobs.
If you can partition the blob path by date time pattern, it would be better. ASA will only have to list a specific path to discover new blobs, without a date pattern, all blobs will have to be listed. This is probably why it gets slower with huge number of blobs.

Related

Azure Storage Webhook being triggered by historical events

I have an Azure Function which uses the webhook bindings to be triggered by each upload or modification of a blob in an Azure Storage container.
This seems to work fine on an empty test container, i.e. when uploading the first blob or modifying one of two or three blobs in the test container.
However, when I point it towards a container with approximately a million blobs it receives a continuous stream of historic blob events.
I've read that
If the blob container being monitored contains more than 10,000 blobs,
the Functions runtime scans log files to watch for new or changed
blobs.
[source]
Is there any way I can ignore these historical events and consider only current events?

Limits on File Count for Azure Blob Storage

Currently, I have a large set of text files which contain (historical) raw data from various sensors. New files are received and processed every day. I'd like to move this off of an on-premises solution to the cloud.
Would Azure's Blob storage be an appropriate mechanism for this volume of small(ish) private files? or is there another Azure solution that I should be pursuing?
Relevent Data (no pun intended) & Requirements-
The data set contains a millions files of mostly small files, for a total of near 400gb. The average file size is around 50kb, but some files could exceed 40mb.
I need to maintain the existing data set for posterity's sake.
New files would be uploaded daily, and then processed once. Processing would be handled by Background Workers reading files off a queue.
Certain files would be downloaded / reviewed / reprocessed after the initial processing.
Let me elaborate more on David's comments.
As David mentioned, there's no limit on number of objects (files) that you can store in Azure Blob Storage. The limit is of the size of the storage account which currently is 500TB. As long as you stay in this limit you will be good. Further, you can have 100 storage accounts in an Azure Subscription so essentially the amount of data that you will be able to store is practically limitless.
I do want to mention one more thing though. It seems that the files that are uploaded in blob storage are once processed and then kind of archived. For this I suggest you take a look at Azure Cool Blob Storage. It is essentially meant for this purpose only where you want to store objects that are not frequently accessible yet when you need those objects they are accessible almost immediately. The advantage of using Cool Blob Storage is that writes and storage is cheaper as compared to Hot Blob Storage accounts however the reads are expensive (which makes sense considering their intended use case).
So a possible solution would be to save the files in your Hot Blob Storage accounts. Once the files are processed, they are moved to Cool Blob Storage. This Cool Blob Storage account can be in the same or different Azure Subscription.
I'm guessing it CAN be used as a file system, is the right (best) tool for the job.
Yes, Azure Blobs Storage can be used as cloud file system.
The data set contains a millions files of mostly small files, for a total of near 400gb. The average file size is around 50kb, but some files could exceed 40mb.
As David and Gaurav Mantri mentioned, Azure Blob Storage could meet this requirement.
I need to maintain the existing data set for posterity's sake.
Data in Azure Blob Storage is durable. You could reference the SERVICE LEVEL AGREEMENTS of Storage.
New files would be uploaded daily, and then processed once. Processing would be handled by Background Workers reading files off a queue.
You can use Azure Function to do the file processing work. Since it will do once a day, you could add a TimerTrigger Function.
//This function will be executed once a day
public static void TimerJob([TimerTrigger("0 0 0 * * *")] TimerInfo timerInfo)
{
//write the processing job here
}
Certain files would be downloaded / reviewed / reprocessed after the initial processing.
Blobs can be downloaded or updated at anytime you want.
In addition, if your data processing job is very complicated, you also could store your data in Azure Data Lake Store and do the data processing job using Hadoop analytic frameworks such as MapReduce or Hive. Microsoft Azure HDInsight clusters can be provisioned and configured to directly access data stored in Data Lake Store.
Here are the differences between Azure Data Lake Store and Azure Blob Storage.
Comparing Azure Data Lake Store and Azure Blob Storage

Fast mechanism for querying Azure blob names

I'm trying to get a list of blob names in Azure and I'm looking for ways to make this operation significantly faster. Within a given sub-folder, the number of blobs can exceed 150,000 elements. The filenames of the blobs are an encoded ID which is what I really need to get at, but I could store that as some sort of metadata if there was a way to query just the metadata or a single field of the metadata.
I'm finding that something as simple as the following:
var blobList = container.ListBlobs(null, false);
can take upwards of 60 seconds to run from my desktop and typically around 15 seconds when running on a VM hosted in Azure. These times are based on a test of 125k blobs in an otherwise empty container and were several hours after they were uploaded, so they've definitely had time to "settle", so to speak.
I've attempted multiple variations and tried using ListBlobsSegmented but it doesn't really help because the function is returning a lot of extra information that I simply don't need. I just need the blob names so I can get at the encoded ID to see what's currently stored and what isn't.
The query for the blob names and extracting the encoded Id is somewhat time sensitive so if I could get it to under 1 second, I'd be happy with it. If I stored the files locally, I can get the entire list of files in a few ms, but I have to use Azure storage for this so that's not an option.
The only thing I can think of to be able to reduce the time it takes to identify the available blobs is to track the names of the blobs being added or removed from a given folder and store it in a separate blob. Then when I need to know the blob names in that folder, I would read the blob with the metadata rather than using ListBlobs. I suppose another would be to use Azure Table storage in a similar way, but it seems like I'm being forced into caching information about a given folder in the container.
Is there a better way of doing this or is this generally what people end up doing when you have hundreds of thousands of blobs in a single folder?
As mentioned, Azure Blob storage is a storage system and doesn't help you in indexing the content. We now have Azure Search Indexer which indexes the content uploaded to Azure Blob storage, refer https://azure.microsoft.com/en-us/documentation/articles/search-howto-indexing-azure-blob-storage/ with this you can perform all the features supported by Azure Search e.g. listing, searching, paging, sorting etc.. Hope this helps.

Azure blob storage and stream analytics

I read what in azure blob very nice save some data for statistics or something else, after it create requests for blob and show statistics to website (dashboard).
But I don't know how to use stream analytics for showing statistics. It is some SDK for create query to blob and generate josn data. Or ... I don't know.
And I have more question about it:
How to save data to blob (it is json data or something else). I don't
know format data for it issue.
How to use stream analytics for create request to blob and after it get data for showing in dashboard.
And maybe you know how to use this technology. Help me please. Thanks, and have a nice day.
#Taras - did you get a chance to toy with the Stream Analytics UI?
When you add a blob input you can either add an entire container - which means Stream Analytics will scan the entire container for new files or you can specify a path prefix pattern which will make Stream Analytics look in only that path.
You can also specify tokens such as {date}, {time} on the path prefix pattern to help guide Stream Analytics on the files to read.
Generally speaking - it is highly recommended to use Event Hub as input for the improved latency.
As for output - you can either use Power BI which would give you an interactive dashboard or you can output to some storage (blob, table, SQL, etc...) and build a dashboard on top of that.
You can also try to do one of the walkthroughs to get a feel for Stream Analytics: https://azure.microsoft.com/en-us/documentation/articles/stream-analytics-twitter-sentiment-analysis-trends/
Thanks!
Ziv.

Azure - Check if a new blob is uploaded to a container

Are there ways to check if a container in Azure has a new blob (doesn't matter which blob it is)? LastModifiedUtc does not seem to change if a blob is dropped into the container
You should use a BlobTrigger function in an App Service resource.
Documentation
Windows Azure Blob Storage does not provide this functionality out of the box. You would need to handle this on your end. A few things come to my mind (just thinking out loud):
If the blobs are uploaded using your application (and not through 3rd party tools), after the blob is uploaded, you could just update the container properties (may be add/update a metadata entry with information about the last blob uploaded). You could also make an entry into Azure Table Storage and keep on updating it with the information about last blob uploaded. As I said above, this method will only work if all blobs are uploaded through your application.
You could manually iterate through blobs in the blob container periodically and then sort them by last modified date. This method would work fine for a blob container having lesser number of blobs. If the number of blobs are more (say in tens of thousands), then you would end up fetching a long list because blob storage only sorts the blob by blob name.

Resources