azureml register datastore file share or blob storage - azure

I have a folder called data with a bunch of csvs (about 80), file sizes are fairly small. This data is clean and has already been preprocessed. I want to upload this data folder and register as a datastore in azureml. Which would be best for this scenario data store created with file share or a data store created with blob storage?

AFAIK, based on your scenario you can make use of Azure File Share to create data store.
Please note that, Azure Blob storage is suitable for uploading large amount of unstructured data whereas Azure File Share is suitable for uploading and processing the structured files in chunks (more interaction with app to share files).
I have a folder called data with a bunch of csvs (about 80), file sizes are fairly small. This data is clean and has already been preprocessed.
As you mentioned CSV data is clean and preprocessed it comes under structured data. So, you can make you of Azure File Share to create data store.
To register a data store with Azure File Share you can make use of this MsDoc
To know more about Azure File Share and Azure Blob storage, please find below links:
Azure Blob Storage or Azure File Storage by Mike
azureml.data.azure_storage_datastore.AzureFileDatastore class - Azure Machine Learning Python | Microsoft Docs

Related

Storing files in wwwroot folder vs storing in Azure blob storage

I have a project ASP.Net MVC classics which need to be migrated to Azure and host in AppServices. Currently this project save files in the root folder and file size could be 2GB.
Now the question is should I leave the current logic to store the file in wwwroot folder as:\wwwroot\Files\myfile.txt"; or should I store it in the blob?
I am looking for the best practice and do not want to change the current logic? Can someone give me the idea?
Thanks
Storing files in Azure Blob Storage:
According to Documentation it says,
Azure Blob Storage enables the creation of data lakes for analytics purposes and provides storage for the development of powerful cloud-native and mobile apps. Reduce costs by using tiered storage for long-term data and scalability for high-performance computing and machine learning workloads.
According to Documentation it says,
SAS enables you to securely upload and download files from Azure Blob Storage without having to share the connection string.
While uploading you can split large data in small amounts with which we can decrease uploading time and after uploading then it again combines into a single blob.
Storing file in wwwroot folder:
According to Documentation it says
Static resource files are stored in web root, The default directory is {content root}/wwwroot folder.
According to Documentation we can clearly say that capacity depends on price tier.
Web apps performance can be effected by uploading large files.
In Azure linux web apps, if the size is around 2 Gb then it might lead to timeoutException.

Moving files among azure blob without downloading

Currently, I have a blob container with about 5TB archive files. I need to move some of those files to another container. Is that a way to avoid download and upload files related? I do not need to access the data of those file. I do not want to get any bill about reading archive files either.
Thanks.
I suggest that you can use Data Factory. It usually used to transfer big data.
Copy performance and scalability achievable using ADF:
You can learn from below tutorial:
Copy and transform data in Azure Blob storage by using Azure Data Factory
Hope this helps.
You can use azcopy for that. It is a command line util that you can use to initiate server to server transfers:
AzCopy uses server-to-server APIs, so data is copied directly between storage servers. These copy operations don't use the network bandwidth of your computer.

Limits on File Count for Azure Blob Storage

Currently, I have a large set of text files which contain (historical) raw data from various sensors. New files are received and processed every day. I'd like to move this off of an on-premises solution to the cloud.
Would Azure's Blob storage be an appropriate mechanism for this volume of small(ish) private files? or is there another Azure solution that I should be pursuing?
Relevent Data (no pun intended) & Requirements-
The data set contains a millions files of mostly small files, for a total of near 400gb. The average file size is around 50kb, but some files could exceed 40mb.
I need to maintain the existing data set for posterity's sake.
New files would be uploaded daily, and then processed once. Processing would be handled by Background Workers reading files off a queue.
Certain files would be downloaded / reviewed / reprocessed after the initial processing.
Let me elaborate more on David's comments.
As David mentioned, there's no limit on number of objects (files) that you can store in Azure Blob Storage. The limit is of the size of the storage account which currently is 500TB. As long as you stay in this limit you will be good. Further, you can have 100 storage accounts in an Azure Subscription so essentially the amount of data that you will be able to store is practically limitless.
I do want to mention one more thing though. It seems that the files that are uploaded in blob storage are once processed and then kind of archived. For this I suggest you take a look at Azure Cool Blob Storage. It is essentially meant for this purpose only where you want to store objects that are not frequently accessible yet when you need those objects they are accessible almost immediately. The advantage of using Cool Blob Storage is that writes and storage is cheaper as compared to Hot Blob Storage accounts however the reads are expensive (which makes sense considering their intended use case).
So a possible solution would be to save the files in your Hot Blob Storage accounts. Once the files are processed, they are moved to Cool Blob Storage. This Cool Blob Storage account can be in the same or different Azure Subscription.
I'm guessing it CAN be used as a file system, is the right (best) tool for the job.
Yes, Azure Blobs Storage can be used as cloud file system.
The data set contains a millions files of mostly small files, for a total of near 400gb. The average file size is around 50kb, but some files could exceed 40mb.
As David and Gaurav Mantri mentioned, Azure Blob Storage could meet this requirement.
I need to maintain the existing data set for posterity's sake.
Data in Azure Blob Storage is durable. You could reference the SERVICE LEVEL AGREEMENTS of Storage.
New files would be uploaded daily, and then processed once. Processing would be handled by Background Workers reading files off a queue.
You can use Azure Function to do the file processing work. Since it will do once a day, you could add a TimerTrigger Function.
//This function will be executed once a day
public static void TimerJob([TimerTrigger("0 0 0 * * *")] TimerInfo timerInfo)
{
//write the processing job here
}
Certain files would be downloaded / reviewed / reprocessed after the initial processing.
Blobs can be downloaded or updated at anytime you want.
In addition, if your data processing job is very complicated, you also could store your data in Azure Data Lake Store and do the data processing job using Hadoop analytic frameworks such as MapReduce or Hive. Microsoft Azure HDInsight clusters can be provisioned and configured to directly access data stored in Data Lake Store.
Here are the differences between Azure Data Lake Store and Azure Blob Storage.
Comparing Azure Data Lake Store and Azure Blob Storage

Creating large blob out of small blobs in Azure

I have a large number of tiny blob files created by Azure Application Insights Service.
I would like to combine these blob files and create 1 blob file per hour. This is because we have data on-premises that I would like to integrate this data with and I wouldn't want to download millions of small blob files.
My question is what Azure Service I can use for this?

Azure Blob storage and HDF file storage

I am in the middle of developing a cloud server and I need to store HDF files ( http://www.hdfgroup.org/HDF5/ ) using blob storage.
Functions related to creating, reading writing and modifying data elements within the file come from HDF APIs.
I need to get the file path to create the file or read or write it.
Can anyone please tell me how to create a custom file on Azure Blob ?
I need to be able to use the API like shown below, but passing the Azure storage path to the file.
http://davis.lbl.gov/Manuals/HDF5-1.4.3/Tutor/examples/C/h5_crtfile.c
These files i am trying to create can get really huge ~10-20GB, So downloading them locally and modifying them is not an option for me.
Thanks
Shashi
One possible approach, admittedly fraught with challenges, would be to create the file in a temporary location using the code you included, and then use the Azure API to upload the file to Azure as a file input stream. I am in the process of researching how size restrictions are handled in Azure storage, so I can't say whether an entire 10-20GB file could be moved in a single upload operation, but since the Azure API reads from an input stream, you should be able to create a combination of operations that would result in the information you need residing in Azure storage.
Can anyone please tell me how to create a custom file on Azure Blob ?
I need to be able to use the API like shown below, but passing the
Azure storage path to the file.
http://davis.lbl.gov/Manuals/HDF5-1.4.3/Tutor/examples/C/h5_crtfile.c
Windows Azure Blob storage is a service for storing large amounts of unstructured data that can be accessed via HTTP or HTTPS. So from application point of view Azure Blob does not work as regular disk.
Microsoft provides quite good API (c#, Java) to work with the blob storage. They also provide Blob Service REST API to access blobs from any other language (where specific blob storage API is not provided like C++).
A single block blob can be up to 200GB so it should easily store files of ~10-20GB size.
I am afraid that the provided example will not work with Windows Azure Blob. However, I do not know HDF file storage; maybe they provide some Azure Blob storage support.

Resources