Copy millions of files form root AZStorage Blob to subfolders

Copy millions of files form root AZStorage Blob to subfolders - azure

I’ve got multiple Azure storage blob containers each with over 1M JSON files include the root. Impossible to work with (no shocker) so trying to use Data Factory to move them to multiple folders using a timestamp in the files to create a YYYY-MM-DD/HH folder setup as a partition system. But every approach I’ve tried fails with timeouts / too many item limits. Need to open each file, get the timestamp, and use it to move the file to a dynamic path using the timestamp data. Ideas?
UPDATE: I was able to get around this, but I wouldn't call it a "answer" so I'll just update the question. To create smaller collections, I parameterized the pipeline to accept a file name wildcard. I then created another pipeline that uses an array of 0-9,a-z to use that as an parameter on the dataset. Brute force workaround... assume there's got to be a better solution, but this works for now.

Read doc: Move data to and from Azure Blob storage
The following articles describe how to move data to and from Azure Blob storage using different technologies.
Azure Storage-Explorer
AzCopy
Python-SDK (Others: .NET, Java, Node.js, Python, Go, PHP, Ruby.)
SSIS
In your case, I would suggest you to use SDK, which supports .NET, Java, Node.js, Python, Go, PHP, Ruby.
Believe me , if you want to migrate your datas from AzureBlob , DataFactory is not a good way, it makes the problem more complicated.
( This is my suggestion after I migrated over 100 million JSON-files (over 2TB) from AzureBlob)

If you have time... I would do the following:
Create an Azure Function to read the file and get your timestamp and do your move operation. scope the function just to use a single file. Then use events (EventGrid) in the storage account to trigger the function on create of a blob. Then you know for any new files it will move the file to the right spot. (Remember you need to reach a million executions in the consumption model for functions to start billing, so this is a low cost option).
For the current files, create another function (or if you want some more control, use a logic app, but your cost will be a bit more) and set your parralelism on the function or logic app to a low amount (to keep an eye on your executions). that run a simple for each with limits that run your first function. This will slowly move your files out of that container eventually getting you into a reasonable item count to work with on with stuff like ADF. This might just solve your problem for the long run as any new files will be categorized accordingly, and your backlog is slowly being moved as required. If you need to update a DB with a pointer to where your file lives you could put that piece of code also in your function or logic app. Just my two cents :)

It is not clear if you are using the hierarchical folder structure provided by Azure Data Lake Storage Gen2, the generation 1 simulates a folders structure but it is not optimum.
There are several advantages on the ADLSV2 that should help in your case mainly related to move operations.
To migrate from ADLS Gen 1 to ADLS Gen 2 have a look here.
Additionally, you may explore optimizations on your specific case with the following paper here.

Related

Azure Blob - Create subfolder or create multiple containers

I am migrating some files stored in SQL Server to Azure Storage (Blobs), it's a legacy .NET Framework web application.
The "issue" is: I have multiple countries using this webapp (each one uses its own database instance), for example, let's say: USA, Canada and Mexico.
What would be a good approach to store these files in Azure Blob? I was thinking about creating a single container, for example, orders-container, and inside create a folder structure by country, like this:
orders-container > USA > report.pdf
orders-container > CAN > report1.pdf
orders-container > MEX > report2.pdf
However, I'm kinda questioning myself about this approach when I think about performance and management. I don't know if it would be better to create like this or if I should create a container per country, for example:
orders-container-USA > report.pdf
and so on for the other countries.
I also think that maybe if someday I would have to move these files to somewhere else, it would be easier to move if they would have a container per country and not a single container for everyone.
Have anyone faced this kinda of design "issue" to think about?
Many thanks!

Since you are using blob storage, you can use either option but the benefit of using separate containers for each country is that access can be segregated. If you use only 1 container, you cannot segregate access to the files within the container (unless you enable hierarchical namespace).
If you have no requirement of access segregation now or in the near
future, then go with one container since it offers the ease of file
movement and management.
If you need to segregate the access to each
country's data in the blob, then I would suggest you go with separate
containers per country.
If you enable hierarchical namespaces, then you can use a single container and you still have the flexibility of controlling access to each of the (physical) folders.
As #silent mentioned, there is no difference in performance with using a single or multiple containers.

Azure Data Factory copy activity very slow

Summarize the problem
I've seeing particularly slow performance out of Azure Data Factory. Searching for similar questions on StackOverFlow turns up nothing except for the advice to contact support.
I'm rolling the dice here to see if anyone has seen something similar and knows how to fix it.
In short, every operation I try in ADF results in excruciatingly slow performance.
This includes:
Extracting a zip in blob storage to blob storage
Copying a number of small compressed files into Azure Data Explorer
Copying a number of small uncompressed json files into Azure Data Explorer
Extracting ZIP
Copying to ADX
In both cases the performance is in the kilobytes per second range.
In both cases the copy/import will eventually work but it can take hours.
Describe what you've tried
I've tried:
using different regions
creating and using my own Integration Runtime
playing with different parameters that could potentially affect performance such as parallel connections etc.
Contacting Microsoft support (who sent me here)
Show some code
Not really any code to share. To reproduce just try extracting a zip to and from blob storage. I get ~400KB/s.
In summary, any advice would be gratefully received. If I can't get this bit working I have to implement a the ingestion factory manually, which on reflection sounds like fun than I've been having with ADF.

Thease 'deep' folders will affect copy speed. We should minimize the depth and increase the amount of copy activity. You can reference this document to troubleshoot copy activity performance. Or you can send a feedback to Microsoft Azure.

Azure: SQL Compact possible?

I have a RESTful service running on azure. Currently, it has zero persistence. (It is just a REST gateway to another api.) I run it in a single, minimal Azure instance, and expect this will handle all the load this will ever get.
I now need to add some very lightweight persistence to it. A simple table, of 40-200 rows, eight data columns. The data is very static.
Doing the whole SQL Azure thing seems big overkill for my needs.
My thoughts have been to use:
An XML file, and load it into memory, as the db. XML file is
deployed with code.
Some better way to deploy XML, so it can be
rolled out/updated easier
SQL Compact (can I do this on Azure?)
___ ?
What is the right path here?
Thank you!

SQL Server Compact would need to store its data somewhere in persistent manner, so you would need to sync it regularly to a persistent storage and that's a lot of extra work and I have no idea how to do that reliably, so it's likely not a very good idea.
For your simple table the Azure Table Storage might be just enough. If that's not enough then SQL Azure is the next choice.

You can use the XML file as your store, there is no harm it it, rather this is a very easy and cost efficient solution, but there is a catch. As you mentioned currently you are using only azure instance, in this case you can store the XML file in your App_Data, but if in future if you want to shift to 2 azure instance, you will have to replicate the App_Data folder. In other words you will need to keep App_Data folder in sync.
Suggestion
Instead of storing file in App_Data store it in BLOB, you can retrieve it using WebClient and the store it in memory.
Pros: The advantage of BLOB is, you don't have to sync it.
Cons: There is a cost associated on the number of transactions you can make. This will depend upon how many times you update the file.
Summary
If you are going to work with only one Azure Instance, use App_Data
More than one Azure Instance, use BLOB with no syncing or use App_Data with sync.
Do not use Azure Table, as BLOB is the designated store provided for this purpose only.
EDIT
From MSDN post
As far as I know, Windows Azure does not support SQL Compact Edition. SQL Compact Edition stores data in file system which will not be synchronized in multiple instances (a web role may be deployed to more than one instance. An instance is similar to a virtual machine). And files stored in file system will lost when the instance is restarted or reimaged.
Hope this helps you.

How can I back up my Windows Azure table storage?

I would like to be able to back up my table storage and also I have a need to move the data (export and import) from my production to development environment on my desktop.
Does anyone know of any tools or method that I can use to this.

You can use Cerebrata's Azure Management Cmdlets product. It allows you to download and restore your Azure table storage (and many more things). You can download it from here.

http://clumsyleaf.com/products/tablexplorer
TableXplorer will let you export all table data to an XML or CSV file.

As mentioned by others, there are tools out there that let you download your data in various formats, but it's worth noting that none of these are a true backup like you might be used to getting with SQL server.
As far as I'm aware they all just run a regular table storage query to scan through all of the records in the table and save out the results. If you have a reasonable amount of data (and if you're using table storage then you probably do) it's quite possible that this backup could take an hour or more.
For the sake of simplicity let's say you have two large related tables, A and B. If the backup starts by backing up table A, then moves on to table B, by the time it finishes backing up table B, it might contain records that rely on data in table A that's just not there.
If you just want to refresh the data in your development environment, this could be perfectly acceptable, but you do need to be aware of it.

You can try this solution it lets you to backup/restore your tables and blobs to the same or different storage account.

Use Sql Server FileStream or traditional File Server?

I am designing a system that's going to have about 10 millions+ users, each has a photo, which is about 1~2 MB.
We are going to deploy both database and web app using Microsoft Azure
I am wondering the way I should store the photos, there are currently two options,
1, Store all photos use Sql Server FileStream
2, Use File Server
I haven't experienced such large scale BLOB data using FileStream.
Can anybody give my any suggestion? The Cons and Pros?
And anyone with Microsoft Azure experiences concerning the large photos store is really appreciated!
Thx
Ryan.

I vote for neither. Use Windows Azure Blob storage. Simple REST API, $0.15/GB/month. You can even serve the images directly from there, if you make them public (like <img src="http://myaccount.blob.core.windows.net/container/image.jpg" />), meaning you don't have to funnel them through your web app.

Database is almost always a horrible choice for any large-scale binary storage needs. Database is best for relational-only systems, and instead, provide references in your database to the actual storage location. There's a few factors you should consider:
Cost - SQL Azure costs quite a lot per GB of storage, and has small storage limitations (50GB per database), both of which make it a poor choice for binary data. Windows Azure Blob storage is vastly cheaper for serving up binary objects (though has a bit more complicated pricing system, still vastly cheaper per GB).
Throughput - SQL Azure has pretty good throughput, as it can scale well, however, Windows Azure Blog storage has even greater throughput as it can scale to any number of nodes.
Content Delivery Network - A feature not available to SQL Azure (though a complex, custom wrapper could be created), but can easily be setup within minutes to piggy-back off your Windows Azure Blob storage to provide limitless bandwidth to your end-users, so you never have to worry about your binary objects being a bottleneck in your system. CDN costs are similar to that of Blob storage, but you can find all that stuff here: http://www.microsoft.com/windowsazure/pricing/#windows
In other words, no reason not to go with Blob storage. It is simple to use, cost effective, and will scale to any needs.

I can't speak on anything Azure related but for my money the biggest advantage of using FILESTREAM is that that data can get backed up inside the normal SQL Server backup process. The size of the data that you are talking about also suggests that FILESTREAM may be a good choice as well.
I've worked on a SCM system with a RDBMS back end and one of our big decisions was whether to store the file deltas on the file system or inside the DB itself. Because it was cross-RDBMS we had to cook up a generic non-FILESTREAM way of doing it but the ability to do a single shot backup sold us.

FILESTREAM is a horrible option for storing images. I'm surprised MS ever promoted it.
We're currently using it for our images on our website. Mainly the user generated images and any CMS related stuff that admins create. The decision to use FILESTREAM was made before I started. The biggest issue is related to serving the images up. You better have a CDN sitting in front. If not, plan on your system coming to a screeching halt. Of course, most sites have a CDN, but you don't want to be at the mercy of that service going down meaning your system will get overloaded. The amount of stress put on your sql server is the main problem here.
In terms of ease of backup. Your tradeoff there is that your db is MUCH MUCH LARGER and, therefore, the backup takes longer. Potentially, much longer and the system runs slower during the backup. Not to mention, moving backups around takes longer (i.e., restoring prod data in a dev environment or on local machines for dev purposes). Don't use this as a deciding factor.
Most cloud services have automatic redundancy of any files that you store on their system (i.e., aws's S3 and azure's blob). If you're on premise, just make sure you use a shared location for the images and make sure that location is backed up. I think the best option is to set it up so each image (other UGC file types too) has an entry in your db with a path to that file. Going one step further, separate the root path into a config setting and only store the remaining path with the entry. For example, root path in config might be a base url, a shared drive or virtual dir, or a blank entry. Then your entry might have "/files/images/image.jpg". This way, if you move your filestore, you can just update the root config. I would also suggest creating a FileStoreProvider interface (Singleton) that can be used for managing (saving, deleting, updating) these files. This way, if you switch between AWS, Azure, or on premise, you can just create a new Provider.

I have a client server DB, i manage many files (doc, txt, pdf, ...) and all of them go in a filestream BLOB. Customers has 50+ MB dbs. If in azure you can do the same go for it. Having all in the db is a wonderful thing. It is considered good policy also for Postgres and MySQL

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string