I have built an Azure Webjob that takes a queue trigger and an blob reference as input, processes the file and creates multiple output blob files (it's breaking a PDF into individual pages). In order to output the multiple blobs, I have code in the job that explicitly creates the storage/container connection and does the output. It would be cleaner to let webjobs handle this if it's possible using attributes.
Is there a way to output multiple blobs to a container? I can output multiple queue messages using the QueueAttribute and an ICollector, but I don't see if that's possible with a Blob (like a container reference where I can send multiple blobs). Thanks.
Correct - BlobAttribute does not support an ICollector binding. In the current beta release, we have added some new bindings that might help you a bit. For example, you can now bind to a CloudBlobContainer and you could use that to create additional blobs. See the release notes for more details.
Another possibility would be for you to use the IBinder binding (example here). It allows you to imperatively bind to a blob. You could do that multiple times in your function.
Related
In Synapse I've setup 3 different pipelines. They all gather data from different sources (SQL, REST and CSV) and sink this to the same SQL database.
Currently they all run during the night, but I already know that the question is coming of running it more frequently. I want to prevent that my pipelines are going to run through all the sources while nothing has changed in the source.
Therefore I would like to store the last succesfull sync run of each pipeline (or pipeline activity). Before the next start of each pipeline I want to create a new pipeline, a 4th one, which checks if something has changed in sources. If so, it triggers the execution of one, two or all three the pipelines to run.
I still see some complications in doing this, so I'm not fully convinced on how to do this. So all help and thoughts are welcome, don't know if someone has experience in doing this?
This is (at least in part) the subject of the following Microsoft tutorial:
Incrementally load data from Azure SQL Database to Azure Blob storage using the Azure portal
You're on the correct path - the crux of the issue is creating and persisting "watermarks" for each source from which you can determine if there have been any changes. The approach you use may be different for different source types. In the above tutorial, they create a stored procedure that can store and retrieve a "last run date", and use this to intelligently query tables for only rows modified after this last run date. Of course this requires the cooperation of the data source to take note of when data is inserted or modified.
If you have a source that cannot by intelligently queried in part (e.g. a CSV file) you still have options to use things like the Get Metadata Activity to e.g. query the lastModified property of a source file (or even its contentMD5 if using blob or ADLGen2) and compare this to a value saved during your last run (You would have to pick a place to store this, e.g. an operational DB, Azure Table or small blob file) to determine whether it needs to be reprocessed.
If you want to go crazy, you can look into streaming patterns (might require dabbling in HDInsights or getting your hands dirty with Azure Event Hubs to trigger ADF) to move from the scheduled trigger to an automatic ingestion as new data appears at the sources.
I have millions of files in one container and I need to copy ~100k to another container in the same storage account. What is the most efficient way to do this?
I have tried:
Python API -- Using BlobServiceClient and related classes, I make a BlobClient for the source and destination and start a copy with new_blob.start_copy_from_url(source_blob.url). This runs at roughly 7 files per second.
azcopy (one file per line) -- Basically a batch script with a line like azcopy copy <source w/ SAS> <destination w/ SAS> for every file. This runs at roughly 0.5 files per second due to azcopy's overhead.
azcopy (1000 files per line) -- Another batch script like the above, except I use the --include-path argument to specify a bunch of semicolon-separated files at once. (The number is arbitrary but I chose 1000 because I was concerned about overloading the command. Even 1000 files makes a command with 84k characters.) Extra caveat here: I cannot rename the files with this method, which is required for about 25% due to character constraints on the system that will download from the destination container. This runs at roughly 3.5 files per second.
Surely there must be a better way to do this, probably with another Azure tool that I haven't tried. Or maybe by tagging the files I want to copy then copying the files with that tag, but I couldn't find the arguments to do that.
Please check with below references:
1. AZCOPY would be best for best performance for copying blobs within
same storage or other storage accounts .we can force a synchronous
copy by specifying "/SyncCopy" parameter for AZCopy to ensures that
the copy operation will get consistent speed. azcopy sync |
Microsoft Docs .
But note that AzCopy performs the synchronous copy by
downloading the blobs to local memory and then uploads to the Blob
storage destination. So performance will also depend on network
conditions between the location where AZCopy is being run and Azure
DC location. Also note that /SyncCopy might generate additional
egress cost comparing to asynchronous copy, the recommended approach
is to use this sync option with azcopy in the Azure VM which is in the same region as
your source storage account to avoid egress cost.
Choose a tool and strategy to copy blobs - Learn | Microsoft Docs
2. StartCopyAsync is one of the ways you can try for copy within a
storage account .
References:
1. .net - Copying file across Azure container without using azcopy - Stack Overflow
2. Copying Azure Blobs Between Containers the Quick Way (markheath.net)
3. You may consider Azure data factory in case of millions of files
but also note that it may be expensive and little timeouts may occur
but it may be worth for repeated kind of work.
References:
1. Copy millions of files (andrewconnell.com) , GitHub(microsoft docs)
2. File Transfer between container to another container - Microsoft Q&A
4. Also check out and try the Azure storage explorer copy blob container to
another
We have a few third-party companies sending us emails with CSV/excel data files attached to the emails. I want to build a pipeline (preferably in ADF) to get the attachments, load the raw files (attachments) to blob, process/transform them, and finally load the processed files to another dir in the blob.
To get the attachment, I think I can use the instructions (using Logic App) in this link. Then, trigger an ADF pipeline using storage trigger, get the file and process it and do the rest of the stuff.
However, first, I'm not sure how reliable storage triggers are?
Second, although it seems ok, this approach makes it difficult to monitor the runs and make sure things are working properly. For example, if the logic app doesn't read/load the attachments for any reason and fails, you can't pick it up in ADF as nothing has written in the blob to trigger the pipeline.
Anyway, is this approach good, or there are better ways to do this?
Thanks
If you are able to save attachments into a blob or something, you can schedule a ADF pipeline that imports every file in the blob every minute og 5 minute or so.
Does the files have same data structure everytime? (that makes things much easier)
It is most common to schedule imports in ADF, rather that trigger based on external events.
I am using Azure Data Lake Store (ADLS), targeted by an Azure Data Factory (ADF) pipeline that reads from Blob Storage and writes in to ADLS. During execution I notice that there is a folder created in the output ADLS that does not exist in the source data. The folder has a GUID for a name and many files in it, also GUIDs. The folder is temporary and after around 30 seconds it disappears.
Is this part of the ADLS metadata indexing? Is it something used by ADF during processing? Although it appears in the Data Explorer in the portal, does it show up through the API? I am concerned it may create issues down the line, even though it it a temporary structure.
Any insight appreciated - a Google turned up little.
So what your seeing here is something that Azure Data Lake Storage does regardless of the method you use to upload and copy data into it. It's not specific to Data Factory and not something you can control.
For large files it basically parallelises the read/write operation for a single file. You then get multiple smaller files appearing in the temporary directory for each thread of the parallel operation. Once complete the process concatenates the threads into the single expected destination file.
Comparison: this is similar to what PolyBase does in SQLDW with its 8 external readers that hit a file in 512MB blocks.
I understand your concerns here. I've also done battle with this where by the operation fails and does not clean up the temp files. My advice would be to be explicit with you downstream services when specifying the target file path.
One other thing, I've had problems where using the Visual Studio Data Lake file explorer tool to do uploads of large files. Sometimes the parallel threads did not concatenate into the single correctly and caused corruption in my structured dataset. This was with files in the 4 - 8GB region. Be warned!
Side note. I've found PowerShell most reliable for handling uploads into Data Lake Store.
Hope this helps.
Is a azure blob available for download whilst it is being overwritten with a new version?
From my tests using Cloud Storage Studio the download is blocked until the overwrite is completed, however my tests are from the same machine so I can't be sure this is correct.
If it isn't available during an overwrite, then I presume the solution (to maintain availability) would be to upload using a different blob name and then rename once complete. Does anyone have any better solution than this?
The blob is available during overwrite. What you see will depend on whether you are using a block blob or page blob however. For block blobs, you will download the older version until the final block commit. That final PutBlockList operation will atomically update the blob to the new version. I am not actually sure however for very large blobs that you are in the middle of downloading what happens when a PutBlockList atomically updates the blob. Choices are: a.) request continues with older blob, b.) the connection is broken, or c:) you start downloading bytes of new blob. What a fun thing to test!
If you are using page blobs (without a lease), you will read inconsistent data as the page ranges are updated underneath you. Each page range update is atomic, but it will look weird unless you lease the blob and keep other readers out (readers can snapshot a leased blob and read the state).
I might try to test the block blob update in middle of read scenario to see what happens. However, your core question should be answered: the blob is available.