I have a Data Factory v2 job which copies files from an SFTP server to an Azure Data Lake Gen2.
There is a mix of .csv files and .zip files (each containing only one csv file).
I have one dataset for copying the csv files and another for copying zip files (with Compressoin type set to ZipDeflate). The problem is that the ZipDeflate creates a new folder that contains the csv file and I need this to respect the folder hierarchy without creating any folders.
Is this possible in Azure Data Factory?
Good question, I ran into similar trouble* and it doesn't seem to be well documented.
If I remember correctly Data Factory assumes ZipDeflate could contain more than one file and appears to create a folder no matter what.
If you have Gzip files on the other hand which only have a single file, then it will create only that.
You'll probably already know this bit, but having it in the forefront of your mind helped me realise the sensible default data factory has:
My understanding of it is that the Zip standard is an archive format which is happening to use the Deflate algorithm. Being an archive format it naturally can contain multiple files.
Whereas gzip (for example) is just the compression algorithm it doesn't support multiple files (unless tar archived first), so it will decompress to just a file without a folder.
You could have an additional data factory step to take the hierarchy and copy it to a flat folder perhaps, but that leads to random file names (which you may or may not be happy with). For us it didn't work as our next step in the pipeline needed predictable filenames.
n.b. Data factory does not move files it copies them so if they're very large this could be a pain. You can trigger a meta data move operation via the data lake store API or Powershell etc however.
*Mine was slightly crazier situation in that I was receiving files named .gz from a source system but were in fact zip files in disguise! In the end the best option was to ask our source system to change to true gzip files.
Related
I am trying to create an ADF pipeline that does the following:
Takes in a csv with 2 columns, eg:
Source, Destination
test_container/test.txt, test_container/test_subfolder/test.txt
Essentially I want to copy/move the filepath from the source directory into the Destination directory (Both these directories are in Azure blob storage).
I think there is a way to do this using lookups, but lookups are limited to 5000 rows and my CSV will be larger than that. Any suggestions on how this can be accomplished?
Thanks in advance,
This is a complex scenario for Azure Data Factory. Also as you mentioned there are more than 5000 file paths records in your CSV files, it also means same number of Source and Destination paths. So now if you create this architecture in ADF, it will goes like this:
You will use the Lookup activity to read the Source and Destination paths. In that also you can't read all the paths due to Lookup activity limitation.
Later you will iterate over the records using ForEach activity.
Now you also need to split the path so that you will get container, directory and file names separately to pass the details to Datasets created for Source and Destination location. Once you split the paths, you need to use the Set variable activity to store the Source and Destination container, directory and file names. These variables will be then passed to Datasets dynamically. This is a tricky part as even if a single record is unable to split properly then your pipeline would fail.
If above step completed successfully, then you not need to worry about copy activity. If all the parameters got the expected values under Source and Destination tabs in copy activity it will work properly.
My suggestion is to use programmatical approach for this. Use python, for example, to read the CSV file using pandas module and iterate over each path and copy the files. This will work fine even if you have 5000+ records.
You can refer this SO thread which will help you to implement the same programmatically.
First, if you want to maintain a hierarchical pattern in your data, i recommend using ADLS (Azure Data Lake Storage) this will guarantee a certain structure for your data.
second, if you have a Folder in Blob Storage and you would like to copy files to it, use Copy Activity, you should define 2 datasets, one for the source and one for the sink.
check this link : https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-overview
Let's say there are 10 folders in my bucket. I want to split the contents of the folders in a ratio of 0.8,0.1,0.1 and move them to three new folders Train, Test and Val. I have earlier done this process by downloading the folders, splitting and uploading them again. I now want to split he folders in the bucket itself.
I was able to connect to the bucket using "google-cloud-storage" library from Notebook using the post here. I was able to download, upload files. I'm not sure how to achieve splitting the folders without downloading the content.
Appreciate the help.
PS: I don't need the full code, just how to approach will do
With Cloud Storage you can only READ, WRITE (CREATE/DELETE). You can't move blob inside the bucket, even if the operation exists in the console or in some client library, the move is a WRITE/CREATE of the content with another path and then a WRITE/DELETE of the previous path.
Thus, your strategy must follow the same logic:
Perform a gsutil ls to list all the files
Copy (or move) 80% in one directory, 10% and 10% in the 2 others directory
Delete the old directory (useless if you used move operation).
It's quicker than downloading and uploading files, but it takes time. Because it's not a file system, but only API calls, it takes time for each files. And if you have thousands of file, it can take hours!
I have a requirement to regularly update an existing set of 30+ CSV files with new data (append to the end). There is also a requirement to possibly remove the first X rows as Y rows are added to the end.
Am I using the correct services for this and in the correct manner?
Azure Blob Storage to store the Existing and Update files.
Azure DataFactory with DataFlows. A PipeLine and DataFlow per CSV I want to transform that conducts a merge of datasets (existing + update), producing a
sink fileset that drops the new combined CSV back into Blob
storage.
A trigger on the Blob Storage Updates directory to trigger the pipeline when a new update file is uploaded.
Questions:
Is this the best approach for this problem, I need a solution with minimal input from users (I'll take care of Azure ops so long as all they have to do is upload a file and download the new one)
Do I need a pipeline and dataflow per CSV file? Or could I have one per transformation type (ie one for just appending, another for appending and removing first X rows)
I was going to create a directory in blob storage for each of the CSVs (30+ Dirs) and create a dataset for each directories existing and update files.
Then create a dataset for each output file into some new/ directory
Depending on the size of your CSVs, you can either perform the append right inside of the data flow by taking both the new data as well as the existing CSV file as a source and then Union the 2 files together to make a new file.
Or, with larger files, use the Copy Activity "merge files" setting to merge the 2 files together.
I am trying to generate very large Microsoft Excel files in a browser application. While there are JavaScript libraries which allow me to generate XLSX files from the browser, the issue with them is that they require all of the document contents to be loaded in memory before writing them, which gives me an upper bound on how much I can store in a single file before the browser crashes. Thus I would like to have a write stream that allows me to write data sequentially into a Excel file using something like StreamSaver.js.
Doing such a thing with CSV would be trivial:
for (let i = 0; i < paginatedRequest.length; i++) {
writer.write(paginatedRequest[i].join(",") + "\n");
}
The approach above would allow me to write an extremely large number of CSV rows to an output stream without having to store all of the data in memory. My question is: is this technically feasible to do with an XLSX file?
My main concern here is that internally XLSX files are ZIP archives, so my first idea was to use an uncompressed ZIP archive and stream writes to it, but every file inside a ZIP archive comes with a header which indicates its size and I can't possibly know that beforehand. Is there a workaround that I could possibly use for this?
Lastly, if not possible, are there any other streamable spreadsheet formats which can be opened in Excel and "look nice"? (There is a flat OpenDocument specification with the .fods extension, so I could stream writes to such a file. Sadly, Microsoft Office does not support flat OpenDocument files.)
A possible solution would be to generate a small, static XLSX file which imports an external CSV file using Excel's Data Model. Since generating a streaming CSV file is almost trivial, that could be a feasible solution. However, it's somewhat unsatisfactory:
It's rather annoying to have the user download two files instead of one (or a compressed file that they'd need to uncompress).
Excel does not support relative routes to external CSV files, so we'd also need a macro to ensure that we update the route every time we open the file (if this is feasible at all). This requires the user accepting the usage of macros, which comes with a security warning and is not terribly nice for them.
I am in the process of copying a large set of data to an Azure Blob Storage area. My source set of data has a large number of files that I do not want to move, so my first thought was to create a DataSet.csv file of just the files I do want to copy. As a test, I created a csv file where each row is a single file that I want to include.
BasePath,DstBlobPathOrPrefix,BlobType,Disposition,MetadataFile,PropertiesFile
"\\SERVER\Share\Folder1\Item1\Page1\full.jpg","containername/Src/Folder1/Item1/Page1/full.jpg",BlockBlob,overwrite,"None",None
"\\SERVER\Share\Folder1\Item1\Page1\thumb.jpg","containername/Src/Folder1/Item1/Page1/thumb.jpg",BlockBlob,overwrite,"None",None
etc.
When I run the Import/Export tool (WAImportExport.exe) it seems to create a single folder on the destination for each file, so that it ends up looking like:
session#1
-session#1-0
-session#1-1
-session#1-2
etc.
All files share the same base, but do output their filename in the CSV. Is there any way to avoid this, so that all the files go into a single "session#1" folder? If possible, I'd like to avoid creating N-thousand folders on the destination drive.
I don't think you should worry about the way the files are stored on the disk, as they will be converted back to the directory structure you specified in the .csv file.
Here's what the documentation says:
How does the WAImportExport tool work on multiple source dir and disks?
If the data size is greater than the disk size, the WAImportExport
tool will distribute the data across the disks in an optimized way.
The data copy to multiple disks can be done in parallel or
sequentially. There is no limit on the number of disks the data can be
written to simultaneously. The tool will distribute data based on disk
size and folder size. It will select the disk that is most optimized
for the object-size. The data when uploaded to the storage account
will be converged back to the specified directory structure.