Azure Import/Export tool dataset.csv and multiple session folders - azure

I am in the process of copying a large set of data to an Azure Blob Storage area. My source set of data has a large number of files that I do not want to move, so my first thought was to create a DataSet.csv file of just the files I do want to copy. As a test, I created a csv file where each row is a single file that I want to include.
BasePath,DstBlobPathOrPrefix,BlobType,Disposition,MetadataFile,PropertiesFile
"\\SERVER\Share\Folder1\Item1\Page1\full.jpg","containername/Src/Folder1/Item1/Page1/full.jpg",BlockBlob,overwrite,"None",None
"\\SERVER\Share\Folder1\Item1\Page1\thumb.jpg","containername/Src/Folder1/Item1/Page1/thumb.jpg",BlockBlob,overwrite,"None",None
etc.
When I run the Import/Export tool (WAImportExport.exe) it seems to create a single folder on the destination for each file, so that it ends up looking like:
session#1
-session#1-0
-session#1-1
-session#1-2
etc.
All files share the same base, but do output their filename in the CSV. Is there any way to avoid this, so that all the files go into a single "session#1" folder? If possible, I'd like to avoid creating N-thousand folders on the destination drive.

I don't think you should worry about the way the files are stored on the disk, as they will be converted back to the directory structure you specified in the .csv file.
Here's what the documentation says:
How does the WAImportExport tool work on multiple source dir and disks?
If the data size is greater than the disk size, the WAImportExport
tool will distribute the data across the disks in an optimized way.
The data copy to multiple disks can be done in parallel or
sequentially. There is no limit on the number of disks the data can be
written to simultaneously. The tool will distribute data based on disk
size and folder size. It will select the disk that is most optimized
for the object-size. The data when uploaded to the storage account
will be converged back to the specified directory structure.

Related

Databricks reading from a zip file

I have mounted an Azure Blob Storage in the Azure Databricks workspace filestore. The mounted container has zipped files with csv files in them.
I mounted the data using dbuitls:
dbutils.fs.mount(
source = f"wasbs://{container}#{storage_account}.blob.core.windows.net",
mount_point = mountPoint,
extra_configs = {f"fs.azure.sas.{container}.{storage_account}.blob.core.windows.net":sasKey})
And then I followed the following tutorial:
https://learn.microsoft.com/en-us/azure/databricks/_static/notebooks/zip-files-python.html
But in above since shell command does not work probably because the data does not reside in dbfs but blob storage which is mounted, which gives the error:
unzip: cannot find or open `/mnt/azureblobstorage/file.zip, /mnt/azureblobstorage/Deed/file.zip.zip or /mnt/azureblobstorage/file.zip.ZIP.`
What would be the best way to read the zipped files and write into a delta table?
The "zip" utility in unix does work. I will walk thru the commands so that you can code a dynamic notebook to extract zips files. The zip file is in ADLS Gen 2 and the extracted files are placed there also.
Because we are using a shell command, this runs at the JVM know as the executor not all the worked nodes. Therefore there is no parallelization.
We can see that I have the storage mounted.
The S&P 500 are the top 505 stocks and their data for 2013. All these files are in a windows zip file.
Cell 2 defines widgets (parameters) and retrieves their values. This only needs to be done once. The call program can pass the correct parameters to the program
Cell 3 creates variables in the OS (shell) for both the file path and file name.
In cell 4, we use a shell call to the unzip program to over write the existing directory/files with the contents of the zip file. This there is not existing directory, we just get the uncompressed files.
Last but not least, the files do appear in the sub-directory as instructed.
To recap, it is possible to unzip files using databricks (spark) using both remote storage or already mounted storage (local). Use the above techniques to accomplish this task is a note book that can be called repeatedly.
If I remember correctly, gzip is natively supported by spark. However, it might be slow or chew up a-lot of memory using dataframes. See this article on rdd's.
https://medium.com/#parasu/dealing-with-large-gzip-files-in-spark-3f2a999fc3fa
On the other hand, if you have a windows zip file you can use a unix script to uncompress the files and then read them. Check out this article.
https://docs.databricks.com/external-data/zip-files.html

Azure Data Factory - Copy files using a CSV with filepaths

I am trying to create an ADF pipeline that does the following:
Takes in a csv with 2 columns, eg:
Source, Destination
test_container/test.txt, test_container/test_subfolder/test.txt
Essentially I want to copy/move the filepath from the source directory into the Destination directory (Both these directories are in Azure blob storage).
I think there is a way to do this using lookups, but lookups are limited to 5000 rows and my CSV will be larger than that. Any suggestions on how this can be accomplished?
Thanks in advance,
This is a complex scenario for Azure Data Factory. Also as you mentioned there are more than 5000 file paths records in your CSV files, it also means same number of Source and Destination paths. So now if you create this architecture in ADF, it will goes like this:
You will use the Lookup activity to read the Source and Destination paths. In that also you can't read all the paths due to Lookup activity limitation.
Later you will iterate over the records using ForEach activity.
Now you also need to split the path so that you will get container, directory and file names separately to pass the details to Datasets created for Source and Destination location. Once you split the paths, you need to use the Set variable activity to store the Source and Destination container, directory and file names. These variables will be then passed to Datasets dynamically. This is a tricky part as even if a single record is unable to split properly then your pipeline would fail.
If above step completed successfully, then you not need to worry about copy activity. If all the parameters got the expected values under Source and Destination tabs in copy activity it will work properly.
My suggestion is to use programmatical approach for this. Use python, for example, to read the CSV file using pandas module and iterate over each path and copy the files. This will work fine even if you have 5000+ records.
You can refer this SO thread which will help you to implement the same programmatically.
First, if you want to maintain a hierarchical pattern in your data, i recommend using ADLS (Azure Data Lake Storage) this will guarantee a certain structure for your data.
second, if you have a Folder in Blob Storage and you would like to copy files to it, use Copy Activity, you should define 2 datasets, one for the source and one for the sink.
check this link : https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-overview

Splitting folders in a Google Bucket either from the Console or from a Notebook

Let's say there are 10 folders in my bucket. I want to split the contents of the folders in a ratio of 0.8,0.1,0.1 and move them to three new folders Train, Test and Val. I have earlier done this process by downloading the folders, splitting and uploading them again. I now want to split he folders in the bucket itself.
I was able to connect to the bucket using "google-cloud-storage" library from Notebook using the post here. I was able to download, upload files. I'm not sure how to achieve splitting the folders without downloading the content.
Appreciate the help.
PS: I don't need the full code, just how to approach will do
With Cloud Storage you can only READ, WRITE (CREATE/DELETE). You can't move blob inside the bucket, even if the operation exists in the console or in some client library, the move is a WRITE/CREATE of the content with another path and then a WRITE/DELETE of the previous path.
Thus, your strategy must follow the same logic:
Perform a gsutil ls to list all the files
Copy (or move) 80% in one directory, 10% and 10% in the 2 others directory
Delete the old directory (useless if you used move operation).
It's quicker than downloading and uploading files, but it takes time. Because it's not a file system, but only API calls, it takes time for each files. And if you have thousands of file, it can take hours!

Azure Data Flows

I have a requirement to regularly update an existing set of 30+ CSV files with new data (append to the end). There is also a requirement to possibly remove the first X rows as Y rows are added to the end.
Am I using the correct services for this and in the correct manner?
Azure Blob Storage to store the Existing and Update files.
Azure DataFactory with DataFlows. A PipeLine and DataFlow per CSV I want to transform that conducts a merge of datasets (existing + update), producing a
sink fileset that drops the new combined CSV back into Blob
storage.
A trigger on the Blob Storage Updates directory to trigger the pipeline when a new update file is uploaded.
Questions:
Is this the best approach for this problem, I need a solution with minimal input from users (I'll take care of Azure ops so long as all they have to do is upload a file and download the new one)
Do I need a pipeline and dataflow per CSV file? Or could I have one per transformation type (ie one for just appending, another for appending and removing first X rows)
I was going to create a directory in blob storage for each of the CSVs (30+ Dirs) and create a dataset for each directories existing and update files.
Then create a dataset for each output file into some new/ directory
Depending on the size of your CSVs, you can either perform the append right inside of the data flow by taking both the new data as well as the existing CSV file as a source and then Union the 2 files together to make a new file.
Or, with larger files, use the Copy Activity "merge files" setting to merge the 2 files together.

Azure Data Factory deflate without creating a folder

I have a Data Factory v2 job which copies files from an SFTP server to an Azure Data Lake Gen2.
There is a mix of .csv files and .zip files (each containing only one csv file).
I have one dataset for copying the csv files and another for copying zip files (with Compressoin type set to ZipDeflate). The problem is that the ZipDeflate creates a new folder that contains the csv file and I need this to respect the folder hierarchy without creating any folders.
Is this possible in Azure Data Factory?
Good question, I ran into similar trouble* and it doesn't seem to be well documented.
If I remember correctly Data Factory assumes ZipDeflate could contain more than one file and appears to create a folder no matter what.
If you have Gzip files on the other hand which only have a single file, then it will create only that.
You'll probably already know this bit, but having it in the forefront of your mind helped me realise the sensible default data factory has:
My understanding of it is that the Zip standard is an archive format which is happening to use the Deflate algorithm. Being an archive format it naturally can contain multiple files.
Whereas gzip (for example) is just the compression algorithm it doesn't support multiple files (unless tar archived first), so it will decompress to just a file without a folder.
You could have an additional data factory step to take the hierarchy and copy it to a flat folder perhaps, but that leads to random file names (which you may or may not be happy with). For us it didn't work as our next step in the pipeline needed predictable filenames.
n.b. Data factory does not move files it copies them so if they're very large this could be a pain. You can trigger a meta data move operation via the data lake store API or Powershell etc however.
*Mine was slightly crazier situation in that I was receiving files named .gz from a source system but were in fact zip files in disguise! In the end the best option was to ask our source system to change to true gzip files.

Resources