I am trying to create an ADF pipeline that does the following:
Takes in a csv with 2 columns, eg:
Source, Destination
test_container/test.txt, test_container/test_subfolder/test.txt
Essentially I want to copy/move the filepath from the source directory into the Destination directory (Both these directories are in Azure blob storage).
I think there is a way to do this using lookups, but lookups are limited to 5000 rows and my CSV will be larger than that. Any suggestions on how this can be accomplished?
Thanks in advance,
This is a complex scenario for Azure Data Factory. Also as you mentioned there are more than 5000 file paths records in your CSV files, it also means same number of Source and Destination paths. So now if you create this architecture in ADF, it will goes like this:
You will use the Lookup activity to read the Source and Destination paths. In that also you can't read all the paths due to Lookup activity limitation.
Later you will iterate over the records using ForEach activity.
Now you also need to split the path so that you will get container, directory and file names separately to pass the details to Datasets created for Source and Destination location. Once you split the paths, you need to use the Set variable activity to store the Source and Destination container, directory and file names. These variables will be then passed to Datasets dynamically. This is a tricky part as even if a single record is unable to split properly then your pipeline would fail.
If above step completed successfully, then you not need to worry about copy activity. If all the parameters got the expected values under Source and Destination tabs in copy activity it will work properly.
My suggestion is to use programmatical approach for this. Use python, for example, to read the CSV file using pandas module and iterate over each path and copy the files. This will work fine even if you have 5000+ records.
You can refer this SO thread which will help you to implement the same programmatically.
First, if you want to maintain a hierarchical pattern in your data, i recommend using ADLS (Azure Data Lake Storage) this will guarantee a certain structure for your data.
second, if you have a Folder in Blob Storage and you would like to copy files to it, use Copy Activity, you should define 2 datasets, one for the source and one for the sink.
check this link : https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-overview
Related
I need to process multiple files at once in a mapping data flow, but need to maintain the source folder structure when syncing.
In other words, the structure
year=yyyy/month=mm/day=dd/files
must be maintained on the sink side as well.
Is there a way to configure the sink settings to accomplish this?
Please check the attached folder structure of the source data and the design of the mapping data flow.
Any help would be appreciated.
Thank you.
Yes, you can make the Sink path dynamic in the Dataset itself and then use that Dataset in the Sink in Data flow Mapping.
Under Connection tab in Dataset, for File path option use the below dynamic expression:
#concat(formatDateTime(utcNow(),'yyyy'),'/',formatDateTime(utcNow(),'MM'),'/',formatDateTime(utcNow(),'dd'),'/')
Use the same Dataset in Data flow sink and run the pipeline.
Path will be created with the current YYYY/MM/dd format and file will be dumped there. Refer below image.
To be clear about the format, this is how the DataFrame is saved in Databricks:
folderpath = "abfss://container#storage.dfs.core.windows.net/folder/path"
df.write.format("delta").mode("overwrite").save(folderPath)
This produces a set of Parquet files (often in 2-4 chunks) that are in the main folder, with a _delta_log folder that contains the files describing the data upload.
The delta log folder dictates which set of Parquet files in the folder should be read.
In Databricks, i would read the latest dataset for exmaple, by doing the following:
df = spark.read.format("delta").load(folderpath)
How would i do this in Azure Data Factory?
I have chosen Azure Data Lake Gen 2, then the Parquet format, however this doesn't seem to work, as i get the entire set of parquets read (i.e. all data sets) and not just the latest.
How can i set this up properly?
With Data Factory pipeline, it seems to be hard to achieve that. But I have some ideas for you:
Use lookup active to get the content of delta_log file. If there many files, use get metadata to get the all the files schema(last modified date).
Use an if condition active or swich active to filter the latest data.
After the data filtered, pass the lookup output to set the copy active source(set as parameter).
The hardest thing is that you need figure out how to filter the latest dataset with delta_log. You could try this way, the whole work flow should like this but I can't tell you if it really works. I couldn't test that for you without same environment.
HTP.
I have a requirement to regularly update an existing set of 30+ CSV files with new data (append to the end). There is also a requirement to possibly remove the first X rows as Y rows are added to the end.
Am I using the correct services for this and in the correct manner?
Azure Blob Storage to store the Existing and Update files.
Azure DataFactory with DataFlows. A PipeLine and DataFlow per CSV I want to transform that conducts a merge of datasets (existing + update), producing a
sink fileset that drops the new combined CSV back into Blob
storage.
A trigger on the Blob Storage Updates directory to trigger the pipeline when a new update file is uploaded.
Questions:
Is this the best approach for this problem, I need a solution with minimal input from users (I'll take care of Azure ops so long as all they have to do is upload a file and download the new one)
Do I need a pipeline and dataflow per CSV file? Or could I have one per transformation type (ie one for just appending, another for appending and removing first X rows)
I was going to create a directory in blob storage for each of the CSVs (30+ Dirs) and create a dataset for each directories existing and update files.
Then create a dataset for each output file into some new/ directory
Depending on the size of your CSVs, you can either perform the append right inside of the data flow by taking both the new data as well as the existing CSV file as a source and then Union the 2 files together to make a new file.
Or, with larger files, use the Copy Activity "merge files" setting to merge the 2 files together.
I am trying to copy one file from one container to another another in a storage account. the scenario i implemented works fine for a single file. but for multiple files, it is copying both of them in one copy activity. i want the file to be moved one at a time and after a single copy to provide a delay of 1 min, then proceed with the next file copy.
i created a pipeline with the move File template but it did not work for multiple files.
i have taken the source and sink dataset as csv datasets and not binary. i will not be aware of the pattern or the names of the files.
when a user input say about 10 files, i want to copy it one at a time and also provide a delay between each copy. this has to happen between 2 storage account containers.
i have tried to use move files template too. but it did not work for multiple. Please help me.
Sanaa, to force a sequential processing, check the "Sequential" checkbox:
Time delay could be achieved by adding "Wait" action:
I am in the process of copying a large set of data to an Azure Blob Storage area. My source set of data has a large number of files that I do not want to move, so my first thought was to create a DataSet.csv file of just the files I do want to copy. As a test, I created a csv file where each row is a single file that I want to include.
BasePath,DstBlobPathOrPrefix,BlobType,Disposition,MetadataFile,PropertiesFile
"\\SERVER\Share\Folder1\Item1\Page1\full.jpg","containername/Src/Folder1/Item1/Page1/full.jpg",BlockBlob,overwrite,"None",None
"\\SERVER\Share\Folder1\Item1\Page1\thumb.jpg","containername/Src/Folder1/Item1/Page1/thumb.jpg",BlockBlob,overwrite,"None",None
etc.
When I run the Import/Export tool (WAImportExport.exe) it seems to create a single folder on the destination for each file, so that it ends up looking like:
session#1
-session#1-0
-session#1-1
-session#1-2
etc.
All files share the same base, but do output their filename in the CSV. Is there any way to avoid this, so that all the files go into a single "session#1" folder? If possible, I'd like to avoid creating N-thousand folders on the destination drive.
I don't think you should worry about the way the files are stored on the disk, as they will be converted back to the directory structure you specified in the .csv file.
Here's what the documentation says:
How does the WAImportExport tool work on multiple source dir and disks?
If the data size is greater than the disk size, the WAImportExport
tool will distribute the data across the disks in an optimized way.
The data copy to multiple disks can be done in parallel or
sequentially. There is no limit on the number of disks the data can be
written to simultaneously. The tool will distribute data based on disk
size and folder size. It will select the disk that is most optimized
for the object-size. The data when uploaded to the storage account
will be converged back to the specified directory structure.