How to read files from a subfolder present under nested parent folder using azure data factory? - azure

How to read files from a subfolder present under nested parent folder using azure data factory?
Container/ABC/Transcation/07654/Audit/Report.csv
Container/CDF/Transcation/07654/Audit/Tranfee/report0910201.csv
Container/FGS/Transcation/07654/Audit/custom/report08092021.csv
I want to retrieve all the files(including the files under subfolder) Under the Audit folder.

While creating dataset specify the folder path.
In source configuration enable Recursively option.

Related

Extract Only Specific Files with Azure Data Factory

I have to extract only "daily" files from a folder in my C: drive into azure data factory but there are "weekly" files that I don't want to extract. Also, I can't separate the two files in different onprem folders. I have to do this for a client but first I'm practicing on my own computer. Here is the onprem folder that I'm referring to. So the ultimate goal is to only transfer the "daily" files out of the folder and into azure data factory
As suggested by #Scott Mildenberger in comments as your files have similar naming convention you can use wildcard file path to filter files with name.
Sample data
Dataset settings
Source Setting
In file path type select wildcard file path and give daily* It will filter all the files from folder with files contains daily.
Output

Azure Data Factory Copy Behaviour "Preserve Hierarchy" not working

I am trying to copy data from one container in Azure Data lake Gen2 into another in the same Storage Account. I want preserve the same hierarchy with folders and subfolders but whatever I try it does only copy the json file and no folders.
As of now I have the target container set in the target dataset. Should I add something more (such as directory and file)?
I have tested this for you and it can work, please follow this:
1.My container's structure:
examplecontainer
+test
+re
json files
+pd
json files
Setting of Source in Copy activity:
3.Setting of Sink in Copy activity:
4.Result:

How to copy all files and folders in specific directory using azure data factory

I have one folder in adls gen2 say it as mysource1 folder .. which has 100's of subfolder s and each subfolder again contains folders and many files ..
How can I copy all of the folders and files in mysource1 using azure data factory ..
You could use binary as source format. It will help you copy all the folders and files in source to sink.
For example: this is my container test:
Source dataset:
Sink dataset:
Copy active:
Output:
You can follow my steps.
use ingest tab on ADF Home page, there you could specify source location using linked service and target location

Append files to existing S3 bucket folder via Spark

I am working in Spark where we need to write the data to S3 bucket after performing some tranformations. I know that while writing dtaa to HDFS/S3 via Spark throws an exception if the folder path already exists. So in our case if S3://bucket_name/folder already exists while writing the data to the same S3 bucket path, it will throw an exception.
Now the possible solution is to use mode as OVERWRITE while writing through Spark. But that would delete all the files already present in it. I want to have a kind of APPEND functionality with the same folder. So if folder already has some files, then it would just add more files to it.
I am not sure if API out of the box gives any such functionality. Of course there is an option where I can create a temporary folder inside a folder and save the file. After that I can move that file to its parent folder and delete the temporary folder. But this kind of approach is not best.
So please suggest how to proceed with this.

How can I have Azure File Share automatically generate non-existing directories?

With AWS S3, I can upload a file test.png to any directory I like, regardless of whether or not it exists... because S3 will automatically generate the full path & directories.
For example, if I when I upload to S3, I use the path this/is/a/new/home/for/test.png, S3 will create directories this, is, a, ... and upload test.png to the correct folder.
I am migrating over to Azure, and I am looking to use their file storage. However, it seems that I must manually create EVERY directory... I could obviously do it programmatically by checking to see if the folder exists and if not, create it... but wow...why should I work so hard?
I did try:
file_service.create_file_from_path('testshare', 'some/long/path', 'test.png', 'path/to/local/location/of/test.png')
However, that complains that the directory does not exist... and will only work if I either manually create the directories or replace some/long/path with None.
Is it possible to just hand Azure a path and have it create the directories?
Azure Files closely mimics OS File System and thus in order to push a file in a directory, that directory must exist. What that means is if you need to create a file in a nested directory structure, that directory structure must exist. Azure File service will not create that structure for you.
A better option in your scenario would be to use Azure Blob Storage. It closely mimics Amazon S3 behavior that you mentioned above. You can create a Container (similar to Bucket in S3) and then upload a file with a name like this/is/a/new/home/for/test.png.
However please note that the folders are virtual in Blob Storage (same as S3) and not the real one. Essentially the name with which the blob (similar to Object in S3) will be saved is this/is/a/new/home/for/test.png.

Resources