Copy a set of files using ADF - azure

I have 10 files in a folder and want to move 4 of them in a different location.
I tried 2 approaches to achieve this -
using lookup to retrieve the filenames from a json file- then feeding it to a for each iterator
using metadata to get file names from source folder and then adding if condition inside a for each to copy the files.
But in both the cases, all the files in source folder gets copied.
Any help would be appreciated.
Thanks!!

There a 3 ways you can consider selecting your files depending on the requirement or blockers.
Checkout official MS doc: Copy activity properties
1. Dynamic content for FilePath property in Source Dataset.
2. You can use Wildcard character in the source folder and file path in the source Dataset.
Allowed wildcards are: * (matches zero or more characters) and ?
(matches zero or single character); use ^ to escape if your actual
folder name has wildcard or this escape char inside. See more
examples in Folder and file filter
examples.
3. List of Files
Point to a text file that includes a list of files you want to copy,
one file per line, which is the relative path to the path configured
in the dataset. When using this option, do not specify file name in
dataset. See more examples in File list
examples.
Example:
Parameterize source dataset and set source file name to that which passes the expression evaluation in IfCondition Activity.

Related

Copy files in subdirs to azure storage with ADF

I have a folder structures like this:
folder1/folder2
/YearNumber1
/monthYear1
/somefile.csv, tbFiles.csv
/monthYear2
/somefile2.csv, tbFiles2.csv
...(many folders as above)
/YearNumber2
/montYear11
/somefileXXYYZz.csv, otherFile.csv
/monthYear12
/someFileRandom.csv. dedFile.csv
...(many folders as above)
Source:
Binary, linked via fileshare linked service
Destination:
Binary, on azure blob storage
I don't want to retain the structure, just need to copy all csv files.
Using CopyActivity:
Wildcard Path: #concat('folder1/folder2/','*/','*/',) / '*.csv'
with recursive
But it copies nothing, 0 Bytes.
You can use the below options in the CopyActivity Source Setting:
1. File path type
Allowed wildcards are: * (matches zero or more characters) and ? (matches zero or single character); use ^ to escape if your actual folder name has wildcard or this escape char inside.
See official MS docs for more examples in Folder and file filter examples.
wildcardFolderPath - The folder path with wildcard characters under your file system configured in dataset to filter source folders.
wildcardFileName - The file name with wildcard characters under your file system + folderPath/wildcardFolderPath to filter source files.
2. recursive - When set to true the data is read recursively from the subfolders.
Example:
If there are only .csv files in your source directories you can simply specify wildcardFileName as just *

ADF Azure Data-Factory loop over folder syntax - wilcard?

i'm tryimg to loop over a diffrent countries folder that got fixed sub folder named survey (i.e Spain/survey , USA/survey ).
where and how I Need to define a wildcard / parameter for the countries so I could loop over all the files that in the survey folder ?
what is the right wildcard syntax ? ( the equivalent of - like 'survey%' in SQL) ?
I tried several ways to define it with no success and I would be happy to get some help on this - Thanks !
In case if the list of paths are static, you can create a parameter or add it in a SQL database and get that result from a lookup activity.
Pass the output to a for each activity and within foreach activity use a copy activity.
You can parameterize the input dataset to get the file paths thereby you need not think of any wildcard characters but use the actual paths itself.
Hope this is helpful.

How to read multiple CSV (leaving out specific ones) from a nested directory in PySpark?

Lets say I have a directory called 'all_data', and inside this, I have several other directories based on the date of the data that it contains. These directories are named date_2020_11_01 to date_2020_11_30 and each one of these contain csv files which I intend to read in a single dataframe.
But I don't want to read the data for date_2020_11_15 and date_2020_11_16. How do I do it?
I'm not sure how to exclude certain files, but you can specify a range of file names using brackets. Code below would select all files without 11_15 and 11_16:
spark.read.csv("date_2020_11_{1[0-4,7-9],[0,2-3][0-9]}.csv")
df= spark.read.format("parquet").option("header", "true").load(paths)
where paths is a list of all the paths where data is present, worked for me.
Simple method is, read all data directory as it is and apply filter condition
df.filter("dataColumn != 'date_2020_11_15' & 'date_2020_11_16'")
Else you can use OS module read directory and iterate to that list to eliminate those date directory using condition.

Unable to copy file from SFTP in Azure Data Factory when using wildcard(*) in the filename

I am unable to copy csv files from an SFTP connection to blob storage when using the wildcard(*) in the filename.
More specifically, I receive csv files in the SFTP on a daily basis, and they are of the format: "ddMMyyyyxxxxxx.csv", where "xxxxxx" is the timestamp. More concretely, my csv file for the 13th of March is: "13032019083647.csv", while for the 14th of March: "14032019083556.csv". Obviously, the timestamp is different for every day, thus I want to copy the file independently of whatever strings exists between the date and the the file extenstion.
In the "File" subfield of the "File path" of the "Connection" tab of my subset, I give as input: "13032019*.csv", as instructed by the help icon next to the field:
When I do so, my Debug run fails with:
{"errorCode": "2200", "message":
"ErrorCode=UserErrorInvalidCopyBehaviorBlobNameNotAllowedWithPreserveOrFlattenHierarchy,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Cannot
adopt copy behavior PreserveHierarchy when copying from folder to a
single file.,Source=Microsoft.DataTransfer.ClientLibrary}
I receive a similar error no matter which type of copy behaviour I choose. I have also tried experimenting with the fileFilter parameter (even though ADF warns that the same behaviour can be achieved with the fileName option), but I still end up getting the same error.
For further clarification, I am attaching the Code segment that ADF produces for this configuration:
I should also mention, that when using the full fileName in the corresponding field, namely the value: "13032019083647.csv", copying works normally.
Any help would be greatly appreciated!
My guess it might get two files with wildcard operation.
In such cases we need to use metadata activity, filter activity and for-each activity to copy these files.
1.Metadata activity : Use data-set in these activity to point the particular location of the files and pass the child Items as the parameter.
2.Filter activity : Use filter to filter the files based on your needs.
3.For-each activity : In the For-each activity get Items from the previous activity and add copy activity inside the for-each.
In copy activity the source data set should be #item().name.
I hope this will solve your issue.
What worked for me was the following: I kept the same regex for the input file, but I defined as "Copy behaviour: Merge Files". Since as mentioned, there is only 1 file that satisfies the regex condition, only 1 file was created as output. I am aware that this is a sort of "dirty" solution, but it did the trick for me.

pentaho create archive folder with MM-YYYY

I would like to archive every file in a folder by putting it in another archive folder with a name like this: "Archive/myfolder-06-2014"
My problem is how to retrieve the current month and year and then how to create a folder (if it does not already exist) with these data.
This solution may be a little awkward (due to the required fuss) but it seems to work. The idea is to precompute the target filename in a seperate transformation and store it as a system variable (TARGET_ZIP_FILENAME):
The following diagrams show the settings of selected components.
Get the current time...
Provide the pattern of the target filename as a string constant...
Extract the month and year as formatted integers...
Replace the month in the pattern (the year will work equivalently)
Set the resulting filename as a system variable
The main job will call the transformation and use the system variable as the zip target filename.
Also you have to make sure that the setting Create Parent folder is active:

Resources