Copy files in subdirs to azure storage with ADF - azure

I have a folder structures like this:
folder1/folder2
/YearNumber1
/monthYear1
/somefile.csv, tbFiles.csv
/monthYear2
/somefile2.csv, tbFiles2.csv
...(many folders as above)
/YearNumber2
/montYear11
/somefileXXYYZz.csv, otherFile.csv
/monthYear12
/someFileRandom.csv. dedFile.csv
...(many folders as above)
Source:
Binary, linked via fileshare linked service
Destination:
Binary, on azure blob storage
I don't want to retain the structure, just need to copy all csv files.
Using CopyActivity:
Wildcard Path: #concat('folder1/folder2/','*/','*/',) / '*.csv'
with recursive
But it copies nothing, 0 Bytes.

You can use the below options in the CopyActivity Source Setting:
1. File path type
Allowed wildcards are: * (matches zero or more characters) and ? (matches zero or single character); use ^ to escape if your actual folder name has wildcard or this escape char inside.
See official MS docs for more examples in Folder and file filter examples.
wildcardFolderPath - The folder path with wildcard characters under your file system configured in dataset to filter source folders.
wildcardFileName - The file name with wildcard characters under your file system + folderPath/wildcardFolderPath to filter source files.
2. recursive - When set to true the data is read recursively from the subfolders.
Example:
If there are only .csv files in your source directories you can simply specify wildcardFileName as just *

Related

Graph API DriveItem: How can I only get query results from the root directory (PREVENT recursive searching)?

https://graph.microsoft.com/v1.0/sites/MyDomain.sharepoint.com,00000000-1111-2222-3333-444444444444/drive/search(q='Matrix')
The above correctly returns all drive files with the word "Matrix" in them within the Shared%20Documents directory for the site's provided site ID (00000000-1111-2222-3333-444444444444).
However, it's recursive: it returns files with the word "Matrix" in them within subfolders too. I only want to query files in the root directory.
How do I search for file names, only within the root directory? I tried changing /drive to /drive/root like below, but it did not make a difference:
https://graph.microsoft.com/v1.0/sites/MyDomain.sharepoint.com,00000000-1111-2222-3333-444444444444/drive/root/search(q='Matrix')
ChatGPT recommended adding the filter $filter=parentReference/path eq '/drive/root':
https://graph.microsoft.com/v1.0/sites/MySite.sharepoint.com,00000000-1111-2222-3333-444444444444/drive/search(q='Matrix')?$filter=parentReference/path eq '/drive/root'
...but I got the error "Only createdDateTime,remoteItem.shared.sharedBy.group.id,remoteItem.shared.sharedBy.user.id is supported for filtering" which ChatGPT didn't know how to get past
I solved this by obtaining the folder id of the root folder and using the /drive/items/{folderId}/children$filter URI instead of /drive/search. I obtained the folder ID of the root folder by copying the id within the parentReference of an item that lies within my root directory from the output of my first command.
Then I queried the files in the root directory with the following format:
https://graph.microsoft.com/v1.0/sites/MyDomain.sharepoint.com,{siteId}/drive/items/{folderId}/children?$filter=startswith(name,'MyWord')
So in my case, the URI ended up looking like below:
https://graph.microsoft.com/v1.0/sites/MyDomain.sharepoint.com,00000000-1111-2222-3333-444444444444/drive/items/01NCSFADN6Y2GOVW7725BZO354PWSELRRZ/children?$filter=startswith(name,'Install')
Unfortunately, I couldn't use the contains function (which functions similarly to /search) and had to use startswith because contains isn't supported on $filter for text fields.
Finally, you can optionally tack on the end whichever field(s) you're interested in retrieving with the select parameter:
&select=name,#microsoft.graph.downloadUrl

python 3.x: how to combine dictionary zipped log file if exist in two different directories into another directory

I have below structure of directories having .log.gz files. Also these logs files are having String dictionary type. Anyone of these directories may not exist as well.
xyz1/
2022-08-08T01:31Z.log.gz
2022-08-08T01:33Z.log.gz
xyz12/
2022-08-08T01:30Z.log.gz
2022-08-08T01:33Z.log.gz
I want to create another directory and combine above files
xyz/
2022-08-08T01:30Z.log.gz
2022-08-08T01:31Z.log.gz
2022-08-08T01:33Z.log.gz
Conditions:
any one of xyz1 and xyz2 may exist or both can exist
If same name file exist in both directory combine them into third one "xyz"
While combining the String dictionary format should be retained
Solution I opted:
Check one directory if it exists and iterate over files.
For each files check if same file exist in other directory.
If yes, than decompress both files and combine them, zip them into xyz directory
if not than copy it into xyz
Is there any better way to perform above operation. Below is how I combine two log files.
combinefile = {}
combinefile.update(json.loads(xyz1/file1.log))
combinefile.update(json.loads(xyz2/file1.log))

Copy a set of files using ADF

I have 10 files in a folder and want to move 4 of them in a different location.
I tried 2 approaches to achieve this -
using lookup to retrieve the filenames from a json file- then feeding it to a for each iterator
using metadata to get file names from source folder and then adding if condition inside a for each to copy the files.
But in both the cases, all the files in source folder gets copied.
Any help would be appreciated.
Thanks!!
There a 3 ways you can consider selecting your files depending on the requirement or blockers.
Checkout official MS doc: Copy activity properties
1. Dynamic content for FilePath property in Source Dataset.
2. You can use Wildcard character in the source folder and file path in the source Dataset.
Allowed wildcards are: * (matches zero or more characters) and ?
(matches zero or single character); use ^ to escape if your actual
folder name has wildcard or this escape char inside. See more
examples in Folder and file filter
examples.
3. List of Files
Point to a text file that includes a list of files you want to copy,
one file per line, which is the relative path to the path configured
in the dataset. When using this option, do not specify file name in
dataset. See more examples in File list
examples.
Example:
Parameterize source dataset and set source file name to that which passes the expression evaluation in IfCondition Activity.

Can we exclude or include only particular file extensions from Databricks Autoloader?

Right now the databricks autoloader requires a directory path where all the files will be loaded from. But in case some other kind of log files also start coming in in that directory - is there a way to ask Autoloader to exclude those files while preparing dataframe?
df = spark.readStream.format("cloudFiles") \
.option(<cloudFiles-option>, <option-value>) \
.schema(<schema>) \
.load(<input-path>)
Autoloader supports specification of the glob string as <input-path> - from documentation:
<input-path> can contain file glob patterns
Glob syntax support different options, like, * for any character, etc. So you can specify input-path as, path/*.json for example. You can exclude files as well, but building that pattern could be slightly more complicated, compared to inclusion pattern, but it's still possible - for example, *.[^l][^o][^g] should exclude files with .log extension
Use pathGlobFilter as one of the option and provide a regex to filter a file type or file with specific name.
For instance, to skip files with filename as A1.csv, A2.csv .... A9.csv from load location, the value for pathGlobFilter will look like:
df = spark.read.load("/file/load/location,
format="csv",
schema=schema,
pathGlobFilter="A[0-9].csv")

Get blob content from file using wildcard

Example:
Blob: container/folder/myfile123.txt [where 123 is dynamic]
I am trying to get the content of an Azure blob file by using a wildcard for the extension since it can be different, but always the same leading (like myfile) and ending extension (like .txt). I've tested using things like myfile*.txt or myfile?.txt but no success when specifying the path.
For getting a wildcard file using the Get blob content tool in logic apps, how can I get a file by a leading name and ending extension but any possible combination between?
You must use the exact name of the file.
What you can do is to get a list of all blobs in the container. Then loop over that list getting each individual file.
You can use the "List blobs" connector and then the "Filter array" collections connector to get the wildcard functionality via the "contains"-operator. Then just use the "Get blob using path" and type in the expression: body('Filter_array')[0]['name']
Or in code view:
"path": "/my_catalogue/#{body('Filter_array')[0]['name']}"
to get the first filename that matches your wildcard.

Resources