Pass a Directory Query to <cfzip>? - zip

I need to zip files from a directory, but not all the files in the directory. I determine the files that need to be zipped by running a query on the directory listing.
Currently, I'm looping over the query results to add each file to the archive individually, but this can take a while in a large directory.
Is there any way to do this outside of a loop? I couldn't find anything in the CF docs that would indicate that you can pass some sort of list to cfzip.

Unfortunately, no. You can pass it an entire directory to zip up, but not a query of files.

Related

ADF Copy Activity problem with wildcard path

I have a seemingly simple task to integrate multiple json files that are residing in a data lake gen2
The problem is files that need to be integrated are located in multiple folders, for example this is a typical structure that I am dealing with:
Folder1\Folder2\Folder3\Folder4\Folder5\2022\Month\Day\Hour\Minute\ <---1 file in Minute Folder
Than same structure for 20223 year, so in order for me to collect all the files I have to go to bottom of the structure which is Minute folder, if I use wildcard path it looks like this:
Wildcard paths 'source from dataset"/ *.json, it copies everything including all folders, and I just want files, I tried to narrow it down and copies only first for 2022 but whatever I do is not working in terms of wildcard paths, help is much appreciated
trying different wildcard combinations did not help, obviously I am doing something wrong
There is no option to copy files from multiple sub- folders to single destination folder. Flatten hierarchy as a copy behavior also will have autogenerated file names in target.
image reference MS document on copy behaviour
Instead, you can follow the below approach.
In order to list the file path in the container, take the Lookup activity and connect to xml dataset with HTTP linked service.
Give the Base URL in HTTP connector as,
https://<storage_account_name>.blob.core.windows.net/<container>?restype=directory&comp=list.
[Replace <storage account name> and <container> with the appropriate name in the above URL]
Lookup activity gives the list of folders and files as separate line items as in following image.
Take the Filter activity and filter the URLs that end with .json from the lookup activity output.
Settings of filter activity:
items:
#activity('Lookup1').output.value[0].EnumerationResults.Blobs.Blob
condition:
#endswith(item().URL,'.json')
Output of filter activity
Take the for-each activity next to filter activity and give the item of for-each as #activity('Filter1').output.value
Inside for-each activity, take the copy activity.
Take http connector and json dataset as source, give the base url as
https://<account-name>.blob.core.windows.net/<container-name>/
Create the parameter for relative URL and value for that parameter as #item().name
In sink, give the container name and folder name.
Give the file name as dynamic content.
#split(item().name,'/')[sub(length(split(item().name,'/')),1)]
This expression will take the filename from relative URL value.
When the pipeline is run, all files from multiple folders got copied to single folder.

Get ADLS directory and sub-directory paths till it gets the file format in a table using databricks

I have a ADLS which has several folders which inturn has sub-folders and so on till the point we have either CSV or Parquet data in it.
How to get the Folder names and subfolders in this folder with the file format in databricks? Also there are some junk folders which I don't want to consider at all like Folder123, Folder_dummy etc.
Suggestions please..
You can add wildcard character in places where you don't know all possible folder names. For eg, if you want to query a parquet file from a nested path, you can use this,
select * from parquet.`{Your ADLS folder}/*/{SomeSpecificFolder}/{your parquet}.parquet`
You can use your wildcard to any extend as long as you know which parquet you are querying and give that name alone use Databricks/Spark SQL

Store a folder into a single file without using external library or database

I want to archive and unarchive a whole folder into/from a file. It's similar to compress/uncompress a folder as a zip file. The original folder structure should be preserved.
Here're what I'm trying to do:
- Archive a whole folder into a single archive file.
- Unarchive an archive file to a target folder (i.e extract the file onto another folder)
- Update a file (file content or file name) in the archive file without having to recreate the whole
archive from scratch.
- Delete one or many files or folders (and file within them) from the archive.
It would be much easier if I use external libraries or database for storage. But I need to do those tasks without using any library/tool. Please give me suggestion.

azure data factory: iterate over millions of files

Previously I had a problem on how to merge several JSON files into one single file,
which I was able to resolve it with the answer of this question.
At first, I tried with just some files by using wild cards in the file name in the connection section of the input dataset. But when I remove the file name, theory tells me that all of the files in all folders would be loaded recursively as I checked the copy recursively option, in the source section of the copy activity.
The problem is that when I manually trigger the pipeline after removing the file name from the input of the data set, only some of the files get loaded and the task ends successfully but only loading around 400+ files, each folder has 1M+ files, I want to create BIG csv files by merging all the small JSON files of the source (I already was able to create csv file by mapping the schemas in the copy activity).
It is probably stopping due to a timeout or out of memory exception.
One solution is to loop over the contents of the directory using
Directory.EnumerateFiles(searchDir)
This way you can process all the files without having the list / contents of all files in memory at the same time.

Delete folder by name from google drive using gdrive

I have read the documentation for gdrive here, but I couldn't find a way to do what I want to do. I want to write a bash script to upload automatically a specific folder from my hard drive. The problem is that when I upload it several times, instead of replacing the old folder by the new one, it generates a new folder with the same name.
I could only find the following partial solutions to my problem:
Use update to replace files. The problem with this partial solution is that new files inside the folder could not get uploaded automatically, and I would have to change the bash script every time a new file is produced in the folder that I want to upload.
Erase the folder by its id from google drive and then upload the folder again. The problem here is that whenever I do this, the id of the uploaded folder chagnes, so I couldn't find a way to write a script to do the work.
I am looking for any method that solves my problem. But the precise questions that could help me are:
Is there a way to delete a folder from google drive (using gdrive) by its name instead of by its id?
Is there a way to get the id of a folder by its name? I guess not, since there can be several folders with the same name (but different ids) uploaded. Or am I missing something?
Is there a way to do a recursive update to renew all files that are already inside the folder uploaded on google drive and in addition upload those that are not yet uploaded?
In case it is relevant, I am using Linux Mint 18.1.
Is there a way to delete a folder from google drive (using gdrive) by its name instead of by its id?
Nope. As your next question observes, there can be multiple such folders.
Is there a way to get the id of a folder by its name? I guess not, since there can be several folders with the same name (but different ids) uploaded. Or am I missing something?
You can get the ids (plural) of all folders with a given name.
gdrive list -q "name = 'My folder name' and mimeType='application/vnd.google-apps.folder' and trashed=false"
Is there a way to do a recursive update to renew all files that are already inside the folder uploaded on google drive and in addition upload those that are not yet uploaded?
Yes, but obviously not with a single command. You'll need to write a short script using gdrive list and parse (awk works well) the output.

Resources