azure data factory: iterate over millions of files - azure

Previously I had a problem on how to merge several JSON files into one single file,
which I was able to resolve it with the answer of this question.
At first, I tried with just some files by using wild cards in the file name in the connection section of the input dataset. But when I remove the file name, theory tells me that all of the files in all folders would be loaded recursively as I checked the copy recursively option, in the source section of the copy activity.
The problem is that when I manually trigger the pipeline after removing the file name from the input of the data set, only some of the files get loaded and the task ends successfully but only loading around 400+ files, each folder has 1M+ files, I want to create BIG csv files by merging all the small JSON files of the source (I already was able to create csv file by mapping the schemas in the copy activity).

It is probably stopping due to a timeout or out of memory exception.
One solution is to loop over the contents of the directory using
Directory.EnumerateFiles(searchDir)
This way you can process all the files without having the list / contents of all files in memory at the same time.

Related

Paraview: Create a state file in an external program

My c++ code outputs a number of vtu files and stl files. Each vtk file has a different mesh and a different number of fields. I want the user to be able to open those vtu files in Paraview together so that they are all on the same pipeline. Currently, the user has to open each vtu file separately or group select them together in the Open File dialog box and open them. But I want to give the user a better experience. I like the user to not worry about all the different but files and open just one "combined file". Is there a way to create one single file from all these vtu and stl files? Or create a single "reference" file that will reference those other vtu and STL files and the user has to open only the reference file?
If you have a way to get the list of file to load, you can create a python script alongside to your data, where you basically put:
from paraview.simple import *
# recover file list
# ...
for file in files:
OpenDataFile(file)
Then one can just load this script as a state in ParaView.

import multiple excel files to database in pentaho 6

I want to import multiple excel files to my db follow a loop. For example, I put all excel files in a for and each excel file import to my db.
Because when I try to import all files in forder which I has maximum of 2 files to import. Three files shows errors related to ram.
Thank you in advance.
You can use a Get file names step as an input to get all the excel files.
You feed the information of the Get file names to the Microsoft excel input step, this step has a check to accept filenames from previous step.
To make this work all excel files must have the same structure, if they have different structure, you'll have to inject metadata with the differences in each file, and you'll have to build a logic in previous transformations to determine the metadata to inject.

Azure Data Factory - Recording file name when reading all files in folder from Azure Blob Storage

I have a set of CSV files stored in Azure Blob Storage. I am reading the files into a database table using the Copy Data task. The Source is set as the folder where the files reside, so it's grabbing it's file and loading it into the database. The issue is that I can't seem to map the file name in order to read it into a column. I'm sure there are more complicated ways to do it, for instance first reading the metadata and then read the files using a loop, but surely the file metadata should be available to use while traversing through the files?
Thanks
This is not possible in a regular copy activity. Mapping Data Flows has this possibility, it's still in preview, but maybe it can help you out. If you check the documentation, you find an option to specify a column to store file name.
It looks like this:

How to load files in a specific order

I would like to know how I can load some files in a specific order. For instance, I would like to load my files according to their timestamp, in order to make sure that subsequent data updates are replayed in the proper order.
Lets say I have 2 types of files : deal info files and risk files.
I would like to load T1_Info.csv, then T1_Risk.csv, T2_Info.csv, T2_Risk.csv...
I have tried to implement a comparator, as it is said on Confluence, but it seems that the loadInstructions file has the priority. It will order the Info files and the risk files independently. (loading T1_Info.csv, T2_Info.csv and then T1_Risk.csv, T2_Risk.csv..)
Do I have to implement a custom file loader, or is it possible using an AP configuration ?
The loading of the files based on load instructions is done in
com.quartetfs.tech.store.csv.impl.CSVDataModelFactory.load(List<FileLoadDescriptor>). The FileLoadDescriptor list you receive is created directly from the load instructions files.
What you can do is create a simple instructions files with 2 entries, one for deal info and one for risk. So your custom implementation of CSVDataModelFactory will be called with a list of two items. In your custom implementation you scan the directory where the files are, sort them in the order you want them to be parsed and call the super.load() with the list of FileLoadDescriptor you created from the directory scanning.
If you want to also load files that are place in the future in this folder you have to add to your load instructions a line that will match all files and that will make the super.load() implementation to create a directory watcher for that (you should then maybe override createDirectoryWatcher() to not watch the files already present in the folder when load is called).

Pass a Directory Query to <cfzip>?

I need to zip files from a directory, but not all the files in the directory. I determine the files that need to be zipped by running a query on the directory listing.
Currently, I'm looping over the query results to add each file to the archive individually, but this can take a while in a large directory.
Is there any way to do this outside of a loop? I couldn't find anything in the CF docs that would indicate that you can pass some sort of list to cfzip.
Unfortunately, no. You can pass it an entire directory to zip up, but not a query of files.

Resources