How to order input files on excel step in Pentaho

How to order input files on excel step in Pentaho - excel

I'm using an Excel input step in a transformation; I need to process a lot of excel files in a directory; the problem is that kettle is processing them in an arbitrary way, so that the result is not always what I was hoping for. Is there some way to specify the order for processing the files? I need spoon to process them by date, starting from the oldest to the newest. Thank you.

Late reply, but mybe still helpful.
You could first use a "Get File Names" and get the list of the files in the directory. Then you use "Sort Rows" and sort by "lastmodifiedtime" (don't think there is "filecreatedtime" availble, so that is a risk). Then you write the result to log. Afterwards you read this log a process the file one by one.

I don't know if there's a reliable way to make PDI process the files in a particular order at the job level.
But what you can do is go to the 'Additional output fields' tab in the Excel input step and specify a field name for the file name (either 'Full filename field' or 'Short filename field'). This will cause your file name to be added as a column in output of the Excel input step with the name you specify. Then simply flow this through a Sort rows step and sort by that column.

Related

How to output to the source data file?

I am brand new to Alteryx and am building a workflow that will be reused with several different Excel reports. Each report has a different format (different column headers, etc).
Before running the workflow, I change the Data Input and update the fields in the Select Tool.
At the end of the workflow, I need to output the results to a new sheet within the original Excel workbook.
I know that the Input Tool has the "Output File Name as Field" option, but I can not figure out how to use that within the Output Tool.
Is there a better way to do this? Right now, I am having to select the new file in the Input Tool and the Output Tool on each run; if I forget to change the output, it will overwrite the sheet in the wrong file.

You can chose a field to determine the file that will be output.
In the Output Data tool, check "Take File/Table Name from Field" and select "Change Entire File Path". You then choose which field contains the output file name. Does that help with your problem?

How to add columns from multiple files in U-SQL in ADLA?

I have a lot of csv files in a Azure Data Lake, consisting of data of various types (e.g., pressure, temperature, true/false). They are all time-stamped and I need to collect them in a single file according to timestamp for machine learning purposes. This is easy enough to do in Java - start a filestream, run a loop on the folder that opens each file, compares timestamps to write relevant values to the output file, starting a new column (going to the end of the first line) for each file.
While I've worked around the timestamp problem in U-SQL I'm having trouble coming up with syntax that will help me run this on the whole folder. The wildcard syntax {*} treats all files as the same fileset while I need to run some sort of loop to join a column from each file individually.
Is there any way to do this, perhaps using virtual columns?

First you have to think about your problem functional/declaratively and not based on procedural paradigms such as loops.
Let me try to rephrase your question to see if I can help. You have many csv files with data that is timestamped. Different files can have rows with the same timestamp, and you want to have all rows for the same timestamp (or range of timestamps) output to a specific file? So you basically want to repartition the data?
What is the format of each of the files? Do they all have the same schema or different schemas? In the later case, how can you differentiate them? Based on filename?
Let me know in the comments if that is a correct declarative restatement and the answers to my questions and I will augment my answer with the next step.

Kettle (spoon) - get filename for excel output from field in the excel field in input

I'm trying to process an excel , I need to generate una excel file for each row and as filename I need to use one of the fields in the row.
The excel output hasn't the option "Accept filename from field" and I can't figure out how to achieve it.
thanks

You need to copy the rows into memory and then loop it across the excel file to generate multiple files. You need to break your solution to 2 parts. First of all, read all the rows from Excel Input step into "Copy rows to Result" step as a variable. In the next transformation, use the same variable to use it as a file parameter.
Please check the two links:
SO Similar Question: Pentaho : How to split single Excel file to multiple excel sheet output
Blog : https://anotherreeshu.wordpress.com/2014/12/23/using-copy-rows-to-result-in-pentaho-data-integration/
Hope this helps :)

The issue is that the step is mostly made for outputting the rows to a single file, not making a file for each row.
This isn't the most elegant solution but I do think it will work. From your transformation you can call a sub-transformation (Mapping) and send a variable to it containing the filename. The sub-transformation can simply do one thing: write the file, and it should work fine. Make sense?

Delete some columns, re-arrange remaining columns and move processed files for multiple .csv files using SSIS 2008 R2

Googled for some tips on how to crack this. But did not get any helpful hits.
Now, I wonder if I can achieve the same in SSIS or not.
There are multiple .csv files in a folder. What I am trying to achieve is to:
open each .csv file (I would use a parameter as filenames' change)
Delete some columns
re-arrange the remaining columns in a particular order
save the .csv file (without the Excel confirmation message box)
Close the .csv file
Move the processed file to another folder.
and re-start the entire above process until all the .csv files in the folder are processed.
Initially I thought I can use the For Each Loop Container and Execute process Task to achieve this. However, not able to find any resource as to how to achieve the above desired objective.
Example:
Header of every Source .csv file:
CODE | NAME | Value 1 | Value 2 | Value 3 | DATE | QTY | PRICE | VALUE_ADD | ZONE
I need to delete columns: NAME | VALUE_ADD | ZONE from each file and re-arrange the columns in the below order.
Desired column order:
CODE | DATE| Value 1 | Value 2 | Value 3 | PRICE | QTY
I know this is possible within SSIS. But am not able to figure it out. Thanks for your help in advance.

Easily done using the following four steps :
Use a "Flat file Connection" to open your CSV.
Use a "Flat file Source" component to read your CSV.
Use a "Derived column" component to rearrange your columns.
Use a "Flat file Destination" component to save your CSV.
Et voilà!

After a lot of experimenting, managed to get the desired result. In the end, it seemed so simple.
My main motive for creating this package was that I had a lot of .csv files that needed the laborious task of opening each file and running a macro that eliminated a couple of columns, rearranged the remaining columns in the desired format. Then I had to manually save each of the files after clicking on the Excel Confirmation boxes. That was becoming too much. I wanted just a one click approach.
Giving a detailed way of what I did. Hope it helps people who are tying to get data from multiple .csv files as source, then get only the desired columns in the order they need, and finally save the desired output as .csv files into a new destination.
In brief, all I had to use was use:
a For Each Loop Container
a Data Flow Task within it.
And within the Data Flow Task:
a Flat File Source
a Flat File Destination
2 Flat File Connection Managers - One each for Source and Destination.
Also, had to use 3 Variables - all String Data Types with Project Scope - which I named: CurrFileName, DestFilePath, and FolderPath.
.
Detailed Steps:
Set default values to the variables:
CurrFileName: Just provide the name of one of the .scv files (test.csv) for temporary purpose.
FolderPath: Provide the path where your source .csv files are located (C:\SSIS\Data\Input)
DestFilePath: Provide the Destination path where you want to save the processed files (C:\SSIS\Data\Input\Output)
Step 1: Drag a For Each Loop Container to the Control Flow area.
Step 2: In collection, select the enumerator as 'Foreach File Enumerator'.
Step 3: Under Enumerator Configuration, under Folder: provide the folder path where the .csv files are located (In my case, C:\SSIS\Data\Input) and in Files:, provide the extension (in our case: *.csv)
Step 4: Under Retrieve file name, select 'Name and extension' radio button.
Step 5: Then go to the Variable Mappings section and select the Variable (in my case: User::CurrFileName.
Step 6: Create the source connection (let's call it SrcConnection)- right-click in the Connection Managers area and select the Flat File Connection manager and select one of the .csv files (for temporary purpose). Go to the Advanced tab and provide the correct desired data type for the columns you wish to keep. Click OK to exit.
Step 7: Then go to the Properties of this newly created source Flat File Connection and click the small box adjacent to the Expressions field to open the Property Expressions Editor. under 'Property', select 'ConnectionString' and in the Expression space, enter: #[User::FolderPath] + "\" + #[User::CurrFileName] and click OK to exit.
Step 8: In Windows Explorer, create a new folder inside your Source folder (in our case: C:\SSIS\Data\Input\Output)
Step 9: Create the Destination connection (let's call it DestConnection) - right-click in the Connection Managers area and select the Flat File Connection manager and select one of the .csv files (for temporary purpose). Go to the Advanced tab and provide the correct desired data type for the columns you wish to keep. Click OK to exit.
Step 10: Then go to the Properties of this newly created source Flat File Connection and click the small box adjacent to the Expressions field to open the Property Expressions Editor. under 'Property', select 'ConnectionString' and in the Expression space, enter: #[User::DestFilePath] + #[User::CurrFileName] and click OK to exit.
Step 11: Drag the Data Flow Task to the Foreach Loop Container.
Step 12: In the Data Flow Task, drag a Flat File Source and in the Flat file connection manager: select the source connection (in this case: SrcConnection). In Columns, de-select all the columns and select only the columns that you require (in the order that you require) and click OK to exit.
Step 13: Drag a Flat File Destination to the Data Flow Task and in the Flat File Connection manager: select the destination connection (in this case: DestConnection). Then, go to the Mappings section and verify if the mappings are as per desired output. Click OK to exit.
Step 14: That's it. Execute the package. it should execute without any trouble.
Hope this helped :-)

It isn't clear why you want to use SSIS to do this: your task seems to be to manipulate text files outside the database, and it's usually much easier to do this in a small script or program written in a language with good CSV parsing support (Perl, Python, PowerShell, whatever). If this should be part of a larger package then you can simply call the script using an Execute Process task. SSIS is a great tool, but I find it quite awkward for a task like this.

Configure and link Excel to a delimited file for repeated use

I am dumping data in a tab delimited file that I would like to view and analyze in Excel. But the file contents change frequently and I do not want to go through the importing steps every time, i.e. define delimiters, column names etc. Is there a way to save a link metadata in an Excel file so that you can skip the definition steps upon subsequent openings, i.e. that it knows that the first row are column names, it is tab delimited etc.?
Thanks

Yes, you can. Go through the Get External Data route. Once you set it up. All you have to do next is "Refresh Data". No macro needed.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string