Pentaho Data Inegration - Multiple Excel File Inputs Loading - excel

I've been using Spoon as a tool to complete a project. One of the requirements is to load multiple Excel files, that have the same format (sheets), in order to output it to a Table Output.
However the number of Excel Files has to be variable (requirement) but they are located on the same folder. Which step(s) allows to load all the Excel files that are on a folder?
Thanks.

The Microsoft Excel input step support reading all files in a folder, or some based on regular expressions. You can also read all files including subfolders.

Related

Upload Microsoft Excel Workbook with Many Sheets into Azure ML Studio

I want to upload my Excel Workbook into Azure Machine Learning Studio. The reason is I have some data that I would like to join into my other .csv files to create a training data set.
When I upload my Excel, I don't get .xlsx, or .xls, but other extensions such as .csv, .txt etc..
This is how it looks,
I uploaded anyways and now, I am getting weird characters. How can I get excel workbook uploaded and get my sheets, so, I can join data and do, data preparation. Any suggestions?
You could save the workbook as a (set of) CSV file(s) and upload them separately.
A CSV file, a 'Comma Separated Values' file, is exactly that. A flat file with some values separated by a comma. If you load an Excel file it will mess up since there's way more information in an Excel file than just values separated by comma's. Have a look at File -> Save as -> Save as type where you can select 'CSV (comma delimited) (*.csv)'
Disclaimer: no, it's not always a comma...
In addition, the term "CSV" also denotes some closely related delimiter-separated formats that use different field delimiters. These include tab-separated values and space-separated values. A delimiter that is not present in the field data (such as tab) keeps the format parsing simple. These alternate delimiter-separated files are often even given a .csv extension despite the use of a non-comma field separator.
Edit
So apparently Excel files are supported: Supported data sources for Azure Machine Learning data preparation
Excel (.xls/.xlsx)
Read an Excel file one sheet at a time by specifying sheet name or number.
But also, only UTF-8 is supported: Import Data - Technical notes
Azure Machine Learning requires UTF-8 encoding. If the data you are importing uses a different encoding, or was exported from a data source that uses a different default encoding, various problems might appear in the text.

Is it possible to find the file that is creating an excel file?

I have an excel file that was created by Alteryx, but I'm not sure which Alteryx file generated the excel. I was wondering if there was a way to backtrack and see what program created an excel file.
Thanks
You won't be able to tell which Alteryx workflow created the file, but you can tell that it was created by Alteryx. In the document properties you can find the company that "created" the file which will show up as Alteryx,Inc.
So I've created a very simple workflow that reads in a few lines of a csv and exports to an excel file, "TestOutput.xlsx".
If you then open the .yxmd Alteryx file in a text editor, you can see that it's just stored as .xml ... here's the relevant section for the output:
From here all you need is a way of searching through text files. Using findstr, I can quickly identify the file that produced my excel file:

SSIS - looping through excel files using dynamic file name and sheet name

I am try to load multiple excel files into database, I have tried this link:
How to loop through Excel files and load them into a database using SSIS package?
but it keeps looping through the files and never ends.
can anyone help?
This is not likely given you have a small number of files which you should when testing.
You need to log the file names inside the ForLoop and see if the values are ever changing.
With the dynamic sheet name may have a stability problem, e.g. some characters may not be able to be picked up by the OLEDB driver.
This is in general a not recommended practice to process dynamic data.

How to compare Excel files (.xlsx) with Kdiff3?

I added Kdiff3 as my external diff tool in Source tree as shown in the figure.
But when I select two commits from Master and click on External Diff from Actions, kdiff3 is showing non-readable text as shown.
To compare excel files in SOurceTree, I used WinMerge (along with plugin, to compare excel files) which is a free tool from http://freemind.s57.xrea.com/xdocdiffPlugin/en/
It seems like you're trying to compare two Excel files. Such files are stored in binary format and are not comparable using tools designed for comparing text files (such as kdiff or winmerge).
To compare two Excel files use Excel itself: https://support.office.com/en-ca/article/Compare-two-versions-of-a-workbook-by-using-Spreadsheet-Compare-0e1627fd-ce14-4c33-9ab1-8ea82c6a5a7e

Totaling figures in .csv files using Excel

I have 12 .csv files produced by another program. The .csv files contain numeric data, separated by commas.
I need an easy way of totaling the values in certain columns in each of the files and comparing the totals across the various files e.g. compare the total from file 1 to the total from file 5.
The format of each file is the same i.e. 5 values in each record, separated by commas. Each of the 12 .csv files is about 50 Mb in size. Each file has a different number of records.
The environment I work in is 'secure' and I cant run any programs other than what I have installed on the PC I use. I have Excel installed and assume I can write VBA code/macros and I have access to the Command line. I can't (for example) load anything from a USB key and can not install any scripting language e.g. Python.
I have thought of doing this manually e.g. open each .csv file in Excel and total the columns using Excel functions i.e. SUM()
My challenge I need to do this many times of the next few weeks as new versions of the .csv files are produced i.e. I now have the first version, there will be many versions of the 12 files produced as I conduct testing on the other system. For each new version I need to sum the data and compare across files.
Last thing to say is, I cant change the system that produces the .csv files e.g. to create a set of totals
I'm looking for a programming solution that I can use, given my limited resources (ability to use any tools other than what is already on the PC)
You should be able to do this easily using an excel VBA macro but it might take quite some time if it needs to load and convert a 50MB csv file.
JScript (a microsoft form of JavaScript) is generally available on all machines and runs under the windows scripting host. Just create a file with a .js extension and try to run with a double click. Or you can use vbscript with a .vbs extension.
I think your easiest solution would be to write an excel macro (as you will have the IDE for excel vba as limited as it is).
Powershell or a batch script? A CSV is nothing more than a text file split with commas. Should be fairly easy to knock something up.
ADO can work on CSV files and you could then use SQL statements to sum the appropriate values - see this MSDN article for full details.
If you go to the Visual Basic Editor in Excel then try to add a reference via the Tools menu you should have several for Microsoft ActiveX Data Objects (2.8 being the most recent one.) Adding that reference lets you use ADO.

Resources