Pentaho Data Integration - Dump Excel into table - excel

I'm very new to this tool and I want to do a simple operation:
Dump data from an XML to tables.
I have an Excel file that has around 10-12 sheets, and almost every sheet coresponds to a table.
With the first Excel input operation there is no problem.
The only problem is that, I don't know why but, when I try to edit (show the list of sheets, or get the list of columns) a second Excel Input the software just hangs, and when it responds just opens a warning with an error.
This is an image of the actual diagram that I'm trying to use:

This is a typical case of out of memory problem. PDI is not able to read the file and required more amount of memory to process the excel file. You need to give PDI more memory to work with your excel. Try increasing the memory of the Spoon. You can read Increase Spoon memory.
Alternatively, try to replicate your excel file with few rows of data keeping the structure of the file as it is e.g. a test file. You can either use that test file to generate the necessary sheet names and columns in excel step. Once you are done, you can point the original file and execute the job.

Related

How to replace index/match with a connection

In order to view customer data in an Excel sheet I have used the functions index/match to retrieve data. However, due to the large amount of customers the file has gotten very large. Right now it is 13MB. This file is regularly sent through mail, so it is a real headache having to open it every time.
Is there a way to replace Index/Match with something else in order to reduce the file size? Transforming the source file into an SQL file? Adding a connection to the source file?
Thanks.

How to select specific rows to load into an Excel Workbook from another at run-time

I have to .xlsx files. One has data "source.xlsx" and one has macros "work.xlsm". I can load the data from "source.xlsx" into "work.xlsm" using Excel's built-in load or using Application.GetOpenFilename. However, I don't want all the data in the source.xlsx. I only want to select specific rows, the criteria for which will be determined at run time.
Thinks of this as a SELECT from a database with parameters. I need to do this to limit the time and processing of the data being processed by "work.xlsx".
Is there a way to do that?
I tried using parameterized query from Excel --> [Data] --> [From Other Sources] but when I did that, it complained about not finding a table (same with ODBC). This is because the source has no table defined, so it makes sense. But I am restricted from touching the source.
So, In short, I need to filter data before exporting it in the target sheet without touching the source file. I want to do this either interactively or via a VBA macro.
Note: I am using Excel 2003.
Any help or pointers will be appreciated. Thx.
I used a macro to convert the source file from .xlsx to .csv format and then loaded the csv formatted file using a loop that contained the desired filter during the load.
This approach may not be the best, nevertheless, no other suggestion was offered and this one works!
The other approach is to abandon the idea of pre-filtering and sacrifice the load time delay and perform the filtering and removal of un-wanted rows in the "work.xlsm" file. Performance and memory size are major factors in this case, assuming code complexity is not the issue.

Handling huge Excel file

Need your help badly. I am dealing with a workbook which has 7000 rows X 5000 columns data in one sheet. Each of this datapoint has to be manipulated and pasted in another sheet. The manipulations is relatively simple where each manipulation will take less than 10 lines of code (simple multiplications and divisions with a couple of Ifs). However, the file crashes every now and then and getting various types of errors. The problem is the filesize. To overcome this problem, I am trying a few approaches
a) Separate the data and output in different files. Keep both files open and take data chunk by chunk (typically 200 rows x 5000 columns) and manipulate that and paste that in output file. However, if both files are open, then I am not sure it remedies the problem since the memory consumed will be same either way i.e. instead of one file consuming a large memory, it would be two files together consuming the same memory.
b) Separate the data and output in different files. Access the data in the data file while it is still closed by inserting links in the output file through a macro, manipulate the data and paste it in output. This can be done chunk by chunk.
c) Separate the data and output in different files. Run a macro to open the data file and load a chunk of data say 200 rows into memory into an array and close it. Process the array and open the output file and paste the array results.
Which of the three approaches are better? I am sure there are other methods which are more efficient. Kindly suggest.
I am not familiar with Access but I tried to import the raw data into Access and it failed because it allowed only 255 columns.
Is there a way to keep the file open but wash it in and out of Memory. Then slight variations to a and c above can be tried. (I am afraid repeated opening and closing will crash the file.)
Look forward to your suggestions
If you don't want to leave Excel, one trick you can use is to save the base excel file as a binary ".xlsb". This will clean out a lot of potential rubbish that might be in the file (it all depends on where it first came from.)
I just shrank a load of webdata by 99.5% - from 300MB to 1.5MB - by doing this, and now the various manipulation in excel works like a dream.
The other trick (from the 80s :) ) if you are using a lot of in cell formulae rather than a macro to iterate through, is to:
turn calculate off.
copy your formulae
turn calculate on, or just run calculate manually
copy and paste-special-values the formulae outputs.
My suggestion is using a scripting language of your choice and working with decomposition/composition of spreadsheets in it.
I was composing and decomposing spreadsheets back in the days (in PHP, oh shame) and it worked like a charm. I wasn't even using any libraries.
Just grab yourself xlutils library for Python and get your hands dirty.

VBA Access -> Excel Output: File Size BALLOON from 3mb to 20mb, why?

I'm finishing up a program that I built to import an excel to a database, do some manipulations/edits, and then spit back out the edited Excel. Except my problem is that the file size just balloon'ed to a huge amount from approx 3mb to ~19mb.
It has the same record count ~20k. It has ~3 more columns (out of 40+ columns total) - but that shouldn't make the file size x6, should it? Below is the code I use for the output:
DoCmd.OutputTo acOutputQuery, "Q_Export", acFormatXLS, txtFilePath & txtFileName
Any ideas on how I can get that file size a bit more down? Or anyone have a possible indication of what is doing it at least?
There are three possible reasons that spring immediately to mind;
1) You are importing more records than you think you are. Check the table after the Excel file has been imported. Make sure the table has EXACTLY as many records as there are rows of data in Excel. Often the import process will bring in many empty records, and that data is then exported as empty strings. To you it looks the same, but to Excel it's information that must be stored, which takes up space.
2) Excel handles NULL values differently than Access does. If your data has a lot of missing information, it's going to be stored differently when it's imported into Access. This actually brings us back to Reason #1.
3) When you import data, sometimes it comes in with trailing spaces. make sure you TRIM() your data before exporting it, to get rid of any potential storage space being used.

SSIS Data Flow Task Excel Source

I have a data flow task set up in SSIS.
The source is from an Excel source not an SQL DB.
The problem i seem to get is that, the package is importing empty rows.
My data has data in 555200 rows, but however when importing the SSIS package imports over 900,000 rows. The extra rows are imported even though the other empty.
When i then download this table into excel there are empty rows in between the data.
Is there anyway i can avoid this?
Thanks
Gerard
The best thing to do. If you can, is export the data to a flat file, csv or tab, and then read it in. The problem is even though those rows are blank they are not really empty. So when you hop across that ODBC-Excel bridge you are getting those rows as blanks.
You could possibly adjust the way the spreadsheet is generated to eliminate this problem or manually the delete the rows. The problem with these solutions is that they are not scalable or maintainable over the long term. You will also be stuck with that rickety ODBC bridge. The best long term solution is to avoid using the ODBC-Excel bridge entirely. By dumping the data to a flat file you have total control over how to read, validate, and interpret the data. You will not be at the mercy of a translation layer that is to this day riddled with bugs and is at the best of times "quirky"
You can also add in a Conditional Split component in your Data flow task, between the source task and the destination task. In here, check if somecolumn is null or empty - something that is consistent - meaning for every valid row, it has some data, and for every invalid row it's empty or null.
Then discard the output for that condition, sending the rest of the rows to the destination. You should then only get the amount of rows with valid data from Excel.

Resources