Pentaho, how to pull data from cells

Pentaho, how to pull data from cells - excel

I'm a new user to Pentaho AND a fairly weak user of Excel sheets, what I need Pentaho to do is what is described in the image. At the step right before conclusion I have several cells with different data.
I need to sort of merge them together into 1 cell with all the right data. I tried Normaliser/De-Normaliser and I couldn't get it to work properly.
In excel what I do is basically pull the data UP the columns to the cell I want based on a key which is common to those lines.
Let me know if someone needs further information.
In the transformation i receive a formated text file input, up until step 25 (obs) i'm reading only the first line of the text, which is where most of the information i need is located, by the pattern there are other possible 9 lines in each entry, some entries have up to 23 line,others have 6 only. Most of the data i can extract from line 1, but i also need data from 2 other lines, which the step "obs" exctracts with formulas by comparing the 2 initial digits, and then cutting the string i need from those lines, the thing is before doing the "filter rows" step, those information cells are not agregated in the same line, i need them all to be in the same line, as i posted the first image, but i cannot find the step that does so, or i don't have the knowledge to make said step function properly.
If you need more information please let me know.
I'm using this many steps because at some point i'll add triggers and validations for most of them to ensure data integrity.

Found the answer myself, first i had to use a Group by with a key that is present in all lines of the same "block" of cells, then another problem surfaced where the top line of the block contained information i needed,but it didnt have the group by key, therefore i had to use the Get Previous Row Field step to have those rows present BEFORE the Group by step. Hope i helped.

Related

How do one extract information from a dynamic table, automatically through excel functions?

I have been searching high and low for a way to solve my dilemma, in different ways, so I am trying to post both of the things I've been trying to do:
The challenge version 1:
I want to extract the entire row with information tied to the name which is the latest entry of that name in the table. So from the table below I would want to collect the entire row which contains the information: "A, Jack Black, 01.01.2029, 10:20". I simply want to copy the entire row to another sheet. But one important factor is that it has to happen automatically.
So i need functions which can check if: Is there another entry with the same name, higher up in the table? If so, DO NOT COPY THE ROW. If there ain't another entry with the exact same name higher up in the table, COPY THE ENTIRE ROW, to another table, within another sheet.
The challenge version 2:
What I really want to do is count the number of unique people(unique names) per. department, and summarize this in another table. Basically this means that "Jack Black" should be counted as 1 person, in department A.
So the result I want, is a table looking like this (the one beneath), where the number of people does not contain any duplicate people (names). OR it does not function with a dynamic table, which updates the information it contains on the fly. I can make this happen if I am copying from a static table, but as stated above, the table is dynamic and updates with new information every minute...
So far i've tried excel's built in filtering, but this does not work automatically. I've also tried using functions like in this guide: https://excel-bytes.com/how-to-extract-a-dynamic-list-from-a-data-range-based-on-a-criteria-without-filters-in-excel/. However every solution i find seems to need criteria for filtering out duplicates or does not function when copying information from a dynamic table.
Does anyone know how to reach my desired result, without implementing criteria for selecting the rows or counting rows as stated above? VBA code is not an option at the moment :(
In advance, THANK YOU, I've really tried solving this, but I feel like this just might break my head wide open soon if I can't solve it. HEEEEELP!
Sincerely
haakonlu

Return the value in the first non-empty cell in the column directly to the left and going upward

I'm all new to VBA and have mostly been trying to modify code after recording macros, so it's all pretty basic and the approach might not be as elegant as some of the stuff I've seen on here. So here we go.
I have coded (by brute force) my data to be arranged like a CAD design tree view with parent products/assemblies and constituent sub-assemblies/parts.
Column E contains Level 0 top assembly Part Number
Column F contains Level 1 items Part Number
... etc all the way to ...
Column M containing Level 8 items Part Number
As an example, cell G112 contains ASSY1; cells H113 to H134 contain its constituent items.
I would like to display in a new column (i.e. Column O) the value of cell G112 (ASSY1) for each of its constituents. So O113 to O134 would show the value of G112. That would need to be applied to every single level of the assembly.
I'm not sure I'm making much sense do please have a look at the picture linked below, it speaks a thousand words. I've highlighted and colour-coded the result I would like in column O.
ADDENDUM - To clarify things:
I don't know how else to explain my request but to post a simplified version of my original picture.
SIMPLIFIED EXCEL TABLE
.CSV available here WeTransfer

A very useful tool to retrieve VBA code for determined action is the macro recorder, in the ribbon, Developer -> RecordMacro, perform you action and stop recording and then you can check the code generated for the actions you recorded. Its not the cleanest code but you can find there the lines of code for the specific actions you want. Once you step into a one concrete problem with the code you tried, you can then ask for help regarding something more concrete, more than expecting that someone will code that for you.
Anyhow if you want someone to try to solve your problem, you need to post the table with the accessible data instead of the image, for the person whoever tries to approach your problem to have the data available.
Hope that helps

Here's the answer I got from somewhere else if anyone is interested:
Formula in Cell O3:
=IF(C3=0,"N/A , ALREADY TOP LEVEL",INDEX(D$2:D2,AGGREGATE(14,6,(ROW(D$2:D2)-ROW(D$2)+1)/(C$2:C2=C3-1),1)))
Copy/Paste down in every cell in column O

Pentaho Kettle - Loading excel with almost blank rows

I got an excel file from a uncontrolled source that comes with a row with all the fields filled and then several rows all fields blank except one (Always the same, is a commentary).
The commentaries belong to the ID of the "row with data".
I would like to make a new field "COMENTARY AGREGATED" with the concatenation of all the comenataries that belong to the ID but I don't know how to do it, as far as I know, you can't interact with the order of the rows as they are treated as independent. ¿Am I right and this is imposible to do inside kettle and should resort to a VB macro in excel as preprocess?
THanks for your time

You can use a group by step, group by all fields except the comment one, and on aggregations choose “concatenate values separated by” and use a whitespace as value for the concatenation ( or nothing if you prefer).
The excel input can’t do all that on its own.

for now I've advanced a little.
I found that in the Excel input step, in the Fields tab, the Repeat column can be set to Y, and if so, it fills the blank rows with the previous value.
Still don't know how to agregate the others but its a step in the right direction I guess.

Reading an Excel sheet using ADO/ODBC in Delphi 7

I'm trying to read an Excel sheet from an XLS or XLSX file in memory using Delphi 7. When possible I use automation to read the cells one by one, but when Excel is not installed, I revert to using the ADO/ODBC Jet driver.
I connect using either
Provider=Microsoft.Jet.OLEDB.4.0; Data Source=file.xls;Extended Properties="Excel 8.0;Persist Security Info=False;IMEX=1;HDR=No";
Provider=Microsoft.ACE.OLEDB.12.0; Data Source=file.xlsx;Extended Properties="Excel 12.0;Persist Security Info=False;IMEX=1;HDR=No";
My problem then is that when I use the following query:
SELECT * FROM [SheetName$]
the returned results do not contain the empty rows or empty columns, so if the sheet contains such rows or columns, the following cells are shifted and do not end up in their correct position. I need the sheet to be loaded "as is", ie know exactly from what cell position each value comes from.
I tried to read the cells one by one by issuing one query of the form
SELECT F1 FROM `SheetName$A1:A1`
but now the driver returns an error saying "There is data outside the selected region". btw I had to use backticks to enclose the name because using brackets like this [SheetName$A1:A1] gave out a syntax error message.
Is there a way to tell the driver to select the sheet as-is, whithout skipping blanks? Or maybe a way to know from which cell position each value is returned?
For internal policy reasons (I know they are bad but I do not decide these), it is not possible to use a third party library, I really need this to work from standard Delphi 7 components.

I assume that if your data is say in the range B2:D10 for example, you want to include the column A as an empty column? Maybe? Is that correct? If that's the case, then your data set, when you read the sheet (SELECT * FROM [SheetName$]) would also return 1 million rows by 16K columns!
Can you not execute a query like: SELECT * FROM [SheetName$B2:D10] and use the ADO GetRows function to get an array - which will give you the size of the data. Then you can index into the array to get what data you want?

OK, the correct answer is
Use a third party library no matter what your boss says. Do not even
try ODBC/ADO to load arbitrary Excel files, you will hit a wall sooner or later.
It may work for excel files that contain a single data table, but not when you want to cherry pick data in a sheet primarily made for human consumption (ie where a single column contains some cells with introductory text, some with numerical data, some with comments, etc...)
Using IMEX=1 ignores empty lines and empty columns
Using IMEX=0 sometimes no longer ignores empty lines, but now some of the first non empty cells are considered field names instead of data, although HDR=No. Would not work anyway since valules in a column are of mixed types.
Explicitly looping across cells and making a SELECT * FROM [SheetName$A1:A1] works until you reach an empty cell, then you get access violations (see below)
Access violation at address 1B30B3E3 in module 'msexcl40.dll'. Read of address 00000000
I'm too old to want to try and guess the appropriate value to use so it works until someone comes with yet another mix of data in a column. Sorry for having wasted everybody's time.

How to Differentiate a Data from a Column/Header in an Excel File

I hope someone can help me come up with an algorithm.
Im still very new with Apache POI and I was assigned to come up with an algorithm on how to read a template (Excel) and extract the headers/column names from the data itself.
The following must be taken into account:
There can be multiple headers/column names in just one sheet of an Excel file.
Headers can be horizontal AND/OR vertical in nature. This means that there could be a mixture of vertical and horizontal headers in one sheet.
Headers dont necessarily have to be at the very first row of the file. There could be introductions or banner images there.
The system must allow ANY kind of Excel format, so there is no control over the formatting of the cells, the naming convention, etc.
Some headers are alphanumeric in nature, which means it also contains numbers.
Some cells are merged to make room for a specific header.
Any ideas and suggestions are very much welcome. Just let me know if you have further clarifications.

(I know nothing about Apache, but some about Excel Interop working)
If the sheets to be detected are yours, I'd recomend NAMING those header cells. (To name a cell in Excel, there's a field at the top left of the screen, where normally the cell coordinates appear (like "A1" or "B2" and so...). Type a name in that place, and you will be able to identify that cell via code by it's name. ( 'Worksheet.Range("Name")' is where you get those cells via code)
To manage names, go to "Insert - Names" or "Formulas - Name manager", depending on what version of excel.
(Personally, I never work with sheets via code without naming headers, then I use "Offset" to get the data cells corresponding to those headers - This allows me to freely edit the sheet later without breaking the code)
If the sheets aren't yours, then, you'll need to find out the extents of the data. (Last row and last column)
Then check for the first line that contains all columns filled, none of them blank. That's a probable horizontal header.
As well as check for the first columns that contains all lines filled. That's a probable vertical header.
You could, as well, search for completely blank lines and/or columns to find headers that are AFTER some data, in case of sheets containing multiple horizontal headers, or vertical.
You could use some formatting properties (Range.Interior or Range.Font for examples) of those cells to identify if they are headers (usually headers have different format, color, borders and so on).
If you're sure there's no numeric header, I mean, all headers contains text, check for the type of data in the cells. If all are strings, header probability increases.
Even so, that's a tricky thing to do, if sheets don't follow some pattern, once in a while one of them can deceive your code and bring false results. I'd recommend, if alowed, to add a human verification to confirm the results after the proccess is done.

The solution to this problem involves taking away two of these freedoms. Such constraints applied will make this a tractable problem. Most of such freedoms come from overcautious thinking.
The freedoms are given as quotes below:-
Headers can be horizontal AND/OR vertical in nature. This means that there could be a mixture of vertical and horizontal headers in one sheet.
Typically, vertical headers are not used in Excel Files where there is a need to programmatically detect headers. As the primary, most common and sometimes the only reason for such detection is to upload/transform the tabular data.
Funny things happen when vertical headers are introduced:
They become Labels of Forms. This implies that such forms are used for data entry rather than storage. The data from such forms is stored in horizontal/columnar headers and rowwise/vertical records of data . Thus obviating the need for Upload/Transformation of the data entry sheet.
Excel is designed to have only horizontal headers. Vertical Headers cease to have autofilter support.
Even when Vertical Headers are present, a top horizontal header row can still be introduced to mark the headers themselves as descriptions / categories.
Staying true, to the core need for autodetection of headers, we can state that once our requirement states that Headers can be placed only in a horizontal alignment, the solution becomes slightly more tractable but not fully so.
Some cells are merged to make room for a specific header.
Merging cells is poison and anathema to the entire reason for transformation/upload of data. This is a pill I steadfastly have refused to take in my entire career with Excel & SQL jugglery. You may kindly merge all that you want to for all I care, however thee shall not pass into my beloved SQL Server.
For aforementioned reasons of prejudice and ill-will towards all mergers and mergees alike. I'd respectfully suggest that you too take this course.
Solution
Staying true to the above requirements after taking away the 2 freedoms. The pseudo algorithm (solution) is to
Take a sample of say c x r Excel Rows. For eg: 200 x 201 rows and columns
Find the counts of non-empty cells using an inbuilt formula like COUNTA whose contents have a non-zero length. The Count of such non-empty cells in each row is maintained as a data structure.
The type of data ie:- Number, Date, String should also be maintained in the above data structure capable of expressing the following:
Row# 22 contains
30 non-empty cells of which
28 are alphanumeric,
1 is a Date and
1 is a Number.
The First specific row that contains the maximum number of such non empty cells with the maximum number of strings should very likely be the header row.
Converting all of the above to a specific algorithm in any given language should be a deliciously occupying task for any young developer in their prime.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string