Pentaho Kettle - Loading excel with almost blank rows

Pentaho Kettle - Loading excel with almost blank rows - excel

I got an excel file from a uncontrolled source that comes with a row with all the fields filled and then several rows all fields blank except one (Always the same, is a commentary).
The commentaries belong to the ID of the "row with data".
I would like to make a new field "COMENTARY AGREGATED" with the concatenation of all the comenataries that belong to the ID but I don't know how to do it, as far as I know, you can't interact with the order of the rows as they are treated as independent. ¿Am I right and this is imposible to do inside kettle and should resort to a VB macro in excel as preprocess?
THanks for your time

You can use a group by step, group by all fields except the comment one, and on aggregations choose “concatenate values separated by” and use a whitespace as value for the concatenation ( or nothing if you prefer).
The excel input can’t do all that on its own.

for now I've advanced a little.
I found that in the Excel input step, in the Fields tab, the Repeat column can be set to Y, and if so, it fills the blank rows with the previous value.
Still don't know how to agregate the others but its a step in the right direction I guess.

Related

Aligning vertically a series of tables with text

Hi I need the text to be in a specific format in a spreadsheet to be able to upload it on a translation tool.
I have already used the text split function to separate the text in a cell with bullet points, moving each bullet point to a separate cell.
enter image description here
Then I used the transpose function to separate each set of data. For context, you are looking at fashion products.
The name of the product is on the first row, followed by a list of features (e.g. "Bracciale" means bracelet and it is followed by the list of materials)
enter image description here
Now for the last step, I need these sets to be vertical, not horizontal. Like this:
enter image description here
I would like to set up an automatic system so that every time we receive a list with hundreds of these products we do not need to copy-paste them one below the other.
With pivot tables maybe? Keep in mind that if it is too complex it might be hard to train the translators to do it each time. Please let me know your suggestions. Thank you!
I am not a programmer. I tried pivot tables but the data was in the wrong order and I am not sure how to get the data out from the pivot table with values only without the sub-menus.

My suggestion would be to use the 'Unpivot Columns' feature in the Power Query Editor - it would be really simple.
Steps:
Select the whole range
Go to Data // Get & Transform Data // From Table/Range
Uncheck 'My Table has headers' (unless it does - but doesn't look like it?)
Press OK. This will open Power Query Editor and will have actually given you column names Col1/2/3 etc, but ignore that.
Go to Add Column // Index column
Select all columns EXCEPT the new index column by Shift+clicking on those headers
Go to Transform // Unpivot Columns
Assuming the order is important, click in the Attribute column and Sort Ascending
Click in the Index column and Sort Ascending
Remove the Attribute and Index columns if you want (right click header)
Go to File // Close & Load
You will get a new table - dynamically linked to the first (ie. can be updated/refreshed) - in the unpivoted format.
Let me know if you need more details / screenshot?

Based of this trick, maybe the following is helpfull:
Formula in A5:
=DROP(REDUCE(0,A1:A3,LAMBDA(a,b,VSTACK(a,TEXTSPLIT(b,,HSTACK(CHAR(10),"^"),1)))),1)
TEXTSPLIT() will use a combination of newline chars and the circumflex to split the input directly into a vertical array;
Iteration in REDUCE() will allow for stacked results;
DROP() the initial value from results.

Taking means of irregular amounts data

I'm not able to take the means for a large dataset given that the amount of attributes is irregular.
I have posted a simplified case for the problem. It explains the problem very well.
An idea that I came up with: Make a filter to condition on a single attribute. However, still, I don't see a way to do this in an efficient way (other then doing it all by hand).
see excel file:
All help is much appreciated.
I'm basically looking for a function/method to achieve taking means of all different attributes conditioned on each person for a large dataset without doing it by hand.

You can use AVERAGEIFS() inside an IF:
=IF(OR(A2<>A1,B2<>B1),AVERAGEIFS(C:C,A:A,A2,B:B,B2),"")
the ifrst part of the if tests whether the row starts a new group either by the person or the attribute changing. Then it uses AVERAGEIFS() to return the correct average of that group. otherwise it returns a blank

What you want to do can be accomplished very simply with a pivot table.
Simply select one of the cells inside the range of data you want to process(See the video for general use of a pivot table https://www.youtube.com/watch?v=iCiayB6GrpQ )
go the insert tab and insert pivot table.
Once you have it, simply check people, attribute, and values. Then drag people and attribute into rows, drag valut into the values window, select the drop down list and change it from sum of value to average and you should be done. https://i.stack.imgur.com/nYEzw.png

Column to rows and highlight difference between values in the same group

I have a huge table with data structured like this:
And I would like to display them in Spotfire Analyst 7.11 as follows:
Basically I need to display the columns that contain "ANTE" below the others in order to make a comparison. Values that have variations for the same ID must be highlighted.
I also have the fields "START_DATE_ANTE" and "END_DATE_ANTE" which have been omitted in the example image.

Amusingly, if you were limited to just what the title asks, this would be a very simple answer.
If you wanted this in a table where the rows are displayed as usual, and the cells are highlighted, you can do this by going to properties, adding a newGrouping where you select VAL_1 and VAL_1_ANTE and add a Rule, Rule type "Boolean expression", where the value is:
[VAL_1] - [VAL_1_ANTE] <> 0
This will highlight the affected cells, which you can place next to each other. You can even throw in a calculated column showing the difference between the two columns, and slap it on right next to it. This gives you the further option to filter down to only showing rows with discrepancies, or sorting by these values.
However, if you actually need it to display the POSTs on different lines from the ANTEs, as formatted above, things get a little tricky.
My personal preference would be to pivot (split/union/etc) the data before pulling it in to Spotfire, with an indicator flag on "is this different", yes/no. However, I know a lot of Spotfire users either aren't using a database or don't have leeway to perform the SQL themselves.
In fact, if you try to do it in Spotfire using custom expressions alone, it becomes so tricky, I'm not sure how to answer it right off. I'm inclined to think you should be able to do it in a cross table, using Subsets, but I haven't figured out a way to identify which subset you're in while inside the custom expressions.
Other options include generating a table using IronPython, if you're up to that.

Update Excel ListObject header names without breaking pivots

I have an Excel sheet with a very wide table on it. Due to developer friendlyness I'd like to use a certain style of column header naming (much like proper Hungarian notation), where I suffix each header name with "column type" tags. This allows me to easily spot where e.g. apples and oranges are compared. There are also pivot table reports based on this table.
An example to illustrate this: say you have 2 monetary columns, column A being expressed in another currency than column B. The model should thus never combine them without first applying appropriate exchange rates. To spot this I name these columns e.g. Earned - Cur1 and Saved - Cur2. Any calculation like =[#[Earned - Cur1]] + [#[Saved - Cur2]] is illegal, but due to the tags this can be picked up easily in an audit. I have several such tag groups in use already, and they already prevented some errors creeping in.
However...
The file also needs to be distributed to lots of not-so-savvy end users, and they need to fill in this table and refer to some of the outcome columns. Most intermediate columns we already hide, but the column names are now far from being user-friendly (like: fill out Actual - NK/Q1/EC/%, please?).
And this needs to run in Excel 2010.
What are my options?
Option 1
Add an extra row above the table, putting human readable names in there, and just hide the table header row. This works, but not the users can't sort and filter the table anymore, so that's a no-go.
Option 2
Augment option 1 by prepending a newline to each column name, and make the table header row 1 character high. The header cells would still be there to drive sorting and filtering and the users have human readable names in the row above. The actual header cells would appear like 'empty' buttons. Could work, but then the complex formulas become unreadable due to all the newlines from the column names all over the place.
Option 3
Add a macro that switches the headers in the table by alternative headers in another row above the table. The macro should be ran just before sending out the file to the users, and ran again when they return them filled in and all. I happily coded this option into the file, and it works wonderfully! But then I realized this (and thus option 2 as well) breaks all the derived pivot tables, since Excel links the data by the names used in the table - update the name, and that section of the pivot will be dropped...
I'd really like the option of having our development-oriented column names in there when we ourselves work with the file, but being able to switch out the headers when needed. And of course without rebuilding all the pivots after each such switch.
An opening here would be that pivots seem to only drop the columns once they're refreshed. I could use this to update the header names, then do some magic on the pivots to remap their fields, and only then refresh them, but it seems there's no way from within VBA to accomplish that (PivotField.SourceName is read only).
Hopefully someone can think of an alternative, or am I SOL? I'm totally open to other workarounds.
Workaround 1
Insert null-terminating characters in the header names such that they do not show normally in the formulas, but do not show in the table header row. If only it were that simple though... Turns out Excel throws up from a =Char(0)&"abc", and things like =Char(8)&"abc" (tab anyone?) give Unicode replacement characters when pasted into a header cell... (?)
Workaround 2
A last resort seems to be to unzip the excel file, and plough through the xml data to update everything in one go there, then rezip the file. But this code also needs to be executed by less skilled users, and I see too many ifs and buts to make me feel safe using this setup.
Workaround 3
For now I just use a variation on option 2; I have some VBA that 'empties' the header cells instead of prepending a newline to them. By 'emptying' I mean setting the font size to 1, subscript, non-bold, and then make the font color identical to the background color, followed by setting it's row height to the default 14.5. The cryptic names do leak out however; column header cell drop down arrows for sorting&filtering show the cryptic name, as well as the pivot field settings and of course the formula bar when you just click such a cell. But I guess it's the best I can do?
And then again I'm probably just perfectionizing this thing faaar to much :) But from this point on it's about the challenge!

Make sure you Tick the Box "Add this data to the DataModel" when creating your pivot(s)
AFAIK when your Pivots are connected to the Datamodel instead of directly to the Range/Table you can change your column-names in the Table and your Pivot will stay fine. You could even use other names in your Pivot.

Reading an Excel sheet using ADO/ODBC in Delphi 7

I'm trying to read an Excel sheet from an XLS or XLSX file in memory using Delphi 7. When possible I use automation to read the cells one by one, but when Excel is not installed, I revert to using the ADO/ODBC Jet driver.
I connect using either
Provider=Microsoft.Jet.OLEDB.4.0; Data Source=file.xls;Extended Properties="Excel 8.0;Persist Security Info=False;IMEX=1;HDR=No";
Provider=Microsoft.ACE.OLEDB.12.0; Data Source=file.xlsx;Extended Properties="Excel 12.0;Persist Security Info=False;IMEX=1;HDR=No";
My problem then is that when I use the following query:
SELECT * FROM [SheetName$]
the returned results do not contain the empty rows or empty columns, so if the sheet contains such rows or columns, the following cells are shifted and do not end up in their correct position. I need the sheet to be loaded "as is", ie know exactly from what cell position each value comes from.
I tried to read the cells one by one by issuing one query of the form
SELECT F1 FROM `SheetName$A1:A1`
but now the driver returns an error saying "There is data outside the selected region". btw I had to use backticks to enclose the name because using brackets like this [SheetName$A1:A1] gave out a syntax error message.
Is there a way to tell the driver to select the sheet as-is, whithout skipping blanks? Or maybe a way to know from which cell position each value is returned?
For internal policy reasons (I know they are bad but I do not decide these), it is not possible to use a third party library, I really need this to work from standard Delphi 7 components.

I assume that if your data is say in the range B2:D10 for example, you want to include the column A as an empty column? Maybe? Is that correct? If that's the case, then your data set, when you read the sheet (SELECT * FROM [SheetName$]) would also return 1 million rows by 16K columns!
Can you not execute a query like: SELECT * FROM [SheetName$B2:D10] and use the ADO GetRows function to get an array - which will give you the size of the data. Then you can index into the array to get what data you want?

OK, the correct answer is
Use a third party library no matter what your boss says. Do not even
try ODBC/ADO to load arbitrary Excel files, you will hit a wall sooner or later.
It may work for excel files that contain a single data table, but not when you want to cherry pick data in a sheet primarily made for human consumption (ie where a single column contains some cells with introductory text, some with numerical data, some with comments, etc...)
Using IMEX=1 ignores empty lines and empty columns
Using IMEX=0 sometimes no longer ignores empty lines, but now some of the first non empty cells are considered field names instead of data, although HDR=No. Would not work anyway since valules in a column are of mixed types.
Explicitly looping across cells and making a SELECT * FROM [SheetName$A1:A1] works until you reach an empty cell, then you get access violations (see below)
Access violation at address 1B30B3E3 in module 'msexcl40.dll'. Read of address 00000000
I'm too old to want to try and guess the appropriate value to use so it works until someone comes with yet another mix of data in a column. Sorry for having wasted everybody's time.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string