I have an Excel sheet with a very wide table on it. Due to developer friendlyness I'd like to use a certain style of column header naming (much like proper Hungarian notation), where I suffix each header name with "column type" tags. This allows me to easily spot where e.g. apples and oranges are compared. There are also pivot table reports based on this table.
An example to illustrate this: say you have 2 monetary columns, column A being expressed in another currency than column B. The model should thus never combine them without first applying appropriate exchange rates. To spot this I name these columns e.g. Earned - Cur1 and Saved - Cur2. Any calculation like =[#[Earned - Cur1]] + [#[Saved - Cur2]] is illegal, but due to the tags this can be picked up easily in an audit. I have several such tag groups in use already, and they already prevented some errors creeping in.
However...
The file also needs to be distributed to lots of not-so-savvy end users, and they need to fill in this table and refer to some of the outcome columns. Most intermediate columns we already hide, but the column names are now far from being user-friendly (like: fill out Actual - NK/Q1/EC/%, please?).
And this needs to run in Excel 2010.
What are my options?
Option 1
Add an extra row above the table, putting human readable names in there, and just hide the table header row. This works, but not the users can't sort and filter the table anymore, so that's a no-go.
Option 2
Augment option 1 by prepending a newline to each column name, and make the table header row 1 character high. The header cells would still be there to drive sorting and filtering and the users have human readable names in the row above. The actual header cells would appear like 'empty' buttons. Could work, but then the complex formulas become unreadable due to all the newlines from the column names all over the place.
Option 3
Add a macro that switches the headers in the table by alternative headers in another row above the table. The macro should be ran just before sending out the file to the users, and ran again when they return them filled in and all. I happily coded this option into the file, and it works wonderfully! But then I realized this (and thus option 2 as well) breaks all the derived pivot tables, since Excel links the data by the names used in the table - update the name, and that section of the pivot will be dropped...
I'd really like the option of having our development-oriented column names in there when we ourselves work with the file, but being able to switch out the headers when needed. And of course without rebuilding all the pivots after each such switch.
An opening here would be that pivots seem to only drop the columns once they're refreshed. I could use this to update the header names, then do some magic on the pivots to remap their fields, and only then refresh them, but it seems there's no way from within VBA to accomplish that (PivotField.SourceName is read only).
Hopefully someone can think of an alternative, or am I SOL? I'm totally open to other workarounds.
Workaround 1
Insert null-terminating characters in the header names such that they do not show normally in the formulas, but do not show in the table header row. If only it were that simple though... Turns out Excel throws up from a =Char(0)&"abc", and things like =Char(8)&"abc" (tab anyone?) give Unicode replacement characters when pasted into a header cell... (?)
Workaround 2
A last resort seems to be to unzip the excel file, and plough through the xml data to update everything in one go there, then rezip the file. But this code also needs to be executed by less skilled users, and I see too many ifs and buts to make me feel safe using this setup.
Workaround 3
For now I just use a variation on option 2; I have some VBA that 'empties' the header cells instead of prepending a newline to them. By 'emptying' I mean setting the font size to 1, subscript, non-bold, and then make the font color identical to the background color, followed by setting it's row height to the default 14.5. The cryptic names do leak out however; column header cell drop down arrows for sorting&filtering show the cryptic name, as well as the pivot field settings and of course the formula bar when you just click such a cell. But I guess it's the best I can do?
And then again I'm probably just perfectionizing this thing faaar to much :) But from this point on it's about the challenge!
Make sure you Tick the Box "Add this data to the DataModel" when creating your pivot(s)
AFAIK when your Pivots are connected to the Datamodel instead of directly to the Range/Table you can change your column-names in the Table and your Pivot will stay fine. You could even use other names in your Pivot.
Related
I have a huge table with data structured like this:
And I would like to display them in Spotfire Analyst 7.11 as follows:
Basically I need to display the columns that contain "ANTE" below the others in order to make a comparison. Values that have variations for the same ID must be highlighted.
I also have the fields "START_DATE_ANTE" and "END_DATE_ANTE" which have been omitted in the example image.
Amusingly, if you were limited to just what the title asks, this would be a very simple answer.
If you wanted this in a table where the rows are displayed as usual, and the cells are highlighted, you can do this by going to properties, adding a newGrouping where you select VAL_1 and VAL_1_ANTE and add a Rule, Rule type "Boolean expression", where the value is:
[VAL_1] - [VAL_1_ANTE] <> 0
This will highlight the affected cells, which you can place next to each other. You can even throw in a calculated column showing the difference between the two columns, and slap it on right next to it. This gives you the further option to filter down to only showing rows with discrepancies, or sorting by these values.
However, if you actually need it to display the POSTs on different lines from the ANTEs, as formatted above, things get a little tricky.
My personal preference would be to pivot (split/union/etc) the data before pulling it in to Spotfire, with an indicator flag on "is this different", yes/no. However, I know a lot of Spotfire users either aren't using a database or don't have leeway to perform the SQL themselves.
In fact, if you try to do it in Spotfire using custom expressions alone, it becomes so tricky, I'm not sure how to answer it right off. I'm inclined to think you should be able to do it in a cross table, using Subsets, but I haven't figured out a way to identify which subset you're in while inside the custom expressions.
Other options include generating a table using IronPython, if you're up to that.
I got an excel file from a uncontrolled source that comes with a row with all the fields filled and then several rows all fields blank except one (Always the same, is a commentary).
The commentaries belong to the ID of the "row with data".
I would like to make a new field "COMENTARY AGREGATED" with the concatenation of all the comenataries that belong to the ID but I don't know how to do it, as far as I know, you can't interact with the order of the rows as they are treated as independent. ¿Am I right and this is imposible to do inside kettle and should resort to a VB macro in excel as preprocess?
THanks for your time
You can use a group by step, group by all fields except the comment one, and on aggregations choose “concatenate values separated by” and use a whitespace as value for the concatenation ( or nothing if you prefer).
The excel input can’t do all that on its own.
for now I've advanced a little.
I found that in the Excel input step, in the Fields tab, the Repeat column can be set to Y, and if so, it fills the blank rows with the previous value.
Still don't know how to agregate the others but its a step in the right direction I guess.
I'm trying to read an Excel sheet from an XLS or XLSX file in memory using Delphi 7. When possible I use automation to read the cells one by one, but when Excel is not installed, I revert to using the ADO/ODBC Jet driver.
I connect using either
Provider=Microsoft.Jet.OLEDB.4.0; Data Source=file.xls;Extended Properties="Excel 8.0;Persist Security Info=False;IMEX=1;HDR=No";
Provider=Microsoft.ACE.OLEDB.12.0; Data Source=file.xlsx;Extended Properties="Excel 12.0;Persist Security Info=False;IMEX=1;HDR=No";
My problem then is that when I use the following query:
SELECT * FROM [SheetName$]
the returned results do not contain the empty rows or empty columns, so if the sheet contains such rows or columns, the following cells are shifted and do not end up in their correct position. I need the sheet to be loaded "as is", ie know exactly from what cell position each value comes from.
I tried to read the cells one by one by issuing one query of the form
SELECT F1 FROM `SheetName$A1:A1`
but now the driver returns an error saying "There is data outside the selected region". btw I had to use backticks to enclose the name because using brackets like this [SheetName$A1:A1] gave out a syntax error message.
Is there a way to tell the driver to select the sheet as-is, whithout skipping blanks? Or maybe a way to know from which cell position each value is returned?
For internal policy reasons (I know they are bad but I do not decide these), it is not possible to use a third party library, I really need this to work from standard Delphi 7 components.
I assume that if your data is say in the range B2:D10 for example, you want to include the column A as an empty column? Maybe? Is that correct? If that's the case, then your data set, when you read the sheet (SELECT * FROM [SheetName$]) would also return 1 million rows by 16K columns!
Can you not execute a query like: SELECT * FROM [SheetName$B2:D10] and use the ADO GetRows function to get an array - which will give you the size of the data. Then you can index into the array to get what data you want?
OK, the correct answer is
Use a third party library no matter what your boss says. Do not even
try ODBC/ADO to load arbitrary Excel files, you will hit a wall sooner or later.
It may work for excel files that contain a single data table, but not when you want to cherry pick data in a sheet primarily made for human consumption (ie where a single column contains some cells with introductory text, some with numerical data, some with comments, etc...)
Using IMEX=1 ignores empty lines and empty columns
Using IMEX=0 sometimes no longer ignores empty lines, but now some of the first non empty cells are considered field names instead of data, although HDR=No. Would not work anyway since valules in a column are of mixed types.
Explicitly looping across cells and making a SELECT * FROM [SheetName$A1:A1] works until you reach an empty cell, then you get access violations (see below)
Access violation at address 1B30B3E3 in module 'msexcl40.dll'. Read of address 00000000
I'm too old to want to try and guess the appropriate value to use so it works until someone comes with yet another mix of data in a column. Sorry for having wasted everybody's time.
I have the following problem:
A datasheet with a column (HOUR) and another column (AM/PM). Entries in the first column consist of 1,2,3,4,5,6,7,8,9,10,11, or 12, the second column consists of 'AM's or 'PM's. Together they define the time of an incident (regarding the below problem, note that I am not allowed to create a new column in the source datasheet or change existing columns). The below formulas 1.) to 3.) work excellent for getting '1's or '0's for incidents that happened either between 8AM and 4PM, or outside of this time window, as long as I create a new column somewhere.
1.) =IF(AND(A1>=8, A1<=11),IF(B1="AM",1,0),0) + IF(AND(A1>=1, A1<=4),IF(B1="PM",1,0),0) + IF(AND(A1=12),IF(B1="PM",1,0),0)
2.) =--OR(AND(A1>=8, B1="AM", A1<>12), AND(OR(A1<=4, A1=12), B1="PM"))
3.) =--OR(AND(OR(A1={8,9,10,11}),B1="AM"), AND(OR(A1={1,2,3,4,12}), B1="PM"))
However, I want the "1"s to be summarized - without creating an extra column - as calculated field in a pivot table. While excel doesn't accept the 3.) formula at all in the calculated field option, excel accepts 1.) and 2.), but puts out only "0"s in all pivot cells. The below is one of the formulas that puts out only "0"s in the pivot table.
=--OR(AND(HOUR>=8,'AM/PM'="AM",HOUR<>12), AND(OR(HOUR<=4,HOUR=12),'AM/PM'="PM"))
The field value settings don't make a difference, and the fields that are created new with 1.) or 2.) cannot be filtered for "1"s or 0"s, so something must be wrong with the field calculation I guess. Does anybody know what I need to change to make it work? Are there special rules for formulas in pivot tables that apply to formula 1.) and 2.) to make them work?
Thanks for any help on this
I think the limitation is not you, but Excel.
See here for description of what is possible, as well as this question
I tried your code and indeed I see it's not working. Even with a simple if code it doesn't seem to work. I think it's is explicitly called a calculated field because you are only able to calculate the fields in the Sum/Total/Count etc. column.
Have a look at MS, this function is quite limited.
I would try to make another work-around to accomplish your goal.
Microsoft
I hope someone can help me come up with an algorithm.
Im still very new with Apache POI and I was assigned to come up with an algorithm on how to read a template (Excel) and extract the headers/column names from the data itself.
The following must be taken into account:
There can be multiple headers/column names in just one sheet of an Excel file.
Headers can be horizontal AND/OR vertical in nature. This means that there could be a mixture of vertical and horizontal headers in one sheet.
Headers dont necessarily have to be at the very first row of the file. There could be introductions or banner images there.
The system must allow ANY kind of Excel format, so there is no control over the formatting of the cells, the naming convention, etc.
Some headers are alphanumeric in nature, which means it also contains numbers.
Some cells are merged to make room for a specific header.
Any ideas and suggestions are very much welcome. Just let me know if you have further clarifications.
(I know nothing about Apache, but some about Excel Interop working)
If the sheets to be detected are yours, I'd recomend NAMING those header cells. (To name a cell in Excel, there's a field at the top left of the screen, where normally the cell coordinates appear (like "A1" or "B2" and so...). Type a name in that place, and you will be able to identify that cell via code by it's name. ( 'Worksheet.Range("Name")' is where you get those cells via code)
To manage names, go to "Insert - Names" or "Formulas - Name manager", depending on what version of excel.
(Personally, I never work with sheets via code without naming headers, then I use "Offset" to get the data cells corresponding to those headers - This allows me to freely edit the sheet later without breaking the code)
If the sheets aren't yours, then, you'll need to find out the extents of the data. (Last row and last column)
Then check for the first line that contains all columns filled, none of them blank. That's a probable horizontal header.
As well as check for the first columns that contains all lines filled. That's a probable vertical header.
You could, as well, search for completely blank lines and/or columns to find headers that are AFTER some data, in case of sheets containing multiple horizontal headers, or vertical.
You could use some formatting properties (Range.Interior or Range.Font for examples) of those cells to identify if they are headers (usually headers have different format, color, borders and so on).
If you're sure there's no numeric header, I mean, all headers contains text, check for the type of data in the cells. If all are strings, header probability increases.
Even so, that's a tricky thing to do, if sheets don't follow some pattern, once in a while one of them can deceive your code and bring false results. I'd recommend, if alowed, to add a human verification to confirm the results after the proccess is done.
The solution to this problem involves taking away two of these freedoms. Such constraints applied will make this a tractable problem. Most of such freedoms come from overcautious thinking.
The freedoms are given as quotes below:-
Headers can be horizontal AND/OR vertical in nature. This means that there could be a mixture of vertical and horizontal headers in one sheet.
Typically, vertical headers are not used in Excel Files where there is a need to programmatically detect headers. As the primary, most common and sometimes the only reason for such detection is to upload/transform the tabular data.
Funny things happen when vertical headers are introduced:
They become Labels of Forms. This implies that such forms are used for data entry rather than storage. The data from such forms is stored in horizontal/columnar headers and rowwise/vertical records of data . Thus obviating the need for Upload/Transformation of the data entry sheet.
Excel is designed to have only horizontal headers. Vertical Headers cease to have autofilter support.
Even when Vertical Headers are present, a top horizontal header row can still be introduced to mark the headers themselves as descriptions / categories.
Staying true, to the core need for autodetection of headers, we can state that once our requirement states that Headers can be placed only in a horizontal alignment, the solution becomes slightly more tractable but not fully so.
Some cells are merged to make room for a specific header.
Merging cells is poison and anathema to the entire reason for transformation/upload of data. This is a pill I steadfastly have refused to take in my entire career with Excel & SQL jugglery. You may kindly merge all that you want to for all I care, however thee shall not pass into my beloved SQL Server.
For aforementioned reasons of prejudice and ill-will towards all mergers and mergees alike. I'd respectfully suggest that you too take this course.
Solution
Staying true to the above requirements after taking away the 2 freedoms. The pseudo algorithm (solution) is to
Take a sample of say c x r Excel Rows. For eg: 200 x 201 rows and columns
Find the counts of non-empty cells using an inbuilt formula like COUNTA whose contents have a non-zero length. The Count of such non-empty cells in each row is maintained as a data structure.
The type of data ie:- Number, Date, String should also be maintained in the above data structure capable of expressing the following:
Row# 22 contains
30 non-empty cells of which
28 are alphanumeric,
1 is a Date and
1 is a Number.
The First specific row that contains the maximum number of such non empty cells with the maximum number of strings should very likely be the header row.
Converting all of the above to a specific algorithm in any given language should be a deliciously occupying task for any young developer in their prime.