Finding the cause of Excel file corruption - excel

I have a feature that downloads things to an xls file using Apache POI. Mostly it works. But on one particular database, the resulting files are corrupted and won't open in Excel. I get the message "We found a problem with some content in 'DownloadFoo.xls'. Do you want us to try to recover as much as we can? If you trust the source of this workbook, click Yes." . Clicking yes results in all the formatting, data validation, etc being stripped out. On the other hand, if I open the file in Open Office Calc and save it, it's fine and can be opened in Excel from then on. (The people who want to use these files aren't allowed to download Open Office Calc, so this is not considered an acceptable workaround.)
I have tried narrowing it down to see which data is causing the problem, but it seems to occur whenever 10 or more items are downloaded, regardless of which items they are. (On other databases, it's fine to download 100+). Excluding some of the columns helps, but they are perfectly innocuous looking columns (and virtually identical to other columns which are fine) so this still hasn't got me to the bottom of it.
Are there any techniques I could use to find out what Excel has a problem with in the corrupted spreadsheets?
I can't make major changes like getting it to download to xlsx instead as this feature is going to be scrapped and replaced with something completely different in the near future, so I'd like to just focus on the problem at hand.

It turns out that the solution to the problem was to reset the data validation lists more often. Quite a lot of the cells in my spreadsheet have data validation. When the data validation lists are longer, they are stored on a hidden sheet. If several cells need the same validation, I try to get them referencing the same list in order to not write out too much stuff on the hidden sheet. However Excel apparently dislikes it when too many cells reference the same list- it's not against the rules as far as I can tell, but it doesn't like it anyway. When I changed it to rewrite the validation lists for every 5 items, it started working.
The reason this database was different was that the items had an unusually high number of subitems, so they occupied a lot of rows even though it didn't seem like many things were being downloaded. Some of the problem columns just had true or false validation rather than using the lists on the hidden sheet, so I don't know what that was about, but resetting the validation lists helped anyway.
This doesn't really answer my question as I never managed to get any information from Excel about what the problem was, or use a particular technique, it was just a series of coincidental findings. I'm putting it here anyway in case anyone else has a similar problem. Also the thing that started me on the right track was finding an old comment when double checking that it doesn't do anything different for over 10 items (it doesn't) in response to Andrew Morton's comment, so thanks Andrew!

Related

VLOOKUP Excel returns #N/A for my boss, but not for me

I got tasked to take over some statistics for management at my company. These statistics are placed on a server that anyone from management has access to and me. We are generating a monthly report based on other reports that are placed on the same server.
Here come the problem:
The values are generated automatically based on a VLOOKUP formula, which is just checking if the data from the other files have a certain value. When you open the report, you get a prompt if you want to update the external links. For me, no matter what I pick, I get no #N/A values. However, when my boss tries to open the same report, no matter what he picks(either update links or not), he always gets #N/A on the cells.
Because of the contract that I signed, I cannot show you any code, however, I can tell you what I have tried:
At first, I thought that my boss has lost (somehow) some permissions with the external files. But this is not the case, we tested every file and he can access all of them, even being marked as an administrator in some cases.
Secondly, I checked the report from the current month with one from two months ago (which works for him, no #N/A here) and went through all of the formulas. No luck here either, all the formulas are all the same except the external file name which was changing (I have done this step with the help of Spreadsheet Compare).
Third, I thought that maybe some macro would be running in the background (even if that makes no sense because these are .xlsx files). No luck here either, there is no VBA code written in any of the sheets of the workbook.
I got no ideas left. He has all the permissions, even more than I do, there is no difference between an old report which works just fine, and this one that is stuck, and there are no running macros. Any ideas?
Edit:
I can give you a sample of the formula, but will replace the actual path with a fake one since I cannot show this proper code:
=VLOOKUP(look_up_value, 'O:\fakepath\[file.xlsx]Report'!$C:$AZ, MATCH('Lookup Data'!$B$3, 'O:\fakepath\[file2.xlsx]Report'!$5:$5, 0) - 2, FALSE)
Also, I might also add details that were asked below here:
The language settings and DateTime format is the same both me and my boss
Both excel files have the same version, including the same display language
The mapping of the external files is the same for both of us, since they are on a server and we both have the drive letter for the server "O"
No definitive answer since question does not have enough details nor sample formulas, but some ideas:
are the 2 client machines with the same language settings (windows regional settings) ? that may impact dates and other factors
are the versions of Excel identical ? (including language)
external data: if here are external links, do the 2 client machines have the same drive mapping ? (could be P: on yours and R: at your boss pc)
EDIT: same rights to source folder ?

How do I return conditional formatting properties without adding additional rules in Excel VBA?

If a cell has conditional formatting that uses an Icon Set (my current situation is using the Traffic Light Icon Set), is there a way to identify in VBA what particular icon is showing in that cell?
The motivation behind it is that it will correspond to a red/amber/green value which I'm exporting in a SQL statement, so I need to find it in VBA.
I can add new rules and select icon sets just fine:
Set Newiconset = Range("H3").FormatConditions.AddIconSetCondition
It's returning the properties of an existing set of rules that has me hung up.
Thanks for your help - I scoured StackOverflow for a solution and couldn't find it. If someone's solved this, let me know and I'll gladly remove my question.
Bad news: what I'm looking to do technically isn't possible.
Here's why:
Excel data is stored in XML files in a main Zip file (you can experiment with this by renaming an xlsx file to zip and opening it). Inside the data is stored in XML files, and when you finally find your workbook, you can see that the data is stored as the actual conditions themselves, with the range values and such. Excel then takes those and computes the result on the fly every time you look at that file. States are not saved when saving the file unfortunately. It's worth noting though that the current state of formulas is stored - I'm assuming this is how accessing values from external workbooks is handled.
This explains why you can set and read the rules just fine, but since there's nothing officially to read a value from you can't "get" the data.

how to display data that is related to a specific cell in excel 2010?

I have created 2 columns, the first has a category of a system using data validation, and the second has the description and failures of that system.
The purpose of that is to open a malfunction on some parts.
In a different sheet I wish to do the same only this time I want to choose the system and the description will automatically appear in the next column showing me all the malfunctions I have written on this system.
I am not very good at all the functions of excel. but I still searched for one that might help me. I have tried using the DGET function but it got me nowhere.
Perhaps try the solution here - it's a bit tricky to explain without copy-pasting the whole thing:
https://superuser.com/questions/536234/excel-how-to-vlookup-to-return-multiple-values
Also take a look at vlookup() if you're working across spreadsheets.
As expected, all of the responses you've seen ehere - and probably elsewhere - are ponyers to VLookup, or a refusal to answer your question.
I'm guessing that you're using DGET() because you need to retrieve data from one named column, using a match for a search term in another named column; and you're that because you can't rely on column ordinals or addresses - you have to do it by name.
VLookup won't do that for you: not without extremely complex and fragile array formulae.
The bad news is: Microsoft NEVER published a working example of a DGET() formula or any corresponding VBA Worksheet Function code.
There's page after page of descriptive text and general explanation in the helpfiles and on MSDN: but no working example. Nobody in Redmond ever sat down and made the DGET() function work with a reproducible set of function parameters and published a screen-shot the working formula.
I'll let you guess why that is.
Maybe there's an example somewhere that is, in effect, a VLookup implemented for known column ordinals using DGET(). If there is, I never found it and you won't either: and it would, of course, be useless for any application where you're working with field names instead of known ordinals.
What you need to do is capture the tabulated data range, with field names in the top row, and pass it to a SQL query using ADODB or MS-Query. That bad new for that is that all the MS-JET Excel drivers have a fatal memory leak.
After that fails, you're left exporting the data somewhere that a proper database app can run the SQL: and that's actually the right thing to do, because your attempt at using DGET() is a relational data query.
If you're left with the need to do this entirely in Excel, you have reached a level of desperation normally associated with the last survivor of an airplane crash who, having devoured the charred remains of his unlucky fellow passengers, is finally forced to contemplate the awful exigency of opening and eating the inflight catering meals.
The grisly details for the equivalent in Excel are a Horrible Hack published here:
http://excellerando.blogspot.com/2014/09/from-time-to-time-it-necessary-to.html

certain fields get corrupted going from an Access report to an Excel spreadsheet

I can't make sense of this. I'm using Access 2003.
The problem occurs when I right-click an Access report, then click "Send..." which has only one option of "Mail recipient". Then I click Excel. An Outlook window containing an outgoing email opens with an Excel attachment. There is one column with item ids. Some get corrupted while others do not. It seems they need to begin with the numerals 20 to get corrupted. Anything else comes out OK.
For example, the following item ids are ok:
100657
100657-17
216116-115
221007-001
The following get corrupted (the corrupted version follows):
202103-001 becomes 1313049
205103-001 becomes 2408777
Does anyone have any idea as to why this might be happening?
Check the formatting of source, it is possible that 20's are getting corrupted to date formates. This is particularly likely if the original source of the data being used by access is Excel. You likely want to set the format to text, particularly if you are using numbers to track ID's like stock numbers.
Also check what, format is used in Access for the Field. Excel is less likely to take liberties with your data when the Access field is set to text. If you continue to have problems after addressing the formats. Post more details about the formats in your question.

Working with Office "open" XML - just how hard is it?

I'm considering replacing a (very) large body of Office-automation code with something that works with the Office XML format directly. I'm just starting out, but already I'm worried that it's too big a task.
I'll be dealing with Word, Excel and PowerPoint. So far I've only looked at Word and Excel. It looks like Word documents should be reasonably easy to manipulate, but Excel workbooks look like a nightmare. For example...
In Word, it looks like you could delete a paragraph simply by deleting the corresponding "w:p" tag. However, the supplied code snippet for deleting a row in Excel takes about 150 lines of code(!).
The reason the Excel code is so big is that deleting a row means updating the row indexes of all the subsequent rows, fixing up the "shared strings" table, etc. According to a comment at the top, the code snippet is not even complete, in that it won't deal with a workbook that has tables in it (I can live with that).
What I'm not clear on is whether that's the only restriction that the sample code has. For example, would there also be a problem if the workbook contained a Pivot Table? Or a chart that references data from the same sheet? Or some named ranges? Wouldn't you also have to update the formulae for any cells (etc.) that referenced a row whose row index had changed?
[That's not to mention the "calc chain", which (thankfully) I think you can simply delete since it's only a chache that can be re-built.]
And that's my question, woolly though it is. Just how hard do you have to work do something as simple as deleting a row properly? Is it an insurmountable task?
Also, if there are other, similar issues either with Excel or with Word or PowerPoint, I'd love to hear about them now, before I waste too much time going down a blind alley. Thanks.
Having worked with the Open XML SDK 2.0 for almost two years now I can say that doing seemingly trivial tasks can take many hours and sometimes days to figure out how to do it properly. For example, deleting an Excel row should be fairly straightforward and easy to do right? Nope because not only do you need code to delete your row, but then you have to update all the row indices, update any merged cell references, update hyperlink references, etc. Our internal delete method is close to 500 lines of code to just delete a row and I'm sure we don't have all the cases accounted for either.
The biggest complaint I have is the lack of documentation on how to do the most common tasks. The MSDN section on the Open XML SDK is very limited and whenever you need to do anything complicated you are really on your own. I've had to read the Open XML standard a lot to figure out what certain elements mean and how they should be implemented since I could find very little online.
The other challenging part is if you insert an element in a spot where it doesn't belong or put an invalid attribute on an element you will get a corrupt file when you try and open it. Most of the time you will not get any information on what caused the error and you will have to look at the Open XML standard spec to see what you did wrong.
If you need a fast turnaround time on converting that Office automation code into Open XML and what you are doing is not really basic, then I would say pass. If you have time and the patience to read up on the Word, Excel and PowerPoint XML structures and get familiar with how they relate then I say go for it. In my opinion it is really the only way to have very fine control over these office documents, but there will be a great learning curve when you start.
Oh and just for fun here is how much code is needed to add a comment to an Excel cell.
Just for completeness, here are some libraries I found for working with Excel XML:
www.extremexml.com - a layer on top of the Open XML SDK classes; focusses on injecting data into an existing spreadsheet; handles many of the cross-reference problems I identified in my question. Open source but GPL2 not LGPL. Code looks nice, and documentation is excellent. Does not appear terribly active on codeplex though.
Closed XML - another layer on top of the Open XML SDK - again open source, but with a less restrictive license (MIT). Looks nice, and looks more "active" than the above.
SpreadsheetLight - from what I can tell, a closed-source library sitting atop the Open XML SDK classes. Targeted more at those looking to create a spreadsheet from scratch rather than making changes to existing spreadsheets.
Here is another third party library dedicated to working with OpenXML:
http://www.officewriter.com
In the example cited by amurra above of deleting Excel spreadsheet rows, this is a single method call with this tool. It updates formulas and all the other references for which it seems that 500 lines of code would be required for otherwise.
The OpenXML SDK itself is a great tool for very simple things, but you still have to concern yourself with a lot of the internals of the file format and packaging structure to get things really right.
Here are some additional libraries that can manipulate with OOXML formats:
- GemBox.Spreadsheet (XLSX)
- GemBox.Document (DOCX)
Also GemBox published some articles that demonstrate how to manipulate with OOXML file format with pure .NET (without a use of any library), I think you'll find this interesting:
www.codeproject.com/Articles/15593/Read-and-write-Open-XML-files-MS-Office
(Introduction to SpreadsheetML format and an explanation on how we can read and write worksheet's cell content)
www.codeproject.com/Articles/649064/Show-Word-File-in-WPF
(Introduction to WordprocessingML format and demonstration on how we can read document's text)

Resources