Encoding issue in excel - excel

As mentioned in the title I have an encoding issue in Excel. It requires special characters and also some letters.
I will show you an example: 1st being a good one and 2nd a bad one.
1st example
After I refresh the Excel workbook my special characters become something like the following.
2nd example
Sadly my work around right now is, using one of my colleagues Mac, to go in and refresh the workbook again and the special characters get fixed.
These Excel workbooks are created by a Microsoft Power Automate flow, and checking every excel file he creates in order to fix this issue, is waste of time.
Does anyone have any suggestions about how to fix this.
Sincerely, Daniel

If you are creating the Excel files yourself via a Power Automate flow you could look into using the Byte Order Mark characters to the beginning of the file.
Below is an explanation how you could achieve that:
https://powerusers.microsoft.com/t5/Building-Flows/Create-a-csv-file-that-uses-UTF-8-character-encoding/m-p/1311587/highlight/true#M148980

This is exactly what my PAD is doing. There is a master excel file, and when the online flow finds a new row in the database is running this PAD flow.
After few refreshes the new data is updated in the excel file, and that is when the special characters get changed, like presented in the main post

Related

Can we specified format rules directly in file loaded in Excel?

I'm generating a csv file that looks like:
column1,column2,column3
hello,02,some comments
hello,AF,some comments
hello,15,some comments
hello,08,some comments
hello,FF,some comments
When opening into Excel, the second columns will convert automatically 02 to 2 and 08 to 8.
Recommended workaround is to tweak settings in Excel or specifying format of each columns during the file loading. (time-consuming, not user-friendly)
Another workaround for file itself, would be to append quotes around the concerned values, so they would be treated as strings.
(for some reasons, I can't do this, as the file I generate, must reflect the original source one)
But... Is there another way ?
Like adding some kind of a commentary line, containing hints for the Excel parser, directly in the CSV.
Taking Oracle SQL queries hints system as an example, which allows us to change the query optimizer way of work.
I'm not sure this exists for Excel, maybe it is, but obscure and not well documented ?
I can't figure out why they would not add such a useful feature to "preconfigure" the format instead of having the user to navigate Excel settings.
Thanks

Finding the cause of Excel file corruption

I have a feature that downloads things to an xls file using Apache POI. Mostly it works. But on one particular database, the resulting files are corrupted and won't open in Excel. I get the message "We found a problem with some content in 'DownloadFoo.xls'. Do you want us to try to recover as much as we can? If you trust the source of this workbook, click Yes." . Clicking yes results in all the formatting, data validation, etc being stripped out. On the other hand, if I open the file in Open Office Calc and save it, it's fine and can be opened in Excel from then on. (The people who want to use these files aren't allowed to download Open Office Calc, so this is not considered an acceptable workaround.)
I have tried narrowing it down to see which data is causing the problem, but it seems to occur whenever 10 or more items are downloaded, regardless of which items they are. (On other databases, it's fine to download 100+). Excluding some of the columns helps, but they are perfectly innocuous looking columns (and virtually identical to other columns which are fine) so this still hasn't got me to the bottom of it.
Are there any techniques I could use to find out what Excel has a problem with in the corrupted spreadsheets?
I can't make major changes like getting it to download to xlsx instead as this feature is going to be scrapped and replaced with something completely different in the near future, so I'd like to just focus on the problem at hand.
It turns out that the solution to the problem was to reset the data validation lists more often. Quite a lot of the cells in my spreadsheet have data validation. When the data validation lists are longer, they are stored on a hidden sheet. If several cells need the same validation, I try to get them referencing the same list in order to not write out too much stuff on the hidden sheet. However Excel apparently dislikes it when too many cells reference the same list- it's not against the rules as far as I can tell, but it doesn't like it anyway. When I changed it to rewrite the validation lists for every 5 items, it started working.
The reason this database was different was that the items had an unusually high number of subitems, so they occupied a lot of rows even though it didn't seem like many things were being downloaded. Some of the problem columns just had true or false validation rather than using the lists on the hidden sheet, so I don't know what that was about, but resetting the validation lists helped anyway.
This doesn't really answer my question as I never managed to get any information from Excel about what the problem was, or use a particular technique, it was just a series of coincidental findings. I'm putting it here anyway in case anyone else has a similar problem. Also the thing that started me on the right track was finding an old comment when double checking that it doesn't do anything different for over 10 items (it doesn't) in response to Andrew Morton's comment, so thanks Andrew!

Excel - Variable number of leading zeros in variable length numbers?

The format of our member numbers has changed several times over the years, such that 00008, 9538, 746, 0746, 00746, 100125, and various other permutations are valid, unique and need to be retained. Exporting from our database into the custom Excel template needed for a mass update strips the leading zeros, such that 00746 and 0746 are all truncated to 746.
Inserting the apostrophe trick, or formatting as text, does not work in our case, since the data seems to be already altered by the time we open it in Excel. Formatting as zip won't work since we have valid numbers less than five digits in length that cannot have zeros added to them. And I am not having any luck with "custom" formatting as that seems to require either adding the same number of leading zeros to a number, or adding enough zeros to every number to make them all the same length.
Any clues? I wish there was some way to set Excel to just take what it's given and leave it alone, but that does not seem to be the case! I would appreciate any suggestions or advice. Thank you all very much in advance!
UPDATE - thanks everybody for your help! Here are some more specifics. We are using a 3rd party membership management app -- we cannot access the database directly, we need to use their "query builder" tool to get the data we want to mass update. Then we export using their "template" format, which is called XLSX but there must be something going on behind the scenes, because if we try to import a regular old Excel, we get an error. Only their template works.
The data is formatted okay in the database, because all of the numbers show correctly in the web-based management tool. Also, if I export to CSV, save it as a .txt and import it into Excel, the numbers show fine.
What I have done is similar to ooo's explanation below -- I exported the template with the incorrect numbers, then exported as CSV/txt, and copied / pasted THOSE numbers into the template and re-imported. I did not get an error, which is something I guess, but I will not be able to find out if it was successful until after midnight! :-(
Assuming the data is not corrupt in the database, then try and export from the database to a csv or text file.
The following can then be done to ensure the import is formatted correctly
Text file with comma delimiter:
In Excel Data/From text and selected Delimited, then next
In step 3 of the import wizard. For each column/field you want as text, highlight the column and select Text
The data should then be placed as text and retain leading zeros.
Again, all of this assumes the database contains non-corrupt data and you are able to export a simple text or csv file. It also assumes you have Excel 2010 but it can be done with minor variation across all versions.
Hopefully, #ooo's answer works for you. I'm providing another answer mainly for informational purposes, and don't feel like dealing with the constraints on comments.
One thing to understand is that Excel is very aggressive about treating "numeric-looking" data as actual numbers. If you were to open the CSV by double-clicking and letting Excel do its thing (rather than using ooo's careful procedure), those numbers would still have come up as numbers (no leading zeros). As you've found, one way to counteract this is to append clearly nonnumeric characters onto your data (before Excel gets its grubby hands on it), to really convince Excel that what it's dealing with is text.
Now, if the thing that uploads to their software is a file ending in .xlsx, then most likely it is the current Excel format (a compressed XML document, used by Excel 2007 and later). I suppose by "regular old Excel" you mean .xls (which still works with the newer Excels in "compatibility mode").
So in case what you've tried so far doesn't work, there are still avenues to explore before resorting to appending characters to the end of your data. (I'll update this answer as needed.)
You're on the right track with the apostrophe.
You'll need to store your numbers in excel as text at the time they are added to the file.
What are you using to create the original excel file / export from database?
This will likely be where your focus needs to be regarding your export.
For example one approach is that you could potentially modify the database export to include the ' symbol prefix before the numbers so that excel will know to display them as text.
I use the formula =text(cell,"# of zeros of the field") to add preceding zeros.
Example, Cell C2 has 12345 and I need it to be 10 characters long. I would put =text(c2,"0000000000").
The result will be 0000012345.

Working with Office "open" XML - just how hard is it?

I'm considering replacing a (very) large body of Office-automation code with something that works with the Office XML format directly. I'm just starting out, but already I'm worried that it's too big a task.
I'll be dealing with Word, Excel and PowerPoint. So far I've only looked at Word and Excel. It looks like Word documents should be reasonably easy to manipulate, but Excel workbooks look like a nightmare. For example...
In Word, it looks like you could delete a paragraph simply by deleting the corresponding "w:p" tag. However, the supplied code snippet for deleting a row in Excel takes about 150 lines of code(!).
The reason the Excel code is so big is that deleting a row means updating the row indexes of all the subsequent rows, fixing up the "shared strings" table, etc. According to a comment at the top, the code snippet is not even complete, in that it won't deal with a workbook that has tables in it (I can live with that).
What I'm not clear on is whether that's the only restriction that the sample code has. For example, would there also be a problem if the workbook contained a Pivot Table? Or a chart that references data from the same sheet? Or some named ranges? Wouldn't you also have to update the formulae for any cells (etc.) that referenced a row whose row index had changed?
[That's not to mention the "calc chain", which (thankfully) I think you can simply delete since it's only a chache that can be re-built.]
And that's my question, woolly though it is. Just how hard do you have to work do something as simple as deleting a row properly? Is it an insurmountable task?
Also, if there are other, similar issues either with Excel or with Word or PowerPoint, I'd love to hear about them now, before I waste too much time going down a blind alley. Thanks.
Having worked with the Open XML SDK 2.0 for almost two years now I can say that doing seemingly trivial tasks can take many hours and sometimes days to figure out how to do it properly. For example, deleting an Excel row should be fairly straightforward and easy to do right? Nope because not only do you need code to delete your row, but then you have to update all the row indices, update any merged cell references, update hyperlink references, etc. Our internal delete method is close to 500 lines of code to just delete a row and I'm sure we don't have all the cases accounted for either.
The biggest complaint I have is the lack of documentation on how to do the most common tasks. The MSDN section on the Open XML SDK is very limited and whenever you need to do anything complicated you are really on your own. I've had to read the Open XML standard a lot to figure out what certain elements mean and how they should be implemented since I could find very little online.
The other challenging part is if you insert an element in a spot where it doesn't belong or put an invalid attribute on an element you will get a corrupt file when you try and open it. Most of the time you will not get any information on what caused the error and you will have to look at the Open XML standard spec to see what you did wrong.
If you need a fast turnaround time on converting that Office automation code into Open XML and what you are doing is not really basic, then I would say pass. If you have time and the patience to read up on the Word, Excel and PowerPoint XML structures and get familiar with how they relate then I say go for it. In my opinion it is really the only way to have very fine control over these office documents, but there will be a great learning curve when you start.
Oh and just for fun here is how much code is needed to add a comment to an Excel cell.
Just for completeness, here are some libraries I found for working with Excel XML:
www.extremexml.com - a layer on top of the Open XML SDK classes; focusses on injecting data into an existing spreadsheet; handles many of the cross-reference problems I identified in my question. Open source but GPL2 not LGPL. Code looks nice, and documentation is excellent. Does not appear terribly active on codeplex though.
Closed XML - another layer on top of the Open XML SDK - again open source, but with a less restrictive license (MIT). Looks nice, and looks more "active" than the above.
SpreadsheetLight - from what I can tell, a closed-source library sitting atop the Open XML SDK classes. Targeted more at those looking to create a spreadsheet from scratch rather than making changes to existing spreadsheets.
Here is another third party library dedicated to working with OpenXML:
http://www.officewriter.com
In the example cited by amurra above of deleting Excel spreadsheet rows, this is a single method call with this tool. It updates formulas and all the other references for which it seems that 500 lines of code would be required for otherwise.
The OpenXML SDK itself is a great tool for very simple things, but you still have to concern yourself with a lot of the internals of the file format and packaging structure to get things really right.
Here are some additional libraries that can manipulate with OOXML formats:
- GemBox.Spreadsheet (XLSX)
- GemBox.Document (DOCX)
Also GemBox published some articles that demonstrate how to manipulate with OOXML file format with pure .NET (without a use of any library), I think you'll find this interesting:
www.codeproject.com/Articles/15593/Read-and-write-Open-XML-files-MS-Office
(Introduction to SpreadsheetML format and an explanation on how we can read and write worksheet's cell content)
www.codeproject.com/Articles/649064/Show-Word-File-in-WPF
(Introduction to WordprocessingML format and demonstration on how we can read document's text)

Convert richtext strings to excel

I have a form that has TinyMCE for richtext formatting. All of our data is available to export as an HTML report, PDF Report, and Excel Spreadsheet (report).
The fields, that we allow richtext in, show up as the formatted values in both the HTML and PDF reports, but in Excel we show them as strings. For instance:
<b>this part is bold</b><br />line 2 here.
I need a way to make that show up as bold/line-break in excel rather then just showing that string, or at least a way to strip the HTML tags out of there and just show plain text (though I would really like to at least keep the line breaks). Is there some type of macro I can include in the excel download or some C++ program that can convert it or something?
Thanks for your time!
I've done something similar with PHPExcel
The trick is to take your formatted data and find a pattern. In your case, it would probably be table rows/table cells. Iterate through that structure setting the excel cell values as you go. For complex formatting you could fairly simply regex replace what is necessary to get formatted as you desire. The theory may sound a little complicated, but once you get down to it, it's only an hour or two's worth of work.
Certainly there are equivalent programs based on other server technologies. But this one has worked brilliantly for me over the years, and I trust it to work on sites for very big clients with crazy inbound traffic numbers...and it's never failed. It's the only reliable way I've found to write perfect, properly formatted Excel without requiring the user to jump through hoops to get a specific browser.

Resources