I have a CSV that is encoded in Unicode, however lacks a byte order mark at the start. As such Excel (2013) opens without encoding correctly (i think it assumes ASCII if no BOM specified...), meaning that certain characters are displayed incorectly.
From reading around i have read that a BOM of "\uFEFF" should be entered at the start of the CSV file. I have tried opening in txt editor and adding the characters e.g.
\uFEFF
r1test 1, r1text2, r1text3
r2test 1, r2text2, r2text3
However, this does not solve the problem - the characters "\uFEFF" show up on the first row when I open in excel, rather than it beign interpreted as a BOM. I am not sure what I am doing wrong, and the format of how the text should be specified such that it is interpreted as a BOM, rather than text in the the first of the data
I have only very limited experience using CSV, and only just heard of a BOM... and thus I could be implementing this completely wrong!
(for reference, i know that I could specify the encoding if i use the import data option within excel... however I really want to work out how to get it correctly specified in advance such that I can just open the csv... I have several thousand of these files that I am creating and exporting - once I know how to do this 'manually' [i.e. by adding some text at start of a the file], I can configure to automatically do in Python).
Thanks in advance
For someone else wanting to tell Excel to add a BOM: See if you can "Save as Unicode Text".
source
Related
I searched all around internet how to save CVS file as Unicode (UTF-8), but it still does not work, whenever i save, and open the file, there is ????? instead of letter that are UTF-8.
Has anyone ever had this issue? how can i solve this?
This has been annoying short coming of Excel for a long time.
A way to work around this issue, is to do the following:
Save as... Unicode text (*.txt). Make sure to keep the extension as txt (or at least not csv). It will be saved with tabs instead of commas separating the columns.
Open the document. You will be prompted with an import wizard, like so:
For File origin, choose 65001: Unicode (UTF-8)
For the rest of the options, choose the common sense options.
You will have your document back, ready to edit, with the proper unicode text intact.
I am using a file from Insideairbnb.com for my thesis. It is a csv.gz file so first I extracted it using the 'Archive Utility' for Mac.
It is comma delimited and uses double quotes as the text qualifier which I specified in the Import popup but Excel/SPSS is still delimitating at the commas within the text.
It is a large file that includes full airbnb descriptions and reviews which are contained in double quotations. Unfortunately, there are many commas within the strings of text. I have never seen a csv file with this format but I believe it was put together correctly because I have seen Insideairbnb cited for data in quite a few scholarly articles.
I have included a link to pictures of a snippet of the data on the SPSS import window. If anyone knows how to go about importing this I would greatly appreciate your feedback :)
Thank you in advance!
[[1]: https://i.stack.imgur.com/Iy3dA.png][1][SPSS screenshot]
[1]: https://i.stack.imgur.com/i7KcG.png[SPSS screenshot 2][1]
I agree with #sarawhite's comment above; if this is a one-time problem there are a couple things I would try.
open the .csv in excel, and if it looks right, save it and then try
to import it in SPSS, or saveas an .xlsx file and import that
(although there can be nonsense with string variables in either
scenario)
OR
open in notepad++ and look at the raw data. you can find and replace
double line breaks fairly easily.
I copy-pasted the data into Notepad++ yesterday, then converted it to ANSI and copy-pasted it back into Excel. Yesterday, it worked, but today it doesn't...
Anyways, maybe this thread is helpful for people with the same question. I will try again at a later point in time.
It is my understanding that txt files do not have encoding information stored so text editors simply make educated guesses about encoding of a given text file and then display the file on screen using that guessed encoding. If the editor guessed right you get your text on the screen, if the editor guessed wrong, then you (sometimes) get gibberish. Am I getting this right so far?
Now on to my problem. I have my bank statements in a csv file. When I open it in MS Excel 14 (MS Office 2010), it recognises the encoding and displays the problematic work as "obračun". Great. When I open the file in Emacs 24.3.1, it fails to recognise the correct encoding and displays the problematic word as "obra鑾n". Not so great.
My question is: how do I ask Excel which encoding the file is in? So I can tell that to Emacs since Excel obviously guessed correctly.
Thanks.
This could be a possible answer: http://metty-mathews.blogspot.si/2013/08/excel2013-character-encoding.html
After I opened ‘Advanced’ – ‘Web Options’ – ‘Encoding’, it said "Central European (Windows)" in "Save this document as:" field. It turns out that's Microsoft's name for Windows-1250 encoding and it turns out my file was indeed encoded with this encoding.
Is this just pure luck or does this field really show in which encoding Excel is displaying text - that I do not know.
I am writing Excel file through perl code. When I insert data in XML file and view in any browser, I see correct data with special characters, but when I write the same data in Excel file, it is showing garbage characters.
For eg.:
(word from XML file on browser) Gràcia - (word from Excel file) Grà cia
I am using 'Spreadsheet::XLSX' for reading excel and 'Excel::Writer::XLSX' for writing excel.
Also need help in finding the encoding format of excel fields.
Do you have any idea? Thanks in advance.
This seems very much like UTF-8 to iso-8859-1 conversion going wrong - seems like a string that contains UTF-8, but is not marked as being UTF-8, is being passed to $worksheet->write(). Since http://metacpan.org/pod/Excel::Writer::XLSX#UNICODE-IN-EXCEL claims to handle unicode correctly, it seems to be a problem with your input string, not the write method itself.
As you don't post any code, and don't tell us where your strings come from, i can't tell why the strings aren't marked correctly.
You can probably get away with
Encode::_utf8_on($str)
before passing your strings to $worksheet->write(), but this might just as well break other things, if not all of your strings are really utf-8. Basically the answer is "get the utf-8 flag on your strings right when you read them".
I'm currently developing CSV export with XSLT. And CSV file will be used %99 percent with Excel in my case, so I have to consider Excel behavior.
My first problem was German special characters in csv. Even fact that CSV encoding is UTF8, Excel cannot open properly CSV file with UTF8. The special characters are getting weird symbols. I found a solution for this problem. I just added 3 additional bytes(EF BB BF - a.k.a BOM Header) beginning of content bytes. Because UTF8 BOM is way to say that 'hey dude, it is UTF8, open it properly' to Excel. Problem solved!
And my second problem was about separator. The default separator could be comma or semicolon depending on region. I think it is semicolon in Germany and comma in UK. So, in order to prevent this problem, I had to add the line in below:
<xsl:text>sep=;</xsl:text>
or
<xsl:text>sep=,</xsl:text>
(This separator was not implemented as hard-coded)
But my problem which I cannot find any solution is that if you add "sep=;" or "sep=," beginning of the file while the CSV file is being generated with UT8-BOM, the BOM doesn't help for showing special characters properly anymore! And I'm sure that BOM bytes are always in the beginning of byte array. This screen shot is from MS Excel in Mac OS X:
First 3 symbols belong to BOM header.
Have you ever had like this problem or do you have any suggestions? Thank you.
Edit:
I share the printscreens.
a. With BOM and <xsl:text>sep=;</xsl:text>
b. Just with BOM
The Java code:
// Write the bytes
ServletOutputStream out = resp.getOutputStream();
if(contentType.toString().equals("CSV")) {
// The additional bytes in below is prefix indicates that the content is in UTF-8.
out.write(239);
out.write(187);
out.write(191);
}
out.write(bytes); // Content bytes, in this case XSL
The XSL code:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" version="1.0" encoding="UTF-8" indent="yes" />
<xsl:template match="/">
<xsl:text>sep=;</xsl:text>
<table>
...
</table>
</xsl:template>
You are right, there is no way in Excel 2007 to get it load both the encoding and the seperator correctly across different locales when someone double clicks a CSV file.
It seems like when you specify sep= after the BOM it forgets the BOM has told it that it is UTF-8.
You have to specify the BOM because in certain locales Excel does not detect the seperator. For instance in danish, the default seperator is ;. If you output tab or comma seperated text then it does not detect the seperator and in other locales if you seperate with semi-colon it doesn't load. You can test this by changing the locae format in windows settings - excel then picks this up.
From this question:
Is it possible to force Excel recognize UTF-8 CSV files automatically?
and the answers it seems the only way is to use UTF16 le encoding with BOM.
Note also that as per http://wiki.scn.sap.com/wiki/display/ABAP/CSV+tests+of+encoding+and+column+separator?original_fqdn=wiki.sdn.sap.com
it seems that if you use utf16-le with tab seperators then it works.
I've wondered if excel reads sep=; and then re-calls the method to get the CSV text and loses the BOM - I've tried giving incorrect text and I can't find any work around that tells excel to take both the sep and the encoding.
This is the result of my testing with Excel 2013.
If you're stuck with UTF-8, there is a workaround which consists of BOM + data + sep=;
Input (written with UTF8 encoding)
\ufeffSome;Header;Columns
Wîth;Fàncÿ;Stûff
sep=;
Output
|Some|Header|Columns|
|Wîth|Fàncÿ |Stûff |
|sep=| | |
The issue with solution is that while Excel interprets sep=; properly, it displays sep= (yes, it swallows the ;) in the first column of the last row.
However, if you can write the file as UTF16-LE, then there is an actual solution. Use the \t delimiter without specifying sep and Excel will play ball.
Input (written with UTF16-LE encoding)
\ufeffSome;Header;Columns
Wîth;Fàncÿ;Stûff
Output
|Some|Header|Columns|
|Wîth|Fàncÿ |Stûff |
I can't write comments yet, but I'd like to address #Pier-Luc Gendreau's solution. While it is possible to open it in European Excel (which by default uses ;as delimiter) and have full utf-16LE support, it is apparently not possible to use this technique when you specify sep=,.
The issue with solution is that while Excel interprets sep=; properly, it displays sep= (yes, it swallows the ;) in the first column of the last row.
For me it did not work if I specified a delimiter which wasn't the default one (;in my case) so I assume Excel did not interpret the last line correctly and swallowed the last delimiter because this is the default behavior.
Please correct me if I'm wrong