What charset does Microsoft Excel use when saving files? - excel

I have a Java app which reads CSV files which have been created in Excel (e.g. 2007). Does anyone know what charset MS Excel uses to save these files in?
I would have guessed either:
windows-1255 (Cp1255)
ISO-8859-1
UTF8
but I am unable to decode extended chars (e.g. french accentuated letters) using either of these charset types.

From memory, Excel uses the machine-specific ANSI encoding. So this would be Windows-1252 for a EN-US installation, 1251 for Russian, etc.

CSV files could be in any format, depending on what encoding option was specified during the export from Excel: (Save Dialog, Tools Button, Web Options Item, Encoding Tab)
UPDATE: Excel (including Office 2013) doesn't actually respect the web options selected in the "save as..." dialog, so this is a bug of some sort. I just use OpenOffice Calc now to open my XLSX files and export them as CSV files (edit filter settings, choose UTF-8 encoding).

Waking up this old thread... We are now in 2017. And still Excel is unable to save a simple spreadsheet into a CSV format while preserving the original encoding ... Just amazing.
Luckily Google Docs lives in the right century. The solution for me is just to open the spreadsheet using Google Docs, than download it back down as CSV. The result is a correctly encoded CSV file (with all strings encoded in UTF8).

I had a similar problem last week. I received a number of CSV files with varying encodings. Before importing into the database I then used the chardet libary to automatically sniff out the correct encoding.
Chardet is a port from Mozillas character detection engine and if the sample size is large enough (one accentuated character will not do) works really well.

Russian Edition offers CSV, CSV (Macintosh) and CSV (DOS).
When saving in plain CSV, it uses windows-1251.
I just tried to save French word Résumé along with the Russian text, it saved it in HEX like 52 3F 73 75 6D 3F, 3F being the ASCII code for question mark.
When I opened the CSV file, the word, of course, became unreadable (R?sum?)

Excel 2010 saves an UTF-16/UCS-2 TSV file, if you select File > Save As > Unicode Text (.txt). It's (force) suffixed ".txt", which you can change to ".tsv".
If you need CSV, you can then convert the TSV file in a text editor like Notepad++, Ultra Edit, Crimson Editor etc, replacing tabs by semi-colons, commas or the like. Note that e.g. for reading into a DB table, often TSV works fine already (and it is often easier to read manually).
If you need a different code page like UTF-8, use one of the above mentioned editors for converting.

cp1250 is used extensively in Microsoft Office documents, including Word and Excel 2003.
http://en.wikipedia.org/wiki/Windows-1250
A simple way to confirm this would be to:
Create a spreadsheet with higher order characters, e.g. "Veszprém" in one of the cells;
Use your favourite scripting language to parse and decode the spreadsheet;
Look at what your script produces when you print out the decoded data.
Example perl script:
#!perl
use strict;
use Spreadsheet::ParseExcel::Simple;
use Encode qw( decode );
my $file = "my_spreadsheet.xls";
my $xls = Spreadsheet::ParseExcel::Simple->read( $file );
my $sheet = [ $xls->sheets ]->[0];
while ($sheet->has_data) {
my #data = $sheet->next_row;
for my $datum ( #data ) {
print decode( 'cp1250', $datum );
}
}

While it is true that exporting an excel file that contains special characters to csv can be a pain in the ass, there is however a simple work around: simply copy/paste the cells into a google docs and then save from there.

You could use this Visual Studio VB.Net code to get the encoding:
Dim strEncryptionType As String = String.Empty
Dim myStreamRdr As System.IO.StreamReader = New System.IO.StreamReader(myFileName, True)
Dim myString As String = myStreamRdr.ReadToEnd()
strEncryptionType = myStreamRdr.CurrentEncoding.EncodingName

You can create CSV file using encoding UTF8 + BOM (https://en.wikipedia.org/wiki/Byte_order_mark).
First three bytes are BOM (0xEF,0xBB,0xBF) and then UTF8 content.

OOXML files like those that come from Excel 2007 are encoded in UTF-8, according to wikipedia. I don't know about CSV files, but it stands to reason it would use the same format...

Related

How do I write the £ (GBP) sign in a CSV file from Ruby and read it back correctly in Excel?

When I write a CSV file using Ruby containing the £ sign and I open it using Excel I see this symbol instead ¬£.
My understanding is that Ruby uses UTF-8, but Excel interprets this file using a different encoding (ASCII).
I tried to write a US-ASCII encoded CSV file and guessed the £ encoding in ASCII like this:
csv = CSV.open(filename, 'w:US-ASCII')
csv << "\xA3"
csv.close
but it fails with invalid byte sequence in UTF-8 somewhere deep into the CSV library.
What am I doing wrong?
Thank you
For sure, Excel is not bound to use ASCII. For instance, I can easily input japanese characters into an Excel cell, and these are certainly not representable by ASCII.
While Ruby, by default, uses Unicode in its internal representation, every String object incorporates its own encoding, so you could in theory mix strings with different encodings, if you want to. In your case, you want to force a certain encoding when writing a file. This can be done either by using the w: output option, as you did, or by using external_encoding: Encoding::US-ASCII. See here for the names of the constants in Encoding.
I don't think US-ASCII is a good choice for the encoding, simply because there is no pound symbol in the ASCII chart. I would have expected that you get a warning message on stderr, when trying to write a pound symbol. If you need an 8-bit-encoding, ISO-8859-1 should do the job, but my recommendation would be to write UTF-8 and tell Excel to use this encoding when reading the CSV file. The possibility to import UTF exists at least since Excel 2007.

Exporting from Excel to CSV replaces Japanese characters with ??? even though Windows, Office locale is Japan/Japanese

I am exporting an excel file (Excel 2016) containing Japanese characters into CSV. (Note : I am not exporting to CSV UTF-8 provided). In the process, all Japanese characters are replaced with '?'
My Windows/Office locale is Japan/Japanese & Windows/office language/format is all Japanese.
I understand that excel uses a codepage to save the CSV file in particular encoding. My understanding was this should be Shift-JIS (as default encoding for Japanese locale). If that is so, why the loss of information & replacement by '?'
What encoding does Excel try to save the CSV in???
(FYI : If I try to open an CSV, excel by default attempts to open the CSV in Shift-JIS 932 as expected)
Note : I am aware of workarounds of using UTF-8. I am interested in understanding above behavior, more than a workaround
Thanks
The character 縺 appears very often when you read a byte stream containing Shift-JIS (MS932) encoded hiragana characters and try to decode it as UTF-8 characters. FYI, CybetChef is handy for this kind of work. You will get the string まとづ…… as output from your string.
So in this situation, Excel 2016 seems to have written the CSV in Shift-JIS (MS932), and your text editor (or Excel 2016. How did you open the CSV?) seems to have read the CSV in UTF-8.

Cannot write british pound or euro symbols to CSV file - Nodejs

I'm writing a CSV file which contains text with british pound and euro symbols, however when I opened the file in Excel, I see some rather odd behavior. I see some weird A-looking symbol before the british pound, and quotes instead of the euro symbol. I figured it's probably because Excel doesn't like a file that's UTF8 encoded.
fs.writeFileAsync("the-file.csv", text-containing-foreing-currency, "utf8");
Does anyone know a way to get around this while creating the file? I don't want the users to have to do anything with excel after downloading the file, I just want them to be able to open the file and see the right symbols.
There shouldn't be any problem with node writing the symbols to the file, if you open it with a text editor you should see the correct characters.
The problem is with excel opening UTF8 csv files. By default it assumes ANSI encoding, so if the file is in UTF8, it scrambles the characters. You can open the file correctly with the text import wizard.
In general this is a limitation of excel. The best workaround for you will depend on your OS and Excel version. This is a heavily discussed topic, here are some good reads:
Is it possible to force Excel recognize UTF-8 CSV files automatically?
Which encoding opens CSV files correctly with Excel on both Mac and Windows?

How to mannually specify Byte Order Mark in CSV

I have a CSV that is encoded in Unicode, however lacks a byte order mark at the start. As such Excel (2013) opens without encoding correctly (i think it assumes ASCII if no BOM specified...), meaning that certain characters are displayed incorectly.
From reading around i have read that a BOM of "\uFEFF" should be entered at the start of the CSV file. I have tried opening in txt editor and adding the characters e.g.
\uFEFF
r1test 1, r1text2, r1text3
r2test 1, r2text2, r2text3
However, this does not solve the problem - the characters "\uFEFF" show up on the first row when I open in excel, rather than it beign interpreted as a BOM. I am not sure what I am doing wrong, and the format of how the text should be specified such that it is interpreted as a BOM, rather than text in the the first of the data
I have only very limited experience using CSV, and only just heard of a BOM... and thus I could be implementing this completely wrong!
(for reference, i know that I could specify the encoding if i use the import data option within excel... however I really want to work out how to get it correctly specified in advance such that I can just open the csv... I have several thousand of these files that I am creating and exporting - once I know how to do this 'manually' [i.e. by adding some text at start of a the file], I can configure to automatically do in Python).
Thanks in advance
For someone else wanting to tell Excel to add a BOM: See if you can "Save as Unicode Text".
source

delete special characters preceding shebang (M-oM-;M-?#!/bin/bash) [duplicate]

I have a CSV file with special accents and save it in Notepad by selecting UTF-8 encoding. When I read the file using Java, it reads the BOM characters too.
So I want to save this file in UTF-8 format without appending a BOM initially in Notepad.
Otherwise, is there a built-in class in Java that eliminates the BOM characters that present at beginning, when reading the contents in a file?
Use Notepad++ - it is free and much better than Notepad. It will help to save text without a BOM using Encoding → Encode in UTF-8 without BOM: Notepad++ v6 and olders:
Notepad++ v7+:
When I encountered this problem in Java, I didn't find any library to parse these first three bytes (BOM). So my advice:
Use PushbackInputStream(in, 3).
Read the first three bytes
If it's not BOM (EF BB BF), push them back
Process the stream as UTF-8
I just learned from this Stack Overflow post, as #martin-geisler points out, that you can save files without the BOM in Windows Notepad, by selecting ANSI as the encoding.
I'm assuming that for more advanced uses this won't work because the resulting file is probably not the end encoding wished, but actually ANSI; but I tested and confirmed this works to save a very small .php script without BOM using only Notepad.
I learned the long, hard way that Windows' Notepad is not a true editor, although I'd like to point out for others that, despite this, it is misleadingly called up when you type "editor" on newer Windows machines, at least on one of mine.
I am currently using Emacs and other editors to solve this problem.
Use Notepad++ instead. See my personal blog post on it. From within Notepad++, choose the "Encoding" menu, then "Encode in UTF-8 without BOM".
Notepad on Windows 10 version 1903 (May 2019 update) and later versions supports saving to UTF-8 without a BOM. In fact, UTF-8 is the default file format now.
Reference: Windows 10 Notepad is Getting Better UTF-8 Encoding Support
The answer is: Not at all. Notepad can't do that.
In Java you can just skip the first byte in your InputStream and be done.
You might want to try out Notepad2 or Notepad++. Those Notepad replacements have the option for you to choose whether to output BOM.
As for a Java solution, as far as I know, Java does not understand the standard UTF-8. I googled and found Java's UTF-8 and Unicode writing is broken - Use this fix that might be the solution.
We're using the utility BOMStripperInputStream.java to strip the BOM from our input if present.

Resources