Does Talend Support UTF-8 Encoding for Excel Headers? - excel

I am new to Talend, and I have this Excel sheet, which has UTF-8 Letters in the headers, that i want to profile using Talend DQ, Now, I was able to import the list and in the preview everything is shown correctly; however, When I Click next, all the UTF-8 encoded letters are change into "Column0, Column1, ....
Any ideas on how to fix this?
Thanks!

In advanced option of your Input Excel File component there is a field that allow to select the encoding you want to use.

Encoding in Advanced settings if for content, header is based on field names which must respect Java namming rules convention (no accent, no special character and so on).
In short, if you need header with composed UTF-8 characters, don't use the standard "Include header" option, do it by yourself.

Related

Microsof Excel 2016 Unicode (UTF-8) does not work

I searched all around internet how to save CVS file as Unicode (UTF-8), but it still does not work, whenever i save, and open the file, there is ????? instead of letter that are UTF-8.
Has anyone ever had this issue? how can i solve this?
This has been annoying short coming of Excel for a long time.
A way to work around this issue, is to do the following:
Save as... Unicode text (*.txt). Make sure to keep the extension as txt (or at least not csv). It will be saved with tabs instead of commas separating the columns.
Open the document. You will be prompted with an import wizard, like so:
For File origin, choose 65001: Unicode (UTF-8)
For the rest of the options, choose the common sense options.
You will have your document back, ready to edit, with the proper unicode text intact.

How to enable my python code to read from Arabic content in Excel?

I have two related problems. I'm working on Arabic dataset using Excel. I think that Excel somehow reads the contents as ؟؟؟؟؟ , because when I tried to replace this character '؟' with this '?' it replaces the whole text in the sheet. But when I replace or search for another letter it works.
Second, I'm trying to edit the sheet using python, but I'm unable to write Arabic letters (I'm using jGRASP). For example when I write the letter 'ل' it appears as 0644, and when I run the code this message appears : "ُError encoding text. Unable to encode text using charset windows-1252 ".
0644 is the character code of the character in hex. jGRASP displays that when the font does not contain the character. You can use "Settings" > "Font" in jGRASP to choose a CSD font that contains the characters you need. Finding one that has those characters and also works well as a coding font might not be possible, so you may need to switch between two fonts.
jGRASP uses the system character encoding for loading and saving files by default. Windows-1252 is an 8-bit encoding used on English language Windows systems. You can use "File" > "Save As" to save the file with the same name but a different encoding (charset). Once you do that, jGRASP will remember it (per file) and you can load and save normally. Alternately, you can use "Settings" > "CSD Windows Settings" > "Workspace" and change the "Default Charset" setting to make the default something other than the system default.

How to mannually specify Byte Order Mark in CSV

I have a CSV that is encoded in Unicode, however lacks a byte order mark at the start. As such Excel (2013) opens without encoding correctly (i think it assumes ASCII if no BOM specified...), meaning that certain characters are displayed incorectly.
From reading around i have read that a BOM of "\uFEFF" should be entered at the start of the CSV file. I have tried opening in txt editor and adding the characters e.g.
\uFEFF
r1test 1, r1text2, r1text3
r2test 1, r2text2, r2text3
However, this does not solve the problem - the characters "\uFEFF" show up on the first row when I open in excel, rather than it beign interpreted as a BOM. I am not sure what I am doing wrong, and the format of how the text should be specified such that it is interpreted as a BOM, rather than text in the the first of the data
I have only very limited experience using CSV, and only just heard of a BOM... and thus I could be implementing this completely wrong!
(for reference, i know that I could specify the encoding if i use the import data option within excel... however I really want to work out how to get it correctly specified in advance such that I can just open the csv... I have several thousand of these files that I am creating and exporting - once I know how to do this 'manually' [i.e. by adding some text at start of a the file], I can configure to automatically do in Python).
Thanks in advance
For someone else wanting to tell Excel to add a BOM: See if you can "Save as Unicode Text".
source

sep=";" statement breaks utf8 BOM in CSV file which is generated by XSL

I'm currently developing CSV export with XSLT. And CSV file will be used %99 percent with Excel in my case, so I have to consider Excel behavior.
My first problem was German special characters in csv. Even fact that CSV encoding is UTF8, Excel cannot open properly CSV file with UTF8. The special characters are getting weird symbols. I found a solution for this problem. I just added 3 additional bytes(EF BB BF - a.k.a BOM Header) beginning of content bytes. Because UTF8 BOM is way to say that 'hey dude, it is UTF8, open it properly' to Excel. Problem solved!
And my second problem was about separator. The default separator could be comma or semicolon depending on region. I think it is semicolon in Germany and comma in UK. So, in order to prevent this problem, I had to add the line in below:
<xsl:text>sep=;</xsl:text>
or
<xsl:text>sep=,</xsl:text>
(This separator was not implemented as hard-coded)
But my problem which I cannot find any solution is that if you add "sep=;" or "sep=," beginning of the file while the CSV file is being generated with UT8-BOM, the BOM doesn't help for showing special characters properly anymore! And I'm sure that BOM bytes are always in the beginning of byte array. This screen shot is from MS Excel in Mac OS X:
First 3 symbols belong to BOM header.
Have you ever had like this problem or do you have any suggestions? Thank you.
Edit:
I share the printscreens.
a. With BOM and <xsl:text>sep=;</xsl:text>
b. Just with BOM
The Java code:
// Write the bytes
ServletOutputStream out = resp.getOutputStream();
if(contentType.toString().equals("CSV")) {
// The additional bytes in below is prefix indicates that the content is in UTF-8.
out.write(239);
out.write(187);
out.write(191);
}
out.write(bytes); // Content bytes, in this case XSL
The XSL code:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" version="1.0" encoding="UTF-8" indent="yes" />
<xsl:template match="/">
<xsl:text>sep=;</xsl:text>
<table>
...
</table>
</xsl:template>
You are right, there is no way in Excel 2007 to get it load both the encoding and the seperator correctly across different locales when someone double clicks a CSV file.
It seems like when you specify sep= after the BOM it forgets the BOM has told it that it is UTF-8.
You have to specify the BOM because in certain locales Excel does not detect the seperator. For instance in danish, the default seperator is ;. If you output tab or comma seperated text then it does not detect the seperator and in other locales if you seperate with semi-colon it doesn't load. You can test this by changing the locae format in windows settings - excel then picks this up.
From this question:
Is it possible to force Excel recognize UTF-8 CSV files automatically?
and the answers it seems the only way is to use UTF16 le encoding with BOM.
Note also that as per http://wiki.scn.sap.com/wiki/display/ABAP/CSV+tests+of+encoding+and+column+separator?original_fqdn=wiki.sdn.sap.com
it seems that if you use utf16-le with tab seperators then it works.
I've wondered if excel reads sep=; and then re-calls the method to get the CSV text and loses the BOM - I've tried giving incorrect text and I can't find any work around that tells excel to take both the sep and the encoding.
This is the result of my testing with Excel 2013.
If you're stuck with UTF-8, there is a workaround which consists of BOM + data + sep=;
Input (written with UTF8 encoding)
\ufeffSome;Header;Columns
Wîth;Fàncÿ;Stûff
sep=;
Output
|Some|Header|Columns|
|Wîth|Fàncÿ |Stûff |
|sep=| | |
The issue with solution is that while Excel interprets sep=; properly, it displays sep= (yes, it swallows the ;) in the first column of the last row.
However, if you can write the file as UTF16-LE, then there is an actual solution. Use the \t delimiter without specifying sep and Excel will play ball.
Input (written with UTF16-LE encoding)
\ufeffSome;Header;Columns
Wîth;Fàncÿ;Stûff
Output
|Some|Header|Columns|
|Wîth|Fàncÿ |Stûff |
I can't write comments yet, but I'd like to address #Pier-Luc Gendreau's solution. While it is possible to open it in European Excel (which by default uses ;as delimiter) and have full utf-16LE support, it is apparently not possible to use this technique when you specify sep=,.
The issue with solution is that while Excel interprets sep=; properly, it displays sep= (yes, it swallows the ;) in the first column of the last row.
For me it did not work if I specified a delimiter which wasn't the default one (;in my case) so I assume Excel did not interpret the last line correctly and swallowed the last delimiter because this is the default behavior.
Please correct me if I'm wrong

What charset does Microsoft Excel use when saving files?

I have a Java app which reads CSV files which have been created in Excel (e.g. 2007). Does anyone know what charset MS Excel uses to save these files in?
I would have guessed either:
windows-1255 (Cp1255)
ISO-8859-1
UTF8
but I am unable to decode extended chars (e.g. french accentuated letters) using either of these charset types.
From memory, Excel uses the machine-specific ANSI encoding. So this would be Windows-1252 for a EN-US installation, 1251 for Russian, etc.
CSV files could be in any format, depending on what encoding option was specified during the export from Excel: (Save Dialog, Tools Button, Web Options Item, Encoding Tab)
UPDATE: Excel (including Office 2013) doesn't actually respect the web options selected in the "save as..." dialog, so this is a bug of some sort. I just use OpenOffice Calc now to open my XLSX files and export them as CSV files (edit filter settings, choose UTF-8 encoding).
Waking up this old thread... We are now in 2017. And still Excel is unable to save a simple spreadsheet into a CSV format while preserving the original encoding ... Just amazing.
Luckily Google Docs lives in the right century. The solution for me is just to open the spreadsheet using Google Docs, than download it back down as CSV. The result is a correctly encoded CSV file (with all strings encoded in UTF8).
I had a similar problem last week. I received a number of CSV files with varying encodings. Before importing into the database I then used the chardet libary to automatically sniff out the correct encoding.
Chardet is a port from Mozillas character detection engine and if the sample size is large enough (one accentuated character will not do) works really well.
Russian Edition offers CSV, CSV (Macintosh) and CSV (DOS).
When saving in plain CSV, it uses windows-1251.
I just tried to save French word Résumé along with the Russian text, it saved it in HEX like 52 3F 73 75 6D 3F, 3F being the ASCII code for question mark.
When I opened the CSV file, the word, of course, became unreadable (R?sum?)
Excel 2010 saves an UTF-16/UCS-2 TSV file, if you select File > Save As > Unicode Text (.txt). It's (force) suffixed ".txt", which you can change to ".tsv".
If you need CSV, you can then convert the TSV file in a text editor like Notepad++, Ultra Edit, Crimson Editor etc, replacing tabs by semi-colons, commas or the like. Note that e.g. for reading into a DB table, often TSV works fine already (and it is often easier to read manually).
If you need a different code page like UTF-8, use one of the above mentioned editors for converting.
cp1250 is used extensively in Microsoft Office documents, including Word and Excel 2003.
http://en.wikipedia.org/wiki/Windows-1250
A simple way to confirm this would be to:
Create a spreadsheet with higher order characters, e.g. "Veszprém" in one of the cells;
Use your favourite scripting language to parse and decode the spreadsheet;
Look at what your script produces when you print out the decoded data.
Example perl script:
#!perl
use strict;
use Spreadsheet::ParseExcel::Simple;
use Encode qw( decode );
my $file = "my_spreadsheet.xls";
my $xls = Spreadsheet::ParseExcel::Simple->read( $file );
my $sheet = [ $xls->sheets ]->[0];
while ($sheet->has_data) {
my #data = $sheet->next_row;
for my $datum ( #data ) {
print decode( 'cp1250', $datum );
}
}
While it is true that exporting an excel file that contains special characters to csv can be a pain in the ass, there is however a simple work around: simply copy/paste the cells into a google docs and then save from there.
You could use this Visual Studio VB.Net code to get the encoding:
Dim strEncryptionType As String = String.Empty
Dim myStreamRdr As System.IO.StreamReader = New System.IO.StreamReader(myFileName, True)
Dim myString As String = myStreamRdr.ReadToEnd()
strEncryptionType = myStreamRdr.CurrentEncoding.EncodingName
You can create CSV file using encoding UTF8 + BOM (https://en.wikipedia.org/wiki/Byte_order_mark).
First three bytes are BOM (0xEF,0xBB,0xBF) and then UTF8 content.
OOXML files like those that come from Excel 2007 are encoded in UTF-8, according to wikipedia. I don't know about CSV files, but it stands to reason it would use the same format...

Resources