Reading text from file and converting it to UTF32

Reading text from file and converting it to UTF32 - visual-c++

I'm using CSFML 1.6 library (it's multimedia library based on OpenGL). And I live in Poland, here we have special characters like:
ąęźćół
Now I have a text file which consists this characters and CSFML offer function to set UnicodeText on displayed string, it's argument is array of ints.
How can I properly read characters from file and then pass them to this function?
Any help really appreciated.

Judging to sfml-dev, the library accepts either char string in ISO-8859-1, or wchar_t string in UTF16, or there's a possibility to provide a completely own charset.
I suppose, the simplest is to stick with UTF16. Save your text in UTF16, and use the "wstring family" functions (those beginning with 'w', like wcscmp()) to handle it.

Related

How do I write the £ (GBP) sign in a CSV file from Ruby and read it back correctly in Excel?

When I write a CSV file using Ruby containing the £ sign and I open it using Excel I see this symbol instead ¬£.
My understanding is that Ruby uses UTF-8, but Excel interprets this file using a different encoding (ASCII).
I tried to write a US-ASCII encoded CSV file and guessed the £ encoding in ASCII like this:
csv = CSV.open(filename, 'w:US-ASCII')
csv << "\xA3"
csv.close
but it fails with invalid byte sequence in UTF-8 somewhere deep into the CSV library.
What am I doing wrong?
Thank you

For sure, Excel is not bound to use ASCII. For instance, I can easily input japanese characters into an Excel cell, and these are certainly not representable by ASCII.
While Ruby, by default, uses Unicode in its internal representation, every String object incorporates its own encoding, so you could in theory mix strings with different encodings, if you want to. In your case, you want to force a certain encoding when writing a file. This can be done either by using the w: output option, as you did, or by using external_encoding: Encoding::US-ASCII. See here for the names of the constants in Encoding.
I don't think US-ASCII is a good choice for the encoding, simply because there is no pound symbol in the ASCII chart. I would have expected that you get a warning message on stderr, when trying to write a pound symbol. If you need an 8-bit-encoding, ISO-8859-1 should do the job, but my recommendation would be to write UTF-8 and tell Excel to use this encoding when reading the CSV file. The possibility to import UTF exists at least since Excel 2007.

Reading txt file & opening with excel

Using Delphi 2007 & trying to read a txt file with greek characters & opening with Excel, I am not getting greek characters but symbols...Any help?
The text file is created with this code
CSVSTRList.SaveToFile('c:\test\xxx2.txt');
where CSVSTRList is a TStringList.

Looking at your code in your previous question it seems you are taking a stock TStringList and calling SaveToFile. That encodes the text as ANSI. Your Greek characters cannot be encoded as ANSI.
You want to export text using a Unicode encoding. For instance, in a modern Delphi you would use:
StringList.SaveToFile(FileName, Encoding.Unicode);
for UTF-16, or
StringList.SaveToFile(FileName, Encoding.UTF8);
for UTF-8.
I would expect that Excel will understand either of these encodings.
Since you are using a non-Unicode Delphi, things are somewhat more difficult. You'll need to change all you code, every single piece of string handling, to be Unicode aware. So you cannot use the stock string list any more, for example, because it contains 8 bit ANSI strings. The simplest way to do this with legacy Delphi is with the TNT Unicode library.
Or you could take the step of moving to a Unicode Delphi. If you care about international text it is the most sensible option.

How to write excel file with special characters through Perl script?

I am writing Excel file through perl code. When I insert data in XML file and view in any browser, I see correct data with special characters, but when I write the same data in Excel file, it is showing garbage characters.
For eg.:
(word from XML file on browser) Gràcia - (word from Excel file) GrÃ cia
I am using 'Spreadsheet::XLSX' for reading excel and 'Excel::Writer::XLSX' for writing excel.
Also need help in finding the encoding format of excel fields.
Do you have any idea? Thanks in advance.

This seems very much like UTF-8 to iso-8859-1 conversion going wrong - seems like a string that contains UTF-8, but is not marked as being UTF-8, is being passed to $worksheet->write(). Since http://metacpan.org/pod/Excel::Writer::XLSX#UNICODE-IN-EXCEL claims to handle unicode correctly, it seems to be a problem with your input string, not the write method itself.
As you don't post any code, and don't tell us where your strings come from, i can't tell why the strings aren't marked correctly.
You can probably get away with
Encode::_utf8_on($str)
before passing your strings to $worksheet->write(), but this might just as well break other things, if not all of your strings are really utf-8. Basically the answer is "get the utf-8 flag on your strings right when you read them".

How to convert chars (Â or Ã or â€) to ASCII codes while creating word document using C#?

I have string " Single 63”x14” rear window" am parsing this string into HTML and creating a word document applying styles using(System.IO.File.ReadAllText(styleSheet)).
in the document am getting this string as "Single 63â€x14â€ rear window" in C#.
How can I get the correct character to show up in Word?

You would have to find out the incoming encoding of the string
" Single 63”x14” rear window"
And also which encoding the word document allows.
It appears that the encoding characters for those funky quotes are not supported by Word. You could always create a nifty little string parser to search for characters outside the Word encoding range and replace them with either String.Empty, or search for specific supported characters that look similar.
Eg. String.Replace("”","\"");
(this probably wouldn't work without directly manipulating the encoding values, but you haven't provided those so can't give an exact example)

The encoding you are looking at appears to be UTF-8. It's actually probably exactly what you want, you just need to view it using a tool which supports UTF-8, and if you process it and put it on a web page, add the required HTML meta tag so that browsers will display it using the correct encoding.

Troubles with text encoding

I'm having some troubles with text encoding. Parsing a website gives me a Data.Text string
"Project - Fran\195\167ois Dubois",
which I need to write to a file. So I'm using Data.Text.Lazy.Encoding.encodeUtf8 to convert it into a Bytestring. The problem is that this yields garbled output:
"Project - FranÃ§ois Dubois".
What am I missing here?

If you have gotten Fran\195\167ois inside your Data.Text, you already have a UTF-8-encoded François.
That's inconvenient because Data.Text[.Lazy] is supposed to be UTF-16 encoded text, and the two code units 195 and 167 are interpreted as the unicode code points 195 resp. 167 which are 'Ã' resp. '§'. If you UTF-8-encode the text, these are converted to the byte sequences c383 ([195,131]) resp c2a7 ([194,167]).
The most likely way for getting into this situation is that the data you got from the website was UTF-8 encoded, but was interpreted as ISO-8859-1 (Latin 1) encoded (or another 8-bit encoding; 8859-15 is widespread too).
The proper way of handling it is avoiding the situation altogether [that may not be possible, unfortunately].
If the source of your data states its encoding correctly - as a website should - find out the encoding and interpret the data accordingly. If an incorrect encoding is stated, you are of course out of luck, and if no encoding is specified, you have to guess right (the natural guess nowadays is UTF-8, at least for languages using a variant of the Latin alphabet).
If avoiding the situation is not possible, the easiest ways of fixing it are
replacing the occurrences of the offending sequence with the desired one before encoding:
encodeUtf8 $ replace (pack "Fran\195\167ois") (pack "Fran\231ois") contents
assuming everything else is ASCII or inadvertent UTF-8 too, interpret the Text code units as bytes:
Data.ByteString.Lazy.Char8.pack $ Data.Text.Lazy.unpack contents
The former is more efficient, but becomes inconvenient if there are many different misencodings (caused by different accented letters, for example). The latter works only in the assumed situation (no code units above 255 in the Text) and is rather inefficient for long texts.

I am not completely sure if less can show UTF-8 encoded characters properly. GVim can. You can check this link on SO to find out how you can view UTF-8 data in gVim.
And regarding the other issue of being able to pass this to graphviz, I think you need to set the encoding on the command-line as explained in the Graph NonAscii FAQ.
From what you are explaining, I think there are no issues with how the data is being persisted. If you pass the encoding properly to graphviz, I think your problem will be resolved.
P.S: Creating an answer since it is easier to create descriptive links

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Reading text from file and converting it to UTF32 - visual-c++

Related

How do I write the £ (GBP) sign in a CSV file from Ruby and read it back correctly in Excel?

Reading txt file & opening with excel

How to write excel file with special characters through Perl script?

How to convert chars (Â or Ã or â€) to ASCII codes while creating word document using C#?

Troubles with text encoding

Categories

Resources