character recognition issue when parsing xml - python-3.x

When using python to extract contents of a xml file and import into a csv file, I found that some characters cannot be recognised, they were converted to some strange characters, such as é was transferred to é, Ü to Ãœ.
I suspect it was due to the character encoding. Because the XML's text locale is en-GB, in Python I set the NLS_LANG as AL32UTF8.
I tried to apply the decode method to following code but did not work at all.
abstract = i.find('abstract').text
Could anyone shed some lights on this?

Related

How do I write the £ (GBP) sign in a CSV file from Ruby and read it back correctly in Excel?

When I write a CSV file using Ruby containing the £ sign and I open it using Excel I see this symbol instead ¬£.
My understanding is that Ruby uses UTF-8, but Excel interprets this file using a different encoding (ASCII).
I tried to write a US-ASCII encoded CSV file and guessed the £ encoding in ASCII like this:
csv = CSV.open(filename, 'w:US-ASCII')
csv << "\xA3"
csv.close
but it fails with invalid byte sequence in UTF-8 somewhere deep into the CSV library.
What am I doing wrong?
Thank you
For sure, Excel is not bound to use ASCII. For instance, I can easily input japanese characters into an Excel cell, and these are certainly not representable by ASCII.
While Ruby, by default, uses Unicode in its internal representation, every String object incorporates its own encoding, so you could in theory mix strings with different encodings, if you want to. In your case, you want to force a certain encoding when writing a file. This can be done either by using the w: output option, as you did, or by using external_encoding: Encoding::US-ASCII. See here for the names of the constants in Encoding.
I don't think US-ASCII is a good choice for the encoding, simply because there is no pound symbol in the ASCII chart. I would have expected that you get a warning message on stderr, when trying to write a pound symbol. If you need an 8-bit-encoding, ISO-8859-1 should do the job, but my recommendation would be to write UTF-8 and tell Excel to use this encoding when reading the CSV file. The possibility to import UTF exists at least since Excel 2007.

Translating text in Microsoft translator removes //

I am trying to translate a code file and I see the below issues with Microsoft translator
Translating below text
// 行番号削除正常終
Gives output
Line number deletion successful end
where it removes the //(double slash)
Translating below text
System.out.println("【○】COBOL
Gives output
System.out.println_○○ COBOL
Where the special characters are removed and replaced with some other.
I am reading my file using CP932 encoding and writing using UTF-8.
Please let me know if there are any ideas on how to resolve this issue.
I have tried Google translate and it works well on the same encodings.

python reverse unicode text into readable

i believe i have similar problem to this how to convert unicode text to utf8 text readable? but i want a python 3.7 solution to it
i am a complete newbie, i have some experience with python so i am trying to use it to make a script that will convert a Unicode file into the previous readable text it was.
the file is a bookmark file i have recovered using easeusa then i opened the bookmark file and it is writen in unicode something like "&PŽ¾³kÊ
k-7ÄÜÅe–?XBdyÃ8߯r×»Êã¥bÏñ ¥»X§ÈÕÀ¬Zé‚1öÄEdýŽ‹€†.c"
whereas previously is said something like " "checksum": "112d56adbd0caa2b3693bb0442dd16ff",
"roots": {
"bookmark_bar": {
"children":"
fyi when i click save as for the unicode bookmark file, for unicode it has ANSI and not utf-8 maybe it was saved us ANSI, i might be waffling here but i'm just trying to give you all the information you might need to help me
i am a newbie who depressingly need help
This text isn't "Unicode". It's simply gibberish.
This file has been corrupted -- it may have been overwritten with other data before you were able to recover it. It is unlikely to be recoverable.

Handling print() UnicodeEncodeError

I extract unicode text from a website and do some printing that can contain non-ascii characters. Printing those raises a UnicodeEncodeError. How should I handle this case? I need a solution that requires minimum changes to the existing code.
I tried encoding the strings (by wrapping sys.stdout) with UTF-8, but then I get a TypeError since the original sys.stdout.write() method expects str not bytes.
I don't want the characters to be lost. The user may want to pipe the output into a file that would in this case be UTF8 formatted.
Edit: Setting PYTHONIOENCODING=utf8 can fix the problem, but isn't there a way I can "imitate" this behaviour in Python? Since utf8 doesn't match the Windows console encoding, some characters look strange. But this is by far better than a crash of the program.
Anyway to format the output from Python so I don't need to modify environment variables?

How to write excel file with special characters through Perl script?

I am writing Excel file through perl code. When I insert data in XML file and view in any browser, I see correct data with special characters, but when I write the same data in Excel file, it is showing garbage characters.
For eg.:
(word from XML file on browser) Gràcia - (word from Excel file) Grà cia
I am using 'Spreadsheet::XLSX' for reading excel and 'Excel::Writer::XLSX' for writing excel.
Also need help in finding the encoding format of excel fields.
Do you have any idea? Thanks in advance.
This seems very much like UTF-8 to iso-8859-1 conversion going wrong - seems like a string that contains UTF-8, but is not marked as being UTF-8, is being passed to $worksheet->write(). Since http://metacpan.org/pod/Excel::Writer::XLSX#UNICODE-IN-EXCEL claims to handle unicode correctly, it seems to be a problem with your input string, not the write method itself.
As you don't post any code, and don't tell us where your strings come from, i can't tell why the strings aren't marked correctly.
You can probably get away with
Encode::_utf8_on($str)
before passing your strings to $worksheet->write(), but this might just as well break other things, if not all of your strings are really utf-8. Basically the answer is "get the utf-8 flag on your strings right when you read them".

Resources