Decode UTF8 symbols

Decode UTF8 symbols - string

I have a string in swift:
let flag = "CattÃ¬ Ã²"
I am trying to convert the UTF8 symbols.
I have tried using
stringByRemovingPercentEncoding
but noting changes. How can I convert the symbols properly ?

Welcome to the encoding guessing game! Look like somewhere along the pathway, your string didn't get the correct code page. Here's one way to guess it:
let flag = "CattÃ¬ Ã²"
let encodings = [NSASCIIStringEncoding,
NSNEXTSTEPStringEncoding,
NSJapaneseEUCStringEncoding,
NSUTF8StringEncoding,
NSISOLatin1StringEncoding,
NSSymbolStringEncoding,
NSNonLossyASCIIStringEncoding,
NSShiftJISStringEncoding,
NSISOLatin2StringEncoding,
NSUnicodeStringEncoding,
NSWindowsCP1251StringEncoding,
NSWindowsCP1252StringEncoding,
NSWindowsCP1253StringEncoding,
NSWindowsCP1254StringEncoding,
NSWindowsCP1250StringEncoding,
NSISO2022JPStringEncoding,
NSMacOSRomanStringEncoding,
NSUTF16StringEncoding,
NSUTF16BigEndianStringEncoding,
NSUTF16LittleEndianStringEncoding,
NSUTF32StringEncoding,
NSUTF32BigEndianStringEncoding,
NSUTF32LittleEndianStringEncoding]
for encoding in encodings {
if let bytes = flag.cStringUsingEncoding(encoding),
flag_utf8 = String(CString: bytes, encoding: NSUTF8StringEncoding) {
print("\(encoding): \(flag_utf8)")
}
}
The array contains all the encodings that Cocoa supports.
From the results, it seems like your string was encoded in NSISOLatin1StringEncoding (a.k.a ISO-8859-1), the default encoding for HTML 4.01. This gives Cattì ò in UTF-8, not exactly match your desired result but is the closest among all code pages.
Other good candidates are NSWindowsCP1252StringEncoding and NSWindowsCP1254StringEncoding so I'd suggest you check with other strings.

Related

Decode XML/HTML entities from iso-8859-1 charset in NodeJS

I'm receiving polish text from a SOAP action that has the polish diacritics encoded as XML entities, but as far as I can tell, they are not encoded in UTF-8 but ISO-8859-1 and I'm struggling to decode them properly in NodeJS.
Example text: Borek FaÅÄcki
Expected decoding result: Borek Fałęcki
Current result: Borek FaÅ‚Ä™cki
While I achieved the correct result in PHP using following code:
echo html_entity_decode('Borek FaÅÄcki', ENT_QUOTES | ENT_SUBSTITUTE | ENT_XML1, 'ISO-8859-1');
I'm having no luck in doing the same in NodeJS. There aren't many complete packages to help with decoding html/xml entities, I have used both entites and html-entities but they provide the same results, and none of them seem to have any charset settings.
const { decode, encode } = require('html-entities');
const entities = require('entities');
const txt = 'Borek FaÅÄcki';
console.log('html-entities decode', decode(txt));
console.log('utf8-encoding', encode('Borek Fałęcki', {
mode: 'nonAsciiPrintable',
numeric: 'decimal',
level: 'xml',
}));
console.log('entities decode', entities.decodeXML(txt));
Output:
html-entities decode Borek FaÅ‚Ä™cki
utf8-encoding Borek Fałęcki
entities decode Borek FaÅ‚Ä™cki
As we can see, when encoded with UTF-8 there are single entities for each character:
ł = ł
ę = ę
With ISO-8859-1, there are 2 entities per character. I have no more ideas how to achieve the same decoding result as in PHP. If there were no entities, I could just convert the encoding to UTF-8 but with entities I have no idea how to do it properly. I cannot get the other side to send me UTF-8, since this is a closed old protocol that I have no control of.

The correct XML encoding of Borek Fałęcki is Borek Fałęcki. The SOAP action XML that you receive is wrongly encoded.
However, the following expression converts it as needed:
Buffer.concat(
"Borek FaÅÄcki"
.match(/[^&]+|&#\d+;/g)
.map(c => c[0] === "&"
? Buffer.of(Number(c.substring(2, c.length - 1)))
: Buffer.from(c))
).toString()

Groovy - String created from UTF8 bytes has wrong characters

The problem came up when getting the result of a web service returning json with Greek characters in it. Actually it is the city of Mykonos. The challenge is whatever encoding or conversion I'm using it is always displayed as:ΜΎΚΟxCE?ΟΣ . But it should show: ΜΎΚΟΝΟΣ
With Powershell I was able to verify, that the web service is returning the correct characters.
I narrowed the problem down when the byte array gets converted to a String in Groovy. Below is code that reproduces the issue I have. myUTF8String holds the byte array I get from URLConnection.content.text. The UTF8 byte sequence to look at is 0xce, 0x9d. After converting this to a string and back to a byte array the byte sequence for that character is 0xce, 0x3f. The result of below code will show the difference at position 9 of the original byte array and the one from the converted string. For the below test I'm using Groovy Console 4.0.6.
Any hints on this one?
import java.nio.charset.StandardCharsets;
def myUTF8String = "ce9cce8ece9ace9fce9dce9fcea3"
def bytes = myUTF8String.decodeHex();
content = new String(bytes).getBytes()
for ( i = 0; i < content.length; i++ ) {
if ( bytes[i] != content[i] ) {
println "Different... at pos " + i
hex = Long.toUnsignedString( bytes[i], 16).toUpperCase()
print hex.substring(hex.length()-2,hex.length()) + " != "
hex = Long.toUnsignedString( content[i], 16).toUpperCase()
println hex.substring(hex.length()-2,hex.length())
}
}
Thanks a lot
Andreas

you have to specify charset name when building String from bytes otherwise default java charset will be used - and it's not necessary urf-8.
Charset.defaultCharset() - Returns the default charset of this Java virtual machine.
The same problem with String.getBytes() - use charset parameter to get correct byte sequence.
Just change the following line in your code and issue will disappear:
content = new String(bytes, "UTF-8").getBytes("UTF-8")
as an option you can set default charset for the whole JVM instance with the following command line parameter:
java -Dfile.encoding=UTF-8 <your application>
but be careful because it will affect whole JVM instance!
https://docs.oracle.com/en/java/javase/19/intl/supported-encodings.html#GUID-DC83E43D-52F6-41D9-8F16-318F3F39D54F

the output of "crypto.createCipheriv with chinese character" is not correct

when there is no chinese character, php and node output the same result.
but when this is chinese character, the output of php is correct, the output of node is not correct
const crypto = require('crypto');
function encodeDesECB(textToEncode, keyString) {
var key = new Buffer(keyString.substring(0, 8), 'utf8');
var cipher = crypto.createCipheriv('des-ecb', key, '');
cipher.setAutoPadding(true);
var c = cipher.update(textToEncode, 'utf8', 'base64');
c += cipher.final('base64');
return c;
}
console.log(encodeDesECB(`{"key":"test"}`, 'MIGfMA0G'))
console.log(encodeDesECB(`{"key":"测试"}`, 'MIGfMA0G'))
node output
6RQdIBxccCUFE+cXPODJzg==
6RQdIBxccCWXTmivfit9AOfoJRziuDf4
php output
6RQdIBxccCUFE+cXPODJzg==
6RQdIBxccCXFCRVbubGaolfSr4q5iUgw

The problem is not the encryption, but a different JSON serialization of the plaintext.
In the PHP code, json_encode() converts the characters as a Unicode escape sequence, i.e. the encoding returns {"key":"\u6d4b\u8bd5"}. In the NodeJS code, however, {"key": "测试"} is applied.
This means that different plaintexts are encrypted in the end. Therefore, for the same ciphertext, a byte-level identical plaintext must be used.
If Unicode escape sequences are to be applied in the NodeJS code (as in the PHP code), an appropriate conversion is necessary. For this the jsesc package can be used:
const jsesc = require('jsesc');
...
console.log(encodeDesECB(jsesc(`{\"key\":\"测试\"}`, {'lowercaseHex': true}), 'MIGfMA0G')); // 6RQdIBxccCXFCRVbubGaolfSr4q5iUgw
now returns the result of the posted PHP code.
If the Unicode characters are to be used unmasked in the PHP code (as in the NodeJS code), an appropriate conversion is necessary. For this the flag JSON_UNESCAPED_UNICODE can be set in json_encode():
$data = json_encode($data, JSON_UNESCAPED_UNICODE); // 6RQdIBxccCWXTmivfit9AOfoJRziuDf4
now returns the result of the posted NodeJS code.

Unknown encoding CP500

I need to convert a String to a byte array by using the CP500 encoding.
I tried this line:
const byteArray = Buffer.from(someString, "cp500");
Which led to:
TypeError: Unknown encoding: cp500TypeError [ERR_UNKNOWN_ENCODING]: Unknown encoding: cp500
I googled "node cp500" and looked at this answer but I wasn't able to find any information pointing to cp500 support in node/javascript.
In addition, I can't find any mention of a plugin that supports this specific encoding.
Is there way to get a buffer of bytes from a string in node.js with the cp500 encoding?

I used the codepage package that was pointed by Xaqron in a comment.
I had to import it as:
const codepage: typeof import('codepage').default = require('codepage');
Then, I used the package's encode function as follows in order to encode my string:
codepage.utils.encode(500, somestring, 'arr');
Which corresponds to the target encoding.

Converting "=?UTF 8?.." (RFC 2047) to a regular string in golang

I'm using an API and it's returning something like this for other language text:
=?UTF 8?B?2KfZhNiu2LfZiNin2Kog2KfZhNiq2Yog2KrYrNmF2Lkg2KjZitmG?= =?UTF 8?B?INit2YHYuCDYp9mE2YLYsdin2ZPZhiDYp9mE2YPYsdmK2YUg2YjZgQ==?= =?UTF 8?B?2YfZhdmHINmF2YXYpyDYp9mU2YXZhNin2Ycg2KfZhNi52YTYp9mF?= =?UTF 8?B?2Kkg2LnYqNivINin2YTZhNmHINin2YTYutiv2YrYp9mGLnBkZg==?=
Is this a common format? How would I go about converting this to a regular string in golang?
Golang usually handles multiple languages well, but I'm not sure about how to go about converting.

Since Go 1.5 you can use mime.WordDecoder.DecodeHeader:
package main
import (
"fmt"
"mime"
)
func main() {
dec := new(mime.WordDecoder)
header, err := dec.DecodeHeader("=?UTF-8?B?2KfZhNiu2LfZiNin2Kog2KfZhNiq2Yog2KrYrNmF2Lkg2KjZitmG?= =?UTF-8?B?INit2YHYuCDYp9mE2YLYsdin2ZPZhiDYp9mE2YPYsdmK2YUg2YjZgQ==?= =?UTF-8?B?2YfZhdmHINmF2YXYpyDYp9mU2YXZhNin2Ycg2KfZhNi52YTYp9mF?= =?UTF-8?B?2Kkg2LnYqNivINin2YTZhNmHINin2YTYutiv2YrYp9mGLnBkZg==?=")
if err != nil {
panic(err)
}
fmt.Println(header)
// Output: لخطوات التي تجمع بين حفظ القرآن الكريم وفهمه مما أملاه العلامة عبد الله الغديان.pdf
}
If you are using an older version of Go, you can use my replacement library: https://github.com/alexcesaro/quotedprintable

Aparrently your API is returning data encoded in RFC 2047 format. Basically, this defines the following:
encoded-word = "=?" charset "?" encoding "?" encoded-text "?="
Which means your charset is UTF-8 (very handy, since this is Go's native character set), and your encoding is Base64. The text you have to decode is the one between the "B?" and the "?=". So all you have to do is take that text and call:
base64.StdEncoding.DecodeString(text)
to get the original UTF-8 string.
There is a decodeRFC2047Word() function in the net/mail package of the Go stdlib, supporting encodings B and Q and charsets UTF-8, US-ASCII and ISO-8859-1. Unfortunately it's not exported, but you're free to take as much inspiration from it as you need ;)
BTW: I just noticed the charset in your example strings is UTF 8, which is a bit odd, since the official name of the encoding is UTF-8.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Decode UTF8 symbols - string

I have a string in swift: let flag = "CattÃ¬ Ã²" I am trying to convert the UTF8 symbols. I have tried using stringByRemovingPercentEncoding but noting changes. How can I convert the symbols properly ?

Related

Decode XML/HTML entities from iso-8859-1 charset in NodeJS

Groovy - String created from UTF8 bytes has wrong characters

the output of "crypto.createCipheriv with chinese character" is not correct

Unknown encoding CP500

Converting "=?UTF 8?.." (RFC 2047) to a regular string in golang

Categories

Resources