Unknown encoding CP500 - node.js

I need to convert a String to a byte array by using the CP500 encoding.
I tried this line:
const byteArray = Buffer.from(someString, "cp500");
Which led to:
TypeError: Unknown encoding: cp500TypeError [ERR_UNKNOWN_ENCODING]: Unknown encoding: cp500
I googled "node cp500" and looked at this answer but I wasn't able to find any information pointing to cp500 support in node/javascript.
In addition, I can't find any mention of a plugin that supports this specific encoding.
Is there way to get a buffer of bytes from a string in node.js with the cp500 encoding?

I used the codepage package that was pointed by Xaqron in a comment.
I had to import it as:
const codepage: typeof import('codepage').default = require('codepage');
Then, I used the package's encode function as follows in order to encode my string:
codepage.utils.encode(500, somestring, 'arr');
Which corresponds to the target encoding.

Related

Decode XML/HTML entities from iso-8859-1 charset in NodeJS

I'm receiving polish text from a SOAP action that has the polish diacritics encoded as XML entities, but as far as I can tell, they are not encoded in UTF-8 but ISO-8859-1 and I'm struggling to decode them properly in NodeJS.
Example text: Borek Fałęcki
Expected decoding result: Borek Fałęcki
Current result: Borek Fałęcki
While I achieved the correct result in PHP using following code:
echo html_entity_decode('Borek Fałęcki', ENT_QUOTES | ENT_SUBSTITUTE | ENT_XML1, 'ISO-8859-1');
I'm having no luck in doing the same in NodeJS. There aren't many complete packages to help with decoding html/xml entities, I have used both entites and html-entities but they provide the same results, and none of them seem to have any charset settings.
const { decode, encode } = require('html-entities');
const entities = require('entities');
const txt = 'Borek Fałęcki';
console.log('html-entities decode', decode(txt));
console.log('utf8-encoding', encode('Borek Fałęcki', {
mode: 'nonAsciiPrintable',
numeric: 'decimal',
level: 'xml',
}));
console.log('entities decode', entities.decodeXML(txt));
Output:
html-entities decode Borek Fałęcki
utf8-encoding Borek Fałęcki
entities decode Borek Fałęcki
As we can see, when encoded with UTF-8 there are single entities for each character:
ł = ł
ę = ę
With ISO-8859-1, there are 2 entities per character. I have no more ideas how to achieve the same decoding result as in PHP. If there were no entities, I could just convert the encoding to UTF-8 but with entities I have no idea how to do it properly. I cannot get the other side to send me UTF-8, since this is a closed old protocol that I have no control of.
The correct XML encoding of Borek Fałęcki is Borek Fałęcki. The SOAP action XML that you receive is wrongly encoded.
However, the following expression converts it as needed:
Buffer.concat(
"Borek Fałęcki"
.match(/[^&]+|&#\d+;/g)
.map(c => c[0] === "&"
? Buffer.of(Number(c.substring(2, c.length - 1)))
: Buffer.from(c))
).toString()

Trouble in decoding the Path of the Blob file in Azure Search

I have set a Azure search for blob storage and since the path of the file is a key property it is encoded to Base 64 format.
While searching the Index, I need to decode the path and display it in the front end. But when I try to do that in few of the scenarios it throws error.
int mod4 = base64EncodedData.Length % 4;
if (mod4 > 0)
{
base64EncodedData += new string('=', 4 - mod4);
}
var base64EncodedBytes = System.Convert.FromBase64String(base64EncodedData);
return System.Text.Encoding.ASCII.GetString(base64EncodedBytes);
Please let me know what is the correct way to do it.
Thanks.
Refer to Base64Encode and Base64Decode mapping functions - the encoding details are documented there.
In particular, if you're using .NET, you should use HttpServerUtility.UrlTokenDecode method with UTF-8 encoding, not ASCII.

Decode UTF8 symbols

I have a string in swift:
let flag = "Cattì ò"
I am trying to convert the UTF8 symbols.
I have tried using
stringByRemovingPercentEncoding
but noting changes. How can I convert the symbols properly ?
Welcome to the encoding guessing game! Look like somewhere along the pathway, your string didn't get the correct code page. Here's one way to guess it:
let flag = "Cattì ò"
let encodings = [NSASCIIStringEncoding,
NSNEXTSTEPStringEncoding,
NSJapaneseEUCStringEncoding,
NSUTF8StringEncoding,
NSISOLatin1StringEncoding,
NSSymbolStringEncoding,
NSNonLossyASCIIStringEncoding,
NSShiftJISStringEncoding,
NSISOLatin2StringEncoding,
NSUnicodeStringEncoding,
NSWindowsCP1251StringEncoding,
NSWindowsCP1252StringEncoding,
NSWindowsCP1253StringEncoding,
NSWindowsCP1254StringEncoding,
NSWindowsCP1250StringEncoding,
NSISO2022JPStringEncoding,
NSMacOSRomanStringEncoding,
NSUTF16StringEncoding,
NSUTF16BigEndianStringEncoding,
NSUTF16LittleEndianStringEncoding,
NSUTF32StringEncoding,
NSUTF32BigEndianStringEncoding,
NSUTF32LittleEndianStringEncoding]
for encoding in encodings {
if let bytes = flag.cStringUsingEncoding(encoding),
flag_utf8 = String(CString: bytes, encoding: NSUTF8StringEncoding) {
print("\(encoding): \(flag_utf8)")
}
}
The array contains all the encodings that Cocoa supports.
From the results, it seems like your string was encoded in NSISOLatin1StringEncoding (a.k.a ISO-8859-1), the default encoding for HTML 4.01. This gives Cattì ò in UTF-8, not exactly match your desired result but is the closest among all code pages.
Other good candidates are NSWindowsCP1252StringEncoding and NSWindowsCP1254StringEncoding so I'd suggest you check with other strings.

Converting "=?UTF 8?.." (RFC 2047) to a regular string in golang

I'm using an API and it's returning something like this for other language text:
=?UTF 8?B?2KfZhNiu2LfZiNin2Kog2KfZhNiq2Yog2KrYrNmF2Lkg2KjZitmG?= =?UTF 8?B?INit2YHYuCDYp9mE2YLYsdin2ZPZhiDYp9mE2YPYsdmK2YUg2YjZgQ==?= =?UTF 8?B?2YfZhdmHINmF2YXYpyDYp9mU2YXZhNin2Ycg2KfZhNi52YTYp9mF?= =?UTF 8?B?2Kkg2LnYqNivINin2YTZhNmHINin2YTYutiv2YrYp9mGLnBkZg==?=
Is this a common format? How would I go about converting this to a regular string in golang?
Golang usually handles multiple languages well, but I'm not sure about how to go about converting.
Since Go 1.5 you can use mime.WordDecoder.DecodeHeader:
package main
import (
"fmt"
"mime"
)
func main() {
dec := new(mime.WordDecoder)
header, err := dec.DecodeHeader("=?UTF-8?B?2KfZhNiu2LfZiNin2Kog2KfZhNiq2Yog2KrYrNmF2Lkg2KjZitmG?= =?UTF-8?B?INit2YHYuCDYp9mE2YLYsdin2ZPZhiDYp9mE2YPYsdmK2YUg2YjZgQ==?= =?UTF-8?B?2YfZhdmHINmF2YXYpyDYp9mU2YXZhNin2Ycg2KfZhNi52YTYp9mF?= =?UTF-8?B?2Kkg2LnYqNivINin2YTZhNmHINin2YTYutiv2YrYp9mGLnBkZg==?=")
if err != nil {
panic(err)
}
fmt.Println(header)
// Output: لخطوات التي تجمع بين حفظ القرآن الكريم وفهمه مما أملاه العلامة عبد الله الغديان.pdf
}
If you are using an older version of Go, you can use my replacement library: https://github.com/alexcesaro/quotedprintable
Aparrently your API is returning data encoded in RFC 2047 format. Basically, this defines the following:
encoded-word = "=?" charset "?" encoding "?" encoded-text "?="
Which means your charset is UTF-8 (very handy, since this is Go's native character set), and your encoding is Base64. The text you have to decode is the one between the "B?" and the "?=". So all you have to do is take that text and call:
base64.StdEncoding.DecodeString(text)
to get the original UTF-8 string.
There is a decodeRFC2047Word() function in the net/mail package of the Go stdlib, supporting encodings B and Q and charsets UTF-8, US-ASCII and ISO-8859-1. Unfortunately it's not exported, but you're free to take as much inspiration from it as you need ;)
BTW: I just noticed the charset in your example strings is UTF 8, which is a bit odd, since the official name of the encoding is UTF-8.

In node.js, how do I validate UTF8 data in a buffer? [duplicate]

I just discovered that Node (tested: v0.8.23, current git: v0.11.3-pre) ignores any decoding errors in its Buffer handling, silently replacing any non-utf8 characters with '\ufffd' (the Unicode REPLACEMENT CHARACTER) instead of throwing an exception about the non-utf8 input. As a consequence, fs.readFile, process.stdin.setEncoding and friends mask a large class of bad input errors for you.
Example which doesn't fail but really ought to:
> notValidUTF8 = new Buffer([ 128 ], 'binary')
<Buffer 80>
> decodedAsUTF8 = notValidUTF8.toString('utf8') // no exception thrown here!
'�'
> decodedAsUTF8 === '\ufffd'
true
'\ufffd' is a perfectly valid character that can occur in legal utf8 (as the sequence ef bf bd), so it is non-trivial to monkey-patch in error handling based on this showing up in the result.
Digging a little deeper, it looks like this stems from node just deferring to v8's strings and that those in turn have the above behaviour, v8 not having any external world full of foreign-encoded data.
Are there node modules or otherwise that let me catch utf-8 decode errors, preferrably with context about where the error was discovered in the input string or buffer?
I hope you solved the problem in those years, I had a similar one and eventually solved with this ugly trick:
function isValidUTF8(buf){
return Buffer.compare(new Buffer(buf.toString(),'utf8') , buf) === 0;
}
which converts the buffer back and forth and check it stays the same.
The 'utf8' encoding can be omitted.
Then we have:
> isValidUTF8(new Buffer('this is valid, 指事字 eè we hope','utf8'))
true
> isValidUTF8(new Buffer([128]))
false
> isValidUTF8(new Buffer('\ufffd'))
true
where the '\ufffd' character is correctly considered as valid utf8.
UPDATE: now this works in JXcore, too
From node 8.3 on, you can use util.TextDecoder to solve this cleanly:
const util = require('util')
const td = new util.TextDecoder('utf8', {fatal:true})
td.decode(Buffer.from('foo')) // works!
td.decode(Buffer.from([ 128 ], 'binary')) // throws TypeError
This will also work in some browsers by using TextDecoder in the global namespace.
As Josh C. said above: "npmjs.org/package/encoding"
From the npm website: "encoding is a simple wrapper around node-iconv and iconv-lite to convert strings from one encoding to another."
Download:
$ npm install encoding
Example Usage
var result = encoding.convert(new Buffer([ 128 ], 'binary'), "utf8");
console.log(result); //<Buffer 80>
Visit the site: npm - encoding

Resources