I am using GMAIL api to send email from the nodejs api. I am rendering the raw body using the following utility function
message += '[DEFAULT EMOJI 😆]'
const str = [
'Content-Type: text/html; charset="UTF-8"\n',
'MIME-Version: 1.0\n',
'Content-Transfer-Encoding: 7bit\n',
'to: ',
to,
'\n',
'from: ',
from.name,
' <',
from.address,
'>',
'\n',
'subject: ',
subject + '[DEFAULT EMOJI 😆]',
'\n\n',
message
].join('');
return Buffer.alloc(str.length, str).toString('base64').replace(/\+/g, '-').replace(/\//g, '_');
The code i have used to send the email is
const r = await gmail.users.messages.send({
auth,
userId: "me",
requestBody: {
raw: makeEmailBody(
thread.send_to,
{
address: user.from_email,
name: user.from_name,
},
campaign.subject,
campaign.template,
thread.id
),
},
});
The emojis are being rendered in the body but not working in subject. See the pic below
Left one is from Gmail in Google Chrome on Desktop and right one is from Gmail App in Mobile
Your utility may benefit from several improvements (we will get to the emoji problem):
First, make it RFC822 compliant by separating lines with CRLF (\r\n).
Be careful with Content-Transfer-Encoding header, set it to 7bit is easiest, but may not be generic enough (quoted-printable is likely the better option).
Now to the emoji problem:
You need to make sure the subject is correctly encoded separately from the body to be able to pass the emoji. According to RFC13420, you can use either Base64 or quoted-printable encoding on a subject to create an encoded-word, described as:
"=" "?" charset "?" encoding "?" encoded-text "?" "="
Where encoding is either Q for quoted-printable and B for Base64 encoding.
Note that the resulting encoded subject string must not be longer than 76 chars in length, which reserves 75 chars for the string and 1 for the separator (to use multiple words, separate them with space or newline [CRLF as well]).
So, set your charset to utf-8, encoding to Q, encode the actual subject with something like below1, and you are half way done:
/**
* #summary RFC 1342 header encoding
* #see {#link https://www.rfc-editor.org/rfc/rfc1342}
*/
class HeaderEncoder {
/**
* #summary encode using Q encoding
*/
static quotedPrintable(str: string, encoding = "utf-8") {
let encoded = "";
for (const char of str) {
const cp = char.codePointAt(0);
encoded += `=${cp.toString(16)}`;
}
return `=?${encoding}?Q?${encoded}?=`;
}
}
Now, the fun part. I was working on a GAS project that had to leverage the Gmail API directly (after all, that is what the client library does under the hood). Even with the correct encoding, an attempt to pass something like a "Beep! \u{1F697}" resulted in an incorrectly parsed subject.
Turns out you need to leverage fromCodePoint operating on byte array or buffer from the original string. This snippet should suffice (don't forget to apply only to multibyte chars):
const escape = (u: string) => String.fromCodePoint(...Buffer.from(u));
0 This is the initial RFC, it would be more appropriate to refer to RFC 2047. Also, see RFC 2231 for including locale info in the header (and some more obscure extensions).
1 If the char falls into a range of printable US-ASCII, it can be left as-is, but due to an extensive set of rules, I recommend sticking to 48-57 (numbers), 65-90 (uppercase) and 97-122 (lowercase) ranges.
Related
I'm receiving polish text from a SOAP action that has the polish diacritics encoded as XML entities, but as far as I can tell, they are not encoded in UTF-8 but ISO-8859-1 and I'm struggling to decode them properly in NodeJS.
Example text: Borek FaÅÄcki
Expected decoding result: Borek Fałęcki
Current result: Borek Fałęcki
While I achieved the correct result in PHP using following code:
echo html_entity_decode('Borek FaÅÄcki', ENT_QUOTES | ENT_SUBSTITUTE | ENT_XML1, 'ISO-8859-1');
I'm having no luck in doing the same in NodeJS. There aren't many complete packages to help with decoding html/xml entities, I have used both entites and html-entities but they provide the same results, and none of them seem to have any charset settings.
const { decode, encode } = require('html-entities');
const entities = require('entities');
const txt = 'Borek FaÅÄcki';
console.log('html-entities decode', decode(txt));
console.log('utf8-encoding', encode('Borek Fałęcki', {
mode: 'nonAsciiPrintable',
numeric: 'decimal',
level: 'xml',
}));
console.log('entities decode', entities.decodeXML(txt));
Output:
html-entities decode Borek Fałęcki
utf8-encoding Borek Fałęcki
entities decode Borek Fałęcki
As we can see, when encoded with UTF-8 there are single entities for each character:
ł = ł
ę = ę
With ISO-8859-1, there are 2 entities per character. I have no more ideas how to achieve the same decoding result as in PHP. If there were no entities, I could just convert the encoding to UTF-8 but with entities I have no idea how to do it properly. I cannot get the other side to send me UTF-8, since this is a closed old protocol that I have no control of.
The correct XML encoding of Borek Fałęcki is Borek Fałęcki. The SOAP action XML that you receive is wrongly encoded.
However, the following expression converts it as needed:
Buffer.concat(
"Borek FaÅÄcki"
.match(/[^&]+|&#\d+;/g)
.map(c => c[0] === "&"
? Buffer.of(Number(c.substring(2, c.length - 1)))
: Buffer.from(c))
).toString()
The problem came up when getting the result of a web service returning json with Greek characters in it. Actually it is the city of Mykonos. The challenge is whatever encoding or conversion I'm using it is always displayed as:ΜΎΚΟxCE?ΟΣ . But it should show: ΜΎΚΟΝΟΣ
With Powershell I was able to verify, that the web service is returning the correct characters.
I narrowed the problem down when the byte array gets converted to a String in Groovy. Below is code that reproduces the issue I have. myUTF8String holds the byte array I get from URLConnection.content.text. The UTF8 byte sequence to look at is 0xce, 0x9d. After converting this to a string and back to a byte array the byte sequence for that character is 0xce, 0x3f. The result of below code will show the difference at position 9 of the original byte array and the one from the converted string. For the below test I'm using Groovy Console 4.0.6.
Any hints on this one?
import java.nio.charset.StandardCharsets;
def myUTF8String = "ce9cce8ece9ace9fce9dce9fcea3"
def bytes = myUTF8String.decodeHex();
content = new String(bytes).getBytes()
for ( i = 0; i < content.length; i++ ) {
if ( bytes[i] != content[i] ) {
println "Different... at pos " + i
hex = Long.toUnsignedString( bytes[i], 16).toUpperCase()
print hex.substring(hex.length()-2,hex.length()) + " != "
hex = Long.toUnsignedString( content[i], 16).toUpperCase()
println hex.substring(hex.length()-2,hex.length())
}
}
Thanks a lot
Andreas
you have to specify charset name when building String from bytes otherwise default java charset will be used - and it's not necessary urf-8.
Charset.defaultCharset() - Returns the default charset of this Java virtual machine.
The same problem with String.getBytes() - use charset parameter to get correct byte sequence.
Just change the following line in your code and issue will disappear:
content = new String(bytes, "UTF-8").getBytes("UTF-8")
as an option you can set default charset for the whole JVM instance with the following command line parameter:
java -Dfile.encoding=UTF-8 <your application>
but be careful because it will affect whole JVM instance!
https://docs.oracle.com/en/java/javase/19/intl/supported-encodings.html#GUID-DC83E43D-52F6-41D9-8F16-318F3F39D54F
when there is no chinese character, php and node output the same result.
but when this is chinese character, the output of php is correct, the output of node is not correct
const crypto = require('crypto');
function encodeDesECB(textToEncode, keyString) {
var key = new Buffer(keyString.substring(0, 8), 'utf8');
var cipher = crypto.createCipheriv('des-ecb', key, '');
cipher.setAutoPadding(true);
var c = cipher.update(textToEncode, 'utf8', 'base64');
c += cipher.final('base64');
return c;
}
console.log(encodeDesECB(`{"key":"test"}`, 'MIGfMA0G'))
console.log(encodeDesECB(`{"key":"测试"}`, 'MIGfMA0G'))
node output
6RQdIBxccCUFE+cXPODJzg==
6RQdIBxccCWXTmivfit9AOfoJRziuDf4
php output
6RQdIBxccCUFE+cXPODJzg==
6RQdIBxccCXFCRVbubGaolfSr4q5iUgw
The problem is not the encryption, but a different JSON serialization of the plaintext.
In the PHP code, json_encode() converts the characters as a Unicode escape sequence, i.e. the encoding returns {"key":"\u6d4b\u8bd5"}. In the NodeJS code, however, {"key": "测试"} is applied.
This means that different plaintexts are encrypted in the end. Therefore, for the same ciphertext, a byte-level identical plaintext must be used.
If Unicode escape sequences are to be applied in the NodeJS code (as in the PHP code), an appropriate conversion is necessary. For this the jsesc package can be used:
const jsesc = require('jsesc');
...
console.log(encodeDesECB(jsesc(`{\"key\":\"测试\"}`, {'lowercaseHex': true}), 'MIGfMA0G')); // 6RQdIBxccCXFCRVbubGaolfSr4q5iUgw
now returns the result of the posted PHP code.
If the Unicode characters are to be used unmasked in the PHP code (as in the NodeJS code), an appropriate conversion is necessary. For this the flag JSON_UNESCAPED_UNICODE can be set in json_encode():
$data = json_encode($data, JSON_UNESCAPED_UNICODE); // 6RQdIBxccCWXTmivfit9AOfoJRziuDf4
now returns the result of the posted NodeJS code.
I'm making a get request in node.js to a url which returns an object i'm trying to parse. However I am getting the unexpected token error. I have played with different encodings as well as converting the response body into string and then removing those tokens but nothing works. Setting encoding to null also did not solve my problem.
Below is the response body im getting:
��[{"unit":"EN15","BOX":"150027","CD":"12 - Natural Gas Leak","Levl":"1","StrName":"1000 N Madison Ave","IncNo":"2020102317","Address":"1036 N Madison Ave","CrossSt":"W 5TH ST/NECHES ST"},{"unit":"EN23","BOX":"230004","CD":"44 - Welfare Check","Levl":"1","StrName":"S Lancaster Rd / E Overton Rd","IncNo":"2020102314","Address":"S Lancaster Rd / E Overton Rd","CrossSt":""}]
Those are the headers which go with my request
headers: {'Content-Type': 'text/plain; charset=utf-8'}
Here is how I'm parsing response body
const data = JSON.parse(response.body)
Any help would be greatly appreciated!
UPDATE: had to do this with response body to make it work
const data = response.body.replace(/^\uFFFD/, '').replace(/^\uFFFD/, '').replace(/\0/g, '')
You are probably getting byte order mark (BOM) of the UTF-8 string.
The simplest workeround would be to remove it before the parsing.
const data = JSON.parse(response.body.toString('utf8').replace(/^\uFFFD/, ''));
Update: Your first 2 characters are Unicode REPLACEMENT CHARACTER. In order to remove it, use the \uFFD character.
I have a string in swift:
let flag = "Cattì ò"
I am trying to convert the UTF8 symbols.
I have tried using
stringByRemovingPercentEncoding
but noting changes. How can I convert the symbols properly ?
Welcome to the encoding guessing game! Look like somewhere along the pathway, your string didn't get the correct code page. Here's one way to guess it:
let flag = "Cattì ò"
let encodings = [NSASCIIStringEncoding,
NSNEXTSTEPStringEncoding,
NSJapaneseEUCStringEncoding,
NSUTF8StringEncoding,
NSISOLatin1StringEncoding,
NSSymbolStringEncoding,
NSNonLossyASCIIStringEncoding,
NSShiftJISStringEncoding,
NSISOLatin2StringEncoding,
NSUnicodeStringEncoding,
NSWindowsCP1251StringEncoding,
NSWindowsCP1252StringEncoding,
NSWindowsCP1253StringEncoding,
NSWindowsCP1254StringEncoding,
NSWindowsCP1250StringEncoding,
NSISO2022JPStringEncoding,
NSMacOSRomanStringEncoding,
NSUTF16StringEncoding,
NSUTF16BigEndianStringEncoding,
NSUTF16LittleEndianStringEncoding,
NSUTF32StringEncoding,
NSUTF32BigEndianStringEncoding,
NSUTF32LittleEndianStringEncoding]
for encoding in encodings {
if let bytes = flag.cStringUsingEncoding(encoding),
flag_utf8 = String(CString: bytes, encoding: NSUTF8StringEncoding) {
print("\(encoding): \(flag_utf8)")
}
}
The array contains all the encodings that Cocoa supports.
From the results, it seems like your string was encoded in NSISOLatin1StringEncoding (a.k.a ISO-8859-1), the default encoding for HTML 4.01. This gives Cattì ò in UTF-8, not exactly match your desired result but is the closest among all code pages.
Other good candidates are NSWindowsCP1252StringEncoding and NSWindowsCP1254StringEncoding so I'd suggest you check with other strings.