NodeJS RTF ANSI Find and Replace Words With Special Chars - node.js

I have a find and replace script that works no problem when the words don't have any special characters. However, there will be a lot of times where there will be special characters since it's finding names. As of now this is breaking the script.
The script looks for {<some-text>} and attempts to replace the contents (as well as remove the braces).
Example:
text.rtf
Here's a name with special char {Kotouč}
script.ts
import * as fs from "fs";
// Ingest the rtf file.
const content: string = fs.readFileSync("./text.rtf", "utf8");
console.log("content::\n", content);
// The string we are looking to match in file text.
const plainText: string = "{Kotouč}";
// Look for all text that matches the patter `{TEXT_HERE}`.
const anyMatchPattern: RegExp = /{(.*?)}/gi;
const matches: string[] = content.match(anyMatchPattern) || [];
const matchesLen: number = matches.length;
for (let i: number = 0; i < matchesLen; i++) {
// It correctly identifies the targeted text.
const currMatch: string = matches[i];
const isRtfMetadata: boolean = currMatch.endsWith(";}");
if (isRtfMetadata) {
continue;
}
// Here I need a way to escape `plainText` string so that it matches the source.
console.log("currMatch::", currMatch);
console.log("currMatch === plainText::", currMatch === plainText);
if (currMatch === plainText) {
const newContent: string = content.replace(currMatch, "IT_WORKS!");
console.log("newContent:", newContent);
}
}
output
content::
{\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf600
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0
\f0\fs24 \cf0 Here's a name with special char \{Kotou\uc0\u269 \}.}
currMatch:: {Kotou\uc0\u269 \}
currMatch === plainText:: false
It looks like ANSI escaping, and I've tried using jsesc but that produces a different string, {Kotou\u010D} instead of what the document produces {Kotou\uc0\u269 \}.
How can I dynamically escape the plainText string variable so that it matches what is found in the document?

What I needed was to deepen my knowledge on rtf formatting as well as general text encoding.
The raw RTF text read from the file gives us a few hints:
{\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf600...
This part of the rtf file metadata tells us a few things.
It is using RTF file formatting version 1. The encoding is ANSI, and specifically cpg1252, also known as Windows-1252 or CP-1252 which is:
...a single-byte character encoding of the Latin alphabet
(source)
The valuable piece of information from that is that we know it is using the Latin alphabet, this will be used later.
Knowing the specific RTF version used I stumbled upon the RTF 1.5 Spec
A quick search on that spec for one of the escape sequences that I was looking into revealed that it was an RTF specific escape control sequence, that is \uc0. So knowing that I was able to then parse what I was really after, \u269. Now I knew it was unicode and had a good hunch that the \u269 stood for unicode character code 269. So I look that up...
The \u269 (char code 269) shows up on this page to confirm. Now I know the character set and what needs done to get the equivalent plain text (unescaped), and there's a basic SO post I used here to get the function started.
Using all this knowledge I was able to piece it together from there. Here's the full corrected script and it's output:
script.ts
import * as fs from "fs";
// Match RTF unicode control sequence: http://www.biblioscape.com/rtf15_spec.htm
const unicodeControlReg: RegExp = /\\uc0\\u/g;
// Extracts the unicode character from an escape sequence with handling for rtf.
const matchEscapedChars: RegExp = /\\uc0\\u(\d{2,6})|\\u(\d{2,6})/g;
/**
* Util function to strip junk characters from string for comparison.
* #param {string} str
* #returns {string}
*/
const cleanupRtfStr = (str: string): string => {
return str
.replace(/\s/g, "")
.replace(/\\/g, "");
};
/**
* Detects escaped unicode and looks up the character by that code.
* #param {string} str
* #returns {string}
*/
const unescapeString = (str: string): string => {
const unescaped = str.replace(matchEscapedChars, (cc: string) => {
const stripped: string = cc.replace(unicodeControlReg, "");
const charCode: number = Number(stripped);
// See unicode character codes here:
// https://unicodelookup.com/#latin/11
return String.fromCharCode(charCode);
});
// Remove all whitespace.
return unescaped;
};
// Ingest the rtf file.
const content: string = fs.readFileSync("./src/TEST.rtf", "binary");
console.log("content::\n", content);
// The string we are looking to match in file text.
const plainText: string = "{Kotouč}";
// Look for all text that matches the pattern `{TEXT_HERE}`.
const anyMatchPattern: RegExp = /{(.*?)}/gi;
const matches: string[] = content.match(anyMatchPattern) || [];
const matchesLen: number = matches.length;
for (let i: number = 0; i < matchesLen; i++) {
const currMatch: string = matches[i];
const isRtfMetadata: boolean = currMatch.endsWith(";}");
if (isRtfMetadata) {
continue;
}
if (currMatch === plainText) {
const newContent: string = content.replace(currMatch, "IT_WORKS!");
console.log("\n\nnewContent:", newContent);
break;
}
const unescapedMatch: string = unescapeString(currMatch);
const cleanedMatch: string = cleanupRtfStr(unescapedMatch);
if (cleanedMatch === plainText) {
const newContent: string = content.replace(currMatch, "IT_WORKS_UNESCAPED!");
console.log("\n\nnewContent:", newContent);
break;
}
}
output
content::
{\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf600
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\partightenfactor0
\f0\fs24 \cf0 Here\'92s a name with special char \{Kotou\uc0\u269 \}}
newContent: {\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf600
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural\partightenfactor0
\f0\fs24 \cf0 Here\'92s a name with special char \IT_WORKS_UNESCAPED!}
Hopefully that helps others that aren't familiar with character encoding/escaping and it's uses in rtf formatted documents!

Related

Issue concatenating two strings containing '&' in dart

I have a code like this :
// Language = Dart
var someVariable = 'Hello';
var someOtherVariable = 'World';
var str = 'somedomain?x=${someVariable}&y=${someOtherVariable}';
return str;
// Expected:
// somedomain?x=Hello&y=World;
// Actual
// somedomain?x=Hello
If I replace the & character with any alphabets, it is able to successfully concatenate. What am I doing wrong.
This is the actual code which I used in FlutterFlow, and am having issues with:
Future<String> getEventUrlFromReference(BuildContext context, DocumentReference? eventReference) async {
var userId = currentUser?.uid as String;
return "https://somedomain.com/event?eventReference=${eventReference?.id}" + "&invitedBy="+userId;
}
// result: https://somedomain.com/event?eventReference=referencevalue
This was a string encoding issue. I was using the result of my function/code as body text in sms://<number>?&body=<string_containigng_&_character>; The text which is appended to the sms text truncates at the & character, and I made a mistake assuming it's a string concatenation issue.

parameter from package.json script (Encoding problem)

https://nodejs.org/docs/latest/api/process.html#processargv
https://www.golinuxcloud.com/pass-arguments-to-npm-script/
passing a parameter by invoking a script in package.json as follows:
--pathToFile=./ESMM/Parametrização_Dezembro_PS1_2022.xlsx
in code retrieve that parameter as argument
const value = process.argv.find( element => element.startsWith( `--pathToFile=` ) );
const pathToFile=value.replace( `--pathToFile=` , '' );
The string that's obtain seems to be in the wrong format/encoding
./ESMM/Parametrização_Dezembro_PS1_2022.xlsx
I tried converting to latin1 (other past issues were fixed with this encoding)
const latin1Buffer = buffer.transcode(Buffer.from(pathToFile), "utf8", "latin1");
const latin1String = latin1Buffer.toString("latin1");
but still don't get the string in the correct encoding:
./ESMM/Parametriza?º?úo_Dezembro_PS1_2022.xlsx
My package.json is in UTF-8.
My current locale is (chcp): Active code page: 850
OS: Windows
This seems to be related to:
https://code.visualstudio.com/docs/editor/tasks#_changing-the-encoding-for-a-task-output
vs code, how to change encoding for terminal triggered by "build task"
https://pt.stackoverflow.com/questions/148543/como-consertar-erro-de-acentua%C3%A7%C3%A3o-do-cmd
Get argv raw bytes in Node.js
will try those configurations
const min = parseInt("0xD800",16), max = parseInt("0xDFFF",16);
console.log(min);//55296
console.log(max);//57343
let textFiltered = "",specialChars = 0;
for(let charAux of pathToFile){
const hexChar = Buffer.from(charAux, 'utf8').toString('hex');
console.log(hexChar)
const intChar = parseInt(hexChar,16);
if(hexChar.length > 2){
//if(intChar>min && intChar<max){
//console.log(Buffer.from(charAux, 'utf8').toString('hex'))
specialChars++;
console.log(`specialChars(${specialChars}): ${hexChar}`);
}else{
textFiltered += String.fromCharCode(intChar);
}
}
console.log(textFiltered); //normal characters
./ESMM/Parametrizao_Dezembro_PS1_2022.xlsx
console.log(specialChars(${specialChars}): ${hexChar}); //specialCharacters
specialChars(1): e2949c
specialChars(2): c2ba
specialChars(3): e2949c
specialChars(4): c3ba
seems that e2949c hex value to indicate a special character since it repeats and 0xc2ba should be able to convert to "ç" and 0xc3ba to "ã" idealy still trying to figure that out.
Each Unicode codepoint can be written in a string with \u{xxxxxx} where xxxxxx represents 1–6 hex digits
As #JosefZ indicated but for Python, in my case gona use a direct conversion since will alls have the keyword "Parametrização" as part of the parameter.
The probleam that encountered in this case is that my package.json and my script are in the correct format UTF8 as stated by #tripleee (thanks for the help providade) but process.argv that returns <string[]> that basicaly UTF16... so my solution is deal with the ├ that in hex is "e2949c" and retrive the correct characters:
const UTF8_Character = "e2949c" //├
//for this cases use this json/array that haves the correct encoding
const personalized_encoding = {
"c2ba": "ç",
"c3ba": "ã"
}
let textFiltered = "",specialChars = 0;
for(let charAux of pathToFile){
const hexChar = Buffer.from(charAux, 'utf8').toString('hex');
//console.log(hexChar)
const intChar = parseInt(hexChar,16);
if(hexChar.length > 2){
if(hexChar === UTF8_Character) continue;
specialChars++;
//console.log(`specialChars(${specialChars}): ${hexChar}`);
textFiltered += personalized_encoding[hexChar];
}else{
textFiltered += String.fromCharCode(intChar);
}
}
console.log(textFiltered);

Convert Specific String to JSON Object

const test ="[{contactId=2525, additionDetail=samle}]";
I need to convert this string to a JSON object. It will dynamically load like this string. I need to particular string to convert to a JSON object.
JSON.parse(test) command not working for this. I attached the error here.
For that specific string, you'd have to parse it yourself.
const test = '[{contactId=2525, additionDetail=samle}]';
const obj = {};
test.split(/[{}]/)[1].split(/, /).forEach((elm) => {
const entry = elm.split('=');
obj[entry[0]] = entry[1];
});
What I am doing is splitting the string on the braces and selecting the second element (utilising regex) then splitting that on comma and space (again regex) then loop over the result and assign to an object.
You can then JSON.stringify(obj) for the result.
:edit:
For the second string you've asked for there is another, potentially more refined, answer. You'll need to first replace the = with : (I've again used a regex), then you use a regex to match the words and sentence and use a function to add the quotes.
const test = '[{contactId=2525, additionDetail=samle}]';
const test2 = "[{contactId=2525, additionDetail=rrr additional Detail, medicationType={medicationTypeId=3333, medicationType=Tablet}, endDate=2022-12-30}]";
const replaced = test.replace(/=/g,':')
const replaced2 = test2.replace(/=/g, ':');
const replacer = function(match){
return '"' + match + '"';
}
const replacedQuote = replaced.replace(/(?!\s)[-?\w ?]+/g,replacer);
const replaced2Quote = replaced2.replace(/(?!\s)[-?\w ?]+/g,replacer);
const obj = JSON.parse(replacedQuote);
const obj2 = JSON.parse(replaced2Quote);
You should note that Json means javascript object notation, so you need to create a JavaScript object to get started:
const test ="[{contactId=2525, additionDetail=samle}]";
let obj = Object.create(null)
You can now define your variable as one of the object properties :
obj.test = test
Now we have a JavaScript object and we can convert it to json:
let convertedToJson = JSON.stringify(test);
[{contactId=2525, additionDetail=samle}]
this is not a valid JSON-string, and it cannot be parsed by JSON.parse()
the correct JSON-string would be:
const test ='[{"contactId":2525, "additionDetail":"samle"}]';

How to split a string into two parts when only knowing one part?

Hello I'm looking for a way to split a string into two parts when only knowing one part. To clarify there is NO separator to determine where to split the string on.
After splitting the string it should be possible to recognize if the resulting part is the left or right part of the string.
Consider the following use case scenario (a very simple string, JS syntax):
const subject = 'foobar';
const known = 'foo';
const [left, right] = splitBySegment(subject, known);
console.log(left, right); // foo bar
Use RegExp (JS syntax):
function splitBySegment(subject, known) {
const escapeRegExp = new RegExp('[\\^$.*+?()[]{}|]', 'g');
const knownEscaped = known.replace(escapeRegExp, '\\$&');
const splitRegExp = new RegExp(
'^' +
`(?:${knownEscaped}|[A-Z]+(?=${knownEscaped}))` + // left
'|' +
`(?:${knownEscaped}|(?!${knownEscaped})[A-Z]+)` + // right
'$',
'g',
);
return subject.match(splitRegExp);
}

How to use stringByAddingPercentEncodingWithAllowedCharacters() for a URL in Swift 2.0

I was using this, in Swift 1.2
let urlwithPercentEscapes = myurlstring.stringByAddingPercentEscapesUsingEncoding(NSUTF8StringEncoding)
This now gives me a warning asking me to use
stringByAddingPercentEncodingWithAllowedCharacters
I need to use a NSCharacterSet as an argument, but there are so many and I cannot determine what one will give me the same outcome as the previously used method.
An example URL I want to use will be like this
http://www.mapquestapi.com/geocoding/v1/batch?key=YOUR_KEY_HERE&callback=renderBatch&location=Pottsville,PA&location=Red Lion&location=19036&location=1090 N Charlotte St, Lancaster, PA
The URL Character Set for encoding seems to contain sets the trim my
URL. i.e,
The path component of a URL is the component immediately following the
host component (if present). It ends wherever the query or fragment
component begins. For example, in the URL
http://www.example.com/index.php?key1=value1, the path component is
/index.php.
However I don't want to trim any aspect of it.
When I used my String, for example myurlstring it would fail.
But when used the following, then there were no issues. It encoded the string with some magic and I could get my URL data.
let urlwithPercentEscapes = myurlstring.stringByAddingPercentEscapesUsingEncoding(NSUTF8StringEncoding)
As it
Returns a representation of the String using a given encoding to
determine the percent escapes necessary to convert the String into a
legal URL string
Thanks
For the given URL string the equivalent to
let urlwithPercentEscapes = myurlstring.stringByAddingPercentEscapesUsingEncoding(NSUTF8StringEncoding)
is the character set URLQueryAllowedCharacterSet
let urlwithPercentEscapes = myurlstring.stringByAddingPercentEncodingWithAllowedCharacters( NSCharacterSet.URLQueryAllowedCharacterSet())
Swift 3:
let urlwithPercentEscapes = myurlstring.addingPercentEncoding( withAllowedCharacters: .urlQueryAllowed)
It encodes everything after the question mark in the URL string.
Since the method stringByAddingPercentEncodingWithAllowedCharacters can return nil, use optional bindings as suggested in the answer of Leo Dabus.
It will depend on your url. If your url is a path you can use the character set
urlPathAllowed
let myFileString = "My File.txt"
if let urlwithPercentEscapes = myFileString.addingPercentEncoding(withAllowedCharacters: .urlPathAllowed) {
print(urlwithPercentEscapes) // "My%20File.txt"
}
Creating a Character Set for URL Encoding
urlFragmentAllowed
urlHostAllowed
urlPasswordAllowed
urlQueryAllowed
urlUserAllowed
You can create also your own url character set:
let myUrlString = "http://www.mapquestapi.com/geocoding/v1/batch?key=YOUR_KEY_HERE&callback=renderBatch&location=Pottsville,PA&location=Red Lion&location=19036&location=1090 N Charlotte St, Lancaster, PA"
let urlSet = CharacterSet.urlFragmentAllowed
.union(.urlHostAllowed)
.union(.urlPasswordAllowed)
.union(.urlQueryAllowed)
.union(.urlUserAllowed)
extension CharacterSet {
static let urlAllowed = CharacterSet.urlFragmentAllowed
.union(.urlHostAllowed)
.union(.urlPasswordAllowed)
.union(.urlQueryAllowed)
.union(.urlUserAllowed)
}
if let urlwithPercentEscapes = myUrlString.addingPercentEncoding(withAllowedCharacters: .urlAllowed) {
print(urlwithPercentEscapes) // "http://www.mapquestapi.com/geocoding/v1/batch?key=YOUR_KEY_HERE&callback=renderBatch&location=Pottsville,PA&location=Red%20Lion&location=19036&location=1090%20N%20Charlotte%20St,%20Lancaster,%20PA"
}
Another option is to use URLComponents to properly create your url
Swift 3.0 (From grokswift)
Creating URLs from strings is a minefield for bugs. Just miss a single / or accidentally URL encode the ? in a query and your API call will fail and your app won’t have any data to display (or even crash if you didn’t anticipate that possibility). Since iOS 8 there’s a better way to build URLs using NSURLComponents and NSURLQueryItems.
func createURLWithComponents() -> URL? {
var urlComponents = URLComponents()
urlComponents.scheme = "http"
urlComponents.host = "www.mapquestapi.com"
urlComponents.path = "/geocoding/v1/batch"
let key = URLQueryItem(name: "key", value: "YOUR_KEY_HERE")
let callback = URLQueryItem(name: "callback", value: "renderBatch")
let locationA = URLQueryItem(name: "location", value: "Pottsville,PA")
let locationB = URLQueryItem(name: "location", value: "Red Lion")
let locationC = URLQueryItem(name: "location", value: "19036")
let locationD = URLQueryItem(name: "location", value: "1090 N Charlotte St, Lancaster, PA")
urlComponents.queryItems = [key, callback, locationA, locationB, locationC, locationD]
return urlComponents.url
}
Below is the code to access url using guard statement.
guard let url = createURLWithComponents() else {
print("invalid URL")
return nil
}
print(url)
Output:
http://www.mapquestapi.com/geocoding/v1/batch?key=YOUR_KEY_HERE&callback=renderBatch&location=Pottsville,PA&location=Red%20Lion&location=19036&location=1090%20N%20Charlotte%20St,%20Lancaster,%20PA
In Swift 3.1, I am using something like the following:
let query = "param1=value1&param2=" + valueToEncode.addingPercentEncoding(withAllowedCharacters: .alphanumeric)
It's safer than .urlQueryAllowed and the others, because it this will encode every characters other than A-Z, a-z and 0-9. This works better when the value you are encoding may use special characters like ?, &, =, + and spaces.
In my case where the last component was non latin characters I did the following in Swift 2.2:
extension String {
func encodeUTF8() -> String? {
//If I can create an NSURL out of the string nothing is wrong with it
if let _ = NSURL(string: self) {
return self
}
//Get the last component from the string this will return subSequence
let optionalLastComponent = self.characters.split { $0 == "/" }.last
if let lastComponent = optionalLastComponent {
//Get the string from the sub sequence by mapping the characters to [String] then reduce the array to String
let lastComponentAsString = lastComponent.map { String($0) }.reduce("", combine: +)
//Get the range of the last component
if let rangeOfLastComponent = self.rangeOfString(lastComponentAsString) {
//Get the string without its last component
let stringWithoutLastComponent = self.substringToIndex(rangeOfLastComponent.startIndex)
//Encode the last component
if let lastComponentEncoded = lastComponentAsString.stringByAddingPercentEncodingWithAllowedCharacters(NSCharacterSet.alphanumericCharacterSet()) {
//Finally append the original string (without its last component) to the encoded part (encoded last component)
let encodedString = stringWithoutLastComponent + lastComponentEncoded
//Return the string (original string/encoded string)
return encodedString
}
}
}
return nil;
}
}
Swift 4.0
let encodedData = myUrlString.addingPercentEncoding(withAllowedCharacters: CharacterSet.urlHostAllowed)

Resources