how to deal with special characters in nodejs fs readdir function - node.js

I'm reading a directory in nodejs using the fs.readdir() function. You feed it a string containing a path and it returns an array containing all the files inside that directory path in string format. It does not work for me with special characters (like ï).
I came across this similar issue, however I am on OS X).
First I created a new dir called encoding and created a file called maïs.md (with my editor Sublime Text).
fs.readdir('encoding', function(err, files) {
console.log(files); // [ 'maïs.md' ]
console.log(files[0]); // maïs.md
console.log(files[0] === 'maïs.md'); // false
console.log(files[0] == 'maïs.md'); // false
console.log(files[0].toString('utf8') === 'maïs.md'); // false
});
The above test works correctly for files without special characters. How can I compare this correctly?

you character seems to be this one. You should try with
(1) console.log(files[0] == 'ma\u00EF;s.md');
(2) console.log(files[0] == 'mai\u0308;s.md');
If (1) works it could mean that the file containing your code is not saved in utf-8 format, so the node.js engine does not interpret correctly the ï character in your code.
If (2) works it could mean that the file system gives to the node engine the ï character in its decomposed unicode form (i followed by a diacritic ¨). cf #thejh answer
In this (2) case, use the unorm library available on npm to normalize the strings before comparing them (or the original UnicodeNormalizer)

https://apple.stackexchange.com/a/10484/23863 looks relevant – it's probably because there are different ways to express ï in utf8.

Related

Regex .match() not finding matches even though it should

I'm trying to scan a .txt file in a node.js script, and scan its contents for certain pieces of data. The lines I'm interested in getting look mostly like this:
DIBH91643 5/10/2019 108,75
SIR108811 5/10/2019 187,50
SIR108845 5/10/2019 63,75
So I've been trying to match them with a regex without succes. Using a regex testing site, I've even confirmed the fact that it should find the matches I'm looking for, but it always returns null when I call data.match(regex). I'm probably missing something basic here, but I can't figure it out for the life of me. This is the code I'm using (in its entirety, since there isn't much):
var fs = require('fs');
let regex = /\w*?(\d+)\s+(\d+\/\d+\/\d+)\s+(\-{0,1}\d+\,\d+)/g;
let ihateregex = /91/g;
fs.readFile('pathToFile/fileToRead.txt',{encoding: 'utf-8'}, (err, data) => {
var result = data.match(regex);
console.log(result);
});
As shown, even an attempt with a simple pattern that is definitely inside the file still returns null. I have looked into other answers here for similar problems, and they all point to deleting bytes from the beginning of the file. I have used vim -b to delete the first 2 bytes - which did look out of place and furthermore printing the entire data with console.log() did actually show 2 weird characters in the beginning of the file, but I get the exact same error.
I can't figure out what I'm missing here.
Try the following regex:
/^[A-Z]*(\d+)\s+(\d+\/\d+\/\d+)\s+(-?\d+,\d+)/gm
Improvements compared to your regex:
^ - start from the start of line,
[A-Z]* instead of \w*? - note that \w matches also digits,
removed / in front of - and ,,
? instead of {0,1},
added m option (I assume that you want to process all rows, not the first only).
To process the matches I used the following code, using rextester.com, so
instead of e.g. console.log(...) it contains print(...):
let data = 'DIBH91643 5/10/2019 108,75\nSIR108811 5/10/2019 187,50\nSIR108845 5/10/2019 63,75'
print("Data: ")
print(data)
let re = /^[A-Z]*(\d+)\s+(\d+\/\d+\/\d+)\s+(-?\d+,\d+)/gm
print("Result: ")
while ((matches = re.exec(data)) != null) {
print(matches[1], '_', matches[2], '_', matches[3])
}
For a working example see https://rextester.com/PZU21213
So I've finally figured out what went wrong and I feel extremely stupid for taking so long to figure it out. One thing I've failed to mention even though I should have is that the file I'm reading is one created by an OCR program. An OCR program which, apparently, added an invisible char between each character in the text file, that I only saw when I switched to php (fopen(), fgets(), fclose()) and looked at the source of the page I made.
Once I copied the contents of fileToRead.txt into a newly created fileToRead2.txt (simple copy-paste), it worked perfectly.

Change Image source in Markdown text using Node JS

I have some markdown text containing image references but those are relative paths & I need to modify those to absolute paths using NodeJS script. Is there any way to achieve the same in simple way ?
Example:
Source
![](myImage.png?raw=true)
Result
![](www.example.com/myImage.png?raw=true)
I have multiple images in the markdwon content that I need to modify.
Check this working example: https://repl.it/repls/BlackJuicyCommands
First of all you need the read the file with fs.readFile() and parse the Buffer to a string.
Now you got access to the text of the file, you need the replace every image path with a new image path. One way to capture it is with a regular expression. You could for example look for ](ANY_IMAGE_PATH.ANY_IMAGE_EXTENSION. We're only interested in replacing the ANY_IMAGE_PATH part.
An example regex could be (this could be improved!) (see it live in action here: https://regex101.com/r/xRioBq/1):
const regex = /\]\((.+)(?=(\.(svg|gif|png|jpe?g)))/g
This will look for a literal ] followed by a literal ( followed by any sequence (that's the .+) until it finds a literal . followed by either svg, gif,png, or jpeg / jpg. the g after the regex is necessary to match all occurences, not only the first.
Javascript/node (< version 9) does not support lookbehind. I've used a regex that includes the ]( before the image path and then will filter it out later (or you could use a more complex regex).
The (.+) captures the image path as a group. Then you can use the replace function of the string. The replace function accepts the regex as first argument. The second argument can be a function which receives as first argument the full match (in this case ](image_path_without_extension and as the following arguments any captured group.
Change the url with the node modules path or url and return the captured group (underneath I've called that argument imagePath). Because we're also replacing the ](, that should be included in the return value.
const url = require('url');
const replacedText = data.toString().replace(regex, (fullResult, imagePath) => {
const newImagePath = url.resolve('http://www.example.org', imagePath)
return `](${newImagePath}`;
})
See the repl example to see it live in action. Note: in the repl example it is written to a different file. Use the same file name if you want to overwrite.
I ran into the same issue yesterday, and came up with this solution.
const markdownReplaced = markdown.replace(
/(?<=\]\()(.+)(?=(\)))/g,
(url) => `www.example.com/${url}`,
);
The regex finds anything starts with ]( and ends with ), without including the capturing group, which is the original URL itself.

readFileSync() doesn't return special characters regardless of encoding

My requirements are very simple… open any old ANSI-ASCII-UTF8-Unicode TXT file and replace some of the special "word processing" characters like the fancy single quote (\u2019) and double quotes (\u201C and \u201D) with the plain vanilla Ascii ones, and then do some other (irrelevant to the problem) parsing.
However, regardless of the encoding I try (ascii, utf8, binary) I just can’t get Node.js to return all characters correctly so as to replace them with their Ascii equivalents and instead I get the useless little rectangles!
Here’s the relevant part of the function…
function LoadTxtFile(Name){
fs=require('fs');
if (fs.existsSync(Name)){
var Source=fs.readFileSync(Name,'binary').toString();
/* Replace miscellaneous characters which works fine…*/
Source=Source.replace(/\©/g,'©');
Source=Source.replace(/\…/g,'...');
Source=Source.replace(/\t/g,' ');
Source=Source.replace(/\'/g,''')
/* Replace the dreaded single/double quotes but they are never located! */
Source=Source.replace(/\u2019/g,''');
Source=Source.replace(/\u201C/g,'"');
Source=Source.replace(/\u201D/g,'"');
/* And we’re stuck! */
}}
Thank you very much.
Try the Node-Iconv library see if it helps

need guidance with basic function creation in MATLAB

I have to write a MATLAB function with the following description:
function counts = letterStatistics(filename, allowedChar, N)
This function is supposed to open a text file specified by filename and read its entire contents. The contents will be parsed such that any character that isn’t in allowedChar is removed. Finally it will return a count of all N-symbol combinations in the parsed text. This function should be stored in a file name “letterStatistics.m” and I made a list of some commands and things of how the function should be organized according to my professors' lecture notes:
Begin the function by setting the default value of N to 1 in case:
a. The user specifies a 0 or negative value of N.
b. The user doesn’t pass the argument N into the function, i.e., counts = letterStatistics(filename, allowedChar)
Using the fopen function, open the file filename for reading in text mode.
Using the function fscanf, read in all the contents of the opened file into a string variable.
I know there exists a MATLAB function to turn all letters in a string to lower case. Since my analysis will disregard case, I have to use this function on the string of text.
Parse this string variable as follows (use logical indexing or regular expressions – do not use for loops):
a. We want to remove all newline characters without this occurring:
e.g.
In my younger and more vulnerable years my father gave me some advice that I've been turning over in my mind ever since.
In my younger and more vulnerableyears my father gave me some advicethat I’ve been turning over in my mindever since.
Replace all newline characters (special character \n) with a single space: ' '.
b. We will treat hyphenated words as two separate words, hence do the same for hyphens '-'.
c. Remove any character that is not in allowedChar. Hint: use regexprep with an empty string '' as an argument for replace.
d. Any sequence of two or more blank spaces should be replaced by a single blank space.
Use the provided permsRep function, to create a matrix of all possible N-symbol combinations of the symbols in allowedChar.
Using the strfind function, count all the N-symbol combinations in the parsed text into an array counts. Do not loop through each character in your parsed text as you would in a C program.
Close the opened file using fclose.
HERE IS MY QUESTION: so as you can see i have made this list of what the function is, what it should do, and using which commands (fclose etc.). the trouble is that I'm aware that closing the file involves use of 'fclose' but other than that I'm not sure how to execute #8. Same goes for the whole function creation. I have a vague idea of how to create a function using what commands but I'm unable to produce the actual code.. how should I begin? Any guidance/hints would seriously be appreciated because I'm having programmers' block and am unable to start!
I think that you are new to matlab, so the documentation may be complicated. The root of the problem is the basic understanding of file I/O (input/output) I guess. So the thing is that when you open the file using fopen, matlab returns a pointer to that file, which is generally called a file ID. When you call fclose you want matlab to understand that you want to close that file. So what you have to do is to use fclose with the correct file ID.
fid = open('test.txt');
fprintf(fid,'This is a test.\n');
fclose(fid);
fid = 0; % Optional, this will make it clear that the file is not open,
% but it is not necessary since matlab will send a not open message anyway
Regarding the function creation the syntax is something like this:
function out = myFcn(x,y)
z = x*y;
fprintf('z=%.0f\n',z); % Print value of z in the command window
out = z>0;
This is a function that checks if two numbers are positive and returns true they are. If not it returns false. This may not be the best way to do this test, but it works as example I guess.
Please comment if this is not what you want to know.

NodeJS Extended ASCII Support

I'm trying to read a file that contains extended ascii characters like 'á' or 'è', but NodeJS doesn't seem to recognize them.
I tried reading into:
Buffer
String
Tried differente encoding types:
ascii
base64
utf8
as referenced on http://nodejs.org/api/fs.html
Is there a way to make this work?
I use the binary type to read such files. For example
var fs = require('fs');
// this comment has I'm trying to read a file that contains extended ascii characters like 'á' or 'è',
fs.readFile("foo.js", "binary", function zz2(err, file) {
console.log(file);
});
When I do save the above into foo.js, then the following is shown on the output:
var fs = require('fs');
// this comment has I'm trying to read a file that contains extended ascii characters like '⟡ 漀爀 ✀',
fs.readFile("foo.js", "binary", function zz2(err, file) {
console.log(file);
});
The wierdness above is because I have run it in an emacs buffer.
The file I was trying to read was in ANSI encoding. When I tried to read it using the 'fs' module's functions, it couldn't perform the conversion of the extended ASCII characters.
I just figured out that nodepad++ is able to actually convert from some formats to UTF-8, instead of just flagging the file with UTF-8 encoding.
After converting it, I was able to read it just fine and apply all the operations I needed to the content.
Thank you for your answers!
I realize this is an old post, but I found it in my personal search for a solution to this particular problem.
I have written a module that provides Extended ASCII decoding and encoding support to/from Node Buffers. You can see the source code here. It is a part of my implementation of Buffer in the browser for an in-browser filesystem I've created called BrowserFS, but it can be used 100% independently of any of that within NodeJS (or Browserify) as it has no dependencies.
Just add bfs-buffer to your dependencies and do the following:
var ExtendedASCII = require('bfs-buffer/js/extended_ascii').default;
// Decodes the input buffer as an extended ASCII string.
ExtendedASCII.byte2str(aBufferWithText);
// Encodes the input string as an extended ASCII string.
ExtendedASCII.str2byte("Some text");
Alternatively, just adapt the module into your project if you don't want to add an extra dependency to your project. It's MIT Licensed.
I hope this helps anyone in the future who finds this page in their searches like I did. :)

Resources