nodejs get file character encoding - node.js

How can I find out what character encoding a given text file has?
var inputFile = "filename.txt";
var file = fs.readFileSync(inputFile);
var data = new Buffer(file, "ascii");
var fileEncoding = some_clever_function(file);
if (fileEncoding !== "utf8") {
// do something
}
Thanks

You can try to use external module, such as https://www.npmjs.com/package/detect-character-encoding

The previously mentioned module works for me too. Alternatively you could have a look at detect-file-encoding-and-language which I'm using at the moment.
Installation:
$ npm install detect-file-encoding-and-language
Usage:
// index.js
const languageEncoding = require("detect-file-encoding-and-language");
const pathToFile = "/home/username/documents/my-text-file.txt"
languageEncoding(pathToFile).then(fileInfo => console.log(fileInfo));
// Possible result: { language: japanese, encoding: Shift-JIS, confidence: { language: 0.97, encoding: 1 } }

Related

Apply regex to .txt file node.js

I'm trying to escape quotes in txt file using node.js and regex.
My code looks like this:
const fs = require("fs");
const utf8 = require("utf8");
var dirname = ".\\f\\";
const regex = new RegExp(`(?<=".*)"(?=.*"$)`, "gm");
fs.readFile(dirname + "test.txt", (error, data) => {
if (error) {
throw error;
}
var d = data.toString();
d = utf8.encode(d)
console.log(`File: ${typeof d}`); //string
// d = `Another string\n"Test "here"."\n"Another "here"."\n"And last one here."`;
console.log(`Text: ${typeof d}`); //string
var re = d.replace(regex, '\\"');
console.log(`Result:\n${re}`);
/* Another string
"Test \"here\"."
"Another \"here\"."
"And last one here."
*/
});
The problem is:
When I remove comment from the line, everything works fine. But if i read the text from the file it doesn't want to work.
Thanks for any comments on this.
Well.. turns out the problem was in file encoding. The file was encoded in UTF-16, not in UTF-8. Node.js wasn't giving me any signs of wrong encoding, so well, nice.

How to use the brotli npm module to compress files

I would like to use this npm module to compress files, but I'm a bit stuck by the documention.
In a linux shell :
npm install brotli # npm#4.1.2 # brotli#1.3.1
node # v6.9.4
Then inside node:
var fs = require('fs');
var brotli = require('brotli');
brotli.compress(fs.readFileSync('myfile.txt')); // output is numbers
fs.writeFile('myfile.txt.br', brotli.compress(fs.readFileSync('bin.tar')), function(err){ if (!err) { console.log('It works!');}});
"It works!"
But the file is full of numbers too...
I've never used streams and fs like that in node, can someone explains how to deal with this? Thanks!
With this simple JS code you are compressing each *.html *.css *.js file inside the folder you choose (in this case /dist)
const fs = require('fs');
const compress = require('brotli/compress');
const brotliSettings = {
extension: 'br',
skipLarger: true,
mode: 1, // 0 = generic, 1 = text, 2 = font (WOFF2)
quality: 10, // 0 - 11,
lgwin: 12 // default
};
fs.readdirSync('dist/').forEach(file => {
if (file.endsWith('.js') || file.endsWith('.css') || file.endsWith('.html')) {
const result = compress(fs.readFileSync('dist/' + file), brotliSettings);
fs.writeFileSync('dist/' + file + '.br', result);
}
});

how to give file name a input in baby parser

I am trying to use baby parser for parsing csv file but i am getting below output if i give file name
file and code are in same directory
my code:
var Papa = require('babyparse');
var fs = require('fs');
var file = 'test.csv';
Papa.parse(file,{
step: function(row){
console.log("Row: ", row.data);
}
});
Out put :
Row: [ [ 'test.csv' ] ]
file must be a File object: http://papaparse.com/docs#local-files. In nodejs, you should use the fs API to load the content of the file and then pass it to PapaParse: https://nodejs.org/api/fs.html#fs_fs_readfilesync_filename_options
var Papa = require('babyparse');
var fs = require('fs');
var file = 'test.csv';
var content = fs.readFileSync(file, { encoding: 'binary' });
Papa.parse(content, {
step: function(row){
console.log("Row: ", row.data);
}
});
The encoding option is important and setting it to binary works for any text/csv file, you could also set it to utf8 if your file is in unicode.

PDF to Text extractor in nodejs without OS dependencies

Is there a way to extract text from PDFs in nodejs without any OS dependencies (like pdf2text, or xpdf on windows)? I wasn't able to find any 'native' pdf packages in nodejs. They always are a wrapper/util on top of an existing OS command.
Thanks
Have you checked PDF2Json? It is built on top of PDF.js. Though it is not providing the text output as a single line but I believe you may just reconstruct the final text based on the generated Json output:
'Texts': an array of text blocks with position, actual text and styling informations:
'x' and 'y': relative coordinates for positioning
'clr': a color index in color dictionary, same 'clr' field as in 'Fill' object. If a color can be found in color dictionary, 'oc' field will be added to the field as 'original color" value.
'A': text alignment, including:
left
center
right
'R': an array of text run, each text run object has two main fields:
'T': actual text
'S': style index from style dictionary. More info about 'Style Dictionary' can be found at 'Dictionary Reference' section
After some work, I finally got a reliable function for reading text from PDF using https://github.com/mozilla/pdfjs-dist
To get this to work, first npm install on the command line:
npm i pdfjs-dist
Then create a file with this code (I named the file "pdfExport.js" in this example):
const pdfjsLib = require("pdfjs-dist");
async function GetTextFromPDF(path) {
let doc = await pdfjsLib.getDocument(path).promise;
let page1 = await doc.getPage(1);
let content = await page1.getTextContent();
let strings = content.items.map(function(item) {
return item.str;
});
return strings;
}
module.exports = { GetTextFromPDF }
Then it can simply be used in any other js file you have like so:
const pdfExport = require('./pdfExport');
pdfExport.GetTextFromPDF('./sample.pdf').then(data => console.log(data));
Thought I'd chime in here for anyone who came across this question in the future.
I had this problem and spent hours over literally all the PDF libraries on NPM. My requirements were that I needed to run it on AWS Lambda so could not depend on OS dependencies.
The code below is adapted from another stackoverflow answer (which I cannot currently find). The only difference being that we import the ES5 version which works with Node >= 12. If you just import pdfjs-dist there will be an error of "Readable Stream is not defined". Hope it helps!
import * as pdfjslib from 'pdfjs-dist/es5/build/pdf.js';
export default class Pdf {
public static async getPageText(pdf: any, pageNo: number) {
const page = await pdf.getPage(pageNo);
const tokenizedText = await page.getTextContent();
const pageText = tokenizedText.items.map((token: any) => token.str).join('');
return pageText;
}
public static async getPDFText(source: any): Promise<string> {
const pdf = await pdfjslib.getDocument(source).promise;
const maxPages = pdf.numPages;
const pageTextPromises = [];
for (let pageNo = 1; pageNo <= maxPages; pageNo += 1) {
pageTextPromises.push(Pdf.getPageText(pdf, pageNo));
}
const pageTexts = await Promise.all(pageTextPromises);
return pageTexts.join(' ');
}
}
Usage
const fileBuffer = fs.readFile('sample.pdf');
const pdfText = await Pdf.getPDFText(fileBuffer);
This solution worked for me using node 14.20.1 using "pdf-parse": "^1.1.1"
You can install it with:
yarn add pdf-parse
This is the main function which converts the PDF file to text.
const path = require('path');
const fs = require('fs');
const pdf = require('pdf-parse');
const assert = require('assert');
const extractText = async (pathStr) => {
assert (fs.existsSync(pathStr), `Path does not exist ${pathStr}`)
const pdfFile = path.resolve(pathStr)
const dataBuffer = fs.readFileSync(pdfFile);
const data = await pdf(dataBuffer)
return data.text
}
module.exports = {
extractText
}
Then you can use the function like this:
const { extractText } = require('../api/lighthouse/lib/pdfExtraction')
extractText('./data/CoreDeveloper-v5.1.4.pdf').then(t => console.log(t))
Instead of using the proposed PDF2Json you can also use PDF.js directly (https://github.com/mozilla/pdfjs-dist). This has the advantage that you are not depending on modesty who owns PDF2Json and that he updates the PDF.js base.

wkhtmltopdf on nodejs generates corrupt pdfs

I am using wkhtmltopdf to generate pdfs in nodejs
Below is my sample code to generate pdf
var wkhtmltopdf = require('wkhtmltopdf')
, createWriteStream = require('fs').createWriteStream;
var r = wkhtmltopdf('http://www.google.com', { pageSize: 'letter' })
.pipe(createWriteStream('C:/MYUSERNAME/demo.pdf'));
r.on('close', function(){
mycallback();
});
The above code is generating corrupt pdfs. I could not figure out the issue.
Although when I generate pdfs using command prompt it is generating correctly
like when I use below code in windows command prompt
wkhtmltopdf http://www.google.com demo.pdf
I get correct pdf generated,sadly when I try to generate pdf in node environment, it generates corrupt pdfs.
Incase it helps I'm using wkhtmltopdf 0.11.0 rc2
Thanks in advance.
wkhtmltopdf for node has a bug for windows, so you can write a new one.
Like this:
function wkhtmltopdf(input, pageSize) {
var spawn = require('child_process').spawn;
var html;
var isUrl = /^(https?|file):\/\//.test(input);
if (!isUrl) {
html = input;
input = '-';
}
var args = ['wkhtmltopdf', '--quiet', '--page-size', pageSize, input, '-']
if (process.platform === 'win32') {
var child = spawn(args[0], args.slice(1));
} else {
var child = spawn('/bin/sh', ['-c', args.join(' ') + ' | cat']);
}
if (!isUrl) {
child.stdin.end(html);
}
return child.stdout;
}
// usage:
createWriteStream = require('fs').createWriteStream;
wkhtmltopdf('http://google.com/', 'letter')
.pipe(createWriteStream('demo1.pdf'));
wkhtmltopdf('<body>hello world!</body>', 'letter')
.pipe(createWriteStream('demo2.pdf'));
note: the param is now 'letter' not { pageSize: 'letter' }

Resources