I am using Puppeteer to scrape the web from a file template that contains the data of an order.
For this, I am using a puppeteer evaluation function, which works correctly while the file is in .js
However, when the "pkg" package is used to compile the .exe file or evaluate to execute and initiate a return or error: "The passed function is not quite serializable!"
Below is the code:
const dados = {name: 'foo', year: 1}
await page.evaluate(dados => {
let dom = document.querySelector('body');
const tags = Object.keys(dados);
for (let i = 0; i < tags.length; i++) {
const tag = tags[i];
dom.innerHTML = dom.innerHTML.split(`{{${tag}}}`).join(dados[tag]);
}
}, dados);
I try to add --public argument with pkg.
Like: pkg start.js -t node14-win-x64 --public
Then I can freely use ElementHandle.evaluate( (elem)=> elem.textContent );
With the manual of pkg, the --public means that : speed up and disclose the sources of top-level project.
BTW
To fixup cannot find chrome binary file
browser = await puppeteer.launch({
executablePath: "node_modules/puppeteer/.local-chromium/win64-782078/chrome-win/chrome.exe"
});
(The path above can set anywhere as you like.)
To fixup start.exe cannot run
Sometime the output binary exe cannot executable. It
always popup a new cmd prompt window when we enter start.exe. (or just double click.)
Just delete the output exe then rerun pkg
Check the code whether is runnable with node start.js or not
The easiest solution for me is to wrap it with eval() :
async getText(selector: string) {
await this.page.waitForSelector(selector);
let text = await eval(`this.page.$eval('${selector}', el => el.textContent)`)
return text;
}
or this:
await eval(`this.page.evaluate(
(selector) => { (document.querySelector(selector).value = ''); },
selector);`);
I had this exact issue relating to puppeteer and pkg. For some reason pkg doesn't correctly interpret the contents of the callback. The solution that worked for me was to pass a string to evaluate rather than a function:
Change:
const dados = {name: 'foo', year: 1}
await page.evaluate(dados => {
let dom = document.querySelector('body');
const tags = Object.keys(dados);
for (let i = 0; i < tags.length; i++) {
const tag = tags[i];
dom.innerHTML = dom.innerHTML.split(`{{${tag}}}`).join(dados[tag]);
}
}, dados);
to
await page.evaluate(`
(() => {
const dados = {name: 'foo', year: 1};
let dom = document.querySelector('body');
const tags = Object.keys(dados);
for (let i = 0; i < tags.length; i++) {
const tag = tags[i];
dom.innerHTML = dom.innerHTML.split(`{{${tag}}}`).join(dados[tag]);
}
// return dom to do something with the data in node
return dom.innerHTML
})()`);
This answer on github suggests an alternative solution - using the pkg api to inject the callback at compile time, however it didn't work for me.
Related
Is there any way to use ts-node with WebWorkers but without using webpack?
When I do:
const worker = new Worker('path-to/workerFile.ts', { // ... });
I get:
TypeError [ERR_WORKER_UNSUPPORTED_EXTENSION]:
The worker script extension must be ".js" or ".mjs". Received ".ts" at new Worker (internal/worker.js:272:15)
// ....
Any ideas?
Tomer
You can make an function to make the magic, using eval property of WorkerOption parameter.
const workerTs = (file: string, wkOpts: WorkerOptions) => {
wkOpts.eval = true;
if (!wkOpts.workerData) {
wkOpts.workerData = {};
}
wkOpts.workerData.__filename = file;
return new Worker(`
const wk = require('worker_threads');
require('ts-node').register();
let file = wk.workerData.__filename;
delete wk.workerData.__filename;
require(file);
`,
wkOpts
);
}
so you can create the thread like this:
let wk = workerTs('./file.ts', {});
Hope it can help.
I need to access per-machine configuration data in my Node application running on Windows. I've found this documentation for how to find the location:
Where Should I Store my Data and Configuration Files if I Target Multiple OS Versions?
So, in my case, I would like to get the path for CSIDL_COMMON_APPDATA (or FOLDERID_ProgramData). However, the examples are all in C, and I would prefer to not have to write a C extension for this.
Is there any other way to access these paths from Node, or should I just hardcode them?
After doing a bit of research, I've found that it's possible to call the relevant Windows API proc. (SHGetKnownFolderPath) to get these folder locations, see docs at: https://msdn.microsoft.com/en-us/library/windows/desktop/bb762188(v=vs.85).aspx.
We call the APi using the FFI npm module: https://www.npmjs.com/package/ffi.
It is possible to find the GUIDs for any known folder here:
https://msdn.microsoft.com/en-us/library/windows/desktop/dd378457(v=vs.85).aspx
Here is a script that finds the location of several common folders,
some of the code is a little hacky, but is easily cleaned up.
const ffi = require('ffi');
const ref = require('ref');
const shell32 = new ffi.Library('Shell32', {
SHGetKnownFolderPath: ['int', [ ref.refType('void'), 'int', ref.refType('void'), ref.refType(ref.refType("char"))]]
});
function parseGUID(guidStr) {
var fields = guidStr.split('-');
var a1 = [];
for(var i = 0; i < fields.length; i++) {
var a2 = [...Buffer.from(fields[i], 'hex')];
if (i < 3) a2 = a2.reverse();
a1 = a1.concat(a2);
}
return new Buffer(a1);
}
function getWindowsKnownFolderPath(pathGUID) {
let guidPtr = parseGUID(pathGUID);
guidPtr.type = ref.types.void;
let pathPtr = ref.alloc(ref.refType(ref.refType("void")));
let status = shell32.SHGetKnownFolderPath(guidPtr, 0, ref.NULL, pathPtr);
if (status !== 0) {
return "Error occurred getting path: " + status;
}
let pathStr = ref.readPointer(pathPtr, 0, 200);
return pathStr.toString('ucs2').substring(0, (pathStr.indexOf('\0\0') + 1)/2);
}
// See this link for a complete list: https://msdn.microsoft.com/en-us/library/windows/desktop/dd378457(v=vs.85).aspx
const WindowsKnownFolders = {
ProgramData: "62AB5D82-FDC1-4DC3-A9DD-070D1D495D97",
Windows: "F38BF404-1D43-42F2-9305-67DE0B28FC23",
ProgramFiles: "905E63B6-C1BF-494E-B29C-65B732D3D21A",
Documents: "FDD39AD0-238F-46AF-ADB4-6C85480369C7"
}
// Enumerate common folders.
for(let [k,v] of Object.entries(WindowsKnownFolders)) {
console.log(`${k}: `, getWindowsKnownFolderPath(v));
}
Is there a way to extract text from PDFs in nodejs without any OS dependencies (like pdf2text, or xpdf on windows)? I wasn't able to find any 'native' pdf packages in nodejs. They always are a wrapper/util on top of an existing OS command.
Thanks
Have you checked PDF2Json? It is built on top of PDF.js. Though it is not providing the text output as a single line but I believe you may just reconstruct the final text based on the generated Json output:
'Texts': an array of text blocks with position, actual text and styling informations:
'x' and 'y': relative coordinates for positioning
'clr': a color index in color dictionary, same 'clr' field as in 'Fill' object. If a color can be found in color dictionary, 'oc' field will be added to the field as 'original color" value.
'A': text alignment, including:
left
center
right
'R': an array of text run, each text run object has two main fields:
'T': actual text
'S': style index from style dictionary. More info about 'Style Dictionary' can be found at 'Dictionary Reference' section
After some work, I finally got a reliable function for reading text from PDF using https://github.com/mozilla/pdfjs-dist
To get this to work, first npm install on the command line:
npm i pdfjs-dist
Then create a file with this code (I named the file "pdfExport.js" in this example):
const pdfjsLib = require("pdfjs-dist");
async function GetTextFromPDF(path) {
let doc = await pdfjsLib.getDocument(path).promise;
let page1 = await doc.getPage(1);
let content = await page1.getTextContent();
let strings = content.items.map(function(item) {
return item.str;
});
return strings;
}
module.exports = { GetTextFromPDF }
Then it can simply be used in any other js file you have like so:
const pdfExport = require('./pdfExport');
pdfExport.GetTextFromPDF('./sample.pdf').then(data => console.log(data));
Thought I'd chime in here for anyone who came across this question in the future.
I had this problem and spent hours over literally all the PDF libraries on NPM. My requirements were that I needed to run it on AWS Lambda so could not depend on OS dependencies.
The code below is adapted from another stackoverflow answer (which I cannot currently find). The only difference being that we import the ES5 version which works with Node >= 12. If you just import pdfjs-dist there will be an error of "Readable Stream is not defined". Hope it helps!
import * as pdfjslib from 'pdfjs-dist/es5/build/pdf.js';
export default class Pdf {
public static async getPageText(pdf: any, pageNo: number) {
const page = await pdf.getPage(pageNo);
const tokenizedText = await page.getTextContent();
const pageText = tokenizedText.items.map((token: any) => token.str).join('');
return pageText;
}
public static async getPDFText(source: any): Promise<string> {
const pdf = await pdfjslib.getDocument(source).promise;
const maxPages = pdf.numPages;
const pageTextPromises = [];
for (let pageNo = 1; pageNo <= maxPages; pageNo += 1) {
pageTextPromises.push(Pdf.getPageText(pdf, pageNo));
}
const pageTexts = await Promise.all(pageTextPromises);
return pageTexts.join(' ');
}
}
Usage
const fileBuffer = fs.readFile('sample.pdf');
const pdfText = await Pdf.getPDFText(fileBuffer);
This solution worked for me using node 14.20.1 using "pdf-parse": "^1.1.1"
You can install it with:
yarn add pdf-parse
This is the main function which converts the PDF file to text.
const path = require('path');
const fs = require('fs');
const pdf = require('pdf-parse');
const assert = require('assert');
const extractText = async (pathStr) => {
assert (fs.existsSync(pathStr), `Path does not exist ${pathStr}`)
const pdfFile = path.resolve(pathStr)
const dataBuffer = fs.readFileSync(pdfFile);
const data = await pdf(dataBuffer)
return data.text
}
module.exports = {
extractText
}
Then you can use the function like this:
const { extractText } = require('../api/lighthouse/lib/pdfExtraction')
extractText('./data/CoreDeveloper-v5.1.4.pdf').then(t => console.log(t))
Instead of using the proposed PDF2Json you can also use PDF.js directly (https://github.com/mozilla/pdfjs-dist). This has the advantage that you are not depending on modesty who owns PDF2Json and that he updates the PDF.js base.
i have a folder stucture like
var/
testfolder1/
myfile.rb
data.rb
testfolder2/
home.rb
sub.rb
sample.rb
rute.rb
inside var folder contains subfolders(testfolder1,testfolder2) and some files(sample.rb,rute.rb)
in the following code returing a josn object that contains folders and files inside the var folder
like
{
'0': ['sample.rb', 'rute.rb'],
testfolder1: ['myfile.rb',
'data.rb',
],
testfolder2: ['home.rb',
'sub.rb',
]
}
code
var scriptsWithGroup = {};
fs.readdir('/home/var/', function(err, subfolder) {
if(err) return context.sendJson({}, 200);
var scripts = [];
for (var j = 0; j < subfolder.length; j++) {
var scriptsInFolder = [];
if(fs.lstatSync(scriptPath + subfolder[j]).isDirectory()) {
fs.readdirSync(scriptPath + subfolder[j]).forEach(function(file) {
if (file.substr(file.length - 3) == '.rb')
scriptsInFolder.push(file);
});
scriptsWithGroup[subfolder[j]] = scriptsInFolder;
} else {
if (subfolder[j].substr(subfolder[j].length - 3) == '.rb')
scripts.push(subfolder[j]);
}
}
scriptsWithGroup["0"] = scripts;
console.log(scriptsWithGroup)
context.sendJson(scriptsWithGroup, 200);
});
What i need is i want to return the latest modified or created files.here i only use 2 files inside folders it contains lots of files.so i want to return latest created ones
I'm going to assume here that you want only the most recent two files. If you actually want them all, just sorted, just remove the slice portion of this:
scriptsInFolder = scriptsInFolder.sort(function(a, b) {
// or mtime, if you're only wanting file changes and not file attribute changes
var time1 = fs.statSync(a).ctime;
var time2 = fs.statSync(b).ctime;
if (time1 < time2) return -1;
if (time1 > time2) return 1;
return 0;
}).slice(0, 2);
I'll add, however, that it's typically considered best practice not to use to the synchronize fs methods (e.g. fs.statSync). If you're able to install async, that would be a good alternative approach.
const fs = require('fs');
const path = require('path');
const getMostRecentFile = (dir) => {
const files = orderReccentFiles(dir);
return files.length ? files[0] : undefined;
};
const orderReccentFiles = (dir) => {
return fs.readdirSync(dir)
.filter(file => fs.lstatSync(path.join(dir, file)).isFile())
.map(file => ({ file, mtime: fs.lstatSync(path.join(dir, file)).mtime }))
.sort((a, b) => b.mtime.getTime() - a.mtime.getTime());
};
const dirPath = '<PATH>';
console.log(getMostRecentFile(dirPath));
I am using wkhtmltopdf to generate pdfs in nodejs
Below is my sample code to generate pdf
var wkhtmltopdf = require('wkhtmltopdf')
, createWriteStream = require('fs').createWriteStream;
var r = wkhtmltopdf('http://www.google.com', { pageSize: 'letter' })
.pipe(createWriteStream('C:/MYUSERNAME/demo.pdf'));
r.on('close', function(){
mycallback();
});
The above code is generating corrupt pdfs. I could not figure out the issue.
Although when I generate pdfs using command prompt it is generating correctly
like when I use below code in windows command prompt
wkhtmltopdf http://www.google.com demo.pdf
I get correct pdf generated,sadly when I try to generate pdf in node environment, it generates corrupt pdfs.
Incase it helps I'm using wkhtmltopdf 0.11.0 rc2
Thanks in advance.
wkhtmltopdf for node has a bug for windows, so you can write a new one.
Like this:
function wkhtmltopdf(input, pageSize) {
var spawn = require('child_process').spawn;
var html;
var isUrl = /^(https?|file):\/\//.test(input);
if (!isUrl) {
html = input;
input = '-';
}
var args = ['wkhtmltopdf', '--quiet', '--page-size', pageSize, input, '-']
if (process.platform === 'win32') {
var child = spawn(args[0], args.slice(1));
} else {
var child = spawn('/bin/sh', ['-c', args.join(' ') + ' | cat']);
}
if (!isUrl) {
child.stdin.end(html);
}
return child.stdout;
}
// usage:
createWriteStream = require('fs').createWriteStream;
wkhtmltopdf('http://google.com/', 'letter')
.pipe(createWriteStream('demo1.pdf'));
wkhtmltopdf('<body>hello world!</body>', 'letter')
.pipe(createWriteStream('demo2.pdf'));
note: the param is now 'letter' not { pageSize: 'letter' }