I have a Question ?
I have a File Which contains one keyword on each line (5000),
I'm developing a Scraper Using Puppeteer in Nodes Which will go to a website which has a Search Bar and in that Search bar I want to search Using the Keywords from that File, So Someone Please Guide me How to accomplish this ? am I Using the right tools ? I would be grateful .
You can use:
// yourVariable is the text data.
if(yourVariable.contains("What you are looking for.")){
// the code
}
If that is what you mean.
Otherwise, if you mean going to each 5000 lines in the text file, you can use nthline in an async function by running:
npm i nthline --save
in your console, then:
;(async () => {
const nthline = require('nthline'),
filePath = '/path/to/100-million-rows-file',
rowIndex = 42
console.log(await nthline(rowIndex, filePath))
})()
Or if you want to check if the line contains data:
;(async () => {
const nthline = require('nthline'),
filePath = '/path/to/100-million-rows-file',
rowIndex = 42
console.log(await nthline(rowIndex, filePath).contains("Data"))
})()
Related
I'm not a very experienced developer but I am looking too structure my project so it is easier to work on.
Lets say I have a function like this:
const x = async (tx, hobby) => {
const result = await tx.run(
"MATCH (a:Person) - [r] -> (b:$hobby) " +
"RETURN properties(a)",
{ hobby }
)
return result
}
Can I put my cypher query scripts in seperate files, and reference it? I have seen a similar pattern for SQL scripts.
This is what I'm thinking:
const CYPHER_SCRIPT = require('./folder/myCypherScript.cyp')
const x = async (tx, hobby) => {
const result = await tx.run(
CYPHER_SCRIPT,
{ hobby }
)
return result
}
..or will i need to stringify the contents of the .cyp file?
Thanks
You can use the #cybersam/require-cypher package (which I just created).
For example, if folder/myCypherScript.cyp contains this:
MATCH (a:Person)-->(:$hobby)
RETURN PROPERTIES(a)
then after the package is installed (npm i #cybersam/require-cypher), this code will output the contents of that file:
// Just require the package. You don't usually need to use the returned module directly.
// Handlers for files with extensions .cyp, .cql, and .cypher will be registered.
require('#cybersam/require-cypher');
// Now require() will return the string content of Cypher files
const CYPHER_SCRIPT = require('./folder/myCypherScript.cyp')
console.log(CYPHER_SCRIPT);
I am trying to write a simple node script to resize large files (intended to be as a solution to an issue with large portrait oriented files). The main part of the code comes directly from gatsby docs.
module.exports = optimizeImages = () => {
const sharp = require(`sharp`)
const glob = require(`glob`)
const matches = glob.sync(`src/images/**/*!(optimized).{png,jpg,jpeg}`) // <-- There is the problem
console.log('matches:', matches)
const MAX_WIDTH = 1800
const QUALITY = 70
Promise.all(
matches.map(async match => {
const stream = sharp(match)
const info = await stream.metadata()
if (info.width < MAX_WIDTH) {
return
}
const optimizedName = match.replace(
/(\..+)$/,
(match, ext) => `-optimized${ext}`
)
await stream
.resize(MAX_WIDTH)
.jpeg({ quality: QUALITY })
.toFile(optimizedName)
.then(newFile => console.log(newFile))
.catch(error => console.log(error))
return true
})
)
}
The code seems to be working as intended, BUT I can't figure out how to unmatch the filenames which are allready optimized. Their names should end with '-optimized' suffix.
src/images/foo.jpg should be proccessed
src/images/bar-optimized.jpg should be ignored
I've tried to use the pattern src/images/**/*!(optimized).{png,jpg,jpeg}, but this does not work. I've tried using {ignore: 'src/images/**/*!(optimized)'}, but that does not work either.
Any help would be greatly appreciated.
It turns out that this works as intended:
const matches = glob.sync(`src/images/**/*.{png,jpg,jpeg}`, {
ignore: ['src/images/**/*-optimized.*']
})
Important clues were found in answers to this question.
Ran across this answer when I had a sync glob issue and realized you could take it a step further with D.R.Y. and build it as a re-usable glob with something like:
import path from 'path'
import glob from 'glob'
export const globFiles = (dirPath, match, exclusion = []) => {
const arr = exclusion.map(e => path.join(dirPath, e))
return glob.sync(path.join(dirPath, match), {
ignore: arr,
})
}
From the code above it would be called as:
const fileArray = globFiles('src/images/**/','*.{png,jpg,jpeg}', ['*-optimized.*'])
I apologise for the phrasing of the question - it's a bit difficult to sum up as a question - please feel free to edit it if you can clarify. Also, as this quite a complex and long query - thank you to all those who are putting in the time to read through it!
I have 4 files (listed with directory tree from project root) as part of a project I'm building which aims to scrape blockchains and take advantage of multiple cores do get the job done:
./main.js
./scraper.js
./api/api.js
./api/litecoin_api.js
main.js
const { scraper } = require('./scraper.js')
const blockchainCli = process.env.BLOCKSCRAPECLI || 'litecoin-cli'
const client = (args) => {
// create child process which returns a promise which resolves after
// data has finished buffering from locally hosted node using cli
let child = spawn(`${blockchainCli} ${args.join(' ')}`, {
shell: true
})
// ... wrap command in a promise here, etc
}
const main = () => {
// count cores, spawn a worker per core using node cluster, add
// message handlers, then begin scraping blockchain with each core...
scraper(blockHeight)
}
main()
module.exports = {
client,
blockchainCli
}
scraper.js
const api = require('./api/api.js')
const scraper = async (blockHeight) => {
try {
let blockHash = await api.getBlockHashByHeight(blockHeight)
let block = await api.getBlock(blockHash)
// ... etc, scraper tested and working, writes to shared writeStream
}
module.exports = {
scraper
}
api.js
const { client, blockchainCli } = require('../main.js')
const litecoin = require('./litecoin_api')
let blockchain = undefined
if (blockchainCli === 'litecoin-cli' || blockchainCli === 'bitcoin-cli') {
blockchain = litecoin
}
// PROBLEM HERE: blockchainCli (and client) are both undefined if and
// only if running scraper from main.js (but not if running scraper
// from scraper.js)
const decodeRawTransaction = (txHash) => {
return client([blockchain.decodeRawTransaction, txHash])
}
const getBlock = (blockhash) => {
return client([blockchain.getBlock, blockhash])
}
const getBlockHashByHeight = (height) => {
return client([blockchain.getBlockHash, height])
}
const getInfo = () => {
return client([blockchain.getInfo])
}
const getRawTransaction = (txHash, verbose = true) => {
return client([blockchain.getRawTransaction, txHash, verbose])
}
module.exports = {
decodeRawTransaction,
getBlock,
getBlockHashByHeight,
getInfo,
getRawTransaction
}
So, I've taken out most the noise in the files which I don't think is necessary but it's open source so if you need more take a look here.
The problem is that, if I start the scraper from inside scraper.js by doing, say, something like this: scraper(1234567) it works like a charm and outputs the expected data to a csv file.
However if I start the scraper from inside the main.js file, I get this error:
Cannot read property 'getBlockHash' of undefined
at Object.getBlockHashByHeight (/home/grayedfox/github/blockscrape/api/api.js:19:29)
at scraper (/home/grayedfox/github/blockscrape/scraper.js:53:31)
at Worker.messageHandler (/home/grayedfox/github/blockscrape/main.js:81:5)
I don't know why, when launching the scraper from main.js, the blockchain is undefined. I thought it might be from the destructuring, but removing the curly braces from around the first line in the example main.js file doesn't change anything (same error).
Things are a bit messy at the moment (in the middle of developing this branch) - but the essential problem now is that it's not clear to me why the require would fail (cannot see variables inside main.js) if it's used in the following way:
main.js (execute scraper()) > scraper.js > api.js
But not fail (can see variables inside main.js) if it's run like this:
scraper.js (execute scraper()) > api.js
Thank you very much for your time!
You have a circular dependency between main and api, each requiring in the other. main requires api through scraper and api directly requires main. That causes things not to work.
You have to remove the circular dependency by putting common shared code into its own module that can be included by both, but doesn't include others that include it. It just needs better modularity.
I would like to add support to async/await to node repl
Following this issue: https://github.com/nodejs/node/issues/8382
I've tried to use this one https://github.com/paulserraino/babel-repl but it is missing async await suppport
I would like to use this snippet
const awaitMatcher = /^(?:\s*(?:(?:let|var|const)\s)?\s*([^=]+)=\s*|^\s*)(await\s[\s\S]*)/;
const asyncWrapper = (code, binder) => {
let assign = binder ? `root.${binder} = ` : '';
return `(function(){ async function _wrap() { return ${assign}${code} } return _wrap();})()`;
};
// match & transform
const match = input.match(awaitMatcher);
if(match) {
input = `${asyncWrapper(match[2], match[1])}`;
}
How can I add this snippet to a custom eval on node repl?
Example in node repl:
> const user = await User.findOne();
As of node ^10, you can use the following flag when starting the repl:
node --experimental-repl-await
$ await myPromise()
There is the project https://github.com/ef4/async-repl:
$ async-repl
async> 1 + 2
3
async> 1 + await new Promise(r => setTimeout(() => r(2), 1000))
3
async> let x = 1 + await new Promise(r => setTimeout(() => r(2), 1000))
undefined
async> x
3
async>
Another option, slightly onerous to start but with a great UI, is to use the Chrome Devtools:
$ node --inspect -r esm
Debugger listening on ws://127.0.0.1:9229/b4fb341e-da9d-4276-986a-46bb81bdd989
For help see https://nodejs.org/en/docs/inspector
> Debugger attached.
(I am using the esm package here to allow Node to parse import statements.)
Then you go to chrome://inspect in Chrome and you will be able to connect to the node instance. Chrome Devtools has top-level await, great tab-completion etc.
The idea is to preprocess the command and wrap it in a async function if
there is an await syntax outside async function
this https://gist.github.com/princejwesley/a66d514d86ea174270210561c44b71ba is the final solution
Is there a way to extract text from PDFs in nodejs without any OS dependencies (like pdf2text, or xpdf on windows)? I wasn't able to find any 'native' pdf packages in nodejs. They always are a wrapper/util on top of an existing OS command.
Thanks
Have you checked PDF2Json? It is built on top of PDF.js. Though it is not providing the text output as a single line but I believe you may just reconstruct the final text based on the generated Json output:
'Texts': an array of text blocks with position, actual text and styling informations:
'x' and 'y': relative coordinates for positioning
'clr': a color index in color dictionary, same 'clr' field as in 'Fill' object. If a color can be found in color dictionary, 'oc' field will be added to the field as 'original color" value.
'A': text alignment, including:
left
center
right
'R': an array of text run, each text run object has two main fields:
'T': actual text
'S': style index from style dictionary. More info about 'Style Dictionary' can be found at 'Dictionary Reference' section
After some work, I finally got a reliable function for reading text from PDF using https://github.com/mozilla/pdfjs-dist
To get this to work, first npm install on the command line:
npm i pdfjs-dist
Then create a file with this code (I named the file "pdfExport.js" in this example):
const pdfjsLib = require("pdfjs-dist");
async function GetTextFromPDF(path) {
let doc = await pdfjsLib.getDocument(path).promise;
let page1 = await doc.getPage(1);
let content = await page1.getTextContent();
let strings = content.items.map(function(item) {
return item.str;
});
return strings;
}
module.exports = { GetTextFromPDF }
Then it can simply be used in any other js file you have like so:
const pdfExport = require('./pdfExport');
pdfExport.GetTextFromPDF('./sample.pdf').then(data => console.log(data));
Thought I'd chime in here for anyone who came across this question in the future.
I had this problem and spent hours over literally all the PDF libraries on NPM. My requirements were that I needed to run it on AWS Lambda so could not depend on OS dependencies.
The code below is adapted from another stackoverflow answer (which I cannot currently find). The only difference being that we import the ES5 version which works with Node >= 12. If you just import pdfjs-dist there will be an error of "Readable Stream is not defined". Hope it helps!
import * as pdfjslib from 'pdfjs-dist/es5/build/pdf.js';
export default class Pdf {
public static async getPageText(pdf: any, pageNo: number) {
const page = await pdf.getPage(pageNo);
const tokenizedText = await page.getTextContent();
const pageText = tokenizedText.items.map((token: any) => token.str).join('');
return pageText;
}
public static async getPDFText(source: any): Promise<string> {
const pdf = await pdfjslib.getDocument(source).promise;
const maxPages = pdf.numPages;
const pageTextPromises = [];
for (let pageNo = 1; pageNo <= maxPages; pageNo += 1) {
pageTextPromises.push(Pdf.getPageText(pdf, pageNo));
}
const pageTexts = await Promise.all(pageTextPromises);
return pageTexts.join(' ');
}
}
Usage
const fileBuffer = fs.readFile('sample.pdf');
const pdfText = await Pdf.getPDFText(fileBuffer);
This solution worked for me using node 14.20.1 using "pdf-parse": "^1.1.1"
You can install it with:
yarn add pdf-parse
This is the main function which converts the PDF file to text.
const path = require('path');
const fs = require('fs');
const pdf = require('pdf-parse');
const assert = require('assert');
const extractText = async (pathStr) => {
assert (fs.existsSync(pathStr), `Path does not exist ${pathStr}`)
const pdfFile = path.resolve(pathStr)
const dataBuffer = fs.readFileSync(pdfFile);
const data = await pdf(dataBuffer)
return data.text
}
module.exports = {
extractText
}
Then you can use the function like this:
const { extractText } = require('../api/lighthouse/lib/pdfExtraction')
extractText('./data/CoreDeveloper-v5.1.4.pdf').then(t => console.log(t))
Instead of using the proposed PDF2Json you can also use PDF.js directly (https://github.com/mozilla/pdfjs-dist). This has the advantage that you are not depending on modesty who owns PDF2Json and that he updates the PDF.js base.