PDF to Text extractor in nodejs without OS dependencies

PDF to Text extractor in nodejs without OS dependencies - node.js

Is there a way to extract text from PDFs in nodejs without any OS dependencies (like pdf2text, or xpdf on windows)? I wasn't able to find any 'native' pdf packages in nodejs. They always are a wrapper/util on top of an existing OS command.
Thanks

Have you checked PDF2Json? It is built on top of PDF.js. Though it is not providing the text output as a single line but I believe you may just reconstruct the final text based on the generated Json output:
'Texts': an array of text blocks with position, actual text and styling informations:
'x' and 'y': relative coordinates for positioning
'clr': a color index in color dictionary, same 'clr' field as in 'Fill' object. If a color can be found in color dictionary, 'oc' field will be added to the field as 'original color" value.
'A': text alignment, including:
left
center
right
'R': an array of text run, each text run object has two main fields:
'T': actual text
'S': style index from style dictionary. More info about 'Style Dictionary' can be found at 'Dictionary Reference' section

After some work, I finally got a reliable function for reading text from PDF using https://github.com/mozilla/pdfjs-dist
To get this to work, first npm install on the command line:
npm i pdfjs-dist
Then create a file with this code (I named the file "pdfExport.js" in this example):
const pdfjsLib = require("pdfjs-dist");
async function GetTextFromPDF(path) {
let doc = await pdfjsLib.getDocument(path).promise;
let page1 = await doc.getPage(1);
let content = await page1.getTextContent();
let strings = content.items.map(function(item) {
return item.str;
});
return strings;
}
module.exports = { GetTextFromPDF }
Then it can simply be used in any other js file you have like so:
const pdfExport = require('./pdfExport');
pdfExport.GetTextFromPDF('./sample.pdf').then(data => console.log(data));

Thought I'd chime in here for anyone who came across this question in the future.
I had this problem and spent hours over literally all the PDF libraries on NPM. My requirements were that I needed to run it on AWS Lambda so could not depend on OS dependencies.
The code below is adapted from another stackoverflow answer (which I cannot currently find). The only difference being that we import the ES5 version which works with Node >= 12. If you just import pdfjs-dist there will be an error of "Readable Stream is not defined". Hope it helps!
import * as pdfjslib from 'pdfjs-dist/es5/build/pdf.js';
export default class Pdf {
public static async getPageText(pdf: any, pageNo: number) {
const page = await pdf.getPage(pageNo);
const tokenizedText = await page.getTextContent();
const pageText = tokenizedText.items.map((token: any) => token.str).join('');
return pageText;
}
public static async getPDFText(source: any): Promise<string> {
const pdf = await pdfjslib.getDocument(source).promise;
const maxPages = pdf.numPages;
const pageTextPromises = [];
for (let pageNo = 1; pageNo <= maxPages; pageNo += 1) {
pageTextPromises.push(Pdf.getPageText(pdf, pageNo));
}
const pageTexts = await Promise.all(pageTextPromises);
return pageTexts.join(' ');
}
}
Usage
const fileBuffer = fs.readFile('sample.pdf');
const pdfText = await Pdf.getPDFText(fileBuffer);

This solution worked for me using node 14.20.1 using "pdf-parse": "^1.1.1"
You can install it with:
yarn add pdf-parse
This is the main function which converts the PDF file to text.
const path = require('path');
const fs = require('fs');
const pdf = require('pdf-parse');
const assert = require('assert');
const extractText = async (pathStr) => {
assert (fs.existsSync(pathStr), `Path does not exist ${pathStr}`)
const pdfFile = path.resolve(pathStr)
const dataBuffer = fs.readFileSync(pdfFile);
const data = await pdf(dataBuffer)
return data.text
}
module.exports = {
extractText
}
Then you can use the function like this:
const { extractText } = require('../api/lighthouse/lib/pdfExtraction')
extractText('./data/CoreDeveloper-v5.1.4.pdf').then(t => console.log(t))

Instead of using the proposed PDF2Json you can also use PDF.js directly (https://github.com/mozilla/pdfjs-dist). This has the advantage that you are not depending on modesty who owns PDF2Json and that he updates the PDF.js base.

Related

Feeding PDF generated from pdfkit as input to pdf-lib for merging

I am trying to send a pdfkit generated pdf file as input to pdflib for merging. I am using async function. My project is being developed using sails Js version:"^1.2.3", "node": "^12.16", my pdf-kit version is: "^0.11.0", "pdf-lib": "^1.9.0",
This is the code:
const textbytes=fs.readFileSync(textfile);
var bytes1 = new Uint8Array(textbytes);
const textdoc = await PDFDocumentFactory.load(bytes1)
The error i am getting is:
UnhandledPromiseRejectionWarning: Error: Failed to parse PDF document (line:0 col:0 offset=0): No PDF header found
Please help me with this issue.

You really don't need this line.
var bytes1 = new Uint8Array(textbytes);
By just reading the file and sending textbytes in the parameters is more than enough.
I use this function to merge an array of pdfBytes to make one big PDF file:
async function mergePdfs(pdfsToMerge)
{
const mergedPdf = await pdf.PDFDocument.create();
for (const pdfCopyDoc of pdfsToMerge)
{
const pdfDoc = await pdf.PDFDocument.load(pdfCopyDoc);
const copiedPages = await mergedPdf.copyPages(pdfDoc, pdfDoc.getPageIndices());
copiedPages.forEach((page) => {
mergedPdf.addPage(page);
});
}
const mergedPdfFile = await mergedPdf.save();
return mergedPdfFile;
};
So basically after you add the function mergePdfs(pdfsToMerge)
You can just use it like this:
const textbytes = fs.readFileSync(textfile);
const textdoc = await PDFDocumentFactory.load(bytes1)
let finalPdf = await mergePdfs(textdoc);

How can I import a Cypher query to my Node.js logic?

I'm not a very experienced developer but I am looking too structure my project so it is easier to work on.
Lets say I have a function like this:
const x = async (tx, hobby) => {
const result = await tx.run(
"MATCH (a:Person) - [r] -> (b:$hobby) " +
"RETURN properties(a)",
{ hobby }
)
return result
}
Can I put my cypher query scripts in seperate files, and reference it? I have seen a similar pattern for SQL scripts.
This is what I'm thinking:
const CYPHER_SCRIPT = require('./folder/myCypherScript.cyp')
const x = async (tx, hobby) => {
const result = await tx.run(
CYPHER_SCRIPT,
{ hobby }
)
return result
}
..or will i need to stringify the contents of the .cyp file?
Thanks

You can use the #cybersam/require-cypher package (which I just created).
For example, if folder/myCypherScript.cyp contains this:
MATCH (a:Person)-->(:$hobby)
RETURN PROPERTIES(a)
then after the package is installed (npm i #cybersam/require-cypher), this code will output the contents of that file:
// Just require the package. You don't usually need to use the returned module directly.
// Handlers for files with extensions .cyp, .cql, and .cypher will be registered.
require('#cybersam/require-cypher');
// Now require() will return the string content of Cypher files
const CYPHER_SCRIPT = require('./folder/myCypherScript.cyp')
console.log(CYPHER_SCRIPT);

Ignore files containing given substring using glob.sync()

I am trying to write a simple node script to resize large files (intended to be as a solution to an issue with large portrait oriented files). The main part of the code comes directly from gatsby docs.
module.exports = optimizeImages = () => {
const sharp = require(`sharp`)
const glob = require(`glob`)
const matches = glob.sync(`src/images/**/*!(optimized).{png,jpg,jpeg}`) // <-- There is the problem
console.log('matches:', matches)
const MAX_WIDTH = 1800
const QUALITY = 70
Promise.all(
matches.map(async match => {
const stream = sharp(match)
const info = await stream.metadata()
if (info.width < MAX_WIDTH) {
return
}
const optimizedName = match.replace(
/(\..+)$/,
(match, ext) => `-optimized${ext}`
)
await stream
.resize(MAX_WIDTH)
.jpeg({ quality: QUALITY })
.toFile(optimizedName)
.then(newFile => console.log(newFile))
.catch(error => console.log(error))
return true
})
)
}
The code seems to be working as intended, BUT I can't figure out how to unmatch the filenames which are allready optimized. Their names should end with '-optimized' suffix.
src/images/foo.jpg should be proccessed
src/images/bar-optimized.jpg should be ignored
I've tried to use the pattern src/images/**/*!(optimized).{png,jpg,jpeg}, but this does not work. I've tried using {ignore: 'src/images/**/*!(optimized)'}, but that does not work either.
Any help would be greatly appreciated.

It turns out that this works as intended:
const matches = glob.sync(`src/images/**/*.{png,jpg,jpeg}`, {
ignore: ['src/images/**/*-optimized.*']
})
Important clues were found in answers to this question.

Ran across this answer when I had a sync glob issue and realized you could take it a step further with D.R.Y. and build it as a re-usable glob with something like:
import path from 'path'
import glob from 'glob'
export const globFiles = (dirPath, match, exclusion = []) => {
const arr = exclusion.map(e => path.join(dirPath, e))
return glob.sync(path.join(dirPath, match), {
ignore: arr,
})
}
From the code above it would be called as:
const fileArray = globFiles('src/images/**/','*.{png,jpg,jpeg}', ['*-optimized.*'])

How to increase excel sheet column width while downloading in react js ? ,i am using a package called "export-from-json"

downloadPropertiesInXl = async () => {
let API_URL = "something....";
const property = await axios.get(API_URL);
const data = property.data;
const fileName = "download";
const exportType = "xls";
exportFromJSON({ data, fileName, exportType });
}
};
is there any other packages to change column width??

Use Excel.js, it have many options for customization
https://www.npmjs.com/package/exceljs#columnsex

Generating a json for a icon cheatsheet

I'm trying to generate a json file containing the filenames of all the files in a certain directory. I need this to create a cheatsheet for icons.
Currently I'm trying to run a script locally via terminal, to generate the json. That json will be the input for a react component that will display icons. That component works, the create json script doesn't.
Code for generating the json
const fs = require('fs');
const path = require('path');
/**
* Create JSON file
*/
const CreateJson = () => {
const files = [];
const dir = '../icons';
fs.readdirSync(dir).forEach(filename => {
const name = path.parse(filename);
const filepath = path.resolve(dir, filename);
const stat = fs.statSync(filepath);
const isFile = stat.isFile();
if (isFile) files.push({ name });
});
const data = JSON.stringify(files, null, 2);
fs.writeFileSync('../Icons.json', data);
};
module.exports = CreateJson;
I run it in terminal using
"create:json": "NODE_ENV=build node ./scripts/CreateJson.js"
I expect a json file to be created/overridden. But terminal returns:
$ NODE_ENV=build node ./scripts/CreateJson.js
✨ Done in 0.16s.
Any pointers?

You are creating a function CreateJson and exporting it, but you are actually never calling it.
You can get rid of the module.exports and replace it with CreateJson().
When you'll execute the file with node, it will see the function declaration, and a call to it, whereas with your current code there is no call.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

PDF to Text extractor in nodejs without OS dependencies - node.js

Is there a way to extract text from PDFs in nodejs without any OS dependencies (like pdf2text, or xpdf on windows)? I wasn't able to find any 'native' pdf packages in nodejs. They always are a wrapper/util on top of an existing OS command. Thanks

Instead of using the proposed PDF2Json you can also use PDF.js directly (https://github.com/mozilla/pdfjs-dist). This has the advantage that you are not depending on modesty who owns PDF2Json and that he updates the PDF.js base.

Related

Feeding PDF generated from pdfkit as input to pdf-lib for merging

How can I import a Cypher query to my Node.js logic?

Ignore files containing given substring using glob.sync()

How to increase excel sheet column width while downloading in react js ? ,i am using a package called "export-from-json"

Generating a json for a icon cheatsheet

Categories

Resources