Convert PDF to get vectorized text ("convert all text to outlines") - node.js

I'm using nodejs and I'm processing PDFs. One thing I'd like to do is to outline all the fonts of the PDF (so that they are not selectable with the mouse cursor afterwards).
I tried the pdftk's flatten command (using a node wrapper), but I did not get what I wanted.
I may have a track in using inkscape (command line), but I'm not even sure about how to do it. I really am looking for the easiest way to do that using nodejs.
There might also be a track using ghostscript: https://stackoverflow.com/a/28798374/11348232. One notable thing to notice is that I don't use files on disk, but Buffer objects, so it'd be painful to save the PDF locally then use the gs command.
Thanks a lot.

I finally followed #KenS way:
import util from 'util';
import childProcess from 'child_process';
import fs from 'fs';
import os from 'os';
import path from 'path';
import { v4 as uuidv4 } from 'uuid';
const exec = util.promisify(childProcess.exec);
const unlinkCallback = (err) => {
if (err) {
console.error(err);
}
};
const deleteFile = (path: fs.PathLike) => {
if (fs.existsSync(path)) {
fs.unlink(path, unlinkCallback);
}
};
const createTempPathPDF = () => path.join(os.tmpdir(), `${uuidv4()}.pdf`);
const convertFontsToOutlines = async (buffer: Buffer): Promise<Buffer> => {
const inputPath = createTempPathPDF();
const outputPath = createTempPathPDF();
let bufferWithOutlines: Buffer;
fs.createWriteStream(inputPath).write(buffer);
try {
// ! ghostscript package MUST be installed on system
await exec(`gs -o ${outputPath} -dNoOutputFonts -sDEVICE=pdfwrite ${inputPath}`);
bufferWithOutlines = fs.readFileSync(outputPath);
} catch (e) {
console.error(e);
bufferWithOutlines = buffer;
}
deleteFile(inputPath);
deleteFile(outputPath);
return bufferWithOutlines;
};

Related

Node js TypeScript: Good way to copy one file content from File object to another File object

I am in Node version 16.5.0 with TypeScript.
Problem Statement
I have a two File Objects - src and destination(Note: not file paths of type string, but file object of type File).
I want to optimally (best way) to copy the src content to destination content. I can use, if needed streaming also.
Tried so far
import fs from 'fs';
import { isUndefined } from 'lodash';
import path from 'path';
export class FileUtil {
public saveLogFile(parentDir: File, instanceId: string, content: File) {
const srcFilePath: string = `${parentDir}/instance_${instanceId}_mediatorLog.tgz`;
const destFilePath: string = `${parentDir}/${content.name}`;
const srcFileWithAbsolutePath = path.resolve(srcFilePath);
const destFileAbsolutePath = path.resolve(destFilePath);
fs.stat(srcFilePath, function(err) {
if (isUndefined(err)) {
fs.copyFile(srcFileWithAbsolutePath, destFileAbsolutePath, (err) => {
if (err) {
throw err;
}
});
}
});
}
}
The solution is glitchy so far. I also don't think it's pretty optimized (though did not yet bench marked yet).
Reaching out to experts if there is a better way in NodeJS+TS world to do it.

Use GLTF-transform merge command - node js script

I want to use the gltf-transform command "merge" in my script and wrote something like this to merge two or more gltf files.
const { merge } = require('#gltf-transform/cli');
const fileA = '/model_path/fileA.gltf';
const fileB = '/model_path/fileB.gltf';
merge(fileA, fileB, output.gltf, false);
But nothing happened. No output file or console log. So I dont know where to continue my little journey.
Would be great when someone has a clue.
An alternative was...
const command = "gltf-transform merge("+fileA+", "+fileB+", output.gltf, false)";
exec(command, (error, stdout, stderr) => {
if (error) {
console.log(`error: ${error.message}`);
return;
}
if (stderr) {
console.log(`stderr: ${stderr}`);
return;
}
console.log(`stdout: ${stdout}`);
});
... but didnt work either and looks needless.
The #gltf-transform/cli package you're importing isn't really designed for programmatic usage; it's just a CLI. Code meant for programmatic use is in the /functions, /core, and /extensions packages.
If you'd like to implement something similar in JavaScript, I would try this instead:
import { Document, NodeIO } from '#gltf-transform/core';
const io = new NodeIO();
const inputDocument1 = io.read('./model_path/fileA.gltf');
const inputDocument2 = io.read('./model_path/fileB.gltf');
const outputDocument = new Document()
.merge(inputDocument1)
.merge(inputDocument2);
// Optional: Merge binary resources to a single buffer.
const buffer = doc.getRoot().listBuffers()[0];
doc.getRoot().listAccessors().forEach((a) => a.setBuffer(buffer));
doc.getRoot().listBuffers().forEach((b, index) => index > 0 ? b.dispose() : null);
io.write('./model_path/output.gltf', outputDocument);
This will create a single glTF file containing two scenes. If you wanted to merge the contents of both files into a single scene that would require moving some nodes around with the Scene API.
you can use gltf-transform/cli by using spawn.
// you should run `node #gltf-transform/cli/bin/cli.js`
const gltf_transform_cli_path = `your project's node_modules dir path${path.sep}#gltf-transform${path.sep}cli${path.sep}bin${path.sep}cli.js`
const normPath = path.normalize(gltf_transform_cli_path);
const command = `node ${normPath} merge ${fileA} ${fileB} output.gltf false`
const result = spawn(command)
and plus,
if you're going to use texture compressing using ktx(etc1s, usatc),
you've got to install ktx on your local.

Ignore files containing given substring using glob.sync()

I am trying to write a simple node script to resize large files (intended to be as a solution to an issue with large portrait oriented files). The main part of the code comes directly from gatsby docs.
module.exports = optimizeImages = () => {
const sharp = require(`sharp`)
const glob = require(`glob`)
const matches = glob.sync(`src/images/**/*!(optimized).{png,jpg,jpeg}`) // <-- There is the problem
console.log('matches:', matches)
const MAX_WIDTH = 1800
const QUALITY = 70
Promise.all(
matches.map(async match => {
const stream = sharp(match)
const info = await stream.metadata()
if (info.width < MAX_WIDTH) {
return
}
const optimizedName = match.replace(
/(\..+)$/,
(match, ext) => `-optimized${ext}`
)
await stream
.resize(MAX_WIDTH)
.jpeg({ quality: QUALITY })
.toFile(optimizedName)
.then(newFile => console.log(newFile))
.catch(error => console.log(error))
return true
})
)
}
The code seems to be working as intended, BUT I can't figure out how to unmatch the filenames which are allready optimized. Their names should end with '-optimized' suffix.
src/images/foo.jpg should be proccessed
src/images/bar-optimized.jpg should be ignored
I've tried to use the pattern src/images/**/*!(optimized).{png,jpg,jpeg}, but this does not work. I've tried using {ignore: 'src/images/**/*!(optimized)'}, but that does not work either.
Any help would be greatly appreciated.
It turns out that this works as intended:
const matches = glob.sync(`src/images/**/*.{png,jpg,jpeg}`, {
ignore: ['src/images/**/*-optimized.*']
})
Important clues were found in answers to this question.
Ran across this answer when I had a sync glob issue and realized you could take it a step further with D.R.Y. and build it as a re-usable glob with something like:
import path from 'path'
import glob from 'glob'
export const globFiles = (dirPath, match, exclusion = []) => {
const arr = exclusion.map(e => path.join(dirPath, e))
return glob.sync(path.join(dirPath, match), {
ignore: arr,
})
}
From the code above it would be called as:
const fileArray = globFiles('src/images/**/','*.{png,jpg,jpeg}', ['*-optimized.*'])

Is there a more elegant way to read then write *the same file* with node js stream

I wanna read file then change it with through2 then write into the same file, code like:
const rm = require('rimraf')
const through2 = require('through2')
const fs = require('graceful-fs')
// source file path
const replacementPath = `./static/projects/${destPath}/index.html`
// temp file path
const tempfilePath = `./static/projects/${destPath}/tempfile.html`
// read source file then write into temp file
await promiseReplace(replacementPath, tempfilePath)
// del the source file
rm.sync(replacementPath)
// rename the temp file name to source file name
fs.renameSync(tempfilePath, replacementPath)
// del the temp file
rm.sync(tempfilePath)
// promiseify readStream and writeStream
function promiseReplace (readfile, writefile) {
return new Promise((res, rej) => {
fs.createReadStream(readfile)
.pipe(through2.obj(function (chunk, encoding, done) {
const replaced = chunk.toString().replace(/id="wrap"/g, 'dududud')
done(null, replaced)
}))
.pipe(fs.createWriteStream(writefile))
.on('finish', () => {
console.log('replace done')
res()
})
.on('error', (err) => {
console.log(err)
rej(err)
})
})
}
the above code works, but I wanna know can I make it more elegant ?
and I also try some temp lib like node-temp
unfortunately, it cannot readStream and writeStream into the same file as well, and I open a issues about this.
So any one know a better way to do this tell me, thank you very much.
You can make the code more elegant by getting rid of unnecessary dependencies and using the newer simplified constructor for streams.
const fs = require('fs');
const util = require('util');
const stream = require('stream');
const tempWrite = require('temp-write');
const rename = util.promisify(fs.rename);
const goat2llama = async (filePath) => {
const str = fs.createReadStream(filePath, 'utf8')
.pipe(new stream.Transform({
decodeStrings : false,
transform(chunk, encoding, done) {
done(null, chunk.replace(/goat/g, 'llama'));
}
}));
const tempPath = await tempWrite(str);
await rename(tempPath, filePath);
};
Tests
AVA tests to prove that it works:
import fs from 'fs';
import path from 'path';
import util from 'util';
import test from 'ava';
import mkdirtemp from 'mkdirtemp';
import goat2llama from '.';
const writeFile = util.promisify(fs.writeFile);
const readFile = util.promisify(fs.readFile);
const fixture = async (content) => {
const dir = await mkdirtemp();
const fixturePath = path.join(dir, 'fixture.txt');
await writeFile(fixturePath, content);
return fixturePath;
};
test('goat2llama()', async (t) => {
const filePath = await fixture('I like goats and frogs, but goats the best');
await goat2llama(filePath);
t.is(await readFile(filePath, 'utf8'), 'I like llamas and frogs, but llamas the best');
});
A few things about the changes:
Through2 is not really needed anymore. It used to be a pain to set up passthrough or transform streams properly, but that is not the case anymore thanks to the simplified construction API.
You probably don't need graceful-fs, either. Unless you are doing a lot of concurrent disk I/O, EMFILE is not usually a problem, especially these days as Node has gotten smarter about file descriptors. But that library does help with temporary errors caused by antivirus software on Windows, if that is a problem for you.
You definitely do not need rimraf for this. You only need fs.rename(). It is similar to mv on the command line, with a few nuances that make it distinct, but the differences are not super important here. The point is there will be nothing at the temporary path after you rename the file that was there.
I used temp-write because it generates a secure random filepath for you and puts it in the OS temp directory (which automatically gets cleaned up now and then), plus it handles converting the stream to a Promise for you and takes care of some edge cases around errors. Disclosure: I wrote the streams implementation in temp-write. :)
Overall, this is a decent improvement. However, there remains the boundary problem discussed in the comments. Luckily, you are not the first person to encounter this problem! I wouldn't call the actual solution particularly elegant, certainly not if you implement it yourself. But replacestream is here to help you.
const fs = require('fs');
const util = require('util');
const tempWrite = require('temp-write');
const replaceStream = require('replacestream');
const rename = util.promisify(fs.rename);
const goat2llama = async (filePath) => {
const str = fs.createReadStream(filePath, 'utf8')
.pipe(replaceStream('goat', 'llama'));
const tempPath = await tempWrite(str);
await rename(tempPath, filePath);
};
Also...
I do not like temp files
Indeed, temp files are often bad. However, in this case, the temp file is managed by a well-designed library and stored in a secure, out-of-the-way location. There is virtually no chance of conflicting with other processes. And even if the rename() fails somehow, the file will be cleaned up by the OS.
That said, you can avoid temp files altogether by using fs.readFile() and fs.writeFile() instead of streaming. The former also makes text replacement much easier since you do not have to worry about chunk boundaries. You have to choose one approach or the other, however for very big files, streaming may be the only option, aside from manually chunking the file.
Streams are useless in this situation, because they return you chunks of file that can break the string that you're searching for. You could use streams, then merge all these chunks to get content, then replace the string that you need, but that will be longer code that will provoke just one question: why do you read file by chunks if you don't use them ?
The shortest way to achieve what you want is:
let fileContent = fs.readFileSync('file_name.html', 'utf8')
let replaced = fileContent.replace(/id="wrap"/g, 'dududud')
fs.writeFileSync('file_name.html', replaced)
All these functions are synchronous, so you don't have to promisify them

PDF to Text extractor in nodejs without OS dependencies

Is there a way to extract text from PDFs in nodejs without any OS dependencies (like pdf2text, or xpdf on windows)? I wasn't able to find any 'native' pdf packages in nodejs. They always are a wrapper/util on top of an existing OS command.
Thanks
Have you checked PDF2Json? It is built on top of PDF.js. Though it is not providing the text output as a single line but I believe you may just reconstruct the final text based on the generated Json output:
'Texts': an array of text blocks with position, actual text and styling informations:
'x' and 'y': relative coordinates for positioning
'clr': a color index in color dictionary, same 'clr' field as in 'Fill' object. If a color can be found in color dictionary, 'oc' field will be added to the field as 'original color" value.
'A': text alignment, including:
left
center
right
'R': an array of text run, each text run object has two main fields:
'T': actual text
'S': style index from style dictionary. More info about 'Style Dictionary' can be found at 'Dictionary Reference' section
After some work, I finally got a reliable function for reading text from PDF using https://github.com/mozilla/pdfjs-dist
To get this to work, first npm install on the command line:
npm i pdfjs-dist
Then create a file with this code (I named the file "pdfExport.js" in this example):
const pdfjsLib = require("pdfjs-dist");
async function GetTextFromPDF(path) {
let doc = await pdfjsLib.getDocument(path).promise;
let page1 = await doc.getPage(1);
let content = await page1.getTextContent();
let strings = content.items.map(function(item) {
return item.str;
});
return strings;
}
module.exports = { GetTextFromPDF }
Then it can simply be used in any other js file you have like so:
const pdfExport = require('./pdfExport');
pdfExport.GetTextFromPDF('./sample.pdf').then(data => console.log(data));
Thought I'd chime in here for anyone who came across this question in the future.
I had this problem and spent hours over literally all the PDF libraries on NPM. My requirements were that I needed to run it on AWS Lambda so could not depend on OS dependencies.
The code below is adapted from another stackoverflow answer (which I cannot currently find). The only difference being that we import the ES5 version which works with Node >= 12. If you just import pdfjs-dist there will be an error of "Readable Stream is not defined". Hope it helps!
import * as pdfjslib from 'pdfjs-dist/es5/build/pdf.js';
export default class Pdf {
public static async getPageText(pdf: any, pageNo: number) {
const page = await pdf.getPage(pageNo);
const tokenizedText = await page.getTextContent();
const pageText = tokenizedText.items.map((token: any) => token.str).join('');
return pageText;
}
public static async getPDFText(source: any): Promise<string> {
const pdf = await pdfjslib.getDocument(source).promise;
const maxPages = pdf.numPages;
const pageTextPromises = [];
for (let pageNo = 1; pageNo <= maxPages; pageNo += 1) {
pageTextPromises.push(Pdf.getPageText(pdf, pageNo));
}
const pageTexts = await Promise.all(pageTextPromises);
return pageTexts.join(' ');
}
}
Usage
const fileBuffer = fs.readFile('sample.pdf');
const pdfText = await Pdf.getPDFText(fileBuffer);
This solution worked for me using node 14.20.1 using "pdf-parse": "^1.1.1"
You can install it with:
yarn add pdf-parse
This is the main function which converts the PDF file to text.
const path = require('path');
const fs = require('fs');
const pdf = require('pdf-parse');
const assert = require('assert');
const extractText = async (pathStr) => {
assert (fs.existsSync(pathStr), `Path does not exist ${pathStr}`)
const pdfFile = path.resolve(pathStr)
const dataBuffer = fs.readFileSync(pdfFile);
const data = await pdf(dataBuffer)
return data.text
}
module.exports = {
extractText
}
Then you can use the function like this:
const { extractText } = require('../api/lighthouse/lib/pdfExtraction')
extractText('./data/CoreDeveloper-v5.1.4.pdf').then(t => console.log(t))
Instead of using the proposed PDF2Json you can also use PDF.js directly (https://github.com/mozilla/pdfjs-dist). This has the advantage that you are not depending on modesty who owns PDF2Json and that he updates the PDF.js base.

Resources