I am trying to write a simple node script to resize large files (intended to be as a solution to an issue with large portrait oriented files). The main part of the code comes directly from gatsby docs.
module.exports = optimizeImages = () => {
const sharp = require(`sharp`)
const glob = require(`glob`)
const matches = glob.sync(`src/images/**/*!(optimized).{png,jpg,jpeg}`) // <-- There is the problem
console.log('matches:', matches)
const MAX_WIDTH = 1800
const QUALITY = 70
Promise.all(
matches.map(async match => {
const stream = sharp(match)
const info = await stream.metadata()
if (info.width < MAX_WIDTH) {
return
}
const optimizedName = match.replace(
/(\..+)$/,
(match, ext) => `-optimized${ext}`
)
await stream
.resize(MAX_WIDTH)
.jpeg({ quality: QUALITY })
.toFile(optimizedName)
.then(newFile => console.log(newFile))
.catch(error => console.log(error))
return true
})
)
}
The code seems to be working as intended, BUT I can't figure out how to unmatch the filenames which are allready optimized. Their names should end with '-optimized' suffix.
src/images/foo.jpg should be proccessed
src/images/bar-optimized.jpg should be ignored
I've tried to use the pattern src/images/**/*!(optimized).{png,jpg,jpeg}, but this does not work. I've tried using {ignore: 'src/images/**/*!(optimized)'}, but that does not work either.
Any help would be greatly appreciated.
It turns out that this works as intended:
const matches = glob.sync(`src/images/**/*.{png,jpg,jpeg}`, {
ignore: ['src/images/**/*-optimized.*']
})
Important clues were found in answers to this question.
Ran across this answer when I had a sync glob issue and realized you could take it a step further with D.R.Y. and build it as a re-usable glob with something like:
import path from 'path'
import glob from 'glob'
export const globFiles = (dirPath, match, exclusion = []) => {
const arr = exclusion.map(e => path.join(dirPath, e))
return glob.sync(path.join(dirPath, match), {
ignore: arr,
})
}
From the code above it would be called as:
const fileArray = globFiles('src/images/**/','*.{png,jpg,jpeg}', ['*-optimized.*'])
Related
I'm using nodejs and I'm processing PDFs. One thing I'd like to do is to outline all the fonts of the PDF (so that they are not selectable with the mouse cursor afterwards).
I tried the pdftk's flatten command (using a node wrapper), but I did not get what I wanted.
I may have a track in using inkscape (command line), but I'm not even sure about how to do it. I really am looking for the easiest way to do that using nodejs.
There might also be a track using ghostscript: https://stackoverflow.com/a/28798374/11348232. One notable thing to notice is that I don't use files on disk, but Buffer objects, so it'd be painful to save the PDF locally then use the gs command.
Thanks a lot.
I finally followed #KenS way:
import util from 'util';
import childProcess from 'child_process';
import fs from 'fs';
import os from 'os';
import path from 'path';
import { v4 as uuidv4 } from 'uuid';
const exec = util.promisify(childProcess.exec);
const unlinkCallback = (err) => {
if (err) {
console.error(err);
}
};
const deleteFile = (path: fs.PathLike) => {
if (fs.existsSync(path)) {
fs.unlink(path, unlinkCallback);
}
};
const createTempPathPDF = () => path.join(os.tmpdir(), `${uuidv4()}.pdf`);
const convertFontsToOutlines = async (buffer: Buffer): Promise<Buffer> => {
const inputPath = createTempPathPDF();
const outputPath = createTempPathPDF();
let bufferWithOutlines: Buffer;
fs.createWriteStream(inputPath).write(buffer);
try {
// ! ghostscript package MUST be installed on system
await exec(`gs -o ${outputPath} -dNoOutputFonts -sDEVICE=pdfwrite ${inputPath}`);
bufferWithOutlines = fs.readFileSync(outputPath);
} catch (e) {
console.error(e);
bufferWithOutlines = buffer;
}
deleteFile(inputPath);
deleteFile(outputPath);
return bufferWithOutlines;
};
I want to calculate the total archive file size before archiving to show a progress bar.
I have some folder which are exluded from zipping with are defined with a glob pattern.
How can you get a folder size with a glob filter?
It appears you can't use regular expressions; you can use https://www.npmjs.com/package/glob and loop through the files that match your glob pattern, and get the size of each. Something roughly like (i haven't tested this code):
const fs = require('fs')
const glob = require('glob')
let totalSize = 0 // bytes
// options is optional
glob("**/*.js", options, function (er, files) {
files.forEach(f => {
totalSize += fs.statSync(f)
})
})
With the help of above answer this is the solution
const glob = require('glob');
const fs = require('fs');
function getFolderSizeByGlob(folder, { ignorePattern: array }) {
const filePaths = glob.sync('**', { // "**" means you search on the whole folder
cwd: folder, // folder path
ignore: array, // array of glob pattern strings
absolute: true, // you have to set glob to return absolute path not only file names
});
let totalSize = 0;
filePaths.forEach((file) => {
console.log('file', file);
const stat = fs.statSync(file);
totalSize += stat.size;
});
return totalSize;
}
I'm not a very experienced developer but I am looking too structure my project so it is easier to work on.
Lets say I have a function like this:
const x = async (tx, hobby) => {
const result = await tx.run(
"MATCH (a:Person) - [r] -> (b:$hobby) " +
"RETURN properties(a)",
{ hobby }
)
return result
}
Can I put my cypher query scripts in seperate files, and reference it? I have seen a similar pattern for SQL scripts.
This is what I'm thinking:
const CYPHER_SCRIPT = require('./folder/myCypherScript.cyp')
const x = async (tx, hobby) => {
const result = await tx.run(
CYPHER_SCRIPT,
{ hobby }
)
return result
}
..or will i need to stringify the contents of the .cyp file?
Thanks
You can use the #cybersam/require-cypher package (which I just created).
For example, if folder/myCypherScript.cyp contains this:
MATCH (a:Person)-->(:$hobby)
RETURN PROPERTIES(a)
then after the package is installed (npm i #cybersam/require-cypher), this code will output the contents of that file:
// Just require the package. You don't usually need to use the returned module directly.
// Handlers for files with extensions .cyp, .cql, and .cypher will be registered.
require('#cybersam/require-cypher');
// Now require() will return the string content of Cypher files
const CYPHER_SCRIPT = require('./folder/myCypherScript.cyp')
console.log(CYPHER_SCRIPT);
I'm trying to write a method to find all the files in a folder, including subfolders. It's pretty simple to write using fs.readdirSync, but I'm trying to write a version which doesn't block. (i.e. uses fs.readdir).
I've got a version which works, but it's not pretty. Can someone who has a bit more experience with node see if there is a nicer way to write this? I can see a few other places in my codebase where I can apply this pattern so it would be nice to have a cleaner version!
private static findFilesFromFolder(folder: string): Promise<string[]> {
let lstat = util.promisify(fs.lstat)
let readdir = util.promisify(fs.readdir)
// Read the initial folder
let files = readdir(folder)
// Join the folder name to the file name to make it absolute
.then(files => files.map(file => path.join(folder, file)))
// Get the stats for each file (also as a promise)
.then(files =>
Promise.all(files.map(file =>
lstat(file).then(stats => { return { file: file, stats: stats } })
))
)
// If the file is a folder, recurse. Otherwise just return the file itself.
.then(info =>
Promise.all(info.map(info => {
if (info.stats.isDirectory()) {
return this.findFilesFromFolder(info.file)
} else {
return Promise.resolve([info.file])
}
}
)))
// Do sume munging of the types - convert Promise<string[][]> to Promise<string[]>
.then(nested => Array.prototype.concat.apply([], nested) as string[])
return files
}
I'd do a few things to make this cleaner:
move the recursion base case from within the loop to the top level
use async/await syntax
use const instead of let
Also put the promisify calls to where you are importing fs
const lstat = util.promisify(fs.lstat)
const readdir = util.promisify(fs.readdir)
…
private static async findFilesFromPath(folder: string): Promise<string[]> {
const stats = await lstat(folder);
// If the file is a folder, recurse. Otherwise just return the file itself.
if (stats.isDirectory()) {
// Read the initial folder
const files = await readdir(folder);
// Join the folder name to the file name to make it absolute
const paths = files.map(file => path.join(folder, file))
const nested = await Promise.all(paths.map(p => this.findFilesFromPath(p)))
// Do sume munging of the types - convert string[][] string[]
return Array.prototype.concat.apply([], nested) as string[];
} else {
return [folder];
}
}
Is there a way to extract text from PDFs in nodejs without any OS dependencies (like pdf2text, or xpdf on windows)? I wasn't able to find any 'native' pdf packages in nodejs. They always are a wrapper/util on top of an existing OS command.
Thanks
Have you checked PDF2Json? It is built on top of PDF.js. Though it is not providing the text output as a single line but I believe you may just reconstruct the final text based on the generated Json output:
'Texts': an array of text blocks with position, actual text and styling informations:
'x' and 'y': relative coordinates for positioning
'clr': a color index in color dictionary, same 'clr' field as in 'Fill' object. If a color can be found in color dictionary, 'oc' field will be added to the field as 'original color" value.
'A': text alignment, including:
left
center
right
'R': an array of text run, each text run object has two main fields:
'T': actual text
'S': style index from style dictionary. More info about 'Style Dictionary' can be found at 'Dictionary Reference' section
After some work, I finally got a reliable function for reading text from PDF using https://github.com/mozilla/pdfjs-dist
To get this to work, first npm install on the command line:
npm i pdfjs-dist
Then create a file with this code (I named the file "pdfExport.js" in this example):
const pdfjsLib = require("pdfjs-dist");
async function GetTextFromPDF(path) {
let doc = await pdfjsLib.getDocument(path).promise;
let page1 = await doc.getPage(1);
let content = await page1.getTextContent();
let strings = content.items.map(function(item) {
return item.str;
});
return strings;
}
module.exports = { GetTextFromPDF }
Then it can simply be used in any other js file you have like so:
const pdfExport = require('./pdfExport');
pdfExport.GetTextFromPDF('./sample.pdf').then(data => console.log(data));
Thought I'd chime in here for anyone who came across this question in the future.
I had this problem and spent hours over literally all the PDF libraries on NPM. My requirements were that I needed to run it on AWS Lambda so could not depend on OS dependencies.
The code below is adapted from another stackoverflow answer (which I cannot currently find). The only difference being that we import the ES5 version which works with Node >= 12. If you just import pdfjs-dist there will be an error of "Readable Stream is not defined". Hope it helps!
import * as pdfjslib from 'pdfjs-dist/es5/build/pdf.js';
export default class Pdf {
public static async getPageText(pdf: any, pageNo: number) {
const page = await pdf.getPage(pageNo);
const tokenizedText = await page.getTextContent();
const pageText = tokenizedText.items.map((token: any) => token.str).join('');
return pageText;
}
public static async getPDFText(source: any): Promise<string> {
const pdf = await pdfjslib.getDocument(source).promise;
const maxPages = pdf.numPages;
const pageTextPromises = [];
for (let pageNo = 1; pageNo <= maxPages; pageNo += 1) {
pageTextPromises.push(Pdf.getPageText(pdf, pageNo));
}
const pageTexts = await Promise.all(pageTextPromises);
return pageTexts.join(' ');
}
}
Usage
const fileBuffer = fs.readFile('sample.pdf');
const pdfText = await Pdf.getPDFText(fileBuffer);
This solution worked for me using node 14.20.1 using "pdf-parse": "^1.1.1"
You can install it with:
yarn add pdf-parse
This is the main function which converts the PDF file to text.
const path = require('path');
const fs = require('fs');
const pdf = require('pdf-parse');
const assert = require('assert');
const extractText = async (pathStr) => {
assert (fs.existsSync(pathStr), `Path does not exist ${pathStr}`)
const pdfFile = path.resolve(pathStr)
const dataBuffer = fs.readFileSync(pdfFile);
const data = await pdf(dataBuffer)
return data.text
}
module.exports = {
extractText
}
Then you can use the function like this:
const { extractText } = require('../api/lighthouse/lib/pdfExtraction')
extractText('./data/CoreDeveloper-v5.1.4.pdf').then(t => console.log(t))
Instead of using the proposed PDF2Json you can also use PDF.js directly (https://github.com/mozilla/pdfjs-dist). This has the advantage that you are not depending on modesty who owns PDF2Json and that he updates the PDF.js base.