How do I replace a string in a PDF file using NodeJS? - node.js

I have a template PDF file, and I want to replace some marker strings to generate new PDF files and save them. What's the best/simplest way to do this? I don't need to add graphics or anything fancy, just a simple text replacement, so I don't want anything too complicated.
Thanks!
Edit: Just found HummusJS, I'll see if I can make progress and post it here.

I found this question by searching, so I think it deserves the answer. I found the answer by BrighTide here: https://github.com/galkahana/HummusJS/issues/71#issuecomment-275956347
Basically, there is this very powerful Hummus package which uses library written in C++ (crossplatform of course). I think the answer given in that github comment can be functionalized like this:
var hummus = require('hummus');
/**
* Returns a byteArray string
*
* #param {string} str - input string
*/
function strToByteArray(str) {
var myBuffer = [];
var buffer = new Buffer(str);
for (var i = 0; i < buffer.length; i++) {
myBuffer.push(buffer[i]);
}
return myBuffer;
}
function replaceText(sourceFile, targetFile, pageNumber, findText, replaceText) {
var writer = hummus.createWriterToModify(sourceFile, {
modifiedFilePath: targetFile
});
var sourceParser = writer.createPDFCopyingContextForModifiedFile().getSourceDocumentParser();
var pageObject = sourceParser.parsePage(pageNumber);
var textObjectId = pageObject.getDictionary().toJSObject().Contents.getObjectID();
var textStream = sourceParser.queryDictionaryObject(pageObject.getDictionary(), 'Contents');
//read the original block of text data
var data = [];
var readStream = sourceParser.startReadingFromStream(textStream);
while(readStream.notEnded()){
Array.prototype.push.apply(data, readStream.read(10000));
}
var string = new Buffer(data).toString().replace(findText, replaceText);
//Create and write our new text object
var objectsContext = writer.getObjectsContext();
objectsContext.startModifiedIndirectObject(textObjectId);
var stream = objectsContext.startUnfilteredPDFStream();
stream.getWriteStream().write(strToByteArray(string));
objectsContext.endPDFStream(stream);
objectsContext.endIndirectObject();
writer.end();
}
// replaceText('source.pdf', 'output.pdf', 0, /REPLACEME/g, 'My New Custom Text');
UPDATE:
The version used at the time of writing an example was 1.0.83, things might change recently.
UPDATE 2:
Recently I got an issue with another PDF file which had a different font. For some reason the text got split into small chunks, i.e. string QWERTYUIOPASDFGHJKLZXCVBNM1234567890- got represented as -286(Q)9(WER)24(T)-8(YUIOP)116(ASDF)19(GHJKLZX)15(CVBNM1234567890-)
I had no idea what else to do rather than make up a regex.. So instead of this one line:
var string = new Buffer(data).toString().replace(findText, replaceText);
I have something like this now:
var string = Buffer.from(data).toString();
var characters = REPLACE_ME;
var match = [];
for (var a = 0; a < characters.length; a++) {
match.push('(-?[0-9]+)?(\\()?' + characters[a] + '(\\))?');
}
string = string.replace(new RegExp(match.join('')), function(m, m1) {
// m1 holds the first item which is a space
return m1 + '( ' + REPLACE_WITH_THIS + ')';
});

Building on Alex's (and other's) solution, I noticed an issue where some non-text data were becoming corrupted. I tracked this down to encoding/decoding the PDF text as utf-8 instead of as a binary string. Anyways here's a modified solution that:
Avoids corrupting non-text data
Uses streams instead of files
Allows multiple patterns/replacements
Uses the MuhammaraJS package which is a maintained fork of HummusJS (should be able to swap in HummusJS just fine as well)
Is written in TypeScript (feel free to remove the types for JS)
import muhammara from "muhammara";
interface Pattern {
searchValue: RegExp | string;
replaceValue: string;
}
/**
* Modify a PDF by replacing text in it
*/
const modifyPdf = ({
sourceStream,
targetStream,
patterns,
}: {
sourceStream: muhammara.ReadStream;
targetStream: muhammara.WriteStream;
patterns: Pattern[];
}): void => {
const modPdfWriter = muhammara.createWriterToModify(sourceStream, targetStream, { compress: false });
const numPages = modPdfWriter
.createPDFCopyingContextForModifiedFile()
.getSourceDocumentParser()
.getPagesCount();
for (let page = 0; page < numPages; page++) {
const copyingContext = modPdfWriter.createPDFCopyingContextForModifiedFile();
const objectsContext = modPdfWriter.getObjectsContext();
const pageObject = copyingContext.getSourceDocumentParser().parsePage(page);
const textStream = copyingContext
.getSourceDocumentParser()
.queryDictionaryObject(pageObject.getDictionary(), "Contents");
const textObjectID = pageObject.getDictionary().toJSObject().Contents.getObjectID();
let data: number[] = [];
const readStream = copyingContext.getSourceDocumentParser().startReadingFromStream(textStream);
while (readStream.notEnded()) {
const readData = readStream.read(10000);
data = data.concat(readData);
}
const pdfPageAsString = Buffer.from(data).toString("binary"); // key change 1
let modifiedPdfPageAsString = pdfPageAsString;
for (const pattern of patterns) {
modifiedPdfPageAsString = modifiedPdfPageAsString.replaceAll(pattern.searchValue, pattern.replaceValue);
}
// Create what will become our new text object
objectsContext.startModifiedIndirectObject(textObjectID);
const stream = objectsContext.startUnfilteredPDFStream();
stream.getWriteStream().write(strToByteArray(modifiedPdfPageAsString));
objectsContext.endPDFStream(stream);
objectsContext.endIndirectObject();
}
modPdfWriter.end();
};
/**
* Create a byte array from a string, as muhammara expects
*/
const strToByteArray = (str: string): number[] => {
const myBuffer = [];
const buffer = Buffer.from(str, "binary"); // key change 2
for (let i = 0; i < buffer.length; i++) {
myBuffer.push(buffer[i]);
}
return myBuffer;
};
And then to use it:
/**
* Fill a PDF with template data
*/
export const fillPdf = async (sourceBuffer: Buffer): Promise<Buffer> => {
const sourceStream = new muhammara.PDFRStreamForBuffer(sourceBuffer);
const targetStream = new muhammara.PDFWStreamForBuffer();
modifyPdf({
sourceStream,
targetStream,
patterns: [{ searchValue: "home", replaceValue: "emoh" }], // TODO use actual patterns
});
return targetStream.buffer;
};

There is another Node.js Package asposepdfcloud, Aspose.PDF Cloud SDK for Node.js. You can use it to replace text in your PDF document conveniently. Its free plan offers 150 credits monthly. Here is sample code to replace text in PDF document, don't forget to install asposepdfcloud first.
const { PdfApi } = require("asposepdfcloud");
const { TextReplaceListRequest }= require("asposepdfcloud/src/models/textReplaceListRequest");
const { TextReplace }= require("asposepdfcloud/src/models/textReplace");
// Get App key and App SID from https://aspose.cloud
pdfApi = new PdfApi("xxxxx-xxxxx-xxxx-xxxxxxxxxxx", "xxxxxxxxxxxxxxxxxxxxxb");
var fs = require('fs');
const name = "02_pages.pdf";
const remoteTempFolder = "Temp";
//const localTestDataFolder = "C:\\Temp";
//const path = remoteTempFolder + "\\" + name;
//var data = fs.readFileSync(localTestDataFolder + "\\" + name);
const textReplace= new TextReplace();
textReplace.oldValue= "origami";
textReplace.newValue= "aspose";
textReplace.regex= false;
const textReplace1= new TextReplace();
textReplace1.oldValue= "candy";
textReplace1.newValue= "biscuit";
textReplace1.regex= false;
const trr = new TextReplaceListRequest();
trr.textReplaces = [textReplace,textReplace1];
// Upload File
//pdfApi.uploadFile(path, data).then((result) => {
// console.log("Uploaded File");
// }).catch(function(err) {
// Deal with an error
// console.log(err);
//});
// Replace text
pdfApi.postDocumentTextReplace(name, trr, null, remoteTempFolder).then((result) => {
console.log(result.body.code);
}).catch(function(err) {
// Deal with an error
console.log(err);
});
P.S: I'm developer evangelist at aspose.

Related

How to get the most recently image in a folder?

How can I get the most recent image that was added to a folder and save that image file path to a variable? I have checked on here (stack overflow) but don't see a post that specifically answers my questions. This is what I have so far it lists out all the files but I am unsure how to get them sorted by most recently modified. This is a unique and specific question. I don't mind if this code is used or not as long as the result is code that can get the most recent file in a folder.
Code:
(async () => {
var lastdownloadedimage = "";
var pathtocheck = "C:/Users/user1/Downloads";
var pathtocheckimage = "C:/Users/user1/Downloads/ot.png";
const testFolder = pathtocheck;
const fs = require('fs');
var lastdownloadedimage;
var filescount = 0;
var filename = [];
var filedates = [];
var filessortedbytimefromcurrentdateaccending = [];
var files;
//create a tuple for the file date and name
var filedata = [];
//fs.readdirSync(testFolder).forEach( filescount++);
files = fs.readdirSync(testFolder);
filescount = files.length;
console.log(files[0]);
filedates = fs.statSync(pathtocheckimage).mtime.getTime();
filename = fs.readdirSync(testFolder);
console.log(filescount);
for(var currentfiletocheck = 0; currentfiletocheck < filescount ; currentfiletocheck++){
//get current date
//find dile that is closest to current date
//use the index of that file data to find the file name
//save the files name to a variable
//filename[currentfiletocheck] = fs.readdirSync(testFolder)[currentfiletocheck];
//filedates[currentfiletocheck] = fs.stats.mtime.getTime()[currentfiletocheck];
//filedata[currentfiletocheck][0] = filename[currentfiletocheck];
//filedata[currentfiletocheck][1] = filedate[currentfiletocheck];
//console.log(files[currentfiletocheck]);
}
filessortedbytimefromcurrentdateaccending
filedata.sort(function(a, b) {
return a < b ? -1 : (a > b ? 1 : 0);
});
for (var i = 0; i < filedata.length; i++) {
var filenamessortedbytimefromcurrentdateaccending = filedata[i][0];
var filedatesortedbytimefromcurrentdateaccending = filedata[i][1];
lastdownloadedimage = filedatesortedbytimefromcurrentdateaccending;
// do something with key and value
}
/*
*/
console.log(lastdownloadedimage);
})();
I have taken a different approach, and rather than gather the timestamps and sort them, I iterate over the files and compare each timestamp with the next one - keeping the timestamp and file path of the later timestamp each time.
Note that this will also check directories, so you will have to implement a filter if you want to ignore them.
let fs = require('fs')
let dirToCheck = '.'
let files = fs.readdirSync(dirToCheck)
let latestPath = `${dirToCheck}/${files[0]}`
let latestTimeStamp = fs.statSync(latestPath).mtime.getTime()
files.forEach(file => {
let path = `${dirToCheck}/${file}`
let timeStamp = fs.statSync(path).mtime.getTime()
if (timeStamp > latestTimeStamp) {
latestTimeStamp = timeStamp
latestPath = path
}
});
console.log(latestPath)

Make Initialization Asynchronous in node.js

I am trying to initialize a key class in a node.js program, but the instructions are running in arbitrary order and therefore it is initializing wrong. I've tried both making initialization happen in the definition and in a separate function; neither works. Is there something that I'm missing?
Current code:
class BotState {
constructor() {
this.bios = {}
this.aliases = {};
this.stories = {};
this.nextchar = 0;
}
}
var ProgramState = new BotState();
BotState.prototype.Initialize = function() {
this.bios = {};
var aliases = {};
var nextchar = 0;
this.nextchar = 0;
fs.readdir(biosdir, function (err, files) {
if (err) throw err;
for (var file in files) {
fs.readFile(biosdir + file + ".json", {flag: 'r'}, (err, data) => {
if (err) throw err;
var bio = JSON.parse(data);
var index = bio["charid"];
this.bios[index] = bio;
for (var alias in bio["aliaslist"]) {
this.aliases[bio["aliaslist"][alias].toLowerCase()] = index;
}
if (index >= nextchar) {
nextchar = index + 1;
}
})
}
this.stories = {};
this.nextchar = Math.max(Object.keys(aliases).map(key => aliases[key]))+1;
});
}
ProgramState.Initialize();
Is there some general way to make node.js just... run commands in the order they're written, as opposed to some arbitrary one?
(Apologies if the code is sloppy; I was more concerned with making it do the right thing than making it look nice.)
You are running an asynchronous operation in a loop which causes the loop to continue running and the asynchronous operations finish in some random order so you process them in some random order. The simplest way to control your loop is to switch to the promise-based version of the fs library and then use async/await to cause your for loop to pause and wait for the asynchronous operation to complete. You can do that like this:
const fsp = require('fs').promises;
class BotState {
constructor() {
this.bios = {}
this.aliases = {};
this.stories = {};
this.nextchar = 0;
}
}
var ProgramState = new BotState();
BotState.prototype.Initialize = async function() {
this.bios = {};
this.nextchar = 0;
let aliases = {};
let nextchar = 0;
const files = await fsp.readdir(biosdir);
for (const file of files) {
const data = await fsp.readFile(biosdir + file + ".json", {flag: 'r'});
const bio = JSON.parse(data);
const index = bio.charid;
const list = bio.aliaslist;
this.bios[index] = bio;
for (const alias of list) {
this.aliases[alias.toLowerCase()] = index;
}
if (index >= nextchar) {
nextchar = index + 1;
}
}
this.stories = {};
// there is something wrong with this line of code because you NEVER
// put any data in the variable aliases
this.nextchar = Math.max(Object.keys(aliases).map(key => aliases[key]))+1;
}
ProgramState.Initialize();
Note, there's a problem with your usage of the aliases local variable because you never put anything in that data structure, yet you're trying to use it in the last line of the function. I don't know what you're trying to accomplish there so you will have to fix that.
Also, note that you should never use for/in to iterate an array. That iterates properties of an object which can include more than just the array elements. for/of is made precisely for iterating an iterable like an array and it also saves the array dereference too as it gets you each value, not each index.

nodejs event stream setting a variable per stream

I have a code that creates a readable stream . I would like to set the name of the stream in the getStream method . I tried setting a property as shown below . I am able to access the property in the onceFunction but I am not able to access the property in the map Function . Let me know what I am doing wrong
var onceFunction = function(str1,record) {
console.log("OnceFunction",this.nodeName);
}
var getStream = function(csvData) {
var dirNames = csvData.split("/");
var nodeName = dirNames[dirNames.length-2];
var fileName = csvData;
stream = fs.createReadStream(csvData);
stream.nodeName = dirNames[dirNames.length-2];
return stream;
};
var myFileList = ["D:\mypath\file"];
for ( var i = 0; i< myFileList.length; i++ ) {
getStream(myFileList[i])
.once('data',onceFunction)
.pipe(es.split())
.on('end',endFunction)
.pipe(es.map(function(data,cb) {
console.log(this.nodeName);
}));
}
Because "es" has it's own "this". And passes it to es.map callback. Where, ofcource, nodeName is empty. Refactor you code to use closures and avoid using "this".
For example in pseudocode:
for ( var i = 0; i< myFileList.length; i++ ) {
processFile(myFileList[i]);
}
var processfile = function(file) {
var stream = getStream(file);
var somevar = stream.nodeName;
stream.once('data',onceFunction)
.pipe(es.split())
.on('end',endFunction)
.pipe(es.map(function(data,cb) {
console.log(somevar);
console.log(stream.nodeName);
}));
}

Serialization-deserialization with Apache Thrift in nodejs

I am working on a Node.js application and I need to serialize and deserialize instances of the structs defined in an .thrift file, like the following:
struct Notification {
1: string subject,
2: string message
}
Now this is easy doable in Java, according to the tutorial at http://www.gettingcirrius.com/2011/03/rabbitmq-with-thrift-serialization.html :
Notification notification = new Notification();
TDeserializer deserializer = new TDeserializer();
deserializer.deserialize(notification, serializedNotification);
System.out.println("Received "+ notification.toString());
But I can't find how this is done using the nodejs library of Thrift. Can anyone help, please?
Ok, after wasting a lot of time on research and trying different solutions, I finally came to the answer to my own question:
//SERIALIZATION:
var buffer = new Buffer(notification);
var transport = new thrift.TFramedTransport(buffer);
var binaryProt = new thrift.TBinaryProtocol(transport);
notification.write(binaryProt);
where notification is the object I wish to serialize. At this point, the byte array can be found in the transport.outBuffers field:
var byteArray = transport.outBuffers;
For deserialization:
var tTransport = new thrift.TFramedTransport(byteArray);
var tProtocol = new thrift.TBinaryProtocol(tTransport);
var receivedNotif = new notification_type.Notification();
receivedNotif.read(tProtocol);
Assuming that the following lines have been added to the index.js file from the nodejs library for thrift:
exports.TFramedTransport = require('./transport').TFramedTransport;
exports.TBufferedTransport = require('./transport').TBufferedTransport;
exports.TBinaryProtocol = require('./protocol').TBinaryProtocol;
Here is my TypeScript version which runs in a browser. npm install buffer before use.
It should work on node if you remove import { Buffer }.
/*
Thrift serializer for browser and node.js
Author: Hirano Satoshi
Usage:
let byteArray = thriftSerialize(thriftObj);
let thriftObj2 = thriftDeserialize(byteArray, new ThriftClass())
let mayBeTrue = byteArrayCompare(byteArray, thriftSerialize(thriftObj2))
*/
import { TBufferedTransport, TFramedTransport, TJSONProtocol, TBinaryProtocol } from 'thrift';
import { Buffer } from 'buffer';
export function thriftSerialize(thriftObj: any): Buffer {
let transport = new TBufferedTransport(null);
let protocol = new TBinaryProtocol(transport);
thriftObj.write(protocol);
// copy array of array into byteArray
let source = transport.outBuffers;
var byteArrayLen = 0;
for (var i = 0, len = source.length; i < len; i++)
byteArrayLen += source[i].length;
let byteArray = new Buffer(byteArrayLen);
for (var i = 0, len = source.length, pos = 0; i < len; i++) {
let chunk = source[i];
chunk.copy(byteArray, pos);
pos += chunk.length;
}
return byteArray;
}
export function thriftDeserialize(byteArray: Buffer, thriftObj: any): any {
let transport = new TBufferedTransport(byteArray);
let callback = (transport_with_data) => {
var proto = new TBinaryProtocol(transport_with_data);
// var proto = new TJSONProtocol(transport);
thriftObj.read(proto);
}
// var buf = new Buffer(byteArray);
TBufferedTransport.receiver(callback)(byteArray);
return thriftObj;
}
export function byteArrayCompare(array1, array2): boolean {
if (!array1 || !array2)
return false;
let val = array1.length === array2.length && array1.every((value, index) => value === array2[index])
return val;
}
Somehow i did not find the the byte array at:
transport.outBuffers
i needed to do the following:
var transport = new Thrift.TFramedTransport(null, function(bytes){
dataWrapper.out = bytes;
cb(dataWrapper)
})
var binaryProt = new Thrift.TCompactProtocol(transport);
notification.write(binaryProt) ;
transport.flush() ; //important without the flush the transport callback will not be invoked

How can I create a new document out of a subset of another document's pages (in InDesign (CS6) using ExtendScript)?

I need to offer a feature which allows InDesign users to select a page range in an InDesign document and create a new document out of those pages. This sounds simple, but it isn't...
I have tried many different ways of doing this but they have all failed to some degree. Some methods put all pages in a single spread (which sometimes makes InDesign crash). The best I've been able to do (see code below) still has problems at the beginning and the end (see screenshots below):
The original document:
The new document:
The question: How can I create a new document out of a subset of another document's pages (in InDesign using ExtendScript) without having the problems shown in the screenshots?
note: The behavior of the script is quite different in CS5.5 and CS6. My question concerns CS6.
The second screenshot was obtained by applying the following code to the document shown in the first screenshot:
CODE
var firstPageName = { editContents: "117" }; // This page number is actually entered by the user in an integerEditbox
var lastPageName = { editContents: "136" }; // This page number is actually entered by the user in an integerEditbox
var sourceDocument = app.activeDocument;
var destDocument = app.documents.add();
destDocument.importStyles(ImportFormat.paragraphStylesFormat, new File(sourceDocument.filePath + "/" + sourceDocument.name), GlobalClashResolutionStrategy.LOAD_ALL_WITH_OVERWRITE);
destDocument.importStyles(ImportFormat.characterStylesFormat, new File(sourceDocument.filePath + "/" + sourceDocument.name), GlobalClashResolutionStrategy.LOAD_ALL_WITH_OVERWRITE);
destDocument.viewPreferences.horizontalMeasurementUnits = sourceDocument.viewPreferences.horizontalMeasurementUnits;
destDocument.viewPreferences.verticalMeasurementUnits = sourceDocument.viewPreferences.verticalMeasurementUnits;
destDocument.documentPreferences.facingPages = sourceDocument.documentPreferences.facingPages;
destDocument.documentPreferences.pageHeight = sourceDocument.documentPreferences.pageHeight;
destDocument.documentPreferences.pageWidth = sourceDocument.documentPreferences.pageWidth;
destDocument.documentPreferences.pageSize = sourceDocument.documentPreferences.pageSize;
var sourceSpreads = sourceDocument.spreads;
var nbSourceSpreads = sourceSpreads.length;
var firstPageFound = false;
var lastPageFound = false;
var i;
var newSpreadNeeded;
var currentDestSpread;
for (i = 0; !lastPageFound, i < nbSourceSpreads; ++i) {
newSpreadNeeded = true;
var sourcePages = sourceSpreads[i].pages;
var nbSourcePages = sourcePages.length;
var j;
for (j = 0; !lastPageFound, j < nbSourcePages; ++j) {
if (sourcePages[j].name === firstPageName.editContents) {
firstPageFound = true;
destDocument.documentPreferences.startPageNumber = parseInt(firstPageName.editContents); // We want to preserve page numbers
}
if (firstPageFound) {
// Copy this page over to the new document.
var firstInNewSpread = false;
if (newSpreadNeeded) {
currentDestSpread = destDocument.spreads.add();
newSpreadNeeded = false;
firstInNewSpread = true;
}
var newPage = sourcePages[j].duplicate(LocationOptions.AT_END, currentDestSpread);
var k;
for (k = 0; k < newPage.index; ++k) {
currentDestSpread.pages[k].remove();
}
}
if (sourcePages[j].name === lastPageName.editContents) {
lastPageFound = true;
}
}
}
destDocument.spreads[0].remove();
I was hacking around and came up with this little script. Although it approaches the problem from the opposite direction, it seems to work fine here. Also, I'm still running in InDesign CS5, but maybe it will work for you. Hopefully I got the gist of your question?
This will extract pages 3 through 5 into a separate document:
var doc = app.activeDocument;
var newFilePath = doc.filePath + "/subset_" + doc.name;
var newFile = File(newFilePath); // Create a new file path
doc.saveACopy(newFile); // Save a copy of the doc
var newDoc = app.open(newFile); // Open the copy
var firstPageNum = 3; // First page number in the range
var lastPageNum = 5; // Last page number in the range
var firstPage = newDoc.pages[firstPageNum-1];
var lastPage = newDoc.pages[lastPageNum-1];
// Remove all text from the last page in the range to the end of the document
var lastPageFrames = lastPage.textFrames.everyItem().getElements();
for (var i=0; i < lastPageFrames.length; i++) {
var frame = lastPageFrames[i];
var parentStory = frame.parentStory;
var lastFrameInsert = frame.insertionPoints.lastItem();
var lastStoryInsert = parentStory.insertionPoints.lastItem();
var textAfter = parentStory.insertionPoints.itemByRange(lastFrameInsert,lastStoryInsert);
textAfter.remove();
};
// Remove all text from the beginning of the document to the first page in the range
var firstPageFrames = firstPage.textFrames.everyItem().getElements();
for (var i=0; i < firstPageFrames.length; i++) {
var frame = firstPageFrames[i];
var parentStory = frame.parentStory;
var firstFrameInsert = frame.insertionPoints.firstItem();
var textBefore = parentStory.insertionPoints.itemByRange(0,firstFrameInsert.index);
textBefore.remove();
};
// Remove the pages that aren't in the range
var allPages = newDoc.pages.everyItem().getElements();
for (var i=0; i < allPages.length; i++) {
var page = allPages[i];
if (i < firstPageNum || i > lastPageNum) {
page.remove();
}
};

Resources