PDF text search and split library

PDF text search and split library - search

I am look for a server side PDF library (or command line tool) which can:
split a multi-page PDF file into individual PDF files, based on
a search result of the PDF file content
Examples:
Search "Page ???" pattern in text and split the big PDF into 001.pdf, 002,pdf, ... ???.pdf
A server program will scan the PDF, look for the search pattern, save the page(s) which match the patten, and save the file in the disk.
It will be nice with integration with PHP / Ruby. Command line tool is also acceptable. It will be a server side (linux or win32) batch processing tool. GUI/login is not supported. i18n support will be nice but no required. Thanks~

My company, Atalasoft, has just released some PDF manipulation tools that run on .NET. There is a text extract class that you can use to find the text and determine how you will split your document and a very high level document class that makes the splitting trivial. Suppose you have a Stream to your source PDF and an increasingly ordered List that describes the starting page of each split, then the code to generate your split files looks like this:
public void SplitPdf(Stream stm, List<int> pageStarts, string outputDirectory)
{
PdfDocument mainDoc = new PdfDocument(stm);
int lastPage = mainDoc.Pages.Count - 1;
for (int i=0; i < pageStarts.Count; i++) {
int startPage = pageStarts[i];
int endPage= (i < pageStarts.Count - 1) ?
pageStarts[i + 1] - 1 :
lastPage;
if (startPage > endPage) throw new ArgumentException("list is not ordered properly", "pageStarts");
PdfDocument splitDoc = new PdfDocument();
for (j = startPage; j <= endPage; j++)
splitDoc.Pages.Add(mainDoc.Pages[j];
string outputPath = Path.Combine(outputDirectory,
string.Format("{0:D3}.pdf", i + 1));
splitDoc.Save(outputPath);
}
if you generalize this into a page range struct:
public struct PageRange {
public int StartPage;
public int EndPage;
}
where StartPage and EndPage inclusively describe a range of pages, then the code is simpler:
public void SplitPdf(Stream stm, List<PageRange> ranges, string outputDirectory)
{
PdfDocument mainDoc = new PdfDocument(stm);
int outputDocCount = 1;
foreach (PageRange range in ranges) {
int startPage = Math.Min(range.StartPage, range.EndPage); // assume not in order
int endPage = Math.Max(range.StartPage, range.EndPage);
PdfDocument splitDoc = new PdfDocument();
for (int i=startPage; i <= endPage; i++)
splitDoc.Pages.Add(mainDoc.Pages[i]);
string outputPath = Path.Combine(outputDirectory,
string.Format("{0:D3}.pdf", outputDocCount));
splitDoc.Save(outputPath);
outputDocCount++;
}
}

PDFBox is a Java library but it does have some command line tools as well:
http://pdfbox.apache.org/
PDFBox can extract text and also rebuilt/split PDFS

pdfminer + multi-line pattern matching in python

You can use pdfsam to split your file in pages, then use pdftotext (from foolabs.com) to turn this into text and use ruby (or grep) to find the strings. Then you have the page ranges and can return the previous generated pages.

Related

Export PDF file from Excel template with Qt and QAxObject

The project I am currently working on is to export an Excel file to PDF.
The Excel file is a "Template" that allows the generation of graphs. The goal is to fill some cells of the Excel file so that the graphs are generated and then to export the file in PDF.
I use Qt in C++ with the QAxObject class and all the data writing process works well but it's the PDF export part that doesn't.
The problem is that the generated PDF file also contains the data of the graphs while these data are not included in the print area of the Excel template.
The PDF export is done with the "ExportAsFixedFormat" function which has as a parameter the possibility to ignore the print area that is "IgnorePrintAreas" at position 5. Even if I decide to set this parameter to "false", so not to ignore the print area and therefore to take into account the print area, this does not solve the problem and it produces the same result as if this parameter was set to "true".
I tried to vary the other parameters, to change the type of data passed in parameter or not to use any parameter but it does not change anything to the obtained result which is always the same.
Here is the link to the "documentation" of the export command "ExportAsFixedFormat":
https://learn.microsoft.com/en-us/office/vba/api/excel.workbook.exportasfixedformat
I give you a simplified version of the command suite that is executed in the code:
Rapport::Rapport(QObject *parent) : QObject(parent)
{
//Create the template from excel file
QString pathTemplate = "/ReportTemplate_FR.xlsx"
QString pathReporter = "/Report"
this->path = QDir(QDir::currentPath() + pathReporter + pathTemplate);
QString pathAbsolute(this->path.absolutePath().replace("/", "\\\\"));
//Create the output pdf file path
fileName = QString("_" + QDateTime::currentDateTime().toString("yyyyMMdd-HHmmssff") + "_Report");
QString pathDocument = QStandardPaths::writableLocation(QStandardPaths::DocumentsLocation).append("/").replace("/", "\\\\");
QString exportName(pathDocument + fileName + ".pdf");
//Create the QAxObjet that is linked to the excel template
this->excel = new QAxObject("Excel.Application");
//Create the QAxObject « sheet » who can accepte measure data
QAxObject* workbooks = this->excel->querySubObject("Workbooks");
QAxObject* workbook = workbooks->querySubObject("Add(const QString&)", pathAbsolute);
QAxObject* sheets = workbook->querySubObject("Worksheets");
QAxObject* sheet = sheets->querySubObject("Item(int)", 3);
//Get some data measure to a list of Inner class Measurement
QList<Measurement*> actuMeasure = this->getSomeMeasure() ; //no need to know how it’s work…
//Create a 2 dimentional QVector to be able to place data on the table where we want (specific index)
QVector<QVector<QVariant>> vCells(actuMeasure.size());
for(int i = 0; i < vCells.size(); i++)
vCells[i].resize(6);
//Fill the 2 dimentional QVector with data measure
int row = 0;
foreach(Measurement* m, actuMeasure)
{
vCells[row][0] = QVariant(m->x);
vCells[row][1] = QVariant(m->y1);
vCells[row][2] = QVariant(m->y2);
vCells[row][3] = QVariant(m->y3);
vCells[row][4] = QVariant(m->y4);
vCells[row][5] = QVariant(m->y5);
row++;
}
//Transform the 2 dimentional QVector on a QVariant object
QVector<QVariant> vvars;
QVariant var;
for(int i = 0; i < actuMeasure.size(); i++)
vvars.append(QVariant(vCells[i].toList()));
var = QVariant(vvars.toList());
//Set the QVariant object that is the data measure on the excel file
sheet->querySubObject("Range(QString)", "M2:AB501")->setProperty("Value", var);
//Set the fileName on the page setup (not relevant for this example)
sheet->querySubObject("PageSetup")->setProperty("LeftFooter", QVariant(fileName));
//Export to PDF file with options – NOT WORKING !!!
workbook->dynamicCall("ExportAsFixedFormat(const QVariant&, const QVariant&, const QVariant&, const QVariant&, const QVariant&)", QVariant(0), QVariant(exportName), QVariant(0), QVariant(false), QVariant(false));
//Close
workbooks->dynamicCall("Close()");
this->excel->dynamicCall("Quit()");
}
A this point I really need help to find a way to solve this problem.
I also wonder if this is not a bug of the QAxObject class.

I finally found a solution on another forum.
If anyone needs help, I'll leave the link to the answer.

Read the content of a Word document via its XML

Context
I am trying to build a Word document browser in Excel to sift trough a large amount of documents (around 1000).
The process of opening a word document proves to be rather slow (around 4 seconds per documents, so in this case it takes 2 hour to look through all the items, which is far too slow for a single query), even by disabling all things that could slow down the opening, hence I open:
As read only
Without the open and repair mode (which can happen on some documents)
Disabling the display of the document
My attempt so far
These documents are tricky to look through because some keywords do appear every single time but not in the same context (not the core of the problem here since I can handle that when the text is loaded in arrays). Hence the often used Windows explorer solution (like in this link ) cannot be used in my case.
For the moment, I managed to have a working macro that analyze the content of the word documents by opening them.
Code
Here is a sample of the code.
Note that I used the Microsoft Word 14.0 Object Library reference
' Analyzing all the word document within the same folder '
Sub extractFile()
Dim i As Long, j As Long
Dim sAnalyzedDoc As String, sLibName As String
Dim aOut()
Dim oWordApp As Word.Application
Dim oDoc As Word.Document
Set oWordApp = CreateObject("Word.Application")
sLibName = ThisWorkbook.Path & "\"
sAnalyzedDoc = Dir(sLibName)
sKeyword = "example of a word"
With Application
.DisplayAlerts = False
.ScreenUpdating = False
End With
ReDim aOut(2, 2)
aOut(1, 1) = "Document name"
aOut(2, 1) = "Text"
While (sAnalyzedDoc <> "")
' Analyzing documents only with the .doc and .docx extension '
If Not InStr(sAnalyzedDoc, ".doc") = 0 Then
' Opening the document as mentionned above, in read only mode, without repair and invisible '
Set oDoc = Word.Documents.Open(sLibName & "\" & sAnalyzedDoc, ReadOnly:=True, OpenAndRepair:=False, Visible:=False)
With oDoc
For i = 1 To .Sentences.Count
' Searching for the keyword within the document '
If Not InStr(LCase(.Sentences.Item(i)), LCase(sKeyword)) = 0 Then
If Not IsEmpty(aOut(1, 2)) Then
ReDim Preserve aOut(2, UBound(aOut, 2) + 1)
End If
aOut(1, UBound(aOut, 2)) = sAnalyzedDoc
aOut(2, UBound(aOut, 2)) = .Sentences.Item(i)
GoTo closingDoc ' A dubious programming choice but that works for the moment '
End If
Next i
closingDoc:
' Intending to make the closing faster by not saving the document '
.Close SaveChanges:=False
End With
End If
'Moving on to the next document '
sAnalyzedDoc = Dir
Wend
exitSub:
With Output
.Range(.Cells(1, 1), .Cells(UBound(aOut, 1), UBound(aOut, 2))) = aOut
End With
With Application
.DisplayAlerts = True
.ScreenUpdating = True
End With
End Sub
My question
The idea I thought was to go via the XML content within the document to access directly to its content (which you can access when renaming any document in newer versions of Word, with a .zip extension and going for nameOfDocument.zip\word\document.xml).
It would be a lot faster than loading all the images, charts and tables of the word document which are of no use in a text search.
Thus, I wanted to ask if there was a way in VBA to open a word document like a zip file and access that XML document to then process it like a normal string of characters in VBA, since I already have the path and the name of the file given the above code.

Do note that this is not an easy answer to the above problem and the sole VBA code in my initial question will do perfectly the job as long as you do not have a load of documents to browse, else go for another tool (there is a Python Dynamic Link Library (DLL) that does that very well).
Ok, I'll try to make my answer as explanatory as possible.
First of all this question lead me to the infinite journey of XML in C# and in XPath which I chose not to pursue at some point.
It reduced the time of analyzing the files from roughly 2 hours to 10 seconds.
Context
The backbone of reading XML documents, and therefore inner word XML documents, is the OpenXML library from Microsoft.
Keep in mind what I said above, that the method I was trying to implement cannot be done solely in VBA and thus must be done in another way.
This is probably due to the fact that VBA is implemented for Office and thus limited in accessing the core structure of Office documents, but I have no information relating to this limitation (any information is welcomed).
The answer I will give here is writing a C# DLL for VBA.
For writing DLL in C# and referencing to it in VBA I redirect you toward the following link which will resume in a better way this specific process: Tutorial for creating DLL in C#
Let's start
First of all you will need to reference the WindowsBase library and the DocumentFormat.OpenXML in your project to make the solution work as explained in this MSDN article Manipulate Office Open XML Formats Documents and that one Open and add text to a word processing document (Open XML SDK)
These articles explain broadly how works the OpenXML library for manipulating word documents.
The C# code
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Xml;
using System.IO.Packaging;
namespace BrowserClass
{
public class SpecificDirectory
{
public string[,] LookUpWord(string nameKeyword, string nameStopword, string nameDirectory)
{
string sKeyWord = nameKeyword;
string sStopWord = nameStopword;
string sDirectory = nameDirectory;
sStopWord = sStopWord.ToLower();
sKeyWord = sKeyWord.ToLower();
string sDocPath = Path.GetDirectoryName(sDirectory);
// Looking for all the documents with the .docx extension
string[] sDocName = Directory.GetFiles(sDocPath, "*.docx", SearchOption.AllDirectories);
string[] sDocumentList = new string[1];
string[] sDocumentText = new string[1];
// Cycling the documents retrieved in the folder
for (int i = 0; i < sDocName.Count(); i++)
{
string docWord = sDocName[i];
// Opening the documents as read only, no need to edit them
Package officePackage = Package.Open(docWord, FileMode.Open, FileAccess.Read);
const String officeDocRelType = #"http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument";
PackagePart corePart = null;
Uri documentUri = null;
// We are extracting the part with the document content within the files
foreach (PackageRelationship relationship in officePackage.GetRelationshipsByType(officeDocRelType))
{
documentUri = PackUriHelper.ResolvePartUri(new Uri("/", UriKind.Relative), relationship.TargetUri);
corePart = officePackage.GetPart(documentUri);
break;
}
// Here enter the proper code
if (corePart != null)
{
string cpPropertiesSchema = "http://schemas.openxmlformats.org/package/2006/metadata/core-properties";
string dcPropertiesSchema = "http://purl.org/dc/elements/1.1/";
string dcTermsPropertiesSchema = "http://purl.org/dc/terms/";
// Construction of a namespace manager to handle the different parts of the xml files
NameTable nt = new NameTable();
XmlNamespaceManager nsmgr = new XmlNamespaceManager(nt);
nsmgr.AddNamespace("dc", dcPropertiesSchema);
nsmgr.AddNamespace("cp", cpPropertiesSchema);
nsmgr.AddNamespace("dcterms", dcTermsPropertiesSchema);
// Loading the xml document's text
XmlDocument doc = new XmlDocument(nt);
doc.Load(corePart.GetStream());
// I chose to directly load the inner text because I could not parse the way I wanted the document, but it works so far
string docInnerText = doc.DocumentElement.InnerText;
docInnerText = docInnerText.Replace("\\* MERGEFORMAT", ".");
docInnerText = docInnerText.Replace("DOCPROPERTY ", "");
docInnerText = docInnerText.Replace("Glossary.", "");
try
{
Int32 iPosKeyword = docInnerText.ToLower().IndexOf(sKeyWord);
Int32 iPosStopWord = docInnerText.ToLower().IndexOf(sStopWord);
if (iPosStopWord == -1)
{
iPosStopWord = docInnerText.Length;
}
if (iPosKeyword != -1 && iPosKeyword <= iPosStopWord)
{
// Redimensions the array if there was already a document loaded
if (sDocumentList[0] != null)
{
Array.Resize(ref sDocumentList, sDocumentList.Length + 1);
Array.Resize(ref sDocumentText, sDocumentText.Length + 1);
}
sDocumentList[sDocumentList.Length - 1] = docWord.Substring(sDocPath.Length, docWord.Length - sDocPath.Length);
// Taking the small context around the keyword
sDocumentText[sDocumentText.Length - 1] = ("(...) " + docInnerText.Substring(iPosKeyword, sKeyWord.Length + 60) + " (...)");
}
}
catch (ArgumentOutOfRangeException)
{
Console.WriteLine("Error reading inner text.");
}
}
// Closing the package to enable opening a document right after
officePackage.Close();
}
if (sDocumentList[0] != null)
{
// Preparing the array for output
string[,] sFinalArray = new string[sDocumentList.Length, 2];
for (int i = 0; i < sDocumentList.Length; i++)
{
sFinalArray[i, 0] = sDocumentList[i].Replace("\\", "");
sFinalArray[i, 1] = sDocumentText[i];
}
return sFinalArray;
}
else
{
// Preparing the array for output
string[,] sFinalArray = new string[1, 1];
sFinalArray[0, 0] = "NO MATCH";
return sFinalArray;
}
}
}
}
The VBA code associated
Option Explicit
Const sLibname As String = "C:\pathToYourDocuments\"
Sub tester()
Dim aFiles As Variant
Dim LookUpDir As BrowserClass.SpecificDirectory
Set LookUpDir = New BrowserClass.SpecificDirectory
' The array will contain all the files which contain the "searchedPhrase" '
aFiles = LookUpDir.LookUpWord("searchedPhrase", "stopWord", sLibname)
' Add here any necessary processing if needed '
End Sub
So in the end you get a tool that can scan .docx documents much faster than in a classic open-read-close approach in VBA at the cost of more code writing.
Above all you get a simple solution for your users that just want to perform simple search, especially when there is a huge number of word documents.
Note
Parsing Word .XML files can be nightmarish in VBA as pointed out by #Mikegrann .
Thankfully OpenXML has an XML parser C# , xml parsing. get data between tags that will do the work for you in C# and take the <w:t></w:t> tags that are refering to the text of the docment. Though I found these answers so far but couldn't make them work:
Parsing a MS Word generated XML file in C# , Reading specific XML elements from XML file
So I went for the .InnerText solution I provided with my code above, to access the internal text, at the cost of having some formatting text input (like \\MERGEFORMAT).

StreamWriter trouble writing doubles to .txt file (C++)

I'm trying to write some double values to a text file the user creates via a SaveFileDialog, but everytime I do a streamWriterVariable->Write(someDoubleVariable), I instead see some kind of weird ASCII character in the text file where the double should be (music note, |, copyright symbol, etc). I'm opening the file with notepad if it's that of any significance. A basic outline of my code:
SaveFileDialog^ saveFileDialog1 = gcnew SaveFileDialog;
saveFileDialog1->Filter = "txt files (*.txt)|*.txt|All files (*.*)|*.*";
saveFileDialog1->Title = "Save File Here";
saveFileDialog1->RestoreDirectory = true;
if (saveFileDialog1->ShowDialog() == System::Windows::Forms::DialogResult::OK )
{
FileInfo ^fleTest = gcnew FileInfo(saveFileDialog1->FileName);
StreamWriter ^sWriter = fleTest->CreateText();
sWriter->AutoFlush = true;
double test = 5.635; //Some arbitrary double I made up for test purposes
sWriter->Write(test);
sWriter->Flush();
sWriter->Close();
}
Thanks for your help!

Have you tried to set the encoding explicitly?
StreamWriter^ sWriter = gcnew StreamWriter(saveFileDialog1->FileName, false, System::Text::Encoding::ASCII);

The code you've provided does exactly what you ask it to, that is to write a double to the file in the internal computer format. What you most likely want it to write out the textual representation of the double.
In other words you should try sWriter->Write(test.ToString()) or some variation over this, to get the textual version of your double. This also applies to bool and most other variable representation.

How to keep original rotate page in itextSharp (dll)

i would like create the project, reading from Excel and write on pdf and print this pdf.
From Excel file (from cell) read directory where is original pdf on computer or server, and next cell have info what write on the top in second pdf.
And problem is here, original pdf is horizontal, landscape, rotate and my program create copy from original pdf and write info from excel on the top on copy pdf file. But pdf which is landscape is rotate for 270 deegres. This is no OK. For portrait rotation working program OK, copy OK and write on the top of the copy is OK.
Where is my problem in my code.
Code:
public int urediPDF(string inTekst)
{
if (inTekst != "0")
{
string pisava_arialBD = #"..\debug\arial.ttf";
string oldFile = null;
string inText = null;
string indeks = null;
//razbitje stringa
string[] vhod = inTekst.Split('#');
oldFile = vhod[0];
inText = vhod[1];
indeks = vhod[2];
string newFile = #"c:\da\2";
//odpre bralnik pdf
PdfReader reader = new PdfReader(oldFile);
Rectangle size = reader.GetPageSizeWithRotation(reader.NumberOfPages);
Document document = new Document(size);
//odpre zapisovalnik pdf
FileStream fs = new FileStream(newFile + "-" + indeks + ".pdf", FileMode.Create, FileAccess.Write);
PdfWriter writer = PdfWriter.GetInstance(document, fs);
//document.Open();
document.OpenDocument();
label2.Text = ("Status: " + reader.GetPageRotation(reader.NumberOfPages).ToString());
//določi sejo ustvarjanje pdf
PdfContentByte cb = writer.DirectContent;
//izbira pisave oblike
BaseFont bf = BaseFont.CreateFont(pisava_arialBD, BaseFont.IDENTITY_H, BaseFont.NOT_EMBEDDED);
cb.SetColorFill(BaseColor.RED);
cb.SetFontAndSize(bf, 8);
//pisanje teksta v pdf
cb.BeginText();
string text = inText;
//izbira koordinat za zapis pravilnega teksta v pdf (720 stopinj roatacija (ležeče) in 90 stopinj (pokončno))
if (reader.GetPageRotation(1) == 720) //ležeča postavitev
{
cb.ShowTextAligned(1, text, 10, 450, 0);
cb.EndText();
}
else //pokončna postavitev
{
cb.ShowTextAligned(1, text + " - pokončen", 10, 750, 0);
cb.EndText();
}
// create the new page and add it to the pdf
PdfImportedPage page = writer.GetImportedPage(reader, reader.NumberOfPages);
cb.AddTemplate(page, 0, 0);
// close the streams and voilá the file should be changed :)
document.Close();
fs.Close();
writer.Close();
reader.Close();
}
else
{
label2.Text = "Status: Končano zapisovanje";
return 0;
}
return 0;
}
Picture fake pdf:

As explained many times before (ITextSharp include all pages from the input file, Itext pdf Merge : Document overflow outside pdf (Text truncated) page and not displaying, and so on), you should read chapter 6 of my book iText in Action (you can find the C# version of the examples here).
You are using a combination of Document, PdfWriter and PdfImportedPage to split a PDF. Please tell me who made you do it this way, so that I can curse the person who inspired you (because I've answered this question hundreds of times before, and I'm getting tired of repeating myself). These classes aren't a good choice for that job:
you lose all interactivity,
you need to rotate the content yourself if the page is in landscape,
you need to take the original page size into account,
...
Your problem is similar to this one itextsharp: unexpected elements on copied pages. Is there any reason why you didn't read the documentation? If you say: "I didn't have the time", please believe me if I say that I have almost 20 years of experience as a developer, and I've never seen "reading documentation" as a waste of time.
Long story short: read the documentation, replace PdfWriter with PdfCopy, replace AddTemplate() with AddPage().

The process cannot access the file 'd:\1.doc' because it is being used by another process

my code :
object c = "d:\\1.doc";
if(File.Exists(c.ToString()))
{
File.Delete(c.ToString());
}
error :
The process cannot access the file 'd:\1.doc' because it is being used
by another process.
How close ? with code

first of all use string instead of object, so:
string c = "d:\\1.doc";
now as the message indicated the file being used by another process. either by windows process, or you are opening the file stream and forget to close it. check in your code where you are interacting with the file.
Edit: Since you are using Microsoft.Office.Interop.Word make sure you close the file(s) open first like:
Word.ApplicationClass word = new Word.ApplicationClass();
//after using it:
if (word.Documents.Count > 0)
{
word.Documents.Close(...);
}
((Word._Application)word.Application).Quit(..);
word.Quit(..);

I had the same type of issue when I wanted to Delete File after Open/Read it using Microsoft.Office.Interop.Word and I needed to close my document and the application like that :
private void parseFile(string filePath)
{
// Open a doc file.
Microsoft.Office.Interop.Word.Application application = new Microsoft.Office.Interop.Word.Application();
Document document = application.Documents.Open(filePath);
// Loop through all words in the document.
int count = document.Words.Count;
for (int i = 1; i <= count; i++)
{
// Write the word.
string text = document.Words[i].Text;
Console.WriteLine("Word {0} = {1}", i, text);
}
// Close document correctly
((_Document)document).Close();
((_Application)application).Quit();
}

You have that file actively open in this or another program, and Windows prevents you from deleting it in that case.

Check if the file still running (opened) by another application
1- Microsoft Word
2- WordPad

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

PDF text search and split library - search

PDFBox is a Java library but it does have some command line tools as well: http://pdfbox.apache.org/ PDFBox can extract text and also rebuilt/split PDFS

pdfminer + multi-line pattern matching in python

You can use pdfsam to split your file in pages, then use pdftotext (from foolabs.com) to turn this into text and use ruby (or grep) to find the strings. Then you have the page ranges and can return the previous generated pages.

Related

Export PDF file from Excel template with Qt and QAxObject

Read the content of a Word document via its XML

StreamWriter trouble writing doubles to .txt file (C++)

How to keep original rotate page in itextSharp (dll)

The process cannot access the file 'd:\1.doc' because it is being used by another process

Categories

Resources