How to create HWPF document with apache poi - apache-poi

Please somebody help me with putting text into paragraphs. I have this code :
private void createDOCDocument(String from, File file) throws Exception {
POIFSFileSystem fs = new POIFSFileSystem(DOCGenerator.class.getClass().getResourceAsStream("/poi/template.doc"));
HWPFDocument doc = new HWPFDocument(fs);
Range range = doc.getRange();
Paragraph par1 = range.insertAfter(new ParagraphProperties(), 0);
CharacterRun run1 = par1.insertAfter(from);
run1.setFontSize(11);
DocumentSummaryInformation dsi = doc.getDocumentSummaryInformation();
CustomProperties cp = dsi.getCustomProperties();
if (cp == null)
cp = new CustomProperties();
cp.put("myProperty", "foo bar baz");
dsi.setCustomProperties(cp);
doc.write(new FileOutputStream(file));
}
But the problem is that if I put the "from" string directly into the range, it will be in the resulting document, but if I create a paragraph and put it in there instead, the document is empty. Even if I process it with apache tika and its WordExtractor, it gets nothing.
btw /poi/template.doc is empty document.
If I do it like this :
Paragraph par1 = range.getParagraph(0);
CharacterRun run1 = par1.insertAfter(from);
and from is "whatever" then in the document there is "w" (the initial) character at the beginning ... What the hell is this ?

Try with a recent nightly build / svn checkout of POI. The HWPF codebase is currently being heavily worked on by Sergey, and bugs like the one you've described have recently been fixed.

Related

How to get Header / Footer parts from Excel Document

I'm trying to get the header / footer parts from an excel document so that I can do something with their contents, however I cannot seem to get anything from them.
I thought this would be pretty simple... Consider this code:
using (SpreadsheetDocument spreadsheet = SpreadsheetDocument.Open(filePath, true))
{
var headers = spreadsheet.GetPartsOfType<HeaderPart>().ToList();
foreach (var header in headers)
{
//do something
}
}
Even with a file that contains a header, headers will always be empty. I've tried drilling down into the workbook -> worksheets -> etc but i get nothing back. My testing excel file definitely has a header (headers are ghastly in excel!).
Annoyingly the api's for excel in openxml seem to be worse as in a docx you can get the header by calling:
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(filePath, true))
{
MainDocumentPart documentPart = wordDoc.MainDocumentPart;
var headerParts = wordDoc.MainDocumentPart.HeaderParts.ToList();
foreach (var headerPart in headerParts)
{
//do something
}
}
I've seen some people on google saying that I should query the worksheet's descendants (code from this link):
HeaderFooter hf = ws.Descendants<HeaderFooter>().FirstOrDefault();
if (hf != null)
{
//here you can add your code
//I just try to append here for demo
hf = new HeaderFooter();
ws.AppendChild<HeaderFooter>(hf);
}
But I cannot see any way of querying the workbook/sheet/anything with .Descendants and obviously none of the code examples on google show how they got ws šŸ™ƒ.
Any ideas? Thanks
HeaderFooter, as per your second example, is the correct way to read a Header or Footer from a Spreadsheet using OpenXML. The ws in your example refers to a Worksheet.
The following is an example that reads the HeaderFooter and dumps the InnerText to the console.
using (SpreadsheetDocument document = SpreadsheetDocument.Open(filePath, false))
{
WorkbookPart workbookPart = document.WorkbookPart;
WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();
Worksheet ws = worksheetPart.Worksheet;
HeaderFooter hf = ws.Descendants<HeaderFooter>().FirstOrDefault();
if (hf != null)
{
Console.WriteLine(hf.InnerText);
}
}
I would highly recommend that you read the documentation for the HeaderFooter element as it's more complex than you might imagine. The documentation can be found in section 18.3.1.46 of the Fifth Edition of the Ecma Office Open XML Part 1 - Fundamentals And Markup Language Reference which can be found here.

Read the content of a Word document via its XML

Context
I am trying to build a Word document browser in Excel to sift trough a large amount of documents (around 1000).
The process of opening a word document proves to be rather slow (around 4 seconds per documents, so in this case it takes 2 hour to look through all the items, which is far too slow for a single query), even by disabling all things that could slow down the opening, hence I open:
As read only
Without the open and repair mode (which can happen on some documents)
Disabling the display of the document
My attempt so far
These documents are tricky to look through because some keywords do appear every single time but not in the same context (not the core of the problem here since I can handle that when the text is loaded in arrays). Hence the often used Windows explorer solution (like in this link ) cannot be used in my case.
For the moment, I managed to have a working macro that analyze the content of the word documents by opening them.
Code
Here is a sample of the code.
Note that I used the Microsoft Word 14.0 Object Library reference
' Analyzing all the word document within the same folder '
Sub extractFile()
Dim i As Long, j As Long
Dim sAnalyzedDoc As String, sLibName As String
Dim aOut()
Dim oWordApp As Word.Application
Dim oDoc As Word.Document
Set oWordApp = CreateObject("Word.Application")
sLibName = ThisWorkbook.Path & "\"
sAnalyzedDoc = Dir(sLibName)
sKeyword = "example of a word"
With Application
.DisplayAlerts = False
.ScreenUpdating = False
End With
ReDim aOut(2, 2)
aOut(1, 1) = "Document name"
aOut(2, 1) = "Text"
While (sAnalyzedDoc <> "")
' Analyzing documents only with the .doc and .docx extension '
If Not InStr(sAnalyzedDoc, ".doc") = 0 Then
' Opening the document as mentionned above, in read only mode, without repair and invisible '
Set oDoc = Word.Documents.Open(sLibName & "\" & sAnalyzedDoc, ReadOnly:=True, OpenAndRepair:=False, Visible:=False)
With oDoc
For i = 1 To .Sentences.Count
' Searching for the keyword within the document '
If Not InStr(LCase(.Sentences.Item(i)), LCase(sKeyword)) = 0 Then
If Not IsEmpty(aOut(1, 2)) Then
ReDim Preserve aOut(2, UBound(aOut, 2) + 1)
End If
aOut(1, UBound(aOut, 2)) = sAnalyzedDoc
aOut(2, UBound(aOut, 2)) = .Sentences.Item(i)
GoTo closingDoc ' A dubious programming choice but that works for the moment '
End If
Next i
closingDoc:
' Intending to make the closing faster by not saving the document '
.Close SaveChanges:=False
End With
End If
'Moving on to the next document '
sAnalyzedDoc = Dir
Wend
exitSub:
With Output
.Range(.Cells(1, 1), .Cells(UBound(aOut, 1), UBound(aOut, 2))) = aOut
End With
With Application
.DisplayAlerts = True
.ScreenUpdating = True
End With
End Sub
My question
The idea I thought was to go via the XML content within the document to access directly to its content (which you can access when renaming any document in newer versions of Word, with a .zip extension and going for nameOfDocument.zip\word\document.xml).
It would be a lot faster than loading all the images, charts and tables of the word document which are of no use in a text search.
Thus, I wanted to ask if there was a way in VBA to open a word document like a zip file and access that XML document to then process it like a normal string of characters in VBA, since I already have the path and the name of the file given the above code.
Do note that this is not an easy answer to the above problem and the sole VBA code in my initial question will do perfectly the job as long as you do not have a load of documents to browse, else go for another tool (there is a Python Dynamic Link Library (DLL) that does that very well).
Ok, I'll try to make my answer as explanatory as possible.
First of all this question lead me to the infinite journey of XML in C# and in XPath which I chose not to pursue at some point.
It reduced the time of analyzing the files from roughly 2 hours to 10 seconds.
Context
The backbone of reading XML documents, and therefore inner word XML documents, is the OpenXML library from Microsoft.
Keep in mind what I said above, that the method I was trying to implement cannot be done solely in VBA and thus must be done in another way.
This is probably due to the fact that VBA is implemented for Office and thus limited in accessing the core structure of Office documents, but I have no information relating to this limitation (any information is welcomed).
The answer I will give here is writing a C# DLL for VBA.
For writing DLL in C# and referencing to it in VBA I redirect you toward the following link which will resume in a better way this specific process: Tutorial for creating DLL in C#
Let's start
First of all you will need to reference the WindowsBase library and the DocumentFormat.OpenXML in your project to make the solution work as explained in this MSDN article Manipulate Office Open XML Formats Documents and that one Open and add text to a word processing document (Open XML SDK)
These articles explain broadly how works the OpenXML library for manipulating word documents.
The C# code
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Xml;
using System.IO.Packaging;
namespace BrowserClass
{
public class SpecificDirectory
{
public string[,] LookUpWord(string nameKeyword, string nameStopword, string nameDirectory)
{
string sKeyWord = nameKeyword;
string sStopWord = nameStopword;
string sDirectory = nameDirectory;
sStopWord = sStopWord.ToLower();
sKeyWord = sKeyWord.ToLower();
string sDocPath = Path.GetDirectoryName(sDirectory);
// Looking for all the documents with the .docx extension
string[] sDocName = Directory.GetFiles(sDocPath, "*.docx", SearchOption.AllDirectories);
string[] sDocumentList = new string[1];
string[] sDocumentText = new string[1];
// Cycling the documents retrieved in the folder
for (int i = 0; i < sDocName.Count(); i++)
{
string docWord = sDocName[i];
// Opening the documents as read only, no need to edit them
Package officePackage = Package.Open(docWord, FileMode.Open, FileAccess.Read);
const String officeDocRelType = #"http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument";
PackagePart corePart = null;
Uri documentUri = null;
// We are extracting the part with the document content within the files
foreach (PackageRelationship relationship in officePackage.GetRelationshipsByType(officeDocRelType))
{
documentUri = PackUriHelper.ResolvePartUri(new Uri("/", UriKind.Relative), relationship.TargetUri);
corePart = officePackage.GetPart(documentUri);
break;
}
// Here enter the proper code
if (corePart != null)
{
string cpPropertiesSchema = "http://schemas.openxmlformats.org/package/2006/metadata/core-properties";
string dcPropertiesSchema = "http://purl.org/dc/elements/1.1/";
string dcTermsPropertiesSchema = "http://purl.org/dc/terms/";
// Construction of a namespace manager to handle the different parts of the xml files
NameTable nt = new NameTable();
XmlNamespaceManager nsmgr = new XmlNamespaceManager(nt);
nsmgr.AddNamespace("dc", dcPropertiesSchema);
nsmgr.AddNamespace("cp", cpPropertiesSchema);
nsmgr.AddNamespace("dcterms", dcTermsPropertiesSchema);
// Loading the xml document's text
XmlDocument doc = new XmlDocument(nt);
doc.Load(corePart.GetStream());
// I chose to directly load the inner text because I could not parse the way I wanted the document, but it works so far
string docInnerText = doc.DocumentElement.InnerText;
docInnerText = docInnerText.Replace("\\* MERGEFORMAT", ".");
docInnerText = docInnerText.Replace("DOCPROPERTY ", "");
docInnerText = docInnerText.Replace("Glossary.", "");
try
{
Int32 iPosKeyword = docInnerText.ToLower().IndexOf(sKeyWord);
Int32 iPosStopWord = docInnerText.ToLower().IndexOf(sStopWord);
if (iPosStopWord == -1)
{
iPosStopWord = docInnerText.Length;
}
if (iPosKeyword != -1 && iPosKeyword <= iPosStopWord)
{
// Redimensions the array if there was already a document loaded
if (sDocumentList[0] != null)
{
Array.Resize(ref sDocumentList, sDocumentList.Length + 1);
Array.Resize(ref sDocumentText, sDocumentText.Length + 1);
}
sDocumentList[sDocumentList.Length - 1] = docWord.Substring(sDocPath.Length, docWord.Length - sDocPath.Length);
// Taking the small context around the keyword
sDocumentText[sDocumentText.Length - 1] = ("(...) " + docInnerText.Substring(iPosKeyword, sKeyWord.Length + 60) + " (...)");
}
}
catch (ArgumentOutOfRangeException)
{
Console.WriteLine("Error reading inner text.");
}
}
// Closing the package to enable opening a document right after
officePackage.Close();
}
if (sDocumentList[0] != null)
{
// Preparing the array for output
string[,] sFinalArray = new string[sDocumentList.Length, 2];
for (int i = 0; i < sDocumentList.Length; i++)
{
sFinalArray[i, 0] = sDocumentList[i].Replace("\\", "");
sFinalArray[i, 1] = sDocumentText[i];
}
return sFinalArray;
}
else
{
// Preparing the array for output
string[,] sFinalArray = new string[1, 1];
sFinalArray[0, 0] = "NO MATCH";
return sFinalArray;
}
}
}
}
The VBA code associated
Option Explicit
Const sLibname As String = "C:\pathToYourDocuments\"
Sub tester()
Dim aFiles As Variant
Dim LookUpDir As BrowserClass.SpecificDirectory
Set LookUpDir = New BrowserClass.SpecificDirectory
' The array will contain all the files which contain the "searchedPhrase" '
aFiles = LookUpDir.LookUpWord("searchedPhrase", "stopWord", sLibname)
' Add here any necessary processing if needed '
End Sub
So in the end you get a tool that can scan .docx documents much faster than in a classic open-read-close approach in VBA at the cost of more code writing.
Above all you get a simple solution for your users that just want to perform simple search, especially when there is a huge number of word documents.
Note
Parsing Word .XML files can be nightmarish in VBA as pointed out by #Mikegrann .
Thankfully OpenXML has an XML parser C# , xml parsing. get data between tags that will do the work for you in C# and take the <w:t></w:t> tags that are refering to the text of the docment. Though I found these answers so far but couldn't make them work:
Parsing a MS Word generated XML file in C# , Reading specific XML elements from XML file
So I went for the .InnerText solution I provided with my code above, to access the internal text, at the cost of having some formatting text input (like \\MERGEFORMAT).

How to keep original rotate page in itextSharp (dll)

i would like create the project, reading from Excel and write on pdf and print this pdf.
From Excel file (from cell) read directory where is original pdf on computer or server, and next cell have info what write on the top in second pdf.
And problem is here, original pdf is horizontal, landscape, rotate and my program create copy from original pdf and write info from excel on the top on copy pdf file. But pdf which is landscape is rotate for 270 deegres. This is no OK. For portrait rotation working program OK, copy OK and write on the top of the copy is OK.
Where is my problem in my code.
Code:
public int urediPDF(string inTekst)
{
if (inTekst != "0")
{
string pisava_arialBD = #"..\debug\arial.ttf";
string oldFile = null;
string inText = null;
string indeks = null;
//razbitje stringa
string[] vhod = inTekst.Split('#');
oldFile = vhod[0];
inText = vhod[1];
indeks = vhod[2];
string newFile = #"c:\da\2";
//odpre bralnik pdf
PdfReader reader = new PdfReader(oldFile);
Rectangle size = reader.GetPageSizeWithRotation(reader.NumberOfPages);
Document document = new Document(size);
//odpre zapisovalnik pdf
FileStream fs = new FileStream(newFile + "-" + indeks + ".pdf", FileMode.Create, FileAccess.Write);
PdfWriter writer = PdfWriter.GetInstance(document, fs);
//document.Open();
document.OpenDocument();
label2.Text = ("Status: " + reader.GetPageRotation(reader.NumberOfPages).ToString());
//določi sejo ustvarjanje pdf
PdfContentByte cb = writer.DirectContent;
//izbira pisave oblike
BaseFont bf = BaseFont.CreateFont(pisava_arialBD, BaseFont.IDENTITY_H, BaseFont.NOT_EMBEDDED);
cb.SetColorFill(BaseColor.RED);
cb.SetFontAndSize(bf, 8);
//pisanje teksta v pdf
cb.BeginText();
string text = inText;
//izbira koordinat za zapis pravilnega teksta v pdf (720 stopinj roatacija (ležeče) in 90 stopinj (pokončno))
if (reader.GetPageRotation(1) == 720) //ležeča postavitev
{
cb.ShowTextAligned(1, text, 10, 450, 0);
cb.EndText();
}
else //pokončna postavitev
{
cb.ShowTextAligned(1, text + " - pokončen", 10, 750, 0);
cb.EndText();
}
// create the new page and add it to the pdf
PdfImportedPage page = writer.GetImportedPage(reader, reader.NumberOfPages);
cb.AddTemplate(page, 0, 0);
// close the streams and voilĆ” the file should be changed :)
document.Close();
fs.Close();
writer.Close();
reader.Close();
}
else
{
label2.Text = "Status: Končano zapisovanje";
return 0;
}
return 0;
}
Picture fake pdf:
As explained many times before (ITextSharp include all pages from the input file, Itext pdf Merge : Document overflow outside pdf (Text truncated) page and not displaying, and so on), you should read chapter 6 of my book iText in Action (you can find the C# version of the examples here).
You are using a combination of Document, PdfWriter and PdfImportedPage to split a PDF. Please tell me who made you do it this way, so that I can curse the person who inspired you (because I've answered this question hundreds of times before, and I'm getting tired of repeating myself). These classes aren't a good choice for that job:
you lose all interactivity,
you need to rotate the content yourself if the page is in landscape,
you need to take the original page size into account,
...
Your problem is similar to this one itextsharp: unexpected elements on copied pages. Is there any reason why you didn't read the documentation? If you say: "I didn't have the time", please believe me if I say that I have almost 20 years of experience as a developer, and I've never seen "reading documentation" as a waste of time.
Long story short: read the documentation, replace PdfWriter with PdfCopy, replace AddTemplate() with AddPage().

The process cannot access the file 'd:\1.doc' because it is being used by another process

my code :
object c = "d:\\1.doc";
if(File.Exists(c.ToString()))
{
File.Delete(c.ToString());
}
error :
The process cannot access the file 'd:\1.doc' because it is being used
by another process.
How close ? with code
first of all use string instead of object, so:
string c = "d:\\1.doc";
now as the message indicated the file being used by another process. either by windows process, or you are opening the file stream and forget to close it. check in your code where you are interacting with the file.
Edit: Since you are using Microsoft.Office.Interop.Word make sure you close the file(s) open first like:
Word.ApplicationClass word = new Word.ApplicationClass();
//after using it:
if (word.Documents.Count > 0)
{
word.Documents.Close(...);
}
((Word._Application)word.Application).Quit(..);
word.Quit(..);
I had the same type of issue when I wanted to Delete File after Open/Read it using Microsoft.Office.Interop.Word and I needed to close my document and the application like that :
private void parseFile(string filePath)
{
// Open a doc file.
Microsoft.Office.Interop.Word.Application application = new Microsoft.Office.Interop.Word.Application();
Document document = application.Documents.Open(filePath);
// Loop through all words in the document.
int count = document.Words.Count;
for (int i = 1; i <= count; i++)
{
// Write the word.
string text = document.Words[i].Text;
Console.WriteLine("Word {0} = {1}", i, text);
}
// Close document correctly
((_Document)document).Close();
((_Application)application).Quit();
}
You have that file actively open in this or another program, and Windows prevents you from deleting it in that case.
Check if the file still running (opened) by another application
1- Microsoft Word
2- WordPad

PDF text search and split library

I am look for a server side PDF library (or command line tool) which can:
split a multi-page PDF file into individual PDF files, based on
a search result of the PDF file content
Examples:
Search "Page ???" pattern in text and split the big PDF into 001.pdf, 002,pdf, ... ???.pdf
A server program will scan the PDF, look for the search pattern, save the page(s) which match the patten, and save the file in the disk.
It will be nice with integration with PHP / Ruby. Command line tool is also acceptable. It will be a server side (linux or win32) batch processing tool. GUI/login is not supported. i18n support will be nice but no required. Thanks~
My company, Atalasoft, has just released some PDF manipulation tools that run on .NET. There is a text extract class that you can use to find the text and determine how you will split your document and a very high level document class that makes the splitting trivial. Suppose you have a Stream to your source PDF and an increasingly ordered List that describes the starting page of each split, then the code to generate your split files looks like this:
public void SplitPdf(Stream stm, List<int> pageStarts, string outputDirectory)
{
PdfDocument mainDoc = new PdfDocument(stm);
int lastPage = mainDoc.Pages.Count - 1;
for (int i=0; i < pageStarts.Count; i++) {
int startPage = pageStarts[i];
int endPage= (i < pageStarts.Count - 1) ?
pageStarts[i + 1] - 1 :
lastPage;
if (startPage > endPage) throw new ArgumentException("list is not ordered properly", "pageStarts");
PdfDocument splitDoc = new PdfDocument();
for (j = startPage; j <= endPage; j++)
splitDoc.Pages.Add(mainDoc.Pages[j];
string outputPath = Path.Combine(outputDirectory,
string.Format("{0:D3}.pdf", i + 1));
splitDoc.Save(outputPath);
}
if you generalize this into a page range struct:
public struct PageRange {
public int StartPage;
public int EndPage;
}
where StartPage and EndPage inclusively describe a range of pages, then the code is simpler:
public void SplitPdf(Stream stm, List<PageRange> ranges, string outputDirectory)
{
PdfDocument mainDoc = new PdfDocument(stm);
int outputDocCount = 1;
foreach (PageRange range in ranges) {
int startPage = Math.Min(range.StartPage, range.EndPage); // assume not in order
int endPage = Math.Max(range.StartPage, range.EndPage);
PdfDocument splitDoc = new PdfDocument();
for (int i=startPage; i <= endPage; i++)
splitDoc.Pages.Add(mainDoc.Pages[i]);
string outputPath = Path.Combine(outputDirectory,
string.Format("{0:D3}.pdf", outputDocCount));
splitDoc.Save(outputPath);
outputDocCount++;
}
}
PDFBox is a Java library but it does have some command line tools as well:
http://pdfbox.apache.org/
PDFBox can extract text and also rebuilt/split PDFS
pdfminer + multi-line pattern matching in python
You can use pdfsam to split your file in pages, then use pdftotext (from foolabs.com) to turn this into text and use ruby (or grep) to find the strings. Then you have the page ranges and can return the previous generated pages.

Resources