How to Extract docx (Word 2007 above) using Apache POI - text

Hai, i'm using Apache POI 3.6
I've already created some code..
XWPFDocument doc = new XWPFDocument(new FileInputStream(file));
wordxExtractor = new XWPFWordExtractor(doc);
text = wordxExtractor.getText();
System.out.println("adding docx " + file);
d.add(new Field("content", text, Field.Store.NO, Field.Index.ANALYZED));
unfortunately, it generated error..
Exception in thread "main" java.lang.NoClassDefFoundError: org/dom4j/DocumentException
at org.apache.poi.openxml4j.opc.OPCPackage.init(OPCPackage.java:149)
at org.apache.poi.openxml4j.opc.OPCPackage.<init>(OPCPackage.java:136)
at org.apache.poi.openxml4j.opc.Package.<init>(Package.java:54)
at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:98)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:199)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:178)
at org.apache.poi.util.PackageHelper.open(PackageHelper.java:53)
at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:98)
at org.apache.lucene.demo.Indexer.indexDocs(Indexer.java:153)
at org.apache.lucene.demo.Indexer.main(Indexer.java:88)
It seemed that it used Constructor
XWPFWordExtractor(OPCPackage container)
but not this one ->
XWPFWordExtractor(XWPFDocument document)
Any wondering why??
Or any idea how I can extract the .docx then convert it into a String?

You need to Add dom4j Library to your claspath or your project libraries

It looks like you don't have all of the dependencies on your classpath.
If you look at http://poi.apache.org/overview.html you'll see that dom4j is a required library when working with the OOXML files. From the exception you got, it seems that you don't have it... If you look in the POI binary download, you should find it in the ooxml-libs subdirectory.

You could try docx4j instead; see http://dev.plutext.org/svn/docx4j/trunk/docx4j/src/main/java/org/docx4j/TextUtils.java

Related

merging two pdf files using gembox c#

I need to merge to PDF files using Gem Box, is it possible to merge them with Gembox ?
I know that we can merge with ITEXT.Please suggest if there is a way to merge files with gembox.
EDIT 2019-11-15:
More appropriate solution for merging multiple PDF files into one is to instead use GemBox.Pdf, as shown in this example.
string[] files = { "File 1.pdf", "File 2.pdf", "File 3.pdf" };
using (var document = new PdfDocument())
{
foreach (var file in files)
using (var source = PdfDocument.Load(file))
document.Pages.Kids.AddClone(source.Pages);
document.Save("Merged Files.pdf");
}
EDIT 2017-11-22:
Newer version of GemBox.Document (version 2.5) supports PDF as both input and output file format, see the release post.
On the following link you can find the demonstration sample for "Read and Extract PDF Text in C# and VB.NET":
https://www.gemboxsoftware.com/document/examples/c-sharp-read-pdf/305
ORIGINAL ANSWER:
current version of GemBox.Document (version 2.3) only supports PDF files as an output files.
There is a feature request for this, but at this moment I'm not sure when it will be available:
https://support.gemboxsoftware.com/feedback/view/support-for-reading-pdf-files

How do I read word document with bold and italic formatting by using POI

I am using Apache POI.
I am able to read text from a doc file by using "org.apache.poi.hwpf.extractor.WordExtractor"
Even fetched the tables by using "org.apache.poi.hwpf.usermodel.Table"
But please suggest me, how can I fetch bold/italic formatting of the text.
Thanks in advance.
WordExtractor returns only the text, nothing else.
The simplest way for you to get the text+formatting of a word document is to switch to using Apache Tika. Apache Tika builds on top of Apache POI (amongst others), and offers both plain text extraction and rich extraction (XHTML with formatting).
Alternately, if you want to write the code yourself, I'd suggest you review the code in Tika's WordExtractor, which demonstrates how to use Apache POI to get the formatting information of runs of text out.
Instead of using WordExtractor, you can read with Range:
...
HWPFDocument doc = new HWPFDocument(fis);
Range r = doc.getRange();
...
Range is the central class of that model. When you get range, you can play more with the features of the texts and, for instance, iterate through all CharacterRuns, and check if it is Italic (.isItalic()) or change to Italic: (.setItalic(true)).
for(int i = 0; i<r.numCharacterRuns(); i++)
{
CharacterRun cr = r.getCharacterRun(i);
cr.setItalic(true);
...
}
...
File fon = new File(yourFilePathOut);
FileOutputStream fos = new FileOutputStream(fon);
doc.write(fos);
...
It works if you are stick to use HWPF. Between, to frame into and work with the concept of Paragraph is more convenient.

apache-poi inserting filename in header or footer

I'm using the apache poi library (poi-3.8-20120326.jar)
How can I add the filename in the footer of an xls (hssf) document?
My approach is the following:
final static public String FILE_NAME = "&[File]";
public static void insertFilename(Sheet sheet) {
sheet.getFooter().setLeft(FILE_NAME);
}
The Problem is, the Microsoft Excel 2003 displays
File]
If I open the Footer-editor, click in the field, change nothing, and save--it works.
Editor shows it as
&[File]
Is there a workaround or a dirty trick to avoid this?
Thank you
It may look like "&[File]" in Excel, but that's not how it's stored internally. You're using HSSF for your .xls file, so use the following static HeaderFooter method to get the internal Excel code for the filename:
import org.apache.poi.hssf.usermodel.HeaderFooter;
String fileIndicator = HeaderFooter.file();
A quick look at the source code determines that the internal code is the string "&F".
If someone is using XSSF for a .xlsx file, then there is no corresponding file method. However, the documentation for XSSFHeaderFooter indicates that you can use the string "&F" directly.

Groovy htmlunit

I'm having issues importing htmlunit (htmlunit.sf.net) into a groovy script.
I'm currently just using the example script that was on the web and it gives me unable to resolve class com.gargoylesoftware.htmlunit.WebClient
The script is:
import com.gargoylesoftware.htmlunit.WebClient
client = new WebClient()
html = client.getPage('http://www.msnbc.msn.com/')
println page.anchors.collect{ it.hrefAttribute }.sort().unique().join('\n')
I downloaded the source from the website and placed the com folder (and all its contents) where my script was located.
Does anyone know what issue I'm encountering? I'm not quite sure why it won't import it
You could use Grape to get the dependecy for you during script runtime. Easiest way to do it is to add a #Grab annotation to your import statement.
Like this:
#Grab('net.sourceforge.htmlunit:htmlunit:2.7')
import com.gargoylesoftware.htmlunit.WebClient
client = new WebClient()
// Added as HtmlUnit had problems with the JavaScript
client.javaScriptEnabled = false
html = client.getPage('http://www.msnbc.msn.com/')
println page.anchors.collect{ it.hrefAttribute }.sort().unique().join('\n')
There's only one problem. The page seems to be a little bit to much to chew off for HtmlUnit. When I ran the code I got OutOfMemoryException every time. I'd suggest downloading the html the normal way instead and then using something like NekoHtml or TagSoup to parse the html into XML and work with it that way.
This example uses TagSoup to work with html as xml in Groovy: http://blog.foosion.org/2008/06/09/parse-html-the-groovy-way/
you just need to download zip file, extract the jar file(s) and place them on the class path when compiling... You dont need the source
http://sourceforge.net/projects/htmlunit/files/htmlunit/2.8/htmlunit-2.8.zip/download

SharePoint and Office Open XML interaction question

I've been frustrated by this for the entire weekend, plus a day or two, so any help would be significantly appreciated.
I'm trying to write a program that can programmatically go into a SharePoint 2007 doc library, open a file, change the contents of the file, then put the file back. I've gotten all but the last part of this down. The reason Office Open XML is involved is that that's how I'm opening the document and modifying it - through the Office Open XML SDK. My question is: How do I get it from the document back into the library?
The problem as I see it is that there's no save function on the WordprocessingDocument object itself. This prevents me from saving it into the SPFile's SaveBinary function.
You should use stream's to write back the changed OOXML into the SPFile.
I hope this example helps!
Stream fs = mySPFile.OpenBinaryStream();
using (WordprocessingDocument ooxmlDoc = WordprocessingDocument.Open(fs, true))
{
MainDocumentPart mainPart = wordDoc.MainDocumentPart;
XmlDocument xmlMainDocument = new XmlDocument();
xmlMainDocument.Load(mainPart.GetStream());
// change the contents of the ooxmlDoc / xmlMainDocument
Stream stream = mainPart.GetStream(FileMode.Open, FileAccess.ReadWrite);
xmlMainDocument.Save(stream);
// the stream should not be longer than the DocumentPart
stream.SetLength(stream.Position);
}
mySPFile.SaveBinary(fs);
fs.Dispose();
Yesterday I saw a webcast with Andrew Connell where he opened a doc from a doc library, added a watermark and saved the file again. It sure sounds like you should have a look at that webcast:
https://msevents.microsoft.com/CUI/WebCastRegistrationConfirmation.aspx?culture=en-US&RegistrationID=1299758384&Validate=false
btw I found that all 10 of the web casts in that serie were very good.

Resources