Is it possible to extract text by page for word/pdf files using Apache Tika?

Is it possible to extract text by page for word/pdf files using Apache Tika? - text

All the documentation I can find seems to suggest I can only extract the entire file's content. But I need to extract pages individually. Do I need to write my own parser for that? Is there some obvious method that I am missing?

Actually Tika does handle pages (at least in pdf) by sending elements <div><p> before page starts and </p></div> after page ends. You can easily setup page count in your handler using this (just counting pages using only <p>):
public abstract class MyContentHandler implements ContentHandler {
private String pageTag = "p";
protected int pageNumber = 0;
...
#Override
public void startElement (String uri, String localName, String qName, Attributes atts) throws SAXException {
if (pageTag.equals(qName)) {
startPage();
}
}
#Override
public void endElement (String uri, String localName, String qName) throws SAXException {
if (pageTag.equals(qName)) {
endPage();
}
}
protected void startPage() throws SAXException {
pageNumber++;
}
protected void endPage() throws SAXException {
return;
}
...
}
When doing this with pdf you may run into the problem when parser doesn't send text lines in proper order - see Extracting text from PDF files with Apache Tika 0.9 (and PDFBox under the hood) on how to handle this.

You'll need to work with the underlying libraries - Tika doesn't do anything at the page level.
For PDF files, PDFBox should be able to give you some page stuff. For Word, HWPF and XWPF from Apache POI don't really do page level things - the page breaks aren't stored in the file, but instead need to be calculated on the fly based on the text + fonts + page size...

You can get the number of pages in a Pdf using the metadata object's xmpTPg:NPages key as in the following:
Parser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
ParseContext parseContext = new ParseContext();
parser.parse(fis, handler, metadata, parseContext);
metadata.get("xmpTPg:NPages");

Related

How to read real numeric values instead of formatted value using Apache XSSF POI streaming API?

I use streaming POI API and would like to read the real value of a cell instead of the formatted one. My code which is below works fine but if the user doesn't display all the digit of a value in the excel sheet which is readed by my code, I've got the same truncated value in my result. I didn't find any solution in the streaming API - which is needed in my case to solve memory issue I had using the POI API without streaming.
/**
* #see org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.SheetContentsHandler cell(java.lang.String,
* java.lang.String)
*/
#Override
void cell(String cellReference, String formattedValue, XSSFComment comment) {
useTheCellValue(formattedValue)
}

If you are constructing the XSSFSheetXMLHandler you can provide a DataFormatter. So if you are creating your own DataFormatter this DataFormatter could give you fully access to the formatting issues.
Example of how this could look like by changing the public void processSheet of the XLSX2CSV example in svn:
...
public void processSheet(
StylesTable styles,
ReadOnlySharedStringsTable strings,
SheetContentsHandler sheetHandler,
InputStream sheetInputStream) throws IOException, SAXException {
//DataFormatter formatter = new DataFormatter();
DataFormatter formatter = new DataFormatter(java.util.Locale.US) {
//do never formatting double values but do formatting dates
public java.lang.String formatRawCellContents(double value, int formatIndex, java.lang.String formatString) {
if (org.apache.poi.ss.usermodel.DateUtil.isADateFormat(formatIndex, formatString)) {
return super.formatRawCellContents(value, formatIndex, formatString);
} else {
//return java.lang.String.valueOf(value);
return super.formatRawCellContents(value, 0, "General");
}
}
};
InputSource sheetSource = new InputSource(sheetInputStream);
try {
XMLReader sheetParser = SAXHelper.newXMLReader();
ContentHandler handler = new XSSFSheetXMLHandler(
styles, null, strings, sheetHandler, formatter, false);
sheetParser.setContentHandler(handler);
sheetParser.parse(sheetSource);
} catch(ParserConfigurationException e) {
throw new RuntimeException("SAX parser appears to be broken - " + e.getMessage());
}
}
...

I've seen a ticket on POI about this point : https://bz.apache.org/bugzilla/show_bug.cgi?id=61858
It provides a first solution by changing the existing class.
This could be an interesting workaround even if the ideal solution should be to use a standard one.

Using RazorEngine with TextWriter

I want to use RazorEngine to generate some html files. It's easy to generate strings first, then write them to files. But if the generated strings are too large, that will cause memory issues.
So I wonder is there a non-cached way to use RazorEngine, like using StreamWriter as its output rather than a string.
I google this for a while, but with no luck.
I think use a custom base template should be the right way, but the documents are so few(even out of date) on the offcial homepage of RazorEngine.
Any hint will be helpful!

OK. I figured it out.
Create a class that inherits TemplateBase<T>, and take a TextWrite parameter in the constructor.
public class TextWriterTemplate<T> : TemplateBase<T>
{
private readonly TextWriter _tw;
public TextWriterTemplate(TextWriter tw)
{
_tw = tw;
}
// override Write and WriteLiteral methods, write text using the TextWriter.
public override void Write(object value)
{
_tw.Write(value);
}
public override void WriteLiteral(string literal)
{
_tw.Write(literal);
}
}
Then use the template as this:
private static void Main(string[] args)
{
using (var sw = new StreamWriter(#"output.txt"))
{
var config = new FluentTemplateServiceConfiguration(c =>
c.WithBaseTemplateType(typeof(TextWriterTemplate<>))
.ActivateUsing(context => (ITemplate)Activator.CreateInstance(context.TemplateType, sw))
);
using (var service = new TemplateService(config))
{
service.Parse("Hello #Model.Name", new {Name = "Waku"}, null, null);
}
}
}
The content of output.txt should be Hello WAKU.

How do I Response.Write from a Sandboxed Solution?

I have a requirement to create an HTML table and shoot it out to the user as a spreadsheet. I've done this many times in plain old ASP.NET, but it's kicking my behind in SharePoint. It's a simple Button with an event handler that calls 3 files stored in a Module (accessible to a sandboxed solution) and then tries to shove the text out to the user along with the appropriate content-type and content-disposition. When I get this part working, the SampleData.txt portion will be replaced with data from multiple lists. Unfortunately, the page does it's postback, but no file is offered to the user for opening or download.
I've read in a few places that Response.Write is not usable in web parts, but surely there is some other way of accomplishing this task...? Here's what I want to do - is there a usable alternative?
protected override void CreateChildControls()
{
Button ReportButton = new Button();
ReportButton.Text = "Generate DCPDS Report";
ReportButton.Click += new EventHandler(ReportButton_Click);
this.Controls.Add(ReportButton);
}
protected void ReportButton_Click(Object sender, EventArgs e)
{
string header = "";
string sampleData = "";
string footer = "";
using (SPWeb webRoot = SPContext.Current.Site.RootWeb)
{
header = webRoot.GetFileAsString("SpreadsheetParts/Header.txt");
sampleData = webRoot.GetFileAsString("SpreadsheetParts/SampleData.txt");
footer = webRoot.GetFileAsString("SpreadsheetParts/Footer.txt");
}
StringBuilder sb = new StringBuilder();
sb.Append(header);
sb.Append(sampleData);
sb.Append(footer);
Page.Response.Clear();
HttpContext.Current.Response.ContentType = "application/vnd.ms-excel";
HttpContext.Current.Response.AddHeader("content-disposition", "attachment; filename=June2010Spreadsheet.xls");
HttpContext.Current.Response.Write(sb.ToString());
//HttpContext.Current.Response.End();
}

I have not tested this, and this would make an interesting Proof of Concept, but I don't have any virgin SharePoint 2010 servers ready for testing, so I cannot test this for you. But just as a thought, could you not place the contents of your excel sheet in a hidden field, and do something like:
HTML:
<script type="text/javascript">
function downloadExcel() {
location.href='data:application/download,' + encodeURIComponent(
document.getElementById('<%= hfExcelData.ClientID %>').value
);
}
</script>
C#:
protected override void CreateChildControls()
{
Button ReportButton = new Button();
ReportButton.Text = "Generate DCPDS Report";
ReportButton.Click += new EventHandler(ReportButton_Click);
this.Controls.Add(ReportButton);
}
protected void ReportButton_Click(Object sender, EventArgs e)
{
string header = "";
string sampleData = "";
string footer = "";
using (SPWeb webRoot = SPContext.Current.Site.RootWeb)
{
header = webRoot.GetFileAsString("SpreadsheetParts/Header.txt");
sampleData = webRoot.GetFileAsString("SpreadsheetParts/SampleData.txt");
footer = webRoot.GetFileAsString("SpreadsheetParts/Footer.txt");
}
StringBuilder sb = new StringBuilder();
sb.Append(header);
sb.Append(sampleData);
sb.Append(footer);
this.hfExcelData.value = sb.ToString();
Page.ClientScript.RegisterStartupScript(this.GetType(), "ExcelUploader", "downloadExcel();", true)
}

I'm struggling with similar. My understanding is that in the sandbox, SP is actually proxying calls to your page, getting the response, and then returning it's own response to the client. That proxy doesn't allow all functions of HttpContext or the SP classes. In my testing, ContentType and headers are not returned. SP (in sandboxed code) will always return text/html and ignore anything else from your webpart.
One way to get the HTML part of what you want is to use an asp:literal. You could follow these steps to create an application page with a webpart on it. Remove the master page reference and the placeholders, so that the only thing returned is your webpart content. In the webpart, add a literal and set the text to the HTML table. At this point, you would have a page that just returns the table as HTML. Here is another post outlining the exact technique. However, you still would not be able to change the content type or add headers. Maybe combined with the first answer that would get you a solution though?

use the default javascript function on a button or linkbutton
STSNavigate(dUrl);
calling the page that creates the HttpContext.Current.Response

How to override Web.UI.Page - Events in Sharepoint?

I want to compress the viewstate. Therefore I need to override SavePageStateToPersistenceMedium wich belongs to Web.UI.Page. In "normal" ASP.Net thats quite easy but in my sharepoint-project I cannot find any place where I have a class that is inherited from Web.UI.Page
My PageLayouts have no code behind, neither has the masterPage.
The best solution would be for me to be able to handle that in a pageLayout, because I do not want every Page to cache the ViewState.
To make it a bit clearer. This is the code I want to put "somewhere":
public abstract class BasePage : System.Web.UI.Page
{
private ObjectStateFormatter _formatter =
new ObjectStateFormatter();
protected override void
SavePageStateToPersistenceMedium(object viewState)
{
MemoryStream ms = new MemoryStream();
_formatter.Serialize(ms, viewState);
byte[] viewStateArray = ms.ToArray();
ClientScript.RegisterHiddenField("__COMPRESSEDVIEWSTATE",
Convert.ToBase64String(
CompressViewState.Compress(viewStateArray)));
}
protected override object
LoadPageStateFromPersistenceMedium()
{
string vsString = Request.Form["__COMPRESSEDVIEWSTATE"];
byte[] bytes = Convert.FromBase64String(vsString);
bytes = CompressViewState.Decompress(bytes);
return _formatter.Deserialize(
Convert.ToBase64String(bytes));
}
}

I would inherit from PublishingLayoutPage (which in turns way back inherits from Page) instead and let all of my page-layouts use this base page as codebehind.
This means you need to alter you page-layouts' page directive like so:
<%# Page Language="C#" Inherits="YourNameSpace.BasePage, $SharePoint.Project.AssemblyFullName$" %>

How can I remove the tables rendered around the webparts in the Rich Content area?

How would I override the tables rendered around the webparts in the "Rich Content" area?
I have successfully removed the tables around webpartzones and their webparts but can't figure how to remove the tables around Rich Content area webparts.
I am not using the Content Editor WebPart.
The "Rich Content" area I am using is created using the PublishingWebControls:RichHtmlField.
This is the control which has content and webparts.
Bounty here.

I have pondered this myself in the past and I've come up with two options, though none are very appealing, so have not implemented them:
Create a custom rich text field. Override render, call base.Render using a TextWriter object and place the resulting html in a variable, which you then "manually" clean up, before writing to output.
Create a custom rich text field. Override render, but instead of calling base.Render, take care of the magic of inserting the webparts yourself. (This is probably trickier.)
Good luck!
Update, some example code I use to minimize the output of the RichHtmlField:
public class SlimRichHtmlField : RichHtmlField
{
protected override void Render(HtmlTextWriter output)
{
if (IsEdit() == false)
{
//This will remove the label which precedes the bodytext which identifies what
//element this is. This is also identified using the aria-labelledby attribute
//used by for example screen readers. In our application, this is not needed.
StringBuilder sb = new StringBuilder();
StringWriter sw = new StringWriter(sb);
HtmlTextWriter htw = new HtmlTextWriter(sw);
base.Render(htw);
htw.Flush();
string replaceHtml = GetReplaceHtml();
string replaceHtmlAttr = GetReplaceHtmlAttr();
sb.Replace(replaceHtml, string.Empty).Replace(replaceHtmlAttr, string.Empty);
output.Write(sb.ToString());
}
else
{
base.Render(output);
}
}
private string GetReplaceHtmlAttr()
{
return " aria-labelledby=\"" + this.ClientID + "_label\"";
}
private string GetReplaceHtml()
{
var sb = new StringBuilder();
sb.Append("<div id=\"" + this.ClientID + "_label\" style='display:none'>");
if (this.Field != null)
{
sb.Append(SPHttpUtility.HtmlEncode(this.Field.Title));
}
else
{
sb.Append(SPHttpUtility.HtmlEncode(SPResource.GetString("RTELabel", new object[0])));
}
sb.Append("</div>");
return sb.ToString();
}
private bool IsEdit()
{
return SPContext.Current.FormContext.FormMode == SPControlMode.Edit || SPContext.Current.FormContext.FormMode == SPControlMode.New;
}
}
This code is then used by your pagelayout like this:
<YourPrefix:SlimRichHtmlField ID="RichHtmlField1" HasInitialFocus="false" MinimumEditHeight="200px" FieldName="PublishingPageContent" runat="server" />

Got it:
https://sharepoint.stackexchange.com/a/32957/7442

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Is it possible to extract text by page for word/pdf files using Apache Tika? - text

All the documentation I can find seems to suggest I can only extract the entire file's content. But I need to extract pages individually. Do I need to write my own parser for that? Is there some obvious method that I am missing?

Related

How to read real numeric values instead of formatted value using Apache XSSF POI streaming API?

Using RazorEngine with TextWriter

How do I Response.Write from a Sandboxed Solution?

How to override Web.UI.Page - Events in Sharepoint?

How can I remove the tables rendered around the webparts in the Rich Content area?

Categories

Resources