Apache TIKA vs PdfBox (HTML) - text

I'm trying to convert pdf files into HTML.
When using the PDFBox jar the following:
java -jar pdfbox-app-2.0.7.jar ExtractText -html 1.pdf
I'm getting a valid HTML file as expected.
But when using
tika --html 1.pdf
I'm getting a file missing a lot of tags, such as <b>, <i> etc
I saw that TIKA is using PDFBox as the pdf extractor, so I guess there must be a way to get the same result, but I can't seem to find the right way to do that.
Any suggestions?

Related

Convert Cobertura.xml to html report

Is it possible to convert a cobertura.xml file to an html report or similar, so it can be viewed locally? I already have it pretty-printing in Jenkins but we need it locally too...
The report has been produced using istanbul/nyc, and the code it's producing coverage for is in node.js.
pycobertura (repository) answers your need of converting a Cobertura XML report into an human-friendly HTML page:
pycobertura show --format html --output coverage.html cobertura.xml
Then open coverage.html in your browser to check the report.

How can I automate my webpack build to auto escape data URIs in SASS and HTML files?

The issue I am facing is that firefox does not support # characters in data URIs. Chrome or Safari are totally fine with this.
Our UI guys have used a lot of inlining of SVGs and these all contain data URIS
for example in scss files:
content: url('data:image/svg+xml;utf8,<svg ...</svg>');
and in html files:
<img src='data:image/svg+xml;utf8,<svg width="234px" height="205px"...</svg>'>
And there are 100s of examples like this and none of these work in firefox because they have # character and I get the following error
but when we try %23 in place of that character, the SVGs load correctly.
How can I automate the build so that these get url encoded.
The string replace has to extremely specific and needs to do it only inside img tags in html and url('data:image/svg+xml;utf8 in less files.
This is what I am thinking of doing: find all stroke="# and replace with stroke='%23 and same thing with fill if harder to do with webpack

How to open a HTML file in a non-default file type application(browser) using python?

Does anyone know how to open a HTML file in a browser that isn't the default HTML file type browser?
You can use below code
webbrowser.get("C:/Program Files (x86)/Google/Chrome/Application/chrome.exe %s").open("file/path/name.html")

How to share html report of pytest-html with out changing the css

I am using python 3.6 and pytest-html to generate HTML reports .
Everything is successfully working but when i share my html report to my manager the css of the entire document is out of placed .can someone tell the reason to why it is happening and the solution for it .
The view of reports when i run:
The view of the reports when i share the document with my manager
include --self-contained-html when you are calling your pytest...such as
pytest your.pyfilename --html=pathandfilename --self-contained-html.
Your result file have inline css in it.
html=report.html --self-contained-html
It seems like you are not sharing whole bunch of items like CSS with html file you are giving. Just place the CSS code inside your HTML rather than giving the path and it will solve your problem.

JSP to Excel encoding problem. Value=?

Jsp page shows arabic character verywell as like this:
about something bla bla تضارب توقعات شهر أكتوبر فيق الـ و الـ
but when I export it to Excel and try to open it,Excel says:
The file you are trying to open 'example.xls' is in a different format than specified by the file extension. Verify that the file is not corrupted and is from a trusted source before opening the file. Do you want to open the file now?
After clicking yes, value is which I wrote before is:
about something bla bla ??????????????????????????
Jsp page has:
<%# page isELIgnored="false" import="java.util.*" contentType="text/html; charset=UTF-8" pageEncoding="UTF-8" %>
if I copy arabic characters and paste it to Excel,Excel shows them clearly.
I use charset=cp1254.
when I change it to charset=cp1256, Value is like freak characters.
any ideas to fix it?
You're fooling Excel by a plain vanilla HTML file with the wrong extension. This is not going to work flawlessly, as Excel has warned you.
You need to serve a real binary XLS file using a Servlet, not some HTML table using a JSP. You can use Apache POI HSSF or JExcelAPI for this.

Resources