I am writing a JBOSS web app with Struts2 and would like to produce reports in PDF and XLS format. How can I do this? Are there popular packages that can do this for me?
Here's a list of PDF libraries for Java. We use iText extensively.
jFreeReport (scroll down on linked page to find) also offers Excel generation, though I have not used that.
Related
I am trying to parse text from a PDF file using Computer Vision 2.0. I am following the example and have changed the MediaTypeHeaderValue to "application/pdf". I get an error that the content type is not supported. I change it to "multipart/form-data" and get an error in processing. How do I use Computer-Vision to process PDF files?
Kevin,
You are using the legacy "OCR" API that does not support PDF input. Please use the new OCR technology available as the "Read" API - see overview for processing PDF documents. The version 3.0 is in GA since May. Read supports large images and multi-page and mixed languages documents up to 2000 pages long.
Please see the Read REST API QuickStart in C#.
Note that Form Recognizer is great if you want to extract not just text, but layout insights such as tables, check-boxes, and key value pairs from forms, use pre-built models, and build custom models to process your documents. It's now in GA.
Take a look at the Form Recognizer service for extracting data from the PDF.
https://azure.microsoft.com/en-us/services/cognitive-services/form-recognizer/
we have a requirement to extract dark data from unstructured sources such as letters, rad reports, etc. Please suggest azure resource to extract data from common document formats: DOC, DOCX, PDF, RTF, TXT, HTML, etc and then to do analysis on the extracted data.
It sounds like you just want to extract raw text or images from these rich text format documents. If only do these, some libraries of parsing different documents is your real needs.
Here is some libraries in Java or Python to do that. If you are using .NET which I'm not familiar with, you can search in Google or Bing to find these alternative for .NET.
To parse the office document like DOC, DOCX: for Java, Apache POI is a good library for extracting data from MS office files; for Python, there seems to be not any package to do that, except using COM object like Word.Application or IronPython (Reading/Writing MS Word files in Python) in .NET on Windows.
To parse PDF files: there are Apache PDFBox, jPDFText for Java and PyPDF2 for Python.
To read RTF format file: Java natively supports via javax.swing.text.rtf.RTFEditorKit which you can get some sample code via search; like #1, also seems none for Python.
To parse HTML files: jsoup for Java and BeautifulSoup & HTMLParser for Python are best for extracting data from HTML.
For reading TXT format files, I think it's simple for any languages. But to extract valuable information from text content, Stanford NLP for Java and NLTK for Python are useful, also using Azure Text Analytics API of Cognitive Service can help doing some like key phrase extraction, and language detection.
Apache Tika toolkit for content analysis is a good solution, too. Even you can deploy it alone and to invoke its REST APIs by Python, other languages.
If you want to extract text from images, you can use Azure Computer Vision API of Cognitive Services to extract printed text or handwritten text, or use the third party library such as Tess4J or others you searched in GitHub.
All of above are almost depended on the third party dev kits without Azure resources. However, you can store these documents in Azure Storage and process them on Azure VM or Batch services, even to analyze the extract data in Azure Jupyter Notebook or use Azure ML to do more deeper research.
Is there any library that can parse and generate a PNG from a Doc, Docx and PDF file?
We're implementing a training system using Node, Sails.js, Express and SQL and would like to generate some PNG image tiles for training modules based on a file upload.
I've done some searching and found some libraries in C# that can do all 3, as well as a just PDF impementation for Node but can't find anything that does more than that.
A point towards any 3rd party libraries or standard implementations of this method would be great.
Thanks
You can do that sort of stuff with C# (probably only on Windows) because C# is from MS stables, the same stable that churns out doc and docx. I am not sure whether the same implementation would work on Linux or Mac (even with Mono).
If you want to achieve this in NodeJS, just create the app in C#, wrap it in a ReSTful cover and call this ReSTful service in NodeJS (via Kue or something similar).
Honestly, converting file formats is a compute intensive process process. I wouldn't recommend it doing it the same main thread any way. If you're anyway gonna spawn a worker, you might as well do it in C# where it's perhaps faster.
Not necessarily an exact match for your requirement, but since you mentioned training purpose, I would recommend Watson Developer Cloud - it has document conversion among many other features which may be relevant and useful for your objective as a whole.
Speaking of the current problem, please see Document conversion overview to see how we can convert a PDF into a desired format such as HTML. Then you could actually get the PNG files from the HTML resource bundle.
Hope this helps.
This is my project topic given by my college. Can somebody please give me an idea on where to start with this topic.
I have seen a lot of topics on pdf vulnerability but the problem is they require knowing a lot of security stuff beforehand. I have less than a week to submit the project.
If somebody could just guide me to where I should start I would be really grateful.
I have already looked up didier stievens site but its getting really tough for me to understand it since there is no time.
The most important point about PDF security is that most 'popular' attacks are targeting:
application related vulnerabilities in most popular free PDF reading applications: Adobe Reader and Foxit Reader;
humans to get them to click on the malicious attachment inside PDF to initiate attack;
Check these analysis and parsing utilities and documents:
Didier Stevens's pdf tools which include make-pdf-javascript.py ( javascript injection tool), pdfid.py that scans PDF and embedded javascripts for keywords and others;
PDF Stream Dumper and its source code;
PDF Miner Py - pdf parsing library made with python;
PDF.js - javascript based PDF rendering that could help you to learn PDF structure parsing right from your browser console (widely used in lot of online services like DropBox)
Official PDF Format Specification from Adobe for PDF 1.4 and PDF 1.7
I have a JSF/Icefaces project and currently I'm looking for a pdf viewer. There is a library called icepdf and it seems to be a Swing library.
Are there any other Java based libraries for viewing pdf files in the web browser?
Can we use the icepdf solution as an applet in my app?
You can also look at solutions like PDF.js