Is there any library that can parse and generate a PNG from a Doc, Docx and PDF file?
We're implementing a training system using Node, Sails.js, Express and SQL and would like to generate some PNG image tiles for training modules based on a file upload.
I've done some searching and found some libraries in C# that can do all 3, as well as a just PDF impementation for Node but can't find anything that does more than that.
A point towards any 3rd party libraries or standard implementations of this method would be great.
Thanks
You can do that sort of stuff with C# (probably only on Windows) because C# is from MS stables, the same stable that churns out doc and docx. I am not sure whether the same implementation would work on Linux or Mac (even with Mono).
If you want to achieve this in NodeJS, just create the app in C#, wrap it in a ReSTful cover and call this ReSTful service in NodeJS (via Kue or something similar).
Honestly, converting file formats is a compute intensive process process. I wouldn't recommend it doing it the same main thread any way. If you're anyway gonna spawn a worker, you might as well do it in C# where it's perhaps faster.
Not necessarily an exact match for your requirement, but since you mentioned training purpose, I would recommend Watson Developer Cloud - it has document conversion among many other features which may be relevant and useful for your objective as a whole.
Speaking of the current problem, please see Document conversion overview to see how we can convert a PDF into a desired format such as HTML. Then you could actually get the PNG files from the HTML resource bundle.
Hope this helps.
Related
Okay, so this is a pretty generic and vague question, so please let me elaborate.
We have a large codebase which we are splitting up the past years to more individual self-contained libraries.
One of the larger and more unwieldy parts is our Word export module. It uses docx4j currently, however we run into memory issues with large exports with a lot of pictures. Besides that, it is pretty difficult to update the exporter due to changes in our domain model.
It has been a while since someone worked on it (like years...) so I took it upon myself to investigate the state of generating Word documents in 2021. I hoped a lot had changed, but some Google searches let me to posts of 2010, and libraries of 2012. Of course, it can be the case that a library of 2012 means it is just that good.
I have identified the following solutions, though I am probably missing a lot:
Docx4j (JVM), still maintained, we run into memory problems with that.
Docx4j with Content Control Data Binding. Seems to be some way to use templating?
Apache POI (JVM), have some okay experience with the Excel part, no experience with the Word part. The 'consensus' online appears to be that Docx4j is more user-friendly.
JasperReports. Don't know anything about that.
DocX, .NET library, no experience.
Office Add-In using Office.js (JS). Official API from Microsoft. Runs at client in Word, so required connection to an API.
docxtemplates (Node / Browser). No experience. Looks complete, don't know about performance though.
officegen (Node). Last release 2019.
Carbone (node). https://github.com/Ideolys/carbone. No experience also.
probably more...
So, as expected a lot of libraries in JS popping up as well.
Looking at my requirements:
using a template would be nice
running it as a service would be nice
efficient (memory wise, don't mind if it takes some time to generate)
We have quite a good JSON API available, which is very easy to maintain and maps pretty good to our domain model. My preference would be to use that as a source of course.
what are peoples experiences and/or am I missing some very good libraries out there?
I am working on the solution for OCR using Azure Read API, and it provides out of box solution for raster PDFs
https://learn.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-recognizing-text#read-api
but I don't see if it can support vector based PDFs. I have other solution by using some third party libraries such as Aspose and PDFxStream, but prefer if I can stay within just Azure Vision API ecosystem.
So my question is is it possible to use Read API for vector PDF, and if not what is best practical approach I could use?
To answer my question: yes, it supports Vector based PDFs, although it is not explicitly mentioned in API documentation. We checked both through Azure portal and through API code and it works. No problem with mixing raster and vector based PDFs.
I have a requirement where i have a config file which has a bunch of properties. The user has to download the property file from the server using a browser. Some of these properties have to be changed based on the user's input and then the file has to be downloaded. This basically fits the perfect description of having templates and then at run time generating a file by replacing the properties provided by the user. How can i achieve this using node js. Any pointer will be deeply appreciated. Please pardon my limited knowledge of MEAN stack.
Template engines are a common thing and it's quite easy to use one with express.
I suggest you start with the docs on using template engines with express. They also have a wiki entry with a list of available engines.
Most template engines are meant to generate HTML, if you want to output something else (even plaintext) it can be a bit tricky sometimes.
Otherwise the choice mainly depends on what your familiar with. I can recommend Mozilla Nunjucks.
I want to make a node.js server that converts documents into ppt presentations. I think I will use Open Office for this job but I am not sure how to start and if someone can help into pointing me towards a good direction and maybe some tutorials on how to use Open Office maybe in other programming languages.
If you're not planing on doing the conversion yourself, you should start by finding a command line tool or a node.js library that does the conversion. Then build the web service around it.
I am looking into the simplest way to integrate Wikipedia into a node.js app.
The requirements are to be able to search for entries and find entities in each entry.
Any known existing libs/methods for that?
Thanks
There's a newly available open source parser for wiki text (http://sweble.org/) that might be useful to you if you roll your own solution. Of course that would require you downloading the wikipedia data dump, parsing, and storing entities in a db.
You could also look at dbpedia (http://dbpedia.org/About), though that would require integrating the rdf stack into your app (either running a local rdf repository or communicating with the often flaky online version via sparql).
One easy approach is to use a search engine api and restrict to site:wikipedia.org - e.g:
http://www.google.com/search?q=node.js+site%3Awikipedia.org
I've found that can work really well.
Spider for scraping using jquery is fantastic:
https://github.com/mikeal/spider
Mikeal is the man
Presumably you'd be using this for a side (personal) project though. Not sure how kosher it is to run wild on wikipedia with a scraper.