I'm using the NodeJS elasticsearch package to interact with ElasticSearch. I have a document that has a file field. I want to be able to upload a file to the index but the only way that I have found is by using the elasticsearch-mapper-attachment plugin.
The problem is that if I use it, I have to load the whole file in memory, encode it to Base64 and then pass the String to ElasticSearch.
I'd like to be able to pass a Stream to ElasticSearch (referencing any binary file: pdf, xls, doc, ppt).
The elasticsearch-mapper-attachment plugin parses the uploaded binary file and extracts text for further indexing using built-in Tika extractor.
What some applications do (for example Search Technology's Aspire) - they run binaries thru Tika locally, extract text and upload just that text with the documents to index.
It might not be the answer you are looking for but you really have just two options - use Elastic plugin (and convert the binary to base64 in yoru code prior to uploading the document to elastic), or parse the binary and extract text in your code and then upload just that text to elastic. Former is easier, latter gives you more control over the process
Related
I am struggling to apply text redaction in a PDF file in a aws lambda function written in NodeJs. Here is a list of libraries that I have tried with no success:
pdf-lib: This library almost fulfils all the requirements except that it doesn't redact the text permanently as part of its limitations https://github.com/Hopding/pdf-lib/issues/827
PDF.js: To overcome the above limitation, tried to covert the pdf to an image, so the redaction black boxes are applied permanently. Example code here: https://github.com/mozilla/pdf.js/blob/master/examples/node/pdf2png/pdf2png.js However, this lib is not reliable as this cannot extract contents from most pdfs during the process.
Finally, Pdf2Pic: This library helps to overcome the limitation of the first library (pdf-lib) by the converting the pdf into images. But this library internally uses two non node based libraries (graphicsmagick and ghostscript) which I am trying to avoid.
Is there a nodejs based solution that can be used to apply redaction permanently on a pdf file or any solution that can be used to covert a pdf to images to overcome limitation of pdf-lib.
My script reads from a web service (JSON) that returns me files (PDF, Excel, Word) in base64. I need to extract the content to put into an Elastic Search. What is the best library to do this? Is there any way to do this without having to save the file to disk?
I want to convert any documents or image or text file into PDF for all the OS.
I tried the approach with node-msoffice-pdf, and its working fine for Windows OS but not working in other OS.
Question:
How to convert docs, images, textfile to pdf in nodejs?
I used wkhtmltopdf from years to manage pdf conversion.
https://github.com/devongovett/node-wkhtmltopdf
You can either render an html file and pass it to the module, or render a pdf directly from an url.
If fidelity/conversion quality is important to you, for Word documents (doc/docx) you could try our freemium https://www.npmjs.com/package/#nativedocuments/docx-wasm which will perform the conversion locally (ie where node is running), without the need to LibreOffice etc.
In my application I am uploading a PDF file after uploading, I should display the information present in PDF file to a HTML form we are using angular 2 for frontend and node js for backend. Can any one help me with this.
Please remember PDF to HTML.
You can do one thing. Convert your pdf to a JSON. Use pdf2json.
pdf2json is a node.js module that parses and converts PDF from binary to json format, it's built with pdf.js and extends it with
interactive form elements and text content parsing outside browser.
The goal is to enable server side PDF parsing with interactive form
elements when wrapped in web service, and also enable parsing local
PDF to json file when using as a command line utility.
perform npm install pdf2json
Create an empty JSON whose key values will be the main headings from the pdf like a customer, age etc. Its values are obtained from the uploaded pdf.
Using this JSON values fill your form, on saving the form using, node.js save it to your DB. Is this what you want?
Simply what you need is to render a PDF in your application.
You could use this library ng2-pdf-viewer
Almost all the basic functionalities are available as properties to this component. You could manipulate it to your requirement.
I am using example source code from the Lucene 4.2.0 demo API:
http://lucene.apache.org/core/4_2_0/demo/overview-summary.html
I run IndexFiles.java to create an index from a directory of rtf, pdf, doc, and docx files. I then run SearcFiles.java and notice that I encounter several instances where my searches fail i.e. it does not return a document that contains the word I searched for.
I suspect it has to do with Lucene 4.2.0 not being able to correctly index non .txt files without additional customization.
Question: Can the IndexFiles.java source code (Lucene 4.2.0) correctly index pdf, doc, docx files as it is written in the provided link? Does anyone have examples or references on how to code that functionality?
Thank You
No, it can't. IndexFiles is a demo, an example for you to learn from, but not really designed for production use. If you take a look at the code, you'll see it just uses a FileInputStream (wrapped with an InputStreamReader, wrapped with a BufferedReader). Generally, Lucene won't handle how to parse different file formats (except it's own index files, of course). How to parse a file to provide meaningful content to Lucene is up to you to define.
Apache Tika might be a good place to look for this functionality. Here is a simple example using Tika with Lucene.
You might also consider using Solr.