Extract content from file in base64 - python-3.x

My script reads from a web service (JSON) that returns me files (PDF, Excel, Word) in base64. I need to extract the content to put into an Elastic Search. What is the best library to do this? Is there any way to do this without having to save the file to disk?

Related

How to index binary file in ElasticSearch without using Base64

I'm using the NodeJS elasticsearch package to interact with ElasticSearch. I have a document that has a file field. I want to be able to upload a file to the index but the only way that I have found is by using the elasticsearch-mapper-attachment plugin.
The problem is that if I use it, I have to load the whole file in memory, encode it to Base64 and then pass the String to ElasticSearch.
I'd like to be able to pass a Stream to ElasticSearch (referencing any binary file: pdf, xls, doc, ppt).
The elasticsearch-mapper-attachment plugin parses the uploaded binary file and extracts text for further indexing using built-in Tika extractor.
What some applications do (for example Search Technology's Aspire) - they run binaries thru Tika locally, extract text and upload just that text with the documents to index.
It might not be the answer you are looking for but you really have just two options - use Elastic plugin (and convert the binary to base64 in yoru code prior to uploading the document to elastic), or parse the binary and extract text in your code and then upload just that text to elastic. Former is easier, latter gives you more control over the process

Detect correct file extension for OpenXmls?

If we have been provided only the XMLs of the document (in input stream, unzipped manner, or in a byte array), can we detect the file extension via parsing XMLs? My motive is to know what particular node in which XML determines that this is DOCX, PPTX, or XLSX file?
I unzipped the documents and tried to dig and found this -
In \docProps\app.xml, application node defines it -
<Application>Microsoft Excel</Application> for Excel,
<Application>Microsoft Office PowerPoint</Application> for PowerPoint, and
<Application>Microsoft Office Word</Application> for Word.

How to send data into an OpenOffice word template from NodeJS

How can I get data passed from NodeJS into placeholders of a OpenOffice template file? Is there any npm packages available to parse an ODT template file so that I can print data into it?
I have a 13 page word file (a template for printing reports) and I want to populate it with certain details from the DB into the different pages of this file. I like to pass the data in JSON format.
What I know is how to write into a plain text/excel file from node, but I want to write into the placeholders of a word template without loosing other parts of the template. I did the same with VBScript (with microsoft word template) in the past. Now want to achieve the same using nodejs. Please share with me your ideas..thanks

How to pull the data or files from website using spoon /Kettle

We need to pull the data from some website using peantho kettle if any one is having some pointers please let me know.
The files are in the zip format in link available on web.
Simple. Create a job that downloads the file from the website.
then create a transform called from the job, which loads the zipped files ( you can use text file input to read zipped text files as they are) and writes them to your db.

Render image or pdf stream from SQL database in asp.net

I have a table with documents saved some of them in pdf, some of them image.
I want to create a web app, to show the images (that can be either pdf, either jpg) in the same control.
I can manage to see pdf, if I set the Response.ContentType = "application/pdf" or image if I set "application/jpg". But the problem is that how can I get the file type, having only the stream saved into the database? Does it have the stream the file type information in it?
Thanks.
No, a stream does not have a content type associated with it. If you had the original filename, you could attempt to derive the content type from that, but it wouldn't be foolproof.
Many file formats have a series of "magic bytes" that allow you to detect what (might) be in the file. PDF, for example, begins with the bytes "%PDF" (note: I'm not an expert on PDF, and there may be situations where that is not true).
If you have no other option, you could attempt to parse the file using various libraries until you found one that worked (System.Drawing.Image.FromStream(), iTextSharp, etc).

Resources