What filetypes can Crafter Search index? - crafter-cms

Crafter search can index attached files. For example, I can search the contents of an attached PDF...
However, is it compatible with Docx?
and image metadata? (for example jpg)
Is there a compatibility list somewhere?
I'm having issues getting docx and a jpg indexed, although a PDF is working perfectly

Crafter Search relies on Solr that uses the Apache Tika library to index binary documents. You can find the list of compatible formats here: https://tika.apache.org/1.16/formats.html (for CrafterCMS 3.0) and https://tika.apache.org/1.5/formats.html (for CrafterCMS 2.x)
For proprietary formats such as docx you need to check the compatibility and the supported fields in the library documentation.

Related

How to redact texts in a pdf file in NodeJs

I am struggling to apply text redaction in a PDF file in a aws lambda function written in NodeJs. Here is a list of libraries that I have tried with no success:
pdf-lib: This library almost fulfils all the requirements except that it doesn't redact the text permanently as part of its limitations https://github.com/Hopding/pdf-lib/issues/827
PDF.js: To overcome the above limitation, tried to covert the pdf to an image, so the redaction black boxes are applied permanently. Example code here: https://github.com/mozilla/pdf.js/blob/master/examples/node/pdf2png/pdf2png.js However, this lib is not reliable as this cannot extract contents from most pdfs during the process.
Finally, Pdf2Pic: This library helps to overcome the limitation of the first library (pdf-lib) by the converting the pdf into images. But this library internally uses two non node based libraries (graphicsmagick and ghostscript) which I am trying to avoid.
Is there a nodejs based solution that can be used to apply redaction permanently on a pdf file or any solution that can be used to covert a pdf to images to overcome limitation of pdf-lib.

Package convert from .doc to html

I am using mammoth npm package to convert docx to html, but it is not usable for doc files.
So which package to use to convert doc files? I have searched many but not found.
I am using nodejs.
I don't think you could find a package allowing you to convert doc files (and if you ever find one, I doubt it will have a good accuracy).
DOCX is an open file format (it's just XML : DOCX), whereas the DOC format is proprietary, so it will be way harder to get the wanted informations, if even possible.

How to index binary file in ElasticSearch without using Base64

I'm using the NodeJS elasticsearch package to interact with ElasticSearch. I have a document that has a file field. I want to be able to upload a file to the index but the only way that I have found is by using the elasticsearch-mapper-attachment plugin.
The problem is that if I use it, I have to load the whole file in memory, encode it to Base64 and then pass the String to ElasticSearch.
I'd like to be able to pass a Stream to ElasticSearch (referencing any binary file: pdf, xls, doc, ppt).
The elasticsearch-mapper-attachment plugin parses the uploaded binary file and extracts text for further indexing using built-in Tika extractor.
What some applications do (for example Search Technology's Aspire) - they run binaries thru Tika locally, extract text and upload just that text with the documents to index.
It might not be the answer you are looking for but you really have just two options - use Elastic plugin (and convert the binary to base64 in yoru code prior to uploading the document to elastic), or parse the binary and extract text in your code and then upload just that text to elastic. Former is easier, latter gives you more control over the process

Lucene 4.2.0 index pdf

I am using example source code from the Lucene 4.2.0 demo API:
http://lucene.apache.org/core/4_2_0/demo/overview-summary.html
I run IndexFiles.java to create an index from a directory of rtf, pdf, doc, and docx files. I then run SearcFiles.java and notice that I encounter several instances where my searches fail i.e. it does not return a document that contains the word I searched for.
I suspect it has to do with Lucene 4.2.0 not being able to correctly index non .txt files without additional customization.
Question: Can the IndexFiles.java source code (Lucene 4.2.0) correctly index pdf, doc, docx files as it is written in the provided link? Does anyone have examples or references on how to code that functionality?
Thank You
No, it can't. IndexFiles is a demo, an example for you to learn from, but not really designed for production use. If you take a look at the code, you'll see it just uses a FileInputStream (wrapped with an InputStreamReader, wrapped with a BufferedReader). Generally, Lucene won't handle how to parse different file formats (except it's own index files, of course). How to parse a file to provide meaningful content to Lucene is up to you to define.
Apache Tika might be a good place to look for this functionality. Here is a simple example using Tika with Lucene.
You might also consider using Solr.

How to save an EXIF format image file in .NET 3.5

I want to save an image in EXIF format using System.Drawing.Image.Save or a similar method in a C# application using .NET framework v3.5. The MSDN documentation lists EXIF as an option for ImageFormat. However, it does not seem to be supported - at least not without some configuration unknown to me. When I enumerate the built-in encoders via ImageCodecInfo.GetImageEncoders() EXIF is not included. (Built in encoders on my machine (Vista Ultimate x64) are: BMP, JPEG, GIF, TIFF, and PNG.) If I save an image using the ImageFormat.Exif property, I simply get the default PNG format.
How can I save an image in EXIF format using .NET 3.5?
EXIF isn't a image file format per se, but a format for meta-data found within JPEG images conforming to the DSC (Digital Still Camera) standard as specified by JEITA.
GDI+ (i.e. Microsoft .NET Framework) allows you to read/write metadata image properties via the Image.PropertyItems, however the EXIF properties exposed by GDI+ are pretty cumbersome and don't convert the values the way you would expect. A lot of work is actually needed to be able to natively read/write these values (e.g. you'd need to unpack binary fields containing specially encoded values according to the JEITA spec).
A straight-forward open-source library which implements all the standard EXIF properties can be found at http://code.google.com/p/exif-utils/ This is probably the easiest way to do this. See the simple included demo which reads in a file, prints out all the EXIF properties and then adds a property to the image.
Have you seen this: Lossless JPEG Rewrites in C#

Resources