Indexing SVG files with SOLR - svg

I am new to Solr, but I suppose that there is an easy way to index SVG files with Solr. I have installed Solr 6.3.0 and I am using an example 'files' core. It works well, but it seems that it parses the SVG files as plain text.
Is there an easy way to take only the texts between the text tags?
Ideally, I want to combine some meta data from a JSON file with the text from the SVG files. The JSON file looks like:
{
"id":"000001",
"title":"Some diagram",
...
} ...
The associated svg file is 000001.svg.Is there a way to create a scheme in Solr, that can take the fields from the json and merge a field with the text from the SVG file?

The most flexible way that will do what you want is to write a custom indexing utility that parses your JSON, picks up the SVG and extracts the relevant elements, then submits the complete structure to Solr. Depending on your programming language of choice you'll do this with something like SolrJ, Solrnet or another client library.
This is way more flexible and maintainable than integrating it directly into Solr, but if you want to do custom SVG indexing (without the additional JSON), you could use the XSLT support in the regular update handler, or using an XPathEntityProcessor in a DataImportHandler configuration.
My choice would be the custom indexing code.

Related

Using a list for a feature in an ML model

I want to run a machine learning algorithm on some data, so I'm exporting the data into a file first.
But one of my features for the text I'm classifying is a list of tags,
and each text can have multiple tags ex. (["mystery", "thriller"]).
Is it recommended that when I write to my CSV file for exporting the data, that I write that entire list as one of the features for my data (the "tags" feature).
Or is it better to make a separate feature for each tag. The only problem then is that most examples will only have one tag, so the other feature columns for those will be blank.
So it seems like writing this list of tags as one feature makes the most sense, but then when parsing it for training, would I then treat every element of that list as its own feature still or no?
If you do it as a single feature just make sure to use some delimiter to separate the tags that won't occur in any of the tags, and also isn't a comma (as that will mess with the csv format), something like | would probably do fine. When you go to build your models and read in that list of tags you can then split it based on that delimiter. In Java this would look like:
String[] tagList = inputString.split("|");
I'm sure most languages will have a similar method to do this.

Is it possible to add an image to a PDF without rendering the PDF?

I'm looking at adding an image to an existing PDF in Node.js. None of the PDF libraries I found appear to have the ability to modify an existing PDF though, so I'm planning on implementing it myself. I'm trying to figure out if it's too much work, as I can always do it server side using iTextPDF instead, but I'd prefer to do it in my app (Electron which uses Node.js).
If I just want to modify an existing PDF and add an image, will I have to write a complete rendering library or is PDF structured in such a way that I can write a very small parser that just gets the page I want and inserts an image using the correct format?
Specifically, I'm asking because I've previously looked into writing a text extraction library, put in order to get the position of text you have to render pretty much the entire PDF because of how positioning is handled. That's too much work to get around server side processing in this case.
To be clear, just asking if it's possible to do, not how to do it (don't want to be too broad, I'm sure I can figure that part out).
To perform a small manipulation of a PDF, you'll need to implement generalized reading, decompression, encryption and traversal of PDF data structures. Some of the thing you would need to handle include:
basic parsing of PDF syntax
indexing via the cross reference index, and/or cross reference index and object streams
objects (num, byte-string, hex string, dictionary, arrays, booleans...)
filters and variants (LZW, Flate, RunLength, Predictors)
encryption (RC4, AES, Custom security handlers)
page tree traversal
basic handling of page content streams
image handling
serialization, either rewriting of the entire PDF, or incremental updates to an existing PDF
Anything's possible, but realistically, you will need a PDF library or toolkit, client or server-side, to accomplish this.

Lucene 4.2.0 index pdf

I am using example source code from the Lucene 4.2.0 demo API:
http://lucene.apache.org/core/4_2_0/demo/overview-summary.html
I run IndexFiles.java to create an index from a directory of rtf, pdf, doc, and docx files. I then run SearcFiles.java and notice that I encounter several instances where my searches fail i.e. it does not return a document that contains the word I searched for.
I suspect it has to do with Lucene 4.2.0 not being able to correctly index non .txt files without additional customization.
Question: Can the IndexFiles.java source code (Lucene 4.2.0) correctly index pdf, doc, docx files as it is written in the provided link? Does anyone have examples or references on how to code that functionality?
Thank You
No, it can't. IndexFiles is a demo, an example for you to learn from, but not really designed for production use. If you take a look at the code, you'll see it just uses a FileInputStream (wrapped with an InputStreamReader, wrapped with a BufferedReader). Generally, Lucene won't handle how to parse different file formats (except it's own index files, of course). How to parse a file to provide meaningful content to Lucene is up to you to define.
Apache Tika might be a good place to look for this functionality. Here is a simple example using Tika with Lucene.
You might also consider using Solr.

Add custom XMP Tags

I am looking for a tool or a way (.NET) to add custom XMP fields. Also, can someone explain the purpose of needing to know if the XMP tag is a textfield, textarea or a select?
XMP is written inside files as an XML packet or as a separate XML file. The XMP specification uses a subset of RDF/XML. So you could look at (RDF/)XML manipulation tools.
For embedded XPackets however, the packet length needs to be calculated and written at the start of the packet, so it may help to have a purpose built library. Adobe provides an XMP SDK (C++) for that.
XMP supports several content types for fields, like Text, Number or URL. Text fields, for example, could be restricted to values from a controlled vocabulary, for which it may make sense to use a select or dropdown form element in a GUI.

Can Solr index/search static files?

I've been reading this but I was just wondering, does Solr have the capability to search static files (i.e. outside of a content management system or a database)?
Some of my files are just straight up html...or server side code with html "blocks"...
SolR can index any text input. The important bit is that it indexes text. So if your static files are not text files, you may need to run them through a tool like Tika first. Then SolR should have no problem indexing the extracted textual data.
There is the ExternalFileField field type. But it's use looks limited.
http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html

Resources