Hit Location in SOLR When Indexing a Large Text - search

We are creating a SOLR index of a corpus of digital books that are marked up in XML. The index is to be used for NLP analysis of the corpus as well as locating passages within the corpus. I am processing each text into a JSON solr doc for indexing and posting that to the index.
The XML of each book includes markers for all the page breaks in the text, each marked with a "milestone" tag. I have created the initial version of the index by stripping out all the milestones from the body of the text and dumping the text of whole book into a single field so that strings or passages that go over a page break can be found.
My question is: How can I determine the location of hits in a text (i.e. what page they are on) when a search is made?

Related

How can I easily get search context around search term with Typesense?

I currently use Typesense to search in an HTML database. When I search for a term, I would like to retrieve N characters before and N characters after the term found in search.
For example, I search for "query" and this is the sentence that matches:
Let's repeat the query we made earlier with a group_by parameter
I would like to easy retrieve a fixed number of letters (or words) before and after the term to show it in a presumably small area where the search results is retrieved, without breaking any words.
For this particular example, I would be showing:
..repeat the query we made earlier..
Is there a feature like this in Typesense?
I have checked Typesense's documents, without any luck.
The feature you're referring to is called snippets/highlights and it's enabled by default. You can control how many words are returned on either side of the matched text using the highlight_affix_num_tokens search parameter, documented under the table here: https://typesense.org/docs/0.23.1/api/search.html#results-parameters
highlight_affix_num_tokens
The number of tokens that should surround the highlighted text on each side. This controls the length of the snippet.

Can elastic search long document?

I have a study project about identify text content must use JS. Input is a paragraph includes at least 15 lines and search in 100 text files from 3 to 5 pages. Output is which text file has the same content as the input text.
Can Elastic resolve it? Or can you recommend me some solutions?
I found a blog entry from https://ambar.cloud/blog/2017/01/02/es-large-text/ that can respond to your question. There is an in depth example similar to yours.
ElasticSearch can deal with with large documents and still deliver quite a performance, but for cases like yours its important to set up the index correctly.
Lets supose you have ElasticSearch documents with a text field with 3 to 5 pages worth of text.
When you try to query documents that contain a paragraph in the large text field, ElasticSearch will perform a search through all the terms from all the documents and their fields, including the large text field.
During merge ElasticSearch collects all the found documents into memory, including the large text field. After building the results into memory, ElasticSearch will try to send these large documents as a single JSON response. This is very exprensive in terms of performance.
ElasticSearch should handle the large text field separately from other fields. To do this, in the index mapping you should set the parameter store:true for the large text field. This tells ElasticSearch to store the field separately from other document's fields. You should also exclude the large text field from _source by adding this parameter in the index settings:
_source: {
excludes: [
"your_large_text_field"
]
}
If you set your indexes this way, the large text field will be separated from _source. Querying the large text field is now much more effective since it is stored separately and there is no need to merge it with _source.
To conclude, yes, ElasticSearch can handle the search of large text fields, and, with some extra settings it can increase the search performance by 1100 times.

SOLR - match tags against text

I have the tags solr collection with 100k records. It has simple structure, example node:
{
"id": "57301",
"name": "Roof repair",
}
The task is - automatically bind tag list for any input text using solr search engine. Now our algorithm is.
First we send whole text as query to tags collection. We are searching whole text in "name" field. We recive a big list of tags.
Send requests in a cycle (loop tags, recived at step1), to another collection, that contains the document with input text (id is known). Example query
id:38373 AND _text_:"Roof repair" . If this query gives any results - will we add Roof repair to matched tags.
Finaly - we have a checked tag list for given input text. Quality of this automatic tag binding is good (for us of course).
But we have a performance problem: some texts have 10k tags on step 1. Then each tag checking in step2 with http request to solr. 10k requests is very much. We can to crop tags count to analyse, but tag-linking quality becomes much worse.
Is there way to match solr tag collection against text without cyclic request for each tag?
Please elaborate your question again. I didn't get the first part and second one how this happened id:38373 AND text:"Roof repair"?
First we send whole text as query to tags collection. We recive a big list of tags.?
Means you are searching whole text in "name" field ?

Does Solr store the original contents of the document after indexing?

If I mark a field as "don't store," does Solr retain the original contents of that field anywhere, or does it only retain the "bag of words" that it culls for the index itself?
I'm asking from the standpoint of document security. If someone cracked into the machine running our Solr index, could they get the original text passed into Solr for this "don't store" field, or not?
No, the Solr index does not store the original value in any retrievable or viewable way for fields that are set to stored="false". Common Field options on the Solr wiki states the following behavior of setting the stored option.
True if the value of the field should be retrievable during a search
If someone cracked into the machine running the Solr index and ran Solr queries based on the above they would not be able to see the contents of the field as Solr would not return that field. However if they had access to the disk and the actual index folder and segment files as written by Lucene, they could see the terms that Solr stored for each document in that field using Luke - Lucene Index Toolbox to examine the index folder.
When a field is Storable.No, only enough information is stored for Lucene to perform the search.
However, if you specify WITH_POSITIONS_OFFSETS when constructing each field, there is usually enough information to retrieve:
lowercase(EXACTSTRINGINDEXED) - LUCENEDELIMITERS - STOPWORDS
For example, if you indexed:
Jerry&Mary's Live Bait and Yellow Cab
with an analyzer that treats "&" and "'" as delimiters, did not index single letters, and treated 'and' as a stopword, you would see in the index something like:
jerry mary live bait [null word] yellow cab
(You can verify this with Luke, as mentioned above.)

how to retrieve content based paragraph from the word file using open xml and c#4.0?

I am using c#4.0 and open xml sdk 2.0 for accessing Word file.For that, Now i want to Retrieve a paragraph based on the given text.If the paragraph contains my text then retrieve the paragraph containing that text...
FOR EXAMPLE:
Given Word is: TEST
Retrieve the paragraphs that containing the word "TEST"
I want to search the given Word in the paragraph.If any matches found, Then i want to display that methods.If matches not found,no need to get the paragraph.
How i do?
The main content of a word document is stored in the body element.
At the simplest level, paragraphs can be located using Linq queries performed on the document:
using(WordprocessingDocument document = WordprocessingDocument.Open(documentStream, true)){
foreach(Paragraph p in document.MainDocumentPart.Document.Body.Descendants<Paragraph>().Where<Paragraph>(p => p.InnerText.Equals("SOME TEXT")){
// Do something with the Paragraphs.
}
}
However I would advise that the problem is a little more complicated than this. As under each paragraph there may be more than one Run (essentially a sentence) containing a string of words. It is quite likely that where the user entered the word "SOME TEXT" also contains other runs.
But this should be able to point you in the correct direction.

Resources