Can elastic search long document? - node.js

I have a study project about identify text content must use JS. Input is a paragraph includes at least 15 lines and search in 100 text files from 3 to 5 pages. Output is which text file has the same content as the input text.
Can Elastic resolve it? Or can you recommend me some solutions?

I found a blog entry from https://ambar.cloud/blog/2017/01/02/es-large-text/ that can respond to your question. There is an in depth example similar to yours.
ElasticSearch can deal with with large documents and still deliver quite a performance, but for cases like yours its important to set up the index correctly.
Lets supose you have ElasticSearch documents with a text field with 3 to 5 pages worth of text.
When you try to query documents that contain a paragraph in the large text field, ElasticSearch will perform a search through all the terms from all the documents and their fields, including the large text field.
During merge ElasticSearch collects all the found documents into memory, including the large text field. After building the results into memory, ElasticSearch will try to send these large documents as a single JSON response. This is very exprensive in terms of performance.
ElasticSearch should handle the large text field separately from other fields. To do this, in the index mapping you should set the parameter store:true for the large text field. This tells ElasticSearch to store the field separately from other document's fields. You should also exclude the large text field from _source by adding this parameter in the index settings:
_source: {
excludes: [
"your_large_text_field"
]
}
If you set your indexes this way, the large text field will be separated from _source. Querying the large text field is now much more effective since it is stored separately and there is no need to merge it with _source.
To conclude, yes, ElasticSearch can handle the search of large text fields, and, with some extra settings it can increase the search performance by 1100 times.

Related

SOLR - match tags against text

I have the tags solr collection with 100k records. It has simple structure, example node:
{
"id": "57301",
"name": "Roof repair",
}
The task is - automatically bind tag list for any input text using solr search engine. Now our algorithm is.
First we send whole text as query to tags collection. We are searching whole text in "name" field. We recive a big list of tags.
Send requests in a cycle (loop tags, recived at step1), to another collection, that contains the document with input text (id is known). Example query
id:38373 AND _text_:"Roof repair" . If this query gives any results - will we add Roof repair to matched tags.
Finaly - we have a checked tag list for given input text. Quality of this automatic tag binding is good (for us of course).
But we have a performance problem: some texts have 10k tags on step 1. Then each tag checking in step2 with http request to solr. 10k requests is very much. We can to crop tags count to analyse, but tag-linking quality becomes much worse.
Is there way to match solr tag collection against text without cyclic request for each tag?
Please elaborate your question again. I didn't get the first part and second one how this happened id:38373 AND text:"Roof repair"?
First we send whole text as query to tags collection. We recive a big list of tags.?
Means you are searching whole text in "name" field ?

Find logs which not contains specified field in kibana

I use ELK to arrange my logs, logs goes from many places and some records might not contain several fields and the question is what is the best way to find such records? Can I find logs which not contains several fields?
Kibana uses the query string query syntax of Elasticsearch in its filters. The syntax to find a document that does not have a given field is _missing_:FIELD_NAME. The + operator is the preferred way to ensure that a particular search term must be present in the matched documents. Combining these two will allow you to search for documents that are missing multiple fields:
+_missing_:foo +_missing_:bar +_missing_:baz

What exactly is a Lotus Notes document's "size" field?

Background is that I am trying to eliminate duplicated documents that have arisen (from a rogue replication?) in a huge database - tens of thousands of docs. I am composing a view field that uniquely identifies a document so that I can sort on that field and remove the duplicates via LotusScript code. Trouble is that there is a rich text "Body" field that contains most of the action/variability and you can't (AFAICS) use #Abstract in a view column...
Looking around for an alternative I can see that there is a system variable "size" which includes any attachments size, so I am tentatively using this. I note that duplicated docs (to the eye) can differ in reported size by up to about 30 bytes.
Why the small differences? I can code the LS to use a 30-byte leeway, but I'd like to know the reason for it. And is really 30 bytes or something else?
Probably document's item $Revisions has one more entry. That means that the document was saved one more time.
If the cause of this "copy thousands of documents" accident was replication then you might be lucky that documents contain an item $Conflict. This item contains the DocumentUniqueId of the original document and you could pair them this way.

Search keywords in files stored in mongodb

I have stored a .txt file in mongodb using gridFS with node.js.
Can we store .pdf and other format? When I tried to store .pdf and retrieve the content on the console, it displays text in the doc and some junk values in it. I used this line to retrieve "GridStore.read(db,id,function(err, fileData)"
Is there any other better way to do it?
Can we do text search on the content in the files stored in mongodb directly? If so how can we do that?.
Also can you please tell where the data of files stored in mongodb and in what format?
Any help in this regard will be great.
--Thanks
What you really seem to want here is "text search" capabilities, which in MongoDB requires you to simply store the "text" in a field or fields within your document. Putting "text" into MongoDB is really very simple, as you just supply the "text" as the content for the field and MongoDB will store it. The same goes for other data of just about any type which will merely just be stored under the field your specify.
The general case here is that you really seem to want "text search" and for that you must store the "text" of your data. But before implementing that, let's talk about what GridFS actually is and also what it is not, and how it most certainly is not what you think it is.
GridFS
GridFS is not software or a special function of MongoDB. It is in fact a specification for functionality to be implemented by available drivers for the sole intent of enabling you to store content that exceeds the 16MB BSON storage limit.
For this purpose, the implementation uses two collections. By default these are named fs.files and fs.chunks but in fact can be whatever you tell tour driver implementation to actually use. These collections store what is indicated by those default names. Being the unique identifier and metadata for the "file" and the other collection storing the
Here is a quick snippet of what happens to the data you send via the GridFS API as a document in the "chunks" collection:
{
"_id" : ObjectId("539fc66ac8b5e6dc058b4568"),
"files_id" : ObjectId("539fc66ac8b5e6dc058b4567"),
"n" : NumberLong(0),
"data" : BinData(2,"agQAADw/cGhwCgokZGJ....
}
For context, that data belongs to a "text" file I sent via the GridFS API functions. As you can see, despite the actual content being text, what is being displayed here is a "hashed" form of the raw binary data.
This is in fact what the API functions do, by reading the data that you provide as a stream of bytes and submitting that binary stream, and in manageable "chunks", so in all likelihood parts of your "file" will not in fact be kept in the same document. Which actually is the point of the implementation.
To MongoDB itself these are just ordinary collections and you can treat them as such for all general operations such and find and delete and update. The GridFS API spec as implemented by your driver, gives you functions to "read" from all of those chunks and even return that data as if it was a file. But in fact it is just data in a collection, in a binary format, and split across documents. None of which is going to help you with performing a "search" is this is neither "text" or contained in the same document.
Text Search
So what you really seem to want here is "text search" to allow you to find the words you are searching for. If you want to store "text" from a PDF file for example, then you would need to externally extract that text and store in documents. Or otherwise use an external text search system which will do much the same.
For the MongoDB implementation, any extracted text would be stored in a document, or possibly several documents in order for you to enable a "text index" in order to enable the search functionality. Basically you would do this on a collection like this:
db.collection.ensureIndex({ "content": "text" })
Once the field or "fields" on your documents in your collection is covered by a text index then you can actually search using the $text operator with .find():
db.collection.find({ "$text": { "$search": "word" } })
This form of query allows you to match documents on the terms you specify in your search and to also determine a relevance to your search and "rank" the documents accordingly.
More information can be found in the tutorials section on text search.
Combined
There is nothing stopping you from in fact taking a combined approach. Here you would actually store your orginal data documents using the GridFS API methods, and then store the extracted "text" in another collection that was aware of and contained a reference to the original fs.files document referring to your large text document or PDF file or whatever.
But you would need to extract the "text" from the original "documents" and store that within the MongoDB documents in your collection. Otherwise a similar approach can be taken with an external text search solution, where it is quite common to provide interfaces that can do things such as extract text from things like PDF documents.
With an external solution you would also send the reference to the GridFS form of the document to allow this data to be retrieved from any search with another request if it was your intention to deliver the original content.
So ultimately you see that the two methods are in fact for different things. You can build your own approach around "combining" functionality, but "search" is for search and the "chunk" storage is for doing exactly what you want it to do.
Of course if your content is always under 16MB, then just store it in a document as you normally would. But of course, if that is binary data and not text, it is no good to you for search unless you explicitly extract the text.

Controlling how elasticsearch fields are tokenized for faceting

I'm new to elasticsearch (and the underlying Lucene engine).
We're storing some metadata about documents eg a single document might be described as:
UniqueHash: ABC123
CreatedBy: John Smith
ApplicationName: MSExcel
ContentType: application/vnd.ms-excel
WordCount: 7000
...
This all works very well for indexing/searching but when we come to faceting, things get interesting.
Faceting on (say) CreatedBy would return
John: 1
Smith: 1
or on ContentType
application: 1
vnd.ms: 1
excel: 1
neither of these is desirable. I have no direct control over the contents of the field (that is to say, I can't change the underlying data). I can perform a transform on the way in but that would result in storing dodgy data just so the searching works as expected which feels like the wrong approach.
How can I convince elasticsearch to treat the whole contents of each field (or at least of specified fields) as the value to use for faceting?
You can index your field twice using Multi Field Type. After reindexing, you will be able to continue using analyzed version of your field for search, and use "untouched" version of the field for facets.

Resources