Search keywords in files stored in mongodb

Search keywords in files stored in mongodb - node.js

I have stored a .txt file in mongodb using gridFS with node.js.
Can we store .pdf and other format? When I tried to store .pdf and retrieve the content on the console, it displays text in the doc and some junk values in it. I used this line to retrieve "GridStore.read(db,id,function(err, fileData)"
Is there any other better way to do it?
Can we do text search on the content in the files stored in mongodb directly? If so how can we do that?.
Also can you please tell where the data of files stored in mongodb and in what format?
Any help in this regard will be great.
--Thanks

What you really seem to want here is "text search" capabilities, which in MongoDB requires you to simply store the "text" in a field or fields within your document. Putting "text" into MongoDB is really very simple, as you just supply the "text" as the content for the field and MongoDB will store it. The same goes for other data of just about any type which will merely just be stored under the field your specify.
The general case here is that you really seem to want "text search" and for that you must store the "text" of your data. But before implementing that, let's talk about what GridFS actually is and also what it is not, and how it most certainly is not what you think it is.
GridFS
GridFS is not software or a special function of MongoDB. It is in fact a specification for functionality to be implemented by available drivers for the sole intent of enabling you to store content that exceeds the 16MB BSON storage limit.
For this purpose, the implementation uses two collections. By default these are named fs.files and fs.chunks but in fact can be whatever you tell tour driver implementation to actually use. These collections store what is indicated by those default names. Being the unique identifier and metadata for the "file" and the other collection storing the
Here is a quick snippet of what happens to the data you send via the GridFS API as a document in the "chunks" collection:
{
"_id" : ObjectId("539fc66ac8b5e6dc058b4568"),
"files_id" : ObjectId("539fc66ac8b5e6dc058b4567"),
"n" : NumberLong(0),
"data" : BinData(2,"agQAADw/cGhwCgokZGJ....
}
For context, that data belongs to a "text" file I sent via the GridFS API functions. As you can see, despite the actual content being text, what is being displayed here is a "hashed" form of the raw binary data.
This is in fact what the API functions do, by reading the data that you provide as a stream of bytes and submitting that binary stream, and in manageable "chunks", so in all likelihood parts of your "file" will not in fact be kept in the same document. Which actually is the point of the implementation.
To MongoDB itself these are just ordinary collections and you can treat them as such for all general operations such and find and delete and update. The GridFS API spec as implemented by your driver, gives you functions to "read" from all of those chunks and even return that data as if it was a file. But in fact it is just data in a collection, in a binary format, and split across documents. None of which is going to help you with performing a "search" is this is neither "text" or contained in the same document.
Text Search
So what you really seem to want here is "text search" to allow you to find the words you are searching for. If you want to store "text" from a PDF file for example, then you would need to externally extract that text and store in documents. Or otherwise use an external text search system which will do much the same.
For the MongoDB implementation, any extracted text would be stored in a document, or possibly several documents in order for you to enable a "text index" in order to enable the search functionality. Basically you would do this on a collection like this:
db.collection.ensureIndex({ "content": "text" })
Once the field or "fields" on your documents in your collection is covered by a text index then you can actually search using the $text operator with .find():
db.collection.find({ "$text": { "$search": "word" } })
This form of query allows you to match documents on the terms you specify in your search and to also determine a relevance to your search and "rank" the documents accordingly.
More information can be found in the tutorials section on text search.
Combined
There is nothing stopping you from in fact taking a combined approach. Here you would actually store your orginal data documents using the GridFS API methods, and then store the extracted "text" in another collection that was aware of and contained a reference to the original fs.files document referring to your large text document or PDF file or whatever.
But you would need to extract the "text" from the original "documents" and store that within the MongoDB documents in your collection. Otherwise a similar approach can be taken with an external text search solution, where it is quite common to provide interfaces that can do things such as extract text from things like PDF documents.
With an external solution you would also send the reference to the GridFS form of the document to allow this data to be retrieved from any search with another request if it was your intention to deliver the original content.
So ultimately you see that the two methods are in fact for different things. You can build your own approach around "combining" functionality, but "search" is for search and the "chunk" storage is for doing exactly what you want it to do.
Of course if your content is always under 16MB, then just store it in a document as you normally would. But of course, if that is binary data and not text, it is no good to you for search unless you explicitly extract the text.

Related

Azure Cognitive Search - return full json as SearchDocument?

I'm using Azure.Search.Documents in C# to index JSON documents in Azure blob storage. About half of the fields of each json doc are meant to be searchable or fielded. The JSON also includes some fields that I don't want evaluated by my search.
My goal is to return the entire JSON document in my search results.
It seems like my choices are to (a) add SearchField records to my SearchIndex for every aspect of the document (in which the SearchDocument results are ready for me to use) or (b) leverage metadata_storage_path / metadata_storage_name and do a separate fetch for the document itself.
Option (b) feels less efficient, considering that the SearchDocument returned is already so close to the full JSON; it seems a shame to have to make a separate fetch for each document. But for option (a) to work, I'd need to tell the SearchIndex about the extra fields without them triggering false positive search results.
For (a) is there a way to add SearchFields (or the equivalent) and have them not trigger false positives? (IsSearchable seems to affect how, but not whether, they are evaluated). Also, if (b) is the better approach, is there a way to do this using "new SearchField" as opposed to declared via attributes? Thanks.

Thank you Vince. Adding your comment as answer to help other community users.
Set IsSearchable to FALSE

Full text search with different types besides String in MongoDB

I want to use full text search in MongoDB, and I know the solution of using text indexes (https://docs.mongodb.com/manual/core/index-text/). But, this solution is meant for searching on String type fields only.
How can I perform full text search on other types as well? Suppose I have a collection with documents in which I have fields from a variety of types like String, Number etc.
What can I do?
P.S: I use MongoDB native driver for Nodejs.

Wildcard Indexing
there can be scenarios where you want any text content in your documents to be searchable. Maybe this will help you.
db.collection.createIndex({"$**":"text"})

Can elastic search long document?

I have a study project about identify text content must use JS. Input is a paragraph includes at least 15 lines and search in 100 text files from 3 to 5 pages. Output is which text file has the same content as the input text.
Can Elastic resolve it? Or can you recommend me some solutions?

I found a blog entry from https://ambar.cloud/blog/2017/01/02/es-large-text/ that can respond to your question. There is an in depth example similar to yours.
ElasticSearch can deal with with large documents and still deliver quite a performance, but for cases like yours its important to set up the index correctly.
Lets supose you have ElasticSearch documents with a text field with 3 to 5 pages worth of text.
When you try to query documents that contain a paragraph in the large text field, ElasticSearch will perform a search through all the terms from all the documents and their fields, including the large text field.
During merge ElasticSearch collects all the found documents into memory, including the large text field. After building the results into memory, ElasticSearch will try to send these large documents as a single JSON response. This is very exprensive in terms of performance.
ElasticSearch should handle the large text field separately from other fields. To do this, in the index mapping you should set the parameter store:true for the large text field. This tells ElasticSearch to store the field separately from other document's fields. You should also exclude the large text field from _source by adding this parameter in the index settings:
_source: {
excludes: [
"your_large_text_field"
]
}
If you set your indexes this way, the large text field will be separated from _source. Querying the large text field is now much more effective since it is stored separately and there is no need to merge it with _source.
To conclude, yes, ElasticSearch can handle the search of large text fields, and, with some extra settings it can increase the search performance by 1100 times.

What exactly is a Lotus Notes document's "size" field?

Background is that I am trying to eliminate duplicated documents that have arisen (from a rogue replication?) in a huge database - tens of thousands of docs. I am composing a view field that uniquely identifies a document so that I can sort on that field and remove the duplicates via LotusScript code. Trouble is that there is a rich text "Body" field that contains most of the action/variability and you can't (AFAICS) use #Abstract in a view column...
Looking around for an alternative I can see that there is a system variable "size" which includes any attachments size, so I am tentatively using this. I note that duplicated docs (to the eye) can differ in reported size by up to about 30 bytes.
Why the small differences? I can code the LS to use a 30-byte leeway, but I'd like to know the reason for it. And is really 30 bytes or something else?

Probably document's item $Revisions has one more entry. That means that the document was saved one more time.
If the cause of this "copy thousands of documents" accident was replication then you might be lucky that documents contain an item $Conflict. This item contains the DocumentUniqueId of the original document and you could pair them this way.

Array of attachment type - how to get a filename for highlighted fragment?

I use ElasticSearch to index resources. I create document for each indexed resource. Each resource can contain meta-data and an array of binary files. I decided to handle these binary files with attachment type. Meta-data is mapped to simple fields of string type. Binary files are mapped to array field of attachment type (field named attachments). Everything works fine - I can find my resources based on contents of binary files.
Another ElasticSearch's feature I use is highlighting. I managed to successfully configure highlighting for both meta-data and binary files, but...
When I ask for highlighted fragments of my attachments field I only get fragments of these files without any information about source of the fragment (there are many files in attachment array field). I need mapping between highlighted fragment and element of attachment array - for instance the name of the file or at least the index in array.
What I get:
"attachments" => ["Fragment <em>number</em> one", "Fragment <em>number</em> two"]
What I need:
"attachments" => [("file_one.pdf", "Fragment <em>number</em> one"), ("file_two.pdf", "Fragment <em>number</em> two")]
Without such mapping, the user of application knows that particular resource contains files with keyword but has no indication about the name of the file.
Is it possible to achieve what I need using ElasticSearch? How?
Thanks in advance.

So what you want here is to store the filename.
Did you send the filename in your json document? Something like:
{
"my_attachment" : {
"_content_type" : "application/pdf",
"_name" : "resource/name/of/my.pdf",
"content" : "... base64 encoded attachment ..."
}
}
If so, you can probably ask for field my_attachment._name.
If it's not the right answer, can you refine a little your question and give a JSON sample document (without the base64 content) and your mapping if any?
UPDATE:
When it come from an array of attachments you can't get from each file it comes because everything is flatten behind the scene. If you really need that, you may want to have a look at nested fields instead.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string