I have nearly 200 000 lines of tuples in my Pandas Dataframe. I injected that data into elastic search. Now, when I run the program It should check whether the present data already there in elastic search if not present insert into it.
I'd recommend to not worry about it and just load everything into Elasticsearch. As long as your _ids are consistent the existing documents will be overwritten instead of duplicated. So just be sure to specify an _id for each document and you are fine, the bulk helpers in the elasticsearch-py client all support you setting an _id value for each document alredy.
I have a study project about identify text content must use JS. Input is a paragraph includes at least 15 lines and search in 100 text files from 3 to 5 pages. Output is which text file has the same content as the input text.
Can Elastic resolve it? Or can you recommend me some solutions?
I found a blog entry from https://ambar.cloud/blog/2017/01/02/es-large-text/ that can respond to your question. There is an in depth example similar to yours.
ElasticSearch can deal with with large documents and still deliver quite a performance, but for cases like yours its important to set up the index correctly.
Lets supose you have ElasticSearch documents with a text field with 3 to 5 pages worth of text.
When you try to query documents that contain a paragraph in the large text field, ElasticSearch will perform a search through all the terms from all the documents and their fields, including the large text field.
During merge ElasticSearch collects all the found documents into memory, including the large text field. After building the results into memory, ElasticSearch will try to send these large documents as a single JSON response. This is very exprensive in terms of performance.
ElasticSearch should handle the large text field separately from other fields. To do this, in the index mapping you should set the parameter store:true for the large text field. This tells ElasticSearch to store the field separately from other document's fields. You should also exclude the large text field from _source by adding this parameter in the index settings:
_source: {
excludes: [
"your_large_text_field"
]
}
If you set your indexes this way, the large text field will be separated from _source. Querying the large text field is now much more effective since it is stored separately and there is no need to merge it with _source.
To conclude, yes, ElasticSearch can handle the search of large text fields, and, with some extra settings it can increase the search performance by 1100 times.
When replicating from couchDB(with geocouch), is there a way to sync documents within a bbox (bounds) ?
I'm displaying markers on a map, and I'd like to only replicate documents within the bounds of that map.
Thank you.
Yes because the underlying documents are the same. When you query for a bounding box the result set should contain a list of document ids. The replicator api in couchdb can accept a list of document ids that have to be replicated. So all you have got to do is query for the bbox to get a list of _ids and pass that list to the couchdb replicator to replicate them.
I have stored a .txt file in mongodb using gridFS with node.js.
Can we store .pdf and other format? When I tried to store .pdf and retrieve the content on the console, it displays text in the doc and some junk values in it. I used this line to retrieve "GridStore.read(db,id,function(err, fileData)"
Is there any other better way to do it?
Can we do text search on the content in the files stored in mongodb directly? If so how can we do that?.
Also can you please tell where the data of files stored in mongodb and in what format?
Any help in this regard will be great.
--Thanks
What you really seem to want here is "text search" capabilities, which in MongoDB requires you to simply store the "text" in a field or fields within your document. Putting "text" into MongoDB is really very simple, as you just supply the "text" as the content for the field and MongoDB will store it. The same goes for other data of just about any type which will merely just be stored under the field your specify.
The general case here is that you really seem to want "text search" and for that you must store the "text" of your data. But before implementing that, let's talk about what GridFS actually is and also what it is not, and how it most certainly is not what you think it is.
GridFS
GridFS is not software or a special function of MongoDB. It is in fact a specification for functionality to be implemented by available drivers for the sole intent of enabling you to store content that exceeds the 16MB BSON storage limit.
For this purpose, the implementation uses two collections. By default these are named fs.files and fs.chunks but in fact can be whatever you tell tour driver implementation to actually use. These collections store what is indicated by those default names. Being the unique identifier and metadata for the "file" and the other collection storing the
Here is a quick snippet of what happens to the data you send via the GridFS API as a document in the "chunks" collection:
{
"_id" : ObjectId("539fc66ac8b5e6dc058b4568"),
"files_id" : ObjectId("539fc66ac8b5e6dc058b4567"),
"n" : NumberLong(0),
"data" : BinData(2,"agQAADw/cGhwCgokZGJ....
}
For context, that data belongs to a "text" file I sent via the GridFS API functions. As you can see, despite the actual content being text, what is being displayed here is a "hashed" form of the raw binary data.
This is in fact what the API functions do, by reading the data that you provide as a stream of bytes and submitting that binary stream, and in manageable "chunks", so in all likelihood parts of your "file" will not in fact be kept in the same document. Which actually is the point of the implementation.
To MongoDB itself these are just ordinary collections and you can treat them as such for all general operations such and find and delete and update. The GridFS API spec as implemented by your driver, gives you functions to "read" from all of those chunks and even return that data as if it was a file. But in fact it is just data in a collection, in a binary format, and split across documents. None of which is going to help you with performing a "search" is this is neither "text" or contained in the same document.
Text Search
So what you really seem to want here is "text search" to allow you to find the words you are searching for. If you want to store "text" from a PDF file for example, then you would need to externally extract that text and store in documents. Or otherwise use an external text search system which will do much the same.
For the MongoDB implementation, any extracted text would be stored in a document, or possibly several documents in order for you to enable a "text index" in order to enable the search functionality. Basically you would do this on a collection like this:
db.collection.ensureIndex({ "content": "text" })
Once the field or "fields" on your documents in your collection is covered by a text index then you can actually search using the $text operator with .find():
db.collection.find({ "$text": { "$search": "word" } })
This form of query allows you to match documents on the terms you specify in your search and to also determine a relevance to your search and "rank" the documents accordingly.
More information can be found in the tutorials section on text search.
Combined
There is nothing stopping you from in fact taking a combined approach. Here you would actually store your orginal data documents using the GridFS API methods, and then store the extracted "text" in another collection that was aware of and contained a reference to the original fs.files document referring to your large text document or PDF file or whatever.
But you would need to extract the "text" from the original "documents" and store that within the MongoDB documents in your collection. Otherwise a similar approach can be taken with an external text search solution, where it is quite common to provide interfaces that can do things such as extract text from things like PDF documents.
With an external solution you would also send the reference to the GridFS form of the document to allow this data to be retrieved from any search with another request if it was your intention to deliver the original content.
So ultimately you see that the two methods are in fact for different things. You can build your own approach around "combining" functionality, but "search" is for search and the "chunk" storage is for doing exactly what you want it to do.
Of course if your content is always under 16MB, then just store it in a document as you normally would. But of course, if that is binary data and not text, it is no good to you for search unless you explicitly extract the text.
I have three document types MainCategory, Category, SubCategory... each have a parentid which relates to the id of their parent document.
So I want to set up a view so that I can get a list of SubCategories which sit under the MainCategory (preferably just using a map function)... I haven't found a way to arrange the view so this is possible.
I currently have set up a view which gets the following output -
{"total_rows":16,"offset":0,"rows":[
{"id":"11098","key":["22056",0,"11098"],"value":"MainCat...."},
{"id":"11098","key":["22056",1,"11098"],"value":"Cat...."},
{"id":"33610","key":["22056",2,"null"],"value":"SubCat...."},
{"id":"33989","key":["22056",2,"null"],"value":"SubCat...."},
{"id":"11810","key":["22245",0,"11810"],"value":"MainCat...."},
{"id":"11810","key":["22245",1,"11810"],"value":"Cat...."},
{"id":"33106","key":["22245",2,"null"],"value":"SubCat...."},
{"id":"33321","key":["22245",2,"null"],"value":"SubCat...."},
{"id":"11098","key":["22479",0,"11098"],"value":"MainCat...."},
{"id":"11098","key":["22479",1,"11098"],"value":"Cat...."},
{"id":"11810","key":["22945",0,"11810"],"value":"MainCat...."},
{"id":"11810","key":["22945",1,"11810"],"value":"Cat...."},
{"id":"33123","key":["22945",2,"null"],"value":"SubCat...."},
{"id":"33453","key":["22945",2,"null"],"value":"SubCat...."},
{"id":"33667","key":["22945",2,"null"],"value":"SubCat...."},
{"id":"33987","key":["22945",2,"null"],"value":"SubCat...."}
]}
Which QueryString parameters would I use to get say the rows which have a key that starts with ["22945".... When all I have (at query time) is the id "11810" (at query time I don't have knowledge of the id "22945").
If any of that makes sense.
Thanks
The way you store your categories seems to be suboptimal for the query you try to perform on it.
MongoDB.org has a page on various strategies to implement tree-structures (they should apply to Couch and other doc dbs as well) - you should consider Array of Ancestors, where you always store the full path to your node. This makes updating/moving categories more difficult, but querying is easy and fast.