I have huge amount of data to be searched. I want to search from solr, and solr only returns the ids of the searched records.
Now i have a cache of all the records in solr (Memcache) and want to fetch whole data set from cache.
Is it good idea ? Any help ?
Thanks.
And the problem is? Just define the 'fl' parameter to be the list of fields you want to return, in which case 'id'.
And if you never actually return the other fields, don't bother storing them, define them as index only. If you do have to store them but only want them very rarely, test using enableLazyFieldLoading setting in solrconfig.xml
Related
I have two fairly general question about full text search in a database. I was looking into elastic search and solr and it seems to me that one needs to produce separate documents made up of table entries, which then get searched. So the result of such a search is not actually a database entry? Or did I misunderstand something?
I also looked into whoosh search, which does index table columns and the result of whoosh are actual table rows.
When using solr or elastic search, should I put the row id into the document which gets searched and after I have my result use that id to retrieve the relevant rows from the table? Or is there a better solution?
Another question I have is if I have a id like abc/123.64664, which is stored as a string, is there any advantage in searching such a column with a FTS? It seems to me there is not much to be gained by indexing? Or am I wrong?
thanks
Elasticsearch can store the indexed document, and you can retrieve it as a part of query result. Usually ppl still store the original data in an usual DB, it gives you more reliability and flexibility on reindexing. Mind that ES indexes non-relational data. You can have you data stored in relational manner and compose denormalized documents for indexing.
As for "abc/123.64664" you can index it as tokenized string or you can tune the index for prefix search etc. It's up to you
(TL;DR) Don't think about what your data is structured in your RDBS. Think about what you are searching.
Content storage for good full text search is quite different from relational database standard storage. So, your data going into Search Engine can end up looking quite differently from the way you stored it.
This is all driven by your expected search results. You may increase granularity of the data or - opposite - denormalize it so the parent/related record content shows up in the records you actually want returned as part of search. Text processing (copyField, tokenization, pre-processing, etc) is also where a lot of content modifications happen to make a record findable.
Sometimes, relational databases support full-text search. PostgreSQL is getting better and better at that. But most of the time, relational databases just do not provide enough flexibility to support good relevancy-driven search.
Finally, if the original schema is quite complex, it may make sense to only use search engine to get the right - relevant - IDs out and then merge them in the client code with the details from the original database records.
in our project we are creating rest api in node.js, we want index every field in database even if it is not much used in querying
how to achieve this without having to create indexes of each and when new field is added
and let me is this correct way to index every field ?
of which may create problems? like for scaling etc
I came to this decision because I have used Microsoft DocumentDB, which indexes every field by default
That's not possible.
It also isn't a good idea for many reasons. Every index you have needs to be maintained, so every insert you do into the DB has to be written to the record and every single index. So having a lot of indexes makes writes slower. Also, it uses more memory. You should only index the fields that you search and sort by. If you want the entire DB to be in memory maybe consider Redis.
I would like to update all documents where doc.type = "article".
From what I understand _bulk_docs works on all documents. To narrow down affected docs one can use key value/range.
This is not ideal because I have different types of documents in database. I hoped I can update all documents returned by a view but it seams to be not possible (please correct me if I'm wrong).
The only solution I can think of is prefixing all keys with document type but is that a reasonable approach?
There is no way of doing this in CouchDB. Moreover, there is not much sense in doing this, since in CouchDB you can only update whole document, not just some properties. So if you is was possible to achieve what you want, it would make all the documents identical.
You could
fetch all documents where doc.type == "article" -- you'd probably use a view for this
make all modifications locally
upload all documents using _bulk_docs
If the number of documents matching your criterion are too large to fit in a single request, you'd have to make multiple requests to _bulk_docs. Also doing this could introduce conflicts that you'd have to resolve afterwards.
While posting xml docs for indexing to solr, some docs are getting added and duplicate records are discarded. Some records are getting updated with new values as well. How Can I know these changes made to index. I mean how will I get to know the number of records added, no. of records updated, number of docs posted to solr core?
In Solr 4, under the collection's section, there is a sub-section called Plugins/Stats. In it, there is category for UpdateHandler with the stats similar to what you are asking.
Also, it is possible to Watch for changes. Combined, this might give you a way to see if these are what you want. If they are, then you should be access the same values via JMX for more flexible/long-term tracking.
So for a new project, I'm building a system for an ecommerce site. The idea is to import products from suppliers and instead of inserting them directly into our catalog, we would store all the information in a staging area. Each supplier has their own stage (i.e. table in the database), and then I will flatten the multiple staging areas into a single entity (currently a single table but later on perhaps into Sphinx or Solr). Then our merchandisers would be able to search the staging products' relevant fields (name and description) and be shown a list of products that match and then choose to have those products pushed into the live catalog. The search will query on the single table (the flattened staging areas).
My design calls to only store searchable and filterable fields in the single flattened table - e.g. name, description, supplier_id, supplier_prod_id etc. And the search queries will return only the ID's of the items matching and a class (supplier_id) that would be used to identify which staging area the product is from.
Another senior engineer feels the flattened search table should include other meta fields (which would not be searched on), but could be used when 'pushing' the products from stage to live catalog. He also feels that the query should return all this other information.
I feel pretty strongly about only having searchable fields in the flattened table and having the search return only class/id pairs which could be used to fetch all the other necessary metadata about the product (simple select * from class_table where id in (1,2,3)).
Part of my reasoning is that this will make it easier later on to switch the flattened table from database to a search server like sphinx or solr and the rest of the code wouldn't have to be changed just because implementation of the search changed.
Am I on the right path? How can I convince the other engineer why it is important to keep only searchable fields and return only ID's? Or more specifically, why should a search application return only IDs of objects?
I think that you're on the right path. If those other fields provide no value to either uniquely identify a staged item or to allow the user to filter staged items, then the data is fundamentally useless until the item is pushed to the live environment. If the other engineer feels that the extra metadata will help the users make a more informed decision, then you might as well make those extra fields searchable (thereby meeting your stated purpose for the table(s).)
The only reason I could think of to pre-fetch that other, non-searchable data would be for a performance improvement on the push to the live environment.
You should use each tool for what it does best. A full text search engine, such as Solr or Sphinx, excels at searching textual fields and ranking the hits quickly. It has no special advantage in retrieving stored data in a select-like fashion. A database is optimized for that. So, yes, you are on the right path. Please see Search Engine versus DBMS for other issues involved in deciding what to store inside the search engine.
In the case of sphinx, it only returns document ids and named attributes back to you anyway (attributes being numerical data, for the most part). I'd say you've got the right idea as the other metadata is just a simple JOIN away from the flattened table if you need it.
You can regard Solr as a powerfull index, so as an index gives IDs back, it would be logical that solr does the same.
You can use the solr query parameter fl to ask for identifier only results, for instance fl=id.
However, there's a feature that needs solr to give you back some data too: the highlighting of search terms in the matched documents. If you don't need it, then using solr to retrieve the identifiers only is fine (I assume you need only the documents list, and no other features, like facets, related docs or spell checking).
That said, it should matter how you build your objects in your search function, either from the DB using uniquely solr to retrieve IDs or from solr returned fields (providing they're stored) or even a mix of both. Think solr to get the 'highlighted' content fields and DB for the other ones. Again if you don't need highlighting, this is not an issue.
I'm using Solr with thousands of documents but only return the ids for the following reasons :
For Solr :
- if some sync mistake append, it's not a big deal (especially in your case, displaying a different price can be a big issue... it's like the item will not be in the right place, but the data are right)
- you will save a lot of time because when you don't ask Solr to return the 'description' of documents (I mean many lines of text)
For your DB :
- you can cache your results, so it's even faster with an ID (you don't need all the data from Solr everytime !!!)
- you build you results in the same way (you don't need a specific method when you want to build html from Solr, and an other method from your DB)
I think there is a lot more...