I use MVC3 on Azure, I like to have a "like" kind of search,
e.g. http://msdn.microsoft.com/en-us/library/ms179859.aspx
First question: Does Lucene support "like" search, I tried ask this question on Google, but it's very difficult to search the word "like" without get result like: I like to use Lucene :)
Second: What kind of performance can I get for use SQL Azure for "like" search, with only id(int) as key, and text(string(100)) for "like" search, and rows around 10 million. I tried seems cannot work out, always timeout. Or you can answer the question as: I know theres a way to improve "like" search in SQL Azure.
3rd question: Is there any other product thats works well with Azure Platform can support "like" search with reasonable performance(less than 2 seconds for above sample database)
Thanks.
SQL Azure doesn't support full text indexing so 'LIKE' is limited to the ANSI SQL operator. This is wholly inadequate for general searching. In general, on the cloud (Azure) you want to avoid using SQL for searching anyway - is is the wrong place for it from a scalability point of view.
As you suggest, a lucene-based search engine is the way to go, but I would recommend using Solr (the Apache/Java lucene server). Solr can still be hosted in Azure and you will find a lot more community support, documentation and help for it.
Lucene does support LIKE search and there is a library specific for Lucene.NET that leverages Azure Storage for the Lucene index. This allows you to provide a fault tolerant Lucene index that will scale well in the cloud.
http://code.msdn.microsoft.com/windowsazure/Azure-Library-for-83562538
Solr is a good option, but you will have to manage the storage of the index yourself unless you extend Solr to run on Azure storage yourself.
You may want to look into implementing Solr on Azure. There's a good write up with demo's and tutorials here:
http://wiki.apache.org/solr/SolrOnWindowsAzure
Related
My team is working on implementing Azure Cognitive Search on one of our websites. We notice that there are 2 ways to set it up: one way is using Azure Portal to import the data, create the index, and expose the APIs that do not require coding at all; another way is to use the #azure/search-documents library which requires a lot of coding to make the search happen.
We don't know for sure which way is better. We notice some aspects as followings:
Using portal: the process of setting up the search is easy and quick.
Using #azure/search-documents: it is a bit more tedious to set up the search, but it gives us the flexibility to the index definition and rules when to update the index.
Other than the above points, we don't know what are the other pros/cons of those 2 ways?
Any insight on this would be very appreciated!
Thank you!
While it's subjective based on the use case what the 'better' way is, typically for minimal business logic and simple data sources, you can use the Portal quickly to index and enrich documents.
You can check out the React Template we have that once you have an index you can seamlessly display UI elements like searching, filtering, sorting, and faceting documents.
https://github.com/dereklegenzoff/azure-search-react-template
You can also check out the Knowledge Mining Accelerator to show a step-by-step process on how to build a Cognitive Search solution.
https://learn.microsoft.com/en-us/samples/azure-samples/azure-search-knowledge-mining/azure-search-knowledge-mining/
The problem:
I am setting up a product that utilizes Azure Search, and one of the requirements is that the results of a search conduct multi-stage learning-to-rank where the final stage involves a pairwise query-dependent machine-learned model such as RankNet.
Is there any existing support in Azure Search for this? If not, where in the Azure Search pipeline would you recommend I start?
What I have tried:
I had been hoping to find something similar to the ElasticSearch LTR Plugin but have not been able to.
The only option I can currently think of is to set-up a server which forwards the query from the front-end to Azure Search, re-ranks the search results my pairwise LTR methods, reconstructs the re-ranked search results, and sends those to the front-end.
However, I am very apprehensive about the inefficiency of this option and it would be unnecessary if there is an existing way for me to do this.
Language / Libraries
If relevant: I am coding primarily in C# and would be using CNTK for machine-learning.
At this time, your suggestion is the way to go. Azure Search does not currently offer a way to inject a custom ranker within the search pipeline. You would need to config your query to return a large amount of results and then re-rank yourself. Sorry we do not have a better answer than this right now. If you have time, it would be great if you could cast your vote for this here as we are hearing this more often lately.
I would like to build a search engine for my website so I can quickly find relevant content. I've done quite a few google searches, discovered ElasticSearch and Solr (which both sit on top of Lucene), and whoosh (python-based).
But are all of these search engines just building an "inverted-index" on top of the data? What are some other algorithmic approaches for getting higher quality searches?
I was intrigued by this blog post using collaborative filtering on top of Solr, which returns related search queries:
http://www.opensourceconnections.com/2013/08/25/semantic-search-with-solr-and-python-numpy/
Are there other common techniques that I should be aware of? Are there other libraries sitting on top of ElasticSearch/Solr that I could just plug into, and use "out-of-the-box"?
Any links or tips would be greatly appreciated!
You haven't mentioned what tech stack you are working on.
If you use Ruby on Rails, I would recommend Tire, which is a gem that gives a DSL wrapper over ElasticSearch. Essentially, it allows you to index your data in Elasticsearch.
For Rails, Sunspot is a very popular gem that people use to interface with Solr.
For .NET - SolrNET is a great Solr client.
Other part of your question (around implementing a good search engine) is too broad - I would recommend reading a good book such as Lucene in Action to get a feel of what Solr/Elasticsearch could do.
I do have a few notes that I wrote a while back, you can read about some of my experience in search here.
Edit:
Since you work on python, I would recommend Haystack, although it is specific to Django. It is very versatile for our needs. However, if you are not using django, I can think of solrpy as a Solr client. Haystack works with both Solr and Elasticsearch.
i suggest you to learn Solr API, cause it was developed since 4 5 years so you can find lots of plug-ins like related search API in Solr, But in elastic search it is very easy to configure however it is very young engine so needs to be developed more.
Pyes is a well-documented Python client for Elasticsearch.
Also, this Youtube video provides a good overview of using Elasticsearch with Python.
I suggest you to use Google Custom Search Engine.
Here have a look.
https://www.google.com/cse/all
We have developed several search engines both on Solr and Elastic. Solr used to be the best as it provided most of the tools needed to admin and debug your indexes. Right now Elastic offers the same features as Solr either natively or via plugins. Plus it is easier to configure in high performance/high availability scenarios (easy to shard or cluster).
Your technology stack is irrelevant. Both Solr and Elastic have clients nearly for every language, plus you can access both via plain HTTP:
That said, each search engine applies to a problem domain. Tunning Elastic or Solr to retrieve relevant results is a bit of an art with some trial and error.
You will have to define analyzers for each field you'll search on and according to your search patterns and the kind of results you will be expecting.
Eventually, to create search engines with a single input that search across disparate attributes of a document type, may need the use of DisMax queries where you can boost results depending on the matching of the search terms to specific document fields.
To summarize: go for Elastic, and get some plugins or frontends. Two suggestions:
Inquisitor: for testing your analyzers
Elastic Head: for administration purposes
Are there any NoSQL databases that support word proximity searching similar to lucene?
I have a client that would like the flexibility of NoSQL with the search power of a Lucene or some other search tool. The average amount of data to be searched is 200GB
Take a look at tjake's Solandra (former Lucandra). "Solandra is a real-time distributed search engine built on Apache Solr and Apache Cassandra."
Solandra "supports most out-of-the-box Solr functionality (search, faceting, highlights)"
If you can manage a .NET/Win solution also check out RavenDB - has lucene baked into it. If not, Schild's answer is a good one. You can also use lucene separately with MongoDB but your app would have to maintain the index itself...
Lucene is a NoSQL database.
Probably too late to be useful but check out MarkLogic. It's a document database with integrated full-text search (not bolt-on Lucene). You can see a quick demo via http://developer.marklogic.com/try/corona/index
We have a web app that allows users to upload documents, create their own documents, and so on. Uploaded files are stored on Amazon S3, created information is stored in a MySQL database. What I'm looking for is some sort of search engine, where I feed it all of our text documents, each with a unique ID, and it builds an index or whatever. Later, I can give it search queries, and it will pull out the best matching documents (via their ID), along with snippets of matching text.
Basically we want to allow our users to search through their repository of uploaded stuffs, along with anything that other users have marked as public. The solution should run on a standard Linux server, and ideally it would be open source, but I'll also consider paid solutions if they aren't outrageously priced.
So far, I've found three potential candidates:
MySQL Full Text Search - some reports I've read are that it's very slow
Apache Lucene - unfortunately written in Java, but I'll use it if I have to. Supposedly fast
Sphinx - doesn't seem to be as popular, ideally whatever solution I find will have lots of community support.
Please let me know if there are any other good choices that I've overlooked, or if you have experience with any of the above.
Take a look at Solr. It's based on Lucene, so it's very fast, and it's really easy to use from any platform.
Sphinx may be worth your consideration, as it works well with several common RDMS (notably MySQL)
There is also Xapian which is fast and is quite customizable.
It has support for custom indexers allowing one to index data that is not stored in a database which might be useful for your documents stored on S3.
I imagine that Google will have a solution that meets your needs. Start here: Google Enterprise
There is a Ruby port of Lucene called "Ferret". In addition to the Ruby API, you can get at the underlying c implementation called "cFerret".
Lucene is very good. And although it was originally written in java there is a php implementation http://framework.zend.com/manual/en/zend.search.lucene.html