Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
My aim is to build an aggregrator of news feeds and blog feeds so as to make
searching/tracking of entitites in it easy. I have been looking at many solutions out there like Terrier, Lucene, SWISH-E, etc.
Basically, I could find only 2 sources of comparison studies done on these engines and one of them is kinda outdated. Basically I want a search engine which would be used in a case in which the data size is not too large, but the indexing will be frequent, every 30 minutes or so. I feel Terrier is not a good tool to be used in this case. It works better when the data size is large and updation frequency is low. Can somebody who has worked in the Information Retrieval field offer some advice ?
Lucene is well known and supported, so personally, that would be my first choice.
If you find a ready-to-use search engine, check out fastcatsearch.
It has been developed for commercial search, and applied to a lot of various sites.
Faster than lucene, and has web-based web manager to use easily.
Hosted in github, and check it out. https://github.com/fastcatgroup/fastcatsearch
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
What are some ways including machine learning that I can use in my projects to generate things related to another. Like related apps, related websites, related products, etc.
I've been brainstorming these are strategies...
one way i can think of is show items from same category. But that would be too broad.
2nd way improves upon previous step, it's to keep track of what people click next and promote that item. Meanwhile keep bottom list randomized to let other relevant items show up and get clicked.
3rd way is to use machine learning and provide training data somehow and use that.
I want something simple but smart, as it gets better with time.
Collaborative filtering is designed for solving exactly this problem. The problem with this approach is that produces good results having a lot of data only. I mean... A LOT. And it's not a really simple thing to use. However, any machine learning technique is not simple. There are some node.js packages for CF available, but I have no idea how good are they.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I'm pretty new on search engines and pretty newbie on machine learning. But I wanted to know if there is a way to combine functionalities of search engines like elasticsearch or Apache Solr and machine learning project like Apache Mahout, H2O or PredictionIO.
For exemple, if you work on a travel website where you can search for a destination. You start type "au", so the first suggestions are "AUstria", "AUstralia", "mAUrice island", "mAUritania"... etc... This is typically what elasticsearch can do.
But you know that this user has already travelled on Mauritania three times, so you want that Mauritania goes on the first place of suggestions. And I guess that's typically what machine learning can do.
Is there bridges between this two type of technologies ? Can machine learning ensure the work of search engine efficiently ?
I'm open to all answers, regardless of the technologies used. If you have ever experienced this type of problems, my ears are wide open :-)
Thank you
Your question is very general in nature- so my answer will have to be the same.
Consider a recommender framework such as the one in Apache Mahout correlated co-occurance. Unlike the vanilla spark recommender, this implementation allows for multiple types of actions, such as viewed a web site, booked a trip their before, demographic information, etc.
Now you would calculate the recommendations for each user at whatever interval. Recommendations being based on multiple criteria and what other people similar to this user has done. Consider your 'items' in this case to be every destination in the world. So we now have every possible destination ranked for each user.
It is then a trivial extension to index elastic search by user/the ordered list of that users recommended destinations.
For example, we have a user who has visited Berlin, looked at several hotels in Vienna, and is from Romainia. When the user types in "au", we would expect to see "Austria" come up in the results much higher than 'Austrailia'
Per the comments and down votes- you probably should have either A) asked a more specific programming question or B) asked this question on another forum such as Data Science Stack Exchange, fyi
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
At the moment we have ASP.Net application with search based on Lucene.Net 3.0.3. And we are going to implement search service to work with > 2.5 mln items and have the similar questions, what search engine will be the quickest in this situation.
As we know Lucene.Net is based on classical Java Lucene principles and ideally it should have almost the same speed. But we found that Lucene.Net 3.0.3 engine have issues with speed of fuzzy search.
We found some explanation why the performance is bad with fuzzy search in Lucene (on our data every request takes 6-8 seconds): Solr/Lucene fuzzy search too slow
Also our speed issues using Lucene.Net 3.0.3 are described here
So we have a list of questions for Lucene community and all experienced IT pro:
Does it makes sense to move from .Net to Java?
Do you see any other alternatives to work with such big amount of data?
Do you have such experience and can you share some numbers according to Lucene fuzzy search? (we had 4-8 sec per search request to 2.5 mln index, see link above to more details)
Do you have experience with FlexLucene? Is it better than Lucene.Net?
Thank you.
I would suggest you to upgrade to a newer version of Lucene, as performance of Fuzzy Search has been improved significantly (by leveraging Finite State Machines).
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
We are building a university website and finding a search solution for it. Our university website has high-traffic because it has faculty of open university so very much students (approximately 1.5 million). Even we use caching for speeding up the website. Anyway, which search engine do you suggest for our situation?
Note: We think Solr, Elasticsearch or Sphinx for now, but also it can be one of the others.
Update: We need a full-text search engine which must be fast, extendable and with the features like query likening and indicating priority support.
Thanks.
It really depends on your use-case, what features you want, and whether you have any experience with any of the technologies. I could paraphrase arguments, but there's a very good discussion here: ElasticSearch, Sphinx, Lucene, Solr, Xapian. Which fits for which usage? that covers the pros and cons of each.
Edit (in response to the question's edit):
Of these technologies I have only used Solr (and SQL), but I've found it to be easy to use and would recommend it. It supports native sharding and replication, which should cover the extendibility issue. It also supports things like joins and field weighting, which I think covers all your needs if I read your requirements correctly.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I'm writing a small helper utility for obscure software that is used at a local shop. Basically, I would like to know if anyone searches for anything associated with that software and if publishing my work on the Internet would make any sense. I entered the name of the software into Google Trends, but my terms "do not have enough search volume to show graphs" despite the fact that Google lists 250,000 results for the software name, or 35,000 if I explicitly remove terms such as serial and warez from the search.
Does anyone know of alternatives to Google Trends? Or of another way to find out if people search for a particular keyword?
I found what I was looking for.
Google AdWords Keyword Tool
Yahoo Clues is a service similar to Google Trends. But I don't think it's as effective for any category that is non-entertainment.
If you don't get an answer here, another place to ask might be The Business of Software.
Google Trends was also telling me there wasn't enough data for my query. I found Google Insights to do job nicely. And unlike the AdWords tool mentioned in the author's answer, it actually shows a trend.
Here's an example which shows the emergence of 3 terms with too low of volume to show up on Trends: #bigdata, #datascientist & #datajournalism.
Here's a related SO question.