Web Crawling and Pagerank - search

I'm a computer science student and I am a bit inexperienced when it comes to web crawling and building search engines. At this time, I am using the latest version of Open Search Server and am crawling several thousand domains. When using the built in search engine creation tool, I get search results that are related to my query but they are ranked using a vector model of documentation as opposed to the Pagerank algorithm or something similar. As a result, the top results are only marginally helpful whereas higher quality results from sites such as Wikipedia are buried on the second page.
Is there some way to run a crude Pagerank algorithm in Open Search Server? If not, is there a similarly easy to use open source package that does this?
Thanks for the help! This is my first time doing anything like this so any feedback is greatly appreciated.

I am not familiar with open search server, but I know that most of the students working on search engines use Lucene or Indri. Reading papers on novel approaches for document search you can find that majority of them use one of these two APIs. Lucene is more flexible than indri in terms of defining different rank algorithms. I suggest take a look at these two and see if they are convenient for your purpose.

As you mention, the web crawl template of OpenSearchServer uses a search query with a relevancy based on the vector space model. But if you use the last version (v1.5.11), it also mixes the number of backlinks.
You may change the weight of the score based on the backlinks, by default it is set to 1.
We are currently working on providing more control on the relevance. This will be visible in future versions of OpenSearchServer.

Related

Better or Not combine Search Engine and Recommend System?

In our project, we use search engine, but the result need to be ranked based on each user's interest, similar to recommendation according to users' keyword.
If we separate the two system, it would cost a lot time.
Is there a better way to combine Search Engine and Recommend System together?
Or is there a simple way to customize my ranking strategy to achieve this?
This is what we were trying to do in our project as well. There are two things while solving this problem - Relevancy vs Personalization. You should look at how much of personalization is ruining the relevancy of the query. For example, if I'm suggesting news, then it makes sense to suggest based on location. I hope you already would have analyzed the use cases.
The way that I followed was - after getting the results on the search, then re-rank results to give personal suggestions. For example if I was searching for a specific algorithm to code, then getting the result set and re-ranking on my preference, lets say on, Java (based on my previous history) will make sense. In any case relevancy is of utmost importance and then we fit in user's preferences.
Again the use case is important, if this was for a news search, then directly querying and retrieving on location is best way to do it.

DNN Search: Indexing, what to index, and filters

I am having a lot of trouble figuring out how the search function works for DNN. To begin, I only have admin credentials to the site (I know this already limits what I can do with search).
I will be putting a large document on the site, and I want it to be indexed with a search function that will allow filtered search. The document will be put into the FAQs module and arranged in a tree-structure hierarchy. Any ideas on how indexing specific modules may work, and how to get the search function to work with filters? I downloaded the Enhanced Search module, but learned that doesn't do much for searching with filters.
Thank you, any leads would be much appreciated!
What you are looking for is not possible with the combination of elements that you have noted. (DNN + FAQ Module).
You have a few options available to you that might be able to make this more of a reality, but it will require more control overall of the installation.
Use a third party module such as Document Exchange which allows you to search/index within files
Use DotNetNuke Professional Editions "Spider" to crawl the site rather than using a regular index process that comes with CE.

what algorithm does freebase use to match by name?

I'm trying to build a local version of the freebase search api using their quad dumps. I'm wondering what algorithm they use to match names? As an example, if you go to freebase.com and type in "Hiking" you get
"Apo Hiking Society"
"Hiking"
"Hiking Georgia"
"Hiking Virginia's national forests"
"Hiking trail"
Wow, a lot of guesses! I hope I don't muddy the waters too much by not guessing too.
The auto-complete box is basically powered by Freebase Suggest which is powered, in turn, by the Freebase Search service. Strings which are indexed by the search service for matching include: 1) the name, 2) all aliases in the given language, 3) link anchor text from the associated Wikipedia articles and 4) identifiers (called keys by Freebase), which includes things like Wikipedia article titles (and redirects).
How the various things are weighted/boosted hasn't been disclosed, but you can get a feel for things by playing with it for while. As you can see from the API, there's also the ability to do filtering/weighting by types and other criteria and this can come into play depending on the context. For example, if you're adding a record label to an album, topics which are typed as record labels will get a boost relative to things which aren't (but you can still get to things of other types to allow for the use case where your target topic doesn't hasn't had the appropriate type applied yet).
So that gives you a little insight into how their service works, but why not build a search service that does what you need since you're starting from scratch anyway?
BTW, pre-Google the Metaweb search implementation was based on top of Lucene, so you could definitely do worse than using that as your starting point. You can read some of the details in the mailing list archive
Probably they use an inverted Index over selected fields, such as the English name, aliases and the Wikipedia snippet displayed. In your application you can achieve that using something like Lucene.
For the algorithm side, I find the following paper a good overview
Zobel and Moffat (2006): "Inverted Files for Text Search Engines".
Most likely it's a trie with lexicographical order.
There are a number of algorithms available: Boyer-Moore, Smith-Waterman-Gotoh, Knuth Morriss-Pratt etc. You might also want to check up on Edit distance algorithms such as Levenshtein. You will need to play around to see which best suits your purpose.
An implementation of such algorithms is the Simmetrics library by the University of Sheffield.

How would I go about creating a custom search index much like Lucene?

I implemented a Lucene search solution awhile back, and it got me interested in compressed file indexes that are searchable. At the time I could not find any good information on how exactly you would go about creating a custom search index, so I wonder if anyone can point me in the right direction?
My primary interest is in file formatting, compression, and something similar to the concept of Lucene's documents and fields. It should not necessarily be language specific, but if you can point me to online resources that have language specific implementations with full descriptions of the process then that is okay, too.
Managing Gigabytes by Alistair Moffat, Timothy C. Bell
You may also try to look in the source code of excellent Sphinx search engine.
It is modern full-text open source search engine, and it uses smartly optimized indexes.

Open-source full-text article recommendation engines

I'm wondering if there are any good .NET recommendation algorithms available in open source projects, whether attached to a search engine or not. By recommendation I mean something that accepts a full-text article and recommends other articles from its index based on keyword similarity.
At the high end there are document classification engines like Autonomy; at the low-end spam filters and blog "related posts" widgets. Possibly advertisement-to-article matching, too. I'd like to incorporate one into a project but can't afford the high end and the low end seems to all be LAMP-based.
[Sorry, one answer asked for clarification: What I'm looking for is ideally a standalone library, but I'm willing to adapt good source code as necessary. The end result is that I need to be able to create a C# service that accepts an arbitrary amount of text and returnsa list of similar previously-indexed articles. Basicallly, the exact thing that StackOverflow itself does as you are submitting a question!]
Thanks!
Steve
I think that in StackOverflow they extract all common english words from the text and then compare this words with the remaining words of other posts to get the "Related" posts.
Question is not very clear (algorithm or library???) but only thing that comes to mind is Lucene.NET, the porting of the popular Lucene library on the .Net framework. HTH.

Resources