Search Engine Indexing based results and correlations to your professional life - search

This might not be a coding-related question on its own but over the years, I've wondered about the page ranking and indexing of my search results and how sometimes I have the documentation links to the code bases as the first result as opposed to tutorial based results. It's not that important for me to just select the documentation link but I find it easier to use the tutorials on some days because of the relative page ranking more than the intent to finish a task quicker versus take more time. This question is too vague.

Related

Better or Not combine Search Engine and Recommend System?

In our project, we use search engine, but the result need to be ranked based on each user's interest, similar to recommendation according to users' keyword.
If we separate the two system, it would cost a lot time.
Is there a better way to combine Search Engine and Recommend System together?
Or is there a simple way to customize my ranking strategy to achieve this?
This is what we were trying to do in our project as well. There are two things while solving this problem - Relevancy vs Personalization. You should look at how much of personalization is ruining the relevancy of the query. For example, if I'm suggesting news, then it makes sense to suggest based on location. I hope you already would have analyzed the use cases.
The way that I followed was - after getting the results on the search, then re-rank results to give personal suggestions. For example if I was searching for a specific algorithm to code, then getting the result set and re-ranking on my preference, lets say on, Java (based on my previous history) will make sense. In any case relevancy is of utmost importance and then we fit in user's preferences.
Again the use case is important, if this was for a news search, then directly querying and retrieving on location is best way to do it.

Web Crawling and Pagerank

I'm a computer science student and I am a bit inexperienced when it comes to web crawling and building search engines. At this time, I am using the latest version of Open Search Server and am crawling several thousand domains. When using the built in search engine creation tool, I get search results that are related to my query but they are ranked using a vector model of documentation as opposed to the Pagerank algorithm or something similar. As a result, the top results are only marginally helpful whereas higher quality results from sites such as Wikipedia are buried on the second page.
Is there some way to run a crude Pagerank algorithm in Open Search Server? If not, is there a similarly easy to use open source package that does this?
Thanks for the help! This is my first time doing anything like this so any feedback is greatly appreciated.
I am not familiar with open search server, but I know that most of the students working on search engines use Lucene or Indri. Reading papers on novel approaches for document search you can find that majority of them use one of these two APIs. Lucene is more flexible than indri in terms of defining different rank algorithms. I suggest take a look at these two and see if they are convenient for your purpose.
As you mention, the web crawl template of OpenSearchServer uses a search query with a relevancy based on the vector space model. But if you use the last version (v1.5.11), it also mixes the number of backlinks.
You may change the weight of the score based on the backlinks, by default it is set to 1.
We are currently working on providing more control on the relevance. This will be visible in future versions of OpenSearchServer.

Strange results from DBpedia lookup API for common words

I ran the keyword and prefix search for some generic keywords like it, there, he, etc.
The most amazing part about these was that it gave wrong results and took around 10 times more time to process the request than some named entities like Nokia, Samsung, McDonald's.
Can anyone explain the weird results I get for these keywords
it ====> http://dbpedia.org/resource/United_States
there ====> http://dbpedia.org/resource/United_States
Why are the results wrong and why does it take so much time to process these requests?
I wonder what kind of results you were looking for with a query like "there" or "it"?
In the context of search engine terms these are often referred to as stop words and are sometimes ignored completely due to the fact they are so common that they add very little relevance to the search query or result. I think actually this is what the lookup tool does now as I do not get the same results you mentioned.
Why did the query take longer? This is likely because the words are very frequent and a query for them returns many more results. This means the search engine has more work to do in figuring out the most relevant result.
Why is United_States the top result? Probably because the wiki page for United_States is the highest ranked in terms of inbound links from other Wikipedia pages. This is the heart of the relevance algorithm used within the lookup tool. Essentially there are more links with the words "there", "it", etc pointing to United_States than any other page, so it is judged to be the most relavent for those terms.

What's the best way to tune my Foursquare API search queries?

I'm getting some erratic results from Foursquare's venue search API and I'm wondering if anyone has any tips on how to process my input parameters for the most "intuitive" results.
For example, suppose I am searching for a venue called "Ise Sushi", around "New York, NY", which is equivalent to (lat: 40.7143528, lon: -74.00597309999999) using Google Maps API. Plugging into the Foursquare Venue API, we get:
https://api.foursquare.com/v2/venues/search?query=ise%20sushi&ll=40.7143528%2C-74.00597309999999
This yields pretty underwhelming results: the venue I'm looking for ends up rather far down the list, at 11th place. What's interesting is that reducing the precision of the coordinates appears to produce much better results. For example, suppose we were to round the coordinates to 3 significant digits:
https://api.foursquare.com/v2/venues/search?query=ise%20sushi&ll=40.7%2C-74.0
This time, the venue I'm looking for ends up in 2nd place, even though it is actually farther from the center of the search (1072 meters, vs. 833 meters using the first query).
Another modification that appears to help improve the quality of search is substituting underscores for spaces to separate our search terms. For example, here's the original query with underscores:
https://api.foursquare.com/v2/venues/search?query=ise_sushi&ll=40.7143528%2C-74.00597309999999
This produces the most intuitive-seeming results: the venue I'm looking for appears first, and is accompanied by just one other result, "Ise Restaurant" (which is tagged as a "sushi restaurant"). For what it's worth, this actually seems to be the result set of the same search conducted on Foursquare's own website.
I'm curious what lessons I should be learning from this. Should I be reducing the precision of my coordinates? Should I be connecting my search terms with underscores, and if so, does that limit how a user can order their search terms?
Although there are ranking improvements we can make on our end to find this distant exact match, it generally also helps to specify intent=browse (although it looks like in this case, for now, it may give you worse results). By default, /venues/search uses intent=checkin, which tries really hard to find close-by matches for checking in to, at the expense of other ways a venue might match your search. Learn more at https://developer.foursquare.com/docs/venues/search

Counts of web search hits [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have a set of search queries in the size of approx. 10 millions. The goal is to collect the number of hits returned by a search engine for all of them. For example, Google returns about 47,500,000 for the query "stackoverflow".
The problem is that:
1- Google API is limited to 100 query per day. This is far from being useful to my task since I would have to get lots of counts.
2- I used Bing API but it does not return an accurate number. Accureate in the sense of matching the number of hits shown in Bing UI. Has anyone came across this issue before?
3- Issuing search queries to a search engine and parsing the html is one solution but it results in CAPTCHA and does not scale to this number of queries.
All I care about is that the number of hits and I am open for any suggestion.
Well, I was really hoping that someone would answer this since this is something that I also was interested in finding out but since it doesn't look like anyone will I will throw in these suggestions.
You could set up a series of proxies that change their IP every 100 requests so that you can query google as seemingly different people (seems like a lot of work). Or you can download wikipedia and write something to parse the data there so that when you search a term you can see how many pages it falls in. Of course that is a much smaller dataset than the whole web but it should get you started. Another possible data source is the google n-grams data which you can download and parse to see how many books and pages the search terms fall in. Maybe a combination of these methods could boost the accuracy on any given search term.
Certainly none of these methods are as good as if you could just get the google page counts directly but understandably that is data they don't want to give out for free.
I see this is a very old question but I was trying to do the same thing which brought me here. I'll add some info and my progress to date:
Firstly, the reason you get an estimate that can change wildly is because search engines use probabilistic algorithms to calculate relevance. This means that during a query they do not need to examine all possible matches in order to calculate the top N hits by relevance with a fair degree of confidence. That means that when the search concludes, for a large result set, the search engine actually doesn't know the total number of hits. It has seen a representative sample though, and it can use some statistics about the terms used in your query to set an upper limit on the possible number of hits. That's why you only get an estimate for large result sets. Running the query in such a way that you got an exact count would be much more computationally intensive.
The best I've been able to achieve is to refine the estimate by tricking the search engine into looking at more results. To do this you need to go to page 2 of the results and then modify the 'first' parameter in the URL to go way higher. Doing this may allow you to find the end of the result set (this worked for me last year I'm sure although today it only worked up to the first few thousand). Even if it doesn't allow you to get to the end of the result set you will see that the estimate gets better as the query engine considers more hits.
I found Bing slightly easier to use in the above way - but I was still unable to get an exact count for the site I was considering. Google seems to be actively preventing this use of their engine which isn't that surprising. Bing also seems to hit limits although they looked more like defects.
For my use case I was able to get both search engines to fairly similar estimates (148k for Bing, 149k for Google) using the above technique. The highest hit count I was able to get from Google was 323 whereas Bing went up to 700 - both wildly inaccurate but not surprising since this is not their intended use of the product.
If you want to do it for your own site you can use the search engine's webmaster tools to view indexed page count. For other sites I think you'd need to use the search engine API (at some cost).

Resources