Every time I search for something on Google's image tab sometimes it just pops up random images. Say I Google 'Harry Potter' and the 60th search item is a picture of 'Macaulay Culkin'. I mean how does that even happen? I know this isn't a strictly coding question but I would like to know why exactly Google's search algorithm fails (if I can say that).
This is not a case of a failing algorithm. It may be that the picture is on a harry potter website, or that the picture is tagged with "Harry Potter." It's most likely the first senario.
Also, you'll notice that as you scroll farther to the bottom, the results are less like your search. It's supposed to show the most relevant results at the top (unless you changed that order option).
Related
I am pretty sure it has a name other than a progress bar, which shows very fluidly amount loaded or percent complete. I am not interested in what shows percent complete. I am looking for the name of a thing that shows you, in named steps or page numbers, where you are in filling out a form.
I remember seeing somewhere a line of interlocking arrow-shaped buttons that accomplished this, but the specifics of the graphic representation is not so important to me. I will experiment with various ways to render this. But in order for me to research more,... what is this thing called by web developers?
I have a question about running an A/B test against different pages on a website and if I should worry about them interfering with either test's results. Not that it matters, but I'm using Visual Website Optimizer to do the testing.
For example, if I have two A/B tests running on different pages in the order placement flow, should I worry about the tests having an effect on one anothers goal conversion rate for the same conversion goal? For example, I have two tests running on a website, one against the product detail page and another running on the shopping cart. Ultimately I want to know if a variation of either page affects the order placement conversion rate. I'm not sure if I should be concerned with the different test's results interfering with one another if they are run at the same time.
My gut is telling me we don't have to worry about it, as the visitors on each page will be distributed across each variation of the other page. So the product detail page version A visitors will be distributed across the A and B variations of the cart, therefore the influence of the product detail page's variation A on order conversion will still be measured correctly even though the visitor sees different versions of the cart from the other test. Of course, I may be completely wrong, and hopefully someone with a statistics background can answer this question more precisely.
The only issue I can think of, is a combination between one page's variation and another page's variation worked together better than other combinations. But this seems unlikely.
I'm not sure if I'm explaining the issue clearly enough, so please let me know if my question makes sense. I searched the web and Stackoverflow for an answer, but I'm not having any luck finding anything.
I understand your problem and there is no quick answer to it and it depends on the types of test you are running. There are times that A/B tests on different pages influence each other, specially if they are within the same sequence of actions, e.g. checkout.
A simple example, if on your first page, variation A says "Click here to view pricing" and variation B says "Click here to get $500 cash". You may find that click through on B is higher and declare that one successful. Once the user clicks, on the following page, there are asked to enter their credit card details, with variations being "Pay" button being either green or red. In a situation like this, people from variation A might have a better chance of actually entering their CC details and converting as opposed to variation B who may feel cheated.
I have noticed when websites are in their seminal stages and they are trying to get a feel of what customers respond to well, drastic changes are made these multivariate tests are more important. When there is some stability and traffic, however, the changes tend to be very subtle and overall message and flow are the same and A/B tests become more micro refinements. In those cases, there might be less value in multi page cross testings (does background colour on page one means anything three pages down the process? probably not!).
Hope this answer helps!
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have a set of search queries in the size of approx. 10 millions. The goal is to collect the number of hits returned by a search engine for all of them. For example, Google returns about 47,500,000 for the query "stackoverflow".
The problem is that:
1- Google API is limited to 100 query per day. This is far from being useful to my task since I would have to get lots of counts.
2- I used Bing API but it does not return an accurate number. Accureate in the sense of matching the number of hits shown in Bing UI. Has anyone came across this issue before?
3- Issuing search queries to a search engine and parsing the html is one solution but it results in CAPTCHA and does not scale to this number of queries.
All I care about is that the number of hits and I am open for any suggestion.
Well, I was really hoping that someone would answer this since this is something that I also was interested in finding out but since it doesn't look like anyone will I will throw in these suggestions.
You could set up a series of proxies that change their IP every 100 requests so that you can query google as seemingly different people (seems like a lot of work). Or you can download wikipedia and write something to parse the data there so that when you search a term you can see how many pages it falls in. Of course that is a much smaller dataset than the whole web but it should get you started. Another possible data source is the google n-grams data which you can download and parse to see how many books and pages the search terms fall in. Maybe a combination of these methods could boost the accuracy on any given search term.
Certainly none of these methods are as good as if you could just get the google page counts directly but understandably that is data they don't want to give out for free.
I see this is a very old question but I was trying to do the same thing which brought me here. I'll add some info and my progress to date:
Firstly, the reason you get an estimate that can change wildly is because search engines use probabilistic algorithms to calculate relevance. This means that during a query they do not need to examine all possible matches in order to calculate the top N hits by relevance with a fair degree of confidence. That means that when the search concludes, for a large result set, the search engine actually doesn't know the total number of hits. It has seen a representative sample though, and it can use some statistics about the terms used in your query to set an upper limit on the possible number of hits. That's why you only get an estimate for large result sets. Running the query in such a way that you got an exact count would be much more computationally intensive.
The best I've been able to achieve is to refine the estimate by tricking the search engine into looking at more results. To do this you need to go to page 2 of the results and then modify the 'first' parameter in the URL to go way higher. Doing this may allow you to find the end of the result set (this worked for me last year I'm sure although today it only worked up to the first few thousand). Even if it doesn't allow you to get to the end of the result set you will see that the estimate gets better as the query engine considers more hits.
I found Bing slightly easier to use in the above way - but I was still unable to get an exact count for the site I was considering. Google seems to be actively preventing this use of their engine which isn't that surprising. Bing also seems to hit limits although they looked more like defects.
For my use case I was able to get both search engines to fairly similar estimates (148k for Bing, 149k for Google) using the above technique. The highest hit count I was able to get from Google was 323 whereas Bing went up to 700 - both wildly inaccurate but not surprising since this is not their intended use of the product.
If you want to do it for your own site you can use the search engine's webmaster tools to view indexed page count. For other sites I think you'd need to use the search engine API (at some cost).
I am trying to build a site that searches a database of user comments for the most often mentioned names of movies. However, with certain movie titles like Up and Warrior(2011), there are far too many irrelevant results and I want to only search for the title in threads about movies or else make sure it's mentioned in the right context. Is there a more generalized question that this problem is a subset of (I'm sure there is but google yielded nothing so far).
working out the context of a chunk of text to determin whether the word "up" is refering to a film or not is, unfortunately, something only a human can do at the moment.
have a look at amazon's mechanical turk service, you can pay people to search thru the text for you. this might not be great if you are trying to offer a free service however.
The company I work for is in the business of sending press releases. We want to make it possible for interested parties to search for press releases based on a number of criteria, the most important being location. For example, someone might search for all news sent to New York City, Massachusetts, or ZIP code 89134, sent from a governmental institution, under the topic of "traffic". Or whatever.
The problem is, we've sent, literally, hundreds of thousands of press releases. Searching is slow and complex. For example, a press release sent to Queens, NY should show up in the search I mentioned above even though it wasn't specifically sent to New York City, because Queens is a subset of New York City. We may also want to implement "and" and "or" and negation and text search to the query to create complex searches. These searches also have to be fast enough to function as dynamic RSS feeds.
I really don't know anything about search theory, or how it's properly done. The way we are getting by right now is using a data mart to store the locations the releases were sent to in a single table. However, because of the subset thing mentioned above, the data mart is gigantic with millions of rows. And we haven't even implemented cities yet, and there are about 50,000 cities in the United States, which will exponentially increase the size of the data mart by so much I'm afraid it just won't work anymore.
Anyway, I realize this is not a simple question and there won't be a "do this" answer. However, I'm hoping one of you can point me in the right direction where I can learn about how massive searches are done? Because I really know nothing about it. And such a search engine is turning out to be incredibly difficult to make. Thanks! I know there must be a way because if Google can search the entire internet we must be able to search our own database :-)
Google can search the entire internet, and your data via a Google Appliance!