I am interacting with a search engine programmatically and I need to trick it into thinking that I am a human making queries, as opposed to a robot. This involves generating queries for which it seems plausible that any ordinary user would search for, like "ncaa football schedule" or "When was the lunar landing?" I'll be making over a thousand of these queries daily, and searching for random words out of a dictionary won't cut it, since that's not a very typical search habit.
So far I have thought of a few ways to generate realistic queries:
Obtain a list of the top google (or Yahoo or Bing, etc) searches for the day
Make use of Google's autocomplete feature by entering a random word from the dictionary followed by a space and scraping the recommended queries.
The latter approach sounds like it would involve a lot of reverse engineering. And with the former approach, I've been unable to find a list of more than 80-or-so queries - the only sources I've found are AOL trends (50-100) and Google Trends (30).
How might I go about generating a large set of human-like search phrases?
(For any language-dependent answers: I'm programming in Python)
Although this most likely breaks Google's TOS, you can scrape the autocomplete data easily:
import requests
import json
def autocomplete(query, depth=1, lang='en'):
if depth == 0:
return
response = requests.get('https://clients1.google.com/complete/search', params={
'client': 'hp',
'hl': lang,
'q': query
}).text
data = response[response.index('(') + 1:-1]
o = json.loads(data)
for result in o[1]:
suggestion = result[0].replace('<b>', '').replace('</b>', '')
yield suggestion
if depth > 1:
for s in autocomplete(suggestion, depth - 1, lang):
yield s
autocomplete('a', depth=2) gives you the top 110 queries that start with a (with some duplicates). Scrape each letter to a depth of 2, and you should have a ton of legitimate queries to choose from.
Related
Let's say I have a database of books that includes their titles. For a given listing from eBay or Craigslist or some other such site, I want to compare its title string to all of the book titles in my database to try to find a match.
It's unlikely there will ever be exact string equality as users on those sites like to include things like "perfect condition" and "fast shipping" to their listing titles to attract buyers.
What algorithm(s) should I use to do this type of correlation? I'm aware of n-grams and Levenshtein distance, but I don't know which would do the most accurate job.
For the various applicable algorithms, how does their computational performance compare? Would it make sense to use multiple algorithms and average their results to balance their strengths and weaknesses? Would it be possible to set a minimum level of confidence? I'd rather have no match than a very poor quality match.
For the task at hand, I think you'd get best results with some pre-processing: remove common "null" phrases (those you don't want to see), such that you have a smaller title that is likely to have the actual title as a major part.
The next step depends on your DB size and request overhead. If those are inexpensive, then pull a list of titles from your DB, and see which exists in the eBay text (a single command in many languages). If that works for you, then even that pre-processing is likely unnecessary overhead.
If the full DB listing is expensive, but the DB is indexed well, then try grabbing likely n-grams (say, 2-3 words) from the eBay text, and searching for them in the DB. You should get relatively few return values, which you can then try in toto against the full eBay text for a match.
What are the recommended methods for extracting locations from free text?
What I can think of is to use regex rules like "words ... in location". But are there better approaches than this?
Also I can think of having a lookup hash table table with names for countries and cities and then compare every extracted token from the text to that of the hash table.
Does anybody know of better approaches?
Edit: I'm trying to extract locations from tweets text. So the issue of high number of tweets might also affect my choice for a method.
All rule-based approaches will fail (if your text is really "free"). That includes regex, context-free grammars, any kind of lookup... Believe me, I've been there before :-)
This problem is called Named Entity Recognition. Location is one of the 3 most studied classes (with Person and Organization). Stanford NLP has an open source Java implementation that is extremely powerful: http://nlp.stanford.edu/software/CRF-NER.shtml
You can easily find implementations in other programming languages.
Put all of your valid locations into a sorted list. If you are planning on comparing case-insensitive, make sure the case of your list already is normalized.
Then all you have to do is loop over individual "words" in your input text and at the start of each new word, start a new binary search in your location list. As soon as you find a no-match, you can skip the entire word and proceed with the next.
Possible problem: multi-word locations such as "New York", "3rd Street", "People's Republic of China". Perhaps all it takes, though, is to save the position of the first new word, if you find your bsearch leads you to a (possible!) multi-word result. Then, if the full comparison fails -- possibly several words later -- all you have to do is revert to this 'next' word, in relation to the previous one where you started.
As to what a "word" is: while you are preparing your location list, make a list of all characters that may appear inside locations. Only phrases that contain characters from this list can be considered a valid 'word'.
How fast are the tweets coming in? As in is it the full twitter fire hose or some filtering queries?
A bit more sophisticated approach, that is similar to what you described is using an NLP tool that is integrated to a gazetteer.
Very few NLP tools will keep up to twitter rates, and very few do very well with twitter because of all of the leet speak. The NLP can be tuned for precision or recall depending on your needs, to limit down performing lockups in the gazetteer.
I recommend looking at Rosoka(also Rosoka Cloud through Amazon AWS) and GeoGravy
I've been a long time browser here, but never have had a question that wasn't already asked. So here goes:
I've run into a problem using SOLR search where some searches on SOLR (let's say DVD Players) tend to return a lot of search results from the same manufacturer in the first 50 results.
Now assuming that I want to provide my end-user with the best experience searching, but also the best variety of products in my catalog, how would I go about providing a type of demerit to reduce the same brand from showing up in the search results more than 5 times. For the record I'm using a fairly standard DisMax search handler.
This logic would only be applied to extremely broad queries like 'DVD Players', or 'Hard Drives', and naturally I wouldn't use it to shape 'Samsung DVD Players' search results.
I don't know if SOLR has a nifty feature that does this automatically, or if I would have to start modifying search handler logic.
I haven't used this but I believe field collapsing / grouping would be what you want.
http://wiki.apache.org/solr/FieldCollapsing
If I understand this feature correctly it would group similar results kind of how http://news.google.com/ does it by grouping similar news stories.
Some ideas here, although I've not tried them myself.
You can use Carrot plugin for Solr to cluster search results lets say on manufacturer and then feed it to custom RequestHandler to re-order (cherry picking from each mfr. cluster) the result for diversity.
However, there is a downside to the approach that you may need to fetch larger than necessary and secondly the search results will be synthetic.
To achieve this is a lengthy and complex process but worth trying. Let's say the main field on which you are searching is a single field called title, first you'll need to make sure that all the documents containing "dvd player" in it have same score. This you can do by neglecting solr scoring parameteres like field norm (set omitNorms=true) & term frequency (write a solr plugin to neglect it) code attached..
Implementation Details:
1) compile the following class and put it into Solr WEB-INF/classes
package my.package;
import org.apache.lucene.search.DefaultSimilarity;
public class CustomSimilarity extends DefaultSimilarity {
public float tf(float freq) {
return freq > 0 ? 1.0f : 0.0f;
}
}
In solrconfig.xml use this new similarity class add
similarity class="my.package.CustomSimilarity"
All this will help you to make score for all the documents with "dvd player" in their title same. After that you can define one field of random type. Then when you query solr you can arrange first by score, then by the random field. Since score for all the documents containing DVD players would be same, results will get arranged by random field, giving the customer better variety of products in your catalog.
I am trying to create a website which returns a random interesting website. The way I am doing this is I am creating a large word pool (over 10,000 words) randomly selecting several words out of it and then sending them in to a search engine (Bing, Google etc...).
The original word pool words will be ranked by the users of the website by their ranking of the website they are given and then bad words will be removed from the word pool. Some more optimization after the result of the first query will be done on the returned set of websites to select the best website from them.
What I need for this to work from the beginning is a descent list of words which are good and will give many results also when paired with other words. Is there a place where I can find a large list of words that will return better websites?
So, what I am looking for is a (very large) list of words optimized for searches, anyone got ideas?
Maybe if someone has good way of creating random queries that would be good too, because simply selecting 3 random english words does not create a good query.
Per a google search for 'english wordlists download'
http://www.net-comber.com/wordurls.html
I hope this helps.
To get the list of words optimized for searches, you can use http://www.google.com/insights/search/# and call it iteratively for each date in say last 2 years.
Consider the following search results:
Google for 'David' - 591 millions hits in 0.28 sec
Google for 'John' - 785 millions hits in 0.18 sec
OK. Pages are indexed, it only needs to look up the count and the first few items in the index table, so speed is understandable.
Now consider the following search with AND operation:
Google for 'David John' ('David' AND 'John') - 173 millions hits in 0.25 sec
This makes me ticked ;) How on earth can search engines get the result of AND operations on gigantic datasets so fast? I see the following two ways to conduct the task and both are terrible:
You conduct the search of 'David'. Take the gigantic temp table and conduct a search of 'John' on it. HOWEVER, the temp table is not indexed by 'John', so brute force search is needed. That just won't compute within 0.25 sec no matter what HW you have.
Indexing by all possible word
combinations like 'David John'. Then
we face a combinatorial explosion on the number of keys and
not even Google has the storage
capacity to handle that.
And you can AND together as many search phrases as you want and you still get answers under a 0.5 sec! How?
What Markus wrote about Google processing the query on many machines in parallel is correct.
In addition, there are information retrieval algorithms that make this job a little bit easier. The classic way to do it is to build an inverted index which consists of postings lists - a list for each term of all the documents that contain that term, in order.
When a query with two terms is searched, conceptually, you would take the postings lists for each of the two terms ('david' and 'john'), and walk along them, looking for documents that are in both lists. If both lists are ordered the same way, this can be done in O(N). Granted, N is still huge, which is why this will be done on hundreds of machines in parallel.
Also, there may be additional tricks. For example, if the highest-ranked documents were placed higher on the lists, then maybe the algorithm could decide that it found the 10 best results without walking the entire lists. It would then guess at the remaining number of results (based on the size of the two lists).
I think you're approaching the problem from the wrong angle.
Google doesn't have a tables/indices on a single machine. Instead they partition their dataset heavily across their servers. Reports indicate that as many as 1000 physical machines are involved in every single query!
With that amount of computing power it's "simply" (used highly ironically) a matter of ensuring that every machine completes their work in fractions of a second.
Reading about Google technology and infrastructure is very inspiring and highly educational. I'd recommend reading up on BigTable, MapReduce and the Google File System.
Google have an archive of their publications available with lots of juicy information about their techologies. This thread on metafilter also provides some insight to the enourmous amount of hardware needed to run a search engine.
I don't know how google does it, but I can tell you how I did it when a client needed something similar:
It starts with an inverted index, as described by Avi. That's just a table listing, for every word in every document, the document id, the word, and a score for the word's relevance in that document. (Another approach is to index each appearance of the word individually along with its position, but that wasn't required in this case.)
From there, it's even simpler than Avi's description - there's no need to do a separate search for each term. Standard database summary operations can easily do that in a single pass:
SELECT document_id, sum(score) total_score, count(score) matches FROM rev_index
WHERE word IN ('david', 'john') GROUP BY document_id HAVING matches = 2
ORDER BY total_score DESC
This will return the IDs of all documents which have scores for both 'David' and 'John' (i.e., both words appear), ordered by some approximation of relevance and will take about the same time to execute regardless of how many or how few terms you're looking for, since IN performance is not affected much by the size of the target set and it's using a simple count to determine whether all terms were matched or not.
Note that this simplistic method just adds the 'David' score and the 'John' score together to determine overall relevance; it doesn't take the order/proximity/etc. of the names into account. Once again, I'm sure that google does factor that into their scores, but my client didn't need it.
I did something similar to this years ago on a 16 bit machine. The dataset had an upper limit of around 110,000 records (it was a cemetery, so finite limit on burials) so I setup a series of bitmaps each containing 128K bits.
The search for "david" resulting in me setting the relevant bit in one of the bitmaps to signify that the record had the word "david" in it. Did the same for 'john' in a second bitmap.
Then all you need to do is a binary 'and' of the two bitmaps, and the resulting bitmap tells you which record numbers had both 'david' and 'john' in them. Quick scan of the resulting bitmap gives you back the list of records that match both terms.
This technique wouldn't work for google though, so consider this my $0.02 worth.