Optimize random query in search engine - search

I am trying to create a website which returns a random interesting website. The way I am doing this is I am creating a large word pool (over 10,000 words) randomly selecting several words out of it and then sending them in to a search engine (Bing, Google etc...).
The original word pool words will be ranked by the users of the website by their ranking of the website they are given and then bad words will be removed from the word pool. Some more optimization after the result of the first query will be done on the returned set of websites to select the best website from them.
What I need for this to work from the beginning is a descent list of words which are good and will give many results also when paired with other words. Is there a place where I can find a large list of words that will return better websites?
So, what I am looking for is a (very large) list of words optimized for searches, anyone got ideas?
Maybe if someone has good way of creating random queries that would be good too, because simply selecting 3 random english words does not create a good query.

Per a google search for 'english wordlists download'
http://www.net-comber.com/wordurls.html
I hope this helps.

To get the list of words optimized for searches, you can use http://www.google.com/insights/search/# and call it iteratively for each date in say last 2 years.

Related

Does NLTK have just a list of emotive/sentiment words?

I'm looking to try and improve my first ever machine learning attempt.
At the moment, I've been getting a good ~90% for my tweet data, using the entire word list as my feature list
I would like to filter out this feature list so it only includes relevant words to gauge sentiment (i.e. words like good,bad,happy, and not robot/car/technology)
Anyone have some advice?
I've made use of their stop words, but then for this, non-sentiment words like "technology" aren't really stopwords
My main approach is to just filter out manually all the words i think wont help, although this assumes I will always use the same input data
One thing that comes to mind is AFINN list, have you run across it? It's a list of English words rated for valence with an integer between minus five (negative) and plus five (positive), created specifically for microblogs. Quick serach for AFINN on Google also spits out a lot of interesting resources.

Advice on how to search and return strings

Forgive me, this will be my first every post to SO, so do let me know how I can improve.
I am currently looking for advice on a problem I am facing. I have a list of one billion unique strings of text. These text strings also have a list of tags associated with them to indicate the content of the string.
Example:
StringText: The cat ate on Sunday
AnimalCode: c001
ActionCode: a001
TimeCode: d001
where
c001 = The cat
a001= ate
d001 = on Sunday
I have loaded all of the strings and their codes as individual documents in an instance of MongoDB
At present, I am trying to devise a method by which I can enter a string and search against the database to return the match. My problem is that the search is taking far to long to return results.
I have created an index on the StringText field, but am guessing that it is too large to hold in memory.
Each string has an equal probability of being searched for so I can't reliably predict which strings have a higher probability of being searched for and pull them out into another collection.
Currently, I am running the DB off a single box with 16GB of RAM and a 4TB HDD.
Does anybody have any advice on how I might accomplish my task more efficiently? Is Mongo the right technology or are there others more adept at doing this kind of search and return?
My goal (forgive me if foolish) would be to try and return a result within 2 seconds or less.
I am very new to this whole arena so any and all advice would be welcome.
Thanks much to all in advance for the help and time.
Sincerely,
Zinga
As discussed in the comments, you could preprocess the input string to find the associated Animal and Action codes and search for StringText based on the indexed codes, which is much faster than text search.
You can't totally avoid text search, so reduce it to the Animal and/or Action collection by tokenizing the input string. See how you can use map/reduce techniques just for queries of this sort.
In your case, if you know that the first word or two will always contain the name of the animal, just use those one or two words to search for the relevant animal. Searching through the Animal/Actions collection shouldn't take long. In case it does, you can keep a periodically updating list of most common animals/actions (based on their frequency) and search against that to make it faster. This is also discussed in the articles on the linked page.
If even after that your search against StringText is slow, you could shard the StringText collection by Animal/Action codes. The official doc should suffice for this and there's not much that's involved in the setup so you might try this anyway. The basic ideology everywhere is to restrict your target space as much as possible. Searching through a billion records for every query is plain overkill. Cache where you can, preprocess where you can, show guesses while you run a slow query.
Good luck!

Methods for extracting locations from text?

What are the recommended methods for extracting locations from free text?
What I can think of is to use regex rules like "words ... in location". But are there better approaches than this?
Also I can think of having a lookup hash table table with names for countries and cities and then compare every extracted token from the text to that of the hash table.
Does anybody know of better approaches?
Edit: I'm trying to extract locations from tweets text. So the issue of high number of tweets might also affect my choice for a method.
All rule-based approaches will fail (if your text is really "free"). That includes regex, context-free grammars, any kind of lookup... Believe me, I've been there before :-)
This problem is called Named Entity Recognition. Location is one of the 3 most studied classes (with Person and Organization). Stanford NLP has an open source Java implementation that is extremely powerful: http://nlp.stanford.edu/software/CRF-NER.shtml
You can easily find implementations in other programming languages.
Put all of your valid locations into a sorted list. If you are planning on comparing case-insensitive, make sure the case of your list already is normalized.
Then all you have to do is loop over individual "words" in your input text and at the start of each new word, start a new binary search in your location list. As soon as you find a no-match, you can skip the entire word and proceed with the next.
Possible problem: multi-word locations such as "New York", "3rd Street", "People's Republic of China". Perhaps all it takes, though, is to save the position of the first new word, if you find your bsearch leads you to a (possible!) multi-word result. Then, if the full comparison fails -- possibly several words later -- all you have to do is revert to this 'next' word, in relation to the previous one where you started.
As to what a "word" is: while you are preparing your location list, make a list of all characters that may appear inside locations. Only phrases that contain characters from this list can be considered a valid 'word'.
How fast are the tweets coming in? As in is it the full twitter fire hose or some filtering queries?
A bit more sophisticated approach, that is similar to what you described is using an NLP tool that is integrated to a gazetteer.
Very few NLP tools will keep up to twitter rates, and very few do very well with twitter because of all of the leet speak. The NLP can be tuned for precision or recall depending on your needs, to limit down performing lockups in the gazetteer.
I recommend looking at Rosoka(also Rosoka Cloud through Amazon AWS) and GeoGravy

Get random site links in bash [duplicate]

This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
Get random site names in bash
I'm making a program for the university that has to find the occurrences of the words on the web. I need to make an algorithm that finds sites and count the numbers of words used and after it has to record them and sort by how many times they are used. Therefore the most sites my program checks, the better. First of all I was thinking of calculating random IPs, but the problem is that the process takes really too much (I left the computer searching the whole night and it found only 15 sites). I guess this is because site's IPs aren't distributed evenly on the web and most of the IPs belongs to users or other services. Now I had a pair of new approach in mind and I wanted to know what you guys think:
what if I make random searches using some sort of a dictionary through google? The dictionary would start empty at the beginning and each time I perform a search, I check one site and add to the dictionary only the words that occur once, so that this won't send me to that site again, by corrupting the occurrences.
Is this easy?
The first thing I want to do is to search also random pages in the google search and not only the first one, how can this be done? I can't figure out how to calculate the max number of pages for that search and how to directly go to a specific page
thanks
While I don't think you could (or should) do this in bash alone, take a look at Google Custom Search API and this question. It allows to programmatically query Google search directly.
As for what queries to use, you could resort to picking words randomly from a dictionary file - though that would not give you a uniform distribution as words like 'cat' are more popular than 'epichorial', say. If you require something which takes into account those differences you can use a word frequency dictionary, although that seems to be the point of you research in itself, so perhaps that would not be appropriate.

How do search engines conduct 'AND' operation?

Consider the following search results:
Google for 'David' - 591 millions hits in 0.28 sec
Google for 'John' - 785 millions hits in 0.18 sec
OK. Pages are indexed, it only needs to look up the count and the first few items in the index table, so speed is understandable.
Now consider the following search with AND operation:
Google for 'David John' ('David' AND 'John') - 173 millions hits in 0.25 sec
This makes me ticked ;) How on earth can search engines get the result of AND operations on gigantic datasets so fast? I see the following two ways to conduct the task and both are terrible:
You conduct the search of 'David'. Take the gigantic temp table and conduct a search of 'John' on it. HOWEVER, the temp table is not indexed by 'John', so brute force search is needed. That just won't compute within 0.25 sec no matter what HW you have.
Indexing by all possible word
combinations like 'David John'. Then
we face a combinatorial explosion on the number of keys and
not even Google has the storage
capacity to handle that.
And you can AND together as many search phrases as you want and you still get answers under a 0.5 sec! How?
What Markus wrote about Google processing the query on many machines in parallel is correct.
In addition, there are information retrieval algorithms that make this job a little bit easier. The classic way to do it is to build an inverted index which consists of postings lists - a list for each term of all the documents that contain that term, in order.
When a query with two terms is searched, conceptually, you would take the postings lists for each of the two terms ('david' and 'john'), and walk along them, looking for documents that are in both lists. If both lists are ordered the same way, this can be done in O(N). Granted, N is still huge, which is why this will be done on hundreds of machines in parallel.
Also, there may be additional tricks. For example, if the highest-ranked documents were placed higher on the lists, then maybe the algorithm could decide that it found the 10 best results without walking the entire lists. It would then guess at the remaining number of results (based on the size of the two lists).
I think you're approaching the problem from the wrong angle.
Google doesn't have a tables/indices on a single machine. Instead they partition their dataset heavily across their servers. Reports indicate that as many as 1000 physical machines are involved in every single query!
With that amount of computing power it's "simply" (used highly ironically) a matter of ensuring that every machine completes their work in fractions of a second.
Reading about Google technology and infrastructure is very inspiring and highly educational. I'd recommend reading up on BigTable, MapReduce and the Google File System.
Google have an archive of their publications available with lots of juicy information about their techologies. This thread on metafilter also provides some insight to the enourmous amount of hardware needed to run a search engine.
I don't know how google does it, but I can tell you how I did it when a client needed something similar:
It starts with an inverted index, as described by Avi. That's just a table listing, for every word in every document, the document id, the word, and a score for the word's relevance in that document. (Another approach is to index each appearance of the word individually along with its position, but that wasn't required in this case.)
From there, it's even simpler than Avi's description - there's no need to do a separate search for each term. Standard database summary operations can easily do that in a single pass:
SELECT document_id, sum(score) total_score, count(score) matches FROM rev_index
WHERE word IN ('david', 'john') GROUP BY document_id HAVING matches = 2
ORDER BY total_score DESC
This will return the IDs of all documents which have scores for both 'David' and 'John' (i.e., both words appear), ordered by some approximation of relevance and will take about the same time to execute regardless of how many or how few terms you're looking for, since IN performance is not affected much by the size of the target set and it's using a simple count to determine whether all terms were matched or not.
Note that this simplistic method just adds the 'David' score and the 'John' score together to determine overall relevance; it doesn't take the order/proximity/etc. of the names into account. Once again, I'm sure that google does factor that into their scores, but my client didn't need it.
I did something similar to this years ago on a 16 bit machine. The dataset had an upper limit of around 110,000 records (it was a cemetery, so finite limit on burials) so I setup a series of bitmaps each containing 128K bits.
The search for "david" resulting in me setting the relevant bit in one of the bitmaps to signify that the record had the word "david" in it. Did the same for 'john' in a second bitmap.
Then all you need to do is a binary 'and' of the two bitmaps, and the resulting bitmap tells you which record numbers had both 'david' and 'john' in them. Quick scan of the resulting bitmap gives you back the list of records that match both terms.
This technique wouldn't work for google though, so consider this my $0.02 worth.

Resources