Get random site links in bash [duplicate] - linux

This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
Get random site names in bash
I'm making a program for the university that has to find the occurrences of the words on the web. I need to make an algorithm that finds sites and count the numbers of words used and after it has to record them and sort by how many times they are used. Therefore the most sites my program checks, the better. First of all I was thinking of calculating random IPs, but the problem is that the process takes really too much (I left the computer searching the whole night and it found only 15 sites). I guess this is because site's IPs aren't distributed evenly on the web and most of the IPs belongs to users or other services. Now I had a pair of new approach in mind and I wanted to know what you guys think:
what if I make random searches using some sort of a dictionary through google? The dictionary would start empty at the beginning and each time I perform a search, I check one site and add to the dictionary only the words that occur once, so that this won't send me to that site again, by corrupting the occurrences.
Is this easy?
The first thing I want to do is to search also random pages in the google search and not only the first one, how can this be done? I can't figure out how to calculate the max number of pages for that search and how to directly go to a specific page
thanks

While I don't think you could (or should) do this in bash alone, take a look at Google Custom Search API and this question. It allows to programmatically query Google search directly.
As for what queries to use, you could resort to picking words randomly from a dictionary file - though that would not give you a uniform distribution as words like 'cat' are more popular than 'epichorial', say. If you require something which takes into account those differences you can use a word frequency dictionary, although that seems to be the point of you research in itself, so perhaps that would not be appropriate.

Related

Does NLTK have just a list of emotive/sentiment words?

I'm looking to try and improve my first ever machine learning attempt.
At the moment, I've been getting a good ~90% for my tweet data, using the entire word list as my feature list
I would like to filter out this feature list so it only includes relevant words to gauge sentiment (i.e. words like good,bad,happy, and not robot/car/technology)
Anyone have some advice?
I've made use of their stop words, but then for this, non-sentiment words like "technology" aren't really stopwords
My main approach is to just filter out manually all the words i think wont help, although this assumes I will always use the same input data
One thing that comes to mind is AFINN list, have you run across it? It's a list of English words rated for valence with an integer between minus five (negative) and plus five (positive), created specifically for microblogs. Quick serach for AFINN on Google also spits out a lot of interesting resources.

Efficiently searching large list of strings

I have a large list of strings which needs to be searched by the user of an iPhone/Android app. The strings are sorted alphabetically, but that isn't actually all that useful, since a string should be included in the results if the search query falls anywhere inside the string, not just the beginning. As the user is typing their search query, the search should be updated to reflect the results of what they currently entered. (e.g. if they type "cat", it should display the results for "c", "ca" and "cat" as they type it).
My current approach is the following:
I have a stack of "search results", which starts out empty. If the user types something to make the search query longer, I push the current search results onto the stack, then search through only the current search results for the new ones (it's impossible for something to be in the full string list but not the current results in this case).
If the user hits backspace, I only need to pop the search results off of the stack and restore them. This can be done nearly instantaneously.
This approach is working great for "backwards" searching (making the search query shorter) and cases where the search query is already long enough for the number of results to be low. However, it still has to search though the full list of strings in O(n) time for each of the first few letters the user types, which is quite slow.
One approach I've considered is to have a pre-compiled list of results of all possible search queries of 2 or 3 letters. The problem with this approach is that it would require 26^2 or 26^3 such lists, and would take up a pretty large amount of space.
Any other optimizations or alternate approaches you can think of?
You should consider using a prefix tree (trie) to make a precomputed list. I'm not sure showing the result for 'c', 'ca', and 'cat' on a sub-character basis is a good idea. For example, let's say the user is searching for the word 'eat'. Your algorithm would have to find all the words that contain 'e', then 'ea', and finally 'eat'; most of which will be of no use for the user. For a phone app, it would probably be better if you do it on a word basis. A multi-word string can be tokenized so searching for 'stake' in 'a large stake' will work fine, but not searching for 'take'.
I notice that Google, and I imagine others, do not provide a full list when only 1 or 2 characters have been pressed. In your case, perhaps a good starting point is to only begin populating the search query results when the user has typed a minimum of 3 characters.
For later versions, if it's important, you can take a cue from the way Google does it, and do more sophisticated processing: keep track of the actual entries that previous users have selected, and order these by frequency. Then, run a cron job every day on your server to populate a small database table with the top 10 entries starting with each letter, and if only 1 or 2 letters have been pressed, use the results from this small table instead of scanning the full list.
You could use Compressed suffix tree

Counts of web search hits [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have a set of search queries in the size of approx. 10 millions. The goal is to collect the number of hits returned by a search engine for all of them. For example, Google returns about 47,500,000 for the query "stackoverflow".
The problem is that:
1- Google API is limited to 100 query per day. This is far from being useful to my task since I would have to get lots of counts.
2- I used Bing API but it does not return an accurate number. Accureate in the sense of matching the number of hits shown in Bing UI. Has anyone came across this issue before?
3- Issuing search queries to a search engine and parsing the html is one solution but it results in CAPTCHA and does not scale to this number of queries.
All I care about is that the number of hits and I am open for any suggestion.
Well, I was really hoping that someone would answer this since this is something that I also was interested in finding out but since it doesn't look like anyone will I will throw in these suggestions.
You could set up a series of proxies that change their IP every 100 requests so that you can query google as seemingly different people (seems like a lot of work). Or you can download wikipedia and write something to parse the data there so that when you search a term you can see how many pages it falls in. Of course that is a much smaller dataset than the whole web but it should get you started. Another possible data source is the google n-grams data which you can download and parse to see how many books and pages the search terms fall in. Maybe a combination of these methods could boost the accuracy on any given search term.
Certainly none of these methods are as good as if you could just get the google page counts directly but understandably that is data they don't want to give out for free.
I see this is a very old question but I was trying to do the same thing which brought me here. I'll add some info and my progress to date:
Firstly, the reason you get an estimate that can change wildly is because search engines use probabilistic algorithms to calculate relevance. This means that during a query they do not need to examine all possible matches in order to calculate the top N hits by relevance with a fair degree of confidence. That means that when the search concludes, for a large result set, the search engine actually doesn't know the total number of hits. It has seen a representative sample though, and it can use some statistics about the terms used in your query to set an upper limit on the possible number of hits. That's why you only get an estimate for large result sets. Running the query in such a way that you got an exact count would be much more computationally intensive.
The best I've been able to achieve is to refine the estimate by tricking the search engine into looking at more results. To do this you need to go to page 2 of the results and then modify the 'first' parameter in the URL to go way higher. Doing this may allow you to find the end of the result set (this worked for me last year I'm sure although today it only worked up to the first few thousand). Even if it doesn't allow you to get to the end of the result set you will see that the estimate gets better as the query engine considers more hits.
I found Bing slightly easier to use in the above way - but I was still unable to get an exact count for the site I was considering. Google seems to be actively preventing this use of their engine which isn't that surprising. Bing also seems to hit limits although they looked more like defects.
For my use case I was able to get both search engines to fairly similar estimates (148k for Bing, 149k for Google) using the above technique. The highest hit count I was able to get from Google was 323 whereas Bing went up to 700 - both wildly inaccurate but not surprising since this is not their intended use of the product.
If you want to do it for your own site you can use the search engine's webmaster tools to view indexed page count. For other sites I think you'd need to use the search engine API (at some cost).

Optimize random query in search engine

I am trying to create a website which returns a random interesting website. The way I am doing this is I am creating a large word pool (over 10,000 words) randomly selecting several words out of it and then sending them in to a search engine (Bing, Google etc...).
The original word pool words will be ranked by the users of the website by their ranking of the website they are given and then bad words will be removed from the word pool. Some more optimization after the result of the first query will be done on the returned set of websites to select the best website from them.
What I need for this to work from the beginning is a descent list of words which are good and will give many results also when paired with other words. Is there a place where I can find a large list of words that will return better websites?
So, what I am looking for is a (very large) list of words optimized for searches, anyone got ideas?
Maybe if someone has good way of creating random queries that would be good too, because simply selecting 3 random english words does not create a good query.
Per a google search for 'english wordlists download'
http://www.net-comber.com/wordurls.html
I hope this helps.
To get the list of words optimized for searches, you can use http://www.google.com/insights/search/# and call it iteratively for each date in say last 2 years.

Why doesn't Google offer partial search? Is it because the index would be too large?

Google/GMail/etc. doesn't offer partial or prefix search (e.g. stuff*) though it could be very useful. Often I don't find a mail in GMail, because I don't remember the exact expression.
I know there is stemming and such, but it's not the same, especially if we talk about languages other than English.
Why doesn't Google add such a feature? Is it because the index would explode? But databases offer partial search, so surely there are good algorithms to tackle this problem.
What is the problem here?
Google doesn't actually store the text that it searches. It stores search terms, links to the page, and where in the page the term exists. That data structure is indexed in the traditional database sense. I'd bet using wildcards would make the index of the index pretty slow and as Developer Art says, not very useful.
Google does search partial words. Gmail does not though. Since you ask what's the problem here, my answer is lack of effort. This problem has a solution that enables to search in constant time and linear space but not very cache friendly: Suffix Trees. Suffix Arrays is another option that is more cache-friendly and still time efficient.
It is possible via the Google Docs - follow this article:
http://www.labnol.org/internet/advanced-gmail-search/21623/
Google Code Search can search based on regular expressions, so they do know how to do it. Of course, the amount of data Code Search has to index is tiny compared to the web search. Using regex or wildcard search in the web search would increase index size and decrease performance to impractical levels.
The secret to finding anything in Google is to enter a combination of search terms (or quoted phrases) that are very likely to be in the content you are looking for, but unlikely to appear together in unrelated content. A wildcard expression does the opposite of this. Just enter the terms you expect the wildcard to match, keeping in mind that Google will do stemming for you. Back in the days when computers ran on steam, Lycos (iirc) had pattern matching, but they turned it off several years ago. I presume it was putting too much load on their servers.
Because you can't sensibly derive what is meant with car*:
Cars?
Carpets?
Carrots?
Google's algorithms compare document texts, also external inbound links to determine what a document is about. With these wildcards all these algorithms go into junk

Resources