web crawling ,ruby,python,cassandra - cassandra

I need to write a script that insert 1-million records of username or emails by crawling the web, into database.
The script may be any types like python,ruby,php etc.
Please let me know is it possible ?if possible please provide the information how can I build the script.
Thanks

You should also look at Apache Nutch and Apache Gora which would do what you're looking for. Nutch does the actual crawling which Gora stores the results in Cassandra, Hive or MySQL

Its possible may take some time though depending on your machine's performance and your internet connection.You could use PHP's cURL library to automatically send Web requests and then you could easily parse the data using a library for example :simplHtmlDOM or using native PHP DOM. But beware of running out of memory, also I highly recommend running the script from shell rather than a web browser. Also consider using multi curl functions, to fasten the process.
This is extreamly easy and fast to implement, although multi-threading would give a huge performance boost in this scenario, so I suggest using one of the other languages you proposed. I know you could do this in Java easily using Apache HttpClient library and manipulate the DOM and extract data using native x-path support, regex or use one of the many third party dom implementations in Java.
I strongly recommend also checking out Java library HtmlUnit, where it could make your life much easier, but you could maybe take a performance hit for that. A good multi-threading implementation would give a huge performance boost but a bad one could make your program run worse.
Here is some resources for python:
http://docs.python.org/library/httplib.html
http://www.boddie.org.uk/python/HTML.html
http://www.tutorialspoint.com/python/python_multithreading.htm

I would add a little on crawl side.
you said crawl the web. So here the crawling direction (i.e. after fetching a page, which link to visit next becomes very important). But if you already have a list of webpages (called seed URLs list) with you then you simply need to download them and parse out reqd. data. If you just need to parse email addresses, then regex would be your option. Because html does not have any tag for emails, then htmldom parser wouldnt help you.

Related

Apache Nutch: crawl only new pages for semantics analysis

I plan to tune up Nutch 2.2.X such way, that after initial crawling of the list of sites I launch the crawler daily and get HTML or plain text of new pages appeared on those sites this day only. Number of sites: hundreds.
Please be noted, that I'm not interested on updated, only new pages. Also I need new pages only starting from a date. Let's suppose it is the date of "Initial crawling".
Reading documentation and searching the Web iI got following questions can't find anywhere else:
What backend I should better use for Nutch for my task? I need page's text only once, then I never return to it. MySQL seems isn't an option as it is not supported by gora anymore. I tried use HBase, but seems I have to rollback to Nutch 2.1.x to get it working correctly. What are your ideas? How I may minimize disk space and other resources utilization?
May I perform my task not using indexing engine, like Solr? Not sure I need store large fulltext indexes. May Nutch >2.2 be launched without Solr and does it needs specific options for launching such way? Tutorials aren't clearly explain this question: everybody needs Solr, except me.
If I'd like to add a site to the crawling list, how I better perform it? Let's suppose I already crawling a list of sites and want to add a site to the list to monitor it from now. So I need to crawl the new site skipping pages content to add it to WebDB, and then run daily crawl as usual. For Nutch 1.x it may be possible to perform separate crawls and then merge them. How it may looks like for Nutch 2.x?
May this task be performed without custom plugins, and may it be performed with Nutch at all? Probably, I may write a custom plugin which detects somehow is the page already indexed, or it is new, and we need put the content to XML, or a database, etc. Should I write the plugin at all, or there is a way to solve the task with lesser blood? And how the plugin's algorithm may look like, if there is no way to live without it?
P.S. There is a lot of Nutch questions/answers/tutorials, and I honestly searched in the Web for 2 weeks, but haven't found answers to questions above.
I'm not using solr too. I just checked this documentation: https://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html
It seems like there are command prompts that can show the data fetched using WebDB. I'm new to Nutch but I just follow this documentation. Check it out.

In Haskell, how can I write an HTTP client to traverse a website and submit forms?

I'm pretty sure Network Browser is the library I want to use, but I'm not sure how to use it. I'm a Haskell newbie. I've read Learn You A Haskell and 1/3rd of Real World Haskell.
I want to write a program that visits a website, logs in to it (which would require submitting a form and cookies), and then gives me the HTML of some pages.
I'd like to see some examples of how to do these things. The documentation only gives one example. Also, please teach a man to fish. If there some other place I should be looking to find examples (IMO the best way to learn how to use a library) I'd like to know. Reading the api documentation isn't cutting it.
Side note: This library named Shpider looks perfect, but I'm on windows and I can't figure out how to install and use curl which is one of the libraries it depends on.

Is it possible to create an app for a site without an API?

I would like to create an app for a myBB forum. So the site on the forum will look nicer and much more cleaner on an iPhone or Android.
Is it possible without an API? It isn't my site ether.
everything is possible, it's just a matter of resources...
technically, you can write an app for everything on the web, but:
an API will tell you how you can do things with the site, without having to reverse engineer all pages/posts/..., and the format of every output resulting from post/get operations. reverse engineering may take a long time, and you will surely not come accross all possible results (error pages, bad authentication...);
an API is quite stable and is always updated with great care from the developpers so as not to break existing applications. without an API, there is no guarantees that your app will not break with the next release of the forum when it is upgraded;
a web API generally defines an output format which is easily parseable: many API outputs XML or JSON, which can be processed with standard libraries. without an API, the output format is plain HTML, which may be difficult to reorganize in order to show the results in a different format.
so, yes, you can definitely write an app for a myBB forum, but it may require a fair amount of work.
You can do, it's called screen scraping and is what was done before XML, the semantic web, SOAP, web services and then JSON apis tried to solve the problem better.
In screen scraping, you grab the site's HTML, parse it, get the data you want out of it, then do what you need with that data. It's more work, and breaks each time the site's layout changes, hence the history of improvements to it.
You mention the site in question is not yours. Many sites do not regard screen scraping as fair use, so check with the site's terms and conditions that you can legally create an app from the data posted there.
you can consider useing HTML5 ... do you think it doable for use app ?

Are there any building blocks for a search engine that will scrape other sites?

I want build a search service for one particular thing. The data is freely available out there, via free classified services, and a host of other sites.
Are there any building blocks, e.g. open-source crawlers that I would customize - rather than build from scratch, that I can use?
Any advice on building such a product? Not just technical, but any privacy/legal things that I might need to take into consideration.
E.g. do I need to 'give credit' where the results are from and put a link to the original - if I get them from many places?
Edit: By the way, I am using GWT with JS for the front-end, haven't decided on the language for the back-end. Either PHP or Python. Thoughts?
There are few blocks in python you can use.
beautifulsoup [http://www.crummy.com/software/BeautifulSoup/] for parsing HTML. It can handle bad code too, and its API is veeery easy... way better than any DOM-like tool for me. My friend used it to scrape his old phpbb forum with success. It has pretty good docs.
mechanize [http://wwwsearch.sourceforge.net/mechanize/] is a webbrowser-simulating http client library. It handles cookies, filling forms and so on. Also easy to use, but it helps if you understand how does http work.
http://dev.scrapy.org/ -- this is a relatively new thing: a whole scraping framework based on twisted. I haven't played with it much.
I use first two for my needs; f.e. it needs 20 lines of code to get an automatic testing tool for a 3-stage poll, with simulation of waiting for user entering data and so on.
I made a screen-scraper in Ruby that took like five minutes. Apparently this dude has it down to 60 seconds! I'm not sure if Ruby is as scalable or fast as what you're looking for, but I've never seen a faster route to a proof-of-concept or a prototype.
The secret is a library called "hpricot", which was built for exactly this purpose.
I don't know anything about PHP or Python or what's available for those development systems/languages.
Good luck!

"tag cloud" generators?

I would like to add a "tag cloud" to a project I'm working on. I see tons of them via google, but they seem to mostly be "enter an url" type.
Here's an example of what I mean:
I'm looking for one which either has either
a nice web-accessible api
a standalone local executable (linux preferred)
a linkable library (c,python preferred)
of course, other options and suggestions appreciated!
update: it seems what I am looking for is commonly called a tag cloud and not a text cloud, even though I am interested in using it to view blocks of text.
update 2: the Most Excellent Jonathan Feinberg and IBM have release Wordle... hooray!!!
http://www.wordle.net
This question is old and already answered, but I would like to say that Wordcram seems to be very nice. And it's open source.
I'm not sure if you are referring to a simple (ala Flickr) tag cloud, or something a little more complicated like Wordle.
Anyway, if you are looking for a simple tag cloud, it wouldn't be too difficult to implement it yourself (as long as you already have the ability to render HTML) as it is just changing the size and/or colour of each item based on its frequency (or some other measure).
If you want to use an existing library you could look at one of the opensource php versions, like Tag Cloud, put just run them locally on your machine using php rather than through a web server. Just install php and run php filename.php similar to how you would execute a python script.
Looking at the Wordle service, there appears to be no way to automatically create one, as they use a java applet to generate the graphics, which cannot easily be scripted using curl. They do have a question in their FAQ about an API however:
Could you expose Wordle as a web
service that generates images?
A scalable web service should take no
more than a few tens of milliseconds
to do its work. To create a Wordle
requires multiple seconds in a Java
runtime. (That pretty animation is not
for show; it's really laying things
out during the animation). Therefore,
Wordle will always apportion the
CPU-intensive stuff to you, the user,
and your CPU.
As of this writing, Wordle is
sustaining 10 hits per second. There's
no way on Earth to render Wordles at
that speed. Well there is a way, but
it involves way more money than I've
got.
Also, this previous question may help.
Here are two Python-Versions of a tag cloud:
https://github.com/atizo/PyTagCloud
http://peekaboo-vision.blogspot.de/2012/11/a-wordcloud-in-python.html
I search a lot these days and it seems that those two are some of the few "stand-alone" tag cloud generators, which run in Linux (in particular those run in python) on the command line.

Resources