Screen Scraping Efficiency - programming-languages

We are going to be scraping thousands of websites each night to update client data, and we are in the process of deciding which language we would like to use to do the scraping.
We are not locked into any platform or language, and I am simply looking for efficiency. If I have to learn a new language to make my servers perform well, that is fine.
Which language/platform will provide the highest scraping efficiency per dollar for us? Really I'm looking for real-world experience with high volume scraping. It will be about maximizing CPU/Memory/Bandwidth.

You will be IO bound anyway, the performance of your code won't matter at all (unless you're a really bad programmer..)

Using a combination of python and beautiful soup it's incredibly easy to write scree-scraping code very quickly. There is a learning curve for beautiful soup, but it's worth it.
Efficiency-wise, I'd say it's just as quick as any other method out there. I've never done thousands of sites at once, but I'd wager that it's definitely up to the task.

For web scraping I use Python with lxml and a few other libraries: http://webscraping.com/blog
I/O is the main bottleneck when crawling - to download data at a good rate you need to use multiple threads.
I cache all downloaded HTML, so memory use is low.
Often after crawling I need to rescrape different features, and CPU becomes important.

If you know C, a single-stream synchronous link (called the "easy" method) is a short day's work with libcURL. Multiple asynchronous streams (called the "multi" method) is a couple hours more.

With the volume that thousands of sites would require, you may be better off economically by looking at commercial packages. They eliminate the IO problem, and have tools specifically designed to handle the nuances between every site, as well as post-scraping tools to normalize the data, and scheduling to keep the data current.

I would recommend Web Scraping Language
compare a simple WSL query:
GOTO example.com >> EXTRACT {'column1':td[0], 'column2': td[1]} IN table.spad
with the following example:
import urllib2
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://example.com').read())
for row in soup('table', {'class': 'spad'})[0].tbody('tr'):
tds = row('td')
print tds[0].string, tds[1].string

Related

What language would provide stable processing of millions of dirty addresses into standardized format?

I'm currently using NodeJS to create a program that takes uncleaned, mistyped, dirty addresses and convert them into standardized formats with all components found or filled in, for further use in digital use cases (such as a computer being capable of recognizing all the addresses, which is improbable given how dirty they can all be)
However I've finally hit the limit of NodeJS capabilities for this execution I feel, mainly in the fact of there's so much data that it either takes hours to run or multithreading is unstable and crashes constantly, and so I'd like to get pros and cons of other languages that might be good for use with this case
I guess python will do the job

Trying to apply the concept of coroutines to existing code

I have a very basic sitemap scraper built in Python 3 using requests and lxml. The aim is to build a database of the URLs of a certain website. Currently the way it works is the following: for each top-level sitemap to be scraped, I trigger a celery task. In this task, the sitemap is parsed to check whether it's a sitemapindex or a urlset. Sitemapindexes point to other sitemaps hierarchically, whereas urlsets point to end urls - they're like the leafs in the tree.
If the sitemap is identified as a sitemapindex, each URL it contains, which points to a sub-sitemap, is processed in a separate thread, repeating the process from the beginning.
If the sitemap is identified as a urlset, the URLs within are stored in the database and this branch finishes.
I've been reading about coroutines, asyncio, gevent, async/await, etc and I'm not sure if my problem is suitable to be developed using these technologies or whether performance would be improved.
As far as I've read, corroutines are useful when dealing with IO operations in order to avoid blocking the execution while the IO operation is running. However, I've also read that they're inherently single-threaded, so I understand there's no parallelization when, e.g., the code starts parsing the XML response from the IO operation.
So esentially the questions are, how could I implement this using coroutines/asyncio/insert_similar_technology? and would I benefit from it performance-wise?
Edit: by the way, I know Twisted has a specialized SitemapSpider, just in case anyone suggests using it.
Sorry, I'm not sure I fully understand how your code works, but here some thoughts:
Does your program downloads multiple urls?
If yes, asyncio can be used to reduce time your program waiting for network I/O. If not, asyncio wouldn't help you.
How does your program download urls?
If one-by-one, then asyncio can help you to grab them much faster. On other hand if you're already grabbing them parallely (with different threads, for example), you wouldn't get much benefit from asyncio.
I advice you to read my answer about asyncio here. It's short and it can help you to understand why and when to use asynchronous code.

Does different language = different performance in couchDB lists?

I am writing a list function in couchDB. I want to know if using a faster language than javascript would boost performance (i was thinking python, just because I know it).
Does anyone know if this is true, and has anyone tested whether it is true?
Generally the different view engines are going to give you the same speed.
Except erlang, which is much faster.
The reason for this is that erlang is what CouchDB is written in and for all other languages the data needs to get converted into standard JSON then sent to the view server, then converted back to the native erlang format for writing.
BUT, This performance "boost" only happens on view generation, which typically happens out -of-line of a request or only on the changed documents.
As in, real world usage performance difference between view servers is irrelevant most of the time.
Here is the list of all the view server implementations: http://wiki.apache.org/couchdb/View_server
I've never used the python ones, but if that is where you are comfortable, go for it.
You can use the V8 engine if you want for Couch. A guy from IrisCouch wrote couchjs to do this (I've seen him on Stack Overflow quite a bit too).
https://github.com/iriscouch/couchjs
Also for views, filtered replication, things like that, you can write the functions in Erlang instead of javascript. I've done that and seen around a 50% performance increase.
Seems you can write list functions in Erlang: http://tisba.de/2010/11/25/native-list-functions-with-couchdb/

Best method to screen-scrape data off of many different websites

I'm looking to scrape public data off of many different local government websites. This data is not provided in any standard format (XML, RSS, etc.) and must be scraped from the HTML. I need to scrape this data and store it in a database for future reference. Ideally the scraping routine would run on a recurring basis and only store the new records in the database. There should be a way for me to detect the new records from the old easily on each of these websites.
My big question is: What's the best method to accomplish this? I've heard some use YQL. I also know that some programming languages make parsing HTML data easier as well. I'm a developer with knowledge in a few different languages and want to make sure I choose the proper language and method to develop this so it's easy to maintain. As the websites change in the future the scraping routines/code/logic will need to be updated so it's important that this will be fairly easy.
Any suggestions?
I would use Perl with modules WWW::Mechanize (web automation) and HTML::TokeParser (HTML parsing).
Otherwise, I would use Python with the Mechanize module (web automation) and the BeautifulSoup module (HTML parsing).
I agree with David about perl and python. Ruby also has mechanize and is excellent for scraping. The only one I would stay away from is php due to it's lack of scraping libraries and clumsy regex functions. As far as YQL goes, it's good for some things but for scraping it really just adds an extra layer of things that can go wrong (in my opinion).
Well, I would use my own scraping library or the corresponding command line tool.
It can use templates which can scrape most web pages without any actual programming, normalize similar data from different sites to a canonical format and validate that none of the pages has changed its layout...
The command line tool doesn't support databases through, there you would need to program something...
(on the other hand Webharvest says it supports databases, but it has no templates)

NLP: Building (small) corpora, or "Where to get lots of not-too-specialized English-language text files?"

Does anyone have a suggestion for where to find archives or collections of everyday English text for use in a small corpus? I have been using Gutenberg Project books for a working prototype, and would like to incorporate more contemporary language. A recent answer here pointed indirectly to a great archive of usenet movie reviews, which hadn't occurred to me, and is very good. For this particular program technical usenet archives or programming mailing lists would tilt the results and be hard to analyze, but any kind of general blog text, or chat transcripts, or anything that may have been useful to others, would be very helpful. Also, a partial or downloadable research corpus that isn't too marked-up, or some heuristic for finding an appropriate subset of wikipedia articles, or any other idea, is very appreciated.
(BTW, I am being a good citizen w/r/t downloading, using a deliberately slow script that is not demanding on servers hosting such material, in case you perceive a moral hazard in pointing me to something enormous.)
UPDATE: User S0rin points out that wikipedia requests no crawling and provides this export tool instead. Project Gutenberg has a policy specified here, bottom line, try not to crawl, but if you need to: "Configure your robot to wait at least 2 seconds between requests."
UPDATE 2 The wikpedia dumps are the way to go, thanks to the answerers who pointed them out. I ended up using the English version from here: http://download.wikimedia.org/enwiki/20090306/ , and a Spanish dump about half the size. They are some work to clean up, but well worth it, and they contain a lot of useful data in the links.
Use the Wikipedia dumps
needs lots of cleanup
See if anything in nltk-data helps you
the corpora are usually quite small
the Wacky people have some free corpora
tagged
you can spider your own corpus using their toolkit
Europarl is free and the basis of pretty much every academic MT system
spoken language, translated
The Reuters Corpora are free of charge, but only available on CD
You can always get your own, but be warned: HTML pages often need heavy cleanup, so restrict yourself to RSS feeds.
If you do this commercially, the LDC might be a viable alternative.
Wikipedia sounds like the way to go. There is an experimental Wikipedia API that might be of use, but I have no clue how it works. So far I've only scraped Wikipedia with custom spiders or even wget.
Then you could search for pages that offer their full article text in RSS feeds. RSS, because no HTML tags get in your way.
Scraping mailing lists and/or the Usenet has several disatvantages: you'll be getting AOLbonics and Techspeak, and that will tilt your corpus badly.
The classical corpora are the Penn Treebank and the British National Corpus, but they are paid for. You can read the Corpora list archives, or even ask them about it. Perhaps you will find useful data using the Web as Corpus tools.
I actually have a small project in construction, that allows linguistic processing on arbitrary web pages. It should be ready for use within the next few weeks, but it's so far not really meant to be a scraper. But I could write a module for it, I guess, the functionality is already there.
If you're willing to pay money, you should check out the data available at the Linguistic Data Consortium, such as the Penn Treebank.
Wikipedia seems to be the best way. Yes you'd have to parse the output. But thanks to wikipedia's categories you could easily get different types of articles and words. e.g. by parsing all the science categories you could get lots of science words. Details about places would be skewed towards geographic names, etc.
You've covered the obvious ones. The only other areas that I can think of too supplement:
1) News articles / blogs.
2) Magazines are posting a lot of free material online, and you can get a good cross section of topics.
Looking into the wikipedia data I noticed that they had done some analysis on bodies of tv and movie scripts. I thought that might interesting text but not readily accessible -- it turns out it is everywhere, and it is structured and predictable enough that it should be possible clean it up. This site, helpfully titled "A bunch of movie scripts and screenplays in one location on the 'net", would probably be useful to anyone who stumbles on this thread with a similar question.
You can get quotations content (in limited form) here:
http://quotationsbook.com/services/
This content also happens to be on Freebase.

Resources