Best method to screen-scrape data off of many different websites

Best method to screen-scrape data off of many different websites - programming-languages

I'm looking to scrape public data off of many different local government websites. This data is not provided in any standard format (XML, RSS, etc.) and must be scraped from the HTML. I need to scrape this data and store it in a database for future reference. Ideally the scraping routine would run on a recurring basis and only store the new records in the database. There should be a way for me to detect the new records from the old easily on each of these websites.
My big question is: What's the best method to accomplish this? I've heard some use YQL. I also know that some programming languages make parsing HTML data easier as well. I'm a developer with knowledge in a few different languages and want to make sure I choose the proper language and method to develop this so it's easy to maintain. As the websites change in the future the scraping routines/code/logic will need to be updated so it's important that this will be fairly easy.
Any suggestions?

I would use Perl with modules WWW::Mechanize (web automation) and HTML::TokeParser (HTML parsing).
Otherwise, I would use Python with the Mechanize module (web automation) and the BeautifulSoup module (HTML parsing).

I agree with David about perl and python. Ruby also has mechanize and is excellent for scraping. The only one I would stay away from is php due to it's lack of scraping libraries and clumsy regex functions. As far as YQL goes, it's good for some things but for scraping it really just adds an extra layer of things that can go wrong (in my opinion).

Well, I would use my own scraping library or the corresponding command line tool.
It can use templates which can scrape most web pages without any actual programming, normalize similar data from different sites to a canonical format and validate that none of the pages has changed its layout...
The command line tool doesn't support databases through, there you would need to program something...
(on the other hand Webharvest says it supports databases, but it has no templates)

Related

Create right to left and bidirectional PDF files in nodeJS

Can anyone recommend a library who really support RTL languages such as Hebrew or arabic, and also Supports links, support images, thumbnails/in-place images, Handle large spread-sheets, set filters and page breaks and good performance – very important criteria-
I started to use PDFKit who is really good, but the language support is awful and the data that we work with cannot be manipulated, because of that we need to change the library, the price is not a problem. Someone recommended Docmosis, someone has experience with it?
Thank you very much

Docmosis has customers that have been generating RTL documents for successfully for years. Docmosis supports the image handling requirements you mention and large data sets, but has no direct spreadsheet functionality.
I suggest doing a free trial (please note I work for Docmosis) and asking their support team about specific points of interest. The cloud is the easiest way to test functionality (since you have no install or setup to do) even if you wouldn't choose to use cloud for your actual deployment.

auto fill web form with dynamic data

I am trying to create shipping labels for a lot of different customers by filling forms on ups website. Is there a programmatic way of doing this?
It is different from the usual auto-fill web form. Because the name, address, etc. fields aren't filled with "constants". 100 customers needs 100 different forms.
Before I dig into python-mechanize, or autoit IE.au3, is there an easier way doing this?

UPDATE 2019-09-09: Generally, would no longer recommend FF.au3 unless you're very much into AutoIt or have tons of legacy code. I'd rather suggest using Selenium with a "proper" programming language.
You could check out FF.au3 for AutoIt. Together with FireFox and MozRepl it allows for web automation, including dynamic websites/forms.
The feature-set should be sufficient for your task (eg. XPath for content extraction and for filling out forms, but just have a look at the link and you'll get an idea of what it can do). It's also fairly easy to use.
The downside is that it's not the most performant approach and I've encountered a bug, but that doesn't say much. Overall it did work well for me for small or medium-ish projects.
Setup:
Install AutoIt: https://www.autoitscript.com/site/autoit-tools/
Get the FF.au3 lib: https://www.thorsten-willert.de/index.php/software/autoit/ff/ff-au3
Get an old Firefox version <v57 or ESR (see remarks on ff.au3 page above)
Install MozRepl: http://legacycollector.org/firefox-addons/264678/index.html

How to organize libraries and links of programming information

I have an email account whose sole purpose it is to store interesting and useful links to programming articles, code, and blog posts. It has become a little knowledgebase of sorts. I can even do a search on it, which is pretty cool.
However, after using this account for a couple of years, I now have 775 links, and it has become this big blob of amorphous information, most of which I have never looked at again. I take comfort in the fact that, if I really needed to, I could find something in there again, if I even remember putting it in there in the first place. But it has developed a "smell," if you will.
How are you organizing your programming library of cool stuff? Do you have a system or tool, and is it better than the way I'm doing it?

I would use something that is made for storing bookmarks. I use delicious.com for all of my bookmarks. The tagging system works perfectly for technology sites because you can tag each page with a specific language or tech abbreviation. This coupled with the Delicious Bookmarks plugin will make it very easy to tag sites and get back to them.
Use one word or abbreviations for languages: java c# vb.net python
Use acronyms for technologies: wpf wcf
I used to use the standard bookmark system in the browser but since I bounce around through various machines and browsers throughout the day I started to use bookmark synchronizers. Both Foxmarks and the one that google came out with. But neither I was completely satisfied with. Plus delicious has a great web interface to it as a decent api to extend for your own purpose.

IMHO, using Evernote to store this information is great.
1) you can go back and search through it easily
2) organize by tags and "notebooks collections"
3) available on multiple platforms (even mobile)
4) available as browser plugins (for direct archiving in-browser)
The only drawback is it's copy-paste functionality is a little lacking (it sometimes doesn't import/display the CSS styles correctly).
Otherwise, it's a great alternative to store web "bookmarks" (and also archive the content at the same time).

Whats the best screen scraping language? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Hi I want to create a desktop app (c# prob) that scrapes or manipulates a form on a 3rd party web page. Basically I enter my data in the form in the desktop app, it goes away to the 3rd party website and, using the script or whatever in the background, enters my data there (incl my login) and clicks the submit button for me.I just want to avoid loading up the browser!
Not having done much (any!) work in this area I was wondering would a scripting language like perl, python, ruby etc allow me to do such? Or simply do it all the scraping using c# and .net? Which one is best IYO?
I was thinking script as may need to hook into the same script something from applications on different platforms (eg symbian mobile where I wouldnt be able to develop it in c# as I would the desktop version).
Its not a web app otherwise I may as well use the original site. I realise it all sounds pointless but the automation for this specific form would be a real time saver for me.

Do not forget to look at BeautifulSoup, comes highly recommended.
See, for example, options-for-html-scraping.
If you need to select a programming language for this task, I'd say Python.
A more direct solution to your question, see twill, a simple scripting language for Web browsing.

I use C# for scraping. See the helpful HtmlAgilityPack package.
For parsing pages, I either use XPATH or regular expressions. .NET can also easily handle cookies if you need that.
I've written a small class that wraps all the details of creating a WebRequest, sending it, waiting for a response, saving the cookies, handling network errors and retransmitting, etc. - the end result is that for most situations I can just call "GetRequest\PostRequest" and get an HtmlDocument back.

You could try using the .NET HTML Agility Pack:
http://www.codeplex.com/htmlagilitypack
"This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams)."

C# is more than suitable for your screen scraping needs. .NET's Regex functionality is really nice. However, with such a simple task, you'll be hard to find a language that doesn't do what you want relatively easily. Considering you're already programming in C#, I'd say stick with that.
The built in screen scraping functionality is also top notch.

We use Groovy with NekoHTML. (Also note that you can now run Groovy on Google App Engine.)
Here is some example, runnable code on the Keplar blog:
Better competitive intelligence through scraping with Groovy

IMO Perl's built in regular expression functionality and ability to manipulate text would make it a pretty good contender for screen scraping.

Ruby is pretty great !...
try its hpricot/mechanize

Groovy is very good.
Example:
http://froth-and-java.blogspot.com/2007/06/html-screen-scraping-with-groovy.html
Groovy and HtmlUnit is also a very good match:
http://groovy.codehaus.org/Testing+Web+Applications
Htmlunit will simulate a full browser with Javascript support.

PHP is a good contender due to its good Perl-Compatible Regex support and cURL library.

HTML Agility Pack (c#)
XPath is borked, the way the html is cleaned to make it xml compliant it will drop tags and you have to adjust the expression to get it to work.
simple to use
Mozilla Parser (Java)
Solid XPath support
you have to set enviroment variables before it will work which is a pain
casting between org.dom4j.Node and org.w3c.dom.Node to get different properties is a real pain
dies on non-standard html (0.3 fixes this)
best solution for XPath
problems accessing data on Nodes in a NodeList
use a for(int i=1;i<=list_size;i++) to get around that
Beautiful Soup (Python)
I don't have much experience but here's what I've found
no XPath support
nice interface to pathing html
I prefer Mozilla HTML Parser

Take a look at HP's Web Language (formerly WEBL).
http://en.wikipedia.org/wiki/Web_Language

Or stick with WebClient in C# and some string manipulations.

I second the recommendation for python (or Beautiful Soup). I'm currently in the middle of a small screen-scraping project using python, and python 3's automatic handling of things like cookie authentication (through CookieJar and urllib) are greatly simplifying things. Python supports all of the more advanced features you might need (like regexes), as well as having the benefit of being able to handle projects like this quickly (not too much overhead in dealing with low level stuff). It's also relatively cross-platform.

Good image gallery engines

What are the best open source image gallery engines? Both stand-alone, and for existing frameworks such as Wordpress or Drupal.
Hopefully we can build a good list here over time.

Gallery is the classic choice. It has skins, security layers, heaps of plugins, etc, but can be run with the default settings easily if you want to. I've used it for years.

GOOD QUESTION, lots of people ask this in many web forums so hopefully we will get some good responses to this, and have a good list of solutions.
Personally I always used to say something like Gallery or some other OS script, but recently I have found myself using more and more something like a simple php script which just spits our a list of images (maybe 7 a page) but relying on a Javascript library such as mootools or Ext to provide all the functionality, particularly for small or individual galleries. Im particularly loving the noobslide mootools class at the moment which has some lovely gallery effects.
Noobslide
I suppose at the end of the day its all down to what you need, there will be no one answer that fits all but a number of different solutions will hopefully show up here that will suit different peoples needs.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Best method to screen-scrape data off of many different websites - programming-languages

I would use Perl with modules WWW::Mechanize (web automation) and HTML::TokeParser (HTML parsing). Otherwise, I would use Python with the Mechanize module (web automation) and the BeautifulSoup module (HTML parsing).

Related

Create right to left and bidirectional PDF files in nodeJS

auto fill web form with dynamic data

How to organize libraries and links of programming information

Whats the best screen scraping language? [closed]

Good image gallery engines

Categories

Resources