Whats the best screen scraping language? [closed] - programming-languages

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Hi I want to create a desktop app (c# prob) that scrapes or manipulates a form on a 3rd party web page. Basically I enter my data in the form in the desktop app, it goes away to the 3rd party website and, using the script or whatever in the background, enters my data there (incl my login) and clicks the submit button for me.I just want to avoid loading up the browser!
Not having done much (any!) work in this area I was wondering would a scripting language like perl, python, ruby etc allow me to do such? Or simply do it all the scraping using c# and .net? Which one is best IYO?
I was thinking script as may need to hook into the same script something from applications on different platforms (eg symbian mobile where I wouldnt be able to develop it in c# as I would the desktop version).
Its not a web app otherwise I may as well use the original site. I realise it all sounds pointless but the automation for this specific form would be a real time saver for me.

Do not forget to look at BeautifulSoup, comes highly recommended.
See, for example, options-for-html-scraping.
If you need to select a programming language for this task, I'd say Python.
A more direct solution to your question, see twill, a simple scripting language for Web browsing.

I use C# for scraping. See the helpful HtmlAgilityPack package.
For parsing pages, I either use XPATH or regular expressions. .NET can also easily handle cookies if you need that.
I've written a small class that wraps all the details of creating a WebRequest, sending it, waiting for a response, saving the cookies, handling network errors and retransmitting, etc. - the end result is that for most situations I can just call "GetRequest\PostRequest" and get an HtmlDocument back.

You could try using the .NET HTML Agility Pack:
http://www.codeplex.com/htmlagilitypack
"This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams)."

C# is more than suitable for your screen scraping needs. .NET's Regex functionality is really nice. However, with such a simple task, you'll be hard to find a language that doesn't do what you want relatively easily. Considering you're already programming in C#, I'd say stick with that.
The built in screen scraping functionality is also top notch.

We use Groovy with NekoHTML. (Also note that you can now run Groovy on Google App Engine.)
Here is some example, runnable code on the Keplar blog:
Better competitive intelligence through scraping with Groovy

IMO Perl's built in regular expression functionality and ability to manipulate text would make it a pretty good contender for screen scraping.

Ruby is pretty great !...
try its hpricot/mechanize

Groovy is very good.
Example:
http://froth-and-java.blogspot.com/2007/06/html-screen-scraping-with-groovy.html
Groovy and HtmlUnit is also a very good match:
http://groovy.codehaus.org/Testing+Web+Applications
Htmlunit will simulate a full browser with Javascript support.

PHP is a good contender due to its good Perl-Compatible Regex support and cURL library.

HTML Agility Pack (c#)
XPath is borked, the way the html is cleaned to make it xml compliant it will drop tags and you have to adjust the expression to get it to work.
simple to use
Mozilla Parser (Java)
Solid XPath support
you have to set enviroment variables before it will work which is a pain
casting between org.dom4j.Node and org.w3c.dom.Node to get different properties is a real pain
dies on non-standard html (0.3 fixes this)
best solution for XPath
problems accessing data on Nodes in a NodeList
use a for(int i=1;i<=list_size;i++) to get around that
Beautiful Soup (Python)
I don't have much experience but here's what I've found
no XPath support
nice interface to pathing html
I prefer Mozilla HTML Parser

Take a look at HP's Web Language (formerly WEBL).
http://en.wikipedia.org/wiki/Web_Language

Or stick with WebClient in C# and some string manipulations.

I second the recommendation for python (or Beautiful Soup). I'm currently in the middle of a small screen-scraping project using python, and python 3's automatic handling of things like cookie authentication (through CookieJar and urllib) are greatly simplifying things. Python supports all of the more advanced features you might need (like regexes), as well as having the benefit of being able to handle projects like this quickly (not too much overhead in dealing with low level stuff). It's also relatively cross-platform.

Related

Is it viable to build a Web Dashboard in Clojure?

I am planning to build a web dashboard where I can analyze the financial records from a company through graphics, tables, ...
I already have the software, so the dashboard will only read the data, and not manipulate it.
It will be something like this, but simpler. Containing reports, graphics, options to select dates, intervals, etc.
But I am thinking, is it viable to use Clojure? And jQuery, CSS, HTML also.
Currently I work with the Luminus Web Framework for Clojure, but I am wondering if it is worth to do this in Clojure or if there are other languages that are better to do it.
Of course I am familiar with the language already, so it is a pro. But I am also open to suggestions.
It is not that hard at all! In fact, there exist great libraries which solve all the challenges involved in building a dashboard - scheduling, caching, transferring data to the client, visualization(and auto reloading).
We are working on a framework for building realtime Clojure dashboard. Have a look at https://github.com/multunus/dashboard-clj. We have used the following libraries:
Immutant's scheduler for scheduling
Core.async to simplify data flow on the backend
Sente for websocket communication
re-frame for client side state and view management
Stuart Sierra's component library for managing stateful components
In order to create beautiful visualizations you may take a look at d3 or highcharts. CLJSJS and Reagent cookbook will gives a good overview of how to use these js libraries(and many many more).
Clojure is an absolutely fantastic tool for building a web dashboard. The other answers here do a pretty good job of laying out the landscape as far as basic web technologies. On this side of things, I'll simply add I'm a big Reagent / Re-frame fan, and would go that route for React wrapper over Om.
As far as data visualizations, you may be interested in checking out Vega-Lite & Vega, which you can use from Clojure or ClojureScript (Reagent) by using a simple but flexible dataviz library I wrote called Oz:
https://github.com/metasoarous/oz
Vega-Lite & Vega are designed based on the ideas of the Grammar of Graphics, which inspired R's popular ggplot2 library. The core idea is that data visualizations should be built according to declarative descriptions of how properties of the data map to aesthetics of the visualization. Vega-Lite & Vega however take things one step further in providing a grammar of interaction, which allows for the construction of interactive data visualizations and sophisticated explorer views. Moreover, it ups the ante on the declarative nature of the GG in that Vega-Lite and Vega specifications are described as pure data (JSON), making it very in line with the data-driven philosophy of the Clojure world, and paving the way for seamless interoperability with other languages and sharing features.
Vega-Lite is more or less the higher lever, day-to-day data science tool, focusing on providing high leverage and automation based on very spartan specifications. It compiles to Vega, which is a somewhat lower level and more powerful, but less automated version of Vega-Lite. Usually starting out with Vega-Lite, and switching to Vega only as needed is sufficient.
For more on Vega & Vega-Lite see: https://vega.github.io.
I don't see any reasons why it wouldn't be viable to build a web dashboard in Clojurescript.
I suggest that you look into a library call reagent, which provides a minimalistic interface between react and clojurescript, so theoretically everything you can do with react should be possible in clojurescript/reagent (with the added benefit that it will be faster than React). You probably might be interested in reframe which is a framework for building single page applications.
React has been proven as a robust tool to build powerful UI.
You can do everything you can do in JavaScript using ClojureScript (just as you can do everything you do in Java using Clojure). So as others have commented, I would definitely recommend ClojureScript, especially since you know Clojure already. You may find out that you do not need jQuery etc.
The common choice to generate html is to use React.js via a wrapper library like:
reagent
Om
Both can generate HTML.
Reagent (and maybe re-frame) are the easiest ones to get started. Especially since there are components libraries like soda-ash, and a hiccup-like syntax.
Om (by the creator of ClojureScript), and maybe untangled are also a good choice, especially if you need to manage complex data. You can get a hiccup-like syntax via sablono.
Dashboards have been built using it (see the circleCI dashboard as a real-life dashboard example). This is the one I use personally.
Hoplon is also an interesting choice, as you mentioned.
Also have a look at cljsjs for pre-packaged js libraries.
As for the CSS, this is an orthogonal concern but yes of course you can use it (or even less and sass, there are Clojure wrappers for it). You can even generate CSS from Clojure code with garden,
You can find an example project using boot (by the same authors as hoplon), sass, reagent called saapas, but there are many more in the wild.
As you see there are many viable options in ClojureScript to build a dashboard. I am myself building one and settled on Om.next, partly because I was using React.js before.

Best method to screen-scrape data off of many different websites

I'm looking to scrape public data off of many different local government websites. This data is not provided in any standard format (XML, RSS, etc.) and must be scraped from the HTML. I need to scrape this data and store it in a database for future reference. Ideally the scraping routine would run on a recurring basis and only store the new records in the database. There should be a way for me to detect the new records from the old easily on each of these websites.
My big question is: What's the best method to accomplish this? I've heard some use YQL. I also know that some programming languages make parsing HTML data easier as well. I'm a developer with knowledge in a few different languages and want to make sure I choose the proper language and method to develop this so it's easy to maintain. As the websites change in the future the scraping routines/code/logic will need to be updated so it's important that this will be fairly easy.
Any suggestions?
I would use Perl with modules WWW::Mechanize (web automation) and HTML::TokeParser (HTML parsing).
Otherwise, I would use Python with the Mechanize module (web automation) and the BeautifulSoup module (HTML parsing).
I agree with David about perl and python. Ruby also has mechanize and is excellent for scraping. The only one I would stay away from is php due to it's lack of scraping libraries and clumsy regex functions. As far as YQL goes, it's good for some things but for scraping it really just adds an extra layer of things that can go wrong (in my opinion).
Well, I would use my own scraping library or the corresponding command line tool.
It can use templates which can scrape most web pages without any actual programming, normalize similar data from different sites to a canonical format and validate that none of the pages has changed its layout...
The command line tool doesn't support databases through, there you would need to program something...
(on the other hand Webharvest says it supports databases, but it has no templates)

What is your programming language of choice for a multi-threaded http downloading application?

I'm eager to learn a new programming language.
Which one(s) would you suggest for a program that:
downloads millions of URLs, in a multi-threaded manner
interacts with a DB of some sort to store downloaded data
Think web crawler/search engine styled projects. And know that I'm up for learning literally anything.
Please post your favorite language, why you chose it, and your favorite tutorial/reference manual (preferably free!) for said language.
Note: I will update this post occasionally to include everyone's best answers.
F# is nice choice, cause the idiomatic patterns of async operations (esp IO) and parallelization is the key strengths of language.
You can do it easy and .NET Framework's BCL is at your service also.
Personally, I use Python for stuff like this. You can use the urllib2 module to download content via HTTP and the I find the syntax of Python to be pleasing.
Furthermore, you can thread easily in Python.
Good Luck.

Can I (relatively easily) test ZK interfaces in Watir?

How easily will Watir interact with a ZK interface? If "not at all" do you have any recommendations for automated testing of the web interface for me?
Edit: Another way to put this would be can I test a Spring/ZK generated page (Ajax/JScript). I found another issue too: I need not to use a proxy to test (like Sahi does) if at all possible.
Edit: I have been testing ZK interfaces now for quite some time. With a higher knowledge of Watir (and now webdriver) I can say it's definitely possible. Timing isn't usually an issue, but finding the elements certainly can be as the ids are dynamically generated. I recommend a strong, maintainable, object oriented approach with a powerful and dynamic DSL, or you'll be listing every element on the page in a custom built object library of some sort. So... it works, but it needs extra effort.
If you're talking about this: http://zssdemo.zkoss.org/ you can take a look at the DOM output, it's atrocious, but possible to test it with Watir. I've dealt with some apps that generate awful output like that. It makes for a challenge. :) Search the Watir google group for testing Ajax, plenty of people do it.
HTH,
Charley

Personal Code Library [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
So I assume I'm not the only one. I'm wondering if there are others out there who have compiled a personal code library. Something that you take from job to job that has examples of best practices, things you are proud of, or just common methods you see yourself using over and over.
I just recently started my C# library. It already has quite a few small items. Common Regex validations, interfaces for exception handling, some type conversion overloads, enum wrappers, sql injection detection methods, and some common user controls with AJAX toolkit examples.
I'm wondering what kind of things do you have in yours?
I use my own wiki where I post code snippets and commentaries.
I find that more useful than having my own library. And since they are essentially notes and not full programs there isn't a problem with who owns the code (you or your employer ).
PS: I don't hide the fact that I have that from my employer. In fact most of them were positive and even asked for a copy.
Because I primarily do web development, I've abstracted out some common features that I end up doing frequently on sites for clients.
Ajax Emailer. Nearly every site I work on has some type of contact form. I wrote a utility that allows me to drop some HTML on a page, having JavaScript field validation, and a PHP library that requires me to change a few parameters to work with each client's mail server. The only thing I have to write is CSS each time I include it on to a page.
Stylesheet skeleton generator. I wrote a small JavaScript utility that walks the DOM for whatever page it has been included on and then stubs out a valid CSS skeleton so that I can immediately start writing styles without having to do the repetitive task for every site I work on.
JavaScript Query String Parser. Occasionally I need to parse the query string but it doesn't warrant any major modifications to the server (such as installing PHP), so I wrote a generic JavaScript utility that I can easily configure for each site.
I've got other odds and end utilities, as well, but they are kind of hacked together for personal use. I'd be embarrassed to let anyone see the source.
Update
Several people have asked for my stylesheet skeleton generator in the comments so I'm providing a link to the project here. It's more or less based on the way that I structure my XHTML and format my CSS, but hopefully you'll find it useful.
I have found that using Snipplr makes this incredibly convenient. You can tag items, save favorites, search by keyword, etc. I mostly use it for Vim-related snippets (common commands, vimrc file, etc.), but it can be used for anything. Check it out.
I have my personal C++ cross platform library here: http://code.google.com/p/kgui/
It's open source LGPL, I use it in my hobby / volunteer projects. I started it about 3 years ago and have been slowly adding functionality to it.
Back in the days of C programming on MacOS 7, i did write a fairly extensive OO library (yes, OOP in very old C) mostly to handle dialog windows. I abandoned it for PowerPlant (a nice C++ from Metrowerks) during the switch from 68k to PPC processors.
A little after that, i began writing web apps, first in PHP, recently in Django. On this aspect, my reusable code is limited to some tricks and code style.
But for all non-web (or with only small web componets), i've been using Lua. It's so fast to write and rewrite code, that there's very little incentive in reusing code. I mean, what's the point of copying a 10 line function and then adapt it? it's faster to rewrite it just for this project.
That's not so wasteful as it sounds. Lua code is so succint that my apps can be very complex, but seldom have more than a couple thousands lines.
At the same time, several Lua projects imply interfacing to C libraries. It's very easy to write bindings to existing libraries, so i just do that as a subproject. And these modules are what i do reuse! once and again... with very little (if any) changes from one project to the other.
In short: non-web projects are usually one-off Lua code, and some heavily reused binding modules.
I use Source Code Library from http://www.highdots.com/products/source-code-library/ since I can manage different textfiles, notes, screenshots and different programming languages.
I have several utility MATLAB functions that I have taken with me as I move from job to job, particularly ones that enforce W3C standards on the plots I make to ensure that text and background colors have a good luminosity ratio. I also have a function that uses ActiveX to insert a MATLAB figure into PowerPoint.
I keep my personal code libraries on CPAN. I'm not even sure how I'd do this in other languages anymore. It's just too integrated in the way that I think about programming now.
For my PHP work I started with a small file of simple things: a mail function that checks inputs for header attacks, and email validator, an input srubber, that type of thing. Over time it has grown into a application framework for quickly developing one off applications that can be templated by our graphic designer.
I have a library that i use quite extensively. I started fresh with c# and kinda threw all of the legacy stuff out the window. I find them very handy and i rewrite/refactor them often (some of them). Some of the stuff i have is:
Auxiliary (things like IsRunningLocal, InternetDetection)
Standard Classes or Structs for: Address, CreditCard, Person
I have .dll's for both win and web stuff, some very logical like a .dll for shopping cart stuff
I wrote a quick and simple library in Java which I can add code snippets to. I plan to extend it to a full framework for development at some point but only when time allows. I have all sorts in there from simple functions to full blown pages and features. Its so helpful to have when developing because as a web designer, all I need to do is change the CSS of the page.

Resources