I am thinking of making a little program for myself that takes products off a page like amazon and similar. What would be the best way to get all the info about the product off the page? Beautiful soup? Is there anything that would be better?
Btw I am using Python
If you are looking for a general purpose library, a scraper such as Beautiful Soup or Scrapy would work.
Specifically, Amazon a powerful API that you might want to take a look at (which works well with Scrapy), which could make the process easier. Don't forget to check whether your usage falls within Amazon's usage guidelines.
Related
This question is about process / approach, more so than how to write the code itself. I'm a process learner, so this is the part that's creating personal anxiety for me.
I am very much a beginner, and still learning about importing libraries and the like. I have an idea for what I'd like to be able to do, for a Capstone Project, as I learn, however.
I have a spreadsheet that I use each Sprint as par of our Capacity Planning process. I want to use Python to query target tickets in our client's GitHub (while logged in) account, and our Jira account, to pull specific data into the cells that I currently populate manually. Others have expressed interest in seeing what I come up with, as they use the same Google sheets template similarly.
From Sheets for Developers > API v4, through trial and error, I should be able to figure out how to generally import data into Google Sheets. Likewise, this GoTrained Python Tutorial looks like it has an approach for obtaining information from GitHub API. I'm fairly certain that I can find similar for Jira (though the first site that I tried wanted to use a fake "captcha" script to trick me into accepting notifications from the site, which was a red flag, to me).
But which are the quality, most efficient approaches? Especially for a starting out Python beginner, like myself? The last time I coded was 15-20 years ago, using LPC to build rooms/mobs/objects on a MU*, accessed via Telnet protocol.
I need to learn more about how to set up the program, and which libraries might be useful; and the best way - after decomposition - to identify the components and which methods to use, generally, in solving for the project goal:
import select field data from Jira and GitHub to a Sheet, using Python
how do I know which libraries are best to import, like Tkinter, for functions that I will need (this one came up in search for creating dropdown lists in Python, so that the Repo names can be standardized).
seeing lots of references to REST-api, but we haven't talked about that in course yet
what are some quality resources to learn more about principles that I should understand better before attempting this project?
w3schools.com is on my radar, but it is also extensive -- not sure if there are resources honed in on this type of "challenge"
I'm looking to scrape public data off of many different local government websites. This data is not provided in any standard format (XML, RSS, etc.) and must be scraped from the HTML. I need to scrape this data and store it in a database for future reference. Ideally the scraping routine would run on a recurring basis and only store the new records in the database. There should be a way for me to detect the new records from the old easily on each of these websites.
My big question is: What's the best method to accomplish this? I've heard some use YQL. I also know that some programming languages make parsing HTML data easier as well. I'm a developer with knowledge in a few different languages and want to make sure I choose the proper language and method to develop this so it's easy to maintain. As the websites change in the future the scraping routines/code/logic will need to be updated so it's important that this will be fairly easy.
Any suggestions?
I would use Perl with modules WWW::Mechanize (web automation) and HTML::TokeParser (HTML parsing).
Otherwise, I would use Python with the Mechanize module (web automation) and the BeautifulSoup module (HTML parsing).
I agree with David about perl and python. Ruby also has mechanize and is excellent for scraping. The only one I would stay away from is php due to it's lack of scraping libraries and clumsy regex functions. As far as YQL goes, it's good for some things but for scraping it really just adds an extra layer of things that can go wrong (in my opinion).
Well, I would use my own scraping library or the corresponding command line tool.
It can use templates which can scrape most web pages without any actual programming, normalize similar data from different sites to a canonical format and validate that none of the pages has changed its layout...
The command line tool doesn't support databases through, there you would need to program something...
(on the other hand Webharvest says it supports databases, but it has no templates)
We are going to be scraping thousands of websites each night to update client data, and we are in the process of deciding which language we would like to use to do the scraping.
We are not locked into any platform or language, and I am simply looking for efficiency. If I have to learn a new language to make my servers perform well, that is fine.
Which language/platform will provide the highest scraping efficiency per dollar for us? Really I'm looking for real-world experience with high volume scraping. It will be about maximizing CPU/Memory/Bandwidth.
You will be IO bound anyway, the performance of your code won't matter at all (unless you're a really bad programmer..)
Using a combination of python and beautiful soup it's incredibly easy to write scree-scraping code very quickly. There is a learning curve for beautiful soup, but it's worth it.
Efficiency-wise, I'd say it's just as quick as any other method out there. I've never done thousands of sites at once, but I'd wager that it's definitely up to the task.
For web scraping I use Python with lxml and a few other libraries: http://webscraping.com/blog
I/O is the main bottleneck when crawling - to download data at a good rate you need to use multiple threads.
I cache all downloaded HTML, so memory use is low.
Often after crawling I need to rescrape different features, and CPU becomes important.
If you know C, a single-stream synchronous link (called the "easy" method) is a short day's work with libcURL. Multiple asynchronous streams (called the "multi" method) is a couple hours more.
With the volume that thousands of sites would require, you may be better off economically by looking at commercial packages. They eliminate the IO problem, and have tools specifically designed to handle the nuances between every site, as well as post-scraping tools to normalize the data, and scheduling to keep the data current.
I would recommend Web Scraping Language
compare a simple WSL query:
GOTO example.com >> EXTRACT {'column1':td[0], 'column2': td[1]} IN table.spad
with the following example:
import urllib2
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://example.com').read())
for row in soup('table', {'class': 'spad'})[0].tbody('tr'):
tds = row('td')
print tds[0].string, tds[1].string
How easily will Watir interact with a ZK interface? If "not at all" do you have any recommendations for automated testing of the web interface for me?
Edit: Another way to put this would be can I test a Spring/ZK generated page (Ajax/JScript). I found another issue too: I need not to use a proxy to test (like Sahi does) if at all possible.
Edit: I have been testing ZK interfaces now for quite some time. With a higher knowledge of Watir (and now webdriver) I can say it's definitely possible. Timing isn't usually an issue, but finding the elements certainly can be as the ids are dynamically generated. I recommend a strong, maintainable, object oriented approach with a powerful and dynamic DSL, or you'll be listing every element on the page in a custom built object library of some sort. So... it works, but it needs extra effort.
If you're talking about this: http://zssdemo.zkoss.org/ you can take a look at the DOM output, it's atrocious, but possible to test it with Watir. I've dealt with some apps that generate awful output like that. It makes for a challenge. :) Search the Watir google group for testing Ajax, plenty of people do it.
HTH,
Charley
What are the best open source image gallery engines? Both stand-alone, and for existing frameworks such as Wordpress or Drupal.
Hopefully we can build a good list here over time.
Gallery is the classic choice. It has skins, security layers, heaps of plugins, etc, but can be run with the default settings easily if you want to. I've used it for years.
GOOD QUESTION, lots of people ask this in many web forums so hopefully we will get some good responses to this, and have a good list of solutions.
Personally I always used to say something like Gallery or some other OS script, but recently I have found myself using more and more something like a simple php script which just spits our a list of images (maybe 7 a page) but relying on a Javascript library such as mootools or Ext to provide all the functionality, particularly for small or individual galleries. Im particularly loving the noobslide mootools class at the moment which has some lovely gallery effects.
Noobslide
I suppose at the end of the day its all down to what you need, there will be no one answer that fits all but a number of different solutions will hopefully show up here that will suit different peoples needs.