I am writing a python program that uses beautifulsoup to scrape the image link off a website and then categorize the image. The website puts their images on separate pages in the given url format:
(website.com/(a-z)(a-z)(0-9)(0-9)(0-9)(0-9)
This means the the number of url possibilities are very high (+1 million). I am afraid that if I do a get request to the site this many times, it might harm the site or put me in legal danger. How can I scrape the most amount of urls without damaging the site or putting myself in legal trouble? Please let me know if you guys would want anymore information. Thank you!
P.S. I have left a psudocode of what my code does below if that helps.
P.S.S. Sorry if the format is weird or messed up, I am posting from mobile
For url in urlPossibilities:
Request.get(url)
UrlLink = FindImgLink(url)
Categorize(urlLink)
A few options I can think of...
1) Is there a way to get a listing of these image URLs? E.g. a site map, or a page with a large list of them. This would be the preferred way as by using that listing you can then only scrape what you know to exist. Based on your question I feel this is unlikely but if you have one URL is there no way to work backwards and find more?
2) Is there a pattern to the image naming? The letters might be random but the numbers might incrementally count up. E.g. AA0001 and AA0002 might exist but there may be no other images for the AA prefix?
3) Responsible scraping - if the naming within that structure truly is random and you have no option but try all URLs till you get a hit do so responsibly. Respect robot.txt's and limit the rate of requests.
Related
I am currently writing this code to grab restaurants' official website links off from their Yelp pages. The code works mostly, but it returns the first link twice instead of going through the list and returning each item once. I tried to work it out but I'm just stuck on what is causing this to happen. Can you spot what I am doing wrong?
I also have another question about grabbing links from Yelp. I know Yelp may not like it, but I really cannot copy and paste links from 20,000 pages by hand so I have to use this.
Would they block my IP? Will inserting 2-second delays between requests keep them from blocking me? Are there any other ways besides inserting delays?
import urllib
import urllib.request
from bs4 import BeautifulSoup
url=[
"https://www.yelp.com/biz/buffalo-wild-wings-ann-arbor-3",
"https://www.yelp.com/biz/good-burger-east-dearborn-dearborn?osq=mac+donalds"
]
def make_soup(url):
for i in url:
thepage=urllib.request.urlopen(i)
soupdata=BeautifulSoup(thepage, "html.parser")
return soupdata
compoundfinal=''
soup=make_soup(url)
for i in url:
for thing1 in soup.findAll('div',{'class':'mapbox-text'}):
for thing2 in thing1.findAll('a',{'rel':'nofollow'}):
final='"http://www.'+thing2.text+'",\n'
compoundfinal=compoundfinal+final
print(compoundfinal)
An answer for your secondary question:
Yes, putting a delay between scrapes would be a very good idea. I would say a static 2-second delay may not be enough - consider a random delay between 2 and 5, perhaps.That will make the scrapes seem less deterministic, though you might still get caught based on scrapes per hour. It would be worth writing your script so you can restart it, in case there are problems mid-scrape - you don't want to have to start again from the beginning.
Please also download Yell's Robots Exclusion File and check your scraping list against their no-scrape list. I notice they request a 10-second delay for Bing, so consider increasing the delay I suggested above.
You might also want to consider the legal aspects of this. Most sites want to be scraped, so they can appear in search engines. However some data aggregators may not have the same enthusiasm: they probably want to be found by search engines, but they don't want to be replaced by competitors. Remember that it costs a lot of money to collect the data in the first place, and they may object to third parties getting a free ride. Thus, if you plan to do this regularly in order to update your own website, I think you might run into either technical or legal obstacles.
You may be tempted to use proxies to hide your scraping traffic, but this carries with it an implicit message that you believe you are doing something wrong. Your scrape target will probably make more efforts to block you in this case, and may be more likely to take legal action against you if they find which website you are republishing the data on.
You are trying to split your processing between two different loops, but not properly saving the data and then re-iterating over it between the two. You also look like you have the wrong indentation on the return statement in the function definition, so the function returns after the first iteration regardless of the number of items in the list. The below seems to work by placing all the processing into one function. It was the best way to get working code from your example, however it isn't the best way to tackle the problem. You would be better off defining your function to process one page, then looping over url to call the function.
import urllib
import urllib.request
from bs4 import BeautifulSoup
url=[
"https://www.yelp.com/biz/buffalo-wild-wings-ann-arbor-3",
"https://www.yelp.com/biz/good-burger-east-dearborn-dearborn?osq=mac+donalds"
]
def make_soup(url):
compoundfinal = ""
for i in url:
thepage=urllib.request.urlopen(i)
soupdata=BeautifulSoup(thepage, "html.parser")
for thing1 in soupdata.findAll('div',{'class':'mapbox-text'}):
for thing2 in thing1.findAll('a',{'rel':'nofollow'}):
final='"http://www.'+thing2.text+'",\n'
compoundfinal=compoundfinal+final
return compoundfinal
final = make_soup(url)
print( final )
output
"http://www.buffalowildwings.com",
"http://www.goodburgerrestaurant.com"
I want to scrape only four data items from the following page in each and every product from the following link that was an infinitive scroll down page.
name of the product
price of the product
href of the product
img src of the product.
All the data will be stored in a single csv file.
How can I do this?
Any idea?
i have not sure of this method.
get the original source code where you can get all of info of the website including the photo link or any word
This is usually considered a bad idea. If you write code to scrape a website for it's content, what happens when they change their markup? Or what happens when they realize you're scraping (stealing) their original content and ban your server's IP address or IP range even. It's a losing battle, so unless you have permission from them to do so I wouldn't recommend trying. It may work for a little while, but probably not for long. It's generally considered poor form to do something like this, so personally I wouldn't encourage anyone to teach someone how to scrape a website for it's content.
Furthermore, it says very clearly in their Terms of Use not to do exactly that:
You agree not to access (or attempt to access) the Website and the materials
or Services by any means other than through the interface that is provided by
Snapdeal. You shall not use any deep-link, robot, spider or other automatic
device, program, algorithm or methodology, or any similar or equivalent manual
process, to access, acquire, copy or monitor any portion of the Website or
Content (as defined below), or in any way reproduce or circumvent the
navigational structure or presentation of the Website, materials or any
Content, to obtain or attempt to obtain any materials, documents or
information through any means not specifically made available through the
Website.
My program currently goes through pages of a website gathering information. How do I set my loop to end when I have visited all the websites pages?
Is there some way of knowing the amount of webpages in any site?
Or do I have compare a block of pages I have visited eg 10 and if the pages are checked in that order again i know its repeating itself.
I'm sure there has to be a better way of knowing when to stop.
Keep track of pages visited ( may be keeping visited URL in a set) and when trying to scan a new page, check if it is already visited.
Breadth first search
Depth first search
Check these two algorithms. Think of the site as a graph
whose nodes are the pages and whose edges/arcs are the links
from one page to another. So two pages are neighboring
A → B, if there's a link from page A to page B.
Then just implement one of these two algorithms
(whichever you find more appropriate for your case).
Both of them have their respective stop conditions.
Your search in both cases should start with the root
page(s) which is usually default.ext or index.ext or
something similar (ext = html, asp, aspx, jsp, php, whatever).
You may want to pre-process the website with a SitemapGenerator and only visit the webpages included in the sitemap.
Is there some way of knowing the amount of webpages in any site
No. All you can do to examine a web-site is to make HTTP GET (or HEAD) requests and examine the response. That will tell you whether the URI is a valid identifier for a resource, and get you a representation of that resource. You can not know which requests will indicate a valid resource, nor can you practically generate all the possible URIs to perform an exhaustive search.
At best, all you can do is to start with a URI and find all the resources reachable from that URI, by examining resources that contain links to other resources, and then following those links.
I'd like a list of the top 100,000 domain names sorted by the number of distinct, public web pages.
The list could look something like this
Domain Name 100,000,000 pages
Domain Name 99,000,000 pages
Domain Name 98,000,000 pages
...
I don't want to know which domains are the most popular. I want to know which domains have the highest number of distinct, publicly accessible web pages.
I wasn't able to find such a list in Google. I assume Quantcast, Google or Alexa would know, but have they published such a list?
For a given domain, e.g. yahoo.com you can google-search site:yahoo.com; at the top of the results it says "About 141,000,000 results (0.41 seconds)". This includes subdomains like www.yahoo.com, and it.yahoo.com.
Note also that some websites generate pages on the fly, so they might, in fact, have infinite "pages". A given page will be calculated when asked for, and forgotten as soon as it is sent. Each can have a link to the next page. Since many websites compose their pages on the fly, there is no real difference (except that there are infinite pages, which you can't find out unless you ask for them all).
Keep in mind a few things:
Many websites generate pages dynamically, leaving a potentially infinite number of pages.
Pages are often behind security barriers.
Very few companies are interested in announcing how much information they maintain.
Indexes go out of date as they're created.
What I would be inclined to do for specific answers is mirror the sites of interest using wget and count the pages.
wget -m --wait=9 --limit-rate=10K http://domain.test
Keep it slow, so that the company doesn't recognize you as a Denial of Service attack.
Most search engines will allow you to search their index by site, as well, though the information on result pages might be confusing for more than a rough order of magnitude and there's no way to know how much they've indexed.
I don't see where they keep or have access to the database at a glance, but down the search engine path, you might also be interested in the Seeks and YaCy search engine projects.
The only organization I can think of that might (a) have the information easily available and (b) be friendly and transparent enough to want to share it would be the folks at The Internet Archive. Since they've been archiving the web with their Wayback Machine for a long time and are big on transparency, they might be a reasonable starting point.
I'm about to launch a multi-domain affiliate sites which have one thing in common which is content. Reading about the problem with duplicate content and Google I'm a little worried that the parent domain or sub sites could get banned from the search engine for duplicated content.
If I have 100 sites with similar look and feel and basically same content with some minor element changes, how will I go on preventing banning, indexing these correctly?
Should I should just prevent sub-sites from been indexed completely with robots?
If so how will people be able to find their site... I actually think the parent is the one that should only be indexed to avoid, but will love to her other expert thoughts.
Google have recently released an update that will allow you to include a link tag in the head of pages that are using duplicated content that point to the original version, they're called canonical links and they exist for the exact reason you mention, to be able to use duplicated content without penalisation
For more information look here..
http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html
This doesn't mean that your sites with duplicated content will be ranked well for the duplicated content but it does mean the original is "protected". For decent ranking in the duplicated sites you will need to provide unique content
If I have 100 sites with similar look
and feel and basically same content
with some minor element changes, how
will I go on preventing banning,
indexing these correctly?
Unfortunately for you, this is exactly what Google downgrades in its search listings, to make search results more relevant, and less rigged / gamed.
Fortunately for us (i.e. users of Google), their techniques generally work.
If you want 100s of sites, to be properly ranked, you'll need to make sure they each have unique content.
You won't get banned straight away. You will have to be reported by a person.
I would suggest launching with the duplicate content and then iterating over it in time, creating unique content that is dispersed across your network. This will ensure that not all sites are spammy copies of each other and will result in Google picking up the content as fresh.
I would say go ahead with it, but try to work in as much unique content as possible, especially where it matters most (page titles, headings, etc).
Even if the sites did get banned (more likely they would just have results omitted, but it is certainly possible they would be banned in your situation) you're now just at basicly the same spot you would have been if you decided to "noindex" all the sites.