I want to scrape only four data items from the following page in each and every product from the following link that was an infinitive scroll down page.
name of the product
price of the product
href of the product
img src of the product.
All the data will be stored in a single csv file.
How can I do this?
Any idea?
i have not sure of this method.
get the original source code where you can get all of info of the website including the photo link or any word
This is usually considered a bad idea. If you write code to scrape a website for it's content, what happens when they change their markup? Or what happens when they realize you're scraping (stealing) their original content and ban your server's IP address or IP range even. It's a losing battle, so unless you have permission from them to do so I wouldn't recommend trying. It may work for a little while, but probably not for long. It's generally considered poor form to do something like this, so personally I wouldn't encourage anyone to teach someone how to scrape a website for it's content.
Furthermore, it says very clearly in their Terms of Use not to do exactly that:
You agree not to access (or attempt to access) the Website and the materials
or Services by any means other than through the interface that is provided by
Snapdeal. You shall not use any deep-link, robot, spider or other automatic
device, program, algorithm or methodology, or any similar or equivalent manual
process, to access, acquire, copy or monitor any portion of the Website or
Content (as defined below), or in any way reproduce or circumvent the
navigational structure or presentation of the Website, materials or any
Content, to obtain or attempt to obtain any materials, documents or
information through any means not specifically made available through the
Website.
Related
I am writing a python program that uses beautifulsoup to scrape the image link off a website and then categorize the image. The website puts their images on separate pages in the given url format:
(website.com/(a-z)(a-z)(0-9)(0-9)(0-9)(0-9)
This means the the number of url possibilities are very high (+1 million). I am afraid that if I do a get request to the site this many times, it might harm the site or put me in legal danger. How can I scrape the most amount of urls without damaging the site or putting myself in legal trouble? Please let me know if you guys would want anymore information. Thank you!
P.S. I have left a psudocode of what my code does below if that helps.
P.S.S. Sorry if the format is weird or messed up, I am posting from mobile
For url in urlPossibilities:
Request.get(url)
UrlLink = FindImgLink(url)
Categorize(urlLink)
A few options I can think of...
1) Is there a way to get a listing of these image URLs? E.g. a site map, or a page with a large list of them. This would be the preferred way as by using that listing you can then only scrape what you know to exist. Based on your question I feel this is unlikely but if you have one URL is there no way to work backwards and find more?
2) Is there a pattern to the image naming? The letters might be random but the numbers might incrementally count up. E.g. AA0001 and AA0002 might exist but there may be no other images for the AA prefix?
3) Responsible scraping - if the naming within that structure truly is random and you have no option but try all URLs till you get a hit do so responsibly. Respect robot.txt's and limit the rate of requests.
My program currently goes through pages of a website gathering information. How do I set my loop to end when I have visited all the websites pages?
Is there some way of knowing the amount of webpages in any site?
Or do I have compare a block of pages I have visited eg 10 and if the pages are checked in that order again i know its repeating itself.
I'm sure there has to be a better way of knowing when to stop.
Keep track of pages visited ( may be keeping visited URL in a set) and when trying to scan a new page, check if it is already visited.
Breadth first search
Depth first search
Check these two algorithms. Think of the site as a graph
whose nodes are the pages and whose edges/arcs are the links
from one page to another. So two pages are neighboring
A → B, if there's a link from page A to page B.
Then just implement one of these two algorithms
(whichever you find more appropriate for your case).
Both of them have their respective stop conditions.
Your search in both cases should start with the root
page(s) which is usually default.ext or index.ext or
something similar (ext = html, asp, aspx, jsp, php, whatever).
You may want to pre-process the website with a SitemapGenerator and only visit the webpages included in the sitemap.
Is there some way of knowing the amount of webpages in any site
No. All you can do to examine a web-site is to make HTTP GET (or HEAD) requests and examine the response. That will tell you whether the URI is a valid identifier for a resource, and get you a representation of that resource. You can not know which requests will indicate a valid resource, nor can you practically generate all the possible URIs to perform an exhaustive search.
At best, all you can do is to start with a URI and find all the resources reachable from that URI, by examining resources that contain links to other resources, and then following those links.
I'd like a list of the top 100,000 domain names sorted by the number of distinct, public web pages.
The list could look something like this
Domain Name 100,000,000 pages
Domain Name 99,000,000 pages
Domain Name 98,000,000 pages
...
I don't want to know which domains are the most popular. I want to know which domains have the highest number of distinct, publicly accessible web pages.
I wasn't able to find such a list in Google. I assume Quantcast, Google or Alexa would know, but have they published such a list?
For a given domain, e.g. yahoo.com you can google-search site:yahoo.com; at the top of the results it says "About 141,000,000 results (0.41 seconds)". This includes subdomains like www.yahoo.com, and it.yahoo.com.
Note also that some websites generate pages on the fly, so they might, in fact, have infinite "pages". A given page will be calculated when asked for, and forgotten as soon as it is sent. Each can have a link to the next page. Since many websites compose their pages on the fly, there is no real difference (except that there are infinite pages, which you can't find out unless you ask for them all).
Keep in mind a few things:
Many websites generate pages dynamically, leaving a potentially infinite number of pages.
Pages are often behind security barriers.
Very few companies are interested in announcing how much information they maintain.
Indexes go out of date as they're created.
What I would be inclined to do for specific answers is mirror the sites of interest using wget and count the pages.
wget -m --wait=9 --limit-rate=10K http://domain.test
Keep it slow, so that the company doesn't recognize you as a Denial of Service attack.
Most search engines will allow you to search their index by site, as well, though the information on result pages might be confusing for more than a rough order of magnitude and there's no way to know how much they've indexed.
I don't see where they keep or have access to the database at a glance, but down the search engine path, you might also be interested in the Seeks and YaCy search engine projects.
The only organization I can think of that might (a) have the information easily available and (b) be friendly and transparent enough to want to share it would be the folks at The Internet Archive. Since they've been archiving the web with their Wayback Machine for a long time and are big on transparency, they might be a reasonable starting point.
I don't know how to make my site hackproof at all. I have inputs where people can enter information that get published on the site. What should I filter and how?
Should I not allow script tags? (issue is, how will they put YouTube embed code on the site?)
iFrame? (People can put inappropriate sites in iFrames...)
Please let me know some ways I can prevent issues.
First of all, run the user's input through a strict XML parser.
Reject any invalid markup.
You should use a whitelist of HTML tags and attributes (in the parsed XML).
Do not allow <script> tags, <iframe>s, or style attributes.
Run all URLs (href and src attributes) through a URI parser (eg, .Net's Uri class), and ensure that the protocol is http, https, or perhaps mailto. Again, reject any invalid URLs.
If you want to allow YouTube embedding, add your own <youtube> tag that takes a URL or video ID as a parameter (content or attribute), and transform it into a script on the server (after validating the parameter).
After you finish, make sure that you're blocking everything on this giant list.
There is no such thing as hacker proof. You want to do everything you can to decrease the possibility of being hacked. The most obvious weaknesses are going to be preventing against xss (cross site scripting) hacks and sql injection attacks. There are easy ways to avoid both, most notably using newer technologies that instinctively seek to ward against them (text outputs that are encoded by default, conversions of queries before execution), etc.
If you need to go beyond those levels, there are a number of both automated (mostly fuzzy numbers you can give your sales guys after they are all "good") services that will "test" your system down to hard-core analysts that will pick apart your system for various audits.
Other than the basics mentioned above (xss & sql injection), the level of security you should try and obtain will really depend on your market.
Didn't see this mentioned explicitly, but also use fuzzers ( http://en.wikipedia.org/wiki/Fuzz_testing ).
It basically shoves random crap (strings of varying characters and length) into your input fields; It's used in industry practice bc it finds lots of bugs (ie. overflows).
http://www.fuzzing.org/ has a list of great fuzzers for you to try.
You can check a penetration testing framework like ISAAF. It give you a check list and a methodology to test important security aspects of your application.
I'm about to launch a multi-domain affiliate sites which have one thing in common which is content. Reading about the problem with duplicate content and Google I'm a little worried that the parent domain or sub sites could get banned from the search engine for duplicated content.
If I have 100 sites with similar look and feel and basically same content with some minor element changes, how will I go on preventing banning, indexing these correctly?
Should I should just prevent sub-sites from been indexed completely with robots?
If so how will people be able to find their site... I actually think the parent is the one that should only be indexed to avoid, but will love to her other expert thoughts.
Google have recently released an update that will allow you to include a link tag in the head of pages that are using duplicated content that point to the original version, they're called canonical links and they exist for the exact reason you mention, to be able to use duplicated content without penalisation
For more information look here..
http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html
This doesn't mean that your sites with duplicated content will be ranked well for the duplicated content but it does mean the original is "protected". For decent ranking in the duplicated sites you will need to provide unique content
If I have 100 sites with similar look
and feel and basically same content
with some minor element changes, how
will I go on preventing banning,
indexing these correctly?
Unfortunately for you, this is exactly what Google downgrades in its search listings, to make search results more relevant, and less rigged / gamed.
Fortunately for us (i.e. users of Google), their techniques generally work.
If you want 100s of sites, to be properly ranked, you'll need to make sure they each have unique content.
You won't get banned straight away. You will have to be reported by a person.
I would suggest launching with the duplicate content and then iterating over it in time, creating unique content that is dispersed across your network. This will ensure that not all sites are spammy copies of each other and will result in Google picking up the content as fresh.
I would say go ahead with it, but try to work in as much unique content as possible, especially where it matters most (page titles, headings, etc).
Even if the sites did get banned (more likely they would just have results omitted, but it is certainly possible they would be banned in your situation) you're now just at basicly the same spot you would have been if you decided to "noindex" all the sites.