How do you prevent crawling from your web site?

How do you prevent crawling from your web site? - iis

I am running a website on IIS with more than 1000 page links at pagination and I want to prevent others to crawl/steal these pages by running a crawler script and get the info page by page.
Is there any way to understand the request if it is a user request or being ran by a script? or maybe some filters for this on highest level before coming to request?

You can't prevent automated crawling.
You can make it harder to automatically crawl your content, but if you allow users to see the content it can be automated (i.e. automating browser navigation is not hard and computer generally don't care to wait long time between requests).
One option is to require single "user" (either authenticated or not) to have some minimal delay between requests (i.e. 1-5 seconds). This way generic crawling will not be useful (require some "user id" in request and delay between requests), and one would have to write custom crawling code which is clearly more time intensive.
Note that writing special "crawler" for your site may be considered as "noble" action and significantly increase incentive to create one (i.e. check out "how to make Google maps available offline" questions).

Related

How can I do pagination?

What will be better? Make 1 request and get all articles for all pages or make a request for each page as needed? Now I am using second variant (make a request for each page as needed). What will be better?
P.S. After the request, all data will be written to Redux

It's usually better to paginate your results, otherwise you load an important amount of data for nothing, which can be slow if the user has limited bandwith. Very large quantities of data loaded in a web browser can also slow down the browser itself in some cases.
If your calls to get the results of 1 page take too long when browsing multiple pages, you could load 2 pages at once and have your UI immediately display the second page when the user clicks on 'next', while contacting the backend to get the 3rd page. That way you keep a reactive UI, while only loading what's necessary.

tabs permission or content script?

I'm writing an extension that needs to show a page action on amazon.com pages.
Would it be better to request the "tabs" permission or to inject a content script into amazon.com pages?
The tabs permission strikes me as using less resources (because it just checks the URL against a regex in the background script) but I think it's a scarier permission message ("access your tabs and browsing activity")?
Injecting a content script into amazon.com pages seems like it would take more resources it but would only need permission to amazon.com...

It is a generic question and answer depends on Client to Client. You have pointed out the + and - of each.
I suggest you to go for content scripts if your clients are particular about security and privacy, in this you are adding an extra load to pages(with content scripts and message passing) which may slow down the normal execution process.
I suggest you to go for tab permission, if you are all about performance. It is a native API, and executes in background page no extra load on tabs. Many extensions on web store does use tabs API, i dont think this would scare them as this is not new.
However, it is all about your target section of users.

How would one technically describe this desired website functionality(involving timeout, log-in)

On websites like eBay, if you would time-out of your session, and say you were looking at a shoe, when you come back(after your sleep) to the page, you'd see the shoe, but you're logged out.
However, once you log back in, you get directed to that shoe immediately.
I am thinking that I say : "After timeout from website, upon re-login go back directly to page where timeout happened."
But how is this functionality described(as in, what technologies will we use)? Also, is it something that needs alot of resources?

Quite often it's done by having a "Return To" url passed along to the login page.
So.... (Logged in)
Visit Shoe Page
Session Times out
Either via js on timeout or next page refresh, user will see logged-out shoe page
Login link on that page includes the url of the shoe page eg Login.php?ReturnTo=ShoePage.php
Note that this applies to websites. You've also added a web service tag which is completely different - web services have no concept of a "current page"
If you decide to store the last page for the user in the db, what happens if the last page visited is no longer valid? You'd also be adding 1 Db operation per page load (to update the last visited page). No real performance concern but worth knowing. It's slightly non-standard behavior so you'd need to make sure the user knows why they've been redirected

how to ensure visit different web page by human not by bot?

how to ensure visit different web pages by human, not by bot program?
Is there some tecnique?
thanks

if(strstr(strtolower($_SERVER['HTTP_USER_AGENT']), "googlebot"))
{
// Google Bot visits you
}
This is an PHP example of finding out, if the visitor is GoogleBOT.

You can either change the User Agent in the HTTP headers, or look for bot like activity, such as a very high frequency of hits over a wide range of pages coming from a single ip address (though you might see that with a Proxy Server too). You can also look for hits on Robots.txt and assume that other visits within the same session where also from a robot.
In reality there is no sure fire way of doing it, as sophisticated robot writers could pretend to be browsers.

Time can be good measurement of whether a visit was a human or a bot.
If you set a time-out or delay on the JavaScript which tracks the user visit to execute after 1 or 2 seconds. Most humans will visit a page for at least that time (even if they don't like it) whereas a bot should be able to scan and move on in that time.
Just a thought.

How I do to block Web scraping without blocking Well behaved bots?

I'm building an e-commerce website with a large database of products. Of course, is nice when Goggle indexes all products of the website. But what if some competitor wants Web Scrape the website and get all images and product descriptions?
I was observing some websites with similar lists of products, and they place a CAPTCHA, so "only humans" can read the list of products. The drawback is... it is invisible for Google, Yahoo or another "Well behaved" bots.

You can discover the IP addresses the Google and others are using by checking visitor IPs with whois (in the command line or on a web site). Then, once you've accumulated a stash of legit search engines, allow them into your product list without the CAPTCHA.

If you're worried about competitors using your text or images, how about a watermark or customized text?
Let them take your images and you'd have your logo on their site!

Since a potential screen-scaping application can spoof the user agent and HTTP referrer (for images) in the header and use a time schedule that is similar to a human browser, it is not possible to completely stop professional scrapers. But you can check for these things nevertheless and prevent casual scraping.
I personally find Captchas annoying for anything other than signing up on a site.

One technique you could try is the "honey pot" method: it can be done either by mining log files are via some simple scripting.
The basic process is you build your own "blacklist" of scraper IPs based by looking for IP addresses which look at 2+ unrelated products in a very short period of time. Chances are these IPs belong to Machines. You can then do a reverse lookup on them to determine if they are nice (like GoogleBot or Slurp) or bad.

Block webscrapers is not easy, and it's even harder trying to avoid false positives.
Anyway you can add some netrange to a whitelist, and don't serve any captcha to them.
All those well known crawlers: Bing, Googlebot, Yahoo etc.. use always specific netranges when crawling, and all those IP addresses resolve to specific reverse lookups.
Few examples:
Google IP 66.249.65.32 resolves to crawl-66-249-65-32.googlebot.com
Bing IP 157.55.39.139 resolves to msnbot-157-55-39-139.search.msn.com
Yahoo IP 74.6.254.109 resolves to h049.crawl.yahoo.net
So let's say that '*.googlebot.com ', '*.search.msn.com ' and '*.crawl.yahoo.net ' addresses should be whitelisted.
There are plenty of white lists you can implement out on internet.
Said that, I don't believe Captcha is a solution against advanced scrapers, since services such as deathbycaptcha.com or 2captcha.com promise to solve any kind of captcha within seconds.
Please have a look into our wiki http://www.scrapesentry.com/scraping-wiki/ we wrote many articles on how to prevent, detect and block web-scrapers.

Perhaps I over-simplify, but if your concern is about server performance then providing an API would lessen the need for scrapers, and save you band/width processor time.
Other thoughts listed here:
http://blog.screen-scraper.com/2009/08/17/further-thoughts-on-hindering-screen-scraping/

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How do you prevent crawling from your web site? - iis

Related

How can I do pagination?

tabs permission or content script?

How would one technically describe this desired website functionality(involving timeout, log-in)

how to ensure visit different web page by human not by bot?

How I do to block Web scraping without blocking Well behaved bots?

Categories

Resources