Automatically saving web pages requiring login/HTTPS

Automatically saving web pages requiring login/HTTPS - browser

I'm trying to automate some datascraping from a website. However, because the user has to go through a login screen a wget cronjob won't work, and because I need to make an HTTPS request, a simple Perl script won't work either. I've tried looking at the "DejaClick" addon for Firefox to simply replay a series of browser events (logging into the website, navigating to where the interesting data is, downloading the page, etc.), but the addon's developers for some reason didn't include saving pages as a feature.
Is there any quick way of accomplishing what I'm trying to do here?

A while back I used mechanize wwwsearch.sourceforge.net/mechanize and found it very helpful. It supports urllib2 so it should also work with HTTPS requests as I read now. So my comment above could hopefully prove wrong.

You can record your action with IRobotSoft web scraper. See demo here: http://irobotsoft.com/help/
Then use saveFile(filename, TargetPage) function to save the target page.

Related

How to get a URL from python 'webbrowser'?

I'm using 'webbrowser' for opening multiple tabs in Chrome/Firefox.
For example:
webbrowser.get("open -a /Applications/Firefox.app %s").open('http://google.com',new=1)
But the problem is, in somecases, the url gets changed(may be redirected to captcha provider), then I have to stop opening new tabs and terminate. Is there any way to read the URL after opening it using 'webbrowser'?
Here, 'Selenium' doesn't help me, because I need some extensions to be installed in the browser and I have to provide my login details and configure some settings in that extensions/plugins. So, I don't think selenium webdriver will be useful here.
Can you suggest any solution for this?
Thanks a lot in advance!

if I understood your question well, you want to open the link each time in the browser, if that's the case you could use
webbrowser.open_new_tab("link")

Why does the Foursquare API JS not work with HTTPS?

In a system I have to maintain (didn't build it, just inherited it) we have a Foursquare implementation that hasn't been used in quite a while. Trying to revive it failed, because our page is now loaded via HTTPS, which it didn't used to be.
We are using the "Save to Foursquare" button as well as the API request to retrieve the number of Check-ins. I already switched all the JS includes and intent links from http to https and at least now it shows the number and the button correctly.
However, I can't click the button and checking the browser's console I found that it added a script tag to the head of this page which tries to access http://platform.foursquare.com/js/modules/widgets.asyncbundle.js. The browser obviously blocks this, because it's not using HTTPS.
The file we are explicitly loading is https://platform.foursquare.com/js/widgets.js. It seems to me like this script is not reacting correctly to HTTP vs. HTTPS. There is probably a very simple solution to this, so what am I missing?

I don't know if you've tried it yet but the foursquare website says this on the matter:
Change the source of the JavaScript file to https://platform-s.foursquare.com/js/widgets.js
Add {"secure":true} to the global configuration block (window.___fourSq)`
The same link (see below) has all the different ways to call the Save To Foursquare function using its .saveTo() function.
https://developer.foursquare.com/overview/widgets
I hope this information and links helps! Cheers.

how to perform a post through chrome extention?

How can I perform a post through the chrome extention, lets say I want to send the current tab page title to a webpage

You can do POST XHRs from chrome extensions to any URL, as long as you have host permissions defined in your manifest. See these docs.

In a chrome extension the best way to try and do what i think you want is via a content script see documentation a word of warning however pinging your server with a POST request every time someone with your extension installed opens a web page is going to be extremely heavy going on your servers especially if you have a lot of installs. A possible solution is to use the content script to keep tally of the sites a user visits and save this data in a HTML5 database (wich chrome supports) then using background.html sending the data at given intervals in bulk with an AJAX request, this will significantly cut down the number of times your server is pinged.

How can I pass a message from outside URL to my Chrome Extension?

I know there's a way for extensions and pages to communicate locally, but I need to send a message from an outside URL, have my Chrome Extension listen for it.
I have tried easyXDM in the background page, but it seems to stop listening after awhile, as if Google "turns off" the Javascript in the background page after awhile.

I think you may try some walk around and build a site with some specific data structure, and then implement a content script which will look for this specific that specific data structure, and when i finds one it can fetch the data you want to be passed to your extension.

Yes, you need a content script that communicates with the page using DOM Events.. Instructions on how to do that are here:
http://code.google.com/chrome/extensions/content_scripts.html#host-page-communication

how to check if my website is being accessed using a crawler?

how to check if a certain page is being accessed from a crawler or a script that fires contineous requests?
I need to make sure that the site is only being accessed from a web browser.
Thanks.

This question is a great place to start:
Detecting 'stealth' web-crawlers
Original post:
This would take a bit to engineer a solution.
I can think of three things to look for right off the bat:
One, the user agent. If the spider is google or bing or anything else it will identify it's self.
Two, if the spider is malicious, it will most likely emulate the headers of a normal browser. Finger print it, if it's IE. Use JavaScript to check for an active X object.
Three, take note of what it's accessing and how regularly. If the content takes the average human X amount of seconds to view, then you can use that as a place to start when trying to determine if it's humanly possible to consume the data that fast. This is tricky, you'll most likely have to rely on cookies. An IP can be shared by multiple users.

You can use the robots.txt file to block access to crawlers, or you can use javascript to detect the browser agent, and switch based on that. If I understood the first option is more appropriate, so:
User-agent: *
Disallow: /
Save that as robots.txt at the site root, and no automated system should check your site.

I had a similar issue in my web application because I created some bulky data in the database for each user that browsed into the site and the crawlers were provoking loads of useless data being created. However I didn't want to deny access to crawlers because I wanted my site indexed and found; I just wanted to avoid creating useless data and reduce the time taken to crawl.
I solved the problem the following ways:
First, I used the HttpBrowserCapabilities.Crawler property from the .NET Framework (since 2.0) which indicates whether the browser is a search engine Web crawler. You can access to it from anywhere in the code:
ASP.NET C# code behind:
bool isCrawler = HttpContext.Current.Request.Browser.Crawler;
ASP.NET HTML:
Is crawler? = <%=HttpContext.Current.Request.Browser.Crawler %>
ASP.NET Javascript:
<script type="text/javascript">
var isCrawler = <%=HttpContext.Current.Request.Browser.Crawler.ToString().ToLower() %>
</script>
The problem of this approach is that it is not 100% reliable against unidentified or masked crawlers but maybe it is useful in your case.
After that, I had to find a way to distinguish between automated robots (crawlers, screen scrapers, etc.) and humans and I realised that the solution required some kind of interactivity such as clicking on a button. Well, some of the crawlers do process javascript and it is very obvious they would use the onclick event of a button element but not if it is a non interactive element such as a div. The following is the HTML / Javascript code I used in my web application www.so-much-to-do.com to implement this feature:
<div
class="all rndCorner"
style="cursor:pointer;border:3;border-style:groove;text-align:center;font-size:medium;font-weight:bold"
onclick="$TodoApp.$AddSampleTree()">
Please click here to create your own set of sample tasks to do
</div>
This approach has been working impeccably until now, although crawlers could be changed to be even more clever, maybe after reading this article :D

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Automatically saving web pages requiring login/HTTPS - browser

A while back I used mechanize wwwsearch.sourceforge.net/mechanize and found it very helpful. It supports urllib2 so it should also work with HTTPS requests as I read now. So my comment above could hopefully prove wrong.

You can record your action with IRobotSoft web scraper. See demo here: http://irobotsoft.com/help/ Then use saveFile(filename, TargetPage) function to save the target page.

Related

How to get a URL from python 'webbrowser'?

Why does the Foursquare API JS not work with HTTPS?

how to perform a post through chrome extention?

How can I pass a message from outside URL to my Chrome Extension?

how to check if my website is being accessed using a crawler?

Categories

Resources