Easiest way to scrape Google for URLs via my browser?

Easiest way to scrape Google for URLs via my browser? - search

I'd like to scrape all the URLs my searches return when searching for stuff via Google. I've tried making a script, but Google did not like it, and adding cookie support and captcha was too tedious. I'm looking for something that - when I'm browsing through the Google search pages - will simply take all the URLs on the pages and put them inside a .txt file or store them somehow.
Does any of you know of something that will do that? Perhaps a greasemonkey script or a firefox addon? Would be greatly appreciated. Thanks!

See the JSON/Atom Custom Search API.

I've done something similar for Google Scholar where there's no API available. My approach was basically to create a proxy web server (a java web app on Tomcat) that would fetch the page, do something with it and then show to user. This is 100% functional solution but requires quite some coding. If you are interested I can get into more details and put up some code.

Google search results are very easy to scrape. Here is an example in php.
<?
# a trivial example of how to scrape google
$html = file_get_contents("http://www.google.com/search?q=pokemon");
$dom = new DOMDocument();
#$dom->loadHTML($html);
$x = new DOMXPath($dom);
foreach($x->query("//div[#id='ires']//h3//a") as $node)
{
echo $node->getAttribute("href")."\n";
}
?>

You may try IRobotSoft bookmark addon at http://irobotsoft.com/bookmark/index.html

Related

How to delete all cookie when visitor visit my website?

I want to ask a question: How to delete all cookie when visitor visit my website. I am using wordpress.
I searched a lot of question like this my question but I can't find a satisfactory answers.
Please help me ! Sorry for my poor English !

You can either retrieve and manipulate cookies on the server side using PHP or client side, using JavaScript.
In PHP, you set cookies using setcookie(). Note that this must be done before any output is sent to the browser which can be quite the challenge in Wordpress. You're pretty much limited to some of the early running hooks which you can set via a plugin or theme file (functions.php for example), eg
add_action('init', function() {
// yes, this is a PHP 5.3 closure, deal with it
if (!isset($_COOKIE['my_cookie'])) {
setcookie('my_cookie', 'some default value', strtotime('+1 day'));
}
});
Retrieving cookies in PHP is much easier. Simply get them by name from the $_COOKIE super global, eg
$cookieValue = $_COOKIE['cookie_name'];
Unsetting a cookie requires setting one with an expiration date in the past, something like
setcookie('cookie_name', null, strtotime('-1 day'));
For JavaScript, I'd recommend having a look at one of the jQuery cookie plugins (seeing as jQuery is already part of Wordpress). Try http://plugins.jquery.com/project/Cookie
and refer this too
http://codex.wordpress.org/WordPress_Cookies
http://codex.wordpress.org/Function_Reference/wp_clear_auth_cookie

Most probably it's in PHP since it is WordPress platform. It's either using WordPress function or PHP function to do it.
<?php wp_clear_auth_cookie(); ?>

HTML to PDF (Using Google chrome API)?

I have some documents in HTML and i need it to be printed/generated on server (no UI, automated, linux based).
I'm very satisfied with Google Chrome "html to pdf" of the documents but i'm wondering is it possible to use that "component" of "html to pdf" printing engine from Google Chrome Browser somehow for this purpose?

Actually i found the solution:
First one wkhtmltopdf http://code.google.com/p/wkhtmltopdf/
And at the end i realized that mpdf (php lib) can help me too :)

If you needed an HTTP API service to convert HTML to PDF from an URL you may want to check this answer that I wrote that explains how to do it.
Example:
https://dhtml2pdf.herokuapp.com/api.php?url=https://www.github.com&result_type=show
shows in the browser the PDF generated of the site https://www.github.com.
See the project in Github.
Hope it helps.

how to check if my website is being accessed using a crawler?

how to check if a certain page is being accessed from a crawler or a script that fires contineous requests?
I need to make sure that the site is only being accessed from a web browser.
Thanks.

This question is a great place to start:
Detecting 'stealth' web-crawlers
Original post:
This would take a bit to engineer a solution.
I can think of three things to look for right off the bat:
One, the user agent. If the spider is google or bing or anything else it will identify it's self.
Two, if the spider is malicious, it will most likely emulate the headers of a normal browser. Finger print it, if it's IE. Use JavaScript to check for an active X object.
Three, take note of what it's accessing and how regularly. If the content takes the average human X amount of seconds to view, then you can use that as a place to start when trying to determine if it's humanly possible to consume the data that fast. This is tricky, you'll most likely have to rely on cookies. An IP can be shared by multiple users.

You can use the robots.txt file to block access to crawlers, or you can use javascript to detect the browser agent, and switch based on that. If I understood the first option is more appropriate, so:
User-agent: *
Disallow: /
Save that as robots.txt at the site root, and no automated system should check your site.

I had a similar issue in my web application because I created some bulky data in the database for each user that browsed into the site and the crawlers were provoking loads of useless data being created. However I didn't want to deny access to crawlers because I wanted my site indexed and found; I just wanted to avoid creating useless data and reduce the time taken to crawl.
I solved the problem the following ways:
First, I used the HttpBrowserCapabilities.Crawler property from the .NET Framework (since 2.0) which indicates whether the browser is a search engine Web crawler. You can access to it from anywhere in the code:
ASP.NET C# code behind:
bool isCrawler = HttpContext.Current.Request.Browser.Crawler;
ASP.NET HTML:
Is crawler? = <%=HttpContext.Current.Request.Browser.Crawler %>
ASP.NET Javascript:
<script type="text/javascript">
var isCrawler = <%=HttpContext.Current.Request.Browser.Crawler.ToString().ToLower() %>
</script>
The problem of this approach is that it is not 100% reliable against unidentified or masked crawlers but maybe it is useful in your case.
After that, I had to find a way to distinguish between automated robots (crawlers, screen scrapers, etc.) and humans and I realised that the solution required some kind of interactivity such as clicking on a button. Well, some of the crawlers do process javascript and it is very obvious they would use the onclick event of a button element but not if it is a non interactive element such as a div. The following is the HTML / Javascript code I used in my web application www.so-much-to-do.com to implement this feature:
<div
class="all rndCorner"
style="cursor:pointer;border:3;border-style:groove;text-align:center;font-size:medium;font-weight:bold"
onclick="$TodoApp.$AddSampleTree()">
Please click here to create your own set of sample tasks to do
</div>
This approach has been working impeccably until now, although crawlers could be changed to be even more clever, maybe after reading this article :D

Automatically saving web pages requiring login/HTTPS

I'm trying to automate some datascraping from a website. However, because the user has to go through a login screen a wget cronjob won't work, and because I need to make an HTTPS request, a simple Perl script won't work either. I've tried looking at the "DejaClick" addon for Firefox to simply replay a series of browser events (logging into the website, navigating to where the interesting data is, downloading the page, etc.), but the addon's developers for some reason didn't include saving pages as a feature.
Is there any quick way of accomplishing what I'm trying to do here?

A while back I used mechanize wwwsearch.sourceforge.net/mechanize and found it very helpful. It supports urllib2 so it should also work with HTTPS requests as I read now. So my comment above could hopefully prove wrong.

You can record your action with IRobotSoft web scraper. See demo here: http://irobotsoft.com/help/
Then use saveFile(filename, TargetPage) function to save the target page.

Download images containing a specific tag with likes from Instagram

I would like to download images with a certain tag from Instagram with their likes. With this post I hope to get some advice or tips on how to do this. I have no experience with web scraping related stuff or web API usages. One of my questions is: can you create a program like this in python code or can you only do this using a webpage?
So far I have understood the following. To get images with a certain tag you have to:
need a valid access_token to even gain access to images by tag, which can be done like this. However, when I sign in you need to give a website. Does this indicate that you can only use the API's on websites rather than a python program for instance?
you use a media Tag Endpoint to search for tags by name.
I have no idea what the latest step will return exactly, but I expect that it will give me a specific image id that contains the tag. Correct? Now I will also need to get the likes belonging to these images. Just like latest step from before:
you use a likes Tag Endpoint to get a list of users that liked the image of which of course you can get the length.
If I can accomplish all of these steps it seems like I can achieve my original goal. I googled if there was something out there already. The only thing I could find was InstaRaider, but this did not seem to fit my description because it web scraped only the images from a specific user and not by tag or its likes. Any suggestions or ideas would be very helpful, I have only programmed in python and Java before..

I can only tell you that for URL you can use the localhost as this:
http://127.0.0.1
OR
http://localhost
I have also tried to do exactly the same before, but I could not, so I used a website to search for tags and images:
http://iconosquare.com/search/[HASHTAG]

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string