Recently I've been working on an idea that requires me to query Google Images and retrieve links for images matching that search term. My most promising candidate for a usable Google Images API was the Google Web Search API, but it looks like it's going to be going out of service as of tomorrow:
https://developers.google.com/web-search/docs/
The API that replaced it is the Google Custom Search API, but it's a little discouraging to use:
Google API Custom Search with Python - Programmatic Search Results
100 search results a day is a very strict limit; that's just four searches per hour. I also don't want to have to go through the hassle of creating some custom search bar that I'm never going to use except through Python
I decided to turn to parsing HTML directly from the results page. This presents a problem, though, because nowhere inside the page's HTML is there any direct link to the image, only referrer URLs. This is true of the javascript-enabled and javascript-disabled versions of Google Images (so even if Python spoofs javascript as enabled, nothing). I'm not sure where to go from here. Could anyone refer me to some obscure, updated library that I've somehow overlooked, or give me some pointers?
You could use Selenium Webdriver to actually execute the JavaScript and click on the images in the thumbnail view. Once an image has been opened, the link is in the DOM and you can scrape it from there. All Webdriver does is open an actual browser and simulate a user. You can even run it as a headless browser if you use xvfbwrapper. The downside is that even then, you will need all the dependencies of the browser you are using installed on your server.
However, scraping Google is against their terms of service and they will make an effort of blocking you as quickly as possible. So, unless you pass through the captchas (which are linked to sessions), you will possibly not be able to make a whole lot of searches before being blocked this way, either.
Related
I set up a very basic headless browser implementation with Puppeteer on a server, and the way I have it configured currently, I have the system scrape arbitrary websites based on user input. I then have the server send the html code of the page to a client using response.write. (I'm not actually deploying this as a solution to anything - it's really just a proof of concept.)
The results are mixed based on what website the system attempts to scrape from - but one thing they all have in common is that things like links and external stylesheets either work sporadically or not at all. My question is, is there a way to view the entire website, with clickable links and all, using Puppeteer? Or is this ridiculously impractical and totally hopeless?
If there is a way to approach this, some example code would be great.
Thanks!
I'm using js-crawler to crawl websites, and encountered now with CNN, that part of its landing page links are inside (urls that are generated dynamically, for some reason).
Thing is crawler don't really touch scripts - how should I address it? should I write my own code in addition to my node.js crawler? is there advanced crawler that knows how to handle this dynamic behavior?
Making my comment into an answer:
Crawling content that is generated by client-side Javascript is a complicated problem that not even Google has fully solved.
The only way to truly do it is to use some sort of head-less browser that is safely sandboxed on your server where the page is loaded into a browser-like environment where it can run its own scripts and generate its own content and then you can examine the resulting DOM.
Even then, it won't necessarily generate content that requires user interaction (like clicking on a tab to show some content).
I'm working on a DNN website, I have a user account with Admin privileges but don't have access to the Host Account. I do have FTP access and have been browsing around the file-structure and have seen some files referring to search.
The search is not working on the website so I was hoping I could replace the back-end code which runs the search, via FTP.
What files would need to be replaced to make sure they are not corrupted/buggy.
I realize doing this may not solve the problem, so any other advice as to trouble-shooting or possible solutions are appreciated.
EDIT(For those asking how in what way search does not work):
Here is an image of what happens when I search 'sheep' (the website is all about sheep). Was told by the company that original website that the search runs on our pages 'Keywords'. I've made sure pages contain keywords but they still do not show up in search.
The solution I ended up using for this problem because I could find no other solution without having the Super-User account access. Was to implement Google's Custom Search Engine, with the multi-page option.
http://www.google.com/cse/
In my case the original search engine was working via GET command with a value of q. This is the same as Google's CSE multi-page option. So I was able to simply remove the old search results html from a module and replace it with the html snippet provided by Google.
Say someone else has a website generated by JavaScript, so I can't go look at the source and read what should be on the screen. How can I grab the text on the screen so I can feed it into another program? Also, how can I write a program that automatically clicks on radio buttons, links, etc. that satisfy certain criteria?
You can write a web scraping tool in Perl or Python. Or, you can use existing tools and frameworks to achieve that.
Check out Scrapy, an open-source tool written in Python.
Take a look at Selenium too.
To parse dynamic content you could see the javascript source and get that same content the same way the webpage is getting it. (ie. replicating ajax calls and such)
If you want to submit data (not actually click on the elements) as if it were clicked/edited/selected you could also send a request containing the same data that the server is expecting by using some HTTP library, like CURL. See an example here.
If you need to handle content generated by script, then your first problem is to cause the script to execute. Further, the script will want to generate the content into a DOM. That means you need to have a DOM, and a script engine, and probably HTTP access to the Internet, and XML handling, etc.
If that sounds a lot like a web browser, then you're listening.
What you basically need is a web browser that you can control from a program. You'll need to be able to tell it to browse to a page, click buttons and links, etc., then you'll need to read back the resulting DOM.
Only then will you need to parse the page.
If you're in the Microsoft world, then you can use the WebBrowser control. There are several forms of this, and they all amount to the same thing: you can have Internet Explorer run inside of your program, and your program can control it.
I understand there are other browsers that can be controlled from a program, but since I don't know their details, I'll wait for someone else to tell us both.
We have a requirement for people to be able to look at documents people have uploaded to us (mainly word, possibly some rtf) via our web app. We want the user to be able to open the docs inside the browser, but keep the original formatting and not have the need for another application (like word, acrobat etc).
We thought about using google docs to do this, there appears to be some batch uploading options to get stuff in there but does anyone know if we can use the API's to keep the user on our site without them having to login to google docs themselves, and keep them still on our website with re-directing to google docs to view them.
Cheers
There's an option to make documents public (Somewhere in Share->Advanced Options).
Using api you can get list of documents in your google docs account, you can even search em. In your app you could make a link to the document in google docs which opens in a new window. That way your user will never navigate away from your page. An alternative would be to use an IFrame, but it's considered bad practice.
A completely different approach could be to automatically generate and host a pdf each time someone uploads a file. There are scripts/programs which can do that, just call them after you receive a file.