I'm using js-crawler to crawl websites, and encountered now with CNN, that part of its landing page links are inside (urls that are generated dynamically, for some reason).
Thing is crawler don't really touch scripts - how should I address it? should I write my own code in addition to my node.js crawler? is there advanced crawler that knows how to handle this dynamic behavior?
Making my comment into an answer:
Crawling content that is generated by client-side Javascript is a complicated problem that not even Google has fully solved.
The only way to truly do it is to use some sort of head-less browser that is safely sandboxed on your server where the page is loaded into a browser-like environment where it can run its own scripts and generate its own content and then you can examine the resulting DOM.
Even then, it won't necessarily generate content that requires user interaction (like clicking on a tab to show some content).
Related
I set up a very basic headless browser implementation with Puppeteer on a server, and the way I have it configured currently, I have the system scrape arbitrary websites based on user input. I then have the server send the html code of the page to a client using response.write. (I'm not actually deploying this as a solution to anything - it's really just a proof of concept.)
The results are mixed based on what website the system attempts to scrape from - but one thing they all have in common is that things like links and external stylesheets either work sporadically or not at all. My question is, is there a way to view the entire website, with clickable links and all, using Puppeteer? Or is this ridiculously impractical and totally hopeless?
If there is a way to approach this, some example code would be great.
Thanks!
New to Web site dev, though have done a lot of coding of other sorts in the past, just set up a personal blog for my own amusement on bluehost using wordpress and have it installed locally for dev.
I have a training log on another site which anyone can open read-only to view training stats with a URL of the form below, this works fine in a browser:
https://www.othersite.com/logs/1234xyz/authenticate?password=thisismypassword
What I have tried to do unsuccessfully is have this open in a frame on one of my web pages (used iframe/object in html). It seems impossible to do this as the authentication string is not passed across, and the screen displayed prompts for manual input of the password. Can I open this automatically in some way?
If I understood well, your solution is insecure regardless it works or not. In this way your clients can see the password of your (or another) site.
I suggest to query the external content using custom php code and display (print) it on your page.
There are several ways to get content of an external page:
https://www.php.net/manual/en/function.stream-context-create.php
https://www.php.net/manual/en/function.curl-init.php
If you need a tutorial for WP plugins check this out:
https://www.wpbeginner.com/wp-tutorials/how-to-create-a-wordpress-plugin/
Recently I've been working on an idea that requires me to query Google Images and retrieve links for images matching that search term. My most promising candidate for a usable Google Images API was the Google Web Search API, but it looks like it's going to be going out of service as of tomorrow:
https://developers.google.com/web-search/docs/
The API that replaced it is the Google Custom Search API, but it's a little discouraging to use:
Google API Custom Search with Python - Programmatic Search Results
100 search results a day is a very strict limit; that's just four searches per hour. I also don't want to have to go through the hassle of creating some custom search bar that I'm never going to use except through Python
I decided to turn to parsing HTML directly from the results page. This presents a problem, though, because nowhere inside the page's HTML is there any direct link to the image, only referrer URLs. This is true of the javascript-enabled and javascript-disabled versions of Google Images (so even if Python spoofs javascript as enabled, nothing). I'm not sure where to go from here. Could anyone refer me to some obscure, updated library that I've somehow overlooked, or give me some pointers?
You could use Selenium Webdriver to actually execute the JavaScript and click on the images in the thumbnail view. Once an image has been opened, the link is in the DOM and you can scrape it from there. All Webdriver does is open an actual browser and simulate a user. You can even run it as a headless browser if you use xvfbwrapper. The downside is that even then, you will need all the dependencies of the browser you are using installed on your server.
However, scraping Google is against their terms of service and they will make an effort of blocking you as quickly as possible. So, unless you pass through the captchas (which are linked to sessions), you will possibly not be able to make a whole lot of searches before being blocked this way, either.
Say someone else has a website generated by JavaScript, so I can't go look at the source and read what should be on the screen. How can I grab the text on the screen so I can feed it into another program? Also, how can I write a program that automatically clicks on radio buttons, links, etc. that satisfy certain criteria?
You can write a web scraping tool in Perl or Python. Or, you can use existing tools and frameworks to achieve that.
Check out Scrapy, an open-source tool written in Python.
Take a look at Selenium too.
To parse dynamic content you could see the javascript source and get that same content the same way the webpage is getting it. (ie. replicating ajax calls and such)
If you want to submit data (not actually click on the elements) as if it were clicked/edited/selected you could also send a request containing the same data that the server is expecting by using some HTTP library, like CURL. See an example here.
If you need to handle content generated by script, then your first problem is to cause the script to execute. Further, the script will want to generate the content into a DOM. That means you need to have a DOM, and a script engine, and probably HTTP access to the Internet, and XML handling, etc.
If that sounds a lot like a web browser, then you're listening.
What you basically need is a web browser that you can control from a program. You'll need to be able to tell it to browse to a page, click buttons and links, etc., then you'll need to read back the resulting DOM.
Only then will you need to parse the page.
If you're in the Microsoft world, then you can use the WebBrowser control. There are several forms of this, and they all amount to the same thing: you can have Internet Explorer run inside of your program, and your program can control it.
I understand there are other browsers that can be controlled from a program, but since I don't know their details, I'll wait for someone else to tell us both.
There are client-side solutions for nasty adware and their recursive links, but is it possible to use a script in the html to prevent the links from displaying in a user's browser who has adware on their machine and is visiting my web site?
I am NOT a programmer. I am designer, and I know just enough to create problems that send me to forums like this.
I doubt it. Malware like that injects links and creates popups by manipulating the internals of the browser.