What's better to scrape a website : selenium or request? - python-3.x

I have an URL to scrape and i ask me what's the best method.
With selenium for example:
executable_path = "....\\chromedriver" browser = webdriver.Chrome(executable_path=executable_path)
url = "xxxxxxxxxx" browser.get(url) timeout = 20
# find_elements_by_xpath returns an array of selenium objects.
titles_element = browser.find_elements_by_css_selector('[data-test-id="xxxx"]'
This method launches Chrome Browser. On windows i have to install both "Chrome browser" and a Chrome Driver in the same version. But what happens in a Linux server: no problem to install Chrome driver but it's not a problem to install a Chrome browser on a server without graphic interface?
Do you suggest me to use rather request module than selenium because my URL is already built.
The risk to be caught by website is more important with selenium or request?

If you have just one URL to scrape Selenium is better because it's easier to code than requests.
For exemple : if you need to scroll down to make your data appear, it will be harder to do without a browser
If you want to do intensive scraping maybe you should try requests with beautifulsoup, it will use way less ressource on your server.
You can also use scrapy, it's very easy to spoof the user agent with it, this makes your bot harder to detect.
If you scrape responsibly with a delay between 2 requests, you should not be detected with either method. You can check the robot.txt document to be safe

Related

When you execute chrome browser using Chrome DevTools Protocol, where does such session store stuff like history, cookies and extensions added?

I realized today that you can merge Chrome DevTools Protocol with Selenium in order to automate some very specific parts of a process within a website.
for instance: after some initial conditions have met, automate the process of uploading some files to an account and etc...
According to the official repository you use a sentence like the following on cmd to create a new chrome session with your user data:
chrome.exe --remote-debugging-port=9222 --user-data-dir:"C:\Users\ResetStoreX\AppData\Local\Google\Chrome\User Data"
So in my case, the sentence above generates the following output:
The thing is, in my original session I had some Chrome extensions added, and I know that If I were to work only with Selenium using its chromedriver.exe, I could easily add an extension (which must be compressed as a .crx file) by using the following sentence:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
opt = Options() #the variable that will store the selenium options
opt.add_extension(fr'{extension_path}')
But it seems that Chrome DevTools Protocol can't just add as much Options as Selenium, so I would have to install all my extensions in this pseudo session of mine again, no problem.
But, after installing such extensions, will those keep installed and ready for use after I execute again chrome.exe --remote-debugging-port=9222 --user-data-dir:"C:\Users\ResetStoreX\AppData\Local\Google\Chrome\User Data", and if so, where?
Or if not, does that mean that I would have to re install those every single time I need to do tests with Chrome DevTools Protocol and my Chrome extensions? Thanks in advice.
Can confirm, a session opened with Chrome DevTools Protocol somehow stores permanently the extensions you re installed. It also remembers if you used some specific credentials for logging in to some sites.

how to fetch the fired tags in a webpage using python and selenium dynamically

I have a website on which Google Analytics code fires (through Google Tag Manager). The site has a lot of pages and I want to check if Google Analytics code fires on all pages. One way would be to open the URL, open GA debugger and check the pageview firing in the console. Since there are a lot of URLs which need to be checked, is there a way to automate this process (preferably by Python)
What I've tried so far: I've managed to get the fetch the source code of the pages and then regexing my way to find specific code snippets (of GA and GTM) You can find the code below. But the problem is this will fetch just the static code. Any pixels/codes firing after the page actually loads will not be captured.
from selenium import webdriver
driver = webdriver.Chrome(executable_path=r"C:\chromedriver.exe")
driver.get("url")
html1 = driver.page_source
html2 = print(driver.execute_script("return
document.documentElement.innerHTML;"))
I also tried using BS4 and request but nothing useful came.
I am using BrowserMob Proxy with a selenium driver to capture all HTTP requests and responses sent while a test is running and then I loop through each checking for a request url that contains 'google-analytics'. I then parse that request to check the event values match what I am expecting.

cURL returns response but unable to get the same response when tried to access using selenium firefox webdriver

I am trying to extract some data from a website from Linux server using Selenium.
When the URL is pinged through cURL, we are able to see the response from the server but when the same URL requested through Selenium Firefox webdriver, we aren't recieving any response for hours.
For eg:- One of the link we are trying to reach is as follows :
http://www.vudu.com/movies/#!content/776990
Can you point out to a possible issue?
Is it common for websites to react this way?
What might be the way to overcome this issue ?
Thanks in advance for help .
NOTE:- The websites we are trying to ping , are already whitelisted in the server.
The server might be responding differently based upon you User Agent string. Many extensions allow you to set this, including "User-Agent Switcher for Chrome."
You might also experiment with the Selenium chromedriver and look to see exactly what the request is inside the browser. You can find this by using the "network" tab of the developer tools:

Detect broken SSL or insecure content warning with Selenium, BrowserStack, & Node.js

I'm trying to setup some automated testing using Browserstack's Selenium and their Node.js driver. I want to check if the page is showing any insecure content warnings when accessing the URL via HTTPS.
Is there a way to detect that in Selenium? If one browser does it easier than another that's fine.
Here are a few different ways to detect this using Selenium and other tools:
iterate through all links and ensure they all start with https:// (though via Selenium, this won't detect complex loaded content, XHR, JSONP, and interframe RPC requests)
automate running the tool on Why No Padlock?, which may not do more than the above method
utilize Sikuli to take a screenshot of the region of the browser address bar showing the green padlock (in the case of Chrome) and fail if not present (caveat of using this in parallel testing mentioned here
There is also mention here of the Content Security Policy in browsers, which will prevent the loading of any non-secure objects and perform a callback to an external URL when encountered.
UPDATE:
These proposed solutions intend to detect any non-secure objects being loaded to the page. This should be the best practice for asserting the content is secure. However, if you literally need to detect whether the specific browser's insecure content warning message is being displayed (aka, software testing the browser vs your website), then utilizing Sikuli to match either the visible existence warning messages or the non-existence of your page's content could do the job.
Firefox creates a log entry each time it runs into mixed content, so you can check the logs in selenium. Example:
driver = webdriver.Firefox()
driver.get("https://googlesamples.github.io/web-fundamentals/fundamentals/security/prevent-mixed-content/simple-example.html")
browser_logs = driver.get_log("browser")
and, in browser_logs look for
{u'timestamp': 1483366797638, u'message': u'Blocked loading mixed active content "http://googlesamples.github.io/web-fundamentals/samples/discovery-and-distribution/avoid-mixed-content/simple-example.js"', u'type': u'', u'level': u'INFO'}
{u'timestamp': 1483366797644, u'message': u'Blocked loading mixed active content "http://googlesamples.github.io/web-fundamentals/samples/discovery-and-distribution/avoid-mixed-content/simple-example.js"', u'type': u'', u'level': u'INFO'}

How to control web browser using some programming language?

I am looking for a way to control a web browser such as firefox or chrome. I need something like "selenium webdriver" but that will allow me to open many instances URL load, get http headers, response code, get response content, load time, etc.
Is there any library, framework, api that I could use to do it? I couldn't find one exactly that does all, selenium opens browser and go to url but I can't get http headers
Selenium and Jellyfish are strong options in general. Jellyfish is an option that uses Node.js - although I have no experience with it, I've heard good things from my colleagues.
If you just want to get headers and such, you could use the cURL library or wget. I've used cURL with NuSOAP to query XML web services in PHP, for example. The downside is that these are not functional browsers, and merely perform the HTTP requests and consume the response.
http://seleniumhq.org/
https://github.com/admc/jellyfish
http://curl.haxx.se/

Resources