Python: Is it possible to download ENTIRE web page in PhantomJS

Python: Is it possible to download ENTIRE web page in PhantomJS - python-3.x

I have used PhantomJS for scraping purpose. I would like to know about possibility of download all contents of a URL(inclduing Images, CSS and JS) and save locally for browsing?

# -*- coding: utf-8 -*-
from selenium import webdriver #for cookies collections after all AJAX/JS being executed
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = ("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36")
driver = webdriver.PhantomJS(desired_capabilities=dcap, service_args=['--ignore-ssl-errors=true', '--ssl-protocol=any', '--web-security=false'])
driver.set_window_size(1366,768)
driver.get('http://stackoverflow.com')
driver.page_source
This is complete code that uses Python Selenium + PhantomJS and at the end you have complete page source.

we can use evaluate() function to get the content. I use this in nodejs.
var webPage = require('webpage');
var page = webPage.create();
page.open('http://google.com', function(status) {
var title = page.evaluate(function() {
return document.title;
});
console.log(title);
phantom.exit();
});`

In the case of wget being installed, this task is rather easy:
domain = "www.google.de"
from subprocess import call
call(["wget", "-mk", domain])

Related

Being not able to set multiple chrome options at the same time (blocking notifications and cookies) in selenium and python

the code is only including blocking notifications:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from time import sleep
chrome_options = webdriver.ChromeOptions()
prefs = {"profile.default_content_setting_values.notifications" : 2}
chrome_options.add_experimental_option("prefs",prefs)
driver=webdriver.Chrome(executable_path="C:\\Users\\Desktop\\chromedriver.exe",chrome_options=chrome_options)
driver.maximize_window()
driver.get("https://www.hurriyet.com.tr/")
sleep(5)
Hello friends, I can not be able to set multiple chrome options (blocking notifications and cookies) at the same time. How can I set the blocking notifications and the cookies at the same time? Is tehere any solution I want to learn. I think that I could use somehow these together but I couldn't. :
"prefs", {"profile.default_content_settings.cookies": 2} "prefs", {"profile.default_content_setting_values.notifications" : 2 }

Why not something like this :
executable_path = r"C:\\Users\\Selenium+Python\\chromedriver.exe"
options.add_experimental_option("prefs", {"profile.default_content_setting_values.notifications": 2})
options.add_experimental_option("prefs", {"profile.default_content_settings.cookies": 2})
options.add_argument("start-maximized")
driver = webdriver.Chrome(executable_path, options=options)

selenium: bypass access denied

I'm trying to navigate a website with Selenium, but I'm getting an error: Access Denied. You do not have permission to access "http://tokopedia.com/" on this server.
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
CHROMEDRIVER_PATH = r'C:/chromedriver.exe'
tokopedia = "https://tokopedia.com/"
options = Options()
options.add_argument("--headless")
options.add_argument('--disable-gpu')
options.add_argument('--no-sandbox')
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(executable_path=CHROMEDRIVER_PATH, chrome_options=options)
driver.get(tokopedia)
print(driver.page_source)
how to solve it? Thank you for the help

Try the below code. It is working for me -
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
tokopedia = "https://tokopedia.com/"
options = Options()
options.add_argument("--headless")
options.add_argument('--disable-gpu')
options.add_argument('--no-sandbox')
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.50 Safari/537.36'
options.add_argument('user-agent={0}'.format(user_agent))
driver = webdriver.Chrome(options=options)
driver.get(tokopedia)
print(driver.page_source)

Downloading files with selenium (python3) on ubuntu server 18.04

I wrote a simple script using the user Fayçal's code from
Downloading with chrome headless and selenium the script worked on my Mac but when I went to run it on the server nothing was downloaded.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_experimental_option("prefs", {
"download.default_directory": "/download/path/",
"download.prompt_for_download": False,
})
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options)
driver.command_executor._commands["send_command"] = ("POST", '/session/$sessionId/chromium/send_command')
params = {'cmd': 'Page.setDownloadBehavior', 'params': {'behavior': 'allow', 'downloadPath': "/Download/path/"}}
command_result = driver.execute("send_command", params)
driver.set_page_load_timeout(10)
#navigate to advanced search
driver.get(loginUrl)
driver.find_element_by_name("login_username").send_keys("username")
driver.find_element_by_name("login_password").send_keys("password")
driver.find_element_by_name("login").click()
driver.get(targetUrl)
file = driver.find_element_by_xpath("/html/body/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr/td/table[2]/tbody/tr/td/table/tbody/tr[2]/td[6]/a")
file.click()
The script runs and does not return any errors nothing but the target path remains empty.

Setting up tor with selenium web driver. (Windows)

i have tried to set up my tor with selenium but it continuously throws up exceptions.
I have tried setting up the binary as well as profiles but no luck.
from selenium import webdriver
from selenium.webdriver.firefox.firefox_profile import FirefoxProfile
import os
torexe = os.popen(r'C:\Users\Jawad Ahmad Khan\Desktop\Tor Browser\Browser\firefox.exe')
profile = FirefoxProfile(r'C:\Users\Jawad Ahmad Khan\Desktop\Tor Browser\Browser\TorBrowser\Data\Browser\profile.default')
profile.set_preference('network.proxy.type', 1)
profile.set_preference('network.proxy.socks', '127.0.0.1')
profile.set_preference('network.proxy.socks_port', 9050)
profile.set_preference("network.proxy.socks_remote_dns", False)
profile.update_preferences()
driver = webdriver.Firefox(firefox_profile= profile,
executable_path=r'D:\geckodriver\geckodriver.exe')
driver.get("http://check.torproject.org")
This is the error message:
selenium.common.exceptions.WebDriverException: Message: Reached error page: about:neterror?e=proxyConnectFailure&u=https%3A//check.torproject.org/&c=UTF-8&f=regular&d=Firefox%20is%20configured%20to%20use%20a%20proxy%20server%20that%20is%20refusing%20connections.

This works on my Mac with Chrome with Tor.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
def get_chrome_webdriver():
tor_proxy = "127.0.0.1:9150"
chrome_options = Options()
chrome_options.add_argument("--test-type")
chrome_options.add_argument('--ignore-certificate-errors')
chrome_options.add_argument('--disable-extensions')
chrome_options.add_argument('disable-infobars')
chrome_options.add_argument("--incognito")
chrome_options.add_argument('--proxy-server=socks5://%s' % tor_proxy)
driver = webdriver.Chrome('/usr/local/bin/chromedriver', options=chrome_options)
return driver
def get_chrome_browser(url):
browser = get_chrome_webdriver()
browser.get(url)
return browser
get_chrome_browser('https://check.torproject.org/')

web.Whatsapp headlessly using phantomjs

Using Phantomjs with to start web session on web.whatsapp.com, using chrome's user-agent as whatsapp not support phantomjs as user-agent
Code as Follows :
var page = require('webpage').create();
page.settings.userAgent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36';
page.viewportSize = {
width: 1200,
height: 800
};
page.open('https://web.whatsapp.com/', function() {
page.render('home.png');
phantom.exit();
});
But the output is blank white screen with dot on center
script output screenshot
any bug in my code or is there any compatible issue ?

Phantomjs is not waiting to load page completely, you can see elastic loading page icon.
Try this code with sleep.
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
user_agent = (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36"
)
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = user_agent
driver = webdriver.PhantomJS(desired_capabilities=dcap, executable_path=r'/bin/phantomjs')
driver.get('http://web.whatsapp.com')
timeout = 30
try:
element_present = EC.presence_of_element_located((By.Class, 'qrcode'))
WebDriverWait(driver, timeout).until(element_present)
except TimeoutException:
print "Timed out waiting for page to load"
Note : whatsapp need cryptoSha256 and cryptoAesCbc supported browser for proper crypt management, Phantom js is not supporting cryptoSha256 and cryptoAesCbc.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Python: Is it possible to download ENTIRE web page in PhantomJS - python-3.x

I have used PhantomJS for scraping purpose. I would like to know about possibility of download all contents of a URL(inclduing Images, CSS and JS) and save locally for browsing?

we can use evaluate() function to get the content. I use this in nodejs. var webPage = require('webpage'); var page = webPage.create(); page.open('http://google.com', function(status) { var title = page.evaluate(function() { return document.title; }); console.log(title); phantom.exit(); });`

In the case of wget being installed, this task is rather easy: domain = "www.google.de" from subprocess import call call(["wget", "-mk", domain])

Related

Being not able to set multiple chrome options at the same time (blocking notifications and cookies) in selenium and python

selenium: bypass access denied

Downloading files with selenium (python3) on ubuntu server 18.04

Setting up tor with selenium web driver. (Windows)

web.Whatsapp headlessly using phantomjs

Categories

Resources