urllib3 return 404 not found for existing website

urllib3 return 404 not found for existing website - python-3.x

Different result by urllib and urllib3
I can open a web page by copying the address into my chrome and urllib also returns the page source code. I just do not understand why urllib3 is returning 404 not found on this webpage when everything else works.
Below is the original code:
url = 'http://www.webmd.com/drugs/2/condition-12862/depression%20associated%20with%20bipolar%20disorder'
import urllib3
http = urllib3.PoolManager()
r = http.request('GET',url)
r.data
import urllib.request
req = urllib.request.Request(url=url)
with urllib.request.urlopen(req) as f:
print(f.read().decode('utf-8'))

My guess, you are calling behind proxy, urllib uses system proxy (if you are on linux - http_proxy enviroment variable), for urllib3 you need to specify it using urllib3 library

Related

Parse a site with DDoS guard

I have read quantities of info about using selenium and chromedriver. Nothing helped.
Then I tried undetected_chromedriver:
import undetected_chromedriver as uc
url = "<url>"
driver = uc.Chrome()
driver.get(url)
driver.quit()
However, there's such a mistake:
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)>
Guides in the net to avoid this mistake didn't help.
Maybe there's just a method to make the code wait 5 secs until the browser checking in process?

Well,
I used Grap methods instead of requests.
Now it works. I think there's bypass method.
Grap documentation: https://grab.readthedocs.io/en/latest/

So you will need to install a library called beautifulsoup4 and requests.
pip install beautifulsoup4
pip install requests
After that, try this code:
from bs4 import BeautifulSoup
import requests
html = requests.get("your url here").text
soup = BeautifulSoup(html, 'html.parser')
print(soup)
#use this to try to find elements:
#find_text = soup.find('pre', {'class': 'brush: python; title: ; notranslate'}).get_text()
Here is the beautifulsoup's documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

python3 urllib cant download url when umlauts "ä,ö,ü"

I have written a Python3 script which downloads a URL. However, it does not work if there is an "umlaut" in the URL (in this case "ü"). The URL does not work if I write "ue". How can I change to UTF 8?
import urllib.request
url = "https://www.corona-in-zahlen.de/landkreise/sk%20würzburg/"
urllib.request.urlretrieve(url, "webpage.txt")

Your example works if you replace the ü with a regular u:
import urllib.request
url = "https://www.corona-in-zahlen.de/landkreise/sk%20wurzburg/"
urllib.request.urlretrieve(url, "webpage.txt")

Why is Selenium overriding my firefox profile

I am making python selenium script to automate some google search with firefox.
I am using python 3.7 on Windows 10 64b.
Something weird happened. When I run my python script, it’is fine.
When I compile it with Nuitka and I run the exe, Firefox is opening with some proxy added (127.0.01:53445).
So I added this line:
profile.set_preference("network.proxy.type", 0)
And again, the script run fine but when I compile it, the exe opens Firefox with a proxy.
It is a pain as this 127.0.01 proxy creates an issue to open google and my program is broken.
Does anyone already faced this weird behaviour of selenium?

Without seeing what code you are using relevant to webdriver I am mostly guessing here. I would suggest rotating the proxy as well.
import requests
from bs4 import BeautifulSoup
from random import choice
firefox_capabilities = webdriver.DesiredCapabilities.FIREFOX
firefox_capabilities['marionette'] = True
def proxy_generator():
# REQUIRED - pip install html5lib
response = requests.get("https://sslproxies.org/")
soup = BeautifulSoup(response.content, 'html5lib')
proxy = {'https': choice(list(map(lambda x: x[0] + ':' + x[1], list(zip(map(lambda x: x.text, soup.findAll('td')[::8]), map(lambda x: x.text, soup.findAll('td')[1::8]))))))}
return proxy
PROXY = proxy_generator()# Commented out proxy option "58.216.202.149:8118"
firefox_capabilities['proxy'] = {
"proxyType": "MANUAL",
"httpProxy": PROXY,
"ftpProxy": PROXY,
"sslProxy": PROXY
}
driver = webdriver.Firefox(capabilities=firefox_capabilities)

Download files with Python - "unknown url type"

I need to download a list of RTF files locally with Python3.
I tried with urllib
import urllib
url = "www.calhr.ca.gov/Documents/wfp-recruitment-flyer-bachelor-degree-jobs.rtf"
urllib.request.urlopen(url)
but I get a ValueError
ValueError: unknown url type: 'www.calhr.ca.gov/Documents/wfp-recruitment-flyer-bachelor-degree-jobs.rtf'
How to deal with this kind of file format?

Try adding http:// in front of the url,
import urllib
url = "http://www.calhr.ca.gov/Documents/wfp-recruitment-flyer-bachelor-degree-jobs.rtf"
urllib.request.urlopen(url)

proxy server refusing connections when trying to run tor browser with selenium (using TorBrowserDriver not profile and binary) [duplicate]

I am trying to connect to a Tor browser but get an error stating "proxyConnectFailure" any ideas I have tried multiple attempts to get into the basics of Tor browser to get it connected but all in vain if any could help life could be saved big time:
from selenium import webdriver
from selenium.webdriver.firefox.firefox_profile import FirefoxProfile
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
binary = FirefoxBinary(r"C:\Users\Admin\Desktop\Tor Browser\Browser\firefox.exe")
profile = FirefoxProfile(r"C:\Users\Admin\Desktop\Tor Browser\Browser\TorBrowser\Data\Browser\profile.default")
# Configured profile settings.
proxyIP = "127.0.0.1"
proxyPort = 9150
proxy_settings = {"network.proxy.type":1,
"network.proxy.socks": proxyIP,
"network.proxy.socks_port": proxyPort,
"network.proxy.socks_remote_dns": True,
}
driver = webdriver.Firefox(firefox_binary=binary,proxy=proxy_settings)
def interactWithSite(driver):
driver.get("https://www.google.com")
driver.save_screenshot("screenshot.png")
interactWithSite(driver)

To connect to a Tor Browser through a FirefoxProfile you can use the following solution:
Code Block:
from selenium import webdriver
from selenium.webdriver.firefox.firefox_profile import FirefoxProfile
import os
torexe = os.popen(r'C:\Users\AtechM_03\Desktop\Tor Browser\Browser\TorBrowser\Tor\tor.exe')
profile = FirefoxProfile(r'C:\Users\AtechM_03\Desktop\Tor Browser\Browser\TorBrowser\Data\Browser\profile.default')
profile.set_preference('network.proxy.type', 1)
profile.set_preference('network.proxy.socks', '127.0.0.1')
profile.set_preference('network.proxy.socks_port', 9050)
profile.set_preference("network.proxy.socks_remote_dns", False)
profile.update_preferences()
driver = webdriver.Firefox(firefox_profile= profile, executable_path=r'C:\Utility\BrowserDrivers\geckodriver.exe')
driver.get("http://check.torproject.org")
Browser Snapshot:
You can find a relevant discussion in How to use Tor with Chrome browser through Selenium

I would like to expand on #DebanjanB answer by adding the Linux counterpart:
from selenium import webdriver
from selenium.webdriver.firefox.firefox_profile import FirefoxProfile
import os
torexe = os.popen('some/path/tor-browser_en-US/Browser/start-tor-browser')
# in my case, I installed it under a folder tor-browser_en-US after
# downloading and extracting it from
# https://www.torproject.org/download/ for linux
profile = FirefoxProfile(
'some/path/tor-browser_en-US/Browser/TorBrowser/Data/Browser/profile.default')
profile.set_preference('network.proxy.type', 1)
profile.set_preference('network.proxy.socks', '127.0.0.1')
profile.set_preference('network.proxy.socks_port', 9050)
profile.set_preference("network.proxy.socks_remote_dns", False)
profile.update_preferences()
firefox_options = webdriver.FirefoxOptions()
firefox_options.binary_location = '/usr/bin/firefox'
# /usr/bin/firefox is default location of firefox - for me anyway
driver = webdriver.Firefox(
firefox_profile=profile, options=firefox_options,
executable_path='wherever/you/installed/geckodriver')
# I keep my geckodriver(s) in a special folder sorted by versions.
# Geckodriver downloadable here:
# https://github.com/mozilla/geckodriver/releases/
driver.get("http://check.torproject.org")

The verified answer does not work in case of opening dot onion sites(I believe that's something to do with tor network which is not allowing access to normal firefox).
As for the latest tor browser (from the tor browser bundle), starting it using selenium causes some error due to which the browser cannot start tor proxy itself causing proxy and timeout errors(doesn't matter if tor proxy is started by python or manually or not started at all). This could also be due to port 9050 or 9150 being used by tor proxy and not being available to browser's tor instance but this does not explain the error caused when no instance of tor proxy is running.
The solution i have found is to start the tor proxy as normal, manually or using os.popen("tor.exe") and configure tor browser to not start tor proxy.
here's the code:
from selenium import webdriver
from selenium.webdriver.firefox.firefox_profile import FirefoxProfile
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
os.popen(r'e:\\bla\\bla\\bla\\tor\\Tor\\tor.exe')
binary=FirefoxBinary(r'e:\\bla\\bla\\bla\\Tor Browser\\Browser\\firefox.exe')
fp=FirefoxProfile(r'e:\\foo\\bar\\bla\\Tor Browser\\Browser\\TorBrowser\\Data\\Browser\\profile.default')
fp.set_preference('extensions.torlauncher.start_tor',False)#note this
fp.set_preference('network.proxy.type',1)
fp.set_preference('network.proxy.socks', '127.0.0.1')
fp.set_preference('network.proxy.socks_port', 9050)
fp.set_preference("network.proxy.socks_remote_dns", True)
fp.update_preferences()
driver = webdriver.Firefox(firefox_profile=fp,firefox_binary=binary)
driver.get("http://check.torproject.org")
driver.get('https://www.bbcnewsv2vjtpsuy.onion/')
*note fp.set_preference('extensions.torlauncher.start_tor',False) on line 10 is being used to configure tor to not start its own tor instance so that it uses the proxy config and tor instance started above.
lo and behold as the tbb starts working like normal firefox bot browser

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

urllib3 return 404 not found for existing website - python-3.x

My guess, you are calling behind proxy, urllib uses system proxy (if you are on linux - http_proxy enviroment variable), for urllib3 you need to specify it using urllib3 library

Related

Parse a site with DDoS guard

python3 urllib cant download url when umlauts "ä,ö,ü"

Why is Selenium overriding my firefox profile

Download files with Python - "unknown url type"

proxy server refusing connections when trying to run tor browser with selenium (using TorBrowserDriver not profile and binary) [duplicate]

Categories

Resources