I was trying to access a website and see if i can read its content using the urlopen module of urllib.import but then i see i got 403 forbidden error.
But when I try and open the link via web browser it opens. It appeared to me this is some kind of security by website to probably prevent malicious attack.
I wanted to know what are the mechanism to keep my content accessible via web browser but then prevent access via script such as which i am running ?
{code}
>>> from urllib.request import urlopen
>>> html= urlopen("http://www.english-for-students.com/A-Wise-Counting.html")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.5/urllib/request.py", line 163, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.5/urllib/request.py", line 472, in open
response = meth(req, response)
File "/usr/lib/python3.5/urllib/request.py", line 582, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.5/urllib/request.py", line 510, in error
return self._call_chain(*args)
File "/usr/lib/python3.5/urllib/request.py", line 444, in _call_chain
result = func(*args)
File "/usr/lib/python3.5/urllib/request.py", line 590, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
>>>
{code}
One possible approach is to check the headers for a compatible browser user agent. More information can be found here: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent
Check the section titled "Headers" in this link. From the page
Some websites dislike being browsed by programs, or send different
versions to different browsers. By default urllib identifies
itself as Python-urllib/x.y (where x and y are the major and minor
version numbers of the Python release, e.g. Python-urllib/2.5), which
may confuse the site, or just plain not work. The way a browser
identifies itself is through the User-Agent header [3]. When you
create a Request object you can pass a dictionary of headers in.
Related
I can't use extensions on UndetectedChromedriver PYPI Package (Python). If I use it with normal selenium its works, but not with this package. I tried to install extensions directly from webstore, but Chrome Webstore Alert is not an Alert to handle with selenium is a Window Event, so we need to use AutoIT, Pyautogui, etc... To handle that.
The only thing is working is loading profiles, but... I'm working for multiprocess windows, is working, but I need to create houndred of windows and then delete them. And I can't clone profiles, because UndetectedChromedriver doesn't work, i need to create manually.
Finally i tried with Google Chrome Enterprise Bundle, then I used Extensions policy to install forced the extension for all profiles. And yes, is working, but if I enabled that, selenium, doesn't work properly.
The error traceback log is:
Exception in thread Thread-2:
Traceback (most recent call last):
File "C:\Users\andre\anaconda3\envs\selenium-env\lib\threading.py", line 950, in _bootstrap_inner
self.run()
File "C:\Users\andre\anaconda3\envs\selenium-env\lib\threading.py", line 888, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\andre\OneDrive\Documentos\(A1)_Inicio\(A2)_CyberEspacio\LAB\(A1)_Programador123\(A1)_Programming_(Section)\VSCode Snippets\python\selenium\app.py", line 72, in test
seleniumCaptchaSolver.reCaptchaServiceLogin(apiKey='MYAPIKEY', solverType = SeleniumCaptchaSolverType().Capmonster)
File "C:\Users\andre\OneDrive\Documentos\(A1)_Inicio\(A2)_CyberEspacio\LAB\(A1)_Programador123\(A1)_Programming_(Section)\VSCode Snippets\python\selenium\modules\seleniumCaptchaSolver.py", line 103, in reCaptchaServiceLogin
self.__driver.get('chrome-extension://pabjfbciaedomjjfelfafejkppknjleh/popup.html')
File "C:\Users\andre\anaconda3\envs\selenium-env\lib\site-packages\undetected_chromedriver\__init__.py", line 535, in get
return super().get(url)
File "C:\Users\andre\anaconda3\envs\selenium-env\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 447, in get
self.execute(Command.GET, {'url': url})
File "C:\Users\andre\anaconda3\envs\selenium-env\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 435, in execute
self.error_handler.check_response(response)
File "C:\Users\andre\anaconda3\envs\selenium-env\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 247, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: cannot determine loading status
from disconnected: received Inspector.detached event
(Session info: chrome=103.0.5060.134)
This happen only when chrome-extension://pabjfbciaedomjjfelfafejkppknjleh/popup.html is opened to login (Send APi Key). I can login etc... But when Policy is Activated I can't because of that issue.
Anyone here know how to fix that or properly use extensions in UndetctedChromedriver?
Note: This error only happens if i load chrome-extension://pabjfbciaedomjjfelfafejkppknjleh/popup.html link, others links works.
I found this solution:
import undetected_chromedriver as uc
import os
working_dir = os.getcwd()
# Im using proxy extension
proxy_plugin = f'{working_dir}/proxy_plugin'
options = uc.ChromeOptions()
options.add_argument(f'--load-extension={proxy_plugin}')
# {proxy_plugin} path to extension folder, I tried to import .zip file
# and this doesnt working, maybe you can try import .crx file
# Also, I use extensions.ui.developer_mode
options.add_experimental_option('prefs', { 'extensions.ui.developer_mode': True })
driver = uc.Chrome(options = options)
extensions.ui.developer_mode
Yet, I seen this pages:
Load unpacked Chrome extension programmatically
Installing extension into V2
I had a python script main.py it did something and to run it via crontab on a daily basis I created the following file (I think it's called bash script):
#!/bin/sh
source /Users/PathToProject/venv/bin/activate
python /Users/PathToProject/main.py
For some time now it ran daily without any problems.
Now I added a feature that saves a .CSV file containing some results to my google drive via PyDrive2 afterward in the main.py. When running this new script via command line it runs successfully without any errors - every time.
I assumed that the crontab would run as well, but now I get the Traceback below.
/Users/PathToProject/venv/lib/python3.8/site-packages/oauth2client/_helpers.py:255: UserWarning: Cannot access mycreds.json: No such file or directory
warnings.warn(_MISSING_FILE_MESSAGE.format(filename))
Traceback (most recent call last):
File "/Users/PathToProject/venv/lib/python3.8/site-packages/oauth2client/clientsecrets.py", line 121, in _loadfile
with open(filename, 'r') as fp:
FileNotFoundError: [Errno 2] No such file or directory: 'client_secrets.json'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/PathToProject/venv/lib/python3.8/site-packages/pydrive2/auth.py", line 431, in LoadClientConfigFile
client_type, client_info = clientsecrets.loadfile(
File "/Users/PathToProject/venv/lib/python3.8/site-packages/oauth2client/clientsecrets.py", line 165, in loadfile
return _loadfile(filename)
File "/Users/PathToProject/venv/lib/python3.8/site-packages/oauth2client/clientsecrets.py", line 124, in _loadfile
raise InvalidClientSecretsError('Error opening file', exc.filename,
oauth2client.clientsecrets.InvalidClientSecretsError: ('Error opening file', 'client_secrets.json', 'No such file or directory', 2)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/PathToProject/main.py", line 5, in <module>
main()
File "/Users/PathToProject/version2.py", line 20, in main
PYD.download_file(data_file)
File "/Users/PathToProject/PyDrive_Modul.py", line 58, in download_file
file_ID = get_ID_of_title(filename)
File "/Users/PathToProject/PyDrive_Modul.py", line 47, in get_ID_of_title
drive = google_drive_auth()
File "/Users/PathToProject/PyDrive_Modul.py", line 11, in google_drive_auth
gauth.LocalWebserverAuth()
File "/Users/PathToProject/venv/lib/python3.8/site-packages/pydrive2/auth.py", line 123, in _decorated
self.GetFlow()
File "/Users/PathToProject/venv/lib/python3.8/site-packages/pydrive2/auth.py", line 507, in GetFlow
self.LoadClientConfig()
File "/Users/PathToProject/venv/lib/python3.8/site-packages/pydrive2/auth.py", line 411, in LoadClientConfig
self.LoadClientConfigFile()
File "/Users/PathToProject/venv/lib/python3.8/site-packages/pydrive2/auth.py", line 435, in LoadClientConfigFile
raise InvalidConfigError("Invalid client secrets file %s" % error)
pydrive2.settings.InvalidConfigError: Invalid client secrets file ('Error opening file', 'client_secrets.json', 'No such file or directory', 2)
If I edit the python script and skip the part of up/downloading to google drive it works fine.
Now I don't know why this error occurs and how I can solve this problem. The error message seems to be misleading because the client_secrets.json is in the directory and it works via the command line.
When you run via command line it picks path for json file and others.Cron could not find path.Be absolute in path, It will run smoothly. If absolute path not possible, try relative path with respect to CRON location path.
Both the server and my computer have geckodriver 0.26.0, Firefox 71, and Selenium 3.141.0.
My computer has MacOS Mojave with python 3.8 and the server is CentOS 7 with python 3.7. The code runs perfectly on my computer, but it returns errors on the server.
I don't remember how, but I have been getting different errors depending on if I add breakpoints to it, or if I run it in terminal or submit the job in SLURM.
On terminal:
File "Main.py", line 230, in <module>
main()
File "Main.py", line 179, in main
dfs=get_data(stations, inidate, findate)
File "Main.py", line 113, in get_data
list_files=return_list_day(date)
File "Main.py", line 66, in return_list_day
driver.get(webdir)
File "/home/user/.local/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 333, in get
self.execute(Command.GET, {'url': url})
File "/home/user/.local/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/home/user/.local/lib/python3.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: Failed to decode response from marionette
or
Submitted on SLURM (and most often error):
Traceback (most recent call last):
File "Main.py", line 230, in <module>
main()
File "Main.py", line 179, in main
dfs=get_data(stations, inidate, findate)
File "Main.py", line 113, in get_data
list_files=return_list_day(date)
File "Main.py", line 66, in return_list_day
driver.get(webdir)
File "/home/user/.local/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 333, in get
self.execute(Command.GET, {'url': url})
File "/home/user/.local/lib/python3.7/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/home/user/.local/lib/python3.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message: Timeout loading page after 300000ms
I can use Selenium without issues in another unrelated code that finishes perfectly fine. So I don't understand why Selenium is having a hard time here or where it comes from.
The code fails here:
## Commented parts are cuz I tried running with the MOZ_HEADLESS tag on terminal
## Didnt make a difference as far as I could tell.
#options = Options()
#options.headless = True
driver = webdriver.Firefox()#options=options)
driver.get(webdir) ##<-- HERE
## Added these because of other replies to this issue I found
driver.implicitly_wait(7)
time.sleep(3)
.
.
.
driver.close()
That bit of codes retrieves a list from the webpage and is ran inside a for-loop. The error is not in the retrieving part, is in driver.get(webdir). I can't share the website since it's literally like looking into the server of a partner institution. webdir is a directory and I am basically waiting till its contents are loaded so I can retrieve its file names.
I know you can't help much without a website, but do the errors I've shown give any indication as to what the problem might be?. Can the retrieving of a specific website behave differently based on OS? I have googled the errors, found questions here, read through them and applied them to see how it changed, and nothing did or I got a different error (either one of the two above).
I found this that states incompatibility between gecko and Mozilla, but since I can successfully run another (unrelated) code with the exact calling and usage of Selenium (only different URL given) then I don't think that's my issue.
Thanks for any help! Let me know what other information I could give that might help.
Edit:
It is not the same as the question that was linked, since I have given it sleep time and didn't change anything. It has 40 GB of ram allocated so its not dying out of too little memory. Which are the solutions shown in this question.
For anyone with this issue, the server updated Firefox to version 79.0 and now it works without any issues. Nothing else was changed as far as I was let known.
I assume the version change fixed it but I don't know exactly how. It's worth a try if anyone else was experiencing the same as I was with different errors depending on how it was run.
I use bazel to publish docker images to gitlab regitry. Last week, the bazel commands started failing. I was able to narrow down the issue to httplib2.
The code sample below can be used to reproduce the issue.
import httplib
import httplib2
conn = httplib.HTTPSConnection("registry.gitlab.com")
conn.request("GET", "/")
r1 = conn.getresponse()
print r1.status, r1.reason
httplib2.Http().request('https://registry.gitlab.com')
The output for the above is:
200 OK
Traceback (most recent call last):
File "deleteMe.py", line 9, in <module>
httplib2.Http().request('https://registry.gitlab.com')
File "/Users/joint/Library/Python/2.7/lib/python/site-packages/httplib2/__init__.py", line 2135, in request
cachekey,
File "/Users/joint/Library/Python/2.7/lib/python/site-packages/httplib2/__init__.py", line 1796, in _request
conn, request_uri, method, body, headers
File "/Users/joint/Library/Python/2.7/lib/python/site-packages/httplib2/__init__.py", line 1701, in _conn_request
conn.connect()
File "/Users/joint/Library/Python/2.7/lib/python/site-packages/httplib2/__init__.py", line 1411, in connect
raise SSLHandshakeError(e)
httplib2.SSLHandshakeError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:726)
Error shown in Wireshark is 'Description: Unknown CA (48)'
I have tried verifying the gitlab certs via openssl and I don't see any issue with them.
I have tried specifying the gitlab cert in httplib2 definition but I get the same error.
h = httplib2.Http(ca_certs='./registrygitlabcom.crt')
h.request('https://registry.gitlab.com')
Any pointers on what I should be doing or trying out... thanks!
I think I have figured out the answer. Posting it here for anyone else who might run into this.
The root certificates used by httplib2 are coming from the cacerts.txt file.
(https://github.com/httplib2/httplib2/blob/master/python2/httplib2/cacerts.txt)
registry.gitlab.com probably switched the root CA last week and that has triggered the problem.
I was trying to scrape some data off of Google Finances' website as a practise using Python 3.6.2. This is the code:
import urllib.request
url="https://www.google.com/search?num=40&newwindow=1&tbm=fin&q="
stockName=input("The stock you want to search for:")
url=url+stockName
url="https://www.google.com/search?num=40&newwindow=1&tbm=fin&q=FB"
data=urllib.request.urlopen(url).read()`
But I kept getting HTTP error 403.
The error I got was like this:
Traceback (most recent call last):
File "<pyshell#101>", line 1, in <module>
data=urllib.request.urlopen(url).read()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 532, in open
response = meth(req, response)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 564, in error
result = self._call_chain(*args)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 504, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 756, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 532, in open
response = meth(req, response)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 570, in error
return self._call_chain(*args)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 504, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
What should I do? Before this, I got an SSL Certificate Error, but that was solved due to a finding an answer in this forum.
Some sites do not necessarily support headless scraping, whether it is due to missing proper headers or missing JS support to prevent bots. It'll return a 403 status or something else other than what you expected. I'm not familiar enough with urllib to comment on, but when I try it with the requests module, it seems to work.
import requests
res = requests.get("https://www.google.com/search?num=40&newwindow=1&tbm=fin&q=FB")
res.raise_for_status()
# No status raised
You might also want to try to urllib2. Both libraries I've mentioned you'll need to install from pip.
A solution for urllib exists. You'll need to add the header manually. Personally I use fake_useragent library (again, install from pip) to spoof headers:
from fake_useragent import UserAgent
from urllib import request
ua = UserAgent()
req = request.Request("https://www.google.com/search?num=40&newwindow=1&tbm=fin&q=FB")
req.add_header('User-Agent', ua.chrome)
data = request.urlopen(req)
You can, if you're familiar enough, set up your own User-Agent string without using fake_useragent. In which case simply replace the ua.chrome portion with your User-Agent strings. As you can see though, requests work without even needing a header in this case - if you're up for increasing your skillset, that's a viable option that might save you some headaches in the future.
Edit: Just to add my personal experience. One good way I've found debugging these issues is to save the page your code have retrieved and compare it to what you see in an actual browser. This way you'll know if some contents are JS driven (and therefore cannot be parsed by a simple scrape) or if you are receiving something different entirely (which means your scrap is missing some elements the page is expecting, e.g. headers or JS support).