I have been scraping a small amount of data once a day & the code was working before 5 days ago. Now I can't seem to get anything but a 503 error code when trying.
Here's a simple cut of the code causing the issues:
from urllib.request import Request, urlopen
website = "https://bitcointalk.org/index.php?board=1.0"
req = Request(website, headers={'User-Agent': 'Mozilla/5.0'})
data = urlopen(req).read()
print(data)
which gives the error:
Traceback (most recent call last):
File "C:/Users/Clay/Desktop/python/airdrop/1.py", line 6, in <module>
data = urlopen(req).read()
File "C:\Users\Clay\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\Clay\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 532, in open
response = meth(req, response)
File "C:\Users\Clay\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Users\Clay\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 570, in error
return self._call_chain(*args)
File "C:\Users\Clay\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 504, in _call_chain
result = func(*args)
File "C:\Users\Clay\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 503: Service Temporarily Unavailable
any ideas? I am quite lost after changing the User-Agent and still having the same issue.
I also have multiple proxy servers that I've tried cycling through and it doesn't work on any of them.
Related
I'm trying to download a file from the URL -> https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519
I can manually download the file by accessing the URL via a browser and the file gets automatically saved onto the local machine in the Downloads folder. (The file is in JSON format)
However, I need to achieve this using a Python script. I tried using urllib.request & wget, but in both cases I keep getting the error -
urllib.request.urlretrieve(url, path)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 247, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 531, in open
response = meth(req, response)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 641, in http_response
'http', request, response, code, msg, hdrs)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 569, in error
return self._call_chain(*args)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 503, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
Is there a workaround to this? Dealing with dynamic changes ?
You could try the following script to get the download url and download the json file:
import requests
import re
import urllib.request
rq= requests.get("https://www.microsoft.com/en-us/download/confirmation.aspx?id=56519")
t = re.search("https://download.microsoft.com/download/.*?\.json", rq.text )
a= t.group()
print(a)
path = r"$(Build.sourcesdirectory)\agent.json"
urllib.request.urlretrieve(a, path)
Result:
Python 3
import urllib.request, json
with urllib.request.urlopen("https://download.microsoft.com/download/7/1/D/71D86715-5596-4529-9B13-DA13A5DE5B63/ServiceTags_Public_20210329.json") as url:
data = json.loads(url.read().decode())
print(data)
I'm using pandas read_csv to download and read a file to a dataframe.
import pandas as pd
df = pd.read_csv('https://some-monitor.com/rest/data', sep=';', thousands='.', decimal=',')
Locally, the script works fine and the data is read to the dataframe. However, when I ssh into a remote server and run the script there, I get the following error:
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 678, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 424, in _read
filepath_or_buffer, encoding, compression)
File "/usr/lib/python3/dist-packages/pandas/io/common.py", line 195, in get_filepath_or_buffer
req = _urlopen(filepath_or_buffer)
File "/usr/lib/python3.7/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.7/urllib/request.py", line 531, in open
response = meth(req, response)
File "/usr/lib/python3.7/urllib/request.py", line 641, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.7/urllib/request.py", line 569, in error
return self._call_chain(*args)
File "/usr/lib/python3.7/urllib/request.py", line 503, in _call_chain
result = func(*args)
File "/usr/lib/python3.7/urllib/request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 413:
Why is this occurring? Why does the script work locally but not on the server? The server's and my local OS are both the same: Ubuntu.
Thanks to #DeadSec 's suggestion, the script now runs fine on the server too.
I used pretty-downloader to first download the file, then load it in pandas.
from pretty_downloader import download
download('https://some-monitor.com/rest/data', file_name='my_file.csv')
import pandas as pd
df = pd.read_csv('my_file.csv', sep=';', thousands='.', decimal=',')
I was following a tutorial and when using request.urlopen(url) I get an error, I have tried checking the URL
(https://www.wsj.com/market-data/quotes/PH/XPHS/JFC/historical-prices/download?MOD_VIEW=page&num_rows=150&range_days=150&startDate=06/01/2020&endDate=07/05/2020)
and it's fine.
Here is my code:
from urllib import request
import datetime
def download_stock_from_day_until_today(stock_code, start_date):
current_day = datetime.date.today()
formatted_current_day = datetime.date.strftime(current_day, "%m/%d/%Y") #formats today's date for links
#formatted url
url = "https://www.wsj.com/market-data/quotes/PH/XPHS/"+ stock_code +"/historical-prices/download?MOD_VIEW=page&num_rows=150&range_days=150&startDate="+ start_date +"&endDate=" + formatted_current_day
print(url)
response = request.urlopen(url) #requests the csv file
csv = response.read() #reads the csv file
csv_str = str(csv)
lines = csv_str.split("\\n")
dest_url = r'asd.csv'
fx = open(dest_url, "w")
for line in lines:
fx.write(line + "\n")
fx.close()
download_stock_from_day_until_today("JFC", "06/01/2020")
and the error I get in the console is:
Traceback (most recent call last):
File "C:/Users/Lathrix/PycharmProject/StockExcelDownloader/main.py", line 23, in <module>
download_stock_from_day_until_today("JFC", "06/01/2020")
File "C:/Users/Lathrix/PycharmProject/StockExcelDownloader/main.py", line 12, in download_stock_from_day_until_today
response = request.urlopen(url) #requests the csv file
File "C:\Users\Lathrix\AppData\Local\Programs\Python\Python38-32\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\Lathrix\AppData\Local\Programs\Python\Python38-32\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "C:\Users\Lathrix\AppData\Local\Programs\Python\Python38-32\lib\urllib\request.py", line 640, in http_response
response = self.parent.error(
File "C:\Users\Lathrix\AppData\Local\Programs\Python\Python38-32\lib\urllib\request.py", line 569, in error
return self._call_chain(*args)
File "C:\Users\Lathrix\AppData\Local\Programs\Python\Python38-32\lib\urllib\request.py", line 502, in _call_chain
result = func(*args)
File "C:\Users\Lathrix\AppData\Local\Programs\Python\Python38-32\lib\urllib\request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found
Looks like wsj.com does not like urllib's User-Agent.
With the line
response = request.urlopen(request.Request(url,headers={'User-Agent': 'Mozilla/5.0'}))
your code works correctly
import re
>>> import urllib.request
>>> url="https://www.google.com/search?q=googlestock"
>>> print(url)
https://www.google.com/search?q=googlestock
>>> data=urllib.request.urlopen(url).read()
I get an error however the url works fine when opened manually. error is
File "<pyshell#4>", line 1, in <module>
data=urllib.request.urlopen(url).read()
File "C:\Users\SHARM\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\SHARM\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "C:\Users\SHARM\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 641, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Users\SHARM\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 569, in error
return self._call_chain(*args)
File "C:\Users\SHARM\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
File "C:\Users\SHARM\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
If you want to do web scraping from google, you can use the "google" library.
On your command prompt, pip install google (it is literally "pip install google").
Then, just try something like that:
from googlesearch import search
for s in search("googlestock"):
print(s)
This will print all of the results from google search "googlestock". Here to learn more about this library: https://pypi.org/project/google/
I hope it helps,
BR
I have a list of urls and I am using following code to scrape images from websites, using urllib in python3.
i=0
all_image_links=[]
r=requests.get(urllink)
data=r.text
soup=BeautifulSoup(data,"lxml")
name=soup.find('title')
name=name.text
for link in soup.find_all('img'):
image_link=link.get('src')
final_link=urllink+image_link
all_image_links.append(final_link)
for each in all_image_links:
urllib.request.urlretrieve(each,name+str(i))
i=i+1
I am encountered with the following error:
Traceback (most recent call last):
File "j1.py", line 91, in <module>
import_personal_images(each)
File "j1.py", line 63, in import_personal_images
urllib.request.urlretrieve(each,name+str(i))
File "/usr/lib/python3.5/urllib/request.py", line 188, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
File "/usr/lib/python3.5/urllib/request.py", line 163, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.5/urllib/request.py", line 472, in open
response = meth(req, response)
File "/usr/lib/python3.5/urllib/request.py", line 582, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.5/urllib/request.py", line 510, in error
return self._call_chain(*args)
File "/usr/lib/python3.5/urllib/request.py", line 444, in _call_chain
result = func(*args)
File "/usr/lib/python3.5/urllib/request.py", line 590, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
I found few solutions on the web and changed code to :
1):
all_image_links=[]
i=0
req = Request(urllink, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
r=webpage.decode('utf-8')
soup=BeautifulSoup(r,"lxml")
for link in soup.find_all('img'):
image_link=link.get('src')
all_image_links.append(urllink+image_link)
for each in all_image_links:
urllib.request.urlretrieve(each,str(i))
i=i+1
2):
all_image_links=[]
i=0
headers = {'User-Agent':'Mozilla/5.0'}
page = requests.get(urllink)
soup = BeautifulSoup(page.text, "html.parser")
for link in soup.find_all('img'):
image_link=link.get('src')
print(image_link)
all_image_links.append(urllink+image_link)
for each in all_image_links:
urllib.request.urlretrieve(each,str(i))
i=i+1
and i am still getting the same error. Can someone explain where my code is incorrect?
HTTP Error 403: Forbidden - the server understood the request, but will not fulfill it for some reason unrelated to authorization.
The server is actively denying you access to this file. Either you have exceeded a rate-limit, or you are not logged in, and attempting to access a privileged resource.
No amount of code, unless that of the authorization / authentication type, will resolve this error.