Download files with Python - "unknown url type" - python-3.x

I need to download a list of RTF files locally with Python3.
I tried with urllib
import urllib
url = "www.calhr.ca.gov/Documents/wfp-recruitment-flyer-bachelor-degree-jobs.rtf"
urllib.request.urlopen(url)
but I get a ValueError
ValueError: unknown url type: 'www.calhr.ca.gov/Documents/wfp-recruitment-flyer-bachelor-degree-jobs.rtf'
How to deal with this kind of file format?

Try adding http:// in front of the url,
import urllib
url = "http://www.calhr.ca.gov/Documents/wfp-recruitment-flyer-bachelor-degree-jobs.rtf"
urllib.request.urlopen(url)

Related

Parse a site with DDoS guard

I have read quantities of info about using selenium and chromedriver. Nothing helped.
Then I tried undetected_chromedriver:
import undetected_chromedriver as uc
url = "<url>"
driver = uc.Chrome()
driver.get(url)
driver.quit()
However, there's such a mistake:
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)>
Guides in the net to avoid this mistake didn't help.
Maybe there's just a method to make the code wait 5 secs until the browser checking in process?
Well,
I used Grap methods instead of requests.
Now it works. I think there's bypass method.
Grap documentation: https://grab.readthedocs.io/en/latest/
So you will need to install a library called beautifulsoup4 and requests.
pip install beautifulsoup4
pip install requests
After that, try this code:
from bs4 import BeautifulSoup
import requests
html = requests.get("your url here").text
soup = BeautifulSoup(html, 'html.parser')
print(soup)
#use this to try to find elements:
#find_text = soup.find('pre', {'class': 'brush: python; title: ; notranslate'}).get_text()
Here is the beautifulsoup's documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

python3 urllib cant download url when umlauts "ä,ö,ü"

I have written a Python3 script which downloads a URL. However, it does not work if there is an "umlaut" in the URL (in this case "ü"). The URL does not work if I write "ue". How can I change to UTF 8?
import urllib.request
url = "https://www.corona-in-zahlen.de/landkreise/sk%20würzburg/"
urllib.request.urlretrieve(url, "webpage.txt")
Your example works if you replace the ü with a regular u:
import urllib.request
url = "https://www.corona-in-zahlen.de/landkreise/sk%20wurzburg/"
urllib.request.urlretrieve(url, "webpage.txt")

Download xml file from the server with Python3

am trying to download a xml file from public data bank
http://api.worldbank.org/v2/en/indicator/SP.POP.TOTL?downloadformat=xml
I tried to do it with requests:
import requests
response = requests.get(url)
response.encoding = 'utf-8' #or response.apparent_encoding
print(response.content)
and wget
import wget
wget.download(url, './my.xml')
But both of the ways provide mess instead of a correct file (it looks like a broken encoding, but I cannot fix it)
If I try to download the file via web browser I get correct a UTF-8 xml file.
What am I doing wrong in the code?

How to download zip file from a Hyperlink in python

there is website with url
https://www.fda.gov/drugs/drug-approvals-and-databases/drugsfda-data-files
and there is one downloadable file
Drugs#FDA Download File (ZIP - 3.2MB) as Hyperlink in the content of the site.
I have tried the code as below
import urllib.request
import gzip
url = 'https://www.fda.gov/media/89850/download'
with urllib.request.urlopen(url) as response:
with gzip.GzipFile(fileobj=response) as uncompressed:
file_header = uncompressed.read()
But i am getting error of : Not a Zipped file
you can use the python requests library to get the data from the url then write the contents to a file.
import requests
with open('my_zip_file.zip', 'wb') as my_zip_file:
data = requests.get('https://www.fda.gov/media/89850/download')
my_zip_file.write(data.content)
This will create a file in the same directory. you can of course name your file anything.

urllib3 return 404 not found for existing website

Different result by urllib and urllib3
I can open a web page by copying the address into my chrome and urllib also returns the page source code. I just do not understand why urllib3 is returning 404 not found on this webpage when everything else works.
Below is the original code:
url = 'http://www.webmd.com/drugs/2/condition-12862/depression%20associated%20with%20bipolar%20disorder'
import urllib3
http = urllib3.PoolManager()
r = http.request('GET',url)
r.data
import urllib.request
req = urllib.request.Request(url=url)
with urllib.request.urlopen(req) as f:
print(f.read().decode('utf-8'))
My guess, you are calling behind proxy, urllib uses system proxy (if you are on linux - http_proxy enviroment variable), for urllib3 you need to specify it using urllib3 library

Resources