Here is the full script:
import requests
import bs4
res = requests.get('https://example.com')
soup = bs4.BeautifulSoup(res.text, 'lxml')
page_HTML_code = soup.prettify()
multiline_code = """{}""".format(page_HTML_code)
f = open("testfile.txt","w+")
f.write(multiline_code)
f.close()
So I'm trying to write the entire Downloaded HTML as a file while keeping it neat and clean.
I do understand that it has problems with the text and can't save certain characters, but I'm not sure how to encode the text correctly.
Can anyone help?
This is the error message that I will get
"C:\Location", line 16, in <module>
f.write(multiline_code)
File "C:\\Anaconda3\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0421' in position 209: character maps to <undefined>
I did some digging around and this worked:
import requests
import bs4
res = requests.get('https://example.com')
soup = bs4.BeautifulSoup(res.text, 'lxml')
page_HTML_code = soup.prettify()
multiline_code = """{}""".format(page_HTML_code)
#add the Encoding part when opening file and this did the trick
with open('testfile.html', 'w+', encoding='utf-8') as fb:
fb.write(multiline_code)
Related
I'm doing a course on Python at Coursera.
There is this assignment where ive to scrape a html web page and use it in my code.
Here is the code:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('http://py4e-data.dr-chuck.net/comments_828036.html')
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
# Retrieve all of the anchor tags
tags = soup('span')
sum = 0
for tag in tags:
sum = sum+int(tag.contents[0])
print (sum)
I'm using OnlineGDB as my compiler
On Compiling and running, a problem is arising:
Traceback (most recent call last):
File "main.py", line 11, in <module>
html = urllib.request.urlopen(url).read()
File "/usr/lib/python3.4/urllib/request.py", line 161, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.4/urllib/request.py", line 448, in open
req = Request(fullurl, data)
File "/usr/lib/python3.4/urllib/request.py", line 266, in __init__
self.full_url = url
File "/usr/lib/python3.4/urllib/request.py", line 292, in full_url
self._parse()
File "/usr/lib/python3.4/urllib/request.py", line 321, in _parse
raise ValueError("unknown url type: %r" % self.full_url)
ValueError: unknown url type: ''
Now, can anyone explain what this problem is and the required solution?
It would seem that the problem lies in this line:
url = input('http://py4e-data.dr-chuck.net/comments_828036.html')
input() in python allows you to have the user input something to use in your code. The parameter you pass to the input function (in this case, the url) will be the text to display when prompting the user for the input. Eg. age = input('Enter your age -> ') would prompt the user like so:
Enter your age -> #you would enter it here, then the age variable would be assigned the input
Anyways, it doesn't seem that you need an input at all. So all you have to do to fix the code is to remove the input from your code, and assign the url variable the url directly, like so:
url = 'http://py4e-data.dr-chuck.net/comments_828036.html'
Final code:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = 'http://py4e-data.dr-chuck.net/comments_828036.html'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
# Retrieve all of the anchor tags
tags = soup('span')
sum = 0
for tag in tags:
sum = sum+int(tag.contents[0])
print (sum)
Output: 2525
View and run it online
Furthermore, your code can be simplified a bit by using the requests module:
import requests
from bs4 import BeautifulSoup
url = 'http://py4e-data.dr-chuck.net/comments_828036.html'
html = requests.get(url).text
soup = BeautifulSoup(html, 'html.parser')
tags = soup('span')
sum = 0
for tag in tags:
sum += int(tag.contents[0])
print(sum)
I'm new to coding and I've been trying to scrape a page to practice. I have everything almost ready but I don't know why it gives an error.
from variables import MY_URL , OUT_FILE
import requests
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as ureq
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
import csv
def agarrar_pagina():
for i in range(1,22):
uclient = ureq(MY_URL+'categorias/todas/?page={}'.format(i))
page_html = uclient.read()
page_soup = soup(page_html, "html.parser")
contenedores = page_soup.findAll('div', {'class':'cambur'})
contenedor=[]
for link in contenedor:
link = contenedor.findAll('a',['href'])
ulink = ureq(MY_URL + link)
page_link = ulink.read()
ulink = close()
uclient.close()
return page_link
This is the error
`Traceback (most recent call last):
File "prueba.py", line 93, in <module>
main()
File "prueba.py", line 89, in main
cajitas = seleccionar_caja(pagina)
File "prueba.py", line 30, in seleccionar_caja
page_soup = soup(html, "html.parser")
File "C:\Users\PC\AppData\Local\Programs\Python\Python37\lib\site-packages\bs4\__init__.py", line
267, in __init__
elif len(markup) <= 256 and (
TypeError: object of type 'NoneType' has no len()`
contenedor=[] is empty list. I think you intend to use
contenedores
for link in contenedores :
link = contenedores.findAll('a',['href'])
ulink = ureq(MY_URL + link)
page_link = ulink.read()
ulink = close()
uclient.close
The function agarrar_pagina is not mentioned in the stacktrace. I think that is because function agarrar_pagina returns None all the time.
This happens with your code when looping on empty list or on the corrected function if anything is found with the findall.
I'm trying to download picture from website with my python script, but every time i use georgian alphabet in url it gets error "UnicodeEncodeError: 'ascii' codec can't encode characters"
here is my code:
import os
import urllib.request
def download_image(url):
fullfilename = os.path.join('/images', 'image.jpg')
urllib.request.urlretrieve(url, fullfilename)
download_image(u'https://example.com/media/სდასდსადადსაფა_8QXjrbi.jpg')
I think it's better to use requests library in your example which deals with utf-8 characters.
Here is the code:
import requests
def download_image(url):
request = requests.get(url)
local_path = 'images/images.jpg'
with open(local_path, 'wb') as file:
file.write(request.content)
my_url = 'https://upload.wikimedia.org/wikipedia/commons/thumb/8/81/ერეკლე_II_ბავშვობის_სურათი.jpgw/459px-ერეკლე_II_ბავშვობის_სურათი.jpg'
download_image(my_url)
I need to edit hundreds of .html files with beautifulSoup 4.
My CSS formatting is lost when I write back the changes to file.
Before prettify():
And prettify():
My code:
from bs4 import BeautifulSoup
import os
files = []
path = r"C:\Files"
for file in os.listdir(path):
if file.endswith('.html'):
files.append(file)
for htmlfile in files:
soup = BeautifulSoup(open(htmlfile, encoding="utf-8"), "html.parser")
soup.header.decompose()
soup.menu.decompose()
pretty_html = soup.prettify('utf-8', 'minimal')
with open(htmlfile, "wb") as outfile:
outfile.write(pretty_html)
If I don't prettify() and write is out as below:
with open(file, "w") as outfile:
outfile.write(str(soup))
I get an encoding error:
outfile.write(str(soup))
File "...env\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2192' in position 2027: character maps to <undefined>
Seems to be "utf-8" to "cp1252" enconding issue.
I can't wrap my head around this encoding stuff.
This question already has an answer here:
what is the uni code encoding error in the code below [duplicate]
(1 answer)
Closed 3 years ago.
The script below returns 'UnicodeEncode Error: 'ascii' codec can't encode character '\xf8' in position 118: ordinal not in range(128)'
and I cant find a good explanation for it.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
results = {}
for page_num in range(0, 1000, 20):
address = 'https://www.proff.no/nyetableringer?industryCode=p441&fromDate=22.01.2007&location=Nord-Norge&locationId=N&offset=' + str(page_num) + '&industry=Entreprenører'
html = urlopen(address)
soup = BeautifulSoup(html, 'lxml')
table = soup.find_all(class_='table-condensed')
output = pd.read_html(str(table))[0]
results[page_num] = output
df = pd.concat([v for v in results.values()], axis = 0)
You are using the std library to open the url. This library forces the address to be encoded into ascii. Hence non ascii characters like ø will throw a Unicode Error.
Line 1116-1117 of http/client.py
# Non-ASCII characters should have been eliminated earlier
self._output(request.encode('ascii'))
As alternative to urllib.request, the 3rd party requests is great.
import requests
address = 'https://www.proff.no/nyetableringer?industryCode=p441&fromDate=22.01.2007&location=Nord-Norge&locationId=N&offset=' + str(page_num) + '&industry=Entreprenører'
html = requests.get(address).text