Error with HTMLParser and feed in Python - python-3.x

Hello I am using Python3 to create an application which from a given URL it returns the text of the website without the HTML tags, just clean and plain text.
Here is my code is supposed to work but is not:
import urllib, formatter, sys
from urllib.request import urlopen
from html.parser import HTMLParser
website = urlopen("http://www.google.com")
data = website.read()
website.close()
format = formatter.AbstractFormatter(formatter.DumbWriter(sys.stdout))
ptext = HTMLParser(format)
ptext.feed(data)
ptext.close()
The error:
File "app.py", line 11, in <module>
ptext.feed(data)
File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/html/parser.py", line 144, in feed
self.rawdata = self.rawdata + data
TypeError: Can't convert 'bytes' object to str implicitly
I do find a solution which overcome the error but I don't get the appropriate result. The solution was to change the following line:
ptext.feed(data)
to:
ptext.feed(data.decode("utf-8"))
The problem now is that there is no outcome to the terminal the program is run but there is no outcome, the code is tested on a tutorial I saw and it works.
Thanks.

Related

It shows TypeError after run python scraping code(o'reilly example code)

I follow the example code of "O'Reilly Web Scraping with Python: Collecting More Data from the Modern Web" and find it shows error.
The versions are:
python3.7.3, BeautifulSoup4
The code is as follows:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import random
import datetime
import codecs
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
random.seed(datetime.datetime.now())
def getLinks(articleUrl):
html = urlopen('http://en.wikipedia.org{}',format(articleUrl))
bs = BeautifulSoup(html, 'html.parser')
return bs.find('div', {'id':'bodyContent'}).find_all('a',
href=re.compile('^(/wiki/)((?!:).)*$'))
links = getLinks('/wiki/Kevin_Bacon')
links.encoding = 'utf8'
while len(links) > 0:
newArticle = links[random.randint(0, len(links)-1)].attrs['href']
print(newArticle)
links = getLinks(newArticle)
TypeError: POST data should be bytes, an iterable of bytes, or a file
object. It cannot be of type str.
Looking at this rather old question, I see the problem is a typo (and I VTCed as such):
html = urlopen('http://en.wikipedia.org{}',format(articleUrl))
^
That comma , should be a dot ., otherwise according to the documentation we are passing a second parameter data:
data must be an object specifying additional data to send to the server, or None if no such data is needed. Currently HTTP requests are the only ones that use data. The supported object types include bytes, file-like objects, and iterables of bytes-like objects.
Note the last sentence; the function expects an iterable of bytes, but format() returns a string, thus the error:
TypeError: POST data should be bytes, an iterable of bytes, or a file object. It cannot be of type str.

Automating efetch does not return an xml file

Keywords: Entrez NCBI PubMed Python3.7 BeautifulSoup xml
I would like to retrieve some xml data from a list of Pubmed Ids.
When I use the url provided as an example on the Entrez website (https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=10890170&retmode=xml), the data is downloaded correctly as a xml file, but if I am to automate the search by replacing the id with a variable (temp_id), text is returned and not a xml file.
I therefore get this error (because there is not xml file with xml tags)
Traceback (most recent call last):
File "test.py", line 14, in
pub_doi = soup.find(idtype="doi").text
AttributeError: 'NoneType' object has no attribute 'text'
from bs4 import BeautifulSoup
import certifi
import urllib3
temp_id=str(10890170)
#efetch_url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=10890170&retmode=xml'#this url works
base_url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/'
efetch_url = '%sefetch.fcgi?db=pubmed&id=%s&retmode=xml' % (base_url, temp_id)
try:
http = urllib3.PoolManager()
http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())
url = efetch_url
results = http.request('GET', url)
soup = BeautifulSoup(results.data,features='xml')
pub_doi = soup.find(idtype="doi").text
pub_abstract = soup.pubmedarticleset.pubmedarticle.article.abstract.abstracttext.text
except (urllib3.exceptions.HTTPError, IOError) as e:
print("ERROR!", e)
else:
pass
For some reason, when I copy and paste the url in my browser, it appears as text in safari, and xml in chrome.
I would like to get some help as I suspect my url is not constructed well.
Turns out it was an issue in the way Beautiful Soup handled the url link. I used ElementTree instead and it worked.

Can you please show me how to fix this "ValueError: not enough values to unpack" in beautiful soup?

I am going through the Beautiful Soup page of this book
Python for Secret Agents by Steven Lott Dec 11, 2015 2nd edition
http://imgur.com/EgBnXmm
I ran the code from the page and got this error:
Traceback (most recent call last):
File "C:\Python35\Draft 1.py", line 12, in
timestamp_tag, *forecast_list = strong_list
ValueError: not enough values to unpack (expected at least 1, got 0)
For the life of me I cannot figure out the correct way to fix the code listed here in its entirety:
from bs4 import BeautifulSoup
import urllib.request
query= "http://forecast.weather.gov/shmrn.php?mz=amz117&syn=amz101"
with urllib.request.urlopen(query) as amz117:
document= BeautifulSoup(amz117.read())
content= document.body.find('div', id='content').div
synopsis = content.contents[4]
forecast = content.contents[5]
strong_list = list(forecast.findAll('strong'))
timestamp_tag, *forecast_list = strong_list
for strong in forecast_list:
desc= strong.string.strip()
print( desc, strong.nextSibling.string.strip() )
Thanks a million.
You are experiencing the differences between parsers. And, since you have not provided one explicitly, BeautifulSoup picked one automatically based on internal ranking and what you have installed in the Python environment - I suspect, in your case, it picked up lxml or html5lib. Switch to html.parser:
document = BeautifulSoup(amz117.read(), "html.parser")

TypeError with HTML escaping in python3

I try to html escape a string
from html import escape
project_name_unescaped = Project.get('name').encode('utf-8')
project_name = escape(project_name_unescaped)
I get the following error
Traceback (most recent call last):
File "hosts.py", line 70, in <module>
project_name = escape(bytes(project_name_unescaped))
File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/html/__init__.py", line 19, in escape
s = s.replace("&", "&") # Must be done first!
TypeError: a bytes-like object is required, not 'str'
I also tried it with the following, but still have the same error
project_name = escape(bytes(project_name_unescaped))
The strange thing is, I'm pretty sure this had worked before. Any ideas whats wrong or what I need to change to make it work again? Also I tried it before with
from cgi import escape
The error message is a little confusing. html.escape expects s to be str, not bytes as you would expect from the message.
Remove all encoding-related stuff and it will work
from html import escape
project_name_unescaped = Project.get('name')
project_name = escape(project_name_unescaped)

Django-haystack with whoosh

I am getting
SearchBackendError at /forum/search/
No fields were found in any search_indexes. Please correct this before attempting to search.
with search_indexes placed in djangobb app root directory:
from haystack.indexes import *
from haystack import site
import djangobb_forum.models as models
class PostIndex(RealTimeSearchIndex):
text = CharField(document=True, use_template=True)
author = CharField(model_attr='user')
created = DateTimeField(model_attr='created')
topic = CharField(model_attr='topic')
category = CharField(model_attr='topic__forum__category__name')
forum = IntegerField(model_attr='topic__forum__pk')
site.register(models.Post, PostIndex)
settings.py
# Haystack settings
HAYSTACK_SITECONF = 'search_sites'
HAYSTACK_SEARCH_ENGINE = 'whoosh'
HAYSTACK_WHOOSH_PATH = os.path.join(PROJECT_ROOT, 'djangobb_index')
also i havae haystack and whoosh in my installed apps.
In python interpreter:
>>> import haystack
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/.../lib/python2.7/django_haystack-1.2.5-py2.5.egg/haystack/__init__.py", line 26, in <module>
raise ImproperlyConfigured("You must define the HAYSTACK_SITECONF setting before using the search framework.")
django.core.exceptions.ImproperlyConfigured: You must define the HAYSTACK_SITECONF setting before using the search framework.
Has someone has any ideas? Thanks in advance for any help you might have to offer.
Notice that the value shown in the documentation for HAYSTACK_SITECONF is an example only. The real name should be the module where the SearchIndex-derived classes are defined. So, as in your case the module is search_indexes, then you should have HAYSTACK_SITECONF='search_indexes'
Also, about that error that appears at the interpreter, did you get it using python ./manage.py shell? If not, settings.py wasn't loaded at the interpreter.

Resources