How do I resolve AttributeError: 'NoneType' object has no attribute 'text' when using BeautifulSoup?
below is the current code I have.
from bs4 import BeautifulSoup
import urllib.request
with open('websites_mn.txt') as f:
txtdata = f.readlines()
for raw_url in txtdata:
raw_url = raw_url.strip('\n')
url = urllib.request.urlopen(raw_url)
content = url.read()
soup = BeautifulSoup(content, 'lxml')
table = soup.findAll('div',attrs={"class":"journal-content-article"})
for x in table:
print(x.find('p').text)
here is the websites txt file.
https://mongolia.mid.ru:443/en_US/novosti/-/asset_publisher/hfCjAfLBKGW0/content/24-avgusta-sostoalas-vstreca-crezvycajnogo-i-polnomocnogo-posla-rossijskoj-federacii-v-mongolii-i-k-azizova-s-ministrom-energetiki-mongolii-n-tavinbeh?inheritRedirect=false&redirect=https%3A%2F%2Fmongolia.mid.ru%3A443%2Fen_US%2Fnovosti%3Fp_p_id%3D101_INSTANCE_hfCjAfLBKGW0%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3Dcolumn-1%26p_p_col_count%3D1
https://mongolia.mid.ru:443/en_US/novosti/-/asset_publisher/hfCjAfLBKGW0/content/19-avgusta-2020-goda-sostoalas-vstreca-crezvycajnogo-i-polnomocnogo-posla-rossijskoj-federacii-v-mongolii-i-k-azizova-s-zamestitelem-ministra-inostran?inheritRedirect=false&redirect=https%3A%2F%2Fmongolia.mid.ru%3A443%2Fen_US%2Fnovosti%3Fp_p_id%3D101_INSTANCE_hfCjAfLBKGW0%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3Dcolumn-1%26p_p_col_count%3D1
https://mongolia.mid.ru:443/en_US/novosti/-/asset_publisher/hfCjAfLBKGW0/content/vsemirnye-zimnie-igry-special-noj-olimpiady-2022-goda-projdut-v-kazani?inheritRedirect=false&redirect=https%3A%2F%2Fmongolia.mid.ru%3A443%2Fen_US%2Fnovosti%3Fp_p_id%3D101_INSTANCE_hfCjAfLBKGW0%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3Dcolumn-1%26p_p_col_count%3D1
https://mongolia.mid.ru:443/en_US/novosti/-/asset_publisher/hfCjAfLBKGW0/content/mezdunarodnyj-voenno-tehniceskij-forum-armia-2020-projdet-v-period-s-23-po-29-avgusta-2020-goda-na-territorii-kongressno-vystavocnogo-centra-patriot-?inheritRedirect=false&redirect=https%3A%2F%2Fmongolia.mid.ru%3A443%2Fen_US%2Fnovosti%3Fp_p_id%3D101_INSTANCE_hfCjAfLBKGW0%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3Dcolumn-1%26p_p_col_count%3D1
https://mongolia.mid.ru:443/en_US/novosti/-/asset_publisher/hfCjAfLBKGW0/content/crezvycajnyj-i-polnomocnyj-posol-rossijskoj-federacii-v-mongolii-i-k-azizov-vstretilsa-s-ministrom-obrazovania-i-nauki-mongolii-l-cedevsuren-v-hode-be?inheritRedirect=false&redirect=https%3A%2F%2Fmongolia.mid.ru%3A443%2Fen_US%2Fnovosti%3Fp_p_id%3D101_INSTANCE_hfCjAfLBKGW0%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3Dcolumn-1%26p_p_col_count%3D1
https://mongolia.mid.ru:443/en_US/novosti/-/asset_publisher/hfCjAfLBKGW0/content/10-iula-sostoalas-vstreca-crezvycajnogo-i-polnomocnogo-posla-rossijskoj-federacii-v-mongolii-i-k-azizova-i-ministra-inostrannyh-del-mongolii-n-enhtajv?inheritRedirect=false&redirect=https%3A%2F%2Fmongolia.mid.ru%3A443%2Fen_US%2Fnovosti%3Fp_p_id%3D101_INSTANCE_hfCjAfLBKGW0%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3Dcolumn-1%26p_p_col_count%3D1
https://mongolia.mid.ru:443/en_US/novosti/-/asset_publisher/hfCjAfLBKGW0/content/prezident-rossijskoj-federacii-v-v-putin-pozdravil-prezidenta-mongolii-h-battulgu-s-nacional-nym-prazdnikom-naadam-?inheritRedirect=false&redirect=https%3A%2F%2Fmongolia.mid.ru%3A443%2Fen_US%2Fnovosti%3Fp_p_id%3D101_INSTANCE_hfCjAfLBKGW0%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3Dcolumn-1%26p_p_col_count%3D1
https://mongolia.mid.ru:443/en_US/novosti/-/asset_publisher/hfCjAfLBKGW0/content/predsedatel-pravitel-stva-rossijskoj-federacii-m-v-misustin-pozdravil-prem-er-ministra-mongolii-u-hur?inheritRedirect=false&redirect=https%3A%2F%2Fmongolia.mid.ru%3A443%2Fen_US%2Fnovosti%3Fp_p_id%3D101_INSTANCE_hfCjAfLBKGW0%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3Dcolumn-1%26p_p_col_count%3D1
https://mongolia.mid.ru:443/en_US/novosti/-/asset_publisher/hfCjAfLBKGW0/content/inistr-inostrannyh-del-rossijskoj-federacii-s-v-lavrov-pozdravil-ministra-inostrannyh-del-mongolii-n-enhtajvana-s-naznaceniem-i-nacional-nym-prazdniko?inheritRedirect=false&redirect=https%3A%2F%2Fmongolia.mid.ru%3A443%2Fen_US%2Fnovosti%3Fp_p_id%3D101_INSTANCE_hfCjAfLBKGW0%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3Dcolumn-1%26p_p_col_count%3D1
https://mongolia.mid.ru:443/en_US/novosti/-/asset_publisher/hfCjAfLBKGW0/content/pozdravlenie-crezvycajnogo-i-polnomocnogo-posla-rossijskoj-federacii-v-mongolii-i-k-azizova-s-nacional-nym-prazdnikom-mongolii-naada-1?inheritRedirect=false&redirect=https%3A%2F%2Fmongolia.mid.ru%3A443%2Fen_US%2Fnovosti%3Fp_p_id%3D101_INSTANCE_hfCjAfLBKGW0%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3Dcolumn-1%26p_p_col_count%3D1
Instead of printing all the p tags under div with journal-content-article, it stops because of NoneType error.
In your loop you can check if the object exists before accessing an attribute:
for x in table:
if x.find('p'): # add this check
print(x.find('p').text)
Related
I am trying to do an assignment: write a Python program that expands on http://www.py4e.com/code3/urllinks.py. The program will use urllib to read the HTML from the data files below, extract the href= vaues from the anchor tags, scan for a tag that is in a particular position relative to the first name in the list, follow that link and repeat the process a number of times and report the last name you find.
Actual problem: Start at: http://py4e-data.dr-chuck.net/known_by_Kylen.html
Find the link at position 18 (the first name is 1). Follow that link. Repeat this process 7 times. The answer is the last name that you retrieve.
Hint: The first character of the name of the last page that you will load is: P[enter image description here][1]
#Code I used:
import re
import urllib
import urllib.request
import urllib.parse
import urllib.error
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter URL:')
count = int(input('Enter count:'))
position = int(input('Enter position:'))-1
html = urllib.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html,"html.parser")
href = soup('a')
#print href
for i in range(count):
link = href[position].get('href', None)
print(href[position].contents[0])
html = urllib.urlopen(link).read()
soup = BeautifulSoup(html,"html.parser")
href = soup('a')
But got an error: html = urllib.urlopen(url, context=ctx).read()
AttributeError: module 'urllib' has no attribute 'urlopen'
Can anyone provide solutions for this?
You imported urlopen already, but never used it. Instead you used urllib.urlopen which doesn't exist.
Instead of using urllib.urlopen just use urlopen
Example:
from urllib.request import urlopen
# before: html = urllib.urlopen(url, context=ctx).read()
html = urlopen(url, context=ctx).read()
Here is my code :
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd
links = pd.read_csv('C:\\Users\\acer\\Desktop\\links.csv',encoding = 'utf-8',dtype=str)
for i in range(1,10):
link = links.iloc[i,0]
for count in range(1,5):
r = requests.get(link + str(count))
soup = BeautifulSoup(r.text,'lxml')
##comp links
for links in soup.find_all('th',{"id":"c_name"}):
link = links.find('a')
li = link['href'][3:]
print("https://www.hindustanyellowpages.in/Ahmedabad/" + li)
I am getting the below error :
TypeError: unsupported operand type(s) for +: 'Tag' and 'str'
In the last line of your source code. You made a concatenation of an instance of "Tag" class (li) with the string that has an URL.
Try to extract the information that you want from the tag before the concatenation. For example, if you want to get the text in the link use this.
print("https://www.hindustanyellowpages.in/Ahmedabad/" + li.text)
from bs4 import BeautifulSoup
import requests
import lxml
url = 'https://en.wikipedia.org/wiki/Berlin_Wall/'
cream = requests.get(url).content
soup= BeautifulSoup(cream, 'lxml')
table = soup.find('table', {'class' : 'infobox vcard'})
type(table)
table_rows = table.find_all('tr')
for tr in table_rows:
print(td.text)
I am using python3. I was trying to scrap infobox from wikipedia pages, but keep getting AttributeError: 'NoneType' object has no attribute 'find_all'. Anyone knows what's that problem with this one?
You have a couple of simple mistakes in your script:
Remove the last forward-slash ( / ) off your url string.
url = 'https://en.wikipedia.org/wiki/Berlin_Wall'
td does not exist in your loop, so change it to tr:
print(tr.text)
I would like to scrape the infobox for a number of artist wikipedias. But I keep on getting the no attribute error. What am I doing wrong?
from bs4 import BeautifulSoup
from urllib.request import urlopen
url= "http://en.wikipedia.org/wiki/The_Beatles"
page = urlopen(url)
soup = BeautifulSoup(page.read(), "lxml")
table = soup.find('table', class_='infobox vcard plainlist')
result = {}
exceptional_row_count = 0
for tr in table.find_all('tr'):
if tr.find('th'):
result[tr.find('th').text] = tr.find('td').text
else:
exceptional_row_count += 1
if exceptional_row_count > 1:
print ('WARNING ExceptionalRow>1: ', table)
print (result)
The error is in this line:
---> 13 result[tr.find('th').text] = tr.find('td').text
AttributeError: 'NoneType' object has no attribute 'text'
Scraping a column from Wikipedia with Beautifulsoup returns the last row, while I want all of them in a list:
from urllib.request import urlopen
from bs4 import BeautifulSoup
site = "https://en.wikipedia.org/wiki/Agriculture_in_India"
html = urlopen(site)
soup = BeautifulSoup(html, "html.parser")
table = soup.find("table", {'class': 'wikitable sortable'})
for row in table.find_all("tr")[1:]:
col = row.find_all("td")
if len(col) > 0:
com = str(col[1].string.strip("\n"))
list(com)
com
Out: 'ZTS'
So it only shows the last row of the string, I was expecting to get a list with each line of the string as a string value. So that I can assign the list to new variable.
"Rice", "Buffalo milk", "Cow milk", "Wheat"
Can anyone help me?
Your method will not work because you are not "adding" anything to com.
One way to do what you desire is:
from urllib.request import urlopen
from bs4 import BeautifulSoup
site = "https://en.wikipedia.org/wiki/Agriculture_in_India"
html = urlopen(site)
soup = BeautifulSoup(html, "html.parser")
table = soup.find("table", {'class': 'wikitable sortable'})
com=[]
for row in table.find_all("tr")[1:]:
col = row.find_all("td")
if len(col)> 0:
temp=col[1].contents[0]
try:
to_append=temp.contents[0]
except Exception as e:
to_append=temp
com.append(to_append)
print(com)
This will give you what you require.
Explanation
col[1].contents[0] gives the first child of the tag. .contents gives you a list of children of the tag. Here we have a single child so 0.
In some cases, the content inside the <tr> tag is a <a href> tag. So I apply another .contents[0] to get the text.
In other cases it is not a link. For that I used an exception statement. If there is no descendant of the child extracted, we would get an exception.
See the official documentation for details