from urllib import request
from bs4 import BeautifulSoup
def trade_spider(max_pages):
page = 1
while page <= max_pages:
x = 16958 + int(page)
url = 'http://mery.jp/' + str(x)
source_code = request.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
print(soup)
break
trade_spider(2)
when I try to run the code above in my Python3.4, I get an error that says the following:
File "web_crawler.py", line 15, in <module>
trade_spider(2)
File "web_crawler.py", line 9, in trade_spider
source_code = request.get(url)
AttributeError: 'module' object has no attribute 'get'
any ideas on how to fix this? Thank you in advance.
You imported request from urllib, but urllib.request doesn't have a get method. I think you want to import the requests module and use requests.get in lieu of request.get. Either that or you want to replace get with urlopen. Considering that you reference the .text attribute in the next line, it probably is the former and not the latter.
Related
I am trying to do an assignment: write a Python program that expands on http://www.py4e.com/code3/urllinks.py. The program will use urllib to read the HTML from the data files below, extract the href= vaues from the anchor tags, scan for a tag that is in a particular position relative to the first name in the list, follow that link and repeat the process a number of times and report the last name you find.
Actual problem: Start at: http://py4e-data.dr-chuck.net/known_by_Kylen.html
Find the link at position 18 (the first name is 1). Follow that link. Repeat this process 7 times. The answer is the last name that you retrieve.
Hint: The first character of the name of the last page that you will load is: P[enter image description here][1]
#Code I used:
import re
import urllib
import urllib.request
import urllib.parse
import urllib.error
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter URL:')
count = int(input('Enter count:'))
position = int(input('Enter position:'))-1
html = urllib.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html,"html.parser")
href = soup('a')
#print href
for i in range(count):
link = href[position].get('href', None)
print(href[position].contents[0])
html = urllib.urlopen(link).read()
soup = BeautifulSoup(html,"html.parser")
href = soup('a')
But got an error: html = urllib.urlopen(url, context=ctx).read()
AttributeError: module 'urllib' has no attribute 'urlopen'
Can anyone provide solutions for this?
You imported urlopen already, but never used it. Instead you used urllib.urlopen which doesn't exist.
Instead of using urllib.urlopen just use urlopen
Example:
from urllib.request import urlopen
# before: html = urllib.urlopen(url, context=ctx).read()
html = urlopen(url, context=ctx).read()
How do I resolve AttributeError: 'NoneType' object has no attribute 'text' when using BeautifulSoup?
below is the current code I have.
from bs4 import BeautifulSoup
import urllib.request
with open('websites_mn.txt') as f:
txtdata = f.readlines()
for raw_url in txtdata:
raw_url = raw_url.strip('\n')
url = urllib.request.urlopen(raw_url)
content = url.read()
soup = BeautifulSoup(content, 'lxml')
table = soup.findAll('div',attrs={"class":"journal-content-article"})
for x in table:
print(x.find('p').text)
here is the websites txt file.
https://mongolia.mid.ru:443/en_US/novosti/-/asset_publisher/hfCjAfLBKGW0/content/24-avgusta-sostoalas-vstreca-crezvycajnogo-i-polnomocnogo-posla-rossijskoj-federacii-v-mongolii-i-k-azizova-s-ministrom-energetiki-mongolii-n-tavinbeh?inheritRedirect=false&redirect=https%3A%2F%2Fmongolia.mid.ru%3A443%2Fen_US%2Fnovosti%3Fp_p_id%3D101_INSTANCE_hfCjAfLBKGW0%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3Dcolumn-1%26p_p_col_count%3D1
https://mongolia.mid.ru:443/en_US/novosti/-/asset_publisher/hfCjAfLBKGW0/content/19-avgusta-2020-goda-sostoalas-vstreca-crezvycajnogo-i-polnomocnogo-posla-rossijskoj-federacii-v-mongolii-i-k-azizova-s-zamestitelem-ministra-inostran?inheritRedirect=false&redirect=https%3A%2F%2Fmongolia.mid.ru%3A443%2Fen_US%2Fnovosti%3Fp_p_id%3D101_INSTANCE_hfCjAfLBKGW0%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3Dcolumn-1%26p_p_col_count%3D1
https://mongolia.mid.ru:443/en_US/novosti/-/asset_publisher/hfCjAfLBKGW0/content/vsemirnye-zimnie-igry-special-noj-olimpiady-2022-goda-projdut-v-kazani?inheritRedirect=false&redirect=https%3A%2F%2Fmongolia.mid.ru%3A443%2Fen_US%2Fnovosti%3Fp_p_id%3D101_INSTANCE_hfCjAfLBKGW0%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3Dcolumn-1%26p_p_col_count%3D1
https://mongolia.mid.ru:443/en_US/novosti/-/asset_publisher/hfCjAfLBKGW0/content/mezdunarodnyj-voenno-tehniceskij-forum-armia-2020-projdet-v-period-s-23-po-29-avgusta-2020-goda-na-territorii-kongressno-vystavocnogo-centra-patriot-?inheritRedirect=false&redirect=https%3A%2F%2Fmongolia.mid.ru%3A443%2Fen_US%2Fnovosti%3Fp_p_id%3D101_INSTANCE_hfCjAfLBKGW0%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3Dcolumn-1%26p_p_col_count%3D1
https://mongolia.mid.ru:443/en_US/novosti/-/asset_publisher/hfCjAfLBKGW0/content/crezvycajnyj-i-polnomocnyj-posol-rossijskoj-federacii-v-mongolii-i-k-azizov-vstretilsa-s-ministrom-obrazovania-i-nauki-mongolii-l-cedevsuren-v-hode-be?inheritRedirect=false&redirect=https%3A%2F%2Fmongolia.mid.ru%3A443%2Fen_US%2Fnovosti%3Fp_p_id%3D101_INSTANCE_hfCjAfLBKGW0%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3Dcolumn-1%26p_p_col_count%3D1
https://mongolia.mid.ru:443/en_US/novosti/-/asset_publisher/hfCjAfLBKGW0/content/10-iula-sostoalas-vstreca-crezvycajnogo-i-polnomocnogo-posla-rossijskoj-federacii-v-mongolii-i-k-azizova-i-ministra-inostrannyh-del-mongolii-n-enhtajv?inheritRedirect=false&redirect=https%3A%2F%2Fmongolia.mid.ru%3A443%2Fen_US%2Fnovosti%3Fp_p_id%3D101_INSTANCE_hfCjAfLBKGW0%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3Dcolumn-1%26p_p_col_count%3D1
https://mongolia.mid.ru:443/en_US/novosti/-/asset_publisher/hfCjAfLBKGW0/content/prezident-rossijskoj-federacii-v-v-putin-pozdravil-prezidenta-mongolii-h-battulgu-s-nacional-nym-prazdnikom-naadam-?inheritRedirect=false&redirect=https%3A%2F%2Fmongolia.mid.ru%3A443%2Fen_US%2Fnovosti%3Fp_p_id%3D101_INSTANCE_hfCjAfLBKGW0%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3Dcolumn-1%26p_p_col_count%3D1
https://mongolia.mid.ru:443/en_US/novosti/-/asset_publisher/hfCjAfLBKGW0/content/predsedatel-pravitel-stva-rossijskoj-federacii-m-v-misustin-pozdravil-prem-er-ministra-mongolii-u-hur?inheritRedirect=false&redirect=https%3A%2F%2Fmongolia.mid.ru%3A443%2Fen_US%2Fnovosti%3Fp_p_id%3D101_INSTANCE_hfCjAfLBKGW0%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3Dcolumn-1%26p_p_col_count%3D1
https://mongolia.mid.ru:443/en_US/novosti/-/asset_publisher/hfCjAfLBKGW0/content/inistr-inostrannyh-del-rossijskoj-federacii-s-v-lavrov-pozdravil-ministra-inostrannyh-del-mongolii-n-enhtajvana-s-naznaceniem-i-nacional-nym-prazdniko?inheritRedirect=false&redirect=https%3A%2F%2Fmongolia.mid.ru%3A443%2Fen_US%2Fnovosti%3Fp_p_id%3D101_INSTANCE_hfCjAfLBKGW0%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3Dcolumn-1%26p_p_col_count%3D1
https://mongolia.mid.ru:443/en_US/novosti/-/asset_publisher/hfCjAfLBKGW0/content/pozdravlenie-crezvycajnogo-i-polnomocnogo-posla-rossijskoj-federacii-v-mongolii-i-k-azizova-s-nacional-nym-prazdnikom-mongolii-naada-1?inheritRedirect=false&redirect=https%3A%2F%2Fmongolia.mid.ru%3A443%2Fen_US%2Fnovosti%3Fp_p_id%3D101_INSTANCE_hfCjAfLBKGW0%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3Dcolumn-1%26p_p_col_count%3D1
Instead of printing all the p tags under div with journal-content-article, it stops because of NoneType error.
In your loop you can check if the object exists before accessing an attribute:
for x in table:
if x.find('p'): # add this check
print(x.find('p').text)
I'm new to coding and I've been trying to scrape a page to practice. I have everything almost ready but I don't know why it gives an error.
from variables import MY_URL , OUT_FILE
import requests
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as ureq
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
import csv
def agarrar_pagina():
for i in range(1,22):
uclient = ureq(MY_URL+'categorias/todas/?page={}'.format(i))
page_html = uclient.read()
page_soup = soup(page_html, "html.parser")
contenedores = page_soup.findAll('div', {'class':'cambur'})
contenedor=[]
for link in contenedor:
link = contenedor.findAll('a',['href'])
ulink = ureq(MY_URL + link)
page_link = ulink.read()
ulink = close()
uclient.close()
return page_link
This is the error
`Traceback (most recent call last):
File "prueba.py", line 93, in <module>
main()
File "prueba.py", line 89, in main
cajitas = seleccionar_caja(pagina)
File "prueba.py", line 30, in seleccionar_caja
page_soup = soup(html, "html.parser")
File "C:\Users\PC\AppData\Local\Programs\Python\Python37\lib\site-packages\bs4\__init__.py", line
267, in __init__
elif len(markup) <= 256 and (
TypeError: object of type 'NoneType' has no len()`
contenedor=[] is empty list. I think you intend to use
contenedores
for link in contenedores :
link = contenedores.findAll('a',['href'])
ulink = ureq(MY_URL + link)
page_link = ulink.read()
ulink = close()
uclient.close
The function agarrar_pagina is not mentioned in the stacktrace. I think that is because function agarrar_pagina returns None all the time.
This happens with your code when looping on empty list or on the corrected function if anything is found with the findall.
from bs4 import BeautifulSoup
import requests
import lxml
url = 'https://en.wikipedia.org/wiki/Berlin_Wall/'
cream = requests.get(url).content
soup= BeautifulSoup(cream, 'lxml')
table = soup.find('table', {'class' : 'infobox vcard'})
type(table)
table_rows = table.find_all('tr')
for tr in table_rows:
print(td.text)
I am using python3. I was trying to scrap infobox from wikipedia pages, but keep getting AttributeError: 'NoneType' object has no attribute 'find_all'. Anyone knows what's that problem with this one?
You have a couple of simple mistakes in your script:
Remove the last forward-slash ( / ) off your url string.
url = 'https://en.wikipedia.org/wiki/Berlin_Wall'
td does not exist in your loop, so change it to tr:
print(tr.text)
import libraries
import urllib2
from bs4 import BeautifulSoup
new libraries:
import csv
import requests
import string
Defining variables:
i = 1
str_i = str(i)
seqPrefix = 'seq_'
seq_1 = str('https://anyaddress.com/')
quote_page = seqPrefix + str_i
#Then, make use of the Python urllib2 to get the HTML page of the url declared.
# query the website and return the html to the variable 'page'
page = urllib2.urlopen(quote_page)
#Finally, parse the page into BeautifulSoup format so we can use BeautifulSoup to work on it.
# parse the html using beautiful soup and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')
As a result, all is fine...except that:
ERROR MESSAGE:
page = urllib2.urlopen(quote_page)
File "C:\Python27\lib\urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 423, in open
protocol = req.get_type()
File "C:\Python27\lib\urllib2.py", line 285, in get_type
raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: seq_1
Why?
txs.
You can use the local variable dictionary vars()
page = urllib2.urlopen(vars()[quote_page])
The way you had it it was trying to open the URL using the string "seq_1" as the URL not the value of the seq_1 variable which is a valid URL.
Looks like you need to concat seq_1 & str_i
Ex:
seq_1 = str('https://anyaddress.com/')
quote_page = seq_1 + str_i
Output:
https://anyaddress.com/1