Web Crawler keeps saying no attribute even though it really has - python-3.x

I have been developing a web-crawler for this website (http://www.bobaedream.co.kr/cyber/CyberCar.php?gubun=I&page=1). But I have a trouble at crawling each title of the stock. I am pretty sure that there is attribute for carinfo_title = carinfo.find_all('a', class_='title').
Please check out the attached code and website code, and then give me any advice.
Thanks.
(Website Code)
https://drive.google.com/open?id=0BxKswko3bYpuRV9seTZZT3REak0
(My code)
from bs4 import BeautifulSoup
import urllib.request
target_url = "http://www.bobaedream.co.kr/cyber/CyberCar.php?gubun=I&page=1"
def fetch_post_list():
URL = target_url
res = urllib.request.urlopen(URL)
html = res.read()
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table', class_='cyber')
#Car Info and Link
carinfo = table.find_all('td', class_='carinfo')
carinfo_title = carinfo.find_all('a', class_='title')
print (carinfo_title)
return carinfo_title
fetch_post_list()

You have multiple elements with the carinfo class and for every "carinfo" you need to get to the car title. Loop over the result of the table.find_all('td', class_='carinfo'):
for carinfo in table.find_all('td', class_='carinfo'):
carinfo_title = carinfo.find('a', class_='title')
print(carinfo_title.get_text())
Would print:
미니 쿠퍼 S JCW
지프 랭글러 3.8 애니버서리 70주년 에디션
...
벤츠 뉴 SLK200 블루이피션시
포르쉐 뉴 카이엔 4.8 GTS
마쯔다 MPV 2.3
Note that if you need only car titles, you can simplify it down to a single line:
print([elm.get_text() for elm in soup.select('table.cyber td.carinfo a.title')])
where the string inside the .select() method is a CSS selector.

Related

Beautiful Soup Nested Loops

I was hoping to create a list of all of the firms featured on this list. I was hoping each winner would be their own section in the HTML but it looks like there are multiple grouped together across several divs. How would you recommend going about solving this? I was able to pull all of the divs but i dont know how to cycle through them appropriately. Thanks!
import requests
from bs4 import BeautifulSoup
import csv
request = requests.get("https://growthcapadvisory.com/growthcaps-top-40-under-40-growth-investors-of-2020/")
text = request.text
soup = BeautifulSoup(text, 'html.parser')
element = soup.find()
person = soup.find_all('div', class_="under40")
This solution uses css selectors
import requests
from bs4 import BeautifulSoup
request = requests.get("https://growthcapadvisory.com/growthcaps-top-40-under-40-growth-investors-of-2020/")
text = request.text
soup = BeautifulSoup(text, 'html.parser')
# if you have an older version you'll need to use contains instead of -soup-contains
firm_tags = soup.select('h5:-soup-contains("Firm") strong')
# extract the text from the selected bs4.Tags
firms = [tag.text for tag in firm_tags]
# if there is extra whitespace
clean_firms = [f.strip() for f in firms]
It works by selecting all the strong tags whose parent h5 tag contain the word "Firm"
See the SoupSieve Docs for more info on bs4's CSS Selectors

How to change the page no. of a page's search result for Web Scraping using Python?

I am scraping data from a webpage which contains search results using Python.
I am able to scrape data from the 1st search result page.
I want to loop using the same code, changing the search result page with each loop cycle.
Is there any way to do it? Is there a way to click 'Next' button without actually opening the page in a browser?
At a high level this is possible, you will need to use requests or selenium in addition to beautifulsoup.
Here is an example of defining a element and clicking the button by xpath:
html = driver.page_source
soup = BeautifulSoup(html,'lxml')
sleep(1) # Time in seconds.
ele = driver.find_element_by_xpath("/html/body/div[1]/div/div/div/div[2]/div[2]/div/table/tfoot/tr/td/div//button[contains(text(),'Next')]")
ele.click()
Yes, of course you can do what you described. Although you didn't post an actual couple solutions to help you get started.
import requests
from bs4 import BeautifulSoup
url = "http://www.bolsamadrid.es/ing/aspx/Empresas/Empresas.aspx"
res = requests.get(url,headers = {"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,"lxml")
for page in range(7):
formdata = {}
for item in soup.select("#aspnetForm input"):
if "ctl00$Contenido$GoPag" in item.get("name"):
formdata[item.get("name")] = page
else:
formdata[item.get("name")] = item.get("value")
req = requests.post(url,data=formdata)
soup = BeautifulSoup(req.text,"lxml")
for items in soup.select("#ctl00_Contenido_tblEmisoras tr")[1:]:
data = [item.get_text(strip=True) for item in items.select("td")]
print(data)

How can I scrape data which is not having any of the source code?

scrape.py
# code to scrape the links from the html
from bs4 import BeautifulSoup
import urllib.request
data = open('scrapeFile','r')
html = data.read()
data.close()
soup = BeautifulSoup(html,features="html.parser")
# code to extract links
links = []
for div in soup.find_all('div', {'class':'main-bar z-depth-1'}):
# print(div.a.get('href'))
links.append('https://godamwale.com' + str(div.a.get('href')))
print(links)
file = open("links.txt", "w")
for link in links:
file.write(link + '\n')
print(link)
I have successfully got the list of links by using this code. But When I want to scrape the data from those links from their html page, these don't have any of the source code that contains data,and to extract them it my job tough . I have used selenium driver , but it won't work well for me.
I want to scrape the data from the below link , that contains data in the html sections , which have Customer details, licence and automation, commercial details, Floor wise, operational details . I want to extract these data with name , location , contact number and type.
https://godamwale.com/list/result/591359c0d6b269eecc1d8933
it 's link here . If someone finds solution , please give it to me.
Using Developer tools in your browser, you'll notice whenever you visit that link there is a request for https://godamwale.com/public/warehouse/591359c0d6b269eecc1d8933 that returns a json response probably containing the data you're looking for.
Python 2.x:
import urllib2, json
contents = json.loads(urllib2.urlopen("https://godamwale.com/public/warehouse/591359c0d6b269eecc1d8933").read())
print contents
Python 3.x:
import urllib.request, json
contents = json.loads(urllib.request.urlopen("https://godamwale.com/public/warehouse/591359c0d6b269eecc1d8933").read().decode('UTF-8'))
print(contents)
Here you go , the main problem with the site seems to be it takes time to load that's why it was returning incomplete page source. you have to wait until page loads completely. notice time.sleep(8) this line in code below :
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import time
CHROMEDRIVER_PATH ="C:\Users\XYZ\Downloads/Chromedriver.exe"
wd = webdriver.Chrome(CHROMEDRIVER_PATH)
responce = wd.get("https://godamwale.com/list/result/591359c0d6b269eecc1d8933")
time.sleep(8) # wait untill page loads completely
soup = BeautifulSoup(wd.page_source, 'lxml')
props_list = []
propvalues_list = []
div = soup.find_all('div', {'class':'row'})
for childtags in div[6].findChildren('div',{'class':'col s12 m4 info-col'}):
props = childtags.find("span").contents
props_list.append(props)
propvalue = childtags.find("p",recursive=True).contents
propvalues_list.append(propvalue)
print(props_list)
print(propvalues_list)
note: code will return Construction details in 2 seperate list.

Using Beautifulsoup to parse a big comment?

I'm using BS4 to parse this webpage:
You'll notice there are two separate tables on the page. Here's the relevant snipped of my code, which is successfully returning the data I want from the first table, but does not find anything from the second table:
# import packages
import urllib3
import certifi
from bs4 import BeautifulSoup
import pandas as pd
#settings
http = urllib3.PoolManager(
cert_reqs='CERT_REQUIRED',
ca_certs=certifi.where())
gamelog_offense = []
#scrape the data and write the .csv files
url = "https://www.sports-reference.com/cfb/schools/florida/2018/gamelog/"
response = http.request('GET', url)
soup = BeautifulSoup(response.data, features="html.parser")
cnt = 0
for row in soup.findAll('tr'):
try:
col=row.findAll('td')
Pass_cmp = col[4].get_text()
Pass_att = col[5].get_text()
gamelog_offense.append([Pass_cmp, Pass_att])
cnt += 1
except:
pass
print("Finished writing with " + str(cnt) + " records")
Finished writing with 13 records
I've verified the data from the SECOND table is contained within the soup (I can see it!). After lots of troubleshooting, I've discovered that the entire second table is completely contained within one big comment(why?). I've managed to extract this comment into a single comment object using the code below, but can't figure out what to do with it after that to extract the data I want. Ideally, I'd like to parse the comment in same way I'm successfully parsing the first table. I've tried using the ideas from similar stack overflow questions (selenium, phantomjs)...no luck.
import bs4
defense = soup.find(id="all_defense")
for item in defense.children:
if isinstance(item, bs4.element.Comment):
big_comment = item
print(big_comment)
<div class="table_outer_container">
<div class="overthrow table_container" id="div_defense">
...and so on....
Posting an answer here in case others find helpful. Many thanks to #TomasCarvalho for directing me to find a solution. I was able to pass the big comment as html into a second soup instance using the following code, and then just use the original parsing code on the new soup instance. (note: the try/except is because some of the teams have no gamelog, and you can't call .children on a NoneType.
try:
defense = soup.find(id="all_defense")
for item in defense.children:
if isinstance(item, bs4.element.Comment):
html = item
Dsoup = BeautifulSoup(html, features="html.parser")
except:
html = ''
Dsoup = BeautifulSoup(html, features="html.parser")

Parsing through HTML with BeautifulSoup in Python

Currently my code is as follows:
from bs4 import BeautifulSoup
import requests
main_url = 'http://www.foodnetwork.com/recipes/a-z'
response = requests.get(main_url)
soup = BeautifulSoup(response.text, "html.parser")
mylist = [t for tags in soup.find_all(class_='m-PromoList o-Capsule__m-
PromoList') for t in tags if (t!='\n')]
As of now, I get a list containing the correct information but its still inside of HTML tags. An example of an element of the list is given below:
<li class="m-PromoList__a-ListItem">"16 Bean" Pasta E Fagioli</li>
from this item I want to extract both the href link and also the following string separately, but I am having trouble doing this and I really don't think getting this info should require a whole new set of operations. How do?
You can do this to get href and text for one element:
href = soup.find('li', attrs={'class':'m-PromoList__a-ListItem'}).find('a')['href']
text = soup.find('li', attrs={'class':'m-PromoList__a-ListItem'}).find('a').text
For a list of items:
my_list = soup.find_all('li', attrs={'class':'m-PromoList__a-ListItem'})
for el in my_list:
href = el.find('a')['href']
text = el.find('a').text
print(href)
print(text)
Edit:
An important tip to reduce run time: Don't search for the same tag more than once. Instead, save the tag in a variable and then use it multiple times.
a = soup.find('li', attrs={'class':'m-PromoList__a-ListItem'}).find('a')
href = a.get('href')
text = a.text
In large HTML codes, finding a tag takes up lot of time, so doing this will reduce the time taken to find the tag as it will run only once.
Several ways you can achieve the same. Here is another approach using css selector:
from bs4 import BeautifulSoup
import requests
response = requests.get('http://www.foodnetwork.com/recipes/a-z')
soup = BeautifulSoup(response.text, "lxml")
for item in soup.select(".m-PromoList__a-ListItem a"):
print("Item_Title: {}\nItem_Link: {}\n".format(item.text,item['href']))
Partial result:
Item_Title: "16 Bean" Pasta E Fagioli
Item_Link: //www.foodnetwork.com/recipes/ina-garten/16-bean-pasta-e-fagioli-3612570
Item_Title: "16 Bean" Pasta e Fagioli
Item_Link: //www.foodnetwork.com/recipes/ina-garten/16-bean-pasta-e-fagioli-1-3753755
Item_Title: "21" Apple Pie
Item_Link: //www.foodnetwork.com/recipes/21-apple-pie-recipe-1925900

Resources