Using Beautifulsoup to parse a big comment? - python-3.x

I'm using BS4 to parse this webpage:
You'll notice there are two separate tables on the page. Here's the relevant snipped of my code, which is successfully returning the data I want from the first table, but does not find anything from the second table:
# import packages
import urllib3
import certifi
from bs4 import BeautifulSoup
import pandas as pd
#settings
http = urllib3.PoolManager(
cert_reqs='CERT_REQUIRED',
ca_certs=certifi.where())
gamelog_offense = []
#scrape the data and write the .csv files
url = "https://www.sports-reference.com/cfb/schools/florida/2018/gamelog/"
response = http.request('GET', url)
soup = BeautifulSoup(response.data, features="html.parser")
cnt = 0
for row in soup.findAll('tr'):
try:
col=row.findAll('td')
Pass_cmp = col[4].get_text()
Pass_att = col[5].get_text()
gamelog_offense.append([Pass_cmp, Pass_att])
cnt += 1
except:
pass
print("Finished writing with " + str(cnt) + " records")
Finished writing with 13 records
I've verified the data from the SECOND table is contained within the soup (I can see it!). After lots of troubleshooting, I've discovered that the entire second table is completely contained within one big comment(why?). I've managed to extract this comment into a single comment object using the code below, but can't figure out what to do with it after that to extract the data I want. Ideally, I'd like to parse the comment in same way I'm successfully parsing the first table. I've tried using the ideas from similar stack overflow questions (selenium, phantomjs)...no luck.
import bs4
defense = soup.find(id="all_defense")
for item in defense.children:
if isinstance(item, bs4.element.Comment):
big_comment = item
print(big_comment)
<div class="table_outer_container">
<div class="overthrow table_container" id="div_defense">
...and so on....

Posting an answer here in case others find helpful. Many thanks to #TomasCarvalho for directing me to find a solution. I was able to pass the big comment as html into a second soup instance using the following code, and then just use the original parsing code on the new soup instance. (note: the try/except is because some of the teams have no gamelog, and you can't call .children on a NoneType.
try:
defense = soup.find(id="all_defense")
for item in defense.children:
if isinstance(item, bs4.element.Comment):
html = item
Dsoup = BeautifulSoup(html, features="html.parser")
except:
html = ''
Dsoup = BeautifulSoup(html, features="html.parser")

Related

Beautiful Soup Error: Trying to retrieve data from web page returns empty array

I am trying to download a list of voting intention opinion polls from this web page using beautiful soup. However, the code I wrote returns an empty array or nothing. The code I used is below:
The page code is like this:
<div class="ST-c2-dv1 ST-ch ST-PS" style="width:33px"></div>
<div class="ST-c2-dv2">41.8</div>
That's what I tried:
import requests
from bs4 import BeautifulSoup
request = requests.get(quote_page) # take the page link
page = request.content # extract page content
soup = BeautifulSoup(page, "html.parser")
# extract all the divs
for each_div in soup.findAll('div',{'class':'ST-c2-dv2'}):
print each_div
At this point, it prints nothing.
I've tried also this:
tutti_a = soup.find_all("html_element", class_="ST-c2-dv2")
and also:
tutti_a = soup.find_all("div", class_="ST-c2-dv2")
But I get an empty array [] or nothing at all
I think you can use the following url
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get('https://www.marktest.com/wap/a/sf/v~[73D5799E1B0E]/name~Dossier_5fSondagensLegislativas_5f2011.HighCharts.Sondagens.xml.aspx')
soup = bs(r.content, 'lxml')
results = []
for record in soup.select('p'):
results.append([item.text for item in record.select('b')])
df = pd.DataFrame(results)
print(df)
Columns 5,6,7,8,9,10 correspond with PS, PSD,CDS,CDU,Bloco,Outros/Brancos/Nulos
You can drop unwanted columns, add appropriate headers etc.

How can I scrape data which is not having any of the source code?

scrape.py
# code to scrape the links from the html
from bs4 import BeautifulSoup
import urllib.request
data = open('scrapeFile','r')
html = data.read()
data.close()
soup = BeautifulSoup(html,features="html.parser")
# code to extract links
links = []
for div in soup.find_all('div', {'class':'main-bar z-depth-1'}):
# print(div.a.get('href'))
links.append('https://godamwale.com' + str(div.a.get('href')))
print(links)
file = open("links.txt", "w")
for link in links:
file.write(link + '\n')
print(link)
I have successfully got the list of links by using this code. But When I want to scrape the data from those links from their html page, these don't have any of the source code that contains data,and to extract them it my job tough . I have used selenium driver , but it won't work well for me.
I want to scrape the data from the below link , that contains data in the html sections , which have Customer details, licence and automation, commercial details, Floor wise, operational details . I want to extract these data with name , location , contact number and type.
https://godamwale.com/list/result/591359c0d6b269eecc1d8933
it 's link here . If someone finds solution , please give it to me.
Using Developer tools in your browser, you'll notice whenever you visit that link there is a request for https://godamwale.com/public/warehouse/591359c0d6b269eecc1d8933 that returns a json response probably containing the data you're looking for.
Python 2.x:
import urllib2, json
contents = json.loads(urllib2.urlopen("https://godamwale.com/public/warehouse/591359c0d6b269eecc1d8933").read())
print contents
Python 3.x:
import urllib.request, json
contents = json.loads(urllib.request.urlopen("https://godamwale.com/public/warehouse/591359c0d6b269eecc1d8933").read().decode('UTF-8'))
print(contents)
Here you go , the main problem with the site seems to be it takes time to load that's why it was returning incomplete page source. you have to wait until page loads completely. notice time.sleep(8) this line in code below :
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import time
CHROMEDRIVER_PATH ="C:\Users\XYZ\Downloads/Chromedriver.exe"
wd = webdriver.Chrome(CHROMEDRIVER_PATH)
responce = wd.get("https://godamwale.com/list/result/591359c0d6b269eecc1d8933")
time.sleep(8) # wait untill page loads completely
soup = BeautifulSoup(wd.page_source, 'lxml')
props_list = []
propvalues_list = []
div = soup.find_all('div', {'class':'row'})
for childtags in div[6].findChildren('div',{'class':'col s12 m4 info-col'}):
props = childtags.find("span").contents
props_list.append(props)
propvalue = childtags.find("p",recursive=True).contents
propvalues_list.append(propvalue)
print(props_list)
print(propvalues_list)
note: code will return Construction details in 2 seperate list.

Python - Issue Scraping with BeautifulSoup

I'm trying to scrape the Stack Overflow jobs page using Beautiful Soup 4 and URLLIB as a personal project. I'm facing an issue where I'm trying to scrape all the links to the 50 jobs listed on each page. I'm using a regex to identify these links. Even though I reference the tag properly, I am facing these two specific issues:
Instead of the 50 links clearly visible in the source code, I get only 25 results each time as my output(after accounting for an removing an initial irrelevant link)
There's a difference between how the links are ordered in the source code and my output.
Here's my code. Any help on this will be greatly appreciated:
import bs4
import urllib.request
import re
#Obtaining source code to parse
sauce = urllib.request.urlopen('https://stackoverflow.com/jobs?med=site-ui&ref=jobs-tab&sort=p&pg=0').read()
soup = bs4.BeautifulSoup(sauce, 'html.parser')
snippet = soup.find_all("script",type="application/ld+json")
strsnippet = str(snippet)
print(strsnippet)
joburls = re.findall('https://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', strsnippet)
print("Urls: ",joburls)
print(len(joburls))
Disclaimer: I did some asking of my own for a part of this answer.
from bs4 import BeautifulSoup
import requests
import json
# note: link is slightly different; yours just redirects here
link = 'https://stackoverflow.com/jobs?med=site-ui&ref=jobs-tab&sort=p'
r = requests.get(link)
soup = BeautifulSoup(r.text, 'html.parser')
s = soup.find('script', type='application/ld+json')
urls = [el['url'] for el in json.loads(s.text)['itemListElement']]
print(len(urls))
50
Process:
Use soup.find rather than soup.find_all. This will give a JSON bs4.element.Tag
json.loads(s.text) is a nested dict. Access the values for itemListElement key to get a dict of urls, and convert to list.

Web Crawler keeps saying no attribute even though it really has

I have been developing a web-crawler for this website (http://www.bobaedream.co.kr/cyber/CyberCar.php?gubun=I&page=1). But I have a trouble at crawling each title of the stock. I am pretty sure that there is attribute for carinfo_title = carinfo.find_all('a', class_='title').
Please check out the attached code and website code, and then give me any advice.
Thanks.
(Website Code)
https://drive.google.com/open?id=0BxKswko3bYpuRV9seTZZT3REak0
(My code)
from bs4 import BeautifulSoup
import urllib.request
target_url = "http://www.bobaedream.co.kr/cyber/CyberCar.php?gubun=I&page=1"
def fetch_post_list():
URL = target_url
res = urllib.request.urlopen(URL)
html = res.read()
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table', class_='cyber')
#Car Info and Link
carinfo = table.find_all('td', class_='carinfo')
carinfo_title = carinfo.find_all('a', class_='title')
print (carinfo_title)
return carinfo_title
fetch_post_list()
You have multiple elements with the carinfo class and for every "carinfo" you need to get to the car title. Loop over the result of the table.find_all('td', class_='carinfo'):
for carinfo in table.find_all('td', class_='carinfo'):
carinfo_title = carinfo.find('a', class_='title')
print(carinfo_title.get_text())
Would print:
미니 쿠퍼 S JCW
지프 랭글러 3.8 애니버서리 70주년 에디션
...
벤츠 뉴 SLK200 블루이피션시
포르쉐 뉴 카이엔 4.8 GTS
마쯔다 MPV 2.3
Note that if you need only car titles, you can simplify it down to a single line:
print([elm.get_text() for elm in soup.select('table.cyber td.carinfo a.title')])
where the string inside the .select() method is a CSS selector.

Iterate through list while parsing

I am trying to download worksheets for this workout, all the workouts are split on different days. All that needs to be done is add a new number at the end of the link. Here is my code.
import urllib
import urllib.request
from bs4 import BeautifulSoup
import re
import os
theurl = "http://www.muscleandfitness.com/workouts/workout-routines/gain-10-pounds-muscle-4-weeks-1?day="
urls = []
count = 1
while count <29:
urls.append(theurl + str(count))
count +=1
print(urls)
for url in urls:
thepage = urllib
thepage = urllib.request.urlopen(urls)
soup = BeautifulSoup(thepage,"html.parser")
init_data = open('/Users/paribaker/Desktop/scrapping/workout/4weekdata.txt', 'a')
workout = []
for data_all in soup.findAll('div',{'class':"b-workout-program-day-exercises"}):
try:
for item in data_all.findAll('div',{'class':"b-workout-part--item"}):
for desc in item.findAll('div', {'class':"b-workout-part--description"}):
workout.append(desc.find('h4',{'class':"b-workout-part--exercise-count"}).text.strip("\n") +",\t")
workout.append(desc.find('strong',{'class':"b-workout-part--promo-title"}).text +",\t")
workout.append(desc.find('span',{'class':"b-workout-part--equipment"}).text +",\t")
for instr in item.findAll('div', {'class':"b-workout-part--instructions"}):
workout.append(instr.find('div',{'class':"b-workout-part--instructions--item workouts-sets"}).text.strip("\n") +",\t")
workout.append(instr.find('div',{'class':"b-workout-part--instructions--item workouts-reps"}).text.strip("\n") +",\t")
workout.append(instr.find('div',{'class':"b-workout-part--instructions--item workouts-rest"}).text.strip("\n"))
workout.append("\n*3")
except AttributeError:
pass
init_data.write("".join(map(lambda x:str(x), workout)))
init_data.close
The problem is that the server times out, I'm assuming its not iterating through the list properly or adding characters I do not need and crashing the server parser.
I have also tried to write another script that grabs all the link and put them in a text document, then reopen the text in this script and iterate through the text, but that also gave me the same error. What are your thoughts?
There's a typo here:
thepage = urllib.request.urlopen(urls)
You probably wanted:
thepage = urllib.request.urlopen(url)
Otherwise you are trying to open an array of urls rather than a single one.

Resources