Extract Web Data with Beautiful Soup - python-3.x

I am having issue with getting text of field from the web page using python 3 and bs4. Code below.
import requests
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get("https://www.mlssoccer.com/players")
content = page.content
soup = BeautifulSoup(content, "html.parser")
data = soup.find('div', class_ = 'item-list' )
names=[]
for player in data:
name = data.find_all('div', class_ = 'name')
names.append(name)
df= pd.DataFrame({'player':names})
the code works (ie executes) but I get the html tags in the output, rather than the text of the field (player name). i tried:
name = data.find_all('div', class_ = 'name').text
in the for loop but that doesn't work either.
Any pointers or references to help would be appreciated

What you get from the find_all is ResultSet, so yes you need to use text to retrieve name data you want but it won't work for a set. Therefore you need to use for loop to retrieve them one by one.
However, the text in div actually contains an a tag, so you need to further dig in it by find('a').
for player in data:
name = data.find_all('div', class_ = 'name')
for obj in name:
names.append(obj.find('a').text)

you only need to loop once, use .text to get text inside element
....
soup = BeautifulSoup(content, "html.parser")
data = soup.findAll('a', class_='name_link' )
names=[]
for player in data:
names.append(player.text)
.....

Related

What does "[ ]" as output mean while web scraping

I am trying to web scrape and get the complete table of players, age, value and other columns from a site. I got "[ ]" as output. What does "[ ]" mean and how can I get the complete table?
This is my code:
import requests
from bs4 import BeautifulSoup
link = ("https://sofifa.com/team/1/arsenal/?&showCol%5B%5D=ae&showCol%5B%5D=hi&showCol%5B%5D=le&showCol%5B%5D=vl&showCol%5B%5D=wg&showCol%5B%5D=rc")
get_link = requests.get(link)
get_text = get_link.text
objBs = BeautifulSoup("get_text", "lxml")
objBs.findAll("table", {"class":"table table-hover persist-area"})
[] is an empty list, meaning no results were found. The problem is that you passed the literal string "get_text" to Beautiful Soup, instead of the actual web page content. You can get the table like this:
get_text = requests.get(link)
soup = BeautifulSoup(get_text.content, "lxml")
table = soup.find("table", {"class":"table table-hover persist-area"})

Beautiful Soup Error: Trying to retrieve data from web page returns empty array

I am trying to download a list of voting intention opinion polls from this web page using beautiful soup. However, the code I wrote returns an empty array or nothing. The code I used is below:
The page code is like this:
<div class="ST-c2-dv1 ST-ch ST-PS" style="width:33px"></div>
<div class="ST-c2-dv2">41.8</div>
That's what I tried:
import requests
from bs4 import BeautifulSoup
request = requests.get(quote_page) # take the page link
page = request.content # extract page content
soup = BeautifulSoup(page, "html.parser")
# extract all the divs
for each_div in soup.findAll('div',{'class':'ST-c2-dv2'}):
print each_div
At this point, it prints nothing.
I've tried also this:
tutti_a = soup.find_all("html_element", class_="ST-c2-dv2")
and also:
tutti_a = soup.find_all("div", class_="ST-c2-dv2")
But I get an empty array [] or nothing at all
I think you can use the following url
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get('https://www.marktest.com/wap/a/sf/v~[73D5799E1B0E]/name~Dossier_5fSondagensLegislativas_5f2011.HighCharts.Sondagens.xml.aspx')
soup = bs(r.content, 'lxml')
results = []
for record in soup.select('p'):
results.append([item.text for item in record.select('b')])
df = pd.DataFrame(results)
print(df)
Columns 5,6,7,8,9,10 correspond with PS, PSD,CDS,CDU,Bloco,Outros/Brancos/Nulos
You can drop unwanted columns, add appropriate headers etc.

Web Scraping with Try: Except: in For Loop

I have written the code below attempting to practice web-scraping with Python, Pandas, etc. In general I have four steps I am trying to follow to achieve my desired output:
Get a list of names to append to a base url
Create a list of player specific urls
Use the player urls to scrape tables
add the player name to the table I scraped to keep track of which player belongs to which stats - so in each row of the table add a column with the players name who was used to scrape the table
I was able to get #'s 1 and 2 working. The components of #3 seem to work, but i believe i have something wrong with my try: except because if i run just the line of code to scrape a specific playerUrl the tables DF populates as expected. The first player scraped has no data so I believe I am failing with the error catching.
For # 4 i really havent been able to find a solution. How do i add the name to the list as it is iterating in the for loop?
Any help is appreciated.
import requests
import pandas as pd
from bs4 import BeautifulSoup
### get the player data to create player specific urls
res = requests.get("https://www.mlssoccer.com/players?page=0")
soup = BeautifulSoup(res.content,'html.parser')
data = soup.find('div', class_ = 'item-list' )
names=[]
for player in data:
name = data.find_all('div', class_ = 'name')
for obj in name:
names.append(obj.find('a').text.lower().lstrip().rstrip().replace(' ','-'))
### create a list of player specific urls
url = 'https://www.mlssoccer.com/players/'
playerUrl = []
x = 0
for name in (names):
playerList = names
newUrl = url + str(playerList[x])
print("Gathering url..."+newUrl)
playerUrl.append(newUrl)
x +=1
### now take the list of urls and gather stats tables
tbls = []
i = 0
for url in (playerUrl):
try: ### added the try, except, pass because some players have no stats table
tables = pd.read_html(playerUrl[i], header = 0)[2]
tbls.append(tables)
i +=1
except Exception:
continue
There are lots of redundancy in your script. You can clean them up complying the following. I've used select() instead of find_all() to shake of the verbosity in the first place. To get rid of that IndexError, you can make use of continue keyword like I've shown below:
import requests
import pandas as pd
from bs4 import BeautifulSoup
base_url = "https://www.mlssoccer.com/players?page=0"
url = 'https://www.mlssoccer.com/players/'
res = requests.get(base_url)
soup = BeautifulSoup(res.text,'lxml')
names = []
for player in soup.select('.item-list .name a'):
names.append(player.get_text(strip=True).replace(" ","-"))
playerUrl = {}
for name in names:
playerUrl[name] = f'{url}{name}'
tbls = []
for url in playerUrl.values():
if len(pd.read_html(url))<=2:continue
tables = pd.read_html(url, header=0)[2]
tbls.append(tables)
print(tbls)
You can do couple of things to improve your code and get the step # 3 and 4 done.
(i) When using the for name in names loop, there is no need to explicitly use the indexing, just use the variable name.
(ii) You can save the player's name and its corresponding URL as a dict, where the name is the key. Then in step 3/4 you can use that name
(iii) Construct a DataFrame for each parsed HTML table and just append the player's name to it. Save this data frame individually.
(iv) Finally concatenate these data frames to form a single one.
Here is your code modified with above suggested changes:
import requests
import pandas as pd
from bs4 import BeautifulSoup
### get the player data to create player specific urls
res = requests.get("https://www.mlssoccer.com/players?page=0")
soup = BeautifulSoup(res.content,'html.parser')
data = soup.find('div', class_ = 'item-list' )
names=[]
for player in data:
name = data.find_all('div', class_ = 'name')
for obj in name:
names.append(obj.find('a').text.lower().lstrip().rstrip().replace(' ','-'))
### create a list of player specific urls
url = 'https://www.mlssoccer.com/players/'
playerUrl = {}
x = 0
for name in names:
newUrl = url + str(name)
print("Gathering url..."+newUrl)
playerUrl[name] = newUrl
### now take the list of urls and gather stats tables
tbls = []
for name, url in playerUrl.items():
try:
tables = pd.read_html(url, header = 0)[2]
df = pd.DataFrame(tables)
df['Player'] = name
tbls.append(df)
except Exception as e:
print(e)
continue
result = pd.concat(tbls)
print(result.head())

Parsing through HTML with BeautifulSoup in Python

Currently my code is as follows:
from bs4 import BeautifulSoup
import requests
main_url = 'http://www.foodnetwork.com/recipes/a-z'
response = requests.get(main_url)
soup = BeautifulSoup(response.text, "html.parser")
mylist = [t for tags in soup.find_all(class_='m-PromoList o-Capsule__m-
PromoList') for t in tags if (t!='\n')]
As of now, I get a list containing the correct information but its still inside of HTML tags. An example of an element of the list is given below:
<li class="m-PromoList__a-ListItem">"16 Bean" Pasta E Fagioli</li>
from this item I want to extract both the href link and also the following string separately, but I am having trouble doing this and I really don't think getting this info should require a whole new set of operations. How do?
You can do this to get href and text for one element:
href = soup.find('li', attrs={'class':'m-PromoList__a-ListItem'}).find('a')['href']
text = soup.find('li', attrs={'class':'m-PromoList__a-ListItem'}).find('a').text
For a list of items:
my_list = soup.find_all('li', attrs={'class':'m-PromoList__a-ListItem'})
for el in my_list:
href = el.find('a')['href']
text = el.find('a').text
print(href)
print(text)
Edit:
An important tip to reduce run time: Don't search for the same tag more than once. Instead, save the tag in a variable and then use it multiple times.
a = soup.find('li', attrs={'class':'m-PromoList__a-ListItem'}).find('a')
href = a.get('href')
text = a.text
In large HTML codes, finding a tag takes up lot of time, so doing this will reduce the time taken to find the tag as it will run only once.
Several ways you can achieve the same. Here is another approach using css selector:
from bs4 import BeautifulSoup
import requests
response = requests.get('http://www.foodnetwork.com/recipes/a-z')
soup = BeautifulSoup(response.text, "lxml")
for item in soup.select(".m-PromoList__a-ListItem a"):
print("Item_Title: {}\nItem_Link: {}\n".format(item.text,item['href']))
Partial result:
Item_Title: "16 Bean" Pasta E Fagioli
Item_Link: //www.foodnetwork.com/recipes/ina-garten/16-bean-pasta-e-fagioli-3612570
Item_Title: "16 Bean" Pasta e Fagioli
Item_Link: //www.foodnetwork.com/recipes/ina-garten/16-bean-pasta-e-fagioli-1-3753755
Item_Title: "21" Apple Pie
Item_Link: //www.foodnetwork.com/recipes/21-apple-pie-recipe-1925900

Web Crawler keeps saying no attribute even though it really has

I have been developing a web-crawler for this website (http://www.bobaedream.co.kr/cyber/CyberCar.php?gubun=I&page=1). But I have a trouble at crawling each title of the stock. I am pretty sure that there is attribute for carinfo_title = carinfo.find_all('a', class_='title').
Please check out the attached code and website code, and then give me any advice.
Thanks.
(Website Code)
https://drive.google.com/open?id=0BxKswko3bYpuRV9seTZZT3REak0
(My code)
from bs4 import BeautifulSoup
import urllib.request
target_url = "http://www.bobaedream.co.kr/cyber/CyberCar.php?gubun=I&page=1"
def fetch_post_list():
URL = target_url
res = urllib.request.urlopen(URL)
html = res.read()
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table', class_='cyber')
#Car Info and Link
carinfo = table.find_all('td', class_='carinfo')
carinfo_title = carinfo.find_all('a', class_='title')
print (carinfo_title)
return carinfo_title
fetch_post_list()
You have multiple elements with the carinfo class and for every "carinfo" you need to get to the car title. Loop over the result of the table.find_all('td', class_='carinfo'):
for carinfo in table.find_all('td', class_='carinfo'):
carinfo_title = carinfo.find('a', class_='title')
print(carinfo_title.get_text())
Would print:
미니 쿠퍼 S JCW
지프 랭글러 3.8 애니버서리 70주년 에디션
...
벤츠 뉴 SLK200 블루이피션시
포르쉐 뉴 카이엔 4.8 GTS
마쯔다 MPV 2.3
Note that if you need only car titles, you can simplify it down to a single line:
print([elm.get_text() for elm in soup.select('table.cyber td.carinfo a.title')])
where the string inside the .select() method is a CSS selector.

Resources