Converting Beautifulsoup scraped table to list - string

Scraping a column from Wikipedia with Beautifulsoup returns the last row, while I want all of them in a list:
from urllib.request import urlopen
from bs4 import BeautifulSoup
​
site = "https://en.wikipedia.org/wiki/Agriculture_in_India"
html = urlopen(site)
soup = BeautifulSoup(html, "html.parser")
table = soup.find("table", {'class': 'wikitable sortable'})
​
for row in table.find_all("tr")[1:]:
col = row.find_all("td")
if len(col) > 0:
com = str(col[1].string.strip("\n"))
​
list(com)
com
Out: 'ZTS'
So it only shows the last row of the string, I was expecting to get a list with each line of the string as a string value. So that I can assign the list to new variable.
"Rice", "Buffalo milk", "Cow milk", "Wheat"
Can anyone help me?

Your method will not work because you are not "adding" anything to com.
One way to do what you desire is:
from urllib.request import urlopen
from bs4 import BeautifulSoup
site = "https://en.wikipedia.org/wiki/Agriculture_in_India"
html = urlopen(site)
soup = BeautifulSoup(html, "html.parser")
table = soup.find("table", {'class': 'wikitable sortable'})
com=[]
for row in table.find_all("tr")[1:]:
col = row.find_all("td")
if len(col)> 0:
temp=col[1].contents[0]
try:
to_append=temp.contents[0]
except Exception as e:
to_append=temp
com.append(to_append)
print(com)
This will give you what you require.
Explanation
col[1].contents[0] gives the first child of the tag. .contents gives you a list of children of the tag. Here we have a single child so 0.
In some cases, the content inside the <tr> tag is a <a href> tag. So I apply another .contents[0] to get the text.
In other cases it is not a link. For that I used an exception statement. If there is no descendant of the child extracted, we would get an exception.
See the official documentation for details

Related

Script is not returning proper output when trying to retrieve data from a newsletter

I am trying to write a script that can retrieve album title and band name from a music store newsletter. The band name and album title are hidden in a h3 & h4 class. When executing the script I get a blank output in the csv file.
`
from bs4 import BeautifulSoup
import requests
import pandas as pd
# Use the requests library to fetch the HTML content of the page
url = "https://www.musicmaniarecords.be/_sys/newsl_view?n=260&sub=Tmpw6Rij5D"
response = requests.get(url)
# Use the BeautifulSoup library to parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Find all 'a' elements with the class 'row'
albums = soup.find_all('a', attrs={'class': 'row'})
# Iterate over the found elements and extract the album title and band name
album_title = []
band_name = []
for album in albums:
album_title_element = album.find('td', attrs={'td_class': 'h3 class'})
band_name_element = album.find('td', attrs={'td_class': 'h4 class'})
album_title.append(album_title_element.text)
band_name.append(band_name_element.text)
# Use the pandas library to save the extracted data to a CSV file
df = pd.DataFrame({'album_title': album_title, 'band_name': band_name})
df.to_csv('music_records.csv')
I think the error is in the attrs part, not sure how to fix it properly. Thanks in advance!
Looking at your code I agree that the error lies in the attrs part. The problem you are facing is that the site you are trying to scrape does not contain 'a' elements with the 'row' class. Thus find_all returns an empty list. There are plenty of 'div' elements with the 'row' class, maybe you meant to look for those?
You had the right idea by looking for 'td' elements and extracting their 'h3' and 'h4' elements, but since albums is an empty list, there are no elements to find.
I changed your code slightly to look for 'td' elements directly and extract their 'h3' and 'h4' elements. With these small changes your code found 29 albums.
from bs4 import BeautifulSoup
import requests
import pandas as pd
# Use the requests library to fetch the HTML content of the page
url = "https://www.musicmaniarecords.be/_sys/newsl_view?n=260&sub=Tmpw6Rij5D"
response = requests.get(url)
# Use the BeautifulSoup library to parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Find all 'a' elements with the class 'row'
albums = soup.find_all('td', attrs={'class': 'block__cell'} )
# Iterate over the found elements and extract the album title and band name
album_title = []
band_name = []
for i, album in enumerate(albums):
album_title_element = album.find('h3')
band_name_element = album.find('h4')
album_title.append(album_title_element.text)
band_name.append(band_name_element.text)
# Use the pandas library to save the extracted data to a CSV file
df = pd.DataFrame({'album_title': album_title, 'band_name': band_name})
df.to_csv('music_records.csv', index=False)
I also took the liberty of adding index=False to the last line of your code. This makes it so each row doesn't start with a ,.
Hope this helps.
from bs4 import BeautifulSoup
import requests
import pandas as pd
# Use the requests library to fetch the HTML content of the page
url = "https://www.musicmaniarecords.be/_sys/newsl_view?n=260&sub=Tmpw6Rij5D"
response = requests.get(url)
# Use the BeautifulSoup library to parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Find all 'a' elements with the class 'row'
albums = soup.find_all('td', attrs={'class': 'block__cell'})
# Iterate over the found elements and extract the album title and band name
album_title = []
band_name = []
for album in albums:
album_title_element = album.find('h3', attrs={'class': 'header'})
band_name_element = album.find('h4', attrs={'class': 'header'})
album_title.append(album_title_element.text)
band_name.append(band_name_element.text)
# Use the pandas library to save the extracted data to a CSV file
df = pd.DataFrame({'album_title': album_title, 'band_name': band_name})
df.to_csv('music_records.csv')
Thanks to the anonymous hero for helping out!

How to Add to an Index From a Web Scrape

Am I able to run a while loop and add to the index below to gather all odd number indexes on the page?
Basically, I want to skip all the even indexes and print the odd indexes without writing
the same line in asterisks below over and over like [1],[3],[5], etc.
Is there a way to write a while loop and add to the index number?
Thanks!!
'''
import bs4
from bs4 import BeautifulSoup
import requests
import lxml
vegas_insider = requests.get('https://www.vegasinsider.com/nfl/matchups/', 'r').text
soup = BeautifulSoup(vegas_insider, 'lxml')
**team =soup.find_all('a', class_ = 'tableText')[1].text**
print(team)
'''
teams = [team.text for team in soup.find_all('a', class_ = 'tableText')[1::2]]
Or to print,
for team in soup.find_all('a', class_ = 'tableText')[1::2]:
print(team)

Beautifullsoup get text in tag

I am trying to get data from yellowpages, but i need only numbered plumbers. But i can't get text numbers in h2 class='n'. I can get a class="business-name" text but i need only numbered plumbers not with advertisement. What is my mistake? Thank you very much.
This html :
<div class="info">
<h2 class="n">1. <a class="business-name" href="/austin-tx/mip/johnny-rooter-11404675?lid=171372530" rel="" data-impressed="1"><span>Johnny Rooter</span></a></h2>
</div>
And this is my python code:
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.yellowpages.com/austin-tx/plumbers"
req = requests.get(url)
data = req.content
soup = bs(data, "lxml")
links = soup.findAll("div", {"class": "info"})
for link in links:
for content in link.contents:
try:
print(content.find("h2", {"class": "n"}).text)
except:
pass
You need a different class selector to limit to that section
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.yellowpages.com/austin-tx/plumbers"
req = requests.get(url)
data = req.content
soup = bs(data, "lxml")
links = [item.text.replace('\xa0','') for item in soup.select('.organic h2')]
print(links)
.organic is a single class selector, from a compound class, for a parent element which restricts to all the numbered plumbers. Observe how the highlighting starts after the ads:
Output:

Parsing through HTML to extract data from table rows with beautiful soup

I'm using BeautifulSoup to extract stock information from the NASDAQ website. I want to retrieve information specifically from the table rows on the HTML page but I am always getting an error (line 12).
#import html-parser
from bs4 import BeautifulSoup
from requests import get
url = 'https://www.nasdaq.com/symbol/amzn' #AMZN is just an example
response = get(url)
#Create parse tree (BeautifulSoup Object)
soup = BeautifulSoup(response.text, 'html.parser')
data = soup.find_all(class_= 'column span-1-of-2')
table = data.find(class_= 'table-row') #This is where the error occurs
print(table)
You can do something like this to get the data from table rows.
import requests
from bs4 import BeautifulSoup
import re
r = requests.get("https://www.nasdaq.com/")
print(r)
soup = BeautifulSoup(r.content, 'html.parser')
data = soup.find('table',{'id':'indexTable', 'class':'floatL marginB5px'}).script.text
matches = re.findall(r'nasdaqHomeIndexChart.storeIndexInfo(.*);\r\n', data)
table_rows = [re.findall(r'\".*\"', row) for row in matches]
print(table_rows)
table_rows is list of lists containing table data.

the Accessing commented HTML Lines with BeautifulSoup

I am attempting to webscrape stats from this specific webpage: https://www.sports-reference.com/cfb/schools/louisville/2016/gamelog/
However, the table for the 'Defensive Game Log' appears to be commented out when I look at the HTML source (starts with <...!-- and ends with -->)
Because of this, when attempting to use BeautifulSoup4 the following code only grabs the offensive data that is not commented out while the defensive data is commented out.
from urllib.request import Request,urlopen
from bs4 import BeautifulSoup
import re
accessurl = 'https://www.sports-reference.com/cfb/schools/oklahoma-state/2016/gamelog/'
req = Request(accessurl)
link = urlopen(req)
soup = BeautifulSoup(link.read(), "lxml")
tables = soup.find_all(['th', 'tr'])
my_table = tables[0]
rows = my_table.findChildren(['tr'])
for row in rows:
cells = row.findChildren('td')
for cell in cells:
value = cell.string
print(value)
I am curious if there are any solutions to be able to add all of the defensive values into a list the same way the offensive data is stored be it inside or outside of BeautifulSoup4. Thanks!
Note that I added onto solution given below derived from here:
data = []
table = defensive_log
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele]) # Get rid of empty values
Comment object will give you what you want:
from urllib.request import Request,urlopen
from bs4 import BeautifulSoup, Comment
accessurl = 'https://www.sports-reference.com/cfb/schools/oklahoma-state/2016/gamelog/'
req = Request(accessurl)
link = urlopen(req)
soup = BeautifulSoup(link, "lxml")
comments=soup.find_all(string=lambda text:isinstance(text,Comment))
for comment in comments:
comment=BeautifulSoup(str(comment), 'lxml')
defensive_log = comment.find('table') #search as ordinary tag
if defensive_log:
break

Resources