Python - issues with for loops to create dataframe with BeautifulSoup scrape - python-3.x

I'm a beginner in Python and I'm trying to create a new dataframe using BeautifulSoup to scrape a webpage. I'm following some code that worked in a different page, but it's not working here. My final table of data is blank, so seems it's not appending. Any help is appreciated. This is what I've done:
from bs4 import BeautifulSoup
import requests
import pandas as pd
allergens = requests.get(url = 'http://guetta.com/diginn/allergens/')
allergens = BeautifulSoup(allergens.content)
items = allergens.find_all('div', class_ = 'menu-item-card')
final_table = {}
for item in allergens.find_all('div', class_ = 'menu-item-card'):
for row in item.find_all('h4', recursive = False)[0:]:
for column in row.find_all('p', class_ = 'menu-item__allergens'):
col_name = column['class'][0].split('__')[1]
if col_name not in final_table:
final_table[col_name] = []
final_table[col_name].append(column.text)
df_allergens = pd.DataFrame(final_table)
This returns nothing. No errors, just empty brackets. I was able to retrieve each element individually, so I think the items should work but obviously I'm missing something.
Edit:
Here is what the output needs to be:
Item Name Allergens
Classic Dig | Soy
Item2 | allergen1, allergen2
Item3 | allergen2

You don't need to find all h4 tags in every item. So make a change like below:
...
for item in allergens.find_all('div', class_ = 'menu-item-card'):
for column in item.find_all('p', class_ = 'menu-item__allergens'):
col_name = column['class'][0].split('__')[1]
if col_name not in final_table:
final_table[col_name] = []
final_table[col_name].append(column.text)
...

Related

Scraping and creating new df

I'm basically trying to create a dataframe using the following code.
Here is the resulting table I’m trying to achieve:
info_list = []
data_list = []
mini_exc = ['CLFAR', 'CLFLE', 'CLHOL', 'CLCAN', 'CLCLE']
for exc in mini_exc:
grab_page = requests.get(f"http://availability.samknows.com/broadband/exchange/{exc}")
soup = BeautifulSoup(grab_page.content, 'html.parser')
warning = soup.findAll('div', class_='item-content')
for x in warning:
for y in x.findAll('th'):
info_list.append(y.text)
for z in x.findAll('td'):
data_list.append(z.text)
Basically, I would like to have a dataframe with the elements in info list as column names and data_list as rows corresponding to the correspondent columns.
As you can see I obtained a dataframe with correct info data, but I could not add the new columns. I know that:
for y in x.findAll('th'):
info_list.append(y.text)
should be outside the loop because I just need it once, but I put it there so you could get the column names.
That'what I did in the end guys, I created a Dataframe from a list with the info values I needed and after I run this code:
"""
for exc in ['CLFAR', 'CLFLE', 'CLHOL', 'CLCAN', 'CLCLE']:
info_list = []
grab_page = requests.get(f"http://availability.samknows.com/broadband/exchange/{exc}")
soup = BeautifulSoup(grab_page.content, 'html.parser')
warning = soup.findAll('div', class_='item-content')
for x in warning:
for z in x.findAll('td'):
info_list.append(z.text)
exch_only_serv[exc] = info_list
"""
So all the time was creating a list with the values for the following code and creating a new column with it.

How to get all the rows of a table, not just the first row?

Good afternoon community! I need help writing a parser, I'm just starting to program in Python 3, maybe I'm missing something. The task is this:
The site has a table with football teams, using Requests and BeautifulSoup I was able to get the source code of this table into the firsttable variable, the print command normally displays all the data I need, but when I try to display it in a list of the form:
10:00 Team 1 Team 2
11:00 Team 3 Team 4
12:00 Team 5 Team 6
And so on, I can only get the first value from the list, I tried to use the While loop (for example, While i <10), it repeats to me the first value from the table 10 times, but does not pars the remaining ones. What am I doing wrong?
def get_data(html):
soup = BeautifulSoup(html, 'lxml')
firsttable = soup.findAll('table', class_='predictionsTable')[0]
print(firsttable) #Here, all the data that I need is displayed in the console as html source
for scrape in firsttable:
try:
hometeam = scrape.find('td', class_='COL-3').text
except:
hometeam = 'Hometeam Error'
try:
awayteam = scrape.find('td', class_='COL-5').text
except:
awayteam = 'Away Team Error'
try:
btts = scrape.find('td', class_='COL-10').text
except:
btts = 'BTTS Score Error'
datenow = str(datetime.date.today())
print(datenow,hometeam,awayteam,btts)
The loop for scrape in firsttable only has one iteration, of the entire table content, which is why you are finding only the first row. Instead of using a loop I would recommend the find_all method. This worked for me:
url = 'https://www.over25tips.com/both-teams-to-score-tips/'
soup = BeautifulSoup(requests.get(url).content, 'lxml')
firsttable = soup.findAll('table', class_='predictionsTable')[0]
hometeams = [x.text for x in firsttable.find_all('td', {'class': 'COL-3 right-align'})]
awayteams = [x.text for x in firsttable.find_all('td', {'class': 'COL-5 left-align'})]
btts = [x.text for x in firsttable.find_all('td', {'class': 'COL-10 hide-on-small-only'})]
datenow = str(datetime.date.today())
for i in range(len(hometeams)):
print(datenow, hometeams[i], awayteams[i], btts[i])
The second argument to BeautifulSoup's constructor is String.
It is a type of parser.
You want to parse a html, thus you should type 'html.parser' in second argument.
soup = BeautifulSoup(html, 'lxml') => soup = BeautifulSoup(html, 'html.parser')

Python, Beautifulsoup - Extracting strings from tags based on items in list

I am trying to scrape the site https://www.livechart.me/winter-2019/tv to get the number of episodes that have currently aired for certain shows this season. I do this by extracting the "episode-countdown" tag data which gives something like "EP11:" then a timestamp after it, and then I slice that string to only give the number (in this case the "11") and then subtract by 1 to get how many episodes have currently aired (as the timestamp is for when EP11 will air).
I have a list of the different shows I am watching this season in order to filter what shows I extract the episode-countdown strings for instead of extracting the countdown for every show airing. The big problem I am having is that the "episode-countdown" strings are not in the same order as my list of shows I am watching. For example, if my list is [show1, show2, show3, show4], I might get the "episodes-countdown" string tag in the order of show3, show4, show1, show2 if they are listed in that order on the website.
My current code is as follows:
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
def countdown():
html = Request('https://www.livechart.me/winter-2019/tv', headers={'User-Agent': 'Mozilla/5.0'})
page = urlopen(html)
soup = BeautifulSoup(page, 'html.parser')
shows = ['Jojo no Kimyou na Bouken: Ougon no Kaze', 'Dororo', 'Mob Psycho 100 II', 'Yakusoku no Neverland']
for tag in soup.find_all('article', attrs={'class': 'anime'}):
if any(x in tag['data-romaji'] for x in shows):
rlist = tag.find('div', attrs={'class': 'episode-countdown'}).text
r2 = rlist[:rlist.index(":")][2:]
print('{} has aired {} episodes so far'.format(tag['data-romaji'], int(r2)-1))
Each show listed on the website is inside of an "article" tag so for every show in the soup.find_all() statement, if the "data-romaji" (the name of the show listed on the website) matches a show in my "shows" list, then I extract the "episode-countdown" string and then slice the string to just the number as previously explained and then print to make sure I did it correctly.
If you go to the website, the order that the shows are listed are "Yakusoku no Neverland", "Mob Psycho", "Dororo", and "Jojo" which is the order that you get the episode-countdown strings in if you run the code. What I want to do is have it in order of my "shows" list so that I have a list of shows and a list of episodes aired that match each other. I want to add the episodes aired list as a column in a pandas dataframe I am currently building so having it not match the "shows" column would be a problem.
Is there a way for me to extract the "episode-countdown" string based on the order of my "shows" list instead of the order used on the website (if that makes sense)?
Is this what you're looking for?
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import pandas as pd
html = Request('https://www.livechart.me/winter-2019/tv', headers={'User-Agent': 'Mozilla/5.0'})
page = urlopen(html)
soup = BeautifulSoup(page, 'html.parser')
shows = ['Jojo no Kimyou na Bouken: Ougon no Kaze', 'Dororo', 'Mob Psycho 100 II', 'Yakusoku no Neverland']
master = []
for show in shows:
for tag in soup.find_all('article', attrs={'class': 'anime'}):
show_info = []
if show in tag['data-romaji']:
show_info.append(tag['data-romaji'])
rlist = tag.find('div', attrs={'class': 'episode-countdown'}).text
r2 = rlist[:rlist.index(":")][2:]
show_info.append(r2)
master.append(show_info)
df=pd.DataFrame(master,columns=['Show','Episodes'])
df
Output:
Show Episodes
0 Jojo no Kimyou na Bouken: Ougon no Kaze 23
1 Dororo 11
2 Mob Psycho 100 II 11
3 Yakusoku no Neverland 11

Taking Average of List of Integers

I'm scraping a list of daily stock volume numbers, and I'm wanting to take an average of the first 20 results in the volume column of the page. My code looks like:
from bs4 import BeautifulSoup
import re, csv, random, time
import pandas as pd
import os
import requests
page = requests.get('https://finance.yahoo.com/quote/BDSI/history?period1=1517033117&period2=1548569117&interval=1d&filter=history&frequency=1d')
soup = BeautifulSoup(page.text, 'html.parser')
rows = soup.select('table[class="W(100%) M(0)"] tr')
for row in rows[1:20]:
col = row.find_all("td")
numbers = col[6].text.replace(',', '')
numbers2 = int(numbers)
print(numbers2)
avg20vol = sum(numbers2(1,20))/len(numbers2)
...but I'm getting stuck when trying to take the average of the returned numbers2. Receive either "TypeError: 'int' object is not callable" or "TypeError: 'int' object is not iterable" with the solutions I've tried. How do handle taking an average of a list? Does it involve turning it into a dataframe first? Thanks!
UPDATE
Here's a working example of the applicable code segment:
numberslist=[]
for row in rows[1:21]:
col = row.find_all("td")
numbers = col[6].text.replace(',', '')
numbers2 = int(numbers)
numberslist.append(numbers2)
print(numbers2)
average = sum(numberslist)/len(numberslist)
print('Average = ',average)
When scraping, actually create a list of numbers, like so:
# stuff before
number_list = [] # empty list
for row in rows[1:20]:
# get the number
number_list.append(int(number_as_string)) # add the new number at the end of the list
average = sum(number_list)/len(number_list)
You can also .append() the string forms and then transform to ints with list(map(int(list_of_strings)) or [int(x) for x in list_of_strings].
Note: rows[1:20] will leave out the first item, in your case, as you stated, the first row is header. Use rows[:20] to get the first 20 items in general.
You css selector is also wrong and gave me an error.
from bs4 import BeautifulSoup
import requests
page = requests.get('https://finance.yahoo.com/quote/BDSI/history?period1=1517033117&period2=1548569117&interval=1d&filter=history&frequency=1d')
soup = BeautifulSoup(page.text, 'html.parser')
rows = soup.find('table',class_="W(100%) M(0)").find_all('tr')
numbers=[]
for row in rows[1:20]:
col = row.find_all("td")
print(col[6].text)
number = col[6].text.replace(',', '')
number = int(number)
numbers.append(number)
avg20vol =sum(numbers)/len(numbers)
print("Average: ",avg20vol)
Output
650,100
370,500
374,700
500,700
452,500
1,401,800
2,071,200
1,005,800
441,500
757,000
901,200
563,400
1,457,000
637,100
692,700
725,000
709,000
1,155,500
496,400
Average: 808584.2105263158

Python & Beautiful Soup - Searching result strings

I am using Beautiful Soup to parse an HTML table.
Python version 3.2
Beautiful Soup version 4.1.3
I am running into an issue when trying to use the findAll method to find the columns within my rows. I get an error that says list object has no attribute findAll. I found this method through another post on stack exchange and this was not an issue there. (BeautifulSoup HTML table parsing)
I realize that findAll is a method of BeautifulSoup, not python lists. The weird part is the findAll method works when I find the rows within the table list (I only need the 2nd table on the page), but when I attempt to find the columns in the rows list.
Here's my code:
from urllib.request import URLopener
from bs4 import BeautifulSoup
opener = URLopener() #Open the URL Connection
page = opener.open("http://www.labormarketinfo.edd.ca.gov/majorer/countymajorer.asp?CountyCode=000001") #Open the page
soup = BeautifulSoup(page)
table = soup.findAll('table')[1] #Get the 2nd table (index 1)
rows = table.findAll('tr') #findAll works here
cols = rows.findAll('td') #findAll fails here
print(cols)
findAll() returns a result list, you'd need to loop over those or pick one to get to another contained element with it's own findAll() method:
table = soup.findAll('table')[1]
rows = table.findAll('tr')
for row in rows:
cols = rows.findAll('td')
print(cols)
or pick one row:
table = soup.findAll('table')[1]
rows = table.findAll('tr')
cols = rows[0].findAll('td') # columns of the *first* row.
print(cols)
Note that findAll is deprecated, you should use find_all() instead.

Resources