Taking Average of List of Integers - python-3.x

I'm scraping a list of daily stock volume numbers, and I'm wanting to take an average of the first 20 results in the volume column of the page. My code looks like:
from bs4 import BeautifulSoup
import re, csv, random, time
import pandas as pd
import os
import requests
page = requests.get('https://finance.yahoo.com/quote/BDSI/history?period1=1517033117&period2=1548569117&interval=1d&filter=history&frequency=1d')
soup = BeautifulSoup(page.text, 'html.parser')
rows = soup.select('table[class="W(100%) M(0)"] tr')
for row in rows[1:20]:
col = row.find_all("td")
numbers = col[6].text.replace(',', '')
numbers2 = int(numbers)
print(numbers2)
avg20vol = sum(numbers2(1,20))/len(numbers2)
...but I'm getting stuck when trying to take the average of the returned numbers2. Receive either "TypeError: 'int' object is not callable" or "TypeError: 'int' object is not iterable" with the solutions I've tried. How do handle taking an average of a list? Does it involve turning it into a dataframe first? Thanks!
UPDATE
Here's a working example of the applicable code segment:
numberslist=[]
for row in rows[1:21]:
col = row.find_all("td")
numbers = col[6].text.replace(',', '')
numbers2 = int(numbers)
numberslist.append(numbers2)
print(numbers2)
average = sum(numberslist)/len(numberslist)
print('Average = ',average)

When scraping, actually create a list of numbers, like so:
# stuff before
number_list = [] # empty list
for row in rows[1:20]:
# get the number
number_list.append(int(number_as_string)) # add the new number at the end of the list
average = sum(number_list)/len(number_list)
You can also .append() the string forms and then transform to ints with list(map(int(list_of_strings)) or [int(x) for x in list_of_strings].
Note: rows[1:20] will leave out the first item, in your case, as you stated, the first row is header. Use rows[:20] to get the first 20 items in general.

You css selector is also wrong and gave me an error.
from bs4 import BeautifulSoup
import requests
page = requests.get('https://finance.yahoo.com/quote/BDSI/history?period1=1517033117&period2=1548569117&interval=1d&filter=history&frequency=1d')
soup = BeautifulSoup(page.text, 'html.parser')
rows = soup.find('table',class_="W(100%) M(0)").find_all('tr')
numbers=[]
for row in rows[1:20]:
col = row.find_all("td")
print(col[6].text)
number = col[6].text.replace(',', '')
number = int(number)
numbers.append(number)
avg20vol =sum(numbers)/len(numbers)
print("Average: ",avg20vol)
Output
650,100
370,500
374,700
500,700
452,500
1,401,800
2,071,200
1,005,800
441,500
757,000
901,200
563,400
1,457,000
637,100
692,700
725,000
709,000
1,155,500
496,400
Average: 808584.2105263158

Related

Python - issues with for loops to create dataframe with BeautifulSoup scrape

I'm a beginner in Python and I'm trying to create a new dataframe using BeautifulSoup to scrape a webpage. I'm following some code that worked in a different page, but it's not working here. My final table of data is blank, so seems it's not appending. Any help is appreciated. This is what I've done:
from bs4 import BeautifulSoup
import requests
import pandas as pd
allergens = requests.get(url = 'http://guetta.com/diginn/allergens/')
allergens = BeautifulSoup(allergens.content)
items = allergens.find_all('div', class_ = 'menu-item-card')
final_table = {}
for item in allergens.find_all('div', class_ = 'menu-item-card'):
for row in item.find_all('h4', recursive = False)[0:]:
for column in row.find_all('p', class_ = 'menu-item__allergens'):
col_name = column['class'][0].split('__')[1]
if col_name not in final_table:
final_table[col_name] = []
final_table[col_name].append(column.text)
df_allergens = pd.DataFrame(final_table)
This returns nothing. No errors, just empty brackets. I was able to retrieve each element individually, so I think the items should work but obviously I'm missing something.
Edit:
Here is what the output needs to be:
Item Name Allergens
Classic Dig | Soy
Item2 | allergen1, allergen2
Item3 | allergen2
You don't need to find all h4 tags in every item. So make a change like below:
...
for item in allergens.find_all('div', class_ = 'menu-item-card'):
for column in item.find_all('p', class_ = 'menu-item__allergens'):
col_name = column['class'][0].split('__')[1]
if col_name not in final_table:
final_table[col_name] = []
final_table[col_name].append(column.text)
...

How to extract numerical (or string) data from wikipedia tables via webscraping?

I would like to use BeautifulSoup to webscrape data from wikipedia articles for the purpose of creating an HR Diagram. For the example below, I have chosen the star named Arcturus, though the purpose of the code is to be general enough to work for (almost?) any star. The rightmost table of the wikipedia page for each star contains all the information necessary to construct the diagram.
As an example, consider the wikipedia page for Arcturus. The spectral type can be found under the Characteristics subheader; the absolute magnitude can be found under the Astrometry subheader; the luminosity and temperature can be found under the Details subheader. Since all of this information is contained within the same main table, I tried the following:
import requests
from bs4 import BeautifulSoup
# import numpy as np
# import matplotlib.pyplot as plt
hyperlink = 'https://en.wikipedia.org/wiki/Arcturus'
webdata = requests.get(hyperlink)
soup = BeautifulSoup(webdata.text, 'lxml')
# print("\nPRETTY SOUP:\n{}\n".format(soup.prettify()))
res = []
right_table = soup.find('table', class_='infobox')
for row in right_table.findAll('tr'):
cells = row.findAll('td')
print("\n .. CELLS:\n{}\n".format(cells))
This code will run a separate print command for each row of the table. I used ctrl + f to find the occurrences of the word "temperature", from which I found the relevant print statement:
.. CELLS:
[<td><b>Temperature</b></td>, <td><span class="nowrap"><span data-sort-value="7003428600000000000♠"></span>4286<span style="margin-left:0.3em;margin-right:0.15em;">±</span>30</span><sup class="reference" id="cite_ref-ramirez_prieto_2011_7-3">[7]</sup> K</td>]
The actual value is 4286 ± 30 K. Is there an easy-to-generalize method to parse this html string? I would like to believe the methods to extract the other relevant parameters (such as spectral type) will not be much different.
If you want to extract just specific information, you could use this as example (using CSS selectors to obtain info):
import requests
from bs4 import BeautifulSoup
hyperlink = 'https://en.wikipedia.org/wiki/Arcturus'
webdata = requests.get(hyperlink)
soup = BeautifulSoup(webdata.text, 'lxml')
def remove_sup(tag):
for s in tag.select('sup'):
s.extract()
return tag
spectral = remove_sup(soup.select_one(":matches(td, th):contains('Spectral') + td")).get_text(strip=True)
magnitude = remove_sup(soup.select_one(":matches(td, th):contains('Absolute') + td")).get_text(strip=True)
lum = remove_sup(soup.select_one(":matches(td, th):contains('Luminosity') + td")).get_text(strip=True)
temp = remove_sup(soup.select_one(":matches(td, th):contains('Temperature') + td")).get_text(strip=True)
print('{: <25}{}'.format('Spectral type :', spectral))
print('{: <25}{}'.format('Absolute magnitude :', magnitude))
print('{: <25}{}'.format('Luminosity :', lum))
print('{: <25}{}'.format('Temperature :', temp))
Prints:
Spectral type : K0 III
Absolute magnitude : −0.30±0.02
Luminosity : 170L☉
Temperature : 4286±30K
You could use
for row in right_table.findAll('tr'):
cells = ' '.join([i.get_text() for i in row.findAll('td')])
print(cells)
But you are going to get the super scripts and sub scripts for example.

Python, Beautifulsoup - Extracting strings from tags based on items in list

I am trying to scrape the site https://www.livechart.me/winter-2019/tv to get the number of episodes that have currently aired for certain shows this season. I do this by extracting the "episode-countdown" tag data which gives something like "EP11:" then a timestamp after it, and then I slice that string to only give the number (in this case the "11") and then subtract by 1 to get how many episodes have currently aired (as the timestamp is for when EP11 will air).
I have a list of the different shows I am watching this season in order to filter what shows I extract the episode-countdown strings for instead of extracting the countdown for every show airing. The big problem I am having is that the "episode-countdown" strings are not in the same order as my list of shows I am watching. For example, if my list is [show1, show2, show3, show4], I might get the "episodes-countdown" string tag in the order of show3, show4, show1, show2 if they are listed in that order on the website.
My current code is as follows:
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
def countdown():
html = Request('https://www.livechart.me/winter-2019/tv', headers={'User-Agent': 'Mozilla/5.0'})
page = urlopen(html)
soup = BeautifulSoup(page, 'html.parser')
shows = ['Jojo no Kimyou na Bouken: Ougon no Kaze', 'Dororo', 'Mob Psycho 100 II', 'Yakusoku no Neverland']
for tag in soup.find_all('article', attrs={'class': 'anime'}):
if any(x in tag['data-romaji'] for x in shows):
rlist = tag.find('div', attrs={'class': 'episode-countdown'}).text
r2 = rlist[:rlist.index(":")][2:]
print('{} has aired {} episodes so far'.format(tag['data-romaji'], int(r2)-1))
Each show listed on the website is inside of an "article" tag so for every show in the soup.find_all() statement, if the "data-romaji" (the name of the show listed on the website) matches a show in my "shows" list, then I extract the "episode-countdown" string and then slice the string to just the number as previously explained and then print to make sure I did it correctly.
If you go to the website, the order that the shows are listed are "Yakusoku no Neverland", "Mob Psycho", "Dororo", and "Jojo" which is the order that you get the episode-countdown strings in if you run the code. What I want to do is have it in order of my "shows" list so that I have a list of shows and a list of episodes aired that match each other. I want to add the episodes aired list as a column in a pandas dataframe I am currently building so having it not match the "shows" column would be a problem.
Is there a way for me to extract the "episode-countdown" string based on the order of my "shows" list instead of the order used on the website (if that makes sense)?
Is this what you're looking for?
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import pandas as pd
html = Request('https://www.livechart.me/winter-2019/tv', headers={'User-Agent': 'Mozilla/5.0'})
page = urlopen(html)
soup = BeautifulSoup(page, 'html.parser')
shows = ['Jojo no Kimyou na Bouken: Ougon no Kaze', 'Dororo', 'Mob Psycho 100 II', 'Yakusoku no Neverland']
master = []
for show in shows:
for tag in soup.find_all('article', attrs={'class': 'anime'}):
show_info = []
if show in tag['data-romaji']:
show_info.append(tag['data-romaji'])
rlist = tag.find('div', attrs={'class': 'episode-countdown'}).text
r2 = rlist[:rlist.index(":")][2:]
show_info.append(r2)
master.append(show_info)
df=pd.DataFrame(master,columns=['Show','Episodes'])
df
Output:
Show Episodes
0 Jojo no Kimyou na Bouken: Ougon no Kaze 23
1 Dororo 11
2 Mob Psycho 100 II 11
3 Yakusoku no Neverland 11

python3 - how to scrape the data from span

I try to use python3 and BeautifulSoup.
import requests
import json
from bs4 import BeautifulSoup
url = "https://www.binance.com/pl"
#get the data
data = requests.get(url);
soup = BeautifulSoup(data.text,'lxml')
print(soup)
If I open the html code (in browser) I can see:
html code in browser
But in my data (printing in console) i cant see btc price:
what data i cant see in console
Could u give me some advice how to scrape this data?
Use .findAll() to find all the rows, and then you can use it to find all the cells in a given row. You have to look at how the page is structured. It's not a standard row, but a bunch of divs made to look like a table. So you have to look at the role of each div to get to the data you want.
I'm assuming that you're going to want to look at specific rows, so my example uses the Para column to find those rows. Since the star is in it's own little cell, the Para column is the second cell, or index of 1. With that, it's just a question of which cells you want to export.
You could take out the filter if you want to get everything. You can also modify it to see if the value of a cell is above a certain price point.
# Import necessary libraries
import requests
from bs4 import BeautifulSoup
# Ignore the insecure warning
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
# Set options and which rows you want to look at
url = "https://www.binance.com/pl"
desired_rows = ['ADA/BTC', 'ADX/BTC']
# Get the page and convert it into beautiful soup
response = requests.get(url, verify=False)
soup = BeautifulSoup(response.text, 'html.parser')
# Find all table rows
rows = soup.findAll('div', {'role':'row'})
# Process all the rows in the table
for row in rows:
try:
# Get the cells for the given row
cells = row.findAll('div', {'role':'gridcell'})
# Convert them to just the values of the cell, ignoring attributes
cell_values = [c.text for c in cells]
# see if the row is one you want
if cell_values[1] in desired_rows:
# Output the data however you'd like
print(cell_values[1], cell_values[-1])
except IndexError: # there was a row without cells
pass
This resulted in the following output:
ADA/BTC 1,646.39204255
ADX/BTC 35.29384873

Python & Beautiful Soup - Searching result strings

I am using Beautiful Soup to parse an HTML table.
Python version 3.2
Beautiful Soup version 4.1.3
I am running into an issue when trying to use the findAll method to find the columns within my rows. I get an error that says list object has no attribute findAll. I found this method through another post on stack exchange and this was not an issue there. (BeautifulSoup HTML table parsing)
I realize that findAll is a method of BeautifulSoup, not python lists. The weird part is the findAll method works when I find the rows within the table list (I only need the 2nd table on the page), but when I attempt to find the columns in the rows list.
Here's my code:
from urllib.request import URLopener
from bs4 import BeautifulSoup
opener = URLopener() #Open the URL Connection
page = opener.open("http://www.labormarketinfo.edd.ca.gov/majorer/countymajorer.asp?CountyCode=000001") #Open the page
soup = BeautifulSoup(page)
table = soup.findAll('table')[1] #Get the 2nd table (index 1)
rows = table.findAll('tr') #findAll works here
cols = rows.findAll('td') #findAll fails here
print(cols)
findAll() returns a result list, you'd need to loop over those or pick one to get to another contained element with it's own findAll() method:
table = soup.findAll('table')[1]
rows = table.findAll('tr')
for row in rows:
cols = rows.findAll('td')
print(cols)
or pick one row:
table = soup.findAll('table')[1]
rows = table.findAll('tr')
cols = rows[0].findAll('td') # columns of the *first* row.
print(cols)
Note that findAll is deprecated, you should use find_all() instead.

Resources