Python & Beautiful Soup - Searching result strings - python-3.x

I am using Beautiful Soup to parse an HTML table.
Python version 3.2
Beautiful Soup version 4.1.3
I am running into an issue when trying to use the findAll method to find the columns within my rows. I get an error that says list object has no attribute findAll. I found this method through another post on stack exchange and this was not an issue there. (BeautifulSoup HTML table parsing)
I realize that findAll is a method of BeautifulSoup, not python lists. The weird part is the findAll method works when I find the rows within the table list (I only need the 2nd table on the page), but when I attempt to find the columns in the rows list.
Here's my code:
from urllib.request import URLopener
from bs4 import BeautifulSoup
opener = URLopener() #Open the URL Connection
page = opener.open("http://www.labormarketinfo.edd.ca.gov/majorer/countymajorer.asp?CountyCode=000001") #Open the page
soup = BeautifulSoup(page)
table = soup.findAll('table')[1] #Get the 2nd table (index 1)
rows = table.findAll('tr') #findAll works here
cols = rows.findAll('td') #findAll fails here
print(cols)

findAll() returns a result list, you'd need to loop over those or pick one to get to another contained element with it's own findAll() method:
table = soup.findAll('table')[1]
rows = table.findAll('tr')
for row in rows:
cols = rows.findAll('td')
print(cols)
or pick one row:
table = soup.findAll('table')[1]
rows = table.findAll('tr')
cols = rows[0].findAll('td') # columns of the *first* row.
print(cols)
Note that findAll is deprecated, you should use find_all() instead.

Related

Python - issues with for loops to create dataframe with BeautifulSoup scrape

I'm a beginner in Python and I'm trying to create a new dataframe using BeautifulSoup to scrape a webpage. I'm following some code that worked in a different page, but it's not working here. My final table of data is blank, so seems it's not appending. Any help is appreciated. This is what I've done:
from bs4 import BeautifulSoup
import requests
import pandas as pd
allergens = requests.get(url = 'http://guetta.com/diginn/allergens/')
allergens = BeautifulSoup(allergens.content)
items = allergens.find_all('div', class_ = 'menu-item-card')
final_table = {}
for item in allergens.find_all('div', class_ = 'menu-item-card'):
for row in item.find_all('h4', recursive = False)[0:]:
for column in row.find_all('p', class_ = 'menu-item__allergens'):
col_name = column['class'][0].split('__')[1]
if col_name not in final_table:
final_table[col_name] = []
final_table[col_name].append(column.text)
df_allergens = pd.DataFrame(final_table)
This returns nothing. No errors, just empty brackets. I was able to retrieve each element individually, so I think the items should work but obviously I'm missing something.
Edit:
Here is what the output needs to be:
Item Name Allergens
Classic Dig | Soy
Item2 | allergen1, allergen2
Item3 | allergen2
You don't need to find all h4 tags in every item. So make a change like below:
...
for item in allergens.find_all('div', class_ = 'menu-item-card'):
for column in item.find_all('p', class_ = 'menu-item__allergens'):
col_name = column['class'][0].split('__')[1]
if col_name not in final_table:
final_table[col_name] = []
final_table[col_name].append(column.text)
...

Taking Average of List of Integers

I'm scraping a list of daily stock volume numbers, and I'm wanting to take an average of the first 20 results in the volume column of the page. My code looks like:
from bs4 import BeautifulSoup
import re, csv, random, time
import pandas as pd
import os
import requests
page = requests.get('https://finance.yahoo.com/quote/BDSI/history?period1=1517033117&period2=1548569117&interval=1d&filter=history&frequency=1d')
soup = BeautifulSoup(page.text, 'html.parser')
rows = soup.select('table[class="W(100%) M(0)"] tr')
for row in rows[1:20]:
col = row.find_all("td")
numbers = col[6].text.replace(',', '')
numbers2 = int(numbers)
print(numbers2)
avg20vol = sum(numbers2(1,20))/len(numbers2)
...but I'm getting stuck when trying to take the average of the returned numbers2. Receive either "TypeError: 'int' object is not callable" or "TypeError: 'int' object is not iterable" with the solutions I've tried. How do handle taking an average of a list? Does it involve turning it into a dataframe first? Thanks!
UPDATE
Here's a working example of the applicable code segment:
numberslist=[]
for row in rows[1:21]:
col = row.find_all("td")
numbers = col[6].text.replace(',', '')
numbers2 = int(numbers)
numberslist.append(numbers2)
print(numbers2)
average = sum(numberslist)/len(numberslist)
print('Average = ',average)
When scraping, actually create a list of numbers, like so:
# stuff before
number_list = [] # empty list
for row in rows[1:20]:
# get the number
number_list.append(int(number_as_string)) # add the new number at the end of the list
average = sum(number_list)/len(number_list)
You can also .append() the string forms and then transform to ints with list(map(int(list_of_strings)) or [int(x) for x in list_of_strings].
Note: rows[1:20] will leave out the first item, in your case, as you stated, the first row is header. Use rows[:20] to get the first 20 items in general.
You css selector is also wrong and gave me an error.
from bs4 import BeautifulSoup
import requests
page = requests.get('https://finance.yahoo.com/quote/BDSI/history?period1=1517033117&period2=1548569117&interval=1d&filter=history&frequency=1d')
soup = BeautifulSoup(page.text, 'html.parser')
rows = soup.find('table',class_="W(100%) M(0)").find_all('tr')
numbers=[]
for row in rows[1:20]:
col = row.find_all("td")
print(col[6].text)
number = col[6].text.replace(',', '')
number = int(number)
numbers.append(number)
avg20vol =sum(numbers)/len(numbers)
print("Average: ",avg20vol)
Output
650,100
370,500
374,700
500,700
452,500
1,401,800
2,071,200
1,005,800
441,500
757,000
901,200
563,400
1,457,000
637,100
692,700
725,000
709,000
1,155,500
496,400
Average: 808584.2105263158

python3 - how to scrape the data from span

I try to use python3 and BeautifulSoup.
import requests
import json
from bs4 import BeautifulSoup
url = "https://www.binance.com/pl"
#get the data
data = requests.get(url);
soup = BeautifulSoup(data.text,'lxml')
print(soup)
If I open the html code (in browser) I can see:
html code in browser
But in my data (printing in console) i cant see btc price:
what data i cant see in console
Could u give me some advice how to scrape this data?
Use .findAll() to find all the rows, and then you can use it to find all the cells in a given row. You have to look at how the page is structured. It's not a standard row, but a bunch of divs made to look like a table. So you have to look at the role of each div to get to the data you want.
I'm assuming that you're going to want to look at specific rows, so my example uses the Para column to find those rows. Since the star is in it's own little cell, the Para column is the second cell, or index of 1. With that, it's just a question of which cells you want to export.
You could take out the filter if you want to get everything. You can also modify it to see if the value of a cell is above a certain price point.
# Import necessary libraries
import requests
from bs4 import BeautifulSoup
# Ignore the insecure warning
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
# Set options and which rows you want to look at
url = "https://www.binance.com/pl"
desired_rows = ['ADA/BTC', 'ADX/BTC']
# Get the page and convert it into beautiful soup
response = requests.get(url, verify=False)
soup = BeautifulSoup(response.text, 'html.parser')
# Find all table rows
rows = soup.findAll('div', {'role':'row'})
# Process all the rows in the table
for row in rows:
try:
# Get the cells for the given row
cells = row.findAll('div', {'role':'gridcell'})
# Convert them to just the values of the cell, ignoring attributes
cell_values = [c.text for c in cells]
# see if the row is one you want
if cell_values[1] in desired_rows:
# Output the data however you'd like
print(cell_values[1], cell_values[-1])
except IndexError: # there was a row without cells
pass
This resulted in the following output:
ADA/BTC 1,646.39204255
ADX/BTC 35.29384873

How to extract data from multiple dt and dd tags in tabled form (within a looped statement) using python v3 beautiful soup v4?

Source:
I’ve only chosen one year for simplicity but my intention is for all years (n=117).
https://pubs.er.usgs.gov/browse/Report/USGS%20Numbered%20Series/Open-File%20Report/
(2018 only)
https://pubs.er.usgs.gov/browse/Report/USGS%20Numbered%20Series/Open-File%20Report/2018/
Resources:
I’ve found 2 blogs and 2 stack overflow forums that have steered my attempts to replicate their work but my lack of experience and the uniqueness of the website and task has proven difficult. I’ve tried next_siblings a little but to no success. Blog #1: Extract tabled data as a table:
https://journalistsresource.org/tip-sheets/research/python-scrape-website-data-criminal-justice
https://gist.github.com/phillipsm/404780e419c49a5b62a8 Blog #2: Extract data from tags into a table
https://www.dataquest.io/blog/web-scraping-beautifulsoup/
Stack overflow forum #1:
Using BeautifulSoup to extract specific dl and dd list elements
Stack overflow forum #2:
Use BeautifulSoup to get a value after a specific tag
Problems encountered:
1. Each year’s publications have different “Additional Publication Details”. To help with this I can run the code I have and compiled (which is not in tabled form) the unique dt tag text headers to make sure all are captured for 2018 (pasted below). But again to do this for all years would take time…right? I'll add in a comment if necessary.
2. For statements…I find I keep having to nest “for” statements to get to final webpage where publication details live (minimum of 2 links). This seems restricting in what/how I can return data and without limiting replicating returns ([:1]), my code can very easily fail (whether it’s from the source server or what have you).
3. I have to first extract dt element text, then extract dd element text.
Code:
(commented out dt element grab and print statements are only for my record keeping of what’s being done. Again, I compiled unique dt element text headers for reference…see comment above. Apologize upfront if my code is ‘dizzying’…)
import requests
from bs4 import BeautifulSoup
import csv
import re
import time
url =
'https://pubs.er.usgs.gov/browse/Report/USGS%20Numbered%20Series/Open-
File%20Report'
url2 = 'https://pubs.er.usgs.gov'
response = requests.get(url)
data = response.text
pubti_links = []
soup = BeautifulSoup(data, "html.parser")
type(soup)
year_containers = soup.findAll('li',{'class':'pubs-browse-list-theme'})
for year in year_containers[:1]:
for a in soup.findAll('a'):
if '/browse/Report/USGS%20Numbered%20Series/Open-
File%20Report/2018' in a['href']:
link_containers = a.get('href')
#print (link_containers)
pubti_links = url2 + link_containers
#print (pubti_links)
for pubti_link in pubti_links[:1]:
response2 = requests.get(pubti_links)
soup2 = BeautifulSoup(response2.text, "html.parser")
time.sleep(2)
for elm in soup2.find_all('li',{'class':'pubs-browse-list-
theme'}):
for a_elm in elm.findAll('a'):
#print(a.get('href'))
pub_containers = a_elm.get('href')
pubdetails_links = url2 + pub_containers
response3 = requests.get(pubdetails_links)
soup3 = BeautifulSoup(response3.text,
"html.parser")
pubdetail_containers = soup3.findAll('dd',{'class':
["" "","dark"]})
dd_data = soup3.findAll('dd',{'class':[""
"","dark"]})
#dt_data = soup3.findAll('dt',{'class':[""
"","dark"]})
for dd_item in dd_data:
print(dd_item.string)
#for dt_item in dt_data:
#print (dt_item.string)
Desired result (the goal is to create a table of all USGS publication for each year):
Output Table example

Data getting scraped in a single column instead of table format

I've written a script in python using selenium to parse data out of a table from a webpage. However, when i run it I get scraped data in a single column instead of a table format. What type of change should I make in my script to get data in a table format? Here is what I've tried so far:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://fantasy.premierleague.com/player-list/")
table_data = driver.find_elements_by_xpath("//table[#class='ism-table']")[0]
for item in table_data.find_elements_by_xpath(".//td"):
print(item.text)
driver.quit()
What I meant by table format is something like below. However I'm getting data in a single column instead of several columns like below.
try
for item in table_data.find_elements_by_xpath(".//tr"):
print(item.text.split())
it will give you a list for each player separately.
Notice, that tag in .find_elements_by_xpath() is changed
Further,
you can make readable table like this:
...(your previous code)...
data=[]
for item in table_data.find_elements_by_xpath(".//tr"):
data.append(item.text.split())
format_table = '{:8s}' + 4 * '{:>10s}'
for lst in data:
print(format_table.format(*lst))
Another version (to properly catch names with whitespaces like "de Goa"):
data=[]
temp=[]
for item in table_data.find_elements_by_xpath(".//tr"):
for i in item.find_elements_by_xpath('td'):
temp.append(i.text)
data.append(temp)
temp=[]

Resources