python beautiful soup table? - python-3.x

please how I catch the values here in the table.I need date, time, reserve and play values.
Each time I only got a whole list of the whole table, I don't know how to catch the given values in it
thank you very much for your help.
<table class="list">
<tr class="head">
<th>Date</th>
<th>Time</th>
<th>Play</th>
<th>Tickets</th>
<th> </th>
</tr>
<tr class="t1">
<th>Th
03. 09. 2020</th>
<td>
19:00</td>
<td>Racek</td>
<td class="center">4</td>
<td>
<a href="/rezervace/detail?id=2618"
title="Reserve tickets for this performance">
reserve
</a>
</td>
</tr>

First, you should post some code that you've tried yourself. But anyway, here's another way for you.
from simplified_scrapy import SimplifiedDoc,req
html = '''
<table class="list">
<tr class="head">
<th>Date</th>
<th>Time</th>
<th>Play</th>
<th>Tickets</th>
<th> </th>
</tr>
<tr class="t1">
<th>Th
03. 09. 2020</th>
<td>
19:00</td>
<td>Racek</td>
<td class="center">4</td>
<td>
<a href="/rezervace/detail?id=2618"
title="Reserve tickets for this performance">
reserve
</a>
</td>
</tr>
</table>
'''
doc = SimplifiedDoc(html)
# First method
table = doc.getTable('table')
print (table)
# Second method
table = doc.getElement('table', attr='class', value='list').trs.children.text
print (table)
Result:
[['Date', 'Time', 'Play', 'Tickets', ''], ['Th 03. 09. 2020', '19:00', 'Racek', '4', 'reserve']]
[['Date', 'Time', 'Play', 'Tickets', ''], ['Th 03. 09. 2020', '19:00', 'Racek', '4', 'reserve']]
Here are more examples: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

This script will parse the table with BeautifulSoup and then print individual rows to screen:
import re
from bs4 import BeautifulSoup
html = '''
<table class="list">
<tr class="head">
<th>Date</th>
<th>Time</th>
<th>Play</th>
<th>Tickets</th>
<th> </th>
</tr>
<tr class="t1">
<th>Th
03. 09. 2020</th>
<td>
19:00</td>
<td>Racek</td>
<td class="center">4</td>
<td>
<a href="/rezervace/detail?id=2618"
title="Reserve tickets for this performance">
reserve
</a>
</td>
</tr>
</table>
'''
soup = BeautifulSoup(html, 'html.parser')
all_data = []
for row in soup.select('tr'):
all_data.append([re.sub(r'\s{2,}', ' ', d.get_text(strip=True)) for d in row.select('td, th')])
# print data to screen:
# print header:
print('{:<25}{:<15}{:<15}{:<15}{:<15}'.format(*all_data[0]))
# print rows:
for date, time, play, tickets, reserve in all_data[1:]:
print('{:<25}{:<15}{:<15}{:<15}{:<15}'.format(date, time, play, tickets, reserve))
Prints:
Date Time Play Tickets
Th 03. 09. 2020 19:00 Racek 4 reserve

A simple way using pandas
import pandas as pd
table = """
<table class="list">
<tr class="head">
<th>Date</th>
<th>Time</th>
<th>Play</th>
<th>Tickets</th>
<th> </th>
</tr>
<tr class="t1">
<th>Th
03. 09. 2020</th>
<td>
19:00</td>
<td>Racek</td>
<td class="center">4</td>
<td>
<a href="/rezervace/detail?id=2618" title="Reserve tickets for this performance">
reserve
</a>
</td>
</tr>
</table>
""""
df = pd.read_html(table)[0]
Then you can access the data within "df"
df["Date"]
# 0 Th 03. 09. 2020
# Name: Date, dtype: object
df["Time"]
# 0 19:00
# Name: Time, dtype: object
df["Play"]
# 0 Racek
# Name: Play, dtype: object
df["Tickets"]
# 0 4

Related

Colspan not working properly using Python Pandas

I have some data that i need to convert into an Excel sheet which needs to look like this at the end of the day:
I've tried the following code:
import pandas as pd
result = pd.read_html(
"""<table>
<tr>
<th colspan="2">Status N</th>
</tr>
<tr>
<td style="font-weight: bold;">Merchant</td>
<td>Count</td>
</tr>
<tr>
<td>John Doe</td>
<td>10</td>
</tr>
</table>"""
)
writer = pd.ExcelWriter('out/test_pd.xlsx', engine='xlsxwriter')
print(result[0])
result[0].to_excel(writer, sheet_name='Sheet1', index=False)
writer.save()
This issue here is that the colspan is not working properly. The output is like this instead:
Can someone help me on how i can use colspan on Python Pandas?
It would be better if i don't have to use read_html() and do it directly on python code but if it's not possible, i can use read_html()
Since Pandas can't recognize the values and columns title you should introduce them, if you convert HTML text to the standard format, then pandas can handle it correctly. use thead and tbody to split header and values like this.
result = pd.read_html("""
<table>
<thead>
<tr>
<th colspan="2">Status N</th>
</tr>
<tr>
<td style="font-weight: bold;">Merchant</td>
<td>Count</td>
</tr>
</thead>
<tbody>
<tr>
<td>John Doe</td>
<td>10</td>
</tr>
</tbody>
</table>
"""
)
To write Dataframe to an excel file you can use the pandas to_excel method.
result[0].to_excel("out.xlsx")

python bs4 find data in class

Good day
Please how do I write all the data in
<tr class="t1">
<th>Th
27. 02. 2020</th>
<td>
19:00</td>
<td>Záskok</td>
<td class="center">0</td>
<td>
my code is
for tr in soup.find_all("tr", class_="t1"):
date = soup.find("th").text
name = soup.find("td", text="Záskok").text
number = soup.find("td", text="Záskok").find_next_sibling("td").text
print(date, ":",name, ":", number)

beautiful soup not to parse nested table data

I have a nested table structure. I am using the below code for parsing the data.
for row in table.find_all("tr")[1:][:-1]:
for td in row.find_all("td")[1:]:
dataset = td.get_text()
The problem here is when there are nested tables like in my case there are tables inside <td></td> so these are parsed again after parsing initially as I am using find_all(tr) and find_all(td). So how can I avoid parsing the nested table as it is parsed already?
Input:
<table>
<tr>
<td>1</td><td>2</td>
</tr>
<tr>
<td>3</td><td>4</td>
</tr>
<tr>
<td>5
<table><tr><td>11</td><td>22</td></tr></table>
6
</td>
</tr>
</table>
Expected Output:
1 2
3 4
5
11 22
6
But what I am getting is:
1 2
3 4
5
11 22
11 22
6
That is, the inner table is parsed again.
Specs:
beautifulsoup4==4.6.3
Data order should be preserved and content could be anything including any alphanumeric characters.
Using a combinations of bs4 and re, you might achieve what you want.
I am using bs4 4.6.3
from bs4 import BeautifulSoup as bs
import re
html = '''
<table>
<tr>
<td>1</td><td>2</td>
</tr>
<tr>
<td>3</td><td>4</td>
</tr>
<tr>
<td>5
<table><tr><td>11</td><td>22</td></tr></table>
6
</td>
</tr>
</table>'''
soup = bs(html, 'lxml')
ans = []
for x in soup.findAll('td'):
if x.findAll('td'):
for y in re.split('<table>.*</table>', str(x)):
ans += re.findall('\d+', y)
else:
ans.append(x.text)
print(ans)
For each td we test if this is a nest td. If so, we split on table and take everything and match with a regex every number.
Note this working only for two depths level, but adaptable to any depths
I have tried with findChilden() method and some how managed to produce output.I am not sure if this will help you in any other circumstances.
from bs4 import BeautifulSoup
data='''<table>
<tr>
<td>1</td><td>2</td>
</tr>
<tr>
<td>3</td><td>4</td>
</tr>
<tr>
<td>5
<table><tr><td>11</td><td>22</td></tr></table>
6
</td>
</tr>
</table>'''
soup=BeautifulSoup(data,'html.parser')
for child in soup.find('table').findChildren("tr" , recursive=False):
tdlist = []
if child.find('table'):
for td in child.findChildren("td", recursive=False):
print(td.next_element.strip())
for td1 in td.findChildren("table", recursive=False):
for child1 in td1.findChildren("tr", recursive=False):
for child2 in child1.findChildren("td", recursive=False):
tdlist.append(child2.text)
print(' '.join(tdlist))
print(child2.next_element.next_element.strip())
else:
for td in child.findChildren("td" , recursive=False):
tdlist.append(td.text)
print(' '.join(tdlist))
Output:
1 2
3 4
5
11 22
6
EDITED for Explanation
Step 1:
When use findChilden() inside table it first returns 3 records.
for child in soup.find('table').findChildren("tr", recursive=False):
print(child)
Output:
<tr>
<td>1</td><td>2</td>
</tr>
<tr>
<td>3</td><td>4</td>
</tr>
<tr>
<td>5
<table><tr><td>11</td><td>22</td></tr></table>
6
</td>
</tr>
Step 2:
Check that any children has tag <table> and do some operation.
if child.find('table'):
Step 3:
Follow the step 1 and use findChilden() to get <td> tag.
Once you get the <td> follow step 1 to get the children again.
Step 4:
for td in child.findChildren("td", recursive=False)
print(td.next_element.strip())
Next element will return the first text of tag so in that case it will return the value 5.
Step 5
for td in child.findChildren("td", recursive=False):
print(td.next_element.strip())
for td1 in td.findChildren("table", recursive=False):
for child1 in td1.findChildren("tr", recursive=False):
for child2 in child1.findChildren("td", recursive=False):
tdlist.append(child2.text)
print(' '.join(tdlist))
print(child2.next_element.next_element.strip())
If you see here i have just follows step 1 recursively.Yes Again I have used child2.next_element.next_element to get the value of 6 after </table> tag.
You can check if another table exists inside a td tag, if it exists then simply skip that td, otherwise use it as a regular td.
for row in table.find_all("tr")[1:][:-1]:
for td in row.find_all("td")[1:]:
if td.find('table'): # check if td has nested table
continue
dataset = td.get_text()
In your example, with bs4 4.7.1 I use :has :not to exclude looping rows with table child
from bs4 import BeautifulSoup as bs
html = '''
<table>
<tr>
<td>1</td>
<td>2</td>
</tr>
<tr>
<td>3</td>
<td>4</td>
</tr>
<tr>
<td>
<table>
<tr>
<td>11</td>
<td>22</td>
</tr>
</table>
</td>
</tr>
</table>'''
soup = bs(html, 'lxml')
for tr in soup.select('tr:not(:has(table))'):
print([td.text for td in tr.select('td')])

How do I use Python csv to write multiple beautifulsoup table rows for only two specific columns?

I am wanting to use beautifulsoup to scrape HTML to pull out only two columns from every row in one table. However, each "tr" row has 10 "td" cells, and I only want the [1] and [8] "td" cell from each row. What is the most pythonic way to do this?
From my input below I've got one table, one body, three rows, and 10 cells per row.
Input
<table id ="tblMain">
<tbody>
<tr>
<td "text"</td>
<td "data1"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "data2"</td>
<td "text"</td>
<tr>
<td "text"</td>
<td "data1"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "data2"</td>
<td "text"</td>
<tr>
<td "text"</td>
<td "data1"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "data2"</td>
<td "text"</td>
Things I Have Tried
I understand how to use the index of the cells in order to loop through and get "td" at [1] and [8]. However, I'm getting all confused when trying to get that data on one line written back to the csv.
table = soup.find('table', {'id':'tblMain'} )
table_body = table.find('tbody')
rows = table_body.findAll('tr')
data1_columns = []
data2_columns = []
for row in rows[1:]:
data1 = row.findAll('td')[1]
data1_columns.append(data1.text)
data2 = row.findAll('td')[8]
data2_columns.append(data2.text)
This is my current code which finds the table, rows, and all "td" cells and prints them correctly to a .csv. However, instead of writing all ten "td" cells per row back to the csv line, I just want to grab "td"[1] and "td"[8].
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table', {'id':'tblMain'} )
table_body = table.find('tbody')
rows = table_body.findAll('tr')
filename = '%s.csv' % reportname
with open(filename, "wt+", newline="") as f:
writer = csv.writer(f)
for row in rows:
csv_row = []
for cell in row.findAll("td"):
csv_row.append(cell.get_text())
writer.writerow(csv_row)
Expected Results
I want to be able to write "td"[1] and "td"[8] back to my csv_row in order to write each list back to a the csv writer.writerow.
Writing row back to csv_row which then writes to my csv file:
['data1', 'data2']
['data1', 'data2']
['data1', 'data2']
You almost did it
for row in rows:
row = row.findAll("td")
csv_row = [row[1].get_text(), row[8].get_text()]
writer.writerow(csv_row)
Full code
html ='''<table id ="tblMain">
<tbody>
<tr>
<td>text</td>
<td>data1</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>data2</td>
<td>text</td>
<tr>
<td>text</td>
<td>data1</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>data2</td>
<td>text</td>
<tr>
<td>text</td>
<td>data1</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>data2</td>
<td>text</td>
'''
from bs4 import BeautifulSoup
import csv
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table', {'id':'tblMain'} )
table_body = table.find('tbody')
rows = table_body.findAll('tr')
reportname = 'output'
filename = '%s.csv' % reportname
with open(filename, "wt+", newline="") as f:
writer = csv.writer(f)
for row in rows:
row = row.findAll("td")
csv_row = [row[1].get_text(), row[8].get_text()]
writer.writerow(csv_row)
You should be able to use nth-of-type pseudo class css selector
from bs4 import BeautifulSoup as bs
import pandas as pd
html = 'actualHTML'
soup = bs(html, 'lxml')
results = []
for row in soup.select('#tblMain tr'):
out_row = [item.text.strip() for item in row.select('td:nth-of-type(2), td:nth-of-type(9)')]
results.append(out_row)
df = pd.DataFrame(results)
print(df)
df.to_csv(r'C:\Users\User\Desktop\data.csv', sep=',', encoding='utf-8-sig',index = False )
Whenever I need to pull a table and it has the <table> tag, I let Pandas do the work for me, then just maniuplate the dataframe it returns if needed. That's what I would do here:
html = '''<table id ="tblMain">
<tbody>
<tr>
<td> text</td>
<td> data1</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> data2</td>
<td> text</td>
<tr>
<td> text</td>
<td> data1</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> data2</td>
<td> text</td>
<tr>
<td> text</td>
<td> data1</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> data2</td>
<td> text</td>'''
import pandas as pd
# .read_html() returns a list of dataframes
tables = pd.read_html(html)[0]
# we want the dataframe from that list in position [0]
df = tables[0]
# Use .iloc to say I want all the rows, and columns 1, 8
df = df.iloc[:,[1,8]]
# Write the dataframe to file
df.to_csv('path.filename.csv', index=False)

How to scrape for specific tables and specific rows/cells of data python

So this is my first python project and my goal is to scrape the final score from last night's Mets game and send it to a friend through twilio, but right now I'm having issues with extracting the scores from this website:
http://scores.nbcsports.com/mlb/scoreboard.asp?day=20160621&meta=true
The scraper below works but it obviously finds all the tables/rows/cells rather than the one I want. When I look at the html code for the each table, they're all the same:
<table class="shsTable shsLinescore" cellspacing="0">
My question is how can I scrape a specific table if the class attribute for all the games are the same?
from bs4 import BeautifulSoup
import urllib
import urllib.request
def make_soup(url):
thepage = urllib.request.urlopen(url)
soupdata = BeautifulSoup(thepage, "html.parser")
return soupdata
playerdatasaved =""
soup = make_soup("http://scores.nbcsports.com/mlb/scoreboard.asp? day=20160621&meta=true")
for row in soup.findAll('tr'): #finds all rows
playerdata=""
for data in row.findAll('td'):
playerdata = playerdata+","+data.text
playerdatasaved =playerdatasaved+"\n" +playerdata[1:]
print(playerdatasaved)
Use the team name which is in the text of the anchors with the teamName class, find that then pull the previous table:
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.get("http://scores.nbcsports.com/mlb/scoreboard.asp?day=20160621&meta=true").content, "lxml")
table = soup.find("a",class_="teamName", text="NY Mets").find_previous("table")
for row in table.find_all("tr"):
print(row.find_all("td"))
Which gives you:
[<td style="text-align: left">Final</td>, <td class="shsTotD">1</td>, <td class="shsTotD">2</td>, <td class="shsTotD">3</td>, <td class="shsLinescoreSpacer">\xa0</td>, <td class="shsTotD">4</td>, <td class="shsTotD">5</td>, <td class="shsTotD">6</td>, <td class="shsLinescoreSpacer">\xa0</td>, <td class="shsTotD">7</td>, <td class="shsTotD">8</td>, <td class="shsTotD">9</td>, <td class="shsLinescoreSpacer">\xa0</td>, <td class="shsTotD">R</td>, <td class="shsTotD">H</td>, <td class="shsTotD">E</td>]
[<td class="shsNamD" nowrap=""><span class="shsLogo"><span class="shsMLBteam7sm_trans"></span></span><a class="teamName" href="/mlb/teamstats.asp?team=07&type=teamhome">Kansas City</a></td>, <td class="shsTotD">0</td>, <td class="shsTotD">0</td>, <td class="shsTotD">0</td>, <td></td>, <td class="shsTotD">0</td>, <td class="shsTotD">1</td>, <td class="shsTotD">0</td>, <td></td>, <td class="shsTotD">0</td>, <td class="shsTotD">0</td>, <td class="shsTotD">0</td>, <td></td>, <td class="shsTotD">1</td>, <td class="shsTotD">7</td>, <td class="shsTotD">0</td>]
[<td class="shsNamD" nowrap=""><span class="shsLogo"><span class="shsMLBteam21sm_trans"></span></span><a class="teamName" href="/mlb/teamstats.asp?team=21&type=teamhome">NY Mets</a></td>, <td class="shsTotD">1</td>, <td class="shsTotD">0</td>, <td class="shsTotD">0</td>, <td></td>, <td class="shsTotD">1</td>, <td class="shsTotD">0</td>, <td class="shsTotD">0</td>, <td></td>, <td class="shsTotD">0</td>, <td class="shsTotD">0</td>, <td class="shsTotD">x</td>, <td></td>, <td class="shsTotD">2</td>, <td class="shsTotD">6</td>, <td class="shsTotD">1</td>]
To get the score data:
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.get("http://scores.nbcsports.com/mlb/scoreboard.asp?day=20160621&meta=true").content, "lxml")
table = soup.find("a",class_="teamName", text="NY Mets").find_previous("table")
a, b = [a.text for a in table.find_all("a",class_="teamName")]
inn, a_score, b_score = ([td.text for td in row.select("td.shsTotD")]
print " ".join(inn)
print "{}: {}".format(a, " ".join(a_score))
print "{}: {}".format(b, " ".join(b_score))
Which gives you:
1 2 3 4 5 6 7 8 9 R H E
Kansas City: 0 0 0 0 1 0 0 0 0 1 7 0
NY Mets: 1 0 0 1 0 0 0 0 x 2 6 1

Resources