beautiful soup not to parse nested table data - python-3.x

I have a nested table structure. I am using the below code for parsing the data.
for row in table.find_all("tr")[1:][:-1]:
for td in row.find_all("td")[1:]:
dataset = td.get_text()
The problem here is when there are nested tables like in my case there are tables inside <td></td> so these are parsed again after parsing initially as I am using find_all(tr) and find_all(td). So how can I avoid parsing the nested table as it is parsed already?
Input:
<table>
<tr>
<td>1</td><td>2</td>
</tr>
<tr>
<td>3</td><td>4</td>
</tr>
<tr>
<td>5
<table><tr><td>11</td><td>22</td></tr></table>
6
</td>
</tr>
</table>
Expected Output:
1 2
3 4
5
11 22
6
But what I am getting is:
1 2
3 4
5
11 22
11 22
6
That is, the inner table is parsed again.
Specs:
beautifulsoup4==4.6.3
Data order should be preserved and content could be anything including any alphanumeric characters.

Using a combinations of bs4 and re, you might achieve what you want.
I am using bs4 4.6.3
from bs4 import BeautifulSoup as bs
import re
html = '''
<table>
<tr>
<td>1</td><td>2</td>
</tr>
<tr>
<td>3</td><td>4</td>
</tr>
<tr>
<td>5
<table><tr><td>11</td><td>22</td></tr></table>
6
</td>
</tr>
</table>'''
soup = bs(html, 'lxml')
ans = []
for x in soup.findAll('td'):
if x.findAll('td'):
for y in re.split('<table>.*</table>', str(x)):
ans += re.findall('\d+', y)
else:
ans.append(x.text)
print(ans)
For each td we test if this is a nest td. If so, we split on table and take everything and match with a regex every number.
Note this working only for two depths level, but adaptable to any depths

I have tried with findChilden() method and some how managed to produce output.I am not sure if this will help you in any other circumstances.
from bs4 import BeautifulSoup
data='''<table>
<tr>
<td>1</td><td>2</td>
</tr>
<tr>
<td>3</td><td>4</td>
</tr>
<tr>
<td>5
<table><tr><td>11</td><td>22</td></tr></table>
6
</td>
</tr>
</table>'''
soup=BeautifulSoup(data,'html.parser')
for child in soup.find('table').findChildren("tr" , recursive=False):
tdlist = []
if child.find('table'):
for td in child.findChildren("td", recursive=False):
print(td.next_element.strip())
for td1 in td.findChildren("table", recursive=False):
for child1 in td1.findChildren("tr", recursive=False):
for child2 in child1.findChildren("td", recursive=False):
tdlist.append(child2.text)
print(' '.join(tdlist))
print(child2.next_element.next_element.strip())
else:
for td in child.findChildren("td" , recursive=False):
tdlist.append(td.text)
print(' '.join(tdlist))
Output:
1 2
3 4
5
11 22
6
EDITED for Explanation
Step 1:
When use findChilden() inside table it first returns 3 records.
for child in soup.find('table').findChildren("tr", recursive=False):
print(child)
Output:
<tr>
<td>1</td><td>2</td>
</tr>
<tr>
<td>3</td><td>4</td>
</tr>
<tr>
<td>5
<table><tr><td>11</td><td>22</td></tr></table>
6
</td>
</tr>
Step 2:
Check that any children has tag <table> and do some operation.
if child.find('table'):
Step 3:
Follow the step 1 and use findChilden() to get <td> tag.
Once you get the <td> follow step 1 to get the children again.
Step 4:
for td in child.findChildren("td", recursive=False)
print(td.next_element.strip())
Next element will return the first text of tag so in that case it will return the value 5.
Step 5
for td in child.findChildren("td", recursive=False):
print(td.next_element.strip())
for td1 in td.findChildren("table", recursive=False):
for child1 in td1.findChildren("tr", recursive=False):
for child2 in child1.findChildren("td", recursive=False):
tdlist.append(child2.text)
print(' '.join(tdlist))
print(child2.next_element.next_element.strip())
If you see here i have just follows step 1 recursively.Yes Again I have used child2.next_element.next_element to get the value of 6 after </table> tag.

You can check if another table exists inside a td tag, if it exists then simply skip that td, otherwise use it as a regular td.
for row in table.find_all("tr")[1:][:-1]:
for td in row.find_all("td")[1:]:
if td.find('table'): # check if td has nested table
continue
dataset = td.get_text()

In your example, with bs4 4.7.1 I use :has :not to exclude looping rows with table child
from bs4 import BeautifulSoup as bs
html = '''
<table>
<tr>
<td>1</td>
<td>2</td>
</tr>
<tr>
<td>3</td>
<td>4</td>
</tr>
<tr>
<td>
<table>
<tr>
<td>11</td>
<td>22</td>
</tr>
</table>
</td>
</tr>
</table>'''
soup = bs(html, 'lxml')
for tr in soup.select('tr:not(:has(table))'):
print([td.text for td in tr.select('td')])

Related

Colspan not working properly using Python Pandas

I have some data that i need to convert into an Excel sheet which needs to look like this at the end of the day:
I've tried the following code:
import pandas as pd
result = pd.read_html(
"""<table>
<tr>
<th colspan="2">Status N</th>
</tr>
<tr>
<td style="font-weight: bold;">Merchant</td>
<td>Count</td>
</tr>
<tr>
<td>John Doe</td>
<td>10</td>
</tr>
</table>"""
)
writer = pd.ExcelWriter('out/test_pd.xlsx', engine='xlsxwriter')
print(result[0])
result[0].to_excel(writer, sheet_name='Sheet1', index=False)
writer.save()
This issue here is that the colspan is not working properly. The output is like this instead:
Can someone help me on how i can use colspan on Python Pandas?
It would be better if i don't have to use read_html() and do it directly on python code but if it's not possible, i can use read_html()
Since Pandas can't recognize the values and columns title you should introduce them, if you convert HTML text to the standard format, then pandas can handle it correctly. use thead and tbody to split header and values like this.
result = pd.read_html("""
<table>
<thead>
<tr>
<th colspan="2">Status N</th>
</tr>
<tr>
<td style="font-weight: bold;">Merchant</td>
<td>Count</td>
</tr>
</thead>
<tbody>
<tr>
<td>John Doe</td>
<td>10</td>
</tr>
</tbody>
</table>
"""
)
To write Dataframe to an excel file you can use the pandas to_excel method.
result[0].to_excel("out.xlsx")

python beautiful soup table?

please how I catch the values here in the table.I need date, time, reserve and play values.
Each time I only got a whole list of the whole table, I don't know how to catch the given values in it
thank you very much for your help.
<table class="list">
<tr class="head">
<th>Date</th>
<th>Time</th>
<th>Play</th>
<th>Tickets</th>
<th> </th>
</tr>
<tr class="t1">
<th>Th
03. 09. 2020</th>
<td>
19:00</td>
<td>Racek</td>
<td class="center">4</td>
<td>
<a href="/rezervace/detail?id=2618"
title="Reserve tickets for this performance">
reserve
</a>
</td>
</tr>
First, you should post some code that you've tried yourself. But anyway, here's another way for you.
from simplified_scrapy import SimplifiedDoc,req
html = '''
<table class="list">
<tr class="head">
<th>Date</th>
<th>Time</th>
<th>Play</th>
<th>Tickets</th>
<th> </th>
</tr>
<tr class="t1">
<th>Th
03. 09. 2020</th>
<td>
19:00</td>
<td>Racek</td>
<td class="center">4</td>
<td>
<a href="/rezervace/detail?id=2618"
title="Reserve tickets for this performance">
reserve
</a>
</td>
</tr>
</table>
'''
doc = SimplifiedDoc(html)
# First method
table = doc.getTable('table')
print (table)
# Second method
table = doc.getElement('table', attr='class', value='list').trs.children.text
print (table)
Result:
[['Date', 'Time', 'Play', 'Tickets', ''], ['Th 03. 09. 2020', '19:00', 'Racek', '4', 'reserve']]
[['Date', 'Time', 'Play', 'Tickets', ''], ['Th 03. 09. 2020', '19:00', 'Racek', '4', 'reserve']]
Here are more examples: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples
This script will parse the table with BeautifulSoup and then print individual rows to screen:
import re
from bs4 import BeautifulSoup
html = '''
<table class="list">
<tr class="head">
<th>Date</th>
<th>Time</th>
<th>Play</th>
<th>Tickets</th>
<th> </th>
</tr>
<tr class="t1">
<th>Th
03. 09. 2020</th>
<td>
19:00</td>
<td>Racek</td>
<td class="center">4</td>
<td>
<a href="/rezervace/detail?id=2618"
title="Reserve tickets for this performance">
reserve
</a>
</td>
</tr>
</table>
'''
soup = BeautifulSoup(html, 'html.parser')
all_data = []
for row in soup.select('tr'):
all_data.append([re.sub(r'\s{2,}', ' ', d.get_text(strip=True)) for d in row.select('td, th')])
# print data to screen:
# print header:
print('{:<25}{:<15}{:<15}{:<15}{:<15}'.format(*all_data[0]))
# print rows:
for date, time, play, tickets, reserve in all_data[1:]:
print('{:<25}{:<15}{:<15}{:<15}{:<15}'.format(date, time, play, tickets, reserve))
Prints:
Date Time Play Tickets
Th 03. 09. 2020 19:00 Racek 4 reserve
A simple way using pandas
import pandas as pd
table = """
<table class="list">
<tr class="head">
<th>Date</th>
<th>Time</th>
<th>Play</th>
<th>Tickets</th>
<th> </th>
</tr>
<tr class="t1">
<th>Th
03. 09. 2020</th>
<td>
19:00</td>
<td>Racek</td>
<td class="center">4</td>
<td>
<a href="/rezervace/detail?id=2618" title="Reserve tickets for this performance">
reserve
</a>
</td>
</tr>
</table>
""""
df = pd.read_html(table)[0]
Then you can access the data within "df"
df["Date"]
# 0 Th 03. 09. 2020
# Name: Date, dtype: object
df["Time"]
# 0 19:00
# Name: Time, dtype: object
df["Play"]
# 0 Racek
# Name: Play, dtype: object
df["Tickets"]
# 0 4

WebScraping: Get nested element in HTML Table

Hi I am new to webscraping and got stuck on getting nested html element tag in a table, here is the html code I get from the url http://www.geonames.org/search.html?q=+Leisse&country=FR:
<table class="restable">
<tr>
<td colspan=6 style="text-align: right;"><small>1 records found for "col de la Leisse"</small></td>
</tr>
<tr>
<th></th>
<th>Name</th>
<th>Country</th>
<th>Feature class</th>
<th>Latitude</th>
<th>Longitude</th>
</tr>
<tr>
<td><small>1</small> <img src="/maps/markers/m10-ORANGE-T.png" border="0" alt="T"></td>
<td>Col de la Leisse<br><small></small><span class="geo" style="display:none;"><span class="latitude">45.42372</span><span class="longitude">6.906828</span></span></td>
<td>France, Auvergne-Rhône-Alpes<br><small>Savoy > Albertville > Tignes</small></td>
<td>pass</td>
<td nowrap>N 45° 25' 25''</td>
<td nowrap>E 6° 54' 24''</td>
</tr>
<tr class="tfooter">
<td colspan=6></td>
</tr>
</table>
This is the code for only one row to make things simple, but in my case I iterate over each row and check if the text of <td> element equal to a target value, if true I scrape the value of <span> element with class longitude and latitude. In my case I want to get the row with value Col de la Leisse
Here is my code: (not good)
soup = BeautifulSoup(response.text, "html.parser")
table = soup.findAll('table')[1] # second table
rows = table.find_all('tr')
target = "Col de la Leisse"
longitude, latitude = 0
for row in rows:
cols=row.find_all('td')
# I am stuck here...
# if cols.text == target:
# ...
Result:
longitude = 6.906828
latitude = 45.42372
With bs4 4.7.1 you can use :has and :contains to ensure row has an a tag element with your target string in.
target = 'Col de la Leisse'
rows = soup.select('.restable tr:has(a:contains("' + target + '"))')
for row in rows:
print([item.text for item in row.select('.latitude, .longitude')])
You can of course separate out .latitude and .longitude if you think they will not both be present, or if can occur in different order

How to get all td[3] tags from the tr tags with selenium Xpath in python

I have a webpage HTML like this:
<table class="table_type1" id="sailing">
<tbody>
<tr>
<td class="multi_row"></td>
<td class="multi_row"></td>
<td class="multi_row">1</td>
<td class="multi_row"></td>
</tr>
<tr>
<td class="multi_row"></td>
<td class="multi_row"></td>
<td class="multi_row">1</td>
<td class="multi_row"></td>
</tr>
</tbody>
</table>
and tr tags are dynamic so i don't know how many of them exist, i need all td[3] of any tr tags in a list for some slicing stuff.it is much better iterate with built in tools if find_element(s)_by_xpath("") has iterating tools.
Try
cells = driver.find_elements_by_xpath("//table[#id='sailing']//tr/td[3]")
to get third cell of each row
Edit
For iterating just use a for loop:
print ([i.text for i in cells])
Try following code :
tdElements = driver.find_elements_by_xpath("//table[#id="sailing "]/tbody//td")
Edit : for 3rd element
tdElements = driver.find_elements_by_xpath("//table[#id="sailing "]/tbody/tr/td[3]")
To print the text e.g. 1 from each of the third <td> you can either use the get_attribute() method or text property and you can use either of the following solutions:
Using CssSelector and get_attribute():
print(driver.find_elements_by_css_selector("table.table_type1#sailing tr td:nth-child(3)").get_attribute("innerHTML"))
Using CssSelector and text property:
print(driver.find_elements_by_css_selector("table.table_type1#sailing tr td:nth-child(3)").text)
Using XPath and get_attribute():
print(driver.find_elements_by_xpath('//table[#class='table_type1' and #id="sailing"]//tr//following::td[3]').get_attribute("innerHTML"))
Using XPath and text property:
print(driver.find_elements_by_xpath('//table[#class='table_type1' and #id="sailing"]//tr//following::td[3]').text)
To get the 3 rd td of each row, you can try either with xpath
driver.find_elements_by_xpath('//table[#id="sailing"]/tbody//td[3]')
or you can try with css selector like
driver.find_elements_by_css_selector('table#sailing td:nth-child(3)')
As it is returning list you can iterate with for each,
elements=driver.find_elements_by_xpath('//table[#id="sailing"]/tbody//td[3]')
for element in elements:
print(element.text)

find_elements_by_xpath() not producing the desired output python selenium scraping

I'm trying to find a tr by its class of .tableOne. Here is my code:
browser = webdriver.Chrome(executable_path=path, options=options)
cells = browser.find_elements_by_xpath('//*[#class="tableone"]')
But the output of the cells variable is [], an empty array.
Here is the html of the page:
<tbody class="tableUpper">
<tr class="tableone">
<td><a class="studentName" href="//www.abc.com"> student one</a></td>
<td> <span class="id_one"></span> <span class="long">Place</span> <span class="short">Place</span></td>
<td class="hide-s">
<span class="state"></span> <span class="studentState">student_state</span>
</td>
</tr>
<tr class="tableone">..</tr>
<tr class="tableone">..</tr>
<tr class="tableone">..</tr>
<tr class="tableone">..</tr>
</tbody>
Please try this:
import re
cells = browser.find_elements_by_xpath("//*[contains(local-name(), 'tr') and contains(#class, 'tableone')]")
for (e in cells):
insides = e.find_elements_by_xpath("./td")
for (i in insides):
result = re.search('\">(.*)</', i.get_attribute("outerHTML"))
print result.group(1)
What this does is gets all the tr elements that have class tableone, then iterates through each element and lists all the tds. Then iterates through the outerHTML of each td and strips each string to get the text value.
It's quite unrefined and will return empty strings, I think. You might need to put some more work into the final product.

Resources