WebScraping: Get nested element in HTML Table

WebScraping: Get nested element in HTML Table - python-3.x

Hi I am new to webscraping and got stuck on getting nested html element tag in a table, here is the html code I get from the url http://www.geonames.org/search.html?q=+Leisse&country=FR:
<table class="restable">
<tr>
<td colspan=6 style="text-align: right;"><small>1 records found for "col de la Leisse"</small></td>
</tr>
<tr>
<th></th>
<th>Name</th>
<th>Country</th>
<th>Feature class</th>
<th>Latitude</th>
<th>Longitude</th>
</tr>
<tr>
<td><small>1</small> <img src="/maps/markers/m10-ORANGE-T.png" border="0" alt="T"></td>
<td>Col de la Leisse<br><small></small><span class="geo" style="display:none;"><span class="latitude">45.42372</span><span class="longitude">6.906828</span></span></td>
<td>France, Auvergne-Rhône-Alpes<br><small>Savoy > Albertville > Tignes</small></td>
<td>pass</td>
<td nowrap>N 45° 25' 25''</td>
<td nowrap>E 6° 54' 24''</td>
</tr>
<tr class="tfooter">
<td colspan=6></td>
</tr>
</table>
This is the code for only one row to make things simple, but in my case I iterate over each row and check if the text of <td> element equal to a target value, if true I scrape the value of <span> element with class longitude and latitude. In my case I want to get the row with value Col de la Leisse
Here is my code: (not good)
soup = BeautifulSoup(response.text, "html.parser")
table = soup.findAll('table')[1] # second table
rows = table.find_all('tr')
target = "Col de la Leisse"
longitude, latitude = 0
for row in rows:
cols=row.find_all('td')
# I am stuck here...
# if cols.text == target:
# ...
Result:
longitude = 6.906828
latitude = 45.42372

With bs4 4.7.1 you can use :has and :contains to ensure row has an a tag element with your target string in.
target = 'Col de la Leisse'
rows = soup.select('.restable tr:has(a:contains("' + target + '"))')
for row in rows:
print([item.text for item in row.select('.latitude, .longitude')])
You can of course separate out .latitude and .longitude if you think they will not both be present, or if can occur in different order

Related

python bs4 find data in class

Good day
Please how do I write all the data in
<tr class="t1">
<th>Th
27. 02. 2020</th>
<td>
19:00</td>
<td>Záskok</td>
<td class="center">0</td>
<td>
my code is
for tr in soup.find_all("tr", class_="t1"):
date = soup.find("th").text
name = soup.find("td", text="Záskok").text
number = soup.find("td", text="Záskok").find_next_sibling("td").text
print(date, ":",name, ":", number)

beautiful soup not to parse nested table data

I have a nested table structure. I am using the below code for parsing the data.
for row in table.find_all("tr")[1:][:-1]:
for td in row.find_all("td")[1:]:
dataset = td.get_text()
The problem here is when there are nested tables like in my case there are tables inside <td></td> so these are parsed again after parsing initially as I am using find_all(tr) and find_all(td). So how can I avoid parsing the nested table as it is parsed already?
Input:
<table>
<tr>
<td>1</td><td>2</td>
</tr>
<tr>
<td>3</td><td>4</td>
</tr>
<tr>
<td>5
<table><tr><td>11</td><td>22</td></tr></table>
6
</td>
</tr>
</table>
Expected Output:
1 2
3 4
5
11 22
6
But what I am getting is:
1 2
3 4
5
11 22
11 22
6
That is, the inner table is parsed again.
Specs:
beautifulsoup4==4.6.3
Data order should be preserved and content could be anything including any alphanumeric characters.

Using a combinations of bs4 and re, you might achieve what you want.
I am using bs4 4.6.3
from bs4 import BeautifulSoup as bs
import re
html = '''
<table>
<tr>
<td>1</td><td>2</td>
</tr>
<tr>
<td>3</td><td>4</td>
</tr>
<tr>
<td>5
<table><tr><td>11</td><td>22</td></tr></table>
6
</td>
</tr>
</table>'''
soup = bs(html, 'lxml')
ans = []
for x in soup.findAll('td'):
if x.findAll('td'):
for y in re.split('<table>.*</table>', str(x)):
ans += re.findall('\d+', y)
else:
ans.append(x.text)
print(ans)
For each td we test if this is a nest td. If so, we split on table and take everything and match with a regex every number.
Note this working only for two depths level, but adaptable to any depths

I have tried with findChilden() method and some how managed to produce output.I am not sure if this will help you in any other circumstances.
from bs4 import BeautifulSoup
data='''<table>
<tr>
<td>1</td><td>2</td>
</tr>
<tr>
<td>3</td><td>4</td>
</tr>
<tr>
<td>5
<table><tr><td>11</td><td>22</td></tr></table>
6
</td>
</tr>
</table>'''
soup=BeautifulSoup(data,'html.parser')
for child in soup.find('table').findChildren("tr" , recursive=False):
tdlist = []
if child.find('table'):
for td in child.findChildren("td", recursive=False):
print(td.next_element.strip())
for td1 in td.findChildren("table", recursive=False):
for child1 in td1.findChildren("tr", recursive=False):
for child2 in child1.findChildren("td", recursive=False):
tdlist.append(child2.text)
print(' '.join(tdlist))
print(child2.next_element.next_element.strip())
else:
for td in child.findChildren("td" , recursive=False):
tdlist.append(td.text)
print(' '.join(tdlist))
Output:
1 2
3 4
5
11 22
6
EDITED for Explanation
Step 1:
When use findChilden() inside table it first returns 3 records.
for child in soup.find('table').findChildren("tr", recursive=False):
print(child)
Output:
<tr>
<td>1</td><td>2</td>
</tr>
<tr>
<td>3</td><td>4</td>
</tr>
<tr>
<td>5
<table><tr><td>11</td><td>22</td></tr></table>
6
</td>
</tr>
Step 2:
Check that any children has tag <table> and do some operation.
if child.find('table'):
Step 3:
Follow the step 1 and use findChilden() to get <td> tag.
Once you get the <td> follow step 1 to get the children again.
Step 4:
for td in child.findChildren("td", recursive=False)
print(td.next_element.strip())
Next element will return the first text of tag so in that case it will return the value 5.
Step 5
for td in child.findChildren("td", recursive=False):
print(td.next_element.strip())
for td1 in td.findChildren("table", recursive=False):
for child1 in td1.findChildren("tr", recursive=False):
for child2 in child1.findChildren("td", recursive=False):
tdlist.append(child2.text)
print(' '.join(tdlist))
print(child2.next_element.next_element.strip())
If you see here i have just follows step 1 recursively.Yes Again I have used child2.next_element.next_element to get the value of 6 after </table> tag.

You can check if another table exists inside a td tag, if it exists then simply skip that td, otherwise use it as a regular td.
for row in table.find_all("tr")[1:][:-1]:
for td in row.find_all("td")[1:]:
if td.find('table'): # check if td has nested table
continue
dataset = td.get_text()

In your example, with bs4 4.7.1 I use :has :not to exclude looping rows with table child
from bs4 import BeautifulSoup as bs
html = '''
<table>
<tr>
<td>1</td>
<td>2</td>
</tr>
<tr>
<td>3</td>
<td>4</td>
</tr>
<tr>
<td>
<table>
<tr>
<td>11</td>
<td>22</td>
</tr>
</table>
</td>
</tr>
</table>'''
soup = bs(html, 'lxml')
for tr in soup.select('tr:not(:has(table))'):
print([td.text for td in tr.select('td')])

How do I use Python csv to write multiple beautifulsoup table rows for only two specific columns?

I am wanting to use beautifulsoup to scrape HTML to pull out only two columns from every row in one table. However, each "tr" row has 10 "td" cells, and I only want the [1] and [8] "td" cell from each row. What is the most pythonic way to do this?
From my input below I've got one table, one body, three rows, and 10 cells per row.
Input
<table id ="tblMain">
<tbody>
<tr>
<td "text"</td>
<td "data1"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "data2"</td>
<td "text"</td>
<tr>
<td "text"</td>
<td "data1"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "data2"</td>
<td "text"</td>
<tr>
<td "text"</td>
<td "data1"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "data2"</td>
<td "text"</td>
Things I Have Tried
I understand how to use the index of the cells in order to loop through and get "td" at [1] and [8]. However, I'm getting all confused when trying to get that data on one line written back to the csv.
table = soup.find('table', {'id':'tblMain'} )
table_body = table.find('tbody')
rows = table_body.findAll('tr')
data1_columns = []
data2_columns = []
for row in rows[1:]:
data1 = row.findAll('td')[1]
data1_columns.append(data1.text)
data2 = row.findAll('td')[8]
data2_columns.append(data2.text)
This is my current code which finds the table, rows, and all "td" cells and prints them correctly to a .csv. However, instead of writing all ten "td" cells per row back to the csv line, I just want to grab "td"[1] and "td"[8].
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table', {'id':'tblMain'} )
table_body = table.find('tbody')
rows = table_body.findAll('tr')
filename = '%s.csv' % reportname
with open(filename, "wt+", newline="") as f:
writer = csv.writer(f)
for row in rows:
csv_row = []
for cell in row.findAll("td"):
csv_row.append(cell.get_text())
writer.writerow(csv_row)
Expected Results
I want to be able to write "td"[1] and "td"[8] back to my csv_row in order to write each list back to a the csv writer.writerow.
Writing row back to csv_row which then writes to my csv file:
['data1', 'data2']
['data1', 'data2']
['data1', 'data2']

You almost did it
for row in rows:
row = row.findAll("td")
csv_row = [row[1].get_text(), row[8].get_text()]
writer.writerow(csv_row)
Full code
html ='''<table id ="tblMain">
<tbody>
<tr>
<td>text</td>
<td>data1</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>data2</td>
<td>text</td>
<tr>
<td>text</td>
<td>data1</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>data2</td>
<td>text</td>
<tr>
<td>text</td>
<td>data1</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>data2</td>
<td>text</td>
'''
from bs4 import BeautifulSoup
import csv
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table', {'id':'tblMain'} )
table_body = table.find('tbody')
rows = table_body.findAll('tr')
reportname = 'output'
filename = '%s.csv' % reportname
with open(filename, "wt+", newline="") as f:
writer = csv.writer(f)
for row in rows:
row = row.findAll("td")
csv_row = [row[1].get_text(), row[8].get_text()]
writer.writerow(csv_row)

You should be able to use nth-of-type pseudo class css selector
from bs4 import BeautifulSoup as bs
import pandas as pd
html = 'actualHTML'
soup = bs(html, 'lxml')
results = []
for row in soup.select('#tblMain tr'):
out_row = [item.text.strip() for item in row.select('td:nth-of-type(2), td:nth-of-type(9)')]
results.append(out_row)
df = pd.DataFrame(results)
print(df)
df.to_csv(r'C:\Users\User\Desktop\data.csv', sep=',', encoding='utf-8-sig',index = False )

Whenever I need to pull a table and it has the <table> tag, I let Pandas do the work for me, then just maniuplate the dataframe it returns if needed. That's what I would do here:
html = '''<table id ="tblMain">
<tbody>
<tr>
<td> text</td>
<td> data1</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> data2</td>
<td> text</td>
<tr>
<td> text</td>
<td> data1</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> data2</td>
<td> text</td>
<tr>
<td> text</td>
<td> data1</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> data2</td>
<td> text</td>'''
import pandas as pd
# .read_html() returns a list of dataframes
tables = pd.read_html(html)[0]
# we want the dataframe from that list in position [0]
df = tables[0]
# Use .iloc to say I want all the rows, and columns 1, 8
df = df.iloc[:,[1,8]]
# Write the dataframe to file
df.to_csv('path.filename.csv', index=False)

find_elements_by_xpath() not producing the desired output python selenium scraping

I'm trying to find a tr by its class of .tableOne. Here is my code:
browser = webdriver.Chrome(executable_path=path, options=options)
cells = browser.find_elements_by_xpath('//*[#class="tableone"]')
But the output of the cells variable is [], an empty array.
Here is the html of the page:
<tbody class="tableUpper">
<tr class="tableone">
<td><a class="studentName" href="//www.abc.com"> student one</a></td>
<td> <span class="id_one"></span> <span class="long">Place</span> <span class="short">Place</span></td>
<td class="hide-s">
<span class="state"></span> <span class="studentState">student_state</span>
</td>
</tr>
<tr class="tableone">..</tr>
<tr class="tableone">..</tr>
<tr class="tableone">..</tr>
<tr class="tableone">..</tr>
</tbody>

Please try this:
import re
cells = browser.find_elements_by_xpath("//*[contains(local-name(), 'tr') and contains(#class, 'tableone')]")
for (e in cells):
insides = e.find_elements_by_xpath("./td")
for (i in insides):
result = re.search('\">(.*)</', i.get_attribute("outerHTML"))
print result.group(1)
What this does is gets all the tr elements that have class tableone, then iterates through each element and lists all the tds. Then iterates through the outerHTML of each td and strips each string to get the text value.
It's quite unrefined and will return empty strings, I think. You might need to put some more work into the final product.

VBA Excel get text inside HTMLObject

I know this is really easy for some of you out there. But I have been going deep on the internet and I can not find an answer. I need to get the company name that is inside the
tbody tr td a eBay-tradera.com
and
td class="bS aR" 970,80
/td /tr /tbody
<tbody id="matrix1_group0">
<tr class="oR" onmouseover="onMouseOver(this, false)" onmouseout="onMouseOut(this, false)" onclick="onClick(this, false)">
<td class="bS"> </td>
<td>
<a href="aProgramInfoApplyRead.action?programId=175&affiliateId=2014848" title="http://www.tradera.com/" target="_blank">
eBay-Tradera.com
</a>
</td>
<td class="aR">
175</td>
<td class="bS aR">0</td><td class="bS aR">0</td><td class="bS aR">187</td>
<td class="aR">0,00%</td><td class="bS aR">124</td>
<td class="aR">0,00%</td>
<td class="bS aR">26</td>
<td class="aR">20,97%</td>
<td class="bS aR">32</td>
<td class="aR">60,80</td>
<td class="aR">25,81%</td>
<td class="bS aR">5 102,00</td>
<td class="bS aR">0,00</td>
<td class="aR">0,00</td>
<td class="bS aR">
970,80
</td>
</tr>
</tbody>
This is my code, where I only try to get the a tag to start of with but I cant get that to work either
Set TDelements = document.getElementById("matrix1_group0").document.getElementsbytagname("a").innerHTML
r = 0
C = 0
For Each TDelement In TDelements
Blad1.Range("A1").Offset(r, C).Value = TDelement.innerText
r = r + 1
Next
Thanks on beforehand I know that this might be to simple. But I hope that other people might have the same issue and this will be helpful for them as well. The reason for the "r = r + 1" is because there are many more companies on this list. I just wanted to make it as easy as I could. Thanks again!

You will need to specify the element location in the table. Ebay seems to be obfuscating the class-names so we cannot rely on those being consistent. Nor would I usually rely on the elements by their table index being consistent but I don't see any way around this.
I am assuming that this is the HTML document you are searching
<tbody id="matrix1_group0">
<tr class="oR" onmouseover="onMouseOver(this, false)" onmouseout="onMouseOut(this, false)" onclick="onClick(this, false)">
<td class="bS"> </td>
<td>
<a href="aProgramInfoApplyRead.action?programId=175&affiliateId=2014848" title="http://www.tradera.com/" target="_blank">
eBay-Tradera.com <!-- <=== You want this? -->
</a>
</td>
<!-- ... -->
</tr>
<!-- ... -->
</tbody>
We can ignore the rest of the document as the table element has an ID. In short, we assume that
.getElementById("matrix1_group0").getElementsByTagName("TR")
will return a collection of html row objects sorted by their appearance.
Set matrix = document.getElementById("matrix1_group0")
Set firstRow = matrix.getElementsByTagName("TR")(1)
Set firstRowSecondCell = firstRow.getElementsByTagName("TD")(2)
traderaName = firstRowSecondCell.innerText
Of course you could inline this all as
document.getElementById("matrix1_group0").getElementsByTagName("TR")(1).getElementsByTagName("TD")(2).innerText
but that would make debugging harder. Also if the web-page is ever presented to you in a different format then this won't work. Ebay is deliberately making it hard for you to scrape data off of it for security.

With only the HTML you have shown you can use CSS selectors to obtain these:
a[href*='aProgramInfoApplyRead.action?programId']
Which says a tag with attribute href that contains the string 'aProgramInfoApplyRead.action?programId'. This matches two elements but the first is the one you want.
CSS Selector:
VBA:
You can use .querySelector method of .document to retrieve the first match
Debug.Print ie.document.querySelector("a[href*='aProgramInfoApplyRead.action?programId']").innerText

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

WebScraping: Get nested element in HTML Table - python-3.x

Related

python bs4 find data in class

beautiful soup not to parse nested table data

How do I use Python csv to write multiple beautifulsoup table rows for only two specific columns?

find_elements_by_xpath() not producing the desired output python selenium scraping

VBA Excel get text inside HTMLObject

Categories

Resources