Retrieve 'tr' by text of children

Retrieve 'tr' by text of children - python-3.x

I want to select a tr by the text it contains, including the text of the children.
My html is as follows:
<table>
<tbody>
<tr>
<td>
<span id="ctl00_ContentPlaceHolder1_GridView1_ctl21_Label4">Sanskrit</span>
</td>
<td>
<span id="ctl00_ContentPlaceHolder1_GridView1_ctl21_Label2">0655-0700 </span>
</td>
<td>
<a id="ctl00_ContentPlaceHolder1_GridView1_ctl21_LinkButtonDownloadPdf" href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$GridView1$ctl21$LinkButtonDownloadPdf','')" style="color: Navy;
font-weight: bold;">Download</a>
</td>
<td>
<span id="ctl00_ContentPlaceHolder1_GridView1_ctl21_Label3">24 October</span>
</td>
</tr>
<tr>
<td>
<span id="ctl00_ContentPlaceHolder1_GridView1_ctl22_Label4">Sanskrit</span>
</td>
<td>
<span id="ctl00_ContentPlaceHolder1_GridView1_ctl22_Label2">1810-1815 </span>
</td>
<td>
<a id="ctl00_ContentPlaceHolder1_GridView1_ctl22_LinkButtonDownloadPdf" href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$GridView1$ctl22$LinkButtonDownloadPdf','')" style="color: Navy;
font-weight: bold;">Download</a>
</td>
<td>
<span id="ctl00_ContentPlaceHolder1_GridView1_ctl22_Label3">23 October</span>
</td>
</tr>
</tbody>
</table>
I load it thus _soup = soup(html, "html.parser").
If i run _soup.find("span", text="Sanskrit").parent.parent.text then I get the result '\n\nSanskrit\n\n\n0655-0700 \n\n\nDownload\n\n\n24 October\n\n'
but if i run print(_soup.find("tr", text='\n\nSanskrit\n\n\n0655-0700 \n\n\nDownload\n\n\n24 October\n\n'))
i get None

Issue here is that text needs an exact match - You could regex or use css selectors and :-soup-contains():
soup.select('tr:has(:-soup-contains("Sanskrit"))')
or based on comment, check if your text is in <tr>s text:
for row in soup.select('tr'):
if '\n\nSanskrit\n\n\n0655-0700 \n\n\nDownload\n\n\n24 October\n\n' in row.text:
print(row)
Example
from bs4 import BeautifulSoup
html = '''
<table>
<tbody>
<tr>
<td>
<span id="ctl00_ContentPlaceHolder1_GridView1_ctl21_Label4">Sanskrit</span>
</td>
<td>
<span id="ctl00_ContentPlaceHolder1_GridView1_ctl21_Label2">0655-0700 </span>
</td>
<td>
<a id="ctl00_ContentPlaceHolder1_GridView1_ctl21_LinkButtonDownloadPdf" href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$GridView1$ctl21$LinkButtonDownloadPdf','')" style="color: Navy;
font-weight: bold;">Download</a>
</td>
<td>
<span id="ctl00_ContentPlaceHolder1_GridView1_ctl21_Label3">24 October</span>
</td>
</tr>
<tr>
<td>
<span id="ctl00_ContentPlaceHolder1_GridView1_ctl22_Label4">Sanskrit</span>
</td>
<td>
<span id="ctl00_ContentPlaceHolder1_GridView1_ctl22_Label2">1810-1815 </span>
</td>
<td>
<a id="ctl00_ContentPlaceHolder1_GridView1_ctl22_LinkButtonDownloadPdf" href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$GridView1$ctl22$LinkButtonDownloadPdf','')" style="color: Navy;
font-weight: bold;">Download</a>
</td>
<td>
<span id="ctl00_ContentPlaceHolder1_GridView1_ctl22_Label3">23 October</span>
</td>
</tr>
</tbody>
</table>'''
soup = BeautifulSoup(html)
soup.select('tr:has(:-soup-contains("Sanskrit"))')

Related

How do I parse this html with Python lxml & xpath that finds the parent table of a specific span id?

Here is the HTML I don't have any control over. This is condensed HTML of the real page.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Little League</title>
</head>
<body>
<table>
<span>lot of unrelated text</span>
</table>
<table>
<span>lot of unrelated text</span>
</table>
<table>
<span>lot of unrelated text</span>
</table>
<table>
<tbody>
<tr>
<td class="rightTD">
<p>
<span id="teams_players">Player Teams</span>
</p>
</td>
</tr>
<tr>
<td>
<table border="1" cellspacing="0" cellpadding="0" class="tableBorder table table-bordered" width="100%">
<tbody>
<tr>
<td>
<table border="0" width="100%" class="tableData">
<tbody>
<tr id="team_listings">
<td colspan="3">Team Listings
<br>
<br>
</td>
</tr>
<tr>
<td>(a) </td>
<td colspan="2">Team Name </td>
</tr>
<tr>
<td></td>
<td colspan="2">
<span class="blue_color">Foxes</span>
</td>
</tr>
<tr>
<td>(b) </td>
<td colspan="2">Team Rank</td>
</tr>
<tr>
<td></td>
<td colspan="2">
<span class="blue_color">1</span>
</td>
</tr>
<tr>
<td>(c) </td>
<td colspan="2">Team Location
</td>
</tr>
<tr>
<td></td>
<td colspan="2">
<table width="100%">
<tbody>
<tr>
<td>City:
<br>
<span class="blue_color">Tualatin</span>
</td>
<td>State:
<br>
<span class="blue_colorLined"></span>
<br>
<span class="blue_color">Oregon</span>
</td>
<td>Country:
<br>
<span class="blue_colorLined"></span>
<br>
<span class="blue_color">United States</span>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<br>
<table border="1" cellspacing="0" cellpadding="0" class="tableBorder table table-bordered" width="100%">
<tbody>
<tr>
<td>
<table border="0" width="100%" class="tableData">
<tbody>
<tr>
<td>(a) </td>
<td colspan="2">Team Name </td>
</tr>
<tr>
<td></td>
<td colspan="2">
<span class="blue_color">Tigers</span>
</td>
</tr>
<tr>
<td>(b) </td>
<td colspan="2">Team Rank</td>
</tr>
<tr>
<td></td>
<td colspan="2">
<span class="blue_color">3</span>
</td>
</tr>
<tr>
<td>(c) </td>
<td colspan="2">Team Location
</td>
</tr>
<tr>
<td></td>
<td colspan="2">
<table width="100%">
<tbody>
<tr>
<td>City:
<br>
<span class="blue_color">Tigard</span>
</td>
<td>State:
<br>
<span class="blue_colorLined"></span>
<br>
<span class="blue_color">Oregon</span>
</td>
<td>Country:
<br>
<span class="blue_colorLined"></span>
<br>
<span class="blue_color">United States</span>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</body>
</html>
I am trying to get to the table tag immediately preceding the span tag with id team_players.
I tried these but failed -
//table/span[#id="teams_players"]
ancestor::table[span[#id="teams_players"][position() = 1]]
This works but is not elegant and I prefer not to hardcode it -
//span[#id="teams_players"]/../../../../..
While //table[#class="tableData"] this might seem like it should work, there are many such tables in the HTML that has the same class with unrelated data. So this is ruled out.
Here is the code so far with my attempts (definitely not efficient, once I find a way of fetching both tables, I plan on looping through them to extract the data -
def parse_team():
# team data structure
teams = []
team_dict = { 'team': '', 'rank': '', 'location': { 'city': '', 'state': '', 'country': '' } }
filename = f'team.html'
f = open(filename, encoding="utf8").read()
parser = etree.HTMLParser()
tree = etree.parse(StringIO(f), parser)
# fetch the table dom and parse each team table
# fetch the parent table that contains teams_players span id
team_tables = tree.xpath('ancestor::table[span[#id="teams_players"][position() = 1]]')
print(team_tables)
root_tables = tree.xpath('//table/span[#id="teams_players"]')
print("root tables", root_tables)
# this provides each team table but in full html, the same class is being used for other unrelated data
name = tree.xpath('//table[#class="tableData"]')
print(name)
eachvaltr = name[0].xpath('.//tr')
teamname = name[0].xpath('.//td[contains(text(),"Team Name")]//parent::tr/following-sibling::tr[1]//span[#class="blue_color"]/text()')
print("teamname", teamname)
teamrank = name[0].xpath(
'.//td[contains(text(),"Team Rank")]//parent::tr/following-sibling::tr[1]//span[#class="blue_color"]/text()')
print("teamrank", teamrank)
city = name[0].xpath(
'.//td[contains(text(),"City")]//span[#class="blue_color"]/text()')
state = name[0].xpath(
'.//td[contains(text(),"State")]//span[#class="blue_color"]/text()')
country = name[0].xpath(
'.//td[contains(text(),"Country")]//span[#class="blue_color"]/text()')
print(city[0], state[0], country[0])
team_dict['team'] = teamname
team_dict['rank'] = teamrank
team_dict['location']['city'] = city[0]
team_dict['location']['state'] = state[0]
team_dict['location']['country'] = country[0]
print(team_dict)
Desired output is a list of teams where each team is a dict.
[{'team': ['Foxes'], 'rank': ['1'], 'location': {'city': 'Tualatin', 'state': 'Oregon', 'country': 'United States'}}]

//table[.//span[#id="teams_players"]]
or
//span[#id="teams_players"]/ancestor::table

beautifulsoup for looping and getting text and Href

I'm in a bit of a quinch here:
its an ASP site which is really messy that I am trying to get data from:
I'm trying to use a for loop to get an href and the text of all the rows of the 4th table that is on the site, so I first did:
table = soup.findAll('table')[3]
Then from this table I need to get all text inside the <tr> tags and the href's of the <a> inside.
i tried something like this:
for product in table.findAll('tbody'):
product_title = product.find('tr').text
product_link = product.find('a')['href']
print (product_title, product_link)
But I get nothing in return
The table Im working on:
<tr bgcolor="#EFEFEF">
<td>
<a href="free.asp?detail=hide&c_id=4342141">
<img align="absmiddle" border="0" hspace="0" src="pic/bullet.gif" vspace="0"/>
</a>
</td>
<td>
4342141
</td>
<td width="10">
</td>
<td>
25.07.2018 09:00
</td>
<td width="10">
</td>
<td>
Ankara
</td>
<td width="10">
-
</td>
<td>
Konya
</td>
<td colspan="2">
</td>
</tr>
<tr bgcolor="#EFEFEF" height="3">
<td colspan="10">
</td>
</tr>
<tr bgcolor="#FFFFFF" height="1">
<td colspan="10">
</td>
</tr>
<tr bgcolor="#DDDDDD" height="6">
<td colspan="10">
</td>
</tr>
<tr bgcolor="#FFFFFF" height="1">
<td colspan="10">
</td>
</tr>
<tr bgcolor="#DEE3E7" height="3">
<td colspan="10">
</td>
</tr>
<tr bgcolor="#DEE3E7">
<td>
<a href="free.asp?detail=hide&c_id=4134123">
<img align="absmiddle" border="0" hspace="0" src="pic/bullet.gif" vspace="0"/>
</a>
</td>
<td>
4134123
</td>
<td width="10">
</td>
<td>
26.07.2018 09:00
</td>
<td width="10">
</td>
<td>
Van
</td>
<td width="10">
-
</td>
<td>
Istanbul
</td>
<td colspan="2">
</td>
</tr>

Instead of extracting text from tbody from table, you can directly get all tr tags.
Based on your snippet you can refer to this code snippet for data extraction from table.
soup = BeautifulSoup(text, 'html.parser')
all_products = []
for tr in soup.find_all('tr'):
text = tr.get_text(separator=' ', strip=True)
if text:
a_tag = tr.find('a')
if a_tag:
product_link = a_tag.attrs['href']
all_text = text + ' ' + product_link
all_products.append(all_text.split(' '))
print(all_products)
Output is:
[['4342141', '25.07.2018', '09:00', 'Ankara', '-', 'Konya', 'free.asp?detail=hide&c_id=4342141'], ['4134123', '26.07.2018', '09:00', 'Van', '-', 'Istanbul', 'free.asp?detail=hide&c_id=4134123']]

Find label name which doesn't contain input immediately from a table

I am having a table which contains multiple td/tr.
My problem is label which i want to get doesn't contains input immediately neither that label has any attributes property.
I want to get "Countries of Permanent Residence" and Dates which are in span but doesn't have any properties that span is in div that too doesn't contains any property.
I tried with
formElement[input[name="icims_gh_Permanent_Country_Residence"]]
But don't know how to get associate label.
Html looks like:
<div style="margin:0px">
<table style="width:100%" border="1" cellspacing="2" cellpadding="2">
<tbody>
<tr>
<td valign="top">
<div style="margin:0px">
<span style="font-family:arial,helvetica,sans-serif;font-size:12px"> Countries of Permanent Residence (list all) </span>
</div>
</td>
<td valign="top">
<div style="margin:0px">
<span style="font-family:arial,helvetica,sans-serif;font-size:12px"> Dates </span>
</div>
</td>
</tr>
<tr>
<td colspan="2" valign="top">
<div style="margin:0px">
<div>
<input type="hidden" name="icims_gh_Permanent_Country_Residence" value="1">
<a name="0.1_icims_ga_Permanent_Country_Residence"></a>
<div>
<table style="width:100%;border:0px">
<tbody>
<tr>
<td>
<table style="width:100%" border="1" cellspacing="2" cellpadding="2">
<tbody>
<tr>
<td valign="top">
<div style="margin:0px">
<span style="font-family:arial,helvetica,sans-serif;font-size:12px">
<input type="text" name="icims_0_permResidenceCountry"> </span>
</div>
</td>
<td valign="top">
<div style="margin:0px">
<span style="font-family:arial,helvetica,sans-serif;font-size:12px">
<input type="text" name="icims_0_permResidenceCountryDates"> </span>
</div>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</div>

How to remove all "document.write(' ');" with beautifulsoup

How to remove all "document.write(' ');" from the <table> </table> using beautifulsoup:
i have next raw html
document.write('<table>');
document.write('
<tr>
<td>
<span class="prod">
some text
</span>
</td>
');
document.write('
<td>
<span class="prod">
7.70.022
</span>
</td>
</tr>
');
document.write('</table>');
I need in next result with beautifulsoup:
<table>
<tr>
<td>
<span class="prod">
some text
</span>
</td>
<td>
<span class="prod">
7.70
</span>
</td>
</tr>
</table>

Why don't you just use regexs to remove the parts you don't want and then parse it using beautifulsoup?
import re
data = """document.write('<table>');
document.write('
<tr>
<td>
<span class="prod">
some text
</span>
</td>
');
document.write('
<td>
<span class="prod">
7.70.022
</span>
</td>
</tr>
');
document.write('</table>');"""
pattern = re.compile(r"document\.write\('\n?([^']*?)(?:\n\s*)?'\);")
data = pattern.sub('\g<1>', data)
print data
Output
<table>
<tr>
<td>
<span class="prod">
some text
</span>
</td>
<td>
<span class="prod">
7.70.022
</span>
</td>
</tr>
</table>

Scala find location of string in a string

I have this string:
var htmlString;
Assigned to:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" >
<html>
<head>
<title>Payment Receipt</title>
<link rel="stylesheet" type="text/css" href="content/PaymentForm.css">
<style type="text/css">
</style>
<meta content='text/html; charset=UTF-8' http-equiv='Content-Type'/>
</head>
<body>
<div id="divPageOuter" class="PageOuter">
<div id="divPage" class="Page">
<!--[1]-->
<div id="divThankYou">
Thank you for your order!
</div>
<hr class="HrTop">
<div id="divReceiptMsg">
You may print this receipt page for your records.
</div>
<div class="SectionBar">
Order Information
</div>
<table id="tablePaymentDetails1Rcpt">
<tr>
<td class="LabelColInfo1R">
Merchant:
</td>
<td class="DataColInfo1R">
<!--Merchant.val-->
Ryan
<!--end-->
</td>
</tr>
<tr>
<td class="LabelColInfo1R">
Description:
</td>
<td class="DataColInfo1R">
<!--x_description.val-->
Rasmussenpayment
<!--end-->
</td>
</tr>
</table>
<table id="tablePaymentDetails2Rcpt" cellspacing="0" cellpadding="0">
<tr>
<td id="tdPaymentDetails2Rcpt1">
<table>
<tr>
<td class="LabelColInfo1R">
Date/Time:
</td>
<td class="DataColInfo1R">
<!--Date/Time.val-->
09-Jul-2012 12:26:46 PM PT
<!--end-->
</td>
</tr>
<tr>
<td class="LabelColInfo1R">
Customer ID:
</td>
<td class="DataColInfo1R">
<!--x_cust_id.val-->
<!--end-->
</td>
</tr>
</table>
</td>
<td id="tdPaymentDetails2Rcpt2">
<table>
<tr>
<td class="LabelColInfo1R">
Invoice Number:
</td>
<td class="DataColInfo1R">
<!--x_invoice_num.val-->
176966244
<!--end-->
</td>
</tr>
</table>
</td>
</tr>
</table>
<hr id="hrBillingShippingBefore">
<table id="tableBillingShipping">
<tr>
<td id="tdBillingInformation">
<div class="Label">
Billing Information
</div>
<div id="divBillingInformation">
Test14 Rasmussen<br>
1234 test st<br>
San Diego, CA 92107 <br>
</div>
</td>
<td id="tdShippingInformation">
<div class="Label">
Shipping Information
</div>
<div id="divShippingInformation">
</div>
</td>
</tr>
</table>
<hr id="hrBillingShippingAfter">
<div id="divOrderDetailsBottomR">
<table id="tableOrderDetailsBottom">
<tr>
<td class="LabelColTotal">
Total:
</td>
<td class="DescrColTotal">
</td>
<td class="DataColTotal">
<!--x_amount.val-->
US $250.00
<!--end-->
</td>
</tr>
</table>
<!-- tableOrderDetailsBottom -->
</div>
<div id="divOrderDetailsBottomSpacerR">
</div>
<div class="SectionBar">
Visa ****0027
</div>
<table class="PaymentSectionTable" cellspacing="0" cellpadding="0">
<tr>
<td class="PaymentSection1">
<table>
<tr>
<td class="LabelColInfo2R">
Date/Time:
</td>
<td class="DataColInfo2R">
<!--Date/Time.1.val-->
09-Jul-2012 12:26:46 PM PT
<!--end-->
</td>
</tr>
<tr>
<td class="LabelColInfo2R">
Transaction ID:
</td>
<td class="DataColInfo2R">
<!--Transaction ID.1.val-->
2173493354
<!--end-->
</td>
</tr>
<tr>
<td class="LabelColInfo2R">
Authorization Code:
</td>
<td class="DataColInfo2R">
<!--x_auth_code.1.val-->
07I3DH
<!--end-->
</td>
</tr>
<tr>
<td class="LabelColInfo2R">
Payment Method:
</td>
<td class="DataColInfo2R">
<!--x_method.1.val-->
Visa ****0027
<!--end-->
</td>
</tr>
</table>
</td>
<td class="PaymentSection2">
<table>
</table>
</td>
</tr>
</table>
<div class="PaymentSectionSpacer">
</div>
</div>
<!-- entire BODY -->
</div>
<div class="PageAfter">
</div>
</body>
</html>
And I want to find the location of "x_auth_code.1.val" in the string. And then I want to obtain a string from the location plus a certain number of characters. The goal would be to return the Authorization code.

You can use indexOfSlice, and then slice() in StringOps
scala> val myString = "Hello World!"
myString: java.lang.String = Hello World!
scala> val index = myString.indexOfSlice("Wo")
index: Int = 6
scala> val slice = myString.slice(index, index+5)
slice: String = World
With your html string:
scala> htmlString.indexOfSlice("x_auth_code.1.val")
res4: Int = 2771

Why aren't you using an XML parser? Don't treat XML as strings -- you'll get bitten if you do.
Here's a regex to do it, but my advice is: DO NOT USE IT! Use xml tools.
"""\Qx_auth_code.1.val\E[^>]*>([^<]*)""".r.findFirstMatchIn(htmlString).map(_ group 1)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Retrieve 'tr' by text of children - python-3.x

Related

How do I parse this html with Python lxml & xpath that finds the parent table of a specific span id?

beautifulsoup for looping and getting text and Href

Find label name which doesn't contain input immediately from a table

How to remove all "document.write(' ');" with beautifulsoup

Scala find location of string in a string

Categories

Resources