How do I parse this html with Python lxml & xpath that finds the parent table of a specific span id? - python-3.x

Here is the HTML I don't have any control over. This is condensed HTML of the real page.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Little League</title>
</head>
<body>
<table>
<span>lot of unrelated text</span>
</table>
<table>
<span>lot of unrelated text</span>
</table>
<table>
<span>lot of unrelated text</span>
</table>
<table>
<tbody>
<tr>
<td class="rightTD">
<p>
<span id="teams_players">Player Teams</span>
</p>
</td>
</tr>
<tr>
<td>
<table border="1" cellspacing="0" cellpadding="0" class="tableBorder table table-bordered" width="100%">
<tbody>
<tr>
<td>
<table border="0" width="100%" class="tableData">
<tbody>
<tr id="team_listings">
<td colspan="3">Team Listings
<br>
<br>
</td>
</tr>
<tr>
<td>(a) </td>
<td colspan="2">Team Name </td>
</tr>
<tr>
<td></td>
<td colspan="2">
<span class="blue_color">Foxes</span>
</td>
</tr>
<tr>
<td>(b) </td>
<td colspan="2">Team Rank</td>
</tr>
<tr>
<td></td>
<td colspan="2">
<span class="blue_color">1</span>
</td>
</tr>
<tr>
<td>(c) </td>
<td colspan="2">Team Location
</td>
</tr>
<tr>
<td></td>
<td colspan="2">
<table width="100%">
<tbody>
<tr>
<td>City:
<br>
<span class="blue_color">Tualatin</span>
</td>
<td>State:
<br>
<span class="blue_colorLined"></span>
<br>
<span class="blue_color">Oregon</span>
</td>
<td>Country:
<br>
<span class="blue_colorLined"></span>
<br>
<span class="blue_color">United States</span>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<br>
<table border="1" cellspacing="0" cellpadding="0" class="tableBorder table table-bordered" width="100%">
<tbody>
<tr>
<td>
<table border="0" width="100%" class="tableData">
<tbody>
<tr>
<td>(a) </td>
<td colspan="2">Team Name </td>
</tr>
<tr>
<td></td>
<td colspan="2">
<span class="blue_color">Tigers</span>
</td>
</tr>
<tr>
<td>(b) </td>
<td colspan="2">Team Rank</td>
</tr>
<tr>
<td></td>
<td colspan="2">
<span class="blue_color">3</span>
</td>
</tr>
<tr>
<td>(c) </td>
<td colspan="2">Team Location
</td>
</tr>
<tr>
<td></td>
<td colspan="2">
<table width="100%">
<tbody>
<tr>
<td>City:
<br>
<span class="blue_color">Tigard</span>
</td>
<td>State:
<br>
<span class="blue_colorLined"></span>
<br>
<span class="blue_color">Oregon</span>
</td>
<td>Country:
<br>
<span class="blue_colorLined"></span>
<br>
<span class="blue_color">United States</span>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</body>
</html>
I am trying to get to the table tag immediately preceding the span tag with id team_players.
I tried these but failed -
//table/span[#id="teams_players"]
ancestor::table[span[#id="teams_players"][position() = 1]]
This works but is not elegant and I prefer not to hardcode it -
//span[#id="teams_players"]/../../../../..
While //table[#class="tableData"] this might seem like it should work, there are many such tables in the HTML that has the same class with unrelated data. So this is ruled out.
Here is the code so far with my attempts (definitely not efficient, once I find a way of fetching both tables, I plan on looping through them to extract the data -
def parse_team():
# team data structure
teams = []
team_dict = { 'team': '', 'rank': '', 'location': { 'city': '', 'state': '', 'country': '' } }
filename = f'team.html'
f = open(filename, encoding="utf8").read()
parser = etree.HTMLParser()
tree = etree.parse(StringIO(f), parser)
# fetch the table dom and parse each team table
# fetch the parent table that contains teams_players span id
team_tables = tree.xpath('ancestor::table[span[#id="teams_players"][position() = 1]]')
print(team_tables)
root_tables = tree.xpath('//table/span[#id="teams_players"]')
print("root tables", root_tables)
# this provides each team table but in full html, the same class is being used for other unrelated data
name = tree.xpath('//table[#class="tableData"]')
print(name)
eachvaltr = name[0].xpath('.//tr')
teamname = name[0].xpath('.//td[contains(text(),"Team Name")]//parent::tr/following-sibling::tr[1]//span[#class="blue_color"]/text()')
print("teamname", teamname)
teamrank = name[0].xpath(
'.//td[contains(text(),"Team Rank")]//parent::tr/following-sibling::tr[1]//span[#class="blue_color"]/text()')
print("teamrank", teamrank)
city = name[0].xpath(
'.//td[contains(text(),"City")]//span[#class="blue_color"]/text()')
state = name[0].xpath(
'.//td[contains(text(),"State")]//span[#class="blue_color"]/text()')
country = name[0].xpath(
'.//td[contains(text(),"Country")]//span[#class="blue_color"]/text()')
print(city[0], state[0], country[0])
team_dict['team'] = teamname
team_dict['rank'] = teamrank
team_dict['location']['city'] = city[0]
team_dict['location']['state'] = state[0]
team_dict['location']['country'] = country[0]
print(team_dict)
Desired output is a list of teams where each team is a dict.
[{'team': ['Foxes'], 'rank': ['1'], 'location': {'city': 'Tualatin', 'state': 'Oregon', 'country': 'United States'}}]

//table[.//span[#id="teams_players"]]
or
//span[#id="teams_players"]/ancestor::table

Related

Retrieve 'tr' by text of children

I want to select a tr by the text it contains, including the text of the children.
My html is as follows:
<table>
<tbody>
<tr>
<td>
<span id="ctl00_ContentPlaceHolder1_GridView1_ctl21_Label4">Sanskrit</span>
</td>
<td>
<span id="ctl00_ContentPlaceHolder1_GridView1_ctl21_Label2">0655-0700 </span>
</td>
<td>
<a id="ctl00_ContentPlaceHolder1_GridView1_ctl21_LinkButtonDownloadPdf" href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$GridView1$ctl21$LinkButtonDownloadPdf','')" style="color: Navy;
font-weight: bold;">Download</a>
</td>
<td>
<span id="ctl00_ContentPlaceHolder1_GridView1_ctl21_Label3">24 October</span>
</td>
</tr>
<tr>
<td>
<span id="ctl00_ContentPlaceHolder1_GridView1_ctl22_Label4">Sanskrit</span>
</td>
<td>
<span id="ctl00_ContentPlaceHolder1_GridView1_ctl22_Label2">1810-1815 </span>
</td>
<td>
<a id="ctl00_ContentPlaceHolder1_GridView1_ctl22_LinkButtonDownloadPdf" href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$GridView1$ctl22$LinkButtonDownloadPdf','')" style="color: Navy;
font-weight: bold;">Download</a>
</td>
<td>
<span id="ctl00_ContentPlaceHolder1_GridView1_ctl22_Label3">23 October</span>
</td>
</tr>
</tbody>
</table>
I load it thus _soup = soup(html, "html.parser").
If i run _soup.find("span", text="Sanskrit").parent.parent.text then I get the result '\n\nSanskrit\n\n\n0655-0700 \n\n\nDownload\n\n\n24 October\n\n'
but if i run print(_soup.find("tr", text='\n\nSanskrit\n\n\n0655-0700 \n\n\nDownload\n\n\n24 October\n\n'))
i get None
Issue here is that text needs an exact match - You could regex or use css selectors and :-soup-contains():
soup.select('tr:has(:-soup-contains("Sanskrit"))')
or based on comment, check if your text is in <tr>s text:
for row in soup.select('tr'):
if '\n\nSanskrit\n\n\n0655-0700 \n\n\nDownload\n\n\n24 October\n\n' in row.text:
print(row)
Example
from bs4 import BeautifulSoup
html = '''
<table>
<tbody>
<tr>
<td>
<span id="ctl00_ContentPlaceHolder1_GridView1_ctl21_Label4">Sanskrit</span>
</td>
<td>
<span id="ctl00_ContentPlaceHolder1_GridView1_ctl21_Label2">0655-0700 </span>
</td>
<td>
<a id="ctl00_ContentPlaceHolder1_GridView1_ctl21_LinkButtonDownloadPdf" href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$GridView1$ctl21$LinkButtonDownloadPdf','')" style="color: Navy;
font-weight: bold;">Download</a>
</td>
<td>
<span id="ctl00_ContentPlaceHolder1_GridView1_ctl21_Label3">24 October</span>
</td>
</tr>
<tr>
<td>
<span id="ctl00_ContentPlaceHolder1_GridView1_ctl22_Label4">Sanskrit</span>
</td>
<td>
<span id="ctl00_ContentPlaceHolder1_GridView1_ctl22_Label2">1810-1815 </span>
</td>
<td>
<a id="ctl00_ContentPlaceHolder1_GridView1_ctl22_LinkButtonDownloadPdf" href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$GridView1$ctl22$LinkButtonDownloadPdf','')" style="color: Navy;
font-weight: bold;">Download</a>
</td>
<td>
<span id="ctl00_ContentPlaceHolder1_GridView1_ctl22_Label3">23 October</span>
</td>
</tr>
</tbody>
</table>'''
soup = BeautifulSoup(html)
soup.select('tr:has(:-soup-contains("Sanskrit"))')

Need the full correct html code for this Table

Image table
Please write full html codes for this table. You may see the Table image from above via the link.
<table border="1" width="800">
<tr>
<th>Level1</th>
<th>Level2</th>
<th>Level2</th>
<th>Info</th>
<th>Name</th>
</tr>
<tr>
<td rowspan="6">System</td>
</tr>
<tr>
<td rowspan="4">System Apps</td>
<td rowspan="2">System Memory</td>
</tr>
<tr>
<td rowspan="3">SystemEnv</td>
<td rowspan="1">SystemEnv2</td>
<td rowspan="2">Memeory Test</td>
</tr>
Here is the table code :
<table border="2">
<tr>
<th>Level 1</th>
<th>Level 2</th>
<th>Level 3</th>
<th>info</th>
<th>Name</th>
</tr>
<tr>
<td rowSpan="6">System</td>
<td rowSpan="4">System apps</td>
<td rowSpan="3">SystemEnv</td>
<td>App Text</td>
<td>foo</td>
</tr>
<tr>
<td>App memory</td>
<td>foo</td>
</tr>
<tr>
<td>App test</td>
<td>bar</td>
</tr>
<tr>
<td>Systemenv2</td>
<td>App test</td>
<td>bar</td>
</tr>
<tr>
<td rowSpan="2">System Memory</td>
<td rowSpan="2">Memory test</td>
<td>memory func</td>
<td>foo</td>
</tr>
<tr>
<td>Memory Func</td>
<td>foo</td>
</tr>
</table>

beautifulsoup for looping and getting text and Href

I'm in a bit of a quinch here:
its an ASP site which is really messy that I am trying to get data from:
I'm trying to use a for loop to get an href and the text of all the rows of the 4th table that is on the site, so I first did:
table = soup.findAll('table')[3]
Then from this table I need to get all text inside the <tr> tags and the href's of the <a> inside.
i tried something like this:
for product in table.findAll('tbody'):
product_title = product.find('tr').text
product_link = product.find('a')['href']
print (product_title, product_link)
But I get nothing in return
The table Im working on:
<tr bgcolor="#EFEFEF">
<td>
<a href="free.asp?detail=hide&c_id=4342141">
<img align="absmiddle" border="0" hspace="0" src="pic/bullet.gif" vspace="0"/>
</a>
</td>
<td>
4342141
</td>
<td width="10">
</td>
<td>
25.07.2018 09:00
</td>
<td width="10">
</td>
<td>
Ankara
</td>
<td width="10">
-
</td>
<td>
Konya
</td>
<td colspan="2">
</td>
</tr>
<tr bgcolor="#EFEFEF" height="3">
<td colspan="10">
</td>
</tr>
<tr bgcolor="#FFFFFF" height="1">
<td colspan="10">
</td>
</tr>
<tr bgcolor="#DDDDDD" height="6">
<td colspan="10">
</td>
</tr>
<tr bgcolor="#FFFFFF" height="1">
<td colspan="10">
</td>
</tr>
<tr bgcolor="#DEE3E7" height="3">
<td colspan="10">
</td>
</tr>
<tr bgcolor="#DEE3E7">
<td>
<a href="free.asp?detail=hide&c_id=4134123">
<img align="absmiddle" border="0" hspace="0" src="pic/bullet.gif" vspace="0"/>
</a>
</td>
<td>
4134123
</td>
<td width="10">
</td>
<td>
26.07.2018 09:00
</td>
<td width="10">
</td>
<td>
Van
</td>
<td width="10">
-
</td>
<td>
Istanbul
</td>
<td colspan="2">
</td>
</tr>
Instead of extracting text from tbody from table, you can directly get all tr tags.
Based on your snippet you can refer to this code snippet for data extraction from table.
soup = BeautifulSoup(text, 'html.parser')
all_products = []
for tr in soup.find_all('tr'):
text = tr.get_text(separator=' ', strip=True)
if text:
a_tag = tr.find('a')
if a_tag:
product_link = a_tag.attrs['href']
all_text = text + ' ' + product_link
all_products.append(all_text.split(' '))
print(all_products)
Output is:
[['4342141', '25.07.2018', '09:00', 'Ankara', '-', 'Konya', 'free.asp?detail=hide&c_id=4342141'], ['4134123', '26.07.2018', '09:00', 'Van', '-', 'Istanbul', 'free.asp?detail=hide&c_id=4134123']]

How to pass product TVs to SimpleCart's scGetCart snippet?

I need some TVs (weight, dimensions, etc) I've associated with my products to appear in the Cart page of my SimpleCart site.
Problem is I have no idea how to do this. I don't understand how the SimpleCart cart is built and there isn't documentation for this.
Would anyone know how I can show TVs associated with each product in the cart output chunk?
The cart snippet has the following code which gets data from the cart and puts it into Chunks:
$sc = $modx->getService('simplecart','SimpleCart',$modx->getOption('simplecart.core_path',null,$modx->getOption('core_path').'components/simplecart/').'model/simplecart/',$scriptProperties);
if (!($sc instanceof SimpleCart)) return '';
 
$controller = $sc->loadController('Cart');
$output = $controller->run($scriptProperties);
The output Chunk looks like:
<div id="simplecart">
<form action="[[~[[*id]]]]" method="post" id="form_cartoverview">
<input type="hidden" name="updatecart" value="true" />
<table>
<tr>
<th class="desc">[[%simplecart.cart.description]]</th>
<th class="price">[[%simplecart.cart.price]]</th>
<th class="quantity">[[%simplecart.cart.quantity]]</th>
[[+cart.total.vat_total:notempty=`<th class="quantity">[[%simplecart.cart.vat]]</th>`:isempty=``]]
<th class="subtotal">[[%simplecart.cart.subtotal]]</th>
<th> </th>
</tr>
[[+cart.wrapper]]
[[+cart.total.discount:notempty=`<tr class="total first discount">
<td colspan="[[+cart.total.vat_total:notempty=`3`:isempty=`2`]]"> </td>
<td class="label">[[%simplecart.cart.discount]]</td>
<td class="value">- [[+cart.total.discount_formatted]]</td>
<td class="extra">[[+cart.total.discount_percent:notempty=`([[+cart.total.discount_percent]]%)`:isempty=` `]]</td>
</tr>`:isempty=``]]
[[+cart.total.vat_total:notempty=`
<tr class="total [[+cart.total.discount:notempty=`second`:isempty=`first`]]">
<td colspan="3"> </td>
<td class="label">[[%simplecart.cart.total_ex_vat]]</td>
<td class="value">[[+cart.total.price_ex_vat_formatted]]</td>
<td class="extra"> </td>
</tr>
[[+cart.vat_rates]]
<tr class="total [[+cart.total.discount:notempty=`third`:isempty=`second`]]">
<td colspan="3"> </td>
<td class="label">[[%simplecart.cart.total_vat]]</td>
<td class="value">[[+cart.total.vat_total_formatted]]</td>
<td class="extra"> </td>
</tr>
<tr class="total [[+cart.total.discount:notempty=`fourth`:isempty=`third`]]">
<td colspan="3"> </td>
<td class="label">[[%simplecart.cart.total_in_vat]]</td>
<td class="value">[[+cart.total.price_formatted]]</td>
<td class="extra"> </td>
</tr>
`:isempty=`
<tr class="total [[+cart.total.discount:notempty=`second`:isempty=`first`]]">
<td colspan="2"> </td>
<td class="label">[[%simplecart.cart.total]]</td>
<td class="value">[[+cart.total.price_formatted]]</td>
<td class="extra"> </td>
</tr>
`]]
</table>
<div class="submit">
<input type="submit" value="[[%simplecart.cart.update]]" />
</div>
</form>
This does appear to be documented:
Product Options (TVs)
and to output them:
Modifying the Product Template
It appears that you would just output them normally [[*myProductOptions]]
Though, it appears that your template is using a placeholder, I would try
[[+cart.myProductOptions] as well. If all else fails you might try debugging the simplecart class and dump the array of product data before it populates the chunk, there might be a clue in there.
Found (through trial and error) you must use:
[[+product.tv.name_of_tv]]

Scala find location of string in a string

I have this string:
var htmlString;
Assigned to:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" >
<html>
<head>
<title>Payment Receipt</title>
<link rel="stylesheet" type="text/css" href="content/PaymentForm.css">
<style type="text/css">
</style>
<meta content='text/html; charset=UTF-8' http-equiv='Content-Type'/>
</head>
<body>
<div id="divPageOuter" class="PageOuter">
<div id="divPage" class="Page">
<!--[1]-->
<div id="divThankYou">
Thank you for your order!
</div>
<hr class="HrTop">
<div id="divReceiptMsg">
You may print this receipt page for your records.
</div>
<div class="SectionBar">
Order Information
</div>
<table id="tablePaymentDetails1Rcpt">
<tr>
<td class="LabelColInfo1R">
Merchant:
</td>
<td class="DataColInfo1R">
<!--Merchant.val-->
Ryan
<!--end-->
</td>
</tr>
<tr>
<td class="LabelColInfo1R">
Description:
</td>
<td class="DataColInfo1R">
<!--x_description.val-->
Rasmussenpayment
<!--end-->
</td>
</tr>
</table>
<table id="tablePaymentDetails2Rcpt" cellspacing="0" cellpadding="0">
<tr>
<td id="tdPaymentDetails2Rcpt1">
<table>
<tr>
<td class="LabelColInfo1R">
Date/Time:
</td>
<td class="DataColInfo1R">
<!--Date/Time.val-->
09-Jul-2012 12:26:46 PM PT
<!--end-->
</td>
</tr>
<tr>
<td class="LabelColInfo1R">
Customer ID:
</td>
<td class="DataColInfo1R">
<!--x_cust_id.val-->
<!--end-->
</td>
</tr>
</table>
</td>
<td id="tdPaymentDetails2Rcpt2">
<table>
<tr>
<td class="LabelColInfo1R">
Invoice Number:
</td>
<td class="DataColInfo1R">
<!--x_invoice_num.val-->
176966244
<!--end-->
</td>
</tr>
</table>
</td>
</tr>
</table>
<hr id="hrBillingShippingBefore">
<table id="tableBillingShipping">
<tr>
<td id="tdBillingInformation">
<div class="Label">
Billing Information
</div>
<div id="divBillingInformation">
Test14 Rasmussen<br>
1234 test st<br>
San Diego, CA 92107 <br>
</div>
</td>
<td id="tdShippingInformation">
<div class="Label">
Shipping Information
</div>
<div id="divShippingInformation">
</div>
</td>
</tr>
</table>
<hr id="hrBillingShippingAfter">
<div id="divOrderDetailsBottomR">
<table id="tableOrderDetailsBottom">
<tr>
<td class="LabelColTotal">
Total:
</td>
<td class="DescrColTotal">
</td>
<td class="DataColTotal">
<!--x_amount.val-->
US $250.00
<!--end-->
</td>
</tr>
</table>
<!-- tableOrderDetailsBottom -->
</div>
<div id="divOrderDetailsBottomSpacerR">
</div>
<div class="SectionBar">
Visa ****0027
</div>
<table class="PaymentSectionTable" cellspacing="0" cellpadding="0">
<tr>
<td class="PaymentSection1">
<table>
<tr>
<td class="LabelColInfo2R">
Date/Time:
</td>
<td class="DataColInfo2R">
<!--Date/Time.1.val-->
09-Jul-2012 12:26:46 PM PT
<!--end-->
</td>
</tr>
<tr>
<td class="LabelColInfo2R">
Transaction ID:
</td>
<td class="DataColInfo2R">
<!--Transaction ID.1.val-->
2173493354
<!--end-->
</td>
</tr>
<tr>
<td class="LabelColInfo2R">
Authorization Code:
</td>
<td class="DataColInfo2R">
<!--x_auth_code.1.val-->
07I3DH
<!--end-->
</td>
</tr>
<tr>
<td class="LabelColInfo2R">
Payment Method:
</td>
<td class="DataColInfo2R">
<!--x_method.1.val-->
Visa ****0027
<!--end-->
</td>
</tr>
</table>
</td>
<td class="PaymentSection2">
<table>
</table>
</td>
</tr>
</table>
<div class="PaymentSectionSpacer">
</div>
</div>
<!-- entire BODY -->
</div>
<div class="PageAfter">
</div>
</body>
</html>
And I want to find the location of "x_auth_code.1.val" in the string. And then I want to obtain a string from the location plus a certain number of characters. The goal would be to return the Authorization code.
You can use indexOfSlice, and then slice() in StringOps
scala> val myString = "Hello World!"
myString: java.lang.String = Hello World!
scala> val index = myString.indexOfSlice("Wo")
index: Int = 6
scala> val slice = myString.slice(index, index+5)
slice: String = World
With your html string:
scala> htmlString.indexOfSlice("x_auth_code.1.val")
res4: Int = 2771
Why aren't you using an XML parser? Don't treat XML as strings -- you'll get bitten if you do.
Here's a regex to do it, but my advice is: DO NOT USE IT! Use xml tools.
"""\Qx_auth_code.1.val\E[^>]*>([^<]*)""".r.findFirstMatchIn(htmlString).map(_ group 1)

Resources