Scala find location of string in a string - string

I have this string:
var htmlString;
Assigned to:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" >
<html>
<head>
<title>Payment Receipt</title>
<link rel="stylesheet" type="text/css" href="content/PaymentForm.css">
<style type="text/css">
</style>
<meta content='text/html; charset=UTF-8' http-equiv='Content-Type'/>
</head>
<body>
<div id="divPageOuter" class="PageOuter">
<div id="divPage" class="Page">
<!--[1]-->
<div id="divThankYou">
Thank you for your order!
</div>
<hr class="HrTop">
<div id="divReceiptMsg">
You may print this receipt page for your records.
</div>
<div class="SectionBar">
Order Information
</div>
<table id="tablePaymentDetails1Rcpt">
<tr>
<td class="LabelColInfo1R">
Merchant:
</td>
<td class="DataColInfo1R">
<!--Merchant.val-->
Ryan
<!--end-->
</td>
</tr>
<tr>
<td class="LabelColInfo1R">
Description:
</td>
<td class="DataColInfo1R">
<!--x_description.val-->
Rasmussenpayment
<!--end-->
</td>
</tr>
</table>
<table id="tablePaymentDetails2Rcpt" cellspacing="0" cellpadding="0">
<tr>
<td id="tdPaymentDetails2Rcpt1">
<table>
<tr>
<td class="LabelColInfo1R">
Date/Time:
</td>
<td class="DataColInfo1R">
<!--Date/Time.val-->
09-Jul-2012 12:26:46 PM PT
<!--end-->
</td>
</tr>
<tr>
<td class="LabelColInfo1R">
Customer ID:
</td>
<td class="DataColInfo1R">
<!--x_cust_id.val-->
<!--end-->
</td>
</tr>
</table>
</td>
<td id="tdPaymentDetails2Rcpt2">
<table>
<tr>
<td class="LabelColInfo1R">
Invoice Number:
</td>
<td class="DataColInfo1R">
<!--x_invoice_num.val-->
176966244
<!--end-->
</td>
</tr>
</table>
</td>
</tr>
</table>
<hr id="hrBillingShippingBefore">
<table id="tableBillingShipping">
<tr>
<td id="tdBillingInformation">
<div class="Label">
Billing Information
</div>
<div id="divBillingInformation">
Test14 Rasmussen<br>
1234 test st<br>
San Diego, CA 92107 <br>
</div>
</td>
<td id="tdShippingInformation">
<div class="Label">
Shipping Information
</div>
<div id="divShippingInformation">
</div>
</td>
</tr>
</table>
<hr id="hrBillingShippingAfter">
<div id="divOrderDetailsBottomR">
<table id="tableOrderDetailsBottom">
<tr>
<td class="LabelColTotal">
Total:
</td>
<td class="DescrColTotal">
</td>
<td class="DataColTotal">
<!--x_amount.val-->
US $250.00
<!--end-->
</td>
</tr>
</table>
<!-- tableOrderDetailsBottom -->
</div>
<div id="divOrderDetailsBottomSpacerR">
</div>
<div class="SectionBar">
Visa ****0027
</div>
<table class="PaymentSectionTable" cellspacing="0" cellpadding="0">
<tr>
<td class="PaymentSection1">
<table>
<tr>
<td class="LabelColInfo2R">
Date/Time:
</td>
<td class="DataColInfo2R">
<!--Date/Time.1.val-->
09-Jul-2012 12:26:46 PM PT
<!--end-->
</td>
</tr>
<tr>
<td class="LabelColInfo2R">
Transaction ID:
</td>
<td class="DataColInfo2R">
<!--Transaction ID.1.val-->
2173493354
<!--end-->
</td>
</tr>
<tr>
<td class="LabelColInfo2R">
Authorization Code:
</td>
<td class="DataColInfo2R">
<!--x_auth_code.1.val-->
07I3DH
<!--end-->
</td>
</tr>
<tr>
<td class="LabelColInfo2R">
Payment Method:
</td>
<td class="DataColInfo2R">
<!--x_method.1.val-->
Visa ****0027
<!--end-->
</td>
</tr>
</table>
</td>
<td class="PaymentSection2">
<table>
</table>
</td>
</tr>
</table>
<div class="PaymentSectionSpacer">
</div>
</div>
<!-- entire BODY -->
</div>
<div class="PageAfter">
</div>
</body>
</html>
And I want to find the location of "x_auth_code.1.val" in the string. And then I want to obtain a string from the location plus a certain number of characters. The goal would be to return the Authorization code.

You can use indexOfSlice, and then slice() in StringOps
scala> val myString = "Hello World!"
myString: java.lang.String = Hello World!
scala> val index = myString.indexOfSlice("Wo")
index: Int = 6
scala> val slice = myString.slice(index, index+5)
slice: String = World
With your html string:
scala> htmlString.indexOfSlice("x_auth_code.1.val")
res4: Int = 2771

Why aren't you using an XML parser? Don't treat XML as strings -- you'll get bitten if you do.
Here's a regex to do it, but my advice is: DO NOT USE IT! Use xml tools.
"""\Qx_auth_code.1.val\E[^>]*>([^<]*)""".r.findFirstMatchIn(htmlString).map(_ group 1)

Related

Retrieve 'tr' by text of children

I want to select a tr by the text it contains, including the text of the children.
My html is as follows:
<table>
<tbody>
<tr>
<td>
<span id="ctl00_ContentPlaceHolder1_GridView1_ctl21_Label4">Sanskrit</span>
</td>
<td>
<span id="ctl00_ContentPlaceHolder1_GridView1_ctl21_Label2">0655-0700 </span>
</td>
<td>
<a id="ctl00_ContentPlaceHolder1_GridView1_ctl21_LinkButtonDownloadPdf" href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$GridView1$ctl21$LinkButtonDownloadPdf','')" style="color: Navy;
font-weight: bold;">Download</a>
</td>
<td>
<span id="ctl00_ContentPlaceHolder1_GridView1_ctl21_Label3">24 October</span>
</td>
</tr>
<tr>
<td>
<span id="ctl00_ContentPlaceHolder1_GridView1_ctl22_Label4">Sanskrit</span>
</td>
<td>
<span id="ctl00_ContentPlaceHolder1_GridView1_ctl22_Label2">1810-1815 </span>
</td>
<td>
<a id="ctl00_ContentPlaceHolder1_GridView1_ctl22_LinkButtonDownloadPdf" href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$GridView1$ctl22$LinkButtonDownloadPdf','')" style="color: Navy;
font-weight: bold;">Download</a>
</td>
<td>
<span id="ctl00_ContentPlaceHolder1_GridView1_ctl22_Label3">23 October</span>
</td>
</tr>
</tbody>
</table>
I load it thus _soup = soup(html, "html.parser").
If i run _soup.find("span", text="Sanskrit").parent.parent.text then I get the result '\n\nSanskrit\n\n\n0655-0700 \n\n\nDownload\n\n\n24 October\n\n'
but if i run print(_soup.find("tr", text='\n\nSanskrit\n\n\n0655-0700 \n\n\nDownload\n\n\n24 October\n\n'))
i get None
Issue here is that text needs an exact match - You could regex or use css selectors and :-soup-contains():
soup.select('tr:has(:-soup-contains("Sanskrit"))')
or based on comment, check if your text is in <tr>s text:
for row in soup.select('tr'):
if '\n\nSanskrit\n\n\n0655-0700 \n\n\nDownload\n\n\n24 October\n\n' in row.text:
print(row)
Example
from bs4 import BeautifulSoup
html = '''
<table>
<tbody>
<tr>
<td>
<span id="ctl00_ContentPlaceHolder1_GridView1_ctl21_Label4">Sanskrit</span>
</td>
<td>
<span id="ctl00_ContentPlaceHolder1_GridView1_ctl21_Label2">0655-0700 </span>
</td>
<td>
<a id="ctl00_ContentPlaceHolder1_GridView1_ctl21_LinkButtonDownloadPdf" href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$GridView1$ctl21$LinkButtonDownloadPdf','')" style="color: Navy;
font-weight: bold;">Download</a>
</td>
<td>
<span id="ctl00_ContentPlaceHolder1_GridView1_ctl21_Label3">24 October</span>
</td>
</tr>
<tr>
<td>
<span id="ctl00_ContentPlaceHolder1_GridView1_ctl22_Label4">Sanskrit</span>
</td>
<td>
<span id="ctl00_ContentPlaceHolder1_GridView1_ctl22_Label2">1810-1815 </span>
</td>
<td>
<a id="ctl00_ContentPlaceHolder1_GridView1_ctl22_LinkButtonDownloadPdf" href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$GridView1$ctl22$LinkButtonDownloadPdf','')" style="color: Navy;
font-weight: bold;">Download</a>
</td>
<td>
<span id="ctl00_ContentPlaceHolder1_GridView1_ctl22_Label3">23 October</span>
</td>
</tr>
</tbody>
</table>'''
soup = BeautifulSoup(html)
soup.select('tr:has(:-soup-contains("Sanskrit"))')

I used Python Selenium to choose a "Year" from the dropdown box, clicked "search," and it apparently worked. But the data says otherwise

There is a page with tables of statistics I'm trying to pull.
The page has the default year as 2020, with a dropdown box to select different years. I wrote this code to select 2009.
from selenium import webdriver as wd
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from pandas.io.html import read_html
from selenium.webdriver.support.ui import Select
import numpy as np
import re
import pandas as pd
driver=wd.Chrome()
driver.implicitly_wait(10)
driver.get('https://www.cpbl.com.tw/standings/history')
select = Select(driver.find_element_by_id('Year'))
# select by visible text
select.select_by_visible_text('2009')
button = driver.find_elements_by_xpath("//input[#value='查詢']")[0]
button.click()
main=driver.find_elements_by_xpath('//*[(#id = "PageListContainer")]')[0]
main_att=main.get_attribute('innerHTML')
tab=pd.read_html(main_att)
I purposely didn't say driver.close() to leave the browser open, so I can look at it, and apparently the selection of 2009 worked. The browser had tables for 2009. However, the data my code pulled was still from the default year (2020). I want data from 2009. Anyone know why?
I am using Python 3.7 and Spyder 4.0.1
To select 2009 from the html-select and extract the innerHTML of the PageListContainer you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following Locator Strategies:
Code Block:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
driver.get("https://www.cpbl.com.tw/standings/history")
Select(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "select#Year")))).select_by_visible_text("2009")
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//input[#value='查詢']"))).click()
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[#id = 'PageListContainer']"))).get_attribute("innerHTML"))
Console Output:
<!--上半季戰績-->
<div class="RecordTableWrap">
<div class="record_table_caption">上半季戰績</div>
<div class="record_table_swipe_guide" style="display: none;">
<div class="desktop"></div>
<div class="mobile"></div>
</div>
<div class="record_table_scroll_ctrl" style="display: none;">
</div>
<div class="RecordTableOuter">
<div class="RecordTable">
<table>
<tbody><tr>
<th class="sticky">
<div class="sticky_wrap">
<div class="rank">排名</div>
<div class="team-w-trophy">球隊</div>
</div>
</th>
<th class="num">出賽數</th>
<th class="num">勝-和-敗</th>
<th class="num">勝率</th>
<th class="num">勝差</th>
<th class="num">中信兄弟</th>
<th class="num">樂天桃猿</th>
<th class="num">統一7-ELEVEn獅</th>
<th class="num">富邦悍將</th>
<th class="num">主場戰績</th>
<th class="num">客場戰績</th>
</tr>
<tr>
<td class="sticky">
<div class="sticky_wrap">
<div class="rank">1</div>
<div class="team-w-trophy">
中信兄弟
</div>
</div>
</td>
<td class="num">60</td>
<td class="num">37-0-23</td>
<td class="num">0.617</td>
<td class="num">-</td>
<td class="num"> </td>
<td class="num">8-0-12</td>
<td class="num">16-0-4</td>
<td class="num">13-0-7</td>
<td class="num">18-0-12</td>
<td class="num">19-0-11</td>
</tr>
<tr>
<td class="sticky">
<div class="sticky_wrap">
<div class="rank">2</div>
<div class="team-w-trophy">
樂天桃猿
</div>
</div>
</td>
<td class="num">60</td>
<td class="num">34-0-26</td>
<td class="num">0.567</td>
<td class="num">3</td>
<td class="num">12-0-8</td>
<td class="num"> </td>
<td class="num">9-0-11</td>
<td class="num">13-0-7</td>
<td class="num">18-0-12</td>
<td class="num">16-0-14</td>
</tr>
<tr>
<td class="sticky">
<div class="sticky_wrap">
<div class="rank">3</div>
<div class="team-w-trophy">
統一7-ELEVEn獅
</div>
</div>
</td>
<td class="num">60</td>
<td class="num">26-0-34</td>
<td class="num">0.433</td>
<td class="num">11</td>
<td class="num">4-0-16</td>
<td class="num">11-0-9</td>
<td class="num"> </td>
<td class="num">11-0-9</td>
<td class="num">13-0-17</td>
<td class="num">13-0-17</td>
</tr>
<tr>
<td class="sticky">
<div class="sticky_wrap">
<div class="rank">4</div>
<div class="team-w-trophy">
富邦悍將
</div>
</div>
</td>
<td class="num">60</td>
<td class="num">23-0-37</td>
<td class="num">0.383</td>
<td class="num">14</td>
<td class="num">7-0-13</td>
<td class="num">7-0-13</td>
<td class="num">9-0-11</td>
<td class="num"> </td>
<td class="num">13-0-17</td>
<td class="num">10-0-20</td>
</tr>
</tbody></table>
</div>
</div>
</div>
<!--上半季戰績 end-->
<!--下半季戰績-->
<div class="RecordTableWrap">
<div class="record_table_caption">下半季戰績</div>
<div class="record_table_swipe_guide" style="display: none;">
<div class="desktop"></div>
<div class="mobile"></div>
</div>
<div class="record_table_scroll_ctrl" style="display: none;">
</div>
<div class="RecordTableOuter">
<div class="RecordTable">
<table>
<tbody><tr>
<th class="sticky">
<div class="sticky_wrap">
<div class="rank">排名</div>
<div class="team-w-trophy">球隊</div>
</div>
</th>
<th class="num">出賽數</th>
<th class="num">勝-和-敗</th>
<th class="num">勝率</th>
<th class="num">勝差</th>
<th class="num">中信兄弟</th>
<th class="num">樂天桃猿</th>
<th class="num">統一7-ELEVEn獅</th>
<th class="num">富邦悍將</th>
<th class="num">主場戰績</th>
<th class="num">客場戰績</th>
</tr>
<tr>
<td class="sticky">
<div class="sticky_wrap">
<div class="rank">1</div>
<div class="team-w-trophy">
統一7-ELEVEn獅
</div>
</div>
</td>
<td class="num">60</td>
<td class="num">32-1-27</td>
<td class="num">0.542</td>
<td class="num">-</td>
<td class="num">13-1-6</td>
<td class="num">10-0-10</td>
<td class="num"> </td>
<td class="num">9-0-11</td>
<td class="num">16-0-14</td>
<td class="num">16-1-13</td>
</tr>
<tr>
<td class="sticky">
<div class="sticky_wrap">
<div class="rank">2</div>
<div class="team-w-trophy">
富邦悍將
</div>
</div>
</td>
<td class="num">60</td>
<td class="num">31-1-28</td>
<td class="num">0.525</td>
<td class="num">1</td>
<td class="num">9-1-10</td>
<td class="num">11-0-9</td>
<td class="num">11-0-9</td>
<td class="num"> </td>
<td class="num">15-1-14</td>
<td class="num">16-0-14</td>
</tr>
<tr>
<td class="sticky">
<div class="sticky_wrap">
<div class="rank">3</div>
<div class="team-w-trophy">
中信兄弟
</div>
</div>
</td>
<td class="num">60</td>
<td class="num">30-2-28</td>
<td class="num">0.517</td>
<td class="num">1.5</td>
<td class="num"> </td>
<td class="num">14-0-6</td>
<td class="num">6-1-13</td>
<td class="num">10-1-9</td>
<td class="num">16-1-13</td>
<td class="num">14-1-15</td>
</tr>
<tr>
<td class="sticky">
<div class="sticky_wrap">
<div class="rank">4</div>
<div class="team-w-trophy">
樂天桃猿
</div>
</div>
</td>
<td class="num">60</td>
<td class="num">25-0-35</td>
<td class="num">0.417</td>
<td class="num">7.5</td>
<td class="num">6-0-14</td>
<td class="num"> </td>
<td class="num">10-0-10</td>
<td class="num">9-0-11</td>
<td class="num">16-0-14</td>
<td class="num">9-0-21</td>
</tr>
</tbody></table>
</div>
</div>
</div>
<!--下半季戰績 end-->
<!--全年戰績-->
<div class="RecordTableWrap">
<div class="record_table_caption">全年戰績</div>
<div class="record_table_swipe_guide" style="display: none;">
<div class="desktop"></div>
<div class="mobile"></div>
</div>
<div class="record_table_scroll_ctrl" style="display: none;">
</div>
<div class="RecordTableOuter">
<div class="RecordTable">
<table>
<tbody><tr>
<th class="sticky">
<div class="sticky_wrap">
<div class="rank">排名</div>
<div class="team-w-trophy">球隊</div>
</div>
</th>
<th class="num">出賽數</th>
<th class="num">勝-和-敗</th>
<th class="num">勝率</th>
<th class="num">勝差</th>
<th class="num">中信兄弟</th>
<th class="num">樂天桃猿</th>
<th class="num">統一7-ELEVEn獅</th>
<th class="num">富邦悍將</th>
<th class="num">主場戰績</th>
<th class="num">客場戰績</th>
</tr>
<tr>
<td class="sticky">
<div class="sticky_wrap">
<div class="rank">1</div>
<div class="team-w-trophy">
中信兄弟
</div>
</div>
</td>
<td class="num">120</td>
<td class="num">67-2-51</td>
<td class="num">0.568</td>
<td class="num">-</td>
<td class="num"> </td>
<td class="num">22-0-18</td>
<td class="num">22-1-17</td>
<td class="num">23-1-16</td>
<td class="num">34-1-25</td>
<td class="num">33-1-26</td>
</tr>
<tr>
<td class="sticky">
<div class="sticky_wrap">
<div class="rank">2</div>
<div class="team-w-trophy">
樂天桃猿
</div>
</div>
</td>
<td class="num">120</td>
<td class="num">59-0-61</td>
<td class="num">0.492</td>
<td class="num">9</td>
<td class="num">18-0-22</td>
<td class="num"> </td>
<td class="num">19-0-21</td>
<td class="num">22-0-18</td>
<td class="num">34-0-26</td>
<td class="num">25-0-35</td>
</tr>
<tr>
<td class="sticky">
<div class="sticky_wrap">
<div class="rank">3</div>
<div class="team-w-trophy">
統一7-ELEVEn獅
</div>
</div>
</td>
<td class="num">120</td>
<td class="num">58-1-61</td>
<td class="num">0.487</td>
<td class="num">9.5</td>
<td class="num">17-1-22</td>
<td class="num">21-0-19</td>
<td class="num"> </td>
<td class="num">20-0-20</td>
<td class="num">29-0-31</td>
<td class="num">29-1-30</td>
</tr>
<tr>
<td class="sticky">
<div class="sticky_wrap">
<div class="rank">4</div>
<div class="team-w-trophy">
富邦悍將
</div>
</div>
</td>
<td class="num">120</td>
<td class="num">54-1-65</td>
<td class="num">0.454</td>
<td class="num">13.5</td>
<td class="num">16-1-23</td>
<td class="num">18-0-22</td>
<td class="num">20-0-20</td>
<td class="num"> </td>
<td class="num">28-1-31</td>
<td class="num">26-0-34</td>
</tr>
</tbody></table>
</div>
</div>
</div>
<!--全年戰績 end-->
It worked when I used time.sleep(10)

How do I parse this html with Python lxml & xpath that finds the parent table of a specific span id?

Here is the HTML I don't have any control over. This is condensed HTML of the real page.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Little League</title>
</head>
<body>
<table>
<span>lot of unrelated text</span>
</table>
<table>
<span>lot of unrelated text</span>
</table>
<table>
<span>lot of unrelated text</span>
</table>
<table>
<tbody>
<tr>
<td class="rightTD">
<p>
<span id="teams_players">Player Teams</span>
</p>
</td>
</tr>
<tr>
<td>
<table border="1" cellspacing="0" cellpadding="0" class="tableBorder table table-bordered" width="100%">
<tbody>
<tr>
<td>
<table border="0" width="100%" class="tableData">
<tbody>
<tr id="team_listings">
<td colspan="3">Team Listings
<br>
<br>
</td>
</tr>
<tr>
<td>(a) </td>
<td colspan="2">Team Name </td>
</tr>
<tr>
<td></td>
<td colspan="2">
<span class="blue_color">Foxes</span>
</td>
</tr>
<tr>
<td>(b) </td>
<td colspan="2">Team Rank</td>
</tr>
<tr>
<td></td>
<td colspan="2">
<span class="blue_color">1</span>
</td>
</tr>
<tr>
<td>(c) </td>
<td colspan="2">Team Location
</td>
</tr>
<tr>
<td></td>
<td colspan="2">
<table width="100%">
<tbody>
<tr>
<td>City:
<br>
<span class="blue_color">Tualatin</span>
</td>
<td>State:
<br>
<span class="blue_colorLined"></span>
<br>
<span class="blue_color">Oregon</span>
</td>
<td>Country:
<br>
<span class="blue_colorLined"></span>
<br>
<span class="blue_color">United States</span>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<br>
<table border="1" cellspacing="0" cellpadding="0" class="tableBorder table table-bordered" width="100%">
<tbody>
<tr>
<td>
<table border="0" width="100%" class="tableData">
<tbody>
<tr>
<td>(a) </td>
<td colspan="2">Team Name </td>
</tr>
<tr>
<td></td>
<td colspan="2">
<span class="blue_color">Tigers</span>
</td>
</tr>
<tr>
<td>(b) </td>
<td colspan="2">Team Rank</td>
</tr>
<tr>
<td></td>
<td colspan="2">
<span class="blue_color">3</span>
</td>
</tr>
<tr>
<td>(c) </td>
<td colspan="2">Team Location
</td>
</tr>
<tr>
<td></td>
<td colspan="2">
<table width="100%">
<tbody>
<tr>
<td>City:
<br>
<span class="blue_color">Tigard</span>
</td>
<td>State:
<br>
<span class="blue_colorLined"></span>
<br>
<span class="blue_color">Oregon</span>
</td>
<td>Country:
<br>
<span class="blue_colorLined"></span>
<br>
<span class="blue_color">United States</span>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</body>
</html>
I am trying to get to the table tag immediately preceding the span tag with id team_players.
I tried these but failed -
//table/span[#id="teams_players"]
ancestor::table[span[#id="teams_players"][position() = 1]]
This works but is not elegant and I prefer not to hardcode it -
//span[#id="teams_players"]/../../../../..
While //table[#class="tableData"] this might seem like it should work, there are many such tables in the HTML that has the same class with unrelated data. So this is ruled out.
Here is the code so far with my attempts (definitely not efficient, once I find a way of fetching both tables, I plan on looping through them to extract the data -
def parse_team():
# team data structure
teams = []
team_dict = { 'team': '', 'rank': '', 'location': { 'city': '', 'state': '', 'country': '' } }
filename = f'team.html'
f = open(filename, encoding="utf8").read()
parser = etree.HTMLParser()
tree = etree.parse(StringIO(f), parser)
# fetch the table dom and parse each team table
# fetch the parent table that contains teams_players span id
team_tables = tree.xpath('ancestor::table[span[#id="teams_players"][position() = 1]]')
print(team_tables)
root_tables = tree.xpath('//table/span[#id="teams_players"]')
print("root tables", root_tables)
# this provides each team table but in full html, the same class is being used for other unrelated data
name = tree.xpath('//table[#class="tableData"]')
print(name)
eachvaltr = name[0].xpath('.//tr')
teamname = name[0].xpath('.//td[contains(text(),"Team Name")]//parent::tr/following-sibling::tr[1]//span[#class="blue_color"]/text()')
print("teamname", teamname)
teamrank = name[0].xpath(
'.//td[contains(text(),"Team Rank")]//parent::tr/following-sibling::tr[1]//span[#class="blue_color"]/text()')
print("teamrank", teamrank)
city = name[0].xpath(
'.//td[contains(text(),"City")]//span[#class="blue_color"]/text()')
state = name[0].xpath(
'.//td[contains(text(),"State")]//span[#class="blue_color"]/text()')
country = name[0].xpath(
'.//td[contains(text(),"Country")]//span[#class="blue_color"]/text()')
print(city[0], state[0], country[0])
team_dict['team'] = teamname
team_dict['rank'] = teamrank
team_dict['location']['city'] = city[0]
team_dict['location']['state'] = state[0]
team_dict['location']['country'] = country[0]
print(team_dict)
Desired output is a list of teams where each team is a dict.
[{'team': ['Foxes'], 'rank': ['1'], 'location': {'city': 'Tualatin', 'state': 'Oregon', 'country': 'United States'}}]
//table[.//span[#id="teams_players"]]
or
//span[#id="teams_players"]/ancestor::table

beautifulsoup for looping and getting text and Href

I'm in a bit of a quinch here:
its an ASP site which is really messy that I am trying to get data from:
I'm trying to use a for loop to get an href and the text of all the rows of the 4th table that is on the site, so I first did:
table = soup.findAll('table')[3]
Then from this table I need to get all text inside the <tr> tags and the href's of the <a> inside.
i tried something like this:
for product in table.findAll('tbody'):
product_title = product.find('tr').text
product_link = product.find('a')['href']
print (product_title, product_link)
But I get nothing in return
The table Im working on:
<tr bgcolor="#EFEFEF">
<td>
<a href="free.asp?detail=hide&c_id=4342141">
<img align="absmiddle" border="0" hspace="0" src="pic/bullet.gif" vspace="0"/>
</a>
</td>
<td>
4342141
</td>
<td width="10">
</td>
<td>
25.07.2018 09:00
</td>
<td width="10">
</td>
<td>
Ankara
</td>
<td width="10">
-
</td>
<td>
Konya
</td>
<td colspan="2">
</td>
</tr>
<tr bgcolor="#EFEFEF" height="3">
<td colspan="10">
</td>
</tr>
<tr bgcolor="#FFFFFF" height="1">
<td colspan="10">
</td>
</tr>
<tr bgcolor="#DDDDDD" height="6">
<td colspan="10">
</td>
</tr>
<tr bgcolor="#FFFFFF" height="1">
<td colspan="10">
</td>
</tr>
<tr bgcolor="#DEE3E7" height="3">
<td colspan="10">
</td>
</tr>
<tr bgcolor="#DEE3E7">
<td>
<a href="free.asp?detail=hide&c_id=4134123">
<img align="absmiddle" border="0" hspace="0" src="pic/bullet.gif" vspace="0"/>
</a>
</td>
<td>
4134123
</td>
<td width="10">
</td>
<td>
26.07.2018 09:00
</td>
<td width="10">
</td>
<td>
Van
</td>
<td width="10">
-
</td>
<td>
Istanbul
</td>
<td colspan="2">
</td>
</tr>
Instead of extracting text from tbody from table, you can directly get all tr tags.
Based on your snippet you can refer to this code snippet for data extraction from table.
soup = BeautifulSoup(text, 'html.parser')
all_products = []
for tr in soup.find_all('tr'):
text = tr.get_text(separator=' ', strip=True)
if text:
a_tag = tr.find('a')
if a_tag:
product_link = a_tag.attrs['href']
all_text = text + ' ' + product_link
all_products.append(all_text.split(' '))
print(all_products)
Output is:
[['4342141', '25.07.2018', '09:00', 'Ankara', '-', 'Konya', 'free.asp?detail=hide&c_id=4342141'], ['4134123', '26.07.2018', '09:00', 'Van', '-', 'Istanbul', 'free.asp?detail=hide&c_id=4134123']]

why does the Gmail app on iPhone ignore media queries?

I'm putting together an email newsletter for a client, and find that more or less every email client and app provides a more or less readable experience (they still need some work) - except for the Gmail app. Once the breakpoint is reached, it should be displaying as one column. But it does not.
I'm not sure why this is. Is there a way to force the app to display the newsletter in desktop mode shrunken down to fit the screen width? Or is there a way to target Gmail with a conditional so that the content will obey the media query?
Related: the litmus tests I've ran don't look anything at all like the result I'm getting on my iPhone. Why is that?
http://codepen.io/sabaeus/pen/ZGQWdZ?editors=100
This is in my document head:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta name="format-detection" content="telephone=no">
<meta name="viewport" content="width=device-width; initial-scale=1.0; maximum-scale=1.0; user-scalable=no;">
<meta http-equiv="X-UA-Compatible" content="IE=9; IE=8; IE=7; IE=EDGE" />
<title>Title</title>
</head>
And then this is in my body:
<body>
<!-- background table start -->
<table width="100%" bgcolor="#ffffff" cellpadding="0" cellspacing="0" border="0" id="background_table">
<tbody>
<tr>
<td>
<!-- end of background table start -->
<table width="699" cellpadding="0" cellspacing="0" border="0" align="center" class="body_table">
<tbody>
<tr>
<td style="display:inline-block;" width="100%"><img src="http://placehold.it/197x41" style="display:block;"></td>
</tr>
<tr>
<td width="100%" height="20"> </td>
</tr>
<tr>
<td width="100%" height="100">
<img src="http://placehold.it/699x400" style="display:block;">
</td>
</tr>
<tr>
<td width="100%" height="10"> </td>
</tr>
</tbody>
</table>
<!-- hello/quick links -->
<table width="699" cellpadding="0" cellspacing="0" border="0" align="center" class="body_table">
<tbody>
<tr>
<td width="393" class="column" style="height:100%;display:inline-block;margin-right:53px">Hello,
<br>
<br> Intro text
</td>
<td width="230" class="column" style="height:100%;padding:20px;">
<span class="h1">Quick Links</span>
<br>
<br>
<br>
<span style="display:inline-block; padding-bottom:5px;"><strong>Link 1</strong></span>
<br> Info for link 1
<br>
<br>
<span style="display:inline-block; padding-bottom:5px;"><strong>Link 2</strong></span>
<br>Link
<br>
<br>
<span style="display:inline-block; padding-bottom:5px;"><strong>Link 3</strong></span>
<br>Link
</td>
</tr>
</tbody>
</table>
<!-- hello/quick links -->
<!-- marketing communications -->
<table width="699" cellpadding="0" cellspacing="0" border="0" align="center" class="body_table">
<tbody>
<tr>
<td>
<span class="h1">Headline 1</a>
</td>
</tr>
</tbody>
</table>
<table width="699" cellpadding="0" cellspacing="0" border="0" align="center" class="body_table">
<tbody>
<tr>
<td width="148" class="column-img" style="height:100%;display:inline-block;margin-right:17px">
<img src="http://placehold.it/148x111" style="display:block;">
</td>
<td width="503" class="column-text-1" style="padding:20px"><span style="font-size:18px;display:inline-block; padding-bottom:5px;"><strong>Sub head</strong></span>
<br>Info info info info info info info info info info info</td>
</tr>
</tbody>
</table>
<!-- marketing communications -->
<!-- new print collateral -->
<table width="699" cellpadding="0" cellspacing="0" border="0" align="center" class="body_table">
<tbody>
<tr>
<td class="top-buffer">
<span class="h1">Headline 2</span>
</td>
</tr>
</tbody>
</table>
<table width="699" cellpadding="0" cellspacing="0" border="0" align="center" class="body_table">
<tbody>
<tr>
<td width="148" class="column-img" style="height:100%;display:inline-block;margin-right:17px"><img src="http://placehold.it/148x220" style="display:block;"></td>
<td width="503" class="column-text-1" style="padding:20px"><span style="font-size:18px;display:inline-block; padding-bottom:5px;"><strong>Sub head</strong></span>
<br> info info info info</td>
</tr>
</tbody>
</table>
<!-- new print collateral -->
<!-- advertising -->
<!-- brand ads -->
<table width="699" cellpadding="0" cellspacing="0" border="0" align="center" class="body_table">
<tbody>
<tr>
<td class="top-buffer">
<span class="h1" style="display:inline-block;">Headline 3</span>
<br>
<span style="font-size:18px;">
<table width="699" cellpadding="0" cellspacing="0" border="0" align="center" class="body_table">
<tbody>
<tr>
<td style="padding-top:0px;">
<span style="font-size:18px;"><strong>Sub head</strong></span>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<table width="699" cellpadding="0" cellspacing="0" border="0" align="center" class="body_table">
<tbody>
<tr>
<td width="350" class="column" style="height:100%;margin-right:131px"><img src="http://placehold.it/246x264" style="min-width:350px; display:block;"></td>
<td style="height:100%;" width="350" class="column">
<img src="http://www.placehold.it/267x324" style="min-width:350px; display:block;"></td>
</tr>
</tbody>
</table>
<!-- brand ads -->
<!-- community ads -->
<table width="699" cellpadding="0" cellspacing="0" border="0" align="center" class="body_table">
<tbody>
<tr>
<td style="padding-top:30px;padding-bottom:0px;">
<span style="font-size:18px;"><strong>Sub head</strong></span>
</td>
</tr>
</tbody>
</table>
<table width="699" cellpadding="0" cellspacing="0" border="0" align="center" class="body_table">
<tbody>
<tr>
<td width="350" class="column" style="height:100%;margin-right:131px">
<img src="http://placehold.it/197x320" style="min-width:350px; display:block">
<table>
<tbody>
<tr>
<td>
info info info info
</td>
</tr>
</tbody>
</table>
</td>
<td style="height:100%;" width="350" class="column"><img src="http://placehold.it/212x328" style="min-width:350px;display:block">
<table>
<tbody>
<tr>
<td style="padding-top:10px">
<br> info info info info info
</td>
</tr>
</tbody>
</table>
</td>
</td>
</tr>
</tbody>
</table>
<!-- community ads -->
<!-- advertising -->
<!-- talent acquisition -->
<table width="699" cellpadding="0" cellspacing="0" border="0" align="center" class="body_table">
<tbody>
<tr>
<td class="top-buffer">
<span class="h1">Headline 4</span>
</td>
</tr>
</tbody>
</table>
<table width="699" cellpadding="0" cellspacing="0" border="0" align="center" class="body_table">
<tbody>
<!-- <tr>
<td width="100%" height="10"> </td>
</tr>
-->
<tr>
<td width="100%" height="100">
<a href="#" target="_blank">
<img src="http://placehold.it/668x195" style="width:100%;display:block;"></a>
</td>
</tr>
<tr>
<td width="100%" height="10"> </td>
</tr>
</tbody>
</table>
<!-- text -->
<table width="699" cellpadding="0" cellspacing="0" border="0" align="center" class="body_table">
<tbody>
<tr>
</tr>
<tr>
<td width="100%" height="100">
<span style="font-size:18px;display:inline-block; padding-bottom:5px;"><strong>Sub head</strong></span>
<br>Info info info
</td>
</tr>
<tr>
<td width="100%" height="10"> </td>
</tr>
</tbody>
</table>
<!-- text -->
<!-- talent acquisition -->
<!-- new expert advice -->
<table width="699" cellpadding="0" cellspacing="0" border="0" align="center" class="body_table">
<tbody>
<tr>
<td class="top-buffer">
<span class="h1">Headline 5</span>
</td>
</tr>
</tbody>
</table>
<table width="699" cellpadding="0" cellspacing="0" border="0" align="center" class="body_table">
<tbody>
<tr>
<td>
<span style="display:inline-block;">Info info info</span>
</td>
</tr>
</tbody>
</table>
<!-- 0 -->
<table width="699" cellpadding="0" cellspacing="0" border="0" align="center" class="body_table">
<tbody>
<tr>
<td width="345" class="column" style="height:100%;display:inline-block;margin-right:46px"><img src="http://placehold.it/345x281" style="width:100%;display:block;"></td>
<td width="322" class="column" style="padding:20px;">
<span style="display:inline-block; padding-bottom:5px;"><strong>info info</strong></span>
<br>info info info</td>
</tr>
</tbody>
</table>
<!-- 0 -->
<!-- 1 -->
<table width="699" cellpadding="0" cellspacing="0" border="0" align="center" class="body_table">
<tbody>
<tr>
<td width="345" class="column" style="height:100%;display:inline-block;margin-right:46px"><img src="http://placehold.it/345x281" style="width:100%;display:block;"></td>
<td width="322" class="column" style="padding:20px;">
<span style="display:inline-block; padding-bottom:5px;"><strong>info info</strong></span>
<br>info info info</td>
</tr>
</tbody>
</table>
<!-- 1 -->
<!-- 2 -->
<table width="699" cellpadding="0" cellspacing="0" border="0" align="center" class="body_table">
<tbody>
<tr>
<td width="345" class="column" style="height:100%;display:inline-block;margin-right:46px"><img src="http://placehold.it/345x281" style="width:100%;display:block;"></td>
<td width="322" class="column" style="padding:20px">
<span style="display:inline-block; padding-bottom:5px;"><strong>info info</strong><span>
<br>
info info info</td>
</tr>
</tbody>
</table>
<!-- 2 -->
<!-- 3 -->
<table width="699" cellpadding="0" cellspacing="0" border="0" align="center" class="body_table">
<tbody>
<tr>
<td width="345" class="column" style="height:100%;display:inline-block;margin-right:46px"><img src="http://placehold.it/345x281" style="width:100%;display:block;"></td>
<td width="322" class="column" style="padding:20px;">
<span style="display:inline-block; padding-bottom:5px;"><strong>info info</strong></span>
<br>info info info
</td>
</tr>
</tbody>
</table>
<!-- 3 -->
<!-- new expert advice -->
<!-- epic speaker videos -->
<table width="699" cellpadding="0" cellspacing="0" border="0" align="center" class="body_table">
<tbody>
<tr>
<td class="top-buffer">
<span class="h1">Headline 6</span>
</td>
</tr>
</tbody>
</table>
<table width="699" cellpadding="0" cellspacing="0" border="0" align="center" class="body_table">
<tbody>
<tr>
<td width="345" style="height:100%;display:inline-block;margin-right:17px;" class="column"><img src="http://placehold.it/258x154" style="width:100%;display:block;"></td>
<td width="423" class="column" style="padding:20px;">info info info info info</td>
</tr>
</tbody>
</table>
<!-- epic speaker videos -->
<!-- upcoming events -->
<table width="699" cellpadding="0" cellspacing="0" border="0" align="center" class="body_table">
<tbody>
<tr>
<td class="top-buffer">
<span class="h1">Headline 7</span>
</td>
</tr>
</tbody>
</table>
<table width="800" cellpadding="0" cellspacing="0" border="0" align="center" class="body_table">
<tbody>
<!-- <tr>
<td width="100%" height="10"> </td>
</tr>
-->
<tr>
<td width="100%" height="100">
<span style="font-size:17px"><strong>
May is: Physical Fitness Month / Jewish American Heritage Month</strong></span>
<table>
<tbody>
<tr>
<td>
<ul style="line-height: 150%; width: 582px;">
<li style="list-style-type:none; padding-left:10px;background-color:#ededed">May 10th - <span style="font-weight:300">Mother’s Day</span> </li>
<li style="list-style-type:none;padding-left:10px;">May 25th - <span style="font-weight:300">Memorial Day</span> </li>
<li style="list-style-type:none; padding-left:10px; background-color:#ededed">June 6th - <span style="font-weight:300">D-Day</span></li>
<li style="list-style-type:none;padding-left:10px;">June 14th - <span style="font-weight:300">Flag Day</span></li>
<li style="list-style-type:none; padding-left:10px; background-color:#ededed">June 21st - <span style="font-weight:300">Father’s Day</span></li>
<li style="list-style-type:none;padding-left:10px;">June 21st - <span style="font-weight:300">Alzheimer’s Association Longest day (click below for details)</span></li>
</ul>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td width="100%" height="10"> </td>
</tr>
</tbody>
</table>
<!-- alzheimer's -->
<table width="699" cellpadding="0" cellspacing="0" border="0" align="center" class="body_table">
<tbody>
<tr>
<td width="100%" height="10"> </td>
</tr>
<tr>
<td width="100%" height="100">
<img src="http://placehold.it/454x174" style="width:100%;display:block;">
</td>
</tr>
<tr>
<td width="100%" height="10"> </td>
</tr>
</tbody>
</table>
<!-- alzheimer's -->
<!-- prior -->
<table width="800" cellpadding="0" cellspacing="0" border="0" align="center" class="body_table">
<tbody>
<tr>
<td width="100%" height="10"> </td>
</tr>
<tr>
<td width="100%" height="100" style="padding:20px;">
<span style="display:inline-block;padding-bottom:5px">
Info info info
</span>
</td>
</tr>
<tr>
<td width="100%" height="10"> </td>
</tr>
</tbody>
</table>
<!-- prior-->
<!-- upcoming events -->
<!-- footer -->
<table width="600" cellpadding="0" cellspacing="0" border="0" align="center" class="body_table">
<tbody>
<tr>
<td width="100%">
<table width="600" cellpadding="0" cellspacing="0" border="0" align="center" class="body_table">
<tbody>
<!-- Spacing -->
<tr>
<td height="20" style="font-size:1px; line-height:1px; mso-line-height-rule: exactly;"> </td>
</tr>
<!-- Spacing -->
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<!-- end of footer -->
<!-- end of background table-->
</td>
</tr>
</tbody>
</table>
</body>
</html>
CSS:
#import url(http://fonts.googleapis.com/css?family=Open+Sans:400italic,400,300,700);
body {
width: 100% !important;
-webkit-text-size-adjust: 100%;
-ms-text-size-adjust: 100%;
margin: 0;
padding: 0;
}
#background_table {
margin: 0;
padding: 0;
width: 100%!important;
line-height: 100%!important;
}
img {
outline: none;
text-decoration: none;
border: none;
-ms-interpolation-mode: bicubic;
max-width: 100%;
height: auto;
display: block;
}
table td {
border-collapse: collapse;
vertical-align: middle;
font-family: 'Open Sans', Trebuchet, sans-serif;
font-size: 17px;
line-height:120%;
color: #000;
}
table td[class="column"] {
height: 100px;
width: 393px;
padding: 15px;
}
table {
border-collapse: collapse;
mso-table-lspace: 0pt;
mso-table-rspace: 0pt;
}
table[class="body_table"] {
width: 699px;
margin-top: 21px;
}
table span[class="h1"] {
font-weight:300;
font-size:35px;
color:#ff9001;
}
table td[class="top-buffer"] {
padding-top: 25px;
}
#media only screen and (max-width: 640px) {
table[class="body_table"] {
width: 600px!important;
}
table td[class="column"] {
width: 100%!important;
display: block!important;
}
table span[class="h1"] {
line-height:110%!important;
font-size:23px!important;
}
*[class="mob-hide"] { display: none !important }
}
Gmail app and Gmail web service strips all class and ID styling out of your style tags. More Info Here
There is an option for some elemental styling in Gmail web service as seen in link above, but other than that you need to do everything inline. This removes the capabilities of responsive design as you cannot inline media queries.
Your best bet is fluid design(percentage based to fit small or large screen) or a mobile first hybrid design that is essentially designed first for Gmail/Outlook and then uses media queries and style tags to make it work for all other email clients.

Resources