I am trying to scrape a page with BeautifulSoup in Python 3

I am trying to scrape a page with BeautifulSoup in Python 3 - python-3.x

I am trying to scrape this page: http://sports.williamhill.com/bet/en-gb/results///E/10337992/today//////+England+v+Ukraine.html
I am using requests and BeautifulSoup in Python 3. I am specifically trying to extract the text that says 'Draw (13/5)' within the tag. I have tried to navigate to that spot with BeautifulSoup in many different ways, and it is like BS4 cannot 'see' it...
When I search for all within , it only gives me the first row.
<tbody>
<tr>
<th width="41%" scope="col">Market</th>
<th width="3%"></th>
<th width="45%" scope="col">Result</th>
<th width="10%" scope="col" style="text-align: center">Settled</th>
</tr>
<tr class="rowGroup top bottom">
<td style="padding-top: 0pt;" class="borderBottom" rowspan="1">Match Betting</td>
<td>
</td>
<td>
Draw (13/5)
</td>
<td style="padding-top: 0pt; text-align: center" class="borderBottom" rowspan="1">Y</td>
</tr>
How could I best retrieve that information with BeautifulSoup? Thank you in advanced.

Related

Use HTML Table without headers

I've a table, where the normal guidance for HTML tables arn't followed.
My best move will be, to just create a proper JSON-object, and using that.
But i'll like to ask, if there is any options for parsing an HTML table, "without headers", and define them in Tabulator, instead.
I know the case id odd, but i'll just like to hear :-)
Example where no thead and th is in the HTML-source:
<table border="0" cellpadding="0" cellspacing="0">
<tbody>
<tr height="16">
<td colspan="16">
Something
</td>
<td colspan="16">
14
</td>
<td colspan="16">
2020-01-28
</td>
</tr>
</tbody>
</table>

Im afraid not.
When Tabulator is built on a table element, it parses the HTML to create a JavaScript object for each row of the table, using the column headers as property names.
Without headers it would have no reasonable way to map the column values onto an object.

I solve problem like this
<table border="0" cellpadding="0" cellspacing="0">
<tr height="16" hidden>
<td colspan="16">
</td>
<td colspan="16">
</td>
<td colspan="16">
</td>
</tr>
<tr height="16">
<td colspan="16">
Something
</td>
<td colspan="16">
14
</td>
<td colspan="16">
2020-01-28
</td>
</tr>
</table>

Get tr having specific th value from the table having an id using Xpath

I am using python3.6 with XPath library. Crawling inside the table gives me an empty list. And need to crawl to specific th.
My tr contents are dynamically generated. I need to crawl to tr which has a specific th value. Example In HTML code, the Rank appears in the second tr but it can appear in anywhere in tr. It doesn't have a specific index. Need to get the href from the tr having the Rank th.
My html file:
<tbody>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Product Number
</th>
<td class="a-size-base">
B003NR57BY
</td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Rank
</th>
<td>
<span>
<span>#3 in Computer Mice</span>
<br>
</span>
</td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Created Date
</th>
<td class="a-size-base">
June 7, 2010
</td>
</tr>
</tbody>
</table>
Python code:
listings_details = parser.xpath(XPATH_PRODUCT_DETAILS)
for row in listings_details:
th = row.xpath("./th/text()")
if th[0].strip() == 'Rank':
categories = row.xpath("./td/span/span//text()")
qid_url= row.xpath("./td/span/span//#href")
I expect the output to be
Rank: 3,
url : /gp/bestsellers/pc/11036491/ref=pd_zg_hrsr_pc_1_1_last,
category: Computer Mice

Need to get the href from the tr having the Rank th.
Use:
/table/tbody/tr[normalize-space(th)='Rank']/td//a/#href
Note: this works for your provided fragment (now well-formed). You need to add later a context for selecting the table element.
<table>
<tbody>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">Product Number</th>
<td class="a-size-base">B003NR57BY</td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">Rank</th>
<td>
<span>
<span>#3 in
Computer Mice
</span>
<br/>
</span>
</td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">Created Date</th>
<td class="a-size-base">June 7, 2010</td>
</tr>
</tbody>
</table>
Test in http://www.xpathtester.com/xpath/53808ee94dfbc5b38f12791cf857ffb9

how to extract blue color hidden text between rows in table using python beautifulsoup

Trying to crawl all hidden comments in table rows, after row 2 and 3, but fail to extract.
i have tried the below code to extarct these comments but fails.
below is my code.please help me someone to crack this problem.
from bs4 import BeautifulSoup,Comment
import requests
r =requests.get('http://www.esuppliersindia.com/krishna-agro-
traders/aboutus-p17322178-u10731500-swa.html')
soup = BeautifulSoup(r.text,'lxml')
table = soup.find('table',class_='text-listing')
trs = table.find_all('tr')
for tr in trs[2:3]:
print(tr.text)
for tr in trs[3:4].find_next_sibling('td'):
print(tr.text)

I am not sure though if you are looking after below comments inside table.
from bs4 import BeautifulSoup,Comment
import requests
r =requests.get('http://www.esuppliersindia.com/krishna-agro-traders/aboutus-p17322178-u10731500-swa.html')
soup = BeautifulSoup(r.text,'lxml')
table = soup.find('table',class_='text-listing')
comments=table.find_all(string=lambda text:isinstance(text,Comment))
print(comments[0].split('</tr>')[0])
for i in range(1,len(comments)):
print(comments[i])
I will print the output like that.
<td align="right" bgcolor="#FFFFFF" class="text-f11-b">No. Of Employees</td>
<td bgcolor="#FFFFFF" class="text-f11">10</td>
<tr>
<td align="right" bgcolor="#FFFFFF" class="text-f11-b">Export Turnover</td>
<td bgcolor="#FFFFFF" class="text-f11"></td>
</tr>
<tr>
<td align="right" valign="top" bgcolor="#FFFFFF" class="text-f11-b">Annual Turnover</td>
<td valign="top" bgcolor="#FFFFFF" class="text-f11">10 </td>
</tr>
<tr>
<td align="right" valign="top" bgcolor="#FFFFFF" class="text-f11-b">Import Turnover</td>
<td valign="top" bgcolor="#FFFFFF" class="text-f11"> </td>
</tr>
<tr>
<td align="right" valign="top" bgcolor="#ffffff" class="text-f11-b">Bankers</td>
<td valign="top" bgcolor="#ffffff" class="text-f11">Hdfc Bank </td>
</tr>

MorningStar KeyStat to pandas Dataframe

I am trying to read keyStat in MorningStar and know the data which is HTML where is warped in a JSON. So far I can put a request that can get the json by Beautifulsoup:
url = 'http://financials.morningstar.com/ajax/keystatsAjax.html?t=tou&culture=en-CA&region=CAN'
lm_json = requests.get(url).json()
ksContent = BeautifulSoup(lm_json["ksContent"],"html.parser")
Now here is a bit wired to me that the html data as 'ksContent' which contains actual data as a table. I am not a fan of html and wondering how can I just make all it to a nice pandas dataframe? As the table is long, here is some of it:
<table cellpadding="0" cellspacing="0" class="r_table1 text2">
<colgroup>
<col width="23%"/>
<col span="11" width="7%"/>
</colgroup>
<thead>
<tr>
<th align="left" scope="row"></th>
<th align="right" id="Y0" scope="col">2008-12</th>
<th align="right" id="Y1" scope="col">2009-12</th>
<th align="right" id="Y2" scope="col">2010-12</th>
<th align="right" id="Y3" scope="col">2011-12</th>
<th align="right" id="Y4" scope="col">2012-12</th>
<th align="right" id="Y5" scope="col">2013-12</th>
<th align="right" id="Y6" scope="col">2014-12</th>
<th align="right" id="Y7" scope="col">2015-12</th>
<th align="right" id="Y8" scope="col">2016-12</th>
<th align="right" id="Y9" scope="col">2017-12</th>
<th align="right" id="Y10" scope="col">TTM</th>
</tr>
</thead>
<tbody>
<tr class="hr">
<td colspan="12"></td>
</tr>
<tr>
<th class="row_lbl" id="i0" scope="row">Revenue <span>CAD Mil</span></th>
<td align="right" headers="Y0 i0">—</td>
<td align="right" headers="Y1 i0">40</td>
<td align="right" headers="Y2 i0">212</td>
<td align="right" headers="Y3 i0">349</td>
<td align="right" headers="Y4 i0">442</td>
<td align="right" headers="Y5 i0">759</td>
<td align="right" headers="Y6 i0">1,379</td>
<td align="right" headers="Y7 i0">1,074</td>
<td align="right" headers="Y8 i0">1,125</td>
<td align="right" headers="Y9 i0">1,662</td>
<td align="right" headers="Y10 i0">1,760</td>
</tr> ...
It defines a header tr, Y0, Y1 ... Y10 as actual date and next tr refers to it.
your help appreciated!

You can use read_html() to convert it into a list of dataframes
import requests
import pandas as pd
url = 'http://financials.morningstar.com/ajax/keystatsAjax.html?t=tou&culture=en-CA&region=CAN'
lm_json = requests.get(url).json()
df_list=pd.read_html(lm_json["ksContent"])
You can iterate through it and get the dataframes one by one. You can also use dropna() to get rid of the NaN only rows.
Sample output screenshot from my jupyter Notebook

How to pass product TVs to SimpleCart's scGetCart snippet?

I need some TVs (weight, dimensions, etc) I've associated with my products to appear in the Cart page of my SimpleCart site.
Problem is I have no idea how to do this. I don't understand how the SimpleCart cart is built and there isn't documentation for this.
Would anyone know how I can show TVs associated with each product in the cart output chunk?
The cart snippet has the following code which gets data from the cart and puts it into Chunks:
$sc = $modx->getService('simplecart','SimpleCart',$modx->getOption('simplecart.core_path',null,$modx->getOption('core_path').'components/simplecart/').'model/simplecart/',$scriptProperties);
if (!($sc instanceof SimpleCart)) return '';
 
$controller = $sc->loadController('Cart');
$output = $controller->run($scriptProperties);
The output Chunk looks like:
<div id="simplecart">
<form action="[[~[[*id]]]]" method="post" id="form_cartoverview">
<input type="hidden" name="updatecart" value="true" />
<table>
<tr>
<th class="desc">[[%simplecart.cart.description]]</th>
<th class="price">[[%simplecart.cart.price]]</th>
<th class="quantity">[[%simplecart.cart.quantity]]</th>
[[+cart.total.vat_total:notempty=`<th class="quantity">[[%simplecart.cart.vat]]</th>`:isempty=``]]
<th class="subtotal">[[%simplecart.cart.subtotal]]</th>
<th> </th>
</tr>
[[+cart.wrapper]]
[[+cart.total.discount:notempty=`<tr class="total first discount">
<td colspan="[[+cart.total.vat_total:notempty=`3`:isempty=`2`]]"> </td>
<td class="label">[[%simplecart.cart.discount]]</td>
<td class="value">- [[+cart.total.discount_formatted]]</td>
<td class="extra">[[+cart.total.discount_percent:notempty=`([[+cart.total.discount_percent]]%)`:isempty=` `]]</td>
</tr>`:isempty=``]]
[[+cart.total.vat_total:notempty=`
<tr class="total [[+cart.total.discount:notempty=`second`:isempty=`first`]]">
<td colspan="3"> </td>
<td class="label">[[%simplecart.cart.total_ex_vat]]</td>
<td class="value">[[+cart.total.price_ex_vat_formatted]]</td>
<td class="extra"> </td>
</tr>
[[+cart.vat_rates]]
<tr class="total [[+cart.total.discount:notempty=`third`:isempty=`second`]]">
<td colspan="3"> </td>
<td class="label">[[%simplecart.cart.total_vat]]</td>
<td class="value">[[+cart.total.vat_total_formatted]]</td>
<td class="extra"> </td>
</tr>
<tr class="total [[+cart.total.discount:notempty=`fourth`:isempty=`third`]]">
<td colspan="3"> </td>
<td class="label">[[%simplecart.cart.total_in_vat]]</td>
<td class="value">[[+cart.total.price_formatted]]</td>
<td class="extra"> </td>
</tr>
`:isempty=`
<tr class="total [[+cart.total.discount:notempty=`second`:isempty=`first`]]">
<td colspan="2"> </td>
<td class="label">[[%simplecart.cart.total]]</td>
<td class="value">[[+cart.total.price_formatted]]</td>
<td class="extra"> </td>
</tr>
`]]
</table>
<div class="submit">
<input type="submit" value="[[%simplecart.cart.update]]" />
</div>
</form>

This does appear to be documented:
Product Options (TVs)
and to output them:
Modifying the Product Template
It appears that you would just output them normally [[*myProductOptions]]
Though, it appears that your template is using a placeholder, I would try
[[+cart.myProductOptions] as well. If all else fails you might try debugging the simplecart class and dump the array of product data before it populates the chunk, there might be a clue in there.

Found (through trial and error) you must use:
[[+product.tv.name_of_tv]]

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

I am trying to scrape a page with BeautifulSoup in Python 3 - python-3.x

Related

Use HTML Table without headers

Get tr having specific th value from the table having an id using Xpath

how to extract blue color hidden text between rows in table using python beautifulsoup

MorningStar KeyStat to pandas Dataframe

How to pass product TVs to SimpleCart's scGetCart snippet?

Categories

Resources