How to extract texture strings between two tags with different level with Python? - python-3.x

I want to extract the all contents between two tags with different level.
After google, i can't find the effective solutions for my need.
I expect you have solultions with Python.
so please see the html code below:
<span id='info'>
<span>
<span class='cl'>text1</span>
text2 is Chinese string.
text3
</span>
<br 1>
<span class='cl'> text4</span> text5
<br 2>
<span class='cl'>
<span>text7</span>
text8
text9
</span>
<br 3>
<span class='cl'> text10</span> text11
<br 4>
<span class='cl'> text12</span> text13
<br 5>
</span>
And, before first <br 1> tag, the html in some pages is like below:
# one situation below:
<span id='info'>
<span>
<span class='cl'>text1</span>
text2 is Chinese string.
text2 is Chinese string.
text3
</span>
<br 1>
<span class='cl'> text4</span> text5
......other html....
</span>
# two situation below:
<span id='info'>
<span class='cl'>text1</span>
text2 is Chinese string.
text3
<br 1>
<span class='cl'> text4</span> text5
......other html....
</span>
I want to extract contents:
text1 text2 text3 .... text13
I tried Xpath and bs4 for very very more methods, but there were no ok for my need.
Could you tell me the right way to use Xpath or bs4 or other mehods to get the expected output above?
Thank you in advance!
Xpath I tried:
str = response.xpath("//span[#id='info']/descendant::span[contains(text(),'text1')]/following::br[1]/preceding-sibling::node()")
str = str.xpath('string(.)').extract()
print(str)
then i got(like below):
[' \n \n text1 \n \n tex2 \n tex3 \n \n]
Above is the contents before first <br > tag. Because the html tags is not stable in different pages eventhough they are from same website, in the area of <span id='info'>. So i have to extract the contents between two neighbouring <br > tags separately.
And soup didn't work, i did not study it.
For my want, just the <br> tags have the stable postions and status. So i want use <br> tags to position the informations ( one br, one infomation ).
So, how can i do?

Give this a try for a BeautifulSoup solution... It's almost completely stolen from this answer: https://stackoverflow.com/a/1983219/684776
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request
def tag_visible(element):
if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
return False
if isinstance(element, Comment):
return False
return True
def text_from_html(body):
soup = BeautifulSoup(body, 'html.parser')
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
return list(filter(lambda x: x != '', [t.strip() for t in visible_texts]))
html = """
<span id='info'>
<span class='cl'>
<span>text1</span>
text2
text3
</span>
<br 1>
<span class='cl'> text4</span> text5
<br 2>
<span class='cl'>
<span>text7</span>
text8
text9
</span>
<br 3>
<span class='cl'> text10</span> text11
<br 4>
<span class='cl'> text12</span> text13
<br 5>
</span>
"""
print(text_from_html(html))
Test in on Repl.it: https://repl.it/#mac9416/SO-57890610

Generally, don't use regex to parse HTML.
But if all the HTML you want to parse looks very similar to your example, then... do what works!
(?<=>)(?:\s*)[^(<|\s).]+(?=\s*<)
That will capture the text and leading whitespace. Then you can .trim() to the content you want.
See it in action on regex101.

Related

NoSuchElementException xPath mistake in python

I cannt figure out where is my mistake...
The html code is:
<div class="P(16px) C($c-secondary) BreakWord Whs(pl)">
<div class="">
<span class="">some text1 to retrieve</span>
<span class="">some text2 to retrieve</span>
</div>
</div>
I have NoSuchElementException... with this xPath :
descriptions = browser.find_elements_by_xpath("//div[contains(#class, 'BreakWord')/div/span")
help please ?
You have just miss the closing ] bracket.
Try now
descriptions = browser.find_elements_by_xpath("//div[contains(#class, 'BreakWord')]/div/span")
Or
descriptions = browser.find_elements_by_xpath("//div[contains(#class, 'BreakWord')]//span")
Or following css selector.
descriptions =browser.find_elements_by_css_selector("div.BreakWord>div>span")

How can I get texts with certain criteria in python with selenium? (texts with certain siblings)

It's really tricky one for me so I'll describe the question as detail as possible.
First, let me show you some example of html.
....
....
<div class="lawcon">
<p>
<span class="b1">
<label> No.1 </label>
</span>
</p>
<p>
"I Want to get 'No.1' label in span if the div[#class='lawcon'] has a certain <a> tags with "bb" title, and with a string of 'Law' in the text of it."
<a title="bb" class="link" onclick="javascript:blabla('12345')" href="javascript:;">Law Power</a>
</p>
</div>
<div class="lawcon">
<p>
<span class="b1">
<label> No.2 </label>
</p>
<p>
"But I don't want to get No.2 label because, although it has <a> tag with "bb" title, but it doesn't have a text of law in it"
<a title="bb" class="link" onclick="javascript:blabla('12345')" href="javascript:;">Just Power</a>
</p>
</div>
<div class="lawcon">
<p>
<span class="b1">
<label> No.3 </label>
</p>
<p>
"If there are multiple <a> tags with the right criteria in a single div, I want to get span(No.3) for each of those" <a>
<a title="bb" class="link" onclick="javascript:blabla('12345')" href="javascript:;">Lawyer</a>
<a title="bb" class="link" onclick="javascript:blabla('12345')" href="javascript:;">By the Law</a>
<a title="bb" class="link" onclick="javascript:blabla('12345')" href="javascript:;">But not this one</a>
...
...
...
So, here is the thing. I want to extract the text of (e.g. No.1) in div[#class='lawcon'] only if the div has a tag with "bb" title, with a string of 'Law' in it.
If inside of the div, if there isn't any tag with "bb" title, or string of "Law" in it, the span should not be collected.
What I tried was
div_list = [div.text for div in driver.find_elements_by_xpath('//span[following-sibling::a[#title="bb"]]')]
But the problem is, when it has multiple tag with right criteria in a single div, it only return just one div.
What I want to have is a location(: span numbers) list(or tuple) of those text of tags
So it should be like
[[No.1 - Law Power], [No.3 - Lawyer], [No.3 - By the Law]]
I'm not sure I have explained enough. Thank you for your interests and hopefully, enlighten me with your knowledge! I really appreciate it in advance.
Here is the simple python script to get your desired output.
links = driver.find_elements_by_xpath("//a[#title='bb' and contains(.,'Law')]")
linkData = []
for link in links:
currentList = []
currentList.append(link.find_element_by_xpath("./ancestor::div[#class='lawcon']//label").text + '-' + link.text)
linkData.append(currentList)
print(linkData)
Output:
[['No.1-Law Power'], ['No.3-Lawyer'], ['No.3-By the Law']]
I am not sure why you want the output in that format. I would prefer the below approach, so that you will get to know how many divs have the matching links and then you can access the links from the output based on the divs. Just a thought.
divs = driver.find_elements_by_xpath("//a[#title='bb' and contains(.,'Law')]//ancestor::div[#class='lawcon']")
linkData = []
for div in divs:
currentList = []
for link in div.find_elements_by_xpath(".//a[#title='bb' and contains(.,'Law')]"):
currentList.append(div.find_element_by_xpath(".//label").text + '-' + link.text)
linkData.append(currentList)
print(linkData)
Output:
[['No.1-Law Power'], ['No.3-Lawyer', 'No.3-By the Law']]
As your requirement is to extract the texts No.1 and so on, which are within a <label> tag, you have to induce WebDriverWait for the visibility_of_all_elements_located() and you will have only 2 matches (against your expectation of 3) and you can use the following Locator Strategy:
Using XPATH:
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[#class='lawcon']//a[#title='bb' and contains(.,'Law')]//preceding::label[1]")))])

Scraping multiple similar lines with python

Using a simple request I'm trying to get from this html page some information stored in "alt". The problem is that, within each instance, the information is separated in multiple lines that start with "img", and when I try to access it, I can only read the first instance of "img" and not the rest, but I'm not sure how to do it. Here's the HTML text:
<div class="archetype-tile-description-wrapper">
<div class="archetype-tile-description">
<h2>
<span class="deck-price-online">
Golgari Midrange
</span>
<span class="deck-price-paper">
Golgari Midrange
</span>
</h2>
<div class="manacost-container">
<span class="manacost">
<img alt="b" class="common-manaCost-manaSymbol sprite-mana_symbols_b" src="//assets1.mtggoldfish.com/assets/s-d69cbc552cfe8de4931deb191dd349a881ff4448ed3251571e0bacd0257519b1.gif" />
<img alt="g" class="common-manaCost-manaSymbol sprite-mana_symbols_g" src="//assets1.mtggoldfish.com/assets/s-d69cbc552cfe8de4931deb191dd349a881ff4448ed3251571e0bacd0257519b1.gif" />
</span>
</div>
<ul>
<li>Jadelight Ranger</li>
<li>Merfolk Branchwalker</li>
<li>Vraska's Contempt</li>
</ul>
</div>
</div>
Having said that, what I'm looking to get from this is both "b" and "g" and store them in a single variable.
You can probably grab those <img> elements with the class "common-manaCost-manaSymbol" like this:
imgs = soup.find_all("img",{"class":"common-manaCost-manaSymbol"})
and then you can iterate over each <img> and grab the alt property of it.
alts = []
for i in imgs:
alts.append(i['alt'])
or with a list comprehension
alts = [i['alt'] for i in imgs]

How to select only divs with specific children span with xpath python

I am currently trying to scrap information of a particular ecommerce site and i only want to get product information like product name, price, color and sizes of only products whose prices have been slashed.
i am currently using xpath
this is my python scraping code
from lxml import html
import requests
class CategoryCrawler(object):
def __init__(self, starting_url):
self.starting_url = starting_url
self.items = set()
def __str__(self):
return('All Items:', self.items)
def crawl(self):
self.get_item_from_link(self.starting_url)
return
def get_item_from_link(self, link):
start_page = requests.get(link)
tree = html.fromstring(start_page.text)
names = tree.xpath('//span[#class="name"][#dir="ltr"]/text()')
print(names)
Note this is not the original URL
crawler = CategoryCrawler('https://www.myfavoriteecommercesite.com/')
crawler.crawl()
When the program is Run ... These are the HTML Content Gotten from the E-commerce Site
Div of Products With Price Slash
div class="products-info">
<h2 class="title"><span class="brand ">Apple </span> <span class="name" dir="ltr">IPhone X 5.8-Inch HD (3GB,64GB ROM) IOS 11, 12MP + 7MP 4G Smartphone - Silver</span></h2>
<div class="price-container clearfix">
<span class="sale-flag-percent">-22%</span>
<span class="price-box ri">
<span class="price ">
<span data-currency-iso="NGN">₦</span>
<span dir="ltr" data-price="388990">388,990</span>
</span>
<span class="price -old ">
<span data-currency-iso="NGN">₦</span>
<span dir="ltr" data-price="500000">500,000</span>
</span>
</span>
</div>
div
Div of Products with No Price Slash
div class="products-info">
<h2 class="title"><span class="brand ">Apple </span> <span class="name" dir="ltr">IPhone X 5.8-Inch HD (3GB,64GB ROM) IOS 11, 12MP + 7MP 4G Smartphone - Silver</span></h2>
<div class="price-container clearfix">
<span class="price-box ri">
<span class="price ">
<span data-currency-iso="NGN">₦</span>
<span dir="ltr" data-price="388990">388,990</span>
</span>
</span>
</div>
div
Now this is my exact Question
i want to know how to select only the parent divs i.e
div class="price-container clearfix"> that also contains any of these children span classes
span class="price -old "> or
span class="sale-flag-percent">
Thank you all
One solution would be get all <div class="price-container clearfix"> and iterate, checking with the string of the whole element that your keywords exist.
But a better solution would be to use conditionals with xpath:
from lxml import html
htmlst = 'your html'
tree=html.fromstring(htmlst)
divs = tree.xpath('//div[#class="price-container clearfix" and .//span[#class = "price -old " or #class = "sale-flag-percent"] ]')
print(divs)
This get all divs where class="price-container clearfix" and then check if contains span with the searched classes.

How to extract multiple text outside tags with BeautifulSoup?

I want to scrape a web page (German complaint website) using BeautifulSoup. Here is a good example (https://de.reclabox.com/beschwerde/44870-deutsche-bahn-berlin-erstattungsbetrag-sparpreisticket)
<div id="comments" class="kt">
<a name="comments"></a>
<span class="bb">Kommentare und Trackbacks (7)</span>
<br><br><br>
<a id="comment100264" name="comment100264"></a>
<div class="data">
19.12.2011 | 11:04
</div>
von Tom K.
<!--
-->
| <a class="flinko" href="/users/login?functionality_required=1">Regelverstoß melden</a>
<div class="linea"></div>
TEXT I AM INTEREST IN<br><br>MORE TEXT I AM INTEREST IN<br><br>MORETEXT I AM INTEREST IN
<br><br>
<a id="comment100265" name="comment100265"></a>
<div class="data">
19.12.2011 | 11:11
</div>
von Tom K.
<!--
-->
| <a class="flinko" href="/users/login?functionality_required=1">Regelverstoß melden</a>
<div class="linea"></div>
TEXT I AM INTEREST IN<br><br>MORE TEXT I AM INTEREST IN
<br><br>
<a id="comment101223" name="comment101223"></a>
<div class="commentbox comment-not-yet-solved">
<div class="data">
25.12.2011 | 10:14
</div>
von ReclaBoxler-4134668
<!--
--><img alt="noch nicht gelöste Beschwerde" src="https://a1.reclabox.com/assets/live_tracking/not_yet_solve-dbf4769c625b73b23618047471c72fa45bacfeb1cf9058655c4d75aecd6e0277.png" title="noch nicht gelöste Beschwerde">
| <a class="flinko" href="/users/login?functionality_required=1">Regelverstoß melden</a>
<div class="linea"></div>
TEXT I AM NOT INTERESTED IN <br><br>TEXT I AM NOT INTERESTED IN
</div>
<br><br>
<a id="comment101237" name="comment101237"></a>
<div class="data">
25.12.2011 | 11:01
</div>
von ReclaBoxler-3315297
<!--
-->
| <a class="flinko" href="/users/login?functionality_required=1">Regelverstoß melden</a>
<div class="linea"></div>
TEXT I AM INTERESTED IN
<br><br>
etc...
<br><br>
<br><br>
</div>
I was able to scrape most of the content I want (thanks to a lot of Q&A's I read here:-)) except for the comments (<div id="comments" class="kt">) which are not in a class ="commentbox" (I got the commentboxes already with another command). The comments outside the comment boxes seem to be not in a normal tag, that's why I just did not manage to get them via "soup.find(_all)". I'd like to scrape these comments as well as information about the person posting the comment ("von") as well as the date and time (<div class="data">).
It would be absolutely fantastic if someone knows how to solve this one. Thanks in advance for your help!
The common task to extract all texts from a page as follows
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
doc = """xxxxxxxx""" // url name
soup = BeautifulSoup(doc, "html.parser")
print(soup.get_text())

Resources