Python 3 BeautifulSoup4 search for text in source page - python-3.x

I want to search for all '1' in the source code and print the location of that '1' ex: <div id="yeahboy">1</div> the '1' could be replaced by any other string. I want to see the tag around that string.

Consider this context for example * :
from bs4 import BeautifulSoup
html = """<root>
<div id="yeahboy">1</div>
<div id="yeahboy">2</div>
<div id="yeahboy">3</div>
<div>
<span class="nested">1</span>
</div>
</root>"""
soup = BeautifulSoup(html)
You can use find_all() passing parameter True to indicate that you want only element nodes (instead of the child text nodes), and parameter text="1" to indicate that the element you want must have text content equals "1" -or any other text you want to search for- :
for element1 in soup.find_all(True, text="1"):
print(element1)
Output :
<div id="yeahboy">1</div>
<span class="nested">1</span>
*) For OP: for future questions, try to give a context, just like the above context example. That will make your question more concrete and easier to answer -as people doesn't have to create context on his own, which may turn out to be not relevant to the situation that you actually have.

Related

Selenium Can't Find Element Returning None or []

im having trouble accessing element, here is my code:
driver.get(url)
desc = driver.find_elements_by_xpath('//p[#class="somethingcss xxx"]')
and im trying to use another method like this
desc = driver.find_elements_by_class_name('somethingcss xxx')
the element i try to find like this
<div data-testid="descContainer">
<div class="abc1123">
<h2 class="xxx">The Description<span data-tid="prodTitle">The Description</span></h2>
<p data-id="paragraphxx" class="somethingcss xxx">sometext here
<br>text
<br>
<br>text
<br> and several text with
<br> tag below
</p>
</div>
<!--and another div tag below-->
i want to extract tag p inside div class="abc1123", but it doesn't return any result, only return [] when i try to get_attribute or extract it to text.
When i try extract another element using this method with another class, it works perfectly.
Does anyone know why I can't access these elements?
Try the following css selector to locate p tag.
print(driver.find_element_by_css_selector("p[data-id^='paragraph'][class^='somethingcss']").text)
OR Use get_attribute("textContent")
print(driver.find_element_by_css_selector("p[data-id^='paragraph'][class^='somethingcss']").get_attribute("textContent"))

Getting the ID if i know the specific span text

my brain crashed.
I'm trying to get the ID of a span if specific text matches using BeautifulSoup, this because i need a number from the ID but the ID changes every time when searching for a new product but the product (CORRECT). Purpose of this is because when i have the number, 11 in this case, i can add it in another part of the code to scrape the information i need.
Example:
<span id="random-text-10-random-again">IGNORE</span>,
<span id="random-text-11-random-again">CORRECT</span>,
<span id="random-text-12-random-again">IGNORE</span>
Been reading documentation but i never seem to get right or not even remotely close. I'm aware how to pull the text (CORRECT) if i know the ID but not reversed.
Find_all() span items with required text and then get the id attribute and split() the attribute value with -
from bs4 import BeautifulSoup
html='''<span id="random-text-10-random-again">IGNORE</span>
<span id="random-text-11-random-again">CORRECT</span>
<span id="random-text-12-random-again">IGNORE</span>'''
soup=BeautifulSoup(html,'html.parser')
for item in soup.find_all('span',text='CORRECT'):
print(item['id'].split('-')[2])
It will print:
11
I prefer to use :contains to target the innerText by a specified value. Available for bs4 4.7.1+
from bs4 import BeautifulSoup as bs
html = '''
<span id="random-text-10-random-again">IGNORE</span>,
<span id="random-text-11-random-again">CORRECT</span>,
<span id="random-text-12-random-again">IGNORE</span>'''
soup = bs(html, 'lxml')
target = soup.select_one('span:contains("CORRECT")[id]')
if target is None:
print("Not found")
else:
print(target['id'].split('-')[2])

When Scraping got html with "encoded" part, is it possible to get it

One of the final steps in my project is to get the price of a product , i got everything i need except the price.
Source :
<div class="prices">
<div class="price">
<div class="P01 tooltip"><span>Product 1</span></div>€<div class="encoded" data-price="bzMzlXaZjkxLjUxNA==">151.4</div>
</div>
<div class="price">
<div class="Po1plus tooltip"><span>Product 1 +</span></div>€<div class="encoded" data-price="MGMSKJDFsTcxLjU0NA==">184.4</div>
</div>
what i need to get is after the
==">
I don't know if there is some protection from the encoded part, but the clostest i get is returnig this <div class="encoded" data-price="bzMzlXaZjkxLjUxNA=="></div>
Don't know if is relevant i'm using "html.parser" for the parsing
PS. i'm not trying to hack anything, this is just a personal project to help me learn.
Edit: if when parsing the test i get no price, the other methods can get it without a different parser ?
EDIT2 :
this is my code :
page_soup = soup(pagehtml, "html.parser")
pricebox = page_soup.findAll("div",{ "id":"stationList"})
links = pricebox[0].findAll("a",)
det = links[0].findAll("div",)
det[7].text
#or
det[7].get_text()
the result is ''
With Regex
I suppose there are ways to do this using beautifulsoup, anyway here is one approach using regex
import regex
# Assume 'source_code' is the source code posted in the question
prices = regex.findall(r'(?<=data\-price[\=\"\w]+\>)[\d\.]+(?=\<\/div)', source_code)
# ['151.4', '184.4']
# or
[float(p) for p in prices]
# [151.4, 184.4]
Here is a short explanation of the regular expression:
[\d\.]+ is what we are actually searching: \d means digits, \. denotes the period and the two combined in the square brackets with the + means we want to find at least one digit/period
The brackets before/after further specify what has to precede/succeed a potential match
(?<=data\-price[\=\"\w]+\>) means before any potential match there must be data-price...> where ... is at least one of the symbols A-z0-9="
Finally, (?=\<\/div) means after any match must be followed by </div
With lxml
Here is an approach using the module lxml
import lxml.html
tree = lxml.html.fromstring(source_code)
[float(p.text_content()) for p in tree.find_class('encoded')]
# [151.4, 184.4]
"html.parser" works fine as a parser for your problem. As you are able to get this <div class="encoded" data-price="bzMzlXaZjkxLjUxNA=="></div> on your own that means you only need prices now and for that you can use get_text() which is an inbuilt function present in BeautifulSoup.
This function returns whatever the text is in between the tags.
Syntax of get_text() :tag_name.get_text()
Solution to your problem :
from bs4 import BeautifulSoup
data ='''
<div class="prices">
<div class="price">
<div class="P01 tooltip"><span>Product 1</span></div>€<div class="encoded" data-price="bzMzlXaZjkxLjUxNA==">151.4</div>
</div>
<div class="price">
<div class="Po1plus tooltip"><span>Product 1 +</span></div>€<div class="encoded" data-price="MGMSKJDFsTcxLjU0NA==">184.4</div>
</div>
'''
soup = BeautifulSoup(data,"html.parser")
# Searching for all the div tags with class:encoded
a = soup.findAll ('div', {'class' : 'encoded'})
# Using list comprehension to get the price out of the tags
prices = [price.get_text() for price in a]
print(prices)
Output
['151.4', '184.4']
Hope you get what you are looking for. :)

Scrape a span text from multiple span elements of same name within a p tag in a website

I want to scrape the text from the span tag within multiple span tags with similar names. Using python, beautifulsoup to parse the website.
Just cannot uniquely identify that specific gross-amount span element.
The span tag has name=nv and a data value but the other one has that too. I just wanna extract the gross numerical dollar figure in millions.
Please advise.
this is the structure :
<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span name="nv" data-value="93122">93,122</span>
<span class="ghost">|</span>
<span class="text-muted">Gross:</span>
<span name="nv" data-value="69,645,701">$69.65M</span>
</p>
Want the text from second span under span class= text muted Gross.
What you can do is find the <span> tag that has the text 'Gross:'. Then, once it finds that tag, tell it to go find the next <span> tag (which is the value amount), and get that text.
from bs4 import BeautifulSoup as BS
html = '''<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span name="nv" data-value="93122">93,122</span>
<span class="ghost">|</span>
<span class="text-muted">Gross:</span>
<span name="nv" data-value="69,645,701">$69.65M</span>
</p>'''
soup = BS(html, 'html.parser')
gross_value = soup.find('span', text='Gross:').find_next('span').text
Output:
print (gross_value)
$69.65M
or if you want to get the data-value, change that last line to:
gross_value = soup.find('span', text='Gross:').find_next('span')['data-value']
Output:
print (gross_value)
69,645,701
And finally, if you need those values as an integer instead of a string, so you can aggregate in some way later:
gross_value = int(soup.find('span', text='Gross:').find_next('span')['data-value'].replace(',', ''))
Output:
print (gross_value)
69645701

Search entire text for images

I have a problem with a project.
I need to search a string for images.
I want to get the source of the image and modify the html form of the img tag.
For example the image form is:
and I want to change it to:
<div class="col-md-3">
<hr class="visible-sm visible-xs tall" />
<a class="img-thumbnail lightbox pull-left" href="upload/uploader/up_164.jpg" data-plugin-options='{"type":"image"}' title="Image title">
<img class="img-responsive" width="215" src="upload/uploader/up_164.jpg"><span class="zoom"><i class="fa fa-search"></i>
</span></a>
I have done some part of this.
I can find the image, change the form of the html but cannot loop this for all images found in the string.
My code goes like
Using the following function I get the string between two strings
// Get substring between
function GetBetween($var1="",$var2="",$pool){
$temp1 = strpos($pool,$var1)+strlen($var1);
$result = substr($pool,$temp1,strlen($pool));
$dd=strpos($result,$var2);
if($dd == 0){
$dd = strlen($result);
}
return substr($result,0,$dd);
}
And then I get the image tag from the string
$imageFile = GetBetween("img","/>",$newText);
The next was to filter the source of the image:
$imageSource = GetBetween('src="','\"',$imageFile);
And for the last part I call str_replace to do the job:
$newText = str_replace('oldform', 'newform', $newText);
The problem is in case there are more tha one images, I cannot loop this process.
Thank you in advance.
The best, simple and safe way to read an xml file is to use an xml parser.
And, I think you will gain a lot of time.

Resources