When Scraping got html with "encoded" part, is it possible to get it - python-3.x

One of the final steps in my project is to get the price of a product , i got everything i need except the price.
Source :
<div class="prices">
<div class="price">
<div class="P01 tooltip"><span>Product 1</span></div>€<div class="encoded" data-price="bzMzlXaZjkxLjUxNA==">151.4</div>
</div>
<div class="price">
<div class="Po1plus tooltip"><span>Product 1 +</span></div>€<div class="encoded" data-price="MGMSKJDFsTcxLjU0NA==">184.4</div>
</div>
what i need to get is after the
==">
I don't know if there is some protection from the encoded part, but the clostest i get is returnig this <div class="encoded" data-price="bzMzlXaZjkxLjUxNA=="></div>
Don't know if is relevant i'm using "html.parser" for the parsing
PS. i'm not trying to hack anything, this is just a personal project to help me learn.
Edit: if when parsing the test i get no price, the other methods can get it without a different parser ?
EDIT2 :
this is my code :
page_soup = soup(pagehtml, "html.parser")
pricebox = page_soup.findAll("div",{ "id":"stationList"})
links = pricebox[0].findAll("a",)
det = links[0].findAll("div",)
det[7].text
#or
det[7].get_text()
the result is ''

With Regex
I suppose there are ways to do this using beautifulsoup, anyway here is one approach using regex
import regex
# Assume 'source_code' is the source code posted in the question
prices = regex.findall(r'(?<=data\-price[\=\"\w]+\>)[\d\.]+(?=\<\/div)', source_code)
# ['151.4', '184.4']
# or
[float(p) for p in prices]
# [151.4, 184.4]
Here is a short explanation of the regular expression:
[\d\.]+ is what we are actually searching: \d means digits, \. denotes the period and the two combined in the square brackets with the + means we want to find at least one digit/period
The brackets before/after further specify what has to precede/succeed a potential match
(?<=data\-price[\=\"\w]+\>) means before any potential match there must be data-price...> where ... is at least one of the symbols A-z0-9="
Finally, (?=\<\/div) means after any match must be followed by </div
With lxml
Here is an approach using the module lxml
import lxml.html
tree = lxml.html.fromstring(source_code)
[float(p.text_content()) for p in tree.find_class('encoded')]
# [151.4, 184.4]

"html.parser" works fine as a parser for your problem. As you are able to get this <div class="encoded" data-price="bzMzlXaZjkxLjUxNA=="></div> on your own that means you only need prices now and for that you can use get_text() which is an inbuilt function present in BeautifulSoup.
This function returns whatever the text is in between the tags.
Syntax of get_text() :tag_name.get_text()
Solution to your problem :
from bs4 import BeautifulSoup
data ='''
<div class="prices">
<div class="price">
<div class="P01 tooltip"><span>Product 1</span></div>€<div class="encoded" data-price="bzMzlXaZjkxLjUxNA==">151.4</div>
</div>
<div class="price">
<div class="Po1plus tooltip"><span>Product 1 +</span></div>€<div class="encoded" data-price="MGMSKJDFsTcxLjU0NA==">184.4</div>
</div>
'''
soup = BeautifulSoup(data,"html.parser")
# Searching for all the div tags with class:encoded
a = soup.findAll ('div', {'class' : 'encoded'})
# Using list comprehension to get the price out of the tags
prices = [price.get_text() for price in a]
print(prices)
Output
['151.4', '184.4']
Hope you get what you are looking for. :)

Related

Selenium Can't Find Element Returning None or []

im having trouble accessing element, here is my code:
driver.get(url)
desc = driver.find_elements_by_xpath('//p[#class="somethingcss xxx"]')
and im trying to use another method like this
desc = driver.find_elements_by_class_name('somethingcss xxx')
the element i try to find like this
<div data-testid="descContainer">
<div class="abc1123">
<h2 class="xxx">The Description<span data-tid="prodTitle">The Description</span></h2>
<p data-id="paragraphxx" class="somethingcss xxx">sometext here
<br>text
<br>
<br>text
<br> and several text with
<br> tag below
</p>
</div>
<!--and another div tag below-->
i want to extract tag p inside div class="abc1123", but it doesn't return any result, only return [] when i try to get_attribute or extract it to text.
When i try extract another element using this method with another class, it works perfectly.
Does anyone know why I can't access these elements?
Try the following css selector to locate p tag.
print(driver.find_element_by_css_selector("p[data-id^='paragraph'][class^='somethingcss']").text)
OR Use get_attribute("textContent")
print(driver.find_element_by_css_selector("p[data-id^='paragraph'][class^='somethingcss']").get_attribute("textContent"))

Getting the ID if i know the specific span text

my brain crashed.
I'm trying to get the ID of a span if specific text matches using BeautifulSoup, this because i need a number from the ID but the ID changes every time when searching for a new product but the product (CORRECT). Purpose of this is because when i have the number, 11 in this case, i can add it in another part of the code to scrape the information i need.
Example:
<span id="random-text-10-random-again">IGNORE</span>,
<span id="random-text-11-random-again">CORRECT</span>,
<span id="random-text-12-random-again">IGNORE</span>
Been reading documentation but i never seem to get right or not even remotely close. I'm aware how to pull the text (CORRECT) if i know the ID but not reversed.
Find_all() span items with required text and then get the id attribute and split() the attribute value with -
from bs4 import BeautifulSoup
html='''<span id="random-text-10-random-again">IGNORE</span>
<span id="random-text-11-random-again">CORRECT</span>
<span id="random-text-12-random-again">IGNORE</span>'''
soup=BeautifulSoup(html,'html.parser')
for item in soup.find_all('span',text='CORRECT'):
print(item['id'].split('-')[2])
It will print:
11
I prefer to use :contains to target the innerText by a specified value. Available for bs4 4.7.1+
from bs4 import BeautifulSoup as bs
html = '''
<span id="random-text-10-random-again">IGNORE</span>,
<span id="random-text-11-random-again">CORRECT</span>,
<span id="random-text-12-random-again">IGNORE</span>'''
soup = bs(html, 'lxml')
target = soup.select_one('span:contains("CORRECT")[id]')
if target is None:
print("Not found")
else:
print(target['id'].split('-')[2])

Scrape a span text from multiple span elements of same name within a p tag in a website

I want to scrape the text from the span tag within multiple span tags with similar names. Using python, beautifulsoup to parse the website.
Just cannot uniquely identify that specific gross-amount span element.
The span tag has name=nv and a data value but the other one has that too. I just wanna extract the gross numerical dollar figure in millions.
Please advise.
this is the structure :
<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span name="nv" data-value="93122">93,122</span>
<span class="ghost">|</span>
<span class="text-muted">Gross:</span>
<span name="nv" data-value="69,645,701">$69.65M</span>
</p>
Want the text from second span under span class= text muted Gross.
What you can do is find the <span> tag that has the text 'Gross:'. Then, once it finds that tag, tell it to go find the next <span> tag (which is the value amount), and get that text.
from bs4 import BeautifulSoup as BS
html = '''<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span name="nv" data-value="93122">93,122</span>
<span class="ghost">|</span>
<span class="text-muted">Gross:</span>
<span name="nv" data-value="69,645,701">$69.65M</span>
</p>'''
soup = BS(html, 'html.parser')
gross_value = soup.find('span', text='Gross:').find_next('span').text
Output:
print (gross_value)
$69.65M
or if you want to get the data-value, change that last line to:
gross_value = soup.find('span', text='Gross:').find_next('span')['data-value']
Output:
print (gross_value)
69,645,701
And finally, if you need those values as an integer instead of a string, so you can aggregate in some way later:
gross_value = int(soup.find('span', text='Gross:').find_next('span')['data-value'].replace(',', ''))
Output:
print (gross_value)
69645701

how to get soup.find_all to work in BeautifulSoup?

I'm trying to scrape information a page consisting names of attorneys using BeaurifulSoup
#importing libraries
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
Following is an example of each attorney's names that are nested in HTML tags
</a>
<div class="person-info search-person-info people-search-person-info">
<div class="col person-name-position">
<a href="https://www.foxrothschild.com/richard-s-caputo/">
Richard S. Caputo
</a>
I tried using the following script to extract the name of each of the attorneys using 'a' as the tag and "col person-name-position" as the class. But it does not seem to work. Instead it prints out an empty list.
page=requests.get("https://www.foxrothschild.com/people/?search%5Bname%5D=&search%5Bkeyword%5D=&search%5Boffice%5D=&search%5Bpeople-position%5D=&search%5Bpeople-bar-admission%5D=&search%5Bpeople-language%5D=&search%5Bpeople-school%5D=Villanova+University+School+of+Law&search%5Bpractice-area%5D=") #insert page here
soup=BeautifulSoup(page.content,'html.parser')
#print(soup.prettify())
find_name=soup.find_all('a',class_='col person-name-position')
print(find_name)
You need to change your soup.find_all to div since the class goes with div and not a
page=requests.get("https://www.foxrothschild.com/people/search%5Bname%5D=&search%5Bkeywod%5D=&search%5Boffice%5D=&search%5Bpeople-position%5D=&search%5Bpeople-bar-admission%5D=&search%5Bpeople-language%5D=&search%5Bpeople-school%5D=Villanova+University+School+of+Law&search%5Bpractice-area%5D=")
#insert page here
soup=BeautifulSoup(page.content,'html.parser')
#print(soup.prettify())
find_name=soup.find_all('div',class_='col person-name-position')
print(find_name)
class="col person-name-position" is a property of a div object, so you need to use:
find_name=soup.find_all('div',class_='col person-name-position')
for entry in find_name:
a_element = entry.find("a")
#...

Python 3 BeautifulSoup4 search for text in source page

I want to search for all '1' in the source code and print the location of that '1' ex: <div id="yeahboy">1</div> the '1' could be replaced by any other string. I want to see the tag around that string.
Consider this context for example * :
from bs4 import BeautifulSoup
html = """<root>
<div id="yeahboy">1</div>
<div id="yeahboy">2</div>
<div id="yeahboy">3</div>
<div>
<span class="nested">1</span>
</div>
</root>"""
soup = BeautifulSoup(html)
You can use find_all() passing parameter True to indicate that you want only element nodes (instead of the child text nodes), and parameter text="1" to indicate that the element you want must have text content equals "1" -or any other text you want to search for- :
for element1 in soup.find_all(True, text="1"):
print(element1)
Output :
<div id="yeahboy">1</div>
<span class="nested">1</span>
*) For OP: for future questions, try to give a context, just like the above context example. That will make your question more concrete and easier to answer -as people doesn't have to create context on his own, which may turn out to be not relevant to the situation that you actually have.

Resources