How to scrape a span with a certain title in scrapy? - python-3.x

I'm trying to use Scrapy with the CSS path to get the text in the fields of a number of span items. The CSS look like this:
<div class="announcement">
<span title="Name">Homer Simpson</span>
<span title="Date">2018-09-19</span>
<span title="Type">House</span>
</div>
I have tried with this:
response.css("div.announcement span::attr(title)").extract()
# ['Name', 'Date', 'Type']
response.css("div.announcement span::text").extract()
# ['Homer Simpson', '2018-09-19', 'House']
But that only results in a repeated list of the span titles, or I get all of them, but I just want one at the time. What I would like to have is something like:
response.css("div.announcement <SomeMagicHere>('Name')").extract()
# ['Homer Simpson']
How can I get a list of only the content of each of the title items, separately?

You can use "contains" attribute:
response.css("div.announcement span[title*='Name']::text").extract()

Related

lxml.html XPATH expression for element when the test has to be applied to the text_content not the text

I have the following html
<html>
<body>
<p style="text-align:center;margin-bottom:0pt;margin-top:0pt;text-indent:0%;font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">
<a name="_marker_1"></a>
<a name="bananabread"></a>
<font style="font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">
<a name="bananabread"></a>Ban</font> <font style="font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">ana Bread</font>
</p>
<p style="text-align:center;margin-top:10pt;margin-bottom:0pt;text-indent:0%;font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">The Best You Ever Tasted</p>
<p style="margin-top:24pt;margin-bottom:0pt;text-indent:7.69%;font-style:italic;font-family:Times New Roman;font-size:10pt;font-weight:normal;text-transform:none;font-variant: normal;">If you don't agree that this is the best banana bread you have ever eaten well I would suggest you see your doctor</p>
<p style="margin-top:10pt;margin-bottom:0pt;text-indent:7.69%;font-family:Times New Roman;font-size:10pt;font-weight:normal;font-style:normal;text-transform:none;font-variant: normal;">Lots of text here describing what I am trying to capture</p>
<p style="text-align:center;margin-bottom:0pt;margin-top:0pt;text-indent:0%;font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">
<a name="_marker_2"></a>
<a name="bananapudding"></a>
<font style="font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">
<a name="bananapudding"></a>Banana</font>
<font style="font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">Pudding</font>
</p>
<p style="text-align:center;margin-top:10pt;margin-bottom:0pt;text-indent:0%;font-weight:bold;font-family:Times New Roman;font-size:10pt;font-style:normal;text-transform:none;font-variant: normal;">Creamy and Satisfying</p>
<p style="margin-top:24pt;margin-bottom:0pt;text-indent:7.69%;font-style:italic;font-family:Times New Roman;font-size:10pt;font-weight:normal;text-transform:none;font-variant: normal;">This is the same recipe your mother used when you were ten!</p>
<p style="margin-top:10pt;margin-bottom:0pt;text-indent:7.69%;font-family:Times New Roman;font-size:10pt;font-weight:normal;font-style:normal;text-transform:none;font-variant: normal;">Lots of text here describing what I am trying to capture</p>
</body>
</html>
I am trying to write an xpath expression to identify Banana Bread - my initial efforts were successful -
b_tree.xpath('.//*[starts-with(text(),"Banana Bread")]')
but I notice the error cases and upon investigation they are like the html above - another element is added inside the content I am searching for. Sometimes it is like above, a possibly unneeded font element, sometimes it is an anchor.
I worked with this answer (Related) but have not been successful
I can check for elements that have text_content() - clean up the text_content and then string match to my ultimate goal but I am hoping to learn to better apply xpath to these types of problems.
To be absolutely clear I need the text_content of the p element. But sometimes I just need the text of a font element. My existing XPATH expression works fine on the cases where there is not an intervening element. I do not know when I open the page the structure that was imposed on the document.
When the text() expression is applied to an element whose text content is interrupted by other elements, it returns a nodeset consisting of multiple text nodes, of which starts-with considers only the first. If you replace text() by ., you get the text value of the element, which is the concatenation of all text nodes, and that's what you want.
But there is still a problem with the spaces in an element like (attributes omitted, spaces are dots):
<p>
..<a></a>
..<a></a>
..<font>
....<a></a>Banana</font>
..<font>Pudding</font>
</p>
The text value of this element is _.._.._.._....Banana_..Pudding_ (underscores represent line feeds), therefore you must apply normalize-space, which normalizes this to Banana.Pudding, so that
.//*[starts-with(normalize-space(.),"Banana Pudding")]
finds this occurrence.
However, Banana Bread cannot be found, because it does not exist on the page. The element
<font>
..<a></a>Ban</font>.....<font>ana.Bread</font>
has a normalized text value of Ban.ana.Bread and you don't expect the space inside the word Banana. normalize-space removes spaces and line feeds that are invisible on the rendered page, but the two spaces in Ban.ana.Bread are both visible.
If there was no space between the two <font> elements,
.//*[starts-with(normalize-space(.),"Banana Bread")]
would detect 3 elements: the <html>, the <body> and the <p>, because "Banana Bread" are the first words in each of them. So you might better use
.//p[starts-with(normalize-space(.),"Banana Bread")]
instead.

How do you use find_previous() in a select query in Python?

I am trying to pull the span (lets call it AAA before a specific span - BBB. This BBB span only shows up certain times on the page and I only want the AAA's which directly precede the BBBs.
Is there a way to select AAA's that are only proceeded by BBB? Or, to get to my proposed question, how can you use find_previous when you're running a select query? I am successful if I just use select_one -
AAA= selsoup.select_one('span.BBB').find_previous().text
but when I try to use select to pull all entries I get an error message (You're probably treating a list of elements like a single element.)
I've tried applying .find_previous in a for loop but that doesnt work either. Any suggestions?
Sorry, I probably should have added this before:
Adding code from the page -
<tr class="tree">
<th class="AAA">What I want right here<span class="BBB">(Aba: The New Look)</span></th>
Instead of .find_previous() you can use + in your CSS selector:
from bs4 import BeautifulSoup
html_doc = """
<span class="ccc"">txt</span>
<span class="aaa"">This I don't Want</span>
<span class="bbb"">txt</span>
<span class="aaa"">* This I Want *</span>
<span class="ccc"">txt</span>
<span class="aaa"">This I don't Want</span>
"""
soup = BeautifulSoup(html_doc, "html.parser")
for aaa in soup.select(".bbb + .aaa"):
print(aaa.text)
Prints:
* This I Want *
EDIT: Based on your edit:
bbb = soup.select_one(".AAA .BBB")
print(bbb.text)
Prints:
(Aba: The New Look)

Selenium Can't Find Element Returning None or []

im having trouble accessing element, here is my code:
driver.get(url)
desc = driver.find_elements_by_xpath('//p[#class="somethingcss xxx"]')
and im trying to use another method like this
desc = driver.find_elements_by_class_name('somethingcss xxx')
the element i try to find like this
<div data-testid="descContainer">
<div class="abc1123">
<h2 class="xxx">The Description<span data-tid="prodTitle">The Description</span></h2>
<p data-id="paragraphxx" class="somethingcss xxx">sometext here
<br>text
<br>
<br>text
<br> and several text with
<br> tag below
</p>
</div>
<!--and another div tag below-->
i want to extract tag p inside div class="abc1123", but it doesn't return any result, only return [] when i try to get_attribute or extract it to text.
When i try extract another element using this method with another class, it works perfectly.
Does anyone know why I can't access these elements?
Try the following css selector to locate p tag.
print(driver.find_element_by_css_selector("p[data-id^='paragraph'][class^='somethingcss']").text)
OR Use get_attribute("textContent")
print(driver.find_element_by_css_selector("p[data-id^='paragraph'][class^='somethingcss']").get_attribute("textContent"))

Scrape a span text from multiple span elements of same name within a p tag in a website

I want to scrape the text from the span tag within multiple span tags with similar names. Using python, beautifulsoup to parse the website.
Just cannot uniquely identify that specific gross-amount span element.
The span tag has name=nv and a data value but the other one has that too. I just wanna extract the gross numerical dollar figure in millions.
Please advise.
this is the structure :
<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span name="nv" data-value="93122">93,122</span>
<span class="ghost">|</span>
<span class="text-muted">Gross:</span>
<span name="nv" data-value="69,645,701">$69.65M</span>
</p>
Want the text from second span under span class= text muted Gross.
What you can do is find the <span> tag that has the text 'Gross:'. Then, once it finds that tag, tell it to go find the next <span> tag (which is the value amount), and get that text.
from bs4 import BeautifulSoup as BS
html = '''<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span name="nv" data-value="93122">93,122</span>
<span class="ghost">|</span>
<span class="text-muted">Gross:</span>
<span name="nv" data-value="69,645,701">$69.65M</span>
</p>'''
soup = BS(html, 'html.parser')
gross_value = soup.find('span', text='Gross:').find_next('span').text
Output:
print (gross_value)
$69.65M
or if you want to get the data-value, change that last line to:
gross_value = soup.find('span', text='Gross:').find_next('span')['data-value']
Output:
print (gross_value)
69,645,701
And finally, if you need those values as an integer instead of a string, so you can aggregate in some way later:
gross_value = int(soup.find('span', text='Gross:').find_next('span')['data-value'].replace(',', ''))
Output:
print (gross_value)
69645701

Python 3 BeautifulSoup4 search for text in source page

I want to search for all '1' in the source code and print the location of that '1' ex: <div id="yeahboy">1</div> the '1' could be replaced by any other string. I want to see the tag around that string.
Consider this context for example * :
from bs4 import BeautifulSoup
html = """<root>
<div id="yeahboy">1</div>
<div id="yeahboy">2</div>
<div id="yeahboy">3</div>
<div>
<span class="nested">1</span>
</div>
</root>"""
soup = BeautifulSoup(html)
You can use find_all() passing parameter True to indicate that you want only element nodes (instead of the child text nodes), and parameter text="1" to indicate that the element you want must have text content equals "1" -or any other text you want to search for- :
for element1 in soup.find_all(True, text="1"):
print(element1)
Output :
<div id="yeahboy">1</div>
<span class="nested">1</span>
*) For OP: for future questions, try to give a context, just like the above context example. That will make your question more concrete and easier to answer -as people doesn't have to create context on his own, which may turn out to be not relevant to the situation that you actually have.

Resources