Search particular string in entire html using Beautiful Soup in Scrapy - python-3.x

I would like to search for particular string in scraped html page and perform some action if string is present.
find = soup.find('word')
print(find)
But this gives None even there is word in page. Also, I tried:
find = soup.find_all('word')
print(find)
And it gives [] only.

What find method does is searching for a tag. So when you do soup.find('word') you're asking BeautifulSoup to find all the <word></word> tags. I think it's not what you want.
There are several ways to perform what you're asking. You can use re module for searching with a regular expression like that:
import re
is_present = bool(re.search('word', response.text))
But you can avoid importing extra modules, as you use Scrapy, which has a built-in methods for working with regular expressions. Just use re method on selector:
is_present = bool(response.xpath('//body').re('word'))

Try find = soup.findAll(text="word")

Related

How to get specific text in a soup.find method on python?

I'm having multiple issues with trying to scrape a website when the CSS code are all the same. I'm still learning about the soup.find method and things I can do with it. The issue is there are several lines of CSS code on a webpage that has <span class="list-quest" and when I use soup.find(class_='list-quest') for example I will only get the result from the top of the page that uses the same CSS code. Is there a way to get the exact specific line of code? Possibly using Born [dd-mm-yyyy] ? But sadly I do not know how to use a specific keyword such as that for Python to find it.
<span class="list-quest">Born [dd-mm-yyyy]:</span>
By using a regex on the text attribute :
Regex :
Born \d{2}-\d{2}-\d{4}:
Python Code :
from bs4 import BeautifulSoup
import re
text = '<span class="list-quest">Born 01-01-2019:</span>'
soup = BeautifulSoup(text,features='html.parser')
tag = soup.find('span',attrs={'class':'list-quest'} , text=re.compile(r'Born \d{2}-\d{2}-\d{4}'))
print(tag.text)
Demo : Here
You might, with bs4 4.7.1 + be able to use contains
item = soup.select_one('span.list-quest:contains("Born ")')
if item is not None: print(item.text)

Processing all values of an array with get_text

(Disclaimer: I'm a newbie, I'm sorry if this problem is really obvious)
Hello,
I build a little script in order to first find certain parts of HTML markup within a local file and then display the information without HTML tags.
I used bs4 and find_all / get_text for this. Take a look:
from bs4 import BeautifulSoup
with open("/Users/user1/Desktop/testdatapython.html") as fp:
soup = BeautifulSoup(fp, "lxml")
titleResults = soup.find_all('span', attrs={'class':'caption-subject'})
firstResult = titleResults[0]
firstStripped = firstResult.get_text()
print(firstStripped)
This actually works so far. But I want to do this for all values of titleResults, not only the first value. But I can't process an array with get_text.
Which way would be best to accomplish this? The number of values for titleResults is always changing since the local html file is only a sample.
Thank you in advance!
P.S. I already looked up this related thread but it is not enough for understanding or solving the problem sadly:
BeautifulSoup get_text from find_all
find_all returns a list
for result in titleResults:
stripped = result.get_text()
print(stripped)

How to get all xpaths that are matching given regex?

Is there any python library which facilitates in getting xpaths of dom nodes which matches the given regex?
I am trying to fetch question and answer pair from a faq page
these are three different xpaths of questions from this site
xpath1: /html/body/div[1]/div[2]/div[3]/div[2]/div/div[2]/div/div[1]/div/div[7]/div[1]/a/span
xpath2: /html/body/div[1]/div[2]/div[3]/div[2]/div/div[2]/div/div[1]/div/div[10]/div[1]/a/span
xpath3: /html/body/div[1]/div[2]/div[3]/div[2]/div/div[2]/div/div[3]/div[1]/div[1]/div[1]/a/span
now let the regex be something like this :
/html/body/div[1]/div[2]/div[3]/div[2]/div/div[2]/div/ * / * / * /div[1]/a/span
is it possible to get all xpaths that satisfy the regex we build through some library in python?
I tried using scrapy selectors to fetch all questions but it is failing while fetching the answers, so i want to go through all questions and then fetch their answers, for this I want question Xpaths
You don't need a tool or regex (as well as absolute XPath expressions). Try to use below XPath to match all questions on page:
//div[#class="ClsInnerDrop"]/a
If you don't know how to write your own selectors, check this cheatsheet
Finally, I found the solution for this, with the combination of lxml and scrapy.
used #Andersson answer to find all the text content using the selector and then for each text, iterated over the tree and used tree.getpath() from lxml
The solution is not regex based but solved my use-case, so posting it
import requests
from lxml import html
def get_xpath_for_text(tree, text):
try:
for tag in tree.iter():
if tag.text and tag.text == text:
return tree.getpath(tag)
return ' '
except Exception as e:
return ' '
webpage = requests.get(url)
html_content = html.fromstring(webpage.text)
tree= html_content.getroottree()
get_xpath_for_text(tree, text)

What is the difference between the find_all() function and the SoupStrainer of the BeautifulSoup package?

The following code is used to print on screen the tags of the html_doc, which is a variable that contains html code:
from bs4 import SoupStrainer
only_a_tags = SoupStrainer("a")
print(BeautifulSoup(html_doc, "html.parser", parse_only=only_a_tags).prettify())
The following code returns the same result:
print(BeautifulSoup(html_doc, "html.parser").find_all("a").prettify())
What is the difference between using the SoupStrainer and the find_all() function?
Could we use both SoupStrainer and Find_all()?
I found the following but can not understand what it does:
BeautifulSoup(response,parse_only=SoupStrainer("a",href=True)).find_all("a")
find_all returns an iterable of navigable strings, SoupStrainer parses the page to limit your soup with certain parameters.
To iterate over each of the anchors in the page to perform some action on each one you'd use find_all. If you just want to parse out everything from the page except the anchors you'd use SoupStrainer. Using them together just makes it a little bit more efficient.

Getting all tags with multiple attributes with SoupStrainer and BeautifulSoup

I'm trying to get all the occurrences of the 'td' tag when the class attribute has one of a few different values.
I know how to do this with BeautifulSoup after the fact but due to the amount of time it takes I'm trying to speed it up by selectively parsing each page with SoupStrainer. I at first tried the below but it doesn't seem to work.
strainer = SoupStrainer('td', attrs={'class': ['Value_One', 'Value_Two']})
soup = BeautifulSoup(foo.content, "lxml", parse_only=strainer)
Does anybody know of a way to make this work (it doesn't have to involve SoupStrainer or even Beautiful Soup)?
Depending on what you may mean, of course. You might be able to use scrapy which gives you the ability to formulate xpath expressions such as the one used here. It takes advantage of the fact that the two class attributes are similar. Many other ways of making selections are available.
>>> from scrapy.selector import Selector
>>> selector = Selector(text=open('temp.htm').read())
>>> selector.xpath('.//td[contains(#class,"Value")]/text()').extract()
['value one', 'value two']

Resources