How to get all xpaths that are matching given regex? - python-3.x

Is there any python library which facilitates in getting xpaths of dom nodes which matches the given regex?
I am trying to fetch question and answer pair from a faq page
these are three different xpaths of questions from this site
xpath1: /html/body/div[1]/div[2]/div[3]/div[2]/div/div[2]/div/div[1]/div/div[7]/div[1]/a/span
xpath2: /html/body/div[1]/div[2]/div[3]/div[2]/div/div[2]/div/div[1]/div/div[10]/div[1]/a/span
xpath3: /html/body/div[1]/div[2]/div[3]/div[2]/div/div[2]/div/div[3]/div[1]/div[1]/div[1]/a/span
now let the regex be something like this :
/html/body/div[1]/div[2]/div[3]/div[2]/div/div[2]/div/ * / * / * /div[1]/a/span
is it possible to get all xpaths that satisfy the regex we build through some library in python?
I tried using scrapy selectors to fetch all questions but it is failing while fetching the answers, so i want to go through all questions and then fetch their answers, for this I want question Xpaths

You don't need a tool or regex (as well as absolute XPath expressions). Try to use below XPath to match all questions on page:
//div[#class="ClsInnerDrop"]/a
If you don't know how to write your own selectors, check this cheatsheet

Finally, I found the solution for this, with the combination of lxml and scrapy.
used #Andersson answer to find all the text content using the selector and then for each text, iterated over the tree and used tree.getpath() from lxml
The solution is not regex based but solved my use-case, so posting it
import requests
from lxml import html
def get_xpath_for_text(tree, text):
try:
for tag in tree.iter():
if tag.text and tag.text == text:
return tree.getpath(tag)
return ' '
except Exception as e:
return ' '
webpage = requests.get(url)
html_content = html.fromstring(webpage.text)
tree= html_content.getroottree()
get_xpath_for_text(tree, text)

Related

Using Scrapy to get all the articles

I'm using this script with Scrapy :
import scrapy
class PageSpider(scrapy.Spider):
name = "page"
start_urls = ['http://blog.theodo.com/']
def parse(self, response):
for article_url in response.css('.Link-sc-19p3alm-0 fnuPWK a ::attr("href")').extract():
yield response.follow(article_url, callback=self.parse_article)
def parse_article(self, response):
content = response.xpath(".//div[#class='entry-content']/descendant::text()").extract()
yield {'article': ''.join(content)}
I'm following a tutorial but some part needed to be changed I guess.
I have already changed :
response.css('.Link-sc-19p3alm-0 fnuPWK a ::attr("href")').extract():
I guess this is what I need to get the link of the article ->
link
But I'm stuck with the xpath. All the content of the article is contained in a div but there is no entry-content anymore :
xpath
I would like to know if I put the the right thing in the response.css and also whak kind of path I need to write in xpath and understand the logic behind.
Thanks you, I'm hoping my post is clear :)
Im not sure but I think you need an extra point before fnuPWK like:
response.css('.Link-sc-19p3alm-0 .fnuPWK a ::attr("href")').extract()
Because I think it is a class.
Also good to know is that you can copy Xpath, CSS selectors etc with element inspect (see example in the picture below). This way you are sure that you have the right Xpath.
Chrome inspect element copy XPath example
open your terminal, write scrapy shell 'blog.theodo.com'
for the href element you have to do:
response.xpath('//a[#class="Link-sc-19p3alm-0 fnuPWK"]/#href').get()
Also i cant give you an example of the "text", because your picture does not show enough information for me.
Also keep in mind: if you use ' as your first quotation marks, you have to use after class= double quotation marks for example('//div[#class=""]')
for the whole article on https://www.formatic-centre.fr/formation/dynamiser-vos-equipes-special-post-confinement/
response.xpath('//div[#class="course-des-content"]//text()').getall()
.get() will provide you the first match, but in that case, getall would suit more imo

How to get specific text in a soup.find method on python?

I'm having multiple issues with trying to scrape a website when the CSS code are all the same. I'm still learning about the soup.find method and things I can do with it. The issue is there are several lines of CSS code on a webpage that has <span class="list-quest" and when I use soup.find(class_='list-quest') for example I will only get the result from the top of the page that uses the same CSS code. Is there a way to get the exact specific line of code? Possibly using Born [dd-mm-yyyy] ? But sadly I do not know how to use a specific keyword such as that for Python to find it.
<span class="list-quest">Born [dd-mm-yyyy]:</span>
By using a regex on the text attribute :
Regex :
Born \d{2}-\d{2}-\d{4}:
Python Code :
from bs4 import BeautifulSoup
import re
text = '<span class="list-quest">Born 01-01-2019:</span>'
soup = BeautifulSoup(text,features='html.parser')
tag = soup.find('span',attrs={'class':'list-quest'} , text=re.compile(r'Born \d{2}-\d{2}-\d{4}'))
print(tag.text)
Demo : Here
You might, with bs4 4.7.1 + be able to use contains
item = soup.select_one('span.list-quest:contains("Born ")')
if item is not None: print(item.text)

Why does this simple Python 3 XPath web scraping script not work?

I was following a web scraping tutorial from here: http://docs.python-guide.org/en/latest/scenarios/scrape/
It looks pretty straight forward and before I did anything else, I just wanted to see if the sample code would run. I'm trying to find the URIs for the images on this site.
http://www.bvmjets.com/
This actually might be a really bad example. I was trying to do this with a more complex site but decided to dumb it down a bit so I could understand what was going on.
Following the instructions, I got the XPath for one of the images.
/html/body/div/div/table/tbody/tr[4]/td/p[1]/a/img
The whole script looks like:
from lxml import html
import requests
page = requests.get('http://www.bvmjets.com/')
tree = html.fromstring(page.content)
images = tree.xpath('/html/body/div/div/table/tbody/tr[4]/td/p[1]/a/img')
print(images)
But when I run this, the dict is empty. I've looked at the XPath docs and I've tried various alterations to the xpath but I get nothing each time.
I dont think I can answer you question directly, but I noticed the images on the page you are targeting are sometimes wrapped differently. I'm unfamiliar with xpath myself, and wasnt able to get the number selector to work, despite this post. Here are a couple of examples to try:
tree.xpath('//html//body//div//div//table//tr//td//div//a//img[#src]')
or
tree.xpath('//table//tr//td//div//img[#src]')
or
tree.xpath('//img[#src]') # 68 images
The key to this is building up slowly. Find all the images, then find the image wrapped in the tag you are interested in.. etc etc, until you are confident you can find only the images your are interested in.
Note that the [#src] allows us to now access the source of that image. Using this post we can now download any/all image we want:
import shutil
from lxml import html
import requests
page = requests.get('http://www.bvmjets.com/')
tree = html.fromstring(page.content)
cool_images = tree.xpath('//a[#target=\'_blank\']//img[#src]')
source_url = page.url + cool_images[5].attrib['src']
path = 'cool_plane_image.jpg' # path on disk
r = requests.get(source_url, stream=True)
if r.status_code == 200:
with open(path, 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
I would highly recommend looking at Beautiful Soup. For me, this has helped my amateur web scraping ventures. Have a look at this post for a relevant starting point.
This may not be the answer you are looking for, but hopeful it is a starting point / of some use to you - best of luck!

Python BeautifulSoup

I am using Python BeautifulSoup to extract some data from a famous song site.
Here is the snippet of code:
import requests
from bs4 import BeautifulSoup
url= 'https://gaana.com/playlist/gaana-dj-bollywood-top-50-1'
res = requests.get(url)
while(res.status_code!=200):
try:
res = requests.get('url')
except:
pass
print (res)
soup = BeautifulSoup(res.text,'lxml')
songs = soup.find_all('meta',{'property':'music:song'})
print (songs[0])
Here is the sample output:
<Response [200]>
<meta content="https://gaana.com/song/o-saathi" property="music:song"/>
Now i want to extract the url within content as string so that i can further use that url in my program.
Someone please Help me.
It's in the comments, but I just want to explain: beautifulsoup returns most results as a list or other iterable object. You show that you understand this in your code by using songs[0], but in this case what's been returned is a dictionary.
As explained in this StackOverflow post, you have need to query not only songs[0] but also the property within the dictionary (the two together are called a key pair and are the chief way to get data out of a dictionary).
Last note: while I've been a big fan of BeautifulSoup4 for basic web scraping, you may consider the lxml library. It's pretty well documented; to really take advantage of it you have to learn Python-variety Xpaths, which are sort of like regex for XML/HTML; but for advanced scraping it's probably the last best option short of Selenium, and it returns cleaner data than bs4.
Good luck!

Search particular string in entire html using Beautiful Soup in Scrapy

I would like to search for particular string in scraped html page and perform some action if string is present.
find = soup.find('word')
print(find)
But this gives None even there is word in page. Also, I tried:
find = soup.find_all('word')
print(find)
And it gives [] only.
What find method does is searching for a tag. So when you do soup.find('word') you're asking BeautifulSoup to find all the <word></word> tags. I think it's not what you want.
There are several ways to perform what you're asking. You can use re module for searching with a regular expression like that:
import re
is_present = bool(re.search('word', response.text))
But you can avoid importing extra modules, as you use Scrapy, which has a built-in methods for working with regular expressions. Just use re method on selector:
is_present = bool(response.xpath('//body').re('word'))
Try find = soup.findAll(text="word")

Resources