Using Scrapy to get all the articles

Using Scrapy to get all the articles - python-3.x

I'm using this script with Scrapy :
import scrapy
class PageSpider(scrapy.Spider):
name = "page"
start_urls = ['http://blog.theodo.com/']
def parse(self, response):
for article_url in response.css('.Link-sc-19p3alm-0 fnuPWK a ::attr("href")').extract():
yield response.follow(article_url, callback=self.parse_article)
def parse_article(self, response):
content = response.xpath(".//div[#class='entry-content']/descendant::text()").extract()
yield {'article': ''.join(content)}
I'm following a tutorial but some part needed to be changed I guess.
I have already changed :
response.css('.Link-sc-19p3alm-0 fnuPWK a ::attr("href")').extract():
I guess this is what I need to get the link of the article ->
link
But I'm stuck with the xpath. All the content of the article is contained in a div but there is no entry-content anymore :
xpath
I would like to know if I put the the right thing in the response.css and also whak kind of path I need to write in xpath and understand the logic behind.
Thanks you, I'm hoping my post is clear :)

Im not sure but I think you need an extra point before fnuPWK like:
response.css('.Link-sc-19p3alm-0 .fnuPWK a ::attr("href")').extract()
Because I think it is a class.
Also good to know is that you can copy Xpath, CSS selectors etc with element inspect (see example in the picture below). This way you are sure that you have the right Xpath.
Chrome inspect element copy XPath example

open your terminal, write scrapy shell 'blog.theodo.com'
for the href element you have to do:
response.xpath('//a[#class="Link-sc-19p3alm-0 fnuPWK"]/#href').get()
Also i cant give you an example of the "text", because your picture does not show enough information for me.
Also keep in mind: if you use ' as your first quotation marks, you have to use after class= double quotation marks for example('//div[#class=""]')
for the whole article on https://www.formatic-centre.fr/formation/dynamiser-vos-equipes-special-post-confinement/
response.xpath('//div[#class="course-des-content"]//text()').getall()
.get() will provide you the first match, but in that case, getall would suit more imo

Related

Unable to access text from a class using selenium on python

I am willing to parse https://2gis.kz , and I encountered the problem that I am getting error while using .text or any methods used to extract text from a class
I am typing the search query such as "fitness"
My window variable is
all_cards = driver.find_elements(By.CLASS_NAME,"_1hf7139")
for card_ in all_cards:
card_.click()
window = driver.find_element(By.CLASS_NAME, "_18lzknl")
This is a quite simplified version of how I open a mini-window with all of the essential information inside it. Below I am attaching the piece of code where I am trying to extract text from a phone number holder.
texts = window.find_elements(By.CLASS_NAME,'_b0ke8')
print(texts) # this prints out something from where I am concluding that this thing is accessible
try:
print(texts.text)
except:
print(".text")
try:
print(texts.text())
except:
print(".text()")
try:
print(texts.get_attribute("innerHTML"))
except:
print('getAttribute("innerHTML")')
try:
print(texts.get_attribute("textContent"))
except:
print('getAttribute("textContent")')
try:
print(texts.get_attribute("outerHTML"))
except:
print('getAttribute("outerHTML")')
Hi, guys, I solved an issue. The .text was not working for some reason. I guess developers somehow managed to protect information from using this method. I used a
get_attribute("innerHTML") # afaik this allows us to get a html code of a particular class
and now it works like a charm.
texts = window.find_elements(By.TAG_NAME, "bdo")
with io.open("t.txt", "a", encoding="utf-8") as f:
for text in texts:
nums = re.sub("[^0-9]", "",
text.get_attribute("innerHTML"))
f.write(nums+'\n')
f.close()
So the problem was that:
I was trying to print a list of items just by using print(texts)
Even when I tried to print each element of texts variable in a for loop, I was getting an error due to the fact that it was decoded in utf-8.
I hope someone will find it useful and will not spend a plethora of time trying to fix such a simple bug.

find_elements method returns a list of web elements. So this
texts = window.find_elements(By.CLASS_NAME,'_b0ke8')
gives you texts a list of web elements.
You can not apply .text method directly on list.
In order to get each element text you will have to iterate over elements in the list and extract that element text, like this:
text_elements = window.find_elements(By.CLASS_NAME,'_b0ke8')
for element in text_elements:
print(element.text)
Also, I'm not sure about locators you are using.
_1hf7139, _18lzknl and _b0ke8 class names are seem to be dynamic class names i.e they may change each browsing session.

Reading and writing a text value using selenium and pandas when the html element has no definite id

I am working on creating a program that would read a list of aircraft registrations from an excel file and return the aircraft type codes.
My source of information is FlightRadar24. (example - https://www.flightradar24.com/data/aircraft/n502dn)
I tried inspecting the elements on the page to find the correct class id to invoke and found the id to be listed as "details" When I run my code, it extracts the aircraft name with the class id/name details, instead of the type code.
See here for the example data
I then changed my approach to using XPath to seek the correct text but with the xpath it prints out
(For Xpath, i used a browser add on to find the exact xpath for the element, fairly confident that it is correct.)
It gives no output. What would you suggest in this particular instance when extracting values without a definite id ?
for i in list_regs:
driver.get('https://www.flightradar24.com/data/aircraft/'+i)
driver.implicitly_wait(3)
load = 0
while load==0:
try:
element = driver.find_element_by_xpath("/html/body/div[5]/div/section/section[2]/div[1]/div[1]/div[2]/div[2]/span")
print('element') #Printing to terminal to see if the right value is returned.

You should probably change your xpath expression to:
//label[.="TYPE CODE"]/following-sibling::span[#class="details"]
and
print('element')
to
print(element)
Edit:
This works for me:
element = driver.find_element_by_xpath('//label[.="TYPE CODE"]/following-sibling::span[#class="details"]')
print(element.text)
Output:
A359

Selenium Webdriver element returns None even with WebDriverWait

I have a script using BeautifulSoup where I am trying to get the text within a span element.
number_of_pages = soup.find('span', attrs={'class':'random})
print(number_of_pages.string)
and it returns a variable like {{lastPage()}} which means it is generated by JS. So, then I changed my script to use Selenium but it returns an element that doesn't contain the text I need. I tried a random website to see if it works there:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("https://hoshiikins.com/") #navigates to hoshiikins.com
spanList= browser.find_elements_by_xpath("/html/body/div[1]/main/div/div[13]/div/div[2]/div/p")
print(spanList)
and what it returns is:
[<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="fe20e73e-5638-420e-a8a0-a8785153c157", element="3065d5b1-f8a6-4e46-9359-87386b4d1511")>]
I then thought it was an issue related to how fast the script runs. So, I added a delay/wait:
element = WebDriverWait(browser, 10).until(
EC.presence_of_element_located((By.XPATH, "/html/body/div[1]/main/div/div[13]/div/div[2]/div/p"))
)
I even tried different parts of the page and used a class and an ID but I am not getting any text back. Note that I had tried using the spanList.getattribute('value') or spanList.text but they return nothing.

I had this same issue, your variable spanList is an web object, the find elements function doesn't return meaningful text. You have to do one more step and add .text to return the text. You can do this in the print statement
print(spanText.text)
If this tag is an input element then you'll need
print(spanText.get_attribute('value'))
This should print what you are looking for

It sounds like you're perhaps misunderstanding your results, the code you provided for Selenium works with one small change:
driver.get("https://hoshiikins.com/")
spanList = driver.find_elements_by_xpath("/html/body/div[1]/main/div/div[13]/div/div[2]/div/p")
for span in spanList:
print(span.text)
Returns Indivdually Handcrafted with Love, Just for You.
You're using find_elements_by_xpath, which is different from find_element_by_xpath as the former is plural (elements). So all you have to do is either change it to element or iterate over your result set and get the text property of the element.

How i can get two span's text field from one div using Selenium?

I have html like this:
<div class="event__scores fontBold">
<span>1</span>
-
<span>2</span>
</div>
I find this element as follows:
current_score = match.find_element(By.XPATH, '//div[contains(#class, "event__scores")]')
print(current_score.get_attribute('innerHTML'))
I can not understand what I need to do to get the text like 1 - 2 without using bs4 or something like that.
I know i can use bs4 like this:
spans = soup.find_all('span')
result = ' - '.join([e.get_text() for e in spans])
But i want to know can i get similar result only using Selenium.

Consider using Explicit Wait instead of find as it might be the case the element won't be loaded yet by the time you will be attempting to find it. Check out How to use Selenium to test web applications using AJAX technology article for more details
current_score = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, '//div[contains(#class, "event__scores")]')))
You're looking for a wrong property, you should be using innerText, not innerHTML
print(current_score.get_attribute('innerText'))
or simply retrieve WebElement.text property
print(current_score.text)

Why does this simple Python 3 XPath web scraping script not work?

I was following a web scraping tutorial from here: http://docs.python-guide.org/en/latest/scenarios/scrape/
It looks pretty straight forward and before I did anything else, I just wanted to see if the sample code would run. I'm trying to find the URIs for the images on this site.
http://www.bvmjets.com/
This actually might be a really bad example. I was trying to do this with a more complex site but decided to dumb it down a bit so I could understand what was going on.
Following the instructions, I got the XPath for one of the images.
/html/body/div/div/table/tbody/tr[4]/td/p[1]/a/img
The whole script looks like:
from lxml import html
import requests
page = requests.get('http://www.bvmjets.com/')
tree = html.fromstring(page.content)
images = tree.xpath('/html/body/div/div/table/tbody/tr[4]/td/p[1]/a/img')
print(images)
But when I run this, the dict is empty. I've looked at the XPath docs and I've tried various alterations to the xpath but I get nothing each time.

I dont think I can answer you question directly, but I noticed the images on the page you are targeting are sometimes wrapped differently. I'm unfamiliar with xpath myself, and wasnt able to get the number selector to work, despite this post. Here are a couple of examples to try:
tree.xpath('//html//body//div//div//table//tr//td//div//a//img[#src]')
or
tree.xpath('//table//tr//td//div//img[#src]')
or
tree.xpath('//img[#src]') # 68 images
The key to this is building up slowly. Find all the images, then find the image wrapped in the tag you are interested in.. etc etc, until you are confident you can find only the images your are interested in.
Note that the [#src] allows us to now access the source of that image. Using this post we can now download any/all image we want:
import shutil
from lxml import html
import requests
page = requests.get('http://www.bvmjets.com/')
tree = html.fromstring(page.content)
cool_images = tree.xpath('//a[#target=\'_blank\']//img[#src]')
source_url = page.url + cool_images[5].attrib['src']
path = 'cool_plane_image.jpg' # path on disk
r = requests.get(source_url, stream=True)
if r.status_code == 200:
with open(path, 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
I would highly recommend looking at Beautiful Soup. For me, this has helped my amateur web scraping ventures. Have a look at this post for a relevant starting point.
This may not be the answer you are looking for, but hopeful it is a starting point / of some use to you - best of luck!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Using Scrapy to get all the articles - python-3.x

Related

Unable to access text from a class using selenium on python

Reading and writing a text value using selenium and pandas when the html element has no definite id

Selenium Webdriver element returns None even with WebDriverWait

How i can get two span's text field from one div using Selenium?

Why does this simple Python 3 XPath web scraping script not work?

Categories

Resources