Page with anti-scraping protection in the code? - python-3.x

I am trying to extract information from a web page, when dealing with Xpath helper (chrome extension) it shows the content perfectly, but when taking it to scrapy it returns "None", or "empty":
Web: https://cutt.ly/bjj3ohW
The number --NN are the forms it tested.
I have tried with Xpath (//*[#id="da_price"],//*[#id="da_price"]/text()), .get(''), .extract(), .get('').strip(), Css #da_price,#da_price::text, Also i used beautifulsoup and scrapy_splas hand returns the result none or empty. I still don't want to try to use selenium because the number of links is quite large.

The element you're targeting might be dynamically rendered. I tried this and got it to work, I'm targeting the price lower down on the page instead.
import scrapy
class TestSpider(scrapy.Spider):
name = 'testspider'
def start_requests(self):
return [scrapy.Request(
url='https://cutt.ly/bjj3ohW',
)]
def parse(self, response):
price = response.css('.price-final > strong::text').get()
print(price)
A good way to test if it's dynamically rendered is to open inspect panel in Chrome (F12) and look under the Network tab. Reload the page and look and the first response which should be a .html file. Click on that file and then Response. There you can see the html code you can parse in Scrapy. Click ctrl+F and search for the CSS selector you're trying to parse.

Related

How to extract the URL of a particular page using selenium/python?

I'm building an instagram Bot using selenium.
How do I extract the URL of a page using python?
For example Selenium is loading a webpage. I want to extract the url of that particular page .(Suppose : https://instagram.com/as80df67s4)
If you still don't understand what I'm talking about, please check the image below. There, I have highlighted the page link. How do I extract that link?
From webdriver.py:
def current_url(self):
"""
Gets the URL of the current page.
:Usage:
driver.current_url
"""
return self.execute(Command.GET_CURRENT_URL)['value']
This means that in order to get a current url you can use:
your_url = driver.current_url
But first you need for this page to open.

Soup.find_all returns None even the element exists

It returns none for the 5+ pages even the class exists there.
The URL Which works fine:
https://www.ebay.com/sch/i.html?_from=R40&_nkw=Apple&_sacat=0&rt=nc&LH_Sold=1&LH_Complete=1&_ipg=200&_pgn=1
But it doesn't work for 5-6 pages
https://www.ebay.com/sch/i.html?_from=R40&_nkw=Apple&_sacat=0&rt=nc&LH_Sold=1&LH_Complete=1&_ipg=200&_pgn=5
My Code So far:
response = requests.get(url)
soup = BeautifulSoup(response.content,'html.parser')
app = soup.find_all('li',class_ = 's-item')
for x in app:
print(x)
Printing app > Prints empty LIST: []
I have checked it manually, The class exists on all the pages.
The content is probably dynamically generated with JavaScript. You should use Selenium to run the javascript components, then extract the information you want from the resulting webpage
Your bot may be detected and the 5th page often occurs to be a captcha or pop-up window.
Try to use another library like Selenium to witness your bit behavior in browser, or screenshot the window at every page query

Python Scrapy - Trying to use pagination that is "#" and redirects me to same page

I'm building a scraper for this page
I want to follow to next page after making first page work and after setting auto throttle and download speed to be gentle and I tried using:
next_page = response.xpath('//div[#class="global-pagination"]/a[#class="next"]/#href').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
Problem is that href in that class is # and basically it opens same page again. How do I make it work?
If you take a look at your browser developer tool you will see when you go to other pages data loads from loadresult. furthermore by searching in Form Data you'll see there is a field named page have a value of the page you requested which you can request any other page by changing it in your formdata in FormRequest.
from scrapy.http import FormRequest
FormRequst(url=url, formdata=formdata={'page': <page number>}, callback=<parse method>)

Web-scraping and download .csv from OECD website

Sorry for bothering you with my request. I have started to get acquaintance with web-scraping with the library BeautifulSoup. Beacuase I have to download some data from OECD's websites I wanted to try some web-scraping approaches. More specifically, I wanted to download a .csv file from the following page:
https://goingdigital.oecd.org/en/indicator/50/
As you can see, data can be easily downloaded by clicking on 'Download data'. However, because I will have do deal with some a recursive download with loop, I tried to download it directly from the Python console. Therefore, by inspecting the page, I evidenced the download's URL that I have reported in the following picture:
Hence, I wrote the following code:
from bs4 import BeautifulSoup
import requests
from requests import get
url = 'https://goingdigital.oecd.org/en/indicator/50/'
response = get(url)
print(response.text[:500])
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
containers = html_soup.find_all('div', {'class': 'css-cqestz e12cimw51'})
print(type(containers))
print(len(containers))
d = []
for a in containers[0].find_all('a', href = True):
print(a['href'])
d.append(a['href'])
The object containers is composed by three elements since there are three divs with the specified class. The first one (the one I have selected in the loop) should be the one containing the URL in which I am interested. However, I get no result. Conversely, when I select the third element of the object containers I get the following output:
https://www.facebook.com/sharer/sharer.php?u=https%3A%2F%2Fgoingdigital.oecd.org%2Fen%2Findicator%2F50%2F
https://twitter.com/intent/tweet?text=OECD%20Going%20Digital%20Toolkit&url=https%3A%2F%2Fgoingdigital.oecd.org%2Fen%2Findicator%2F50%2F
https://www.linkedin.com/shareArticle?mini=true&url=https%3A%2F%2Fgoingdigital.oecd.org%2Fen%2Findicator%2F50%2F
mailto:?subject=OECD%20Going%20Digital%20Toolkit%3A%20Percentage%20of%20individuals%20aged%2055-74%20using%20the%20Internet&body=Percentage%20of%20individuals%20aged%2055-74%20using%20the%20Internet%0A%0Ahttps%3A%2F%2Fgoingdigital.oecd.org%2Fen%2Findicator%2F50%2F
By the way, for this download I guess it could be related to the following thread. Thank you in advance!
When you pull data from a website, you should first check whether the content you are looking for is in the page source. If it's not in the page source, you should try web scraping with selenium.
When I examined the site you mentioned, I could not see it in the page source, it shows that the link you want on this page is dynamically created.

Web Scrape google search pop-up results or www.prokabaddi.com

I am trying to scrape the results after searching for 'Jaipur Pink Panthers' on google or directly visiting the prokabaddi website. Target is to scrape the table which pops up when you click on any match providing the total score spread for the entire match.
I have tried using beautiful soup and selenium but I endup reading nothing with the div class values. Any help in this regard is highly appreciable.
What I have tried as of now is as follows: [PS: I am absolutely new to Python]:
Attempt1:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.prokabaddi.com/stats/0-102-total-points-statistics')
soup = BeautifulSoup(driver.page_source,"lxml")
for item in soup.select('.sipk-lb-playerName'):
[elem.extract() for elem in soup("span")]
print(item.text)
driver.quit()
Attempt2:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.prokabaddi.com/stats/0-102-total-points-statistics')
soup = BeautifulSoup(page.text, 'html.parser')
name_list = soup.find(class_='.sipk-lb-playerName')
Little Background
Websites such as these, are made in such a manner to make the user's life easy by sending only the content that is just required by you at that point in time.
As you move around the website and click on something, the remaining data is sent back to you. So, it basically works like a demand based interaction between you and the server.
What is the issue in your code?
In your first approach, you are getting an empty div list even though you are able to see that element in the html source. The reason is you clicked on Player tab on the web-page and then it got listed there. It generated the new html content at that point of time and hence you see it.
How to do it?
You need to simulate clicking of that button before sending the html source to BeautifulSoup. So, first find that button by using find_element_by_id() method. Then, click it.
element = driver.find_element_by_id('player_Btn')
element.click()
Now, you have the updated html source in your driver object. Just send this to BeautifulSoup constructor.
soup = BeautifulSoup(driver.page_source)
You do not need an lxml parser for this. Now, you can look for the specific class and get all the names (which I have done here).
soup.findAll('div',attrs={'class':'sipk-lb-playerName'})
Voila! You can store the returned list and get only the names formatted as you want.

Resources