How to extract the URL of a particular page using selenium/python? - python-3.x

I'm building an instagram Bot using selenium.
How do I extract the URL of a page using python?
For example Selenium is loading a webpage. I want to extract the url of that particular page .(Suppose : https://instagram.com/as80df67s4)
If you still don't understand what I'm talking about, please check the image below. There, I have highlighted the page link. How do I extract that link?

From webdriver.py:
def current_url(self):
"""
Gets the URL of the current page.
:Usage:
driver.current_url
"""
return self.execute(Command.GET_CURRENT_URL)['value']
This means that in order to get a current url you can use:
your_url = driver.current_url
But first you need for this page to open.

Related

Page with anti-scraping protection in the code?

I am trying to extract information from a web page, when dealing with Xpath helper (chrome extension) it shows the content perfectly, but when taking it to scrapy it returns "None", or "empty":
Web: https://cutt.ly/bjj3ohW
The number --NN are the forms it tested.
I have tried with Xpath (//*[#id="da_price"],//*[#id="da_price"]/text()), .get(''), .extract(), .get('').strip(), Css #da_price,#da_price::text, Also i used beautifulsoup and scrapy_splas hand returns the result none or empty. I still don't want to try to use selenium because the number of links is quite large.
The element you're targeting might be dynamically rendered. I tried this and got it to work, I'm targeting the price lower down on the page instead.
import scrapy
class TestSpider(scrapy.Spider):
name = 'testspider'
def start_requests(self):
return [scrapy.Request(
url='https://cutt.ly/bjj3ohW',
)]
def parse(self, response):
price = response.css('.price-final > strong::text').get()
print(price)
A good way to test if it's dynamically rendered is to open inspect panel in Chrome (F12) and look under the Network tab. Reload the page and look and the first response which should be a .html file. Click on that file and then Response. There you can see the html code you can parse in Scrapy. Click ctrl+F and search for the CSS selector you're trying to parse.

Python Scrapy - Trying to use pagination that is "#" and redirects me to same page

I'm building a scraper for this page
I want to follow to next page after making first page work and after setting auto throttle and download speed to be gentle and I tried using:
next_page = response.xpath('//div[#class="global-pagination"]/a[#class="next"]/#href').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
Problem is that href in that class is # and basically it opens same page again. How do I make it work?
If you take a look at your browser developer tool you will see when you go to other pages data loads from loadresult. furthermore by searching in Form Data you'll see there is a field named page have a value of the page you requested which you can request any other page by changing it in your formdata in FormRequest.
from scrapy.http import FormRequest
FormRequst(url=url, formdata=formdata={'page': <page number>}, callback=<parse method>)

I cant extract instagram hashtags of a post with bs4

I wanted to extract hashtags from a specific post(given url) using BeautifoulSoup4. First I fetch the page using requests and I've tried find_all() to get every hashtag but it seems there is a hidden problem.
here is the code:
import requests
from bs4 import BeautifulSoup as bs
URL = 'https://www.instagram.com/p/CBz7-X6AOqK/?utm_source=ig_web_copy_link'
r = requests.get(URL)
soup = bs(r.content,'html.parser')
items = soup.find_all('a',attrs={'class':' xil3i'})
print(items)
the result of this code is just an empty list. Can someone please help me with the problem?
It looks like the page you are trying to scrape requires javascript. This means that some elements of the webpage are not there when you send a GET requests.
One way you can figure out if the webpage you are scraping requires javascript to populate the info you need is to simply save the html into a file:
URL = 'https://www.instagram.com/p/CBz7-X6AOqK/?utm_source=ig_web_copy_link'
r = requests.get(URL)
with open('dump.html', 'w+') as file:
file.write(r.text)
and then open that file into a web browser
If the file you open does not have the information you want to scrape then it is likely that it is automatically populated using javascript.
To get around this you can render the javascript using
A web driver (like selenium) that simulates a user going to those pages in a web browser
requests-HTML, which is a slightly new package that allows you to render javascript on a page, and has so many other awesome features that are useful for web scraping
There is a larger group of people who work with selenium which makes debugging easier than with requests-HTML, but if you do not want to learn about a new module like selenium, requests-HTML is very similar to requests and picking it up should not be very difficult

Web-scraping and download .csv from OECD website

Sorry for bothering you with my request. I have started to get acquaintance with web-scraping with the library BeautifulSoup. Beacuase I have to download some data from OECD's websites I wanted to try some web-scraping approaches. More specifically, I wanted to download a .csv file from the following page:
https://goingdigital.oecd.org/en/indicator/50/
As you can see, data can be easily downloaded by clicking on 'Download data'. However, because I will have do deal with some a recursive download with loop, I tried to download it directly from the Python console. Therefore, by inspecting the page, I evidenced the download's URL that I have reported in the following picture:
Hence, I wrote the following code:
from bs4 import BeautifulSoup
import requests
from requests import get
url = 'https://goingdigital.oecd.org/en/indicator/50/'
response = get(url)
print(response.text[:500])
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
containers = html_soup.find_all('div', {'class': 'css-cqestz e12cimw51'})
print(type(containers))
print(len(containers))
d = []
for a in containers[0].find_all('a', href = True):
print(a['href'])
d.append(a['href'])
The object containers is composed by three elements since there are three divs with the specified class. The first one (the one I have selected in the loop) should be the one containing the URL in which I am interested. However, I get no result. Conversely, when I select the third element of the object containers I get the following output:
https://www.facebook.com/sharer/sharer.php?u=https%3A%2F%2Fgoingdigital.oecd.org%2Fen%2Findicator%2F50%2F
https://twitter.com/intent/tweet?text=OECD%20Going%20Digital%20Toolkit&url=https%3A%2F%2Fgoingdigital.oecd.org%2Fen%2Findicator%2F50%2F
https://www.linkedin.com/shareArticle?mini=true&url=https%3A%2F%2Fgoingdigital.oecd.org%2Fen%2Findicator%2F50%2F
mailto:?subject=OECD%20Going%20Digital%20Toolkit%3A%20Percentage%20of%20individuals%20aged%2055-74%20using%20the%20Internet&body=Percentage%20of%20individuals%20aged%2055-74%20using%20the%20Internet%0A%0Ahttps%3A%2F%2Fgoingdigital.oecd.org%2Fen%2Findicator%2F50%2F
By the way, for this download I guess it could be related to the following thread. Thank you in advance!
When you pull data from a website, you should first check whether the content you are looking for is in the page source. If it's not in the page source, you should try web scraping with selenium.
When I examined the site you mentioned, I could not see it in the page source, it shows that the link you want on this page is dynamically created.

How to click a button and scrape text from a website using python scrapy

I have used python scrapy to extract data from a website. Now i am able to scrape most of the details of a site using scrapy. But my main problem is that iam not able to extract all the reviews of products from the site. I am only able to extract the top 4 reviews which they display on the page and for getting other reviews i have to go to a pop up window which has all the reviews. I looked for 'href' for the popup window but im not able to find it. This is the link that i tried to scrape. The reviews and ratings are at the bottom of the page: https://www.coursera.org/learn/big-data-introduction
Can any one help me by explaining how to extract the reviews from this popup window. Another think to note is that there is infinite scrolling for the pop up.
Thanks in advance.
Scrapy, unlike tools like Selenium and PhantomJS, does not drive a full web browser in the background. You cannot just click a button.
You need to understand what the button does (e.g. does it simply submit a form? Does it do something with JavaScript? Etc.) and reproduce the functionality in your own code.
For example, you might need to read the content of a script element, apply regular expressions to it to pull a URL from a string literal, then make a new HTTP request to that URL, the pell the data you want from the new DOM.
... and then repeat for the next “page” of the infinite scroll.

Resources