Irregular behaivour of URL Link - Multiple pages webscraping

Irregular behaivour of URL Link - Multiple pages webscraping - python-3.x

I am writing a Python code to get data from a sharepoint website using Beautiful Soup.
Each page has 10 rows of details. So I should be collecting all the links up to the last page and then get the entire list of data I need.
Issues
When I am trying to open the page2 urllink Using Python code, it is still opening the page1 (base url) link.
When I open the base url from browser (page1) link and from there using next button, I am able to navigate to page2. But the same when I open a new tab and directly copy paste the page2 link, it refreshes and opens and page1 (base url) link.
Code:
import requests
from requests_ntlm import HttpNtlmAuth
session = requests.Session()
session.auth = HttpNtlmAuth('username','password')
r = session.get("UrlLinkOfPage2")
print(r.status_code)
print(r.content)

The issue with some websites is that they are expecting some special headers to be sent, so open the site home page in your browser (login if required), open the network tab in developer tools and inspect the request your browser make to access the second page. Then copy all the headers it is sending and make a dictionary in your python code containing those headers like:
my_headers = {
'some-header': value,
'another-header': another-value
}
then use requests library to send those headers when requesting the page like:
response = session.get(second_page_url, headers=my_headers)

Related

How to extract the URL of a particular page using selenium/python?

I'm building an instagram Bot using selenium.
How do I extract the URL of a page using python?
For example Selenium is loading a webpage. I want to extract the url of that particular page .(Suppose : https://instagram.com/as80df67s4)
If you still don't understand what I'm talking about, please check the image below. There, I have highlighted the page link. How do I extract that link?

From webdriver.py:
def current_url(self):
"""
Gets the URL of the current page.
:Usage:
driver.current_url
"""
return self.execute(Command.GET_CURRENT_URL)['value']
This means that in order to get a current url you can use:
your_url = driver.current_url
But first you need for this page to open.

I cant extract instagram hashtags of a post with bs4

I wanted to extract hashtags from a specific post(given url) using BeautifoulSoup4. First I fetch the page using requests and I've tried find_all() to get every hashtag but it seems there is a hidden problem.
here is the code:
import requests
from bs4 import BeautifulSoup as bs
URL = 'https://www.instagram.com/p/CBz7-X6AOqK/?utm_source=ig_web_copy_link'
r = requests.get(URL)
soup = bs(r.content,'html.parser')
items = soup.find_all('a',attrs={'class':' xil3i'})
print(items)
the result of this code is just an empty list. Can someone please help me with the problem?

It looks like the page you are trying to scrape requires javascript. This means that some elements of the webpage are not there when you send a GET requests.
One way you can figure out if the webpage you are scraping requires javascript to populate the info you need is to simply save the html into a file:
URL = 'https://www.instagram.com/p/CBz7-X6AOqK/?utm_source=ig_web_copy_link'
r = requests.get(URL)
with open('dump.html', 'w+') as file:
file.write(r.text)
and then open that file into a web browser
If the file you open does not have the information you want to scrape then it is likely that it is automatically populated using javascript.
To get around this you can render the javascript using
A web driver (like selenium) that simulates a user going to those pages in a web browser
requests-HTML, which is a slightly new package that allows you to render javascript on a page, and has so many other awesome features that are useful for web scraping
There is a larger group of people who work with selenium which makes debugging easier than with requests-HTML, but if you do not want to learn about a new module like selenium, requests-HTML is very similar to requests and picking it up should not be very difficult

Web-scraping and download .csv from OECD website

Sorry for bothering you with my request. I have started to get acquaintance with web-scraping with the library BeautifulSoup. Beacuase I have to download some data from OECD's websites I wanted to try some web-scraping approaches. More specifically, I wanted to download a .csv file from the following page:
https://goingdigital.oecd.org/en/indicator/50/
As you can see, data can be easily downloaded by clicking on 'Download data'. However, because I will have do deal with some a recursive download with loop, I tried to download it directly from the Python console. Therefore, by inspecting the page, I evidenced the download's URL that I have reported in the following picture:
Hence, I wrote the following code:
from bs4 import BeautifulSoup
import requests
from requests import get
url = 'https://goingdigital.oecd.org/en/indicator/50/'
response = get(url)
print(response.text[:500])
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
containers = html_soup.find_all('div', {'class': 'css-cqestz e12cimw51'})
print(type(containers))
print(len(containers))
d = []
for a in containers[0].find_all('a', href = True):
print(a['href'])
d.append(a['href'])
The object containers is composed by three elements since there are three divs with the specified class. The first one (the one I have selected in the loop) should be the one containing the URL in which I am interested. However, I get no result. Conversely, when I select the third element of the object containers I get the following output:
https://www.facebook.com/sharer/sharer.php?u=https%3A%2F%2Fgoingdigital.oecd.org%2Fen%2Findicator%2F50%2F
https://twitter.com/intent/tweet?text=OECD%20Going%20Digital%20Toolkit&url=https%3A%2F%2Fgoingdigital.oecd.org%2Fen%2Findicator%2F50%2F
https://www.linkedin.com/shareArticle?mini=true&url=https%3A%2F%2Fgoingdigital.oecd.org%2Fen%2Findicator%2F50%2F
mailto:?subject=OECD%20Going%20Digital%20Toolkit%3A%20Percentage%20of%20individuals%20aged%2055-74%20using%20the%20Internet&body=Percentage%20of%20individuals%20aged%2055-74%20using%20the%20Internet%0A%0Ahttps%3A%2F%2Fgoingdigital.oecd.org%2Fen%2Findicator%2F50%2F
By the way, for this download I guess it could be related to the following thread. Thank you in advance!

When you pull data from a website, you should first check whether the content you are looking for is in the page source. If it's not in the page source, you should try web scraping with selenium.
When I examined the site you mentioned, I could not see it in the page source, it shows that the link you want on this page is dynamically created.

How to scrape data after clicking button

I am trying to scrape data from website with beautiful soup, but to scrape all content, I have to click button
<button class="show-more">view all 102 items</button>
to load every item. I have heard that it could by done with selenium, but it means that i have to open browser with script, and then scrape the data. Are there any other ways to solve this problem.

You can use the same API endpoint the page does which returns all the info in json form. Set a records return count higher than the total expected number. I show parsing out the album titles/urls from the json. You can explore response here. You can find this endpoint in the browser network tab when refreshing the url you supplied.
import requests
data = {"fan_id":1812622,"older_than_token":"1557167238:2897209009:a::","count":1000}
r = requests.post('https://bandcamp.com/api/fancollection/1/wishlist_items', json = data).json()
details = [(item['album_title'], item['item_url']) for item in r['items']]
print(details)

scrapy pagination without href

I created a spider that takes the information from the table below, but I can not change to the previous table because it does not have "href", how do I?
https://br.soccerway.com/teams/italy/as-roma/1241/
previous button without href
<a rel="previous" class="previous " id="page_team_1_block_team_matches_summary_7_previous">« anterior</a>

If you look at network inspector in your browser you can see an XHR request being made when you click next button:
That request return json response with html changes:
You need to reverse engineer how your page generated this url (from the first image):
https://br.soccerway.com/a/block_team_matches_summary?block_id=page_team_1_block_team_matches_summary_7&callback_params=%7B%22page%22%3A0%2C%22bookmaker_urls%22%3A%7B%2213%22%3A%5B%7B%22link%22%3A%22http%3A%2F%2Fwww.bet365.com%2Fhome%2F%3Faffiliate%3D365_371546%22%2C%22name%22%3A%22Bet%20365%22%7D%5D%7D%2C%22block_service_id%22%3A%22team_summary_block_teammatchessummary%22%2C%22team_id%22%3A1241%2C%22competition_id%22%3A0%2C%22filter%22%3A%22all%22%2C%22new_design%22%3Afalse%7D&action=changePage&params=%7B%22page%22%3A1%7D
And then you can use that to retrieve following pages.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Irregular behaivour of URL Link - Multiple pages webscraping - python-3.x

Related

How to extract the URL of a particular page using selenium/python?

I cant extract instagram hashtags of a post with bs4

Web-scraping and download .csv from OECD website

How to scrape data after clicking button

scrapy pagination without href

Categories

Resources