scrapy pagination without href - pagination

I created a spider that takes the information from the table below, but I can not change to the previous table because it does not have "href", how do I?
https://br.soccerway.com/teams/italy/as-roma/1241/
previous button without href
<a rel="previous" class="previous " id="page_team_1_block_team_matches_summary_7_previous">« anterior</a>

If you look at network inspector in your browser you can see an XHR request being made when you click next button:
That request return json response with html changes:
You need to reverse engineer how your page generated this url (from the first image):
https://br.soccerway.com/a/block_team_matches_summary?block_id=page_team_1_block_team_matches_summary_7&callback_params=%7B%22page%22%3A0%2C%22bookmaker_urls%22%3A%7B%2213%22%3A%5B%7B%22link%22%3A%22http%3A%2F%2Fwww.bet365.com%2Fhome%2F%3Faffiliate%3D365_371546%22%2C%22name%22%3A%22Bet%20365%22%7D%5D%7D%2C%22block_service_id%22%3A%22team_summary_block_teammatchessummary%22%2C%22team_id%22%3A1241%2C%22competition_id%22%3A0%2C%22filter%22%3A%22all%22%2C%22new_design%22%3Afalse%7D&action=changePage&params=%7B%22page%22%3A1%7D
And then you can use that to retrieve following pages.

Related

How to extract the URL of a particular page using selenium/python?

I'm building an instagram Bot using selenium.
How do I extract the URL of a page using python?
For example Selenium is loading a webpage. I want to extract the url of that particular page .(Suppose : https://instagram.com/as80df67s4)
If you still don't understand what I'm talking about, please check the image below. There, I have highlighted the page link. How do I extract that link?
From webdriver.py:
def current_url(self):
"""
Gets the URL of the current page.
:Usage:
driver.current_url
"""
return self.execute(Command.GET_CURRENT_URL)['value']
This means that in order to get a current url you can use:
your_url = driver.current_url
But first you need for this page to open.

Python Scrapy - Trying to use pagination that is "#" and redirects me to same page

I'm building a scraper for this page
I want to follow to next page after making first page work and after setting auto throttle and download speed to be gentle and I tried using:
next_page = response.xpath('//div[#class="global-pagination"]/a[#class="next"]/#href').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
Problem is that href in that class is # and basically it opens same page again. How do I make it work?
If you take a look at your browser developer tool you will see when you go to other pages data loads from loadresult. furthermore by searching in Form Data you'll see there is a field named page have a value of the page you requested which you can request any other page by changing it in your formdata in FormRequest.
from scrapy.http import FormRequest
FormRequst(url=url, formdata=formdata={'page': <page number>}, callback=<parse method>)

I want to extract the url from the <a #href= '#' onclick="redirectpage(2);return false" >...</a>

I'm using scrapy and passing SplashRequest, I want to extract the url from the #href as usual, but when I inspect the href to get the actual url, it is not assigned the url I'm looking for, but instead I see '#', then when I hover the mouse on that '#' I can see the url I'm looking for.
How can I get that url then follow it using SplashRequest ?
the HTML code is shown below:-
<a #href= '#' onclick="redirectpage(2);return false" >Page 120</a>
When I hover the mouse on #href I see the url I'm looking for as shown below :=
https://example.com/page/120
To get href/url attribute :
//div[#class='---']/a/#href
I believe this is efficient for any page
For getting the URL, you should use some of the dynamic data fetching methods,
Click the particular URL and view the Url in response.
If the content not available in the page source, then its loading dynamically via some scripts.
we should handle things that way.

getting # after extracting href from <a> tag

Trying to scrape https://www.pagesjaunes.fr/annuaire/marseille-13/jardinier , I have a problem with pagination.
The link of next page is stored in tag. i get # after a['href'] not the link
tree = html.fromstring(response.text)
soup = BeautifulSoup(response.text, 'html.parser')
Footer = soup.find(class_='result-footer')
divpagination= Footer.find(class_='pagination')
atag=divpagination.find("a", {"id": "pagination-next"})
print(atag.get('href'))
Output : #
Note: I Make the request without the Accept-Encoding header, that way the server doesn't compress the message to be sent
html tag :
Suivant
tag with beautifulsoup:
Suivant
As you can see if you inspect the page's source code in your browser (or just print it), this link uses js for navigation.
There are additional (non standard) properties to the tag so you can eventually try to reverse engineering the whole thing (check the tag attributes values, click the link in your browser and compare with the new page's effective url).
If it doesn't work then you'll need a headless browser and code to drive it (selenium being the canonical python solution).

Preventing javascript running in a HREF

I have a problem with the following HTML:
<a href="javascript:document.formName.submit();" target="iframe">
Where formName is the name of a form inside the iframe. I would like the browser to navigate to the page "javascript:..." in the iframe, so it executes the javascript on the current page in the iframe. My problem is that the browser will attempt to execute the function and use the returned result as the url. Of course there is no form to submit on the current page, so I just get an error.
Cross domain iframes are no fly zones, you won't be able to do anything with or to the DOM inside of a frame on a different domain. Even if the user clicked the submit button inside the frame, your page would not be able to get the new url back out of the frame.
In this case, you can do it by reaching inside the iframe:
<a href="javascript:window.frames[N].contentDocument.FORMNAME.submit()">
(that may not be exactly the right incantation). In general, you should do this with an onclick handler for the hyperlink, that invokes a function defined by the iframe's document.
EDIT: There is NO WAY to make this work cross-domain. It's a violation of the browser's security policies.

Resources