I am trying to scrape a website that renders all the date with JS. It pages of tables, however you can only access certain page via search box or by clicking an arrow to move to next page. It is impossible to access certain page by url.
I need to change proxy on each page. If I reload webdriver, I must execute all the searches to access e.g. 124102nd page. It is very time as well as computationally intensive.
Anyone could help me on this?
Related
I am trying to crawl the listings on this website via scrapy: https://www.hipflat.com/search/rent/condo_y/TH.BM_r1/any_r2/any_p/any_b/any_a/any_w/any_i/100.560155,13.737171_c/16_z/list_v
However, I am stuck with the navigation. At the bottom of the page the links for "next page" show up. But as far as I can see it, they call an external site (algolia) via a JavaScrip-Query.
What would be the easiest way to make the navigation crawlable via scrapy?
The next page link is present in the page. You can get it using response.css("[rel='next']" ::attr("href")). This will provide you the next link for pagination. Now you can simply proceed with GET request using response.follow(url=,callback=).
I'm working on a browser extension (compatible with Chrome, FF, Opera, and Edge) and I'm trying to figure out how to associate requests to domains outside of the current page. For example, when you go to google.com a lot of requests to domains other than google.com occur such as to gstatic.com.
An extension like NoScript shows all of the requested domains that a page made and lets you allow or deny. I'm trying to get a similar functionality.
Is this something that can be done in the content script or is there some way to keep state information in the background script that I can then display in the popup? Obviously it's possible but I'm just not seeing which callback I can use.
Im configuring a desktop and mobile version of my site and was looking to use js to test for browser dimensions and then load the relevant version, however the problem is if someone shares a link from the mobile version and sends it to a desktop user then they circumvented the check. Is there a way to configure .htaccess (or some other method) to have the address bar show 'mysite.com' even though i would be loading 'mysite.com/mobile.htm'? I know i can always use media queries but that has the downfall of loading unused assets, so this method would be alot better.
Use a rewrite instead of a redirect. With a redirect, the browser is instructed to go to another address. With a URL rewrite, the server just responds with the contents of a different URL.
For just this page it will be simple, but it could be complicated, based on your site.
Another way is to include a little JS in every page to make sure you are on the right one for the device and redirect to the other if not. It would help if there was some pattern to easily determine the corresponding page.
I want my site address bar not to change its address when I go to subpages, it should show my index.html, even though I enter tosub pages.
Like if I open www.xyz.com and I navigate to any page it should still show www.xyz.com.
I heard this can be done with .htaccess is it possible?
You really should think about why you want it, because this way of working has a couple of drawbacks with it:
Users can't see they are on a different page
Users can't bookmark your pages for fast access
Users can't share links to eachother
Search Engines may have trouble spidering your side
But basically, there are two main ways to do this:
Use frames. Put the page into a frame, and have all the links stay in this frame.
Use Javascript. Have each page "load" into the current page, using AJAX.
I have an application that utilizes rather unfriendly dynamic URLs most of the time. I am providing friendly URLs to some content, but these are used only as an entry point into the application, after which point all of the generated URLs will be the unfriendly variety.
My question is, if I know that the user is on a page for which a friendly URL could be generated and they choose to bookmark it, is there a way to tell the browser to bookmark the friendly one instead of what is in the address bar?
I had hoped that rel="canonical" would help here, but it seems as if it's only used for indexing. Maybe one day browsers will utilise it.
No. This is by design, and a Good Thing.
Imagine the following scenario: Piskvor surfs to http://innocentlookingpage.example.com/ and clicks "bookmark". He doesn't notice that the bookmark he saved points to http://evilsite.example.net/ Next time he opens that bookmark, he might get a bit of a surprise.
Another example without cross-domain issues:
Piskvor the site admin clicks "bookmark" on the homepage of http://security-holes-r-us.example.org/ - unfortunately, the page is vulnerable to script injection, and the injected code changes the bookmark to http://security-holes-r-us.example.org/admin?action=delete&what=everything&sure=absolutely . If he's still logged in the next time he opens the bookmark, he may find his site purged of data (Granted, it was his fault not to prevent script injection AND to have non-idempotent GET resources, but that is all too common).