How to fetch website links when they're not numerically ordered - python-3.x

Using beautifulsoup it's easy to fetch URLs that follow a certain numeric order. However how do I fetch URL links when it's organized otherwise such as https://mongolia.mid.ru/en_US/novosti where it has articles like
https://mongolia.mid.ru/en_US/novosti/-/asset_publisher/hfCjAfLBKGW0/content/24-avgusta-sostoalas-vstreca-crezvycajnogo-i-polnomocnogo-posla-rossijskoj-federacii-v-mongolii-i-k-azizova-s-ministrom-energetiki-mongolii-n-tavinbeh?inheritRedirect=false&redirect=https%3A%2F%2Fmongolia.mid.ru%3A443%2Fen_US%2Fnovosti%3Fp_p_id%3D101_INSTANCE_hfCjAfLBKGW0%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3Dcolumn-1%26p_p_col_count%3D1?
Websites such as these are weird because once you first open the link, you have » Бусад мэдээ button to go to the next page of articles. But once you click there, now you have Previous or Next button which is so unorganized.
How do I fetch all the news articles from websites like these (https://mongolia.mid.ru/en_US/novosti or https://mongolia.mid.ru/ru_RU/)?

It seems that the » Бусад мэдээ button from https://mongolia.mid.ru/ru_RU/ just redirects to https://mongolia.mid.ru/en_US/novosti. So why not start from the latter?
To scrape all the news just go page through page using the link from the Next button.
If you want it to be more programatic, just check the differences in the query parameters and you'll see that _101_INSTANCE_hfCjAfLBKGW0_cur is set to the actual page's number (starting from 1).

Related

How to click a button and scrape text from a website using python scrapy

I have used python scrapy to extract data from a website. Now i am able to scrape most of the details of a site using scrapy. But my main problem is that iam not able to extract all the reviews of products from the site. I am only able to extract the top 4 reviews which they display on the page and for getting other reviews i have to go to a pop up window which has all the reviews. I looked for 'href' for the popup window but im not able to find it. This is the link that i tried to scrape. The reviews and ratings are at the bottom of the page: https://www.coursera.org/learn/big-data-introduction
Can any one help me by explaining how to extract the reviews from this popup window. Another think to note is that there is infinite scrolling for the pop up.
Thanks in advance.
Scrapy, unlike tools like Selenium and PhantomJS, does not drive a full web browser in the background. You cannot just click a button.
You need to understand what the button does (e.g. does it simply submit a form? Does it do something with JavaScript? Etc.) and reproduce the functionality in your own code.
For example, you might need to read the content of a script element, apply regular expressions to it to pull a URL from a string literal, then make a new HTTP request to that URL, the pell the data you want from the new DOM.
... and then repeat for the next “page” of the infinite scroll.

Scraping in Scrapy without pagination links

I'm trying to paginate through a review site that doesn't have hyperlinked pagination buttons. Code snipit for pagination button:
<span data-page-number="2" data-offset="5" class="nav next taLnk " onclick="(ta.prwidgets.getjs(this,'handlers')).paginate(this); ta.trackEventOnPage('NORTH_STAR_PAGINATION', 'next', '2', 0);" data-page="LqWQeVsSVuWy3KkAoWMUKvKmmmmWxfWiEoWrGVQhIpMgQJIQxGCSsVEtSIgQfSIgWwGScJMVc2GSJQwVCCtgcsJCSJB"><div class="ui_button primary ">Next</div></span>
However there is pagination structure in the URL. For example foobar.com/page1
I'd like to avoid using a headless browser. Since I'm visiting many of these pages I can't manually inspect the page length of each one.
However, I do know there are 10 reviews per page and the review count is stated on the first page, as is the number of pages. Is there a way I can use this scraped information to paginate via url logic off of my initial start-url? Thanks!
(FYI Going to a url with a page count that doesn't exist redirects back to page 1)

htaccess Redirecting or Rewriting to a form results page with variables intact

I am building a website which calls for a page selector on product search results, the page selector currently adds a forward slash and a number (representing the page) to the end of the current URL.
e.g. If I am browsing Washing Machines on "/laundry/Washing-Machines" and I click page 2 on the selector it takes me to "/laundry/Washing-Machines/2" and page 2 loads, this is working fine.
Now, the problem I am having...
I have a form in the sidebar where the user can filter Range Cooker search results by brand, fuel type, size and colour. The form gathers the products from the database that meet the search criteria, and displays the results along side the page selector.
If I leave the form values as default and submit the form I am presented with the results on "/cooking/Range-Cookers/Search?brand=0&type=0&size=0&colour=0" but when I click page 2 on the selector I am taken to "cooking/Range-Cookers/2" which presents me with a 404. If I add "&page=2" to the end of the original URL I am presented with page 2.
Since the page selector is a php include and works fine for every product except the results from my Range Cooker form, I would rather find a solution that leaves the selector php intact.
Is there any way I can add a redirect to .htaccess which would take a link from my page selector e.g. "cooking/Range-Cookers/5" and correctly apply it to the current URL with all form variables intact e.g. "cooking/Range-Cookers/Search?brand=1&type=2&size=0&colour=0&page=5"?
I have experience in HTML, CSS and PHP, but I am new to editing .htaccess and would appreciate any insight into how I can accomplish this. Thanks.
You cannot do this with .htaccess, because the information is not available, when the request hits Apache, or .htaccess for that matter.
When you click the link for page 2, the client requests the URL in the associated href attribute. It doesn't provide any other information available on the current page. If you want this information transmitted, you must modify the link for page 2 from
cooking/Range-Cookers/2
to
cooking/Range-Cookers/Search?brand=1&type=2&size=0&colour=0&page=2
when you deliver the page to the client. Same goes for any other information you need for following pages.

Using Watir How can i visit all the links of a web page and then sub links of the visited link

strong textI have a web page that is containing several links on it, and when we click on any link it redirect to another page that is also containing several links, like wise all links have several pages.
I want to click on all the links and when i click on first link script should click on all the links of redirected page and so on.. when it done the clicking on the links, again second links link of the first page should get clicked like wise for links.
Please any one can help me on this, I have developed the script by which I am able to click on all the links of main(first) page but not getting idea how to do that for sub pages of the application.
Please revert ASAP, its very urgent.
You just have to implement some recursive function like this:
def crawl(link)
browser.goto link
# gather all links before navigating to next link
all_links = browser.links.reduce([]) do |memo, link|
memo << link if link.href =~ /appdomain/ # do not visit external links
memo
end
all_links.each do |link|
crawl link
end
end
crawl "http://appdomain.com/"
This is untested code, but it might work :)
Also this code does not avoid clicking link to same path twice from different places - there's room for optimization.
It might be that you're using wrong tool for your job - at least it seems so when reading your question. What is the original problem?

How can I create a separate search page for my blog?

I made a wordpress blog: link text
I have a separate htm page with the form and input field to search. It does not work. But when I have the search field and submit bitton in the sidebar on every page, it works fine. I left bot the sidebar search and the "search blog" page available.
Is it possible to have a separate search page and have the results appear normally?
Your search box is redirecting to the wrong URL
You send the visitor to /blog/?name=(search term)
Should be /blog/?s=(search term)
You can see this if you look at the URL for a search that works (i.e. from your sidebar).

Resources