Python: Scrape JavaScript generated content - python-3.x

I'm trying to do a web scraping because I need to grab text from a web page, but the text is generated by JavaScript. I can't use Selenium to simulate the browser because there's too much text to be generated and it will make it crash. I can't do it with requests, because the requested URL doesn't return the desired text due to JavaScript, but if I paste the URL in Chrome it gives me what I need. I have not managed to get the expected results in python and I need some help.

Related

Intercept Chrome translating with js

I have chrome extension and auto translate option on Chrome is changing some dates on page, which cause some errors while data processing.
Is it possible to turn of translate of the page, from the content script?
I have tried with appending the element with also tried to add class "notranslate" to .

How to only get the "title" and "main content" of a page using puppeteer?

I'm trying to create a clone of getpocket.com for learning. On that app, every saved link gets converted into a markdown; and it seems like the it's a filtered content with only the page title and body without headers, footers, etc.
I could get the page's title using puppeteer api thru different means:
using page.title()
or get the page's opengraph "og:title"
But how do i get like the summarized version containing only the main content of the page.
Note that i don't know beforehand the "css class" of the main content since i'm planning on just entering a url in a textbox and scrape that site from there.
I have found what i've needed for this scenario.
I used the Readability.js library for making webpages readable by removing some certain html tags. Here's the library.
This library is what mozilla uses behind the scenes when rendering their reader view

I cant extract instagram hashtags of a post with bs4

I wanted to extract hashtags from a specific post(given url) using BeautifoulSoup4. First I fetch the page using requests and I've tried find_all() to get every hashtag but it seems there is a hidden problem.
here is the code:
import requests
from bs4 import BeautifulSoup as bs
URL = 'https://www.instagram.com/p/CBz7-X6AOqK/?utm_source=ig_web_copy_link'
r = requests.get(URL)
soup = bs(r.content,'html.parser')
items = soup.find_all('a',attrs={'class':' xil3i'})
print(items)
the result of this code is just an empty list. Can someone please help me with the problem?
It looks like the page you are trying to scrape requires javascript. This means that some elements of the webpage are not there when you send a GET requests.
One way you can figure out if the webpage you are scraping requires javascript to populate the info you need is to simply save the html into a file:
URL = 'https://www.instagram.com/p/CBz7-X6AOqK/?utm_source=ig_web_copy_link'
r = requests.get(URL)
with open('dump.html', 'w+') as file:
file.write(r.text)
and then open that file into a web browser
If the file you open does not have the information you want to scrape then it is likely that it is automatically populated using javascript.
To get around this you can render the javascript using
A web driver (like selenium) that simulates a user going to those pages in a web browser
requests-HTML, which is a slightly new package that allows you to render javascript on a page, and has so many other awesome features that are useful for web scraping
There is a larger group of people who work with selenium which makes debugging easier than with requests-HTML, but if you do not want to learn about a new module like selenium, requests-HTML is very similar to requests and picking it up should not be very difficult

How to click a button and scrape text from a website using python scrapy

I have used python scrapy to extract data from a website. Now i am able to scrape most of the details of a site using scrapy. But my main problem is that iam not able to extract all the reviews of products from the site. I am only able to extract the top 4 reviews which they display on the page and for getting other reviews i have to go to a pop up window which has all the reviews. I looked for 'href' for the popup window but im not able to find it. This is the link that i tried to scrape. The reviews and ratings are at the bottom of the page: https://www.coursera.org/learn/big-data-introduction
Can any one help me by explaining how to extract the reviews from this popup window. Another think to note is that there is infinite scrolling for the pop up.
Thanks in advance.
Scrapy, unlike tools like Selenium and PhantomJS, does not drive a full web browser in the background. You cannot just click a button.
You need to understand what the button does (e.g. does it simply submit a form? Does it do something with JavaScript? Etc.) and reproduce the functionality in your own code.
For example, you might need to read the content of a script element, apply regular expressions to it to pull a URL from a string literal, then make a new HTTP request to that URL, the pell the data you want from the new DOM.
... and then repeat for the next “page” of the infinite scroll.

Rendering a user modified page using PhantomJS

My use case is: a user goes onto a webpage and modifies it by either filling in a form, populating the page with data from the database, or dragging around some draggables on the page. He can then download the page he modified as pdf. I was thinking of using PhantomJS to do the conversion from html to pdf.
I understand the basic functionality of PhantomJS and got the basic example working but in all the examples I've seen, either a local file or a url is passed in. Example:
page.open('./test.html', function () { ... }
How would I render the page that is getting modified by a user using PhantomJS? I have 2 ideas:
Have the url change as the user modifies the page, and simply pass in the url. For example, the url contains the position of a draggable div.
Send the modified html to back-end, save it, and run PhantomJS
Do these solutions make sense? I'm hoping there would be a simpler way.

Resources