How to get visual DOM structure from url in node.js - node.js

I am wondering how to get "visual" DOM structure from url in node.js. When I try to get html content with request library, html structure is not correct.
const request = require('request');
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
request({ 'https://www.washingtonpost.com/news/acts-of-faith/wp/2017/06/30/trump-promised-to-destroy-the-johnson-amendment-congress-is-targeting-it-now/', jar: true }, function (e, r, body) {
console.log(body);
});
reurned html structure is here, where meta tags are not correct:
<meta property="og:title" content=""/>
<meta itemprop="description" name="description" content=""/>
If I open website in web browser, I can see correct meta tags in web inspector:
<meta property="og:title" content="Trump promised to destroy the Johnson Amendment. Congress is targeting it now."/>
<meta itemprop="description" name="description" content="Observers believe the proposed legislation would make it harder for the IRS to enforce a law preventing pulpit endorsements."/>

I might need more clarification on what a "visual" DOM structure is, but as a commenter pointed out a headless browser like puppeteer is probably the way to go when a website has complex loading behavior.
The advantage here is, with puppeteer at least, you can navigate to a page and then programmatically wait until some condition is satisfied before continuing. In this case, I chose to wait until one of the meta tags you specified's content attribute is truthy, but depending on your needs you could wait for something else or even wait for multiple conditions to be true.
You might have to analyze the behavior of the page in question a little deeper to figure out what you should wait for though, but at the very least the following code seems to correctly load the tags in your question.
import puppeteer from 'puppeteer'
(async ()=>{
const url = 'https://www.washingtonpost.com/news/acts-of-faith/wp/2017/06/30/trump-promised-to-destroy-the-johnson-amendment-congress-is-targeting-it-now/'
const browser = await puppeteer.launch()
const page = await browser.newPage()
await page.goto(url)
// wait until <meta property="og:title"> has a truthy value for content attribute
await page.waitForFunction(()=>{
return document.querySelector('meta[property="og:title"]').getAttribute('content')
})
const html = await page.content()
console.log(html)
await browser.close()
})()
(pastebin of formatted html result)
Also, since this solution uses puppeteer I'd recommend not working with the html string and instead using the puppeteer API to extract the information you need.

Related

Web Scraping NodeJs - How to recover resources when the page loads in full after several requests

i'm trying to retrieve each item (composed of an image, a word and its translation) from this page
Link of the website: https://livingdictionaries.app/hazaragi/entries/gallery?entries_prod%5Btoggle%5D%5BhasImage%5D=true"
I used JsDom and Got.
Here is the code
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
const got = require('got');
(async () => {
const response = await got("https://livingdictionaries.app/hazaragi/entries/gallery?entries_prod%5Btoggle%5D%5BhasImage%5D=true");
console.log(response.body);
const dom = new JSDOM(response.body);
console.log(dom.window.document.querySelectorAll(".ld-egdn1r"))
})();
when I display the html code that is returned to me it does not correspond to what I open the site with my browser.There are no html tags that contain the items.
When I look at the Network tab, other resources are loaded, but again I can't find the query that retrieves the words.
I think that what I am looking for is loaded in several queries but I don't know which one
Here are the step:
enter image description here
then you will get a code like that
fetch("https://xcvbaysyxd-dsn.algolia.net/1/indexes/*/queries", {
"credentials": "omit",
"headers": {},
"referrer": "https://livingdictionaries.app/",
"body": "...",
"method": "POST",
"mode": "cors"
});
you will just have to process the data manualy after that
const fetch = require("node-fetch") // npm i node-fetch
const data = await fetch(...).then(r=>r.json())
const product = data.results.map(r=>r.hits)
in your case
The site you are trying to scrape is a Single Page Application (SPA) built with Svelte and the individual elements are dynamically rendered as needed, as many websites are today. Since the HTML is not hard-coded, these sites are notoriously difficult to scrape.
If you just log the response, you will see that the elements for which you are selecting do not exist. This is because it is the browser that interprets the JavaScript at run time and updates the UI. A GET request using got, axios, fetch, whatever, cannot perform such tasks.
You will need to implement the use of a headless browser like Puppeteer in order to dynamically render the site and scrape.

Can't scrape text with cheerio

i'm trying to scrape this page with cheerio https://en.dict.naver.com/#/search?query=%EC%B6%94%EC%9B%8C%EC%9A%94&range=all
But i can't get anything. I tried to get that 'Word-Idiom' text but i get nothing as response.
Here's my code
app.get("/conjugation", (req, res) => {
axios(
"https://en.dict.naver.com/#/search?query=%EC%B6%94%EC%9B%8C%EC%9A%94&range=all"
)
.then((response) => {
const htmlData = response.data;
const $ = cheerio.load(htmlData);
const element = $(
"#searchPage_entry > h3 > span.title_text.myScrollNavQuick.my_searchPage"
);
console.log(element.text());
})
.catch((err) => console.log(err));
});
The server at that URL doesn't return any body DOM structure in the HTML response. The body DOM is rendered by linked JavaScript after the response is received. Cheerio doesn't execute the JavaScript in the HTML response, so it won't be possible to scape that page using Cheerio. Instead, you'll need to use another method which can execute the in-page JavaScript (e.g. Puppeteer).
This is a common issue while web scraping, the page loads dynamically, that's why when you fetch the content of the initial get response from that website, all you're getting is script tags, print the htmlData so you can see what I mean. There are no loaded html elements in your response, what you'll have to do is use something like selenium to wait for the elements that you're requiring to get rendered.

Why does an await within a server endpoint blocks other incoming requests?

I have two different versions of an endpoint for testing.
The first one is this:
app.get('/order', async (req, res) => {
console.log("OK")
getPrintFile()
})
And the second one is this:
app.get('/order', async (req, res) => {
console.log("OK")
await getPrintFile()
})
the getPrintFile is an async function that returns a promise when every operation is done. Withing the function I upload an image to a
server, I download a new image, and re upload that new image to another server.
I noticed that in the first example, without the await, if I send a lot of requests to the "order" endpoint,
I get the "OK" instantly for each request, which is what I want because that "OK" will get replaced by a res.status("200"). I need
to send a status 200 immediatly after getting the endpoint hit for various reasons. Then I don't care how long it takes for the server to do all the processing of the images/uploading, as long as the res.send(200) is executed instantly when there is a new incoming request.
However, when I use the await, even if new requests are coming in, it takes a lot to display the next "OK" if a previous request
is still processing. Usually it displays the OK only when the code within the getPrintFile function is done executing (that is, images are uploaded and everything is done)
It's like the event loop is blocked but I don't understand why.
What is happening here?
Edit:
So it is clearer, I tested it. If I send 5 requests to the "order" endpoint, the "OK" is displayed in the console immediately for all of them, and then the images are processed and uploaded at their own speed for each request. In the second example, if I send 5 requests, the first OK is displayed, and then the remaining 4 are displayed one at a time when the previous request is done executing, or if not exactly in that order, they get logged with tremendous delay, and not consecutively
I'll try to answer based on my understanding of your problem. The first thing missing is the res.sendStatus(200) in your examples to make them actually work. But then, indeed it happens as you describe it: the /order endpoint actually is "blocking" you from making another request as the await is blocking you from reaching the final statement (res.send). Here a full example
const express = require("express");
const app = express();
async function takeTime() {
return new Promise((resolve, reject) => {
setTimeout(() => {
console.log("resolved");
resolve(1);
}, 2000);
});
}
app.get("/order", (req, res) => {
console.log("ok"); // Happens immediately
await takeTime();
return res.sendStatus(200);
});
app.get("/", async (req, res) => {
console.log("ok"); // Happens immediately
takeTime();
return res.sendStatus(200);
});
app.listen(3000, () => {
console.log("server running on port 3000");
});
When testing the API (i.e. from Postman), you won't be able to run immediately another request on the /order endpoint, because you will be locked in waiting for the answer of the request just sent.
In the / endpoint, on the other hand, you will receive immediately the HTTP 200 response (as there is no await) and the code for the takeTime function will keep running asynchronously until it's done.
Here you can find more information, I hope it's useful.
Edit 1
here I add the .html page I'm using to test the await loop requests
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Document</title>
</head>
<body></body>
<script>
for (let i = 0; i < 10; i++) {
fetch("http://localhost:3000/order");
}
</script>
</html>
The fetch is not working for me in the browser, but having a web page making those request works

Puppeteer cannot read elements loaded by data-react-helmet

I need to read an website to get some SEO tags, but this tags was embed by React Helmet (I believe). During the standard process (page.goto(url)) everything works fine, but in SPA pages with this way to load lazy data, I cannot read the tags.
const page = await browser.newPage();
await page.emulate(device);
await page.setRequestInterception(false);
const json_headers = process.argv[3];
const extra_headers = JSON.parse(json_headers);
await page.setExtraHTTPHeaders(extra_headers);
const response = await page.goto(process.argv[2],{ waitUntil: 'networkidle0',referer: process.argv[2]});
await autoScroll(page);
If I put any kind of "wait" function the program simply stop because the DOM was received and it not contains the expect argument, for example:
await page.waitForSelector('meta[name="description"]');
I did more than 30 different ways, but the order of natural request not is applied in this case, because the developer put (I don't know how) the tags after delivery the result/response, and this scenario is impossible to crawl the page.
Here an example of tag generated on demand (during the page load it not exists)
<link rel="canonical" href="https://someexample.com/testes" data-react-helmet="true">
Any suggestion (again, all wait*** I tried) ?

Problems on xpath in nodejs scrapejs

sp = require('scrapejs').init({
cc:100,
delay:1*1000
});
sp.load('http://www.gatherproxy.com/proxylist/anonymity/?t=Elite')
.then(function($){
var counter = 0
//$.q('//*[#id="tblproxy"]/tbody/tr[3]').forEach(function(node){
$.q('//tbody/tr').forEach(function(node){
//$.q("/html/body/div[1]/div[0]/table/tbody/tr").forEach(function(node){
console.log(counter)
var res = {
prx: node.textContent
}
console.log(res)
counter+=1
})
console.log(counter)
})
.fail(function(err){
console.log("srsly")
})
I am trying to scrape some proxy server's information from the webpage, but the xpath extracted by the google inspection tools doesn't work, I want to know how to fix it.
so the xpath I extracted is //*[#id="tblproxy"]/tbody/tr, not sure why it doesn't work
It is not a problem with xpath. The problem is that the HTML you see in your browser is not the HTML that scrapejs sees. The table is generated with JavaScript.
If you download the "raw" site, for example with curl or wget, you will see each table row consisting of a bit of code:
<script type="text/javascript">
gp.insertPrx({"PROXY_CITY":"","PROXY_COUNTRY":"Singapore","PROXY_IP":"119.81.199.86","PROXY_LAST_UPDATE":"1 29","PROXY_PORT":"50","PROXY_REFS":null,"PROXY_STATE":"","PROXY_STATUS":"OK","PROXY_TIME":"22","PROXY_TYPE":"Elite","PROXY_UID":null,"PROXY_UPTIMELD":"21/13"});
</script>

Resources