Problems on xpath in nodejs scrapejs - node.js

sp = require('scrapejs').init({
cc:100,
delay:1*1000
});
sp.load('http://www.gatherproxy.com/proxylist/anonymity/?t=Elite')
.then(function($){
var counter = 0
//$.q('//*[#id="tblproxy"]/tbody/tr[3]').forEach(function(node){
$.q('//tbody/tr').forEach(function(node){
//$.q("/html/body/div[1]/div[0]/table/tbody/tr").forEach(function(node){
console.log(counter)
var res = {
prx: node.textContent
}
console.log(res)
counter+=1
})
console.log(counter)
})
.fail(function(err){
console.log("srsly")
})
I am trying to scrape some proxy server's information from the webpage, but the xpath extracted by the google inspection tools doesn't work, I want to know how to fix it.
so the xpath I extracted is //*[#id="tblproxy"]/tbody/tr, not sure why it doesn't work

It is not a problem with xpath. The problem is that the HTML you see in your browser is not the HTML that scrapejs sees. The table is generated with JavaScript.
If you download the "raw" site, for example with curl or wget, you will see each table row consisting of a bit of code:
<script type="text/javascript">
gp.insertPrx({"PROXY_CITY":"","PROXY_COUNTRY":"Singapore","PROXY_IP":"119.81.199.86","PROXY_LAST_UPDATE":"1 29","PROXY_PORT":"50","PROXY_REFS":null,"PROXY_STATE":"","PROXY_STATUS":"OK","PROXY_TIME":"22","PROXY_TYPE":"Elite","PROXY_UID":null,"PROXY_UPTIMELD":"21/13"});
</script>

Related

Can't scrape text with cheerio

i'm trying to scrape this page with cheerio https://en.dict.naver.com/#/search?query=%EC%B6%94%EC%9B%8C%EC%9A%94&range=all
But i can't get anything. I tried to get that 'Word-Idiom' text but i get nothing as response.
Here's my code
app.get("/conjugation", (req, res) => {
axios(
"https://en.dict.naver.com/#/search?query=%EC%B6%94%EC%9B%8C%EC%9A%94&range=all"
)
.then((response) => {
const htmlData = response.data;
const $ = cheerio.load(htmlData);
const element = $(
"#searchPage_entry > h3 > span.title_text.myScrollNavQuick.my_searchPage"
);
console.log(element.text());
})
.catch((err) => console.log(err));
});
The server at that URL doesn't return any body DOM structure in the HTML response. The body DOM is rendered by linked JavaScript after the response is received. Cheerio doesn't execute the JavaScript in the HTML response, so it won't be possible to scape that page using Cheerio. Instead, you'll need to use another method which can execute the in-page JavaScript (e.g. Puppeteer).
This is a common issue while web scraping, the page loads dynamically, that's why when you fetch the content of the initial get response from that website, all you're getting is script tags, print the htmlData so you can see what I mean. There are no loaded html elements in your response, what you'll have to do is use something like selenium to wait for the elements that you're requiring to get rendered.

save HTML for of rendered PUG with dATA

I want to save HTML source of my pugfile rendered with data.
My route is:
res.render('pugfile', { data: resp });
How can I do that?
There's a callback from .render().
res.render('pugfile', { data: resp }, function (err, pageBody) {
if (err) throw err
/* manipulate pageBody as you will,
* but be sure to .send it to the browser if
* you use this callback. */
res.send(pageBody)
})
See here and here.
Try using this link and see whether it will be helpfull to you
link to possible answer
The createTemplateFile function simply creates a new file if it doesn't exist.
The exportTemplateFile function saves the HTML in the html variable rendered by pug and prettifies it with the pretty package and then overwrites the new template file
Thats according to that post

Sending an Excel file from backend to frontend and download it at the frontend

I had created an Excel file at the backend (Express JS) using Exceljs npm module. I have it stored in a temp directory. Now I would want to send the file from the back-end to the front-end and download it there when the user clicks a button. I am struck on two things
1. How to send the file from the backend to the frontend through an HTTP POST request
2. How to then download the file in the front-end
Edited content:
I need the front end to be a button that appends the file to it and then download it. This is how my code looks, I am not getting the file properly from the backend to the front-end
front end file:
function(parm1,parm2,parm3){
let url =${path}?parmA=${parm1}&parmB=${parm2}&parmC=${parm3};
let serviceDetails = {};
serviceDetails["method"] = "GET";
serviceDetails["mode"] = "cors";
serviceDetails["headers"] = {
"Content-Type": "application/json"
};
fetch(url, serviceDetails)
.then(res => {
if (res.status != 200) {
return false;
}
var file = new Blob([res], { type : 'application/octet-stream' });
a = document.createElement('a'), file;
a.href = window.URL.createObjectURL(file);
a.target = "_blank"; 
a.download = "excel.xlsx";
document.body.appendChild(a);
a.click();
document.body.removeChild(a);
}).catch(error => {
return false;
});
}`
router.js
var abc = ... // this is a object for the controller.js file
router.get('/path', function(req, res) {
abc.exportintoExcel(req, res);
});
controller.js
let xyz = ... //this is a object for the service.js file
exports.exportintoExcel = function(req, res) {
xyz.exportintoExcel(reqParam,res);
}
service.js
exportintoExcel(req,response){
//I have a excel file in my server root directory
const filepath = path.join(__dirname,'../../nav.txt');
response.sendFile(filepath);
})
}
This is a complete re-write of an earlier answer, so sorry if anyone needed that one, but this version is superior. I'm using a project created with express-generator and working in three files:
routes/index.js
views/index.ejs
public/javascripts/main.js
index.ejs
Start with an anchor tag that has the download attribute, with whatever filename you wish, and an empty href attribute. We will fill in the href in the main.js file with an ObjectURL that represents the Excel file later:
<body>
<a id="downloadExcelLink" download="excelFile.xlsx" href="#">Download Excel File</a>
<script type="text/javascript" src="/javascripts/main.js"></script>
</body>
public/javascripts/main.js
Select the anchor element, and then make a fetch() request to the route /downloadExcel. Convert the response to a Blob, then create an ObjectURL from this Blob. You can then set the href attribute of the anchor tag to this ObjectURL:
const downloadExcelLink = document.getElementById('downloadExcelLink');
(async () => {
const downloadExcelResponse = await fetch('/downloadExcel');
const downloadExcelBlob = await downloadExcelResponse.blob();
const downloadExcelObjectURL = URL.createObjectURL(downloadExcelBlob);
downloadExcelLink.href = downloadExcelObjectURL;
})();
routes/index.js
In the index router, you simply need to call the res.sendFile() function and pass it the path to the Excel file on your server.
router.get('/downloadExcel', (req, res, next) => {
const excelFilePath = path.join(__dirname, '../tmp/excel.xlsx');
res.sendFile(excelFilePath, (err) => {
if (err) console.log(err);
});
});
That's it! You can find a git repo here of the project. Clone into it and try it out for yourself if you can't get this code to work in your project as it is.
How It Works
When the page loads, 4 requests are fired off to our server, as we can see in the console output:
GET / 200 2.293 ms - 302
GET /stylesheets/style.css 200 1.123 ms - 111
GET /javascripts/main.js 200 1.024 ms - 345
GET /downloadExcel 200 2.395 ms - 4679
The first three requests are for index.ejs (/), the CSS stylesheet, and our main.js file. The fourth request is sent by our call to fetch('/downloadExcel') in the main.js file:
const downloadExcelResponse = await fetch('/downloadExcel');
I have a route-handler setup in routes/index.js at this route that uses res.sendFile() to send a file from our filesystem as the response:
router.get('/downloadExcel', (req, res, next) => {
const excelFilePath = path.join(__dirname, '../tmp/excel.xlsx');
res.sendFile(excelFilePath, (err) => {
if (err) console.log(err);
});
});
excelFilePath needs to be the path to the file on YOUR system. On my system, here is the layout of the router file and the Excel file:
/
/routes/index.js
/tmp/excel.xlsx
The response sent from our Express server is stored in downloadExcelResponse as the return value from the call to fetch() in the main.js file:
const downloadExcelResponse = await fetch('/downloadExcel');
downloadExcelResponse is a Response object, and for our purposes we want to turn it into a Blob object using the Response.blob() method:
const downloadExcelBlob = await downloadExcelResponse.blob();
Now that we have the Blob, we can call URL.convertObjectURL() to turn this Blob into something we can use as the href for our download link:
const downloadExcelObjectURL = URL.createObjectURL(downloadExcelBlob);
At this point, we have a URL that represents our Excel file in the browser, and we can point the href to this URL by adding it to the DOM element we selected earlier's href property:
When the page loads, we selected the anchor element with this line:
<a id="downloadExcelLink" download="excelFile.xlsx" href="#">Download Excel File</a>
So we add the URL to the href here, in the function that makes the fetch request:
downloadExcelLink.href = downloadExcelObjectURL;
You can check out the element in the browser and see that the href property has been changed by the time the page has loaded:
Notice, on my computer, the anchor tag is now:
<a id="downloadExcelLink" download="excelFile.xlsx" href="blob:http://localhost:3000/aa48374e-ebef-461a-96f5-d94dd6d2c383">Download Excel File</a>
Since the download attribute is present on the link, when the link is clicked, the browser will download whatever the href points to, which in our case is the URL to the Blob that represents the Excel document.
I pulled my information from these sources:
JavaScript.info - Blob as URL
Javascript.info - Fetch
Here's a gif of how the download process looks on my machine:
OK, now that I see your code, I can try and help out a little. I have refactored your example a little bit to make it easier for me to understand, but feel free to adjust to your needs.
index.html
I don't know what the page looks like that you're working with, but it looks like in your example you are creating an anchor element with JavaScript during the fetch() call. I'm just creating one with HTML in the actual page, is there a reason you can't do this?
<body>
<a id="downloadLink" download="excel.xlsx" href="#">Download Excel File</a>
<script type="text/javascript" src="/javascripts/test.js"></script>
</body
With that in hand, here is my version of your front end JS file:
test.js
const downloadLink = document.getElementById('downloadLink');
sendFetch('a', 'b', 'c');
function sendFetch(param1, param2, param3) {
const path = 'http://localhost:3000/excelTest';
const url = `${path}?parmA=${param1}&parmB=${param2}&parmC=${param3}`;
const serviceDetails = {};
serviceDetails.method = "GET";
serviceDetails.mode = "cors";
serviceDetails.headers = {
"Content-Type": "application/json"
};
fetch(url, serviceDetails).then((res) => {
if (res.status != 200) {
return false;
}
res.blob().then((excelBlob) => {
const excelBlobURL = URL.createObjectURL(excelBlob);
downloadLink.href = excelBlobURL;
});
}).catch((error) => {
return false;
});
}
I had to fill in some details because I can't tell what is going on from your code. Here are the things I changed:
Selected the DOM element instead of creating it:
Your version:
a = document.createElement('a'), file;
My version:
index.html
<a id="downloadLink" download="excel.xlsx" href="#">Download Excel File</a>
test.js
const downloadLink = document.getElementById('downloadLink');
This saves us the trouble of creating the element. Unless you need to do that for some reason, I wouldn't. I'm also not sure what that file is doing in your original.
Name the function and change parm -> param for arguments list
Your version:
function(parm1,parm2,parm3){
My version:
function sendFetch(param1, param2, param3) {
I wasn't sure how you were actually calling your function, so I named it. Also, parm isn't clear. Param isn't great either, should describe what it is, but I don't know from your code.
Create a path variable and enclose url assignment in backticks
Your version:
let url =${path}?parmA=${parm1}&parmB=${parm2}&parmC=${parm3};
My version:
const path = 'http://localhost:3000/excelTest';
const url = `${path}?parmA=${param1}&parmB=${param2}&parmC=${param3}`;
In your version, that url assignment should throw an error. It looks like you want to use string interpolation, but you need backticks for that, which I added. Also, I had to define a path variable, because I didn't see one in your code.
Cleaned up some formatting
I used 'dot' notation for the serviceDetails, but that was just personal preference. I also changed the spacing of the fetch() call, but no need to reprint that here. Shouldn't effect anything.
Create a blob from the fetch response
Your version:
var file = new Blob([res], { type : 'application/octet-stream' });
My version:
res.blob().then((excelBlob) => {
I'm not sure why you are calling the Blob constructor and what that [res] is supposed to be. The Response object returned from fetch() has a blob() method that returns a promise that resolves to a Blob with whatever MIME-type the data was in. In an Excel documents case, this is application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.
Create an ObjectURL from the Blob and add this URL to the href of the anchor tag.
Your version:
a = document.createElement('a'), file;
a.href = window.URL.createObjectURL(file);
a.target = "_blank";
a.download = "excel.xlsx";
document.body.appendChild(a);
a.click();
document.body.removeChild(a);
My version:
const excelBlobURL = URL.createObjectURL(excelBlob);
downloadLink.href = excelBlobURL;
You have to do a bunch of DOM manipulation, which I'm not sure why you need. If you do have to dynamically create this element, then I'm not sure why you are 'clicking' it, then removing it, if the user is supposed to be able to click it. Maybe clarify for me why you are doing this, or if you really need to do it. Either way, in my version I create the ObjectURL and then assign it, but you could just as easily not store it in a variable.
Call the function that sends the fetch request.
As my function signature is:
function sendFetch(param1, param2, param3)
I needed to call it somewhere in order to fire off the request, so I did so like this:
sendFetch('a', 'b', 'c');
Right when the page loads, as you can see from the server logs:
GET / 304 0.448 ms - -
GET /javascripts/test.js 304 1.281 ms - -
GET /excelTest?parmA=a&parmB=b&parmC=c 304 0.783 ms - -
The first two requests are for the index.html page and the test.js file, then the fetch request is fired with the param's I passed in. I'm not sure how you are doing this in your app, because that is not included in your code.
Everything I just covered is Front-End. I'm assuming your server-side code is actually sending an excel file with your call to response.sendFile() in service.js. If you are sure that the file is getting sent, then the code I've given you should work, when adjusted to your app.
So, in conclusion, what this code does is:
Load an HTML page with an anchor tag with no href attribute set.
Send off a fetch() request to the server.
Turn the fetch response into a Blob, then create an ObjectURL from this Blob.
Assign that ObjectURL to the anchor tag's href attribute.
When the user clicks the 'Download Excel File' link, the Excel sheet should be downloaded. If you didn't want them to see the link until after the fetch request, you could definitely do create the anchor tag in JS instead, let me know if you want to see how to do that.
As before, here is a gif showing how it looks on my machine (this is with your version and my modifications):

How to get visual DOM structure from url in node.js

I am wondering how to get "visual" DOM structure from url in node.js. When I try to get html content with request library, html structure is not correct.
const request = require('request');
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
request({ 'https://www.washingtonpost.com/news/acts-of-faith/wp/2017/06/30/trump-promised-to-destroy-the-johnson-amendment-congress-is-targeting-it-now/', jar: true }, function (e, r, body) {
console.log(body);
});
reurned html structure is here, where meta tags are not correct:
<meta property="og:title" content=""/>
<meta itemprop="description" name="description" content=""/>
If I open website in web browser, I can see correct meta tags in web inspector:
<meta property="og:title" content="Trump promised to destroy the Johnson Amendment. Congress is targeting it now."/>
<meta itemprop="description" name="description" content="Observers believe the proposed legislation would make it harder for the IRS to enforce a law preventing pulpit endorsements."/>
I might need more clarification on what a "visual" DOM structure is, but as a commenter pointed out a headless browser like puppeteer is probably the way to go when a website has complex loading behavior.
The advantage here is, with puppeteer at least, you can navigate to a page and then programmatically wait until some condition is satisfied before continuing. In this case, I chose to wait until one of the meta tags you specified's content attribute is truthy, but depending on your needs you could wait for something else or even wait for multiple conditions to be true.
You might have to analyze the behavior of the page in question a little deeper to figure out what you should wait for though, but at the very least the following code seems to correctly load the tags in your question.
import puppeteer from 'puppeteer'
(async ()=>{
const url = 'https://www.washingtonpost.com/news/acts-of-faith/wp/2017/06/30/trump-promised-to-destroy-the-johnson-amendment-congress-is-targeting-it-now/'
const browser = await puppeteer.launch()
const page = await browser.newPage()
await page.goto(url)
// wait until <meta property="og:title"> has a truthy value for content attribute
await page.waitForFunction(()=>{
return document.querySelector('meta[property="og:title"]').getAttribute('content')
})
const html = await page.content()
console.log(html)
await browser.close()
})()
(pastebin of formatted html result)
Also, since this solution uses puppeteer I'd recommend not working with the html string and instead using the puppeteer API to extract the information you need.

CasperJS and downloading a file via iFrame and JavaScript

I have a script to test that - on click - generates an iFrame which downloads a file. How can I intercept the response with CasperJS?
I already tried the sequence:
casper.click('element');
casper.withFrame('frame', function(){
console.log(this.getCurrentUrl()); // only return about:blank, but should have URL
console.log("content: " + this.getHTML()); // Yep, empty HMTL
this.on('resource.received', function(resource){
console.log(resource.url); // never executed
});
});
I need the content of the file but can not really produce the URL without clicking the element or changing the script I'm testing.
Ideas?
I tried other events, but none got fired when downloading via the iframe. I found another solution that works - but if you have something better, I'd like to try it.
Here it comes:
// Check downloaded File
.then(function(){
// Fetch URL via internals
var url = this.evaluate(function(){
return $('__saveReportFrame').src; // JavaScript function in the page
});
var file = fs.absolute('plaintext.txt');
this.download(url, file);
var fileString = fs.read(file);
// Whatever test you want to make
test.assert(fileString.match(/STRING/g) !== null, 'Downloaded File is good');
})

Resources