Can't scrape text with cheerio - node.js

i'm trying to scrape this page with cheerio https://en.dict.naver.com/#/search?query=%EC%B6%94%EC%9B%8C%EC%9A%94&range=all
But i can't get anything. I tried to get that 'Word-Idiom' text but i get nothing as response.
Here's my code
app.get("/conjugation", (req, res) => {
axios(
"https://en.dict.naver.com/#/search?query=%EC%B6%94%EC%9B%8C%EC%9A%94&range=all"
)
.then((response) => {
const htmlData = response.data;
const $ = cheerio.load(htmlData);
const element = $(
"#searchPage_entry > h3 > span.title_text.myScrollNavQuick.my_searchPage"
);
console.log(element.text());
})
.catch((err) => console.log(err));
});

The server at that URL doesn't return any body DOM structure in the HTML response. The body DOM is rendered by linked JavaScript after the response is received. Cheerio doesn't execute the JavaScript in the HTML response, so it won't be possible to scape that page using Cheerio. Instead, you'll need to use another method which can execute the in-page JavaScript (e.g. Puppeteer).

This is a common issue while web scraping, the page loads dynamically, that's why when you fetch the content of the initial get response from that website, all you're getting is script tags, print the htmlData so you can see what I mean. There are no loaded html elements in your response, what you'll have to do is use something like selenium to wait for the elements that you're requiring to get rendered.

Related

Web Scraping NodeJs - How to recover resources when the page loads in full after several requests

i'm trying to retrieve each item (composed of an image, a word and its translation) from this page
Link of the website: https://livingdictionaries.app/hazaragi/entries/gallery?entries_prod%5Btoggle%5D%5BhasImage%5D=true"
I used JsDom and Got.
Here is the code
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
const got = require('got');
(async () => {
const response = await got("https://livingdictionaries.app/hazaragi/entries/gallery?entries_prod%5Btoggle%5D%5BhasImage%5D=true");
console.log(response.body);
const dom = new JSDOM(response.body);
console.log(dom.window.document.querySelectorAll(".ld-egdn1r"))
})();
when I display the html code that is returned to me it does not correspond to what I open the site with my browser.There are no html tags that contain the items.
When I look at the Network tab, other resources are loaded, but again I can't find the query that retrieves the words.
I think that what I am looking for is loaded in several queries but I don't know which one
Here are the step:
enter image description here
then you will get a code like that
fetch("https://xcvbaysyxd-dsn.algolia.net/1/indexes/*/queries", {
"credentials": "omit",
"headers": {},
"referrer": "https://livingdictionaries.app/",
"body": "...",
"method": "POST",
"mode": "cors"
});
you will just have to process the data manualy after that
const fetch = require("node-fetch") // npm i node-fetch
const data = await fetch(...).then(r=>r.json())
const product = data.results.map(r=>r.hits)
in your case
The site you are trying to scrape is a Single Page Application (SPA) built with Svelte and the individual elements are dynamically rendered as needed, as many websites are today. Since the HTML is not hard-coded, these sites are notoriously difficult to scrape.
If you just log the response, you will see that the elements for which you are selecting do not exist. This is because it is the browser that interprets the JavaScript at run time and updates the UI. A GET request using got, axios, fetch, whatever, cannot perform such tasks.
You will need to implement the use of a headless browser like Puppeteer in order to dynamically render the site and scrape.

how can I get an element in a redirect url using node.js

Non-English country, please forgive my spelling mistakes.
For example, I want to first redirect url1(http://localhost:3000/api/song/167278) to url2(http://localhost:4000/api/song/167278) to use url2's api. And url2 will reponse a json file, which can be seen in the postman's panel.
(postman's pannel)
But there maybe a lot of elements, I only want an element in the file, such as data[0].url. How can I return just return the url value (data[0].url in this json) when people access http://localhost:3000/api/song/167278.
I am using express.js now, how can I edit it? Or is there any other methods?
app.get('api/song/:id', async (req, res) => {
try {
const { id } = req.params
url = "http://localhost:4000/api/song/" + id
res.redirect(url)
}
catch (e) {
console.log(e)
}
}
You could either proxy the entire request there or fetch localhost:4000/api/song/1 in your request handler (with something like node-fetch or axios or with node's APIs and send the fields that you want back to the client as json.

Sending an Excel file from backend to frontend and download it at the frontend

I had created an Excel file at the backend (Express JS) using Exceljs npm module. I have it stored in a temp directory. Now I would want to send the file from the back-end to the front-end and download it there when the user clicks a button. I am struck on two things
1. How to send the file from the backend to the frontend through an HTTP POST request
2. How to then download the file in the front-end
Edited content:
I need the front end to be a button that appends the file to it and then download it. This is how my code looks, I am not getting the file properly from the backend to the front-end
front end file:
function(parm1,parm2,parm3){
let url =${path}?parmA=${parm1}&parmB=${parm2}&parmC=${parm3};
let serviceDetails = {};
serviceDetails["method"] = "GET";
serviceDetails["mode"] = "cors";
serviceDetails["headers"] = {
"Content-Type": "application/json"
};
fetch(url, serviceDetails)
.then(res => {
if (res.status != 200) {
return false;
}
var file = new Blob([res], { type : 'application/octet-stream' });
a = document.createElement('a'), file;
a.href = window.URL.createObjectURL(file);
a.target = "_blank"; 
a.download = "excel.xlsx";
document.body.appendChild(a);
a.click();
document.body.removeChild(a);
}).catch(error => {
return false;
});
}`
router.js
var abc = ... // this is a object for the controller.js file
router.get('/path', function(req, res) {
abc.exportintoExcel(req, res);
});
controller.js
let xyz = ... //this is a object for the service.js file
exports.exportintoExcel = function(req, res) {
xyz.exportintoExcel(reqParam,res);
}
service.js
exportintoExcel(req,response){
//I have a excel file in my server root directory
const filepath = path.join(__dirname,'../../nav.txt');
response.sendFile(filepath);
})
}
This is a complete re-write of an earlier answer, so sorry if anyone needed that one, but this version is superior. I'm using a project created with express-generator and working in three files:
routes/index.js
views/index.ejs
public/javascripts/main.js
index.ejs
Start with an anchor tag that has the download attribute, with whatever filename you wish, and an empty href attribute. We will fill in the href in the main.js file with an ObjectURL that represents the Excel file later:
<body>
<a id="downloadExcelLink" download="excelFile.xlsx" href="#">Download Excel File</a>
<script type="text/javascript" src="/javascripts/main.js"></script>
</body>
public/javascripts/main.js
Select the anchor element, and then make a fetch() request to the route /downloadExcel. Convert the response to a Blob, then create an ObjectURL from this Blob. You can then set the href attribute of the anchor tag to this ObjectURL:
const downloadExcelLink = document.getElementById('downloadExcelLink');
(async () => {
const downloadExcelResponse = await fetch('/downloadExcel');
const downloadExcelBlob = await downloadExcelResponse.blob();
const downloadExcelObjectURL = URL.createObjectURL(downloadExcelBlob);
downloadExcelLink.href = downloadExcelObjectURL;
})();
routes/index.js
In the index router, you simply need to call the res.sendFile() function and pass it the path to the Excel file on your server.
router.get('/downloadExcel', (req, res, next) => {
const excelFilePath = path.join(__dirname, '../tmp/excel.xlsx');
res.sendFile(excelFilePath, (err) => {
if (err) console.log(err);
});
});
That's it! You can find a git repo here of the project. Clone into it and try it out for yourself if you can't get this code to work in your project as it is.
How It Works
When the page loads, 4 requests are fired off to our server, as we can see in the console output:
GET / 200 2.293 ms - 302
GET /stylesheets/style.css 200 1.123 ms - 111
GET /javascripts/main.js 200 1.024 ms - 345
GET /downloadExcel 200 2.395 ms - 4679
The first three requests are for index.ejs (/), the CSS stylesheet, and our main.js file. The fourth request is sent by our call to fetch('/downloadExcel') in the main.js file:
const downloadExcelResponse = await fetch('/downloadExcel');
I have a route-handler setup in routes/index.js at this route that uses res.sendFile() to send a file from our filesystem as the response:
router.get('/downloadExcel', (req, res, next) => {
const excelFilePath = path.join(__dirname, '../tmp/excel.xlsx');
res.sendFile(excelFilePath, (err) => {
if (err) console.log(err);
});
});
excelFilePath needs to be the path to the file on YOUR system. On my system, here is the layout of the router file and the Excel file:
/
/routes/index.js
/tmp/excel.xlsx
The response sent from our Express server is stored in downloadExcelResponse as the return value from the call to fetch() in the main.js file:
const downloadExcelResponse = await fetch('/downloadExcel');
downloadExcelResponse is a Response object, and for our purposes we want to turn it into a Blob object using the Response.blob() method:
const downloadExcelBlob = await downloadExcelResponse.blob();
Now that we have the Blob, we can call URL.convertObjectURL() to turn this Blob into something we can use as the href for our download link:
const downloadExcelObjectURL = URL.createObjectURL(downloadExcelBlob);
At this point, we have a URL that represents our Excel file in the browser, and we can point the href to this URL by adding it to the DOM element we selected earlier's href property:
When the page loads, we selected the anchor element with this line:
<a id="downloadExcelLink" download="excelFile.xlsx" href="#">Download Excel File</a>
So we add the URL to the href here, in the function that makes the fetch request:
downloadExcelLink.href = downloadExcelObjectURL;
You can check out the element in the browser and see that the href property has been changed by the time the page has loaded:
Notice, on my computer, the anchor tag is now:
<a id="downloadExcelLink" download="excelFile.xlsx" href="blob:http://localhost:3000/aa48374e-ebef-461a-96f5-d94dd6d2c383">Download Excel File</a>
Since the download attribute is present on the link, when the link is clicked, the browser will download whatever the href points to, which in our case is the URL to the Blob that represents the Excel document.
I pulled my information from these sources:
JavaScript.info - Blob as URL
Javascript.info - Fetch
Here's a gif of how the download process looks on my machine:
OK, now that I see your code, I can try and help out a little. I have refactored your example a little bit to make it easier for me to understand, but feel free to adjust to your needs.
index.html
I don't know what the page looks like that you're working with, but it looks like in your example you are creating an anchor element with JavaScript during the fetch() call. I'm just creating one with HTML in the actual page, is there a reason you can't do this?
<body>
<a id="downloadLink" download="excel.xlsx" href="#">Download Excel File</a>
<script type="text/javascript" src="/javascripts/test.js"></script>
</body
With that in hand, here is my version of your front end JS file:
test.js
const downloadLink = document.getElementById('downloadLink');
sendFetch('a', 'b', 'c');
function sendFetch(param1, param2, param3) {
const path = 'http://localhost:3000/excelTest';
const url = `${path}?parmA=${param1}&parmB=${param2}&parmC=${param3}`;
const serviceDetails = {};
serviceDetails.method = "GET";
serviceDetails.mode = "cors";
serviceDetails.headers = {
"Content-Type": "application/json"
};
fetch(url, serviceDetails).then((res) => {
if (res.status != 200) {
return false;
}
res.blob().then((excelBlob) => {
const excelBlobURL = URL.createObjectURL(excelBlob);
downloadLink.href = excelBlobURL;
});
}).catch((error) => {
return false;
});
}
I had to fill in some details because I can't tell what is going on from your code. Here are the things I changed:
Selected the DOM element instead of creating it:
Your version:
a = document.createElement('a'), file;
My version:
index.html
<a id="downloadLink" download="excel.xlsx" href="#">Download Excel File</a>
test.js
const downloadLink = document.getElementById('downloadLink');
This saves us the trouble of creating the element. Unless you need to do that for some reason, I wouldn't. I'm also not sure what that file is doing in your original.
Name the function and change parm -> param for arguments list
Your version:
function(parm1,parm2,parm3){
My version:
function sendFetch(param1, param2, param3) {
I wasn't sure how you were actually calling your function, so I named it. Also, parm isn't clear. Param isn't great either, should describe what it is, but I don't know from your code.
Create a path variable and enclose url assignment in backticks
Your version:
let url =${path}?parmA=${parm1}&parmB=${parm2}&parmC=${parm3};
My version:
const path = 'http://localhost:3000/excelTest';
const url = `${path}?parmA=${param1}&parmB=${param2}&parmC=${param3}`;
In your version, that url assignment should throw an error. It looks like you want to use string interpolation, but you need backticks for that, which I added. Also, I had to define a path variable, because I didn't see one in your code.
Cleaned up some formatting
I used 'dot' notation for the serviceDetails, but that was just personal preference. I also changed the spacing of the fetch() call, but no need to reprint that here. Shouldn't effect anything.
Create a blob from the fetch response
Your version:
var file = new Blob([res], { type : 'application/octet-stream' });
My version:
res.blob().then((excelBlob) => {
I'm not sure why you are calling the Blob constructor and what that [res] is supposed to be. The Response object returned from fetch() has a blob() method that returns a promise that resolves to a Blob with whatever MIME-type the data was in. In an Excel documents case, this is application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.
Create an ObjectURL from the Blob and add this URL to the href of the anchor tag.
Your version:
a = document.createElement('a'), file;
a.href = window.URL.createObjectURL(file);
a.target = "_blank";
a.download = "excel.xlsx";
document.body.appendChild(a);
a.click();
document.body.removeChild(a);
My version:
const excelBlobURL = URL.createObjectURL(excelBlob);
downloadLink.href = excelBlobURL;
You have to do a bunch of DOM manipulation, which I'm not sure why you need. If you do have to dynamically create this element, then I'm not sure why you are 'clicking' it, then removing it, if the user is supposed to be able to click it. Maybe clarify for me why you are doing this, or if you really need to do it. Either way, in my version I create the ObjectURL and then assign it, but you could just as easily not store it in a variable.
Call the function that sends the fetch request.
As my function signature is:
function sendFetch(param1, param2, param3)
I needed to call it somewhere in order to fire off the request, so I did so like this:
sendFetch('a', 'b', 'c');
Right when the page loads, as you can see from the server logs:
GET / 304 0.448 ms - -
GET /javascripts/test.js 304 1.281 ms - -
GET /excelTest?parmA=a&parmB=b&parmC=c 304 0.783 ms - -
The first two requests are for the index.html page and the test.js file, then the fetch request is fired with the param's I passed in. I'm not sure how you are doing this in your app, because that is not included in your code.
Everything I just covered is Front-End. I'm assuming your server-side code is actually sending an excel file with your call to response.sendFile() in service.js. If you are sure that the file is getting sent, then the code I've given you should work, when adjusted to your app.
So, in conclusion, what this code does is:
Load an HTML page with an anchor tag with no href attribute set.
Send off a fetch() request to the server.
Turn the fetch response into a Blob, then create an ObjectURL from this Blob.
Assign that ObjectURL to the anchor tag's href attribute.
When the user clicks the 'Download Excel File' link, the Excel sheet should be downloaded. If you didn't want them to see the link until after the fetch request, you could definitely do create the anchor tag in JS instead, let me know if you want to see how to do that.
As before, here is a gif showing how it looks on my machine (this is with your version and my modifications):

Nodejs - React download file from s3 bucket using pre-signed url

I am trying to make an onClick button to download a file from S3 bucket using pre-signet url. The problem comes when I received my url. I want an automatic redirect or kind of. In other words, how can I lunch the download file after getting back my signed url?
this is my document list
The onClick event is on the Download button.
redux action
Redux action call my nodejs route
api route nodejs
Ask for pre-signed url then send it to my redux reducer.
Now in my front-end page, I got my link but I want an automatic redirect to start the file download.
Part of Component
Hope my first post isn't too messy.
I resolved my problem with a redux action. With one click I call my action, who return my pre-signed URL, then automatically click the link. This trigger download event with the original file name when I upload it to S3.
export const downDoc = (docId) => async dispatch => {
const res = await axios({ url: 'myApiCall', method: 'GET', responseType: 'blob' })
.then((response) => {
console.log(response)
const url = window.URL.createObjectURL(new Blob([response.data]));
const link = document.createElement('a');
link.href = url;
link.setAttribute('download', `${docId.originalName}`);
document.body.appendChild(link);
link.click();
});
The other answer does direct DOM manipulation, creates a blob, which looks as though it buffers the whole file in memory before sending it to the user and also creates a new link each time you download. A react-y of doing is:
const downloadFileRef = useRef<HTMLAnchorElement | null>(null);
const [downloadFileUrl, setDownloadFileUrl] = useState<string>();
const [downloadFileName, setDownloadFileName] = useState<string>();
const onLinkClick = (filename: string) => {
axios.get("/presigned-url")
.then((response: { url: string }) => {
setDownloadFileUrl(response.url);
setDownloadFileName(filename);
downloadFileRef.current?.click();
})
.catch((err) => {
console.log(err);
});
};
return (
<>
<a onClick={() => onLinkClick("document.pdf")} aria-label="Download link">
Download
</a>
<a
href={downloadFileUrl}
download={downloadFileName}
className="hidden"
ref={downloadFileRef}
/>
</>)
See here for more info https://levelup.gitconnected.com/react-custom-hook-typescript-to-download-a-file-through-api-b766046db18a
The way I did it was different and has the advantage of being able to see the progress of the download as the file is being downloaded. If you're downloading a large file then it makes a difference UX wise as you see feedback immediately.
What I did was:
When creating the S3 presigned URL I set the content-disposition to `attachment
I used an anchor element to download the actual item <a url='https://presigned-url' download>Download me</a>
Others have mentioned simulating a click within the DOM or React, but another option is to use window.open(). You can set the target attribute to _blank to open a tab, but you do need window.open() inside the click event to prevent popup blockers from stopping the functionality. There's some good discussion on the subject here. I found this to be a better solution than simulating a click event.
Here's an example (though there may be more needed depending on how you fetch the signed_url).
function downloadDocument() {
const signedurlPromise = fetch("/signed_url")
signedurlPromise.then((response) => {
window.open(response.signed_url, "_blank");
})
}

Problems on xpath in nodejs scrapejs

sp = require('scrapejs').init({
cc:100,
delay:1*1000
});
sp.load('http://www.gatherproxy.com/proxylist/anonymity/?t=Elite')
.then(function($){
var counter = 0
//$.q('//*[#id="tblproxy"]/tbody/tr[3]').forEach(function(node){
$.q('//tbody/tr').forEach(function(node){
//$.q("/html/body/div[1]/div[0]/table/tbody/tr").forEach(function(node){
console.log(counter)
var res = {
prx: node.textContent
}
console.log(res)
counter+=1
})
console.log(counter)
})
.fail(function(err){
console.log("srsly")
})
I am trying to scrape some proxy server's information from the webpage, but the xpath extracted by the google inspection tools doesn't work, I want to know how to fix it.
so the xpath I extracted is //*[#id="tblproxy"]/tbody/tr, not sure why it doesn't work
It is not a problem with xpath. The problem is that the HTML you see in your browser is not the HTML that scrapejs sees. The table is generated with JavaScript.
If you download the "raw" site, for example with curl or wget, you will see each table row consisting of a bit of code:
<script type="text/javascript">
gp.insertPrx({"PROXY_CITY":"","PROXY_COUNTRY":"Singapore","PROXY_IP":"119.81.199.86","PROXY_LAST_UPDATE":"1 29","PROXY_PORT":"50","PROXY_REFS":null,"PROXY_STATE":"","PROXY_STATUS":"OK","PROXY_TIME":"22","PROXY_TYPE":"Elite","PROXY_UID":null,"PROXY_UPTIMELD":"21/13"});
</script>

Resources