Web scraping certain web page cannot finish - node.js

So i'm learning web scraping with node 8, followed this
npm install --save request-promise cheerio puppeteer
The code is simple
const rp = require('request-promise');
const url = 'https://www.examples.com'; //good
rp(url).then( (html) => {
console.log(html);
}).catch( (e) => {
console.log(e);
});
Now if url is examples.com, i can see the plain html output, great.
Q1: If yahoo.com, it outputs binary data, e.g.
�i��,a��g�Z.~�Ż�ڔ+�<ٵ�A�y�+�c�n1O>Vr�K�#,bc���8�����|����U>��p4U>mś0��Z�M�Xg"6�lS�2B�+�Y�Ɣ���? ��*
why is this ?
Q2: Then with nasdaq.com,
const url = 'https://www.nasdaq.com/earnings/report/msft';
the above code just won't finish, seems hangs there.
Why is this please ?

I'm not sure about Q2, but I can answer Q1.
It seems like Yahoo is detecting you as a bot and preventing you from scraping the page! The most common method sites use to detect bots is via the User-Agent header. When you make a request using request-promise (which uses the request library internally), it does not set this header at all. This means websites can infer your request came from a program (instead of a web browser) because there is not User-Agent header. They will then treat you like a bot and send you back gibberish or never serve you content.
You can work around this by manually setting a User-Agent header to mimic a browser. Note this seems to work for Yahoo, but might not work for all websites. Other websites might use more advanced techniques to detect bots.
const rp = require('request-promise');
const url = 'https://www.yahoo.com'; //good
const options = {
url,
headers: {
'User-Agent': 'Mozilla/5.0 (Android 4.4; Mobile; rv:41.0) Gecko/41.0 Firefox/41.0'
}
};
rp(options).then( (html) => {
console.log(html);
}).catch( (e) => {
console.log(e);
});
Q2 might be related to this, but the above code does not solve it. Nasdaq might be running more sophisticated bot detection, such as checking for various other headers.

Related

How to do a get request with parameters [NODE.JS]

Thank you for reading. It has litteraly been hours that i've been searching for my problem: as said in the title didn't found anything even looked at docs of nodejs but there isn't what i'm searching so it's weird seems like no one does that?
const https = require("https")
https.request("http://aratools.com/dict-service?query=%7B%22dictionary%22%3A%22AR-EN-WORD-DICTIONARY%22%2C%22word%22%3A%22%D9%86%D8%B9%D9%85%22%2C%22dfilter%22%3Atrue%7D&format=json&_=1643219263345",
(res) => {
console.log(res)
}
)
the error:
enter image description here
website used: http://aratools.com/
Trying to do experiment thing later on (have something in head but first have to manage to do this get request)
You need to specify https in the URL and not http if you're using the package https. Otherwise you should use the package http.
Use the http module instead of https.
const http = require("http")
http.request("http://aratools.com/dict-service?query=%7B%22dictionary%22%3A%22AR-EN-WORD-DICTIONARY%22%2C%22word%22%3A%22%D9%86%D8%B9%D9%85%22%2C%22dfilter%22%3Atrue%7D&format=json&_=1643219263345", (res) => {
console.log(res)
});

Check the protocol of an external URL in NodeJS

Is there a way to check what the protocol is of an external site using NodeJS.
For example, for the purposes of URL shortening, people can provide a url, if they omit http or https, I'd check which it should be and add it.
I know I can just redirect users without the protocol, but I'm just curious if there is a way to check it.
Sure can. First install request-promise and its dependency, request:
npm install request request-promise
Now we can write an async function to take a URL that might be missing its protocol and, if necessary, add it:
const rq = require('request-promise');
async function completeProtocol(url) {
if (url.match(/^https?:/)) {
// fine the way it is
return url;
}
// https is preferred
try {
await rq(`https://${url}`, { method: 'HEAD' });
// We got it, that's all we need to know
return `https://${url}`;
} catch (e) {
return `http://${url}`;
}
}
Bear in mind that making requests like this could take up resources on your server particularly if someone spams a lot of these. You can mitigate that by passing timeout: 2000 as an option when calling rq.
Also consider only requesting the home page of the site, parsing off the rest of the URL, to mitigate the risk that this will be abused in some way. The protocol should be the same for the entire site.

Nuxt - Axios requests on server side not sending referer

The story
My backend app provides SEO information for my sites pages. One of this informations are OpenGraph meta tags, such as og:type and og:url.
og:url value is given by the API through the HTTP "Referer" header.
What I'm doing now
I'm using Axios module to make my requests.
Through asyncData function in my pages I can get the req variable and it's headers.referer property, which is what I want, like this:
// page.vue
async asyncData({ app, req }) {
app.$axios.setHeader('Referer', req.headers.referer);
}
The problem
If I am at the index page, let's say, then I click in a link to a dynamic page I got an error, for req is not available on asyncData function while navigating, I suppose.
The question
How can I dynamically get my requests referers to send it with Axios request for both client-side and server-side requests?
Info about the versions:
nuxt: 1.4.2
#nuxtjs/axios: 5.3.1
You can do something like this:
async asyncData({ app, req }) {
const referrer = process.client ? window.document.referrer : req.headers.referer
app.$axios.setHeader('Referer', referrer)
}

Express res.redirect shows content but not downloads

I have and API link what automatically starts downloading file if I follow it in address bar. Let's call it my-third-party-downloading-link-com.
But when in Express framework I set res.redirect(my-third-party-downloading-link-com). I get status code 301 and can see file content in Preview tab in developer tools. But I can't make Browser downloading this file.
My appropriate request handler is following:
downloadFeed(req, res) {
const { jobId, platform, fileName } = req.query;
const host = platform === 'production' ? configs.prodHost :
configs.stageHost;
const downloadLink = `${host}/api/v1/feedfile/${jobId}`;
// I also tried with these headers
// res.setHeader('Content-disposition', 'attachment;
// filename=${fileName}.gz');
// res.setHeader('Content-Type', 'application/x-gzip');
res.redirect(downloadLink)
}
P.S Now, to solve this problem, I build my-third-party-downloading-link-com on back-end, send it with res.end and then:
window.open(**my-third-party-downloading-link-com**, '_blank').
But I don't like this solution. How can I tell browser to start downloading content form this third-party API ?
According to the documentation, you should use res.download() to force a browser to prompt the user for download.
http://expressjs.com/es/api.html#res.download

Express request is called twice

To learn node.js I'm creating a small app that get some rss feeds stored in mongoDB, process them and create a single feed (ordered by date) from these ones.
It parses a list of ~50 rss feeds, with ~1000 blog items, so it's quite long to parse the whole, so I put the following req.connection.setTimeout(60*1000); to get a long enough time out to fetch and parse all the feeds.
Everything runs quite fine, but the request is called twice. (I checked with wireshark, I don't think it's about favicon here).
I really don't get it.
You can test yourself here : http://mighty-springs-9162.herokuapp.com/feed/mde/20 (it should create a rss feed with the last 20 articles about "mde").
The code is here: https://github.com/xseignard/rss-unify
And if we focus on the interesting bits :
I have a route defined like this : app.get('/feed/:name/:size?', topics.getFeed);
And the topics.getFeed is like this :
function getFeed(req, res) {
// 1 minute timeout to get enough time for the request to be processed
req.connection.setTimeout(60*1000);
var name = req.params.name;
var callback = function(err, topic) {
// if the topic has been found
if (topic) {
// aggregate the corresponding feeds
rssAggregator.aggregate(topic, function(err, rssFeed) {
if (err) {
res.status(500).send({error: 'Error while creating feed'});
}
else {
res.send(rssFeed);
}
},
req);
}
else {
res.status(404).send({error: 'Topic not found'});
}};
// look for the topic in the db
findTopicByName(name, callback);
}
So nothing fancy, but still, this getFeed function is called twice.
What's wrong there? Any idea?
This annoyed me for a long time. It's most likely the Firebug extension which is sending a duplicate of each GET request in the background. Try turning off Firebug to make sure that's not the issue.
I faced the same issue while using Google Cloud Functions Framework (which uses express to handle requests) on my local machine. Each fetch request (in browser console and within web page) made resulted in two requests to the server. The issue was related to CORS (because I was using different ports), Chrome made a OPTIONS method call before the actual call. Since OPTIONS method was not necessary in my code, I used an if-statement to return an empty response.
if(req.method == "OPTIONS"){
res.set('Access-Control-Allow-Origin', '*');
res.set('Access-Control-Allow-Headers', 'Content-Type');
res.status(204).send('');
}
Spent nearly 3hrs banging my head. Thanks to user105279's answer for hinting this.
If you have favicon on your site, remove it and try again. If your problem resolved, refactor your favicon url
I'm doing more or less the same thing now, and noticed the same thing.
I'm testing my server by entering the api address in chrome like this:
http://127.0.0.1:1337/links/1
my Node.js server is then responding with a json object depending on the id.
I set up a console log in the get method and noticed that when I change the id in the address bar of chrome it sends a request (before hitting enter to actually send the request) and the server accepts another request after I actually hit enter. This happens with and without having the chrome dev console open.
IE 11 doesn't seem to work in the same way but I don't have Firefox installed right now.
Hope that helps someone even if this was a kind of old thread :)
/J
I am to fix with listen.setTimeout and axios.defaults.timeout = 36000000
Node js
var timeout = require('connect-timeout'); //express v4
//in cors putting options response code for 200 and pre flight to false
app.use(cors({ preflightContinue: false, optionsSuccessStatus: 200 }));
//to put this middleaware in final of middleawares
app.use(timeout(36000000)); //10min
app.use((req, res, next) => {
if (!req.timedout) next();
});
var listen = app.listen(3333, () => console.log('running'));
listen.setTimeout(36000000); //10min
React
import axios from 'axios';
axios.defaults.timeout = 36000000;//10min
After of 2 days trying
you might have to increase the timeout even more. I haven't seen the express source but it just sounds on timeout, it retries.
Ensure you give res.send(); The axios call expects a value from the server and hence sends back a call request after 120 seconds.
I had the same issue doing this with Express 4. I believe it has to do with how it resolves request params. The solution is to ensure your params are resolved by for example checking them in an if block:
app.get('/:conversation', (req, res) => {
let url = req.params.conversation;
//Only handle request when params have resolved
if (url) {
res.redirect(301, 'http://'+ url + '.com')
}
})
In my case, my Axios POST requests were received twice by Express, the first one without body, the second one with the correct payload. The same request sent from Postman only received once correctly. It turned out that Express was run on a different port so my requests were cross origin. This caused Chrome to sent a preflight OPTION method request to the same url (the POST url) and my app.all routing in Express processed that one too.
app.all('/api/:cmd', require('./api.js'));
Separating POST from OPTIONS solved the issue:
app.post('/api/:cmd', require('./api.js'));
app.options('/', (req, res) => res.send());
I met the same problem. Then I tried to add return, it didn't work. But it works when I add return res.redirect('/path');
I had the same problem. Then I opened the Chrome dev tools and found out that the favicon.ico was requested from my Express.js application. I needed to fix the way how I registered the middleware.
Screenshot of Chrome dev tools
I also had double requests. In my case it was the forwarding from http to https protocol. You can check if that's the case by looking comparing
req.headers['x-forwarded-proto']
It will either be 'http' or 'https'.
I could fix my issue simply by adjusting the order in which my middlewares trigger.

Resources