Scraping from Facebook - node.js

I have a challenge I'm running into and cannot seem to find an answer for it anywhere on the web. I'm working on a personal project; it's a Node.js application that uses the request and cheerio packages to hit an end-point and scrape some data... However, the endpoint is a Facebook page... and the display of its content is dependent upon whether the user is logged in or not.
In short, the app seeks to scrape the user's saved links, you know, all that stuff you add to your "save for later" but never actually go back to (at least in my case). The end-point, then, is htpps://www.facebook.com/saved. If, in your browser, you are logged into Facebook, clicking that link will take you where the application needs to go. However, since the application isn't technically going through the browser that has your credentials and your session saved, I'm running into a bit of an issue...
Yes, using the request module I'm able to successfully reach "a" part of Facebook, but not the one I need... My question really is: how should I begin to handle this challenge?
This is all the code I have for the app so far:
var express = require('express');
var fs = require('fs');
var request = require('request');
var cheerio = require('cheerio');
var app = express();
app.get('/scrape', (req, res) => {
// Workspace
var url = 'https://www.facebook.com/saved';
request(url, (err, response, html) => {
if (err) console.log(err);
res.send(JSON.stringify(html));
})
})
app.listen('8081', () => {
console.log('App listening on port 8081');
})
Any input will be greatly appreciated... Currently, I'm on hold...! How could I possibly hit this end-point with credentials (safely) provided by the user so that the application could get legitimately get past authentication and reach the desired end-point?

I don't think you can accomplish that using request-cheerio module since you need to make a post request with your login information.
A headless browser is more appropriate for this kind of project if you want it to be a scraper. Try using casperJs or PhantomJs. It will give you more flexibility but it's not a node.js module so you need to make a step further if you want to incorporate it with express.
One nodeJs module I know that can let you post is Osmosis. If you can make .login(user, pw) to work then that'll be great but I don't think it can successfully login to facebook though.
API if possible would be a much nicer solution but I'm assuming you already looked it up and find nothing in there for what you are looking for.
My personal choice would be to use an RobotProcessAutomation. WinAutomation, for example, is a great tool for manipulating web and scraping. It's a whole new different approach but it can do the job well and can be implemented faster compared to programmatically coding it.

Related

How to display binary images retrieved from API in React.js?

✨ Hello everyone!✨
General Problem:
I have a web app that has about 50 images that shouldn't be able to be accessed before the user logs into the site. This should be a simple answer I suspect, there are plenty of sites that also require this basic protection. Maybe I do not know the right words to google here, but I am having a bit of trouble. Any help is appreciated.
App details:
My web app is built in typescript react, with a node.js/express/mongoDB backend. Fairly typical stuff.
What I have tried:
My best thought so far was to upload them into the public folder on the backend server hosted on heroku. Then I protected the images with authenication middlewear to any url that had "/images/" as a part of it. This works, partially. I am able to see the images when I call the api from postman with the authenication header. But I cannot figure out a way to display that image in my react web app. Here is the basic call I used.
fetch(url,
{
headers: {
Authorization:token,
},
}
);
and then the actual response is just an empty object when I try to copy it
{}
but I also get this when I console log the pure response, some kind of readable stream:
from following related question
I came up with the following: (which is normally wrapped in a asyc function)
const image = await fetch(url,{headers:{ Authorization:token}});
const theBlob = await image.blob();
console.log(URL.createObjectURL(theBlob));
which gives me the link: http://localhost:3000/b299feb8-6ee2-433d-bf05-05bce01516b3 which only displays a blank page.
Any help is very much appreciated! Thanks! 😄
After lots of work trying to understand whats going on, here is my own answer:
const image = await axios(url, { responseType: "blob", headers: {Authorization: token }});
const srcForImage = URL.createObjectURL(image.data)
Why it makes sense now
So I did not understand the innerworkings of what was going on. Please correct me, but the following is my understanding:
So the image was being sent in binary. What I had to do to fix that was to set the reponseType in axios as "blob", which then sent a blob, which I believe means its base 64 encoded instead. Then the function URL.createObjectURL does some magic, and must save it to the browser as part of the page. Then we can just use that as the image url. When you visit it yourself, you must type the 'blob:' part of the url it give you too, otherwise its blank, or stick it in <img src={srcForImage}/> and it works great. I bet it would've worked in the original fetch example in the question, I just never put the url in a tag or included 'blob:' as part of the URL.
That's correct, you send the auth token and the backend uses that to auth the user (check that he exists in the DB, that he has the correct Role and check the jwt too)
The server only responds with the images if the above is true
If your server is responding with an empty object then the problem is the backend not the frontend, console.log what you're sending to the frontend

In an Express.js server, how can I send an HTML (with style and js) acquired from a HTTP request, as a response?

This is an Express.js server. I'm trying to authenticate my Instagram API.
const express = require('express');
const path = require('path');
const bodyParser = require('body-parser');
const axios = require('axios');
const ejs = require('ejs');
var app = express();
// bodyparser middleware setup
app.use(bodyParser.urlencoded({ extended: false }));
app.use(bodyParser.json());
var instagramClientId = '123123';
app.get('/instagram', (req, res) => {
axios({
method: 'post',
url: `https://api.instagram.com/oauth/authorize/?client_id=${instagramClientId}&redirect_uri=REDIRECT-URI&response_type=code`,
}).then((response) => {
res.send(response.data);
console.log(response.data);
}).catch((e) => {
console.log(e);
});
});
// port set-up
const port = process.env.PORT || 3000;
app.listen(port, () => {
console.log(`app fired up on port ${port}`);
});
This is the error I got. Looks like the html file was sent just fine, but the css and js weren't executed. You can see the error messages were all about style and js not being excuted.
Is this because I need to enable some options in my res.send() code?
I'm going to answer two questions here, the first is what I think you're actually having problems with and the second is what would technically be the answer to this question.
What I think your actual problem is: A misunderstanding of OAuth2.0
Looking at your code, it looks like you're trying to get a user to authenticate with Instagram. This is great, OAuth2.0 is fantastic and a good way to authenticate at the moment, but you've got the wrong end of the stick with how to implement it. OAuth2.0 is about redirects, not proxying HTML to the user. In this case it looks like you're using axios to make a server side call to the instagram OAuth endpoint and then sending the user that HTML. What you should actually be doing is redirecting the user to the Instagram URL you've built.
A high level of the "dance" you go through is the following.
The user requests to login with instagram by pressing a button on your website.
You send the user to an instagram URL, that URL contains your applications token plus an "approved" redirect url. Once the user has logged in with Instagram, Instagram will redirect the user to your approved redirect url.
The users browser has now been redirected to a second endpoint on your server, this endpoint recieves a one-time token from Instagram. You take that token on your server side and use axios (or similar) to make a server side request to fetch some user information such as their profile. Once you have that data, you can then create a user in your the database if needed and issue a new session token to them. Along with the profile call on this, you'll also get a token given directly to you (different from the one the users browser gave you) which will allow you to make requests to the Instagram API for the privileges you requested from the user originally.
This means you have 2 endpoints on your service, the "hello, I'd like to log in with instagram, please redirect me to the instagram login page" and then "hello, instagram said I'm all good and gave me this token to prove it, you can now check with them directly" (this is the callback endpoint).
You can manage this whole process manually which is great for understanding OAuth, or you can use something like Passport.js to abstract this for you. This lets you inject your own logic in a few places and handles a lot of the back and forth dance for you. In this instance, I'd probably suggest handling it yourself to learn how it all works.
Ultimately, you are not sending the user any HTML via res.send or anything similar. Instead your first endpoint simply uses a res.redirect(instagramUrl). You also thus do not make any HTTP requests during this portion, you do that on the "callback" after they've entered their username and password with Instagram.
Technically the correct answer to this question: proxy the JS and CSS calls, but this is really bad for security!
You're sending some HTML from a 3rd party in this case. So you will need to also allow the user access to the HTML and CSS. Security wise, this is quite iffy and you should really consider not doing this as it's bad practice. All of the JS and CSS links in the page are most likely relative links, meaning they're asking you for some JS and CSS which you are not hosting. Your best bet is to find these exact paths (ie: /js/app.min.js) and to then proxy these requests. So you'll create a new endpoint on your service which will make a request to instagrams /js/app.min.js and then send that back down with res.send.
Again, please do not do this, pretending to be another service is a really bad idea. If you need something from instagram, use OAuth2.0 to authenticate the user and then make requests using their tokens and the official instagram API.

Meteor allow-access-control-origin

I'm attempting to use the node-trello package to interact with the Trello API inside a Meteor app. However running through setup and attempting to make an api call in my client-side javascript file, I get this error.
This is my code in my javascript file, following the documentation for the package.
var Trello = require('node-trello');
var t = new Trello(Meteor.settings.public.trelloKey, Meteor.settings.public.trelloToken);
t.get('/1/members/me', function(err, data) {
if(err) throw err;
console.log(data);
});
I'm not exactly sure what the error means or how to fix it so any help would be greatly appreciated.
Google will help you find an answer to your problem, by searching for the error message.
The problem is basically a security one, because you are making http requests from the browser to another site (Trello), and you need to let the browser know that it's ok to allow these requests by setting up some headers. I'll let you research what those are.
A better solution is for you to write a server method to do these things. The server process is not restricted in the requests to other sites that it makes, so you avoid the need to maintain headers, and you also won't hit any firewall issues (because perhaps the user's environment doesn't allow access to 3rd party services like Trello).

Register new route at runtime in NodeJs/ExpressJs

I want to extend this open topic: Add Routes at Runtime (ExpressJs) which sadly didn't help me enough.
I'm working on an application that allows the creation of different API's that runs on NodeJs. The UI looks like this:
As you can see, this piece of code contains two endpoints (GET, POST) and as soon as I press "Save", it creates a .js file located in a path where the Nodejs application is looking for its endpoints (e.g: myProject\dynamicRoutes\rule_test.js).
The problem that I have is that being that the Nodejs server is running while I'm developing the code, I'm not able to invoke these new endpoints unless I restart the server once again (and ExpressJs detects the file).
Is there a way to register new routes while the
NodeJs (ExpressJs) is running?
I tried to do the following things with no luck:
app.js
This works if the server is restarted. I tried to include this library (express-dynamic-router, but not working at runtime.)
//this is dynamic routing function
function handleDynamicRoutes(req,res,next) {
var path = req.path; //http://localhost:8080/api/rule_test
//LoadModules(path)
var controllerPath = path.replace("/api/", "./dynamicRoutes/");
var dynamicController = require(controllerPath);
dynamicRouter.index(dynamicController[req.method]).register(app);
dynamicController[req.method] = function(req, res) {
//invocation
}
next();
}
app.all('*', handleDynamicRoutes);
Finally, I readed this article (#NodeJS / #ExpressJS: Adding routes dynamically at runtime), but I couldn't figure out how this can help me.
I believe that this could be possible somehow, but I feel a bit lost. Anyone knows how can I achieve this? I'm getting a CANNOT GET error, after each file creation.
Disclaimer: please know that it is considered as bad design in terms of stability and security to allow the user or even administrator to inject executable code via web forms. Treat this thread as academic discussion and don't use this code in production!
Look at this simple example which adds new route in runtime:
app.get('/subpage', (req, res) => res.send('Hello subpage'))
So basically new route is being registered when app.get is called, no need to walk through routes directory.
All you need to do is simply load your newly created module and pass your app to module.exports function to register new routes. I guess this one-liner should work just fine (not tested):
require('path/to/new/module')(app)
Is req.params enough for you?
app.get('/basebath/:path, (req,res) => {
const content = require('content/' + req.params.path);
res.send(content);
});
So the user can enter whatever after /basepath, for example
http://www.mywebsite.com/basepath/bergur
The router would then try to get the file content/bergur.js
and send it's contents.

want to write node.js http client for web site testing

I am new to node.js
I want to try to write node.js client for my web site testing
(stuff like login, filling forms, etc...)
Which module should i use for that?
Since I want to test user login following other user functionality
it should be able to keep session like browser
Also any site where it has example of using that module?
Thanks
As Amenadiel has said in the comments, you might want to use something like Phantom.js for testing websites.
But if you're new to node.js maybe try with something light, like Zombie.js.
An example from their home page:
var Browser = require("zombie");
var assert = require("assert");
// Load the page from localhost
browser = new Browser()
browser.visit("http://localhost:3000/", function () {
// Fill email, password and submit form
browser.
fill("email", "zombie#underworld.dead").
fill("password", "eat-the-living").
pressButton("Sign Me Up!", function() {
// Form submitted, new page loaded.
assert.ok(browser.success);
assert.equal(browser.text("title"), "Welcome To Brains Depot");
})
});
Later on, when you get the hang of it, maybe switch to Phantom (which has webkit beneath, so it's not emulating the Dom).

Resources