How to parse page that uses HTML5 local storage? - node.js

In advance sorry for my English)
I have a task - write a parser for site, but all his pages save entered data in HTML5 local storage. Its really to emulate click on images on pages and retrieve all variables values that was saved to data storage after this click? For example, using NodeJS + parser like jsdom (https://github.com/tmpvar/jsdom)? Or i can use some alternatively technologies for this?
Thank you!

Sounds like you are trying to parse a website with lots of javascript. You can use phontom to simulate user behaviour. Consider you want to use node. Then you can use Node-Phontom to do that.
var phantom=require('node-phantom');
phantom.create(function(err,ph) {
return ph.createPage(function(err,page) {
return page.open("you/url/", function(err,status) {
console.log("opened site? ", status);
page.includeJs('http://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js', function(err) {
//jQuery Loaded.
//Settimeout to wait for a bit for AJAX call.
setTimeout(function() {
return page.evaluate(function() {
//Get what you want from the page
//e.g. localStorage.getItem('xxx');
}, 5000);
});
});
});
});
Here is phontom.
Here is node-phontom.

Related

How does scribd prevent download

when reading BOOKS on scribd.com the download functionality is not enabled. even browsing through the html source code I was unable to download the actual book. Great stuff ... but HOW did they do this ?
I am looking to implement something similar, to display a pdf (or converted from pdf) in such a way that the visitor cannot download the file
Most solutions I have seen are based on obfusticating the url.. but with a little effort people can find the url and download the file. ScribD seems to have covered this quite well..
Any suggestions , ideas how to implement such a download protection ?
It actually works dinamically building the HTML based on AJAX requests made while you're flipping pages. It is not image based. That's why you're finding it difficult to download the content.
However, it is not that safe for now. I present a solution below to download books that is working today (27th Jan 2020) not for teaching you how to do that (it is not legal), but to show you how you should prevent (or, at least, making it harder) users from downloading content if you're building something similar.
If you have a paid account and open the book page (the one that opens when you click 'Start Reading'), you can download an image of each book page by loading a library such as dom-to-image.
For instance, you could load the library using the developer tools (all code shown below must be typed in the page console):
if (injectDomToImage == undefined) {
var injectDomToImage = document.createElement('script');
injectDomToImage.src = "https://cdnjs.cloudflare.com/ajax/libs/dom-to-image/2.6.0/dom-to-image.min.js";
document.getElementsByTagName('head')[0].appendChild(injectDomToImage);
}
And then, you could define functions such as these:
function downloadPage(page, prefix) {
domtoimage.toJpeg(document.getElementsByClassName('reader_and_banner_container')[0], {
quality: 1,
})
.then(function(dataUrl) {
var link = document.createElement('a');
link.download = `${prefix}_page_${page}.jpg`;
link.href = dataUrl;
link.click();
nextPage(page, prefix);
});
}
function checkPageChanged(page, oldPageCounter, prefix) {
let newPageCounter = $('.page_counter').html();
if (oldPageCounter === newPageCounter) {
setTimeout(function() {
checkPageChanged(page, oldPageCounter, prefix);
}, 500);
} else {
setTimeout(function() {
downloadPage(page + 1, prefix);
}, 500);
}
}
function nextPage(page, prefix) {
let oldPageCounter = $('.page_counter').html();
$('.next_btn').trigger('click');
// Wait until page counter has changed (page loading has finished).
checkPageChanged(page + 1, oldPageCounter, prefix);
}
function download(prefix) {
downloadPage(1, prefix);
}
Finally, you could download each book page as a JPG image using:
download('test_');
It will download each page as test_page_.jpg
In order to prevent such type of 'robot', they could, for example, have used Re-CAPTCHA v3 that works in background seeking for 'robot'-like behaviour.

File input meteor cfs

So i see this code on the Docs
Template.myForm.events({
'change .myFileInput': function(event, template) {
FS.Utility.eachFile(event, function(file) {
Images.insert(file, function (err, fileObj) {
//Inserted new doc with ID fileObj._id, and kicked off the data upload using HTTP
});
});
}
});
But i dont want the file upload inmediatly when i click "myFileInptu" , i want to store that value (from the input), and insert lately with a button, so there is some way to do this?
Also its there a way to upload a FSCollection without a file? just metadata
Sorry for bad english hope you can help me
Achieving what you want to requires a trivial change of the event, i.e switching from change .myFileInput to submit .myForm. In the submit event, you can get the value of the file by selecting the file input, and then storing it as a FS File manually. Something like:
'submit .myForm': function (event, template) {
event.preventDefault();
var file = template.find('#input').files[0];
file = new FS.File(file);
// set metadata
file.metadata = { 'caption': 'wow' };
Images.insert(file, function (error, file) {
if (!error)
// do something with file._id
});
}
If you're using autoform with CollectionFS, you can put that code inside the onSubmit hook. The loop you provided in your question works also.
As for your second question, I don't think FS.Files can be created without a size, so my guess is no, you can't just store metadata without attaching it to a file. Anyways, it seems to me kind of counterintuitive to store just metadata when the metadata is supposed to describe the associated image. You would be better off using a separate collection for that.
Hope that helped :)

Scraperjs interaction with the page

Somebody uses https://github.com/ruipgil/scraperjs for scraping web pages?
I can not understand how to interact with the page? How to get google search results. This should be done as a function of scrape() or before?
You should check out cheerio API. Scraperjs uses it for parsing. You can clarify here what do you wanna get from specific page and I will provide you with sample code.
Here is code for getting url from google query
var scraperjs = require('scraperjs')
scraperjs.StaticScraper
.create('https://www.google.ru/search?q=scraperjs')
.scrape(function($) {
return $('li.g').map(function() {
return $(this).find('a').first().attr('href')
}).get();
}, function(news) {
news.forEach(function(elm) {
console.log(elm);
});
});
~

PhantomJs - How to render a multi page PDF

I can create one-page PDFs with phantomJS; but I can't find on the doc how to create different pages (each page coming from an html view) and put them into one PDF ? I am using node-phantom module for NodeJS
Just need to specify a paperSize.
Like this with module "phantom": "0.5.1"
function(next) {
phantom.create(function(doc) {
next(null, doc);
}, "phantomjs", Math.floor(Math.random()*(65535-49152+1)+49152));
},
function(ph, next) {
ph.createPage(function(doc) {
next(null, doc);
});
},
function(page, next) {
page.set('paperSize', {format: 'A4', orientation: 'portrait'});
page.set('zoomFactor', 1);
}
Then, simply use page-break-before: always; in your HTML content each time you want to open a new page.
PS: I use async.waterfall in this example
PPS: Math.random on port number is used to avoid module crash if concurrent calls to phantom binary are triggered. Works fine - if someone wants to post something better even if a bit off-topic, feel free to do it

Sending text to the browser

I have managed to get file uploading work in Node.js with Express, and in the code i'm checking whether it's an image or not that the user is trying to upload.
If the file was successfully uploaded I want to show a message to the user, directly to the HTML page with the uploading form. The same should be if the file the user tried to upload wasn't an image, or something else happened during the upload.
The code below works (res.send...) but it opens up a new page containing only the message.
My question is: How can I change my code so that the message is sent directly to the HTML page instead? If it could be of any use, i'm using Jade.
Thanks in advance!
app.post('/file-upload', function(req, res, next) {
var fileType = req.files.thumbnail.type;
var divided = fileType.split("/");
var theType = divided[0];
if (theType === "image"){
var tmp_path = req.files.thumbnail.path;
var target_path = './public/images/' + req.files.thumbnail.name;
fs.rename(tmp_path, target_path, function(err) {
if (err) throw err;
fs.unlink(tmp_path, function() {
if (err) {
throw err;
res.send('Something happened while trying to upload, try again!');
}
res.send('File uploaded to: ' + target_path + ' - ' + req.files.thumbnail.size + ' bytes');
});
});
}
else {
res.send('No image!');
}
});
from what I understand you are trying to send a message to an already open browser window?
a few things you can do,
Ajax it, send the post, and process the return info.
Submit it as you are doing now, but set a flash message (look at http://github.com/visionmedia/express-messages) and either res.render the form page, or res.redirect to the form function
now.js or a similar solution. This would let clientside use serverside functions and serverside code to run clientside functions. So what you would do would be on submit, pass the post values to a serverside function, which will process it and trigger a clientside function (display a message)
For my money option #2 is probably the safest bet, as clients without javascript enabled will be able to use it. As for usability #1 or #3 would give a more streamlined appearance to the end user.
You can use WebSockets. I recommend using Socket.IO, it's very easy to work with. On the client-side you would have an event-handler which would use JavaScript to append the new information to that page.
You could then have the server for example say:
socket.emit('error', "Something happened while trying to upload, try again!");
and the client would use:
socket.on('error', function(data){
//alert?
alert(data);
});
http://socket.io/#how-to-use

Resources