Scraperjs interaction with the page

Scraperjs interaction with the page - node.js

Somebody uses https://github.com/ruipgil/scraperjs for scraping web pages?
I can not understand how to interact with the page? How to get google search results. This should be done as a function of scrape() or before?

You should check out cheerio API. Scraperjs uses it for parsing. You can clarify here what do you wanna get from specific page and I will provide you with sample code.
Here is code for getting url from google query
var scraperjs = require('scraperjs')
scraperjs.StaticScraper
.create('https://www.google.ru/search?q=scraperjs')
.scrape(function($) {
return $('li.g').map(function() {
return $(this).find('a').first().attr('href')
}).get();
}, function(news) {
news.forEach(function(elm) {
console.log(elm);
});
});
~

Related

How does scribd prevent download

when reading BOOKS on scribd.com the download functionality is not enabled. even browsing through the html source code I was unable to download the actual book. Great stuff ... but HOW did they do this ?
I am looking to implement something similar, to display a pdf (or converted from pdf) in such a way that the visitor cannot download the file
Most solutions I have seen are based on obfusticating the url.. but with a little effort people can find the url and download the file. ScribD seems to have covered this quite well..
Any suggestions , ideas how to implement such a download protection ?

It actually works dinamically building the HTML based on AJAX requests made while you're flipping pages. It is not image based. That's why you're finding it difficult to download the content.
However, it is not that safe for now. I present a solution below to download books that is working today (27th Jan 2020) not for teaching you how to do that (it is not legal), but to show you how you should prevent (or, at least, making it harder) users from downloading content if you're building something similar.
If you have a paid account and open the book page (the one that opens when you click 'Start Reading'), you can download an image of each book page by loading a library such as dom-to-image.
For instance, you could load the library using the developer tools (all code shown below must be typed in the page console):
if (injectDomToImage == undefined) {
var injectDomToImage = document.createElement('script');
injectDomToImage.src = "https://cdnjs.cloudflare.com/ajax/libs/dom-to-image/2.6.0/dom-to-image.min.js";
document.getElementsByTagName('head')[0].appendChild(injectDomToImage);
}
And then, you could define functions such as these:
function downloadPage(page, prefix) {
domtoimage.toJpeg(document.getElementsByClassName('reader_and_banner_container')[0], {
quality: 1,
})
.then(function(dataUrl) {
var link = document.createElement('a');
link.download = `${prefix}_page_${page}.jpg`;
link.href = dataUrl;
link.click();
nextPage(page, prefix);
});
}
function checkPageChanged(page, oldPageCounter, prefix) {
let newPageCounter = $('.page_counter').html();
if (oldPageCounter === newPageCounter) {
setTimeout(function() {
checkPageChanged(page, oldPageCounter, prefix);
}, 500);
} else {
setTimeout(function() {
downloadPage(page + 1, prefix);
}, 500);
}
}
function nextPage(page, prefix) {
let oldPageCounter = $('.page_counter').html();
$('.next_btn').trigger('click');
// Wait until page counter has changed (page loading has finished).
checkPageChanged(page + 1, oldPageCounter, prefix);
}
function download(prefix) {
downloadPage(1, prefix);
}
Finally, you could download each book page as a JPG image using:
download('test_');
It will download each page as test_page_.jpg
In order to prevent such type of 'robot', they could, for example, have used Re-CAPTCHA v3 that works in background seeking for 'robot'-like behaviour.

Accessing response headers using NodeJS

I'm having a problem right now which I can't seem to find a solution to.
I'm using Uservoice's NodeJS framework to send some requests to UserVoice regarding Feedback posts. A problem I've run into are ratelimits so I want to save the header values X-Rate-Limit-Remaining, X-Rate-Limit-Limit and X-Rate-Limit-Reset locally. I've made a function for updating and getting that value and am calling it like this:
var content = "Test"
c.post(`forums/${config.uservoice.forumId}/suggestions/${id}/comments.json`, {
comment: {
text: content
}
}).then(data => {
rl.updateRL(data.headers['X-Rate-Limit-Limit'],data.headers['X-Rate-Limit-Remaining'],data.headers['X-Rate-Limit-Reset'])
When running this code I get the error Cannot read property 'X-Rate-Limit-Limit' of undefined.
This is not a duplicate, I've also tried it lowercase as described here but had no luck either. Thanks for helping out!
EDIT:
The function takes the following parameters:
module.exports = {
updateRL: (lim, rem, res) {SAVING STUFF HERE}
}
It is defined in the file rates.jsand is imported in the above file as const rl = require('../rates').

Azure mobile apps CRUD operations on SQL table (node.js backend)

This is my first post here so please don't get mad if my formatting is a bit off ;-)
I'm trying to develop a backend solution using Azure mobile apps and node.js for server side scripts. It is a steep curve as I am new to javaScript and node.js coming from the embedded world. What I have made is a custom API that can add users to a MSSQL table, which is working fine using the tables object. However, I also need to be able to delete users from the same table. My code for adding a user is:
var userTable = req.azureMobile.tables('MyfUserInfo');
item.id = uuid.v4();
userTable.insert(item).then( function (){
console.log("inserted data");
res.status(200).send(item);
});
It works. The Azure node.js documentation is really not in good shape and I keep searching for good example on how to do simple things. Pretty annoying and time consuming.
The SDK documentation on delete operations says it works the same way as read, but that is not true. Or I am dumb as a wet door. My code for deleting looks like this - it results in exception
query = queries.create('MyfUserInfo')
.where({ id: results[i].id });
userTable.delete(query).then( function(delet){
console.log("deleted id ", delet);
});
I have also tried this and no success either
userTable.where({ id: item.id }).read()
.then( function(results) {
if (results.length > 0)
{
for (var i = 0; i < results.length; i++)
{
userTable.delete(results[i].id);
});
}
}
Can somebody please point me in the right direction on the correct syntax for this and explain why it has to be so difficult doing basic stuff here ;-) It seems like there are many ways of doing the exact same thing, which really confuses me.
Thanks alot
Martin

You could issue SQL in your api
var api = {
get: (request, response, next) => {
var query = {
sql: 'UPDATE TodoItem SET complete=#completed',
parameters: [
{ name: 'completed', value: request.params.completed }
]
};
request.azureMobile.data.execute(query)
.then(function (results) {
response.json(results);
});
}};
module.exports = api;
That is from their sample on GitHub
Here is the full list of samples to take a look at

Why are you doing a custom API for a table? Just define the table within the tables directory and add any custom authorization / authentication.

How to parse page that uses HTML5 local storage?

In advance sorry for my English)
I have a task - write a parser for site, but all his pages save entered data in HTML5 local storage. Its really to emulate click on images on pages and retrieve all variables values that was saved to data storage after this click? For example, using NodeJS + parser like jsdom (https://github.com/tmpvar/jsdom)? Or i can use some alternatively technologies for this?
Thank you!

Sounds like you are trying to parse a website with lots of javascript. You can use phontom to simulate user behaviour. Consider you want to use node. Then you can use Node-Phontom to do that.
var phantom=require('node-phantom');
phantom.create(function(err,ph) {
return ph.createPage(function(err,page) {
return page.open("you/url/", function(err,status) {
console.log("opened site? ", status);
page.includeJs('http://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js', function(err) {
//jQuery Loaded.
//Settimeout to wait for a bit for AJAX call.
setTimeout(function() {
return page.evaluate(function() {
//Get what you want from the page
//e.g. localStorage.getItem('xxx');
}, 5000);
});
});
});
});
Here is phontom.
Here is node-phontom.

Retrieving album/playlist information API

I read in another thread (http://stackoverflow.com/questions/9474011/showing-a-album-cover)
that the:
Please don't use any of the sp. APIs - they're private and going away soon.
My question is, what is the correct way of getting album and/or playlist information from the API?
I'm currently playing around with this:
sp.core.getMetadata(uri, {
onSuccess: function(uri) {
// Success
},
onFailure: function() {
// Failure
}
});
I guess this is private and shouldn't be used right? Instead I should get the info from the models.* object? If not, is there another preferred method of dealing with this?

Always use models. Documentation can be found here.
For example:
var sp = getSpotifyApi(1);
var models = sp.require('sp://import/scripts/api/models');
var a = models.Album.fromURI("spotify:album:5zyS3GEyL1FmDWgVXxUvj7", function(album) {
console.log("Album loaded", album.name);
});

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Scraperjs interaction with the page - node.js

Somebody uses https://github.com/ruipgil/scraperjs for scraping web pages? I can not understand how to interact with the page? How to get google search results. This should be done as a function of scrape() or before?

Related

How does scribd prevent download

Accessing response headers using NodeJS

Azure mobile apps CRUD operations on SQL table (node.js backend)

How to parse page that uses HTML5 local storage?

Retrieving album/playlist information API

Categories

Resources