when reading BOOKS on scribd.com the download functionality is not enabled. even browsing through the html source code I was unable to download the actual book. Great stuff ... but HOW did they do this ?
I am looking to implement something similar, to display a pdf (or converted from pdf) in such a way that the visitor cannot download the file
Most solutions I have seen are based on obfusticating the url.. but with a little effort people can find the url and download the file. ScribD seems to have covered this quite well..
Any suggestions , ideas how to implement such a download protection ?
It actually works dinamically building the HTML based on AJAX requests made while you're flipping pages. It is not image based. That's why you're finding it difficult to download the content.
However, it is not that safe for now. I present a solution below to download books that is working today (27th Jan 2020) not for teaching you how to do that (it is not legal), but to show you how you should prevent (or, at least, making it harder) users from downloading content if you're building something similar.
If you have a paid account and open the book page (the one that opens when you click 'Start Reading'), you can download an image of each book page by loading a library such as dom-to-image.
For instance, you could load the library using the developer tools (all code shown below must be typed in the page console):
if (injectDomToImage == undefined) {
var injectDomToImage = document.createElement('script');
injectDomToImage.src = "https://cdnjs.cloudflare.com/ajax/libs/dom-to-image/2.6.0/dom-to-image.min.js";
document.getElementsByTagName('head')[0].appendChild(injectDomToImage);
}
And then, you could define functions such as these:
function downloadPage(page, prefix) {
domtoimage.toJpeg(document.getElementsByClassName('reader_and_banner_container')[0], {
quality: 1,
})
.then(function(dataUrl) {
var link = document.createElement('a');
link.download = `${prefix}_page_${page}.jpg`;
link.href = dataUrl;
link.click();
nextPage(page, prefix);
});
}
function checkPageChanged(page, oldPageCounter, prefix) {
let newPageCounter = $('.page_counter').html();
if (oldPageCounter === newPageCounter) {
setTimeout(function() {
checkPageChanged(page, oldPageCounter, prefix);
}, 500);
} else {
setTimeout(function() {
downloadPage(page + 1, prefix);
}, 500);
}
}
function nextPage(page, prefix) {
let oldPageCounter = $('.page_counter').html();
$('.next_btn').trigger('click');
// Wait until page counter has changed (page loading has finished).
checkPageChanged(page + 1, oldPageCounter, prefix);
}
function download(prefix) {
downloadPage(1, prefix);
}
Finally, you could download each book page as a JPG image using:
download('test_');
It will download each page as test_page_.jpg
In order to prevent such type of 'robot', they could, for example, have used Re-CAPTCHA v3 that works in background seeking for 'robot'-like behaviour.
Related
I have created a web scraper where I am trying to fetch the dynamic data which loads in a div after page is load.
Here it is my code and source website url https://www.medizinerkarriere.de/kliniken-sortiert-nach-name.html
async function pageFunction(context) {
// jQuery is handy for finding DOM elements and extracting data from them.
// To use it, make sure to enable the "Inject jQuery" option.
const $ = context.jQuery;
var result = [];
$('#klinikListBox ul').each(function(){
var item = {
Name: $(this).find('li.klName').text().trim(),
Ort: $(this).find('li.klOrt').text().trim(),
Land: $(this).find('li.klLand').text().trim(),
Url:""
};
result.push(item);
});
// To make this work, make sure the "Use request queue" option is enabled.
await context.enqueueRequest({ url: 'https://www.medizinerkarriere.de/kliniken-sortiert-nach-name.html' });
// Return an object with the data extracted from the page.
// It will be stored to the resulting dataset.
return result;
}
But there are on click pagination and I am not sure how to do it.
I tried all method from this link but it didn't work.
https://docs.apify.com/scraping/web-scraper#bonus-making-your-code-neater
Please help and quick help will be highly appreciated.
In this case the pagination loads dynamically on a single page so enqueuing new pages doesn't make sense. You can get to the next page by simply clicking the page button, it is also a good practice to wait a bit after the click.
$('#PGPAGES span').eq(1).click();
await context.waitFor(1000)
You can scrape all pages with a simple loop
const numberOfPages = 8 // You can scrape this number too
for (let i = 1; i <= numberOfPages; i++) {
// Your scraping code, push data to an array and return them in the end
$('#PGPAGES span').eq(i).click();
await context.waitFor(1000)
}
I've been working on a small twitter-like website to teach myself React. It's going fairly well, and i want to allow users to take photos and attach it to their posts. I found a library called React-Camera that seems to do what i want it do to - it brings up the camera and manages to save something.
I say something because i am very confused about what to actually -do- with what i save. This is the client-side code for the image capturing, which i basically just copied from the documentation:
takePicture() {
try {
this.camera.capture()
.then(blob => {
this.setState({
show_camera: "none",
image: URL.createObjectURL(blob)
})
console.log(this.state);
this.img.src = URL.createObjectURL(blob);
this.img.onload = () => { URL.revokeObjectURL(this.src); }
var details = {
'img': this.img.src,
};
var formBody = [];
for (var property in details) {
var encodedKey = encodeURIComponent(property);
var encodedValue = encodeURIComponent(details[property]);
formBody.push(encodedKey + "=" + encodedValue);
}
formBody = formBody.join("&");
fetch('/newimage', {
method: 'post',
headers: {'Content-type': 'application/x-www-form-urlencoded;charset=UTF-8'},
body: formBody
});
console.log("Reqd post")
But what am i actually saving here? For testing i tried adding an image to the site and setting src={this.state.img} but that doesn't work. I can store this blob (which looks like, for example, blob:http://localhost:4000/dacf7a61-f8a7-484f-adf3-d28d369ae8db)
or the image itself into my DB, but again the problem is im not sure what the correct way to go about this is.
Basically, what i want to do is this:
1. Grab a picture using React-Camera
2. Send this in a post to /newimage
3. The image will then - in some form - be stored in the database
4. Later, a client may request an image that will be part of a post (ie. a tweet can have an image). This will then display the image on the website.
Any help would be greatly appreciated as i feel i am just getting more confused the more libraries i look at!
From your question i came to know that you are storing image in DB itself.
If my understanding is correct then you are attempting a bad approcah.
For this
you need to store images in project directory using your node application.
need to store path of images in DB.
using these path you can fetch the images and can display on webpage.
for uploading image using nodejs you can use Multer package.
In advance sorry for my English)
I have a task - write a parser for site, but all his pages save entered data in HTML5 local storage. Its really to emulate click on images on pages and retrieve all variables values that was saved to data storage after this click? For example, using NodeJS + parser like jsdom (https://github.com/tmpvar/jsdom)? Or i can use some alternatively technologies for this?
Thank you!
Sounds like you are trying to parse a website with lots of javascript. You can use phontom to simulate user behaviour. Consider you want to use node. Then you can use Node-Phontom to do that.
var phantom=require('node-phantom');
phantom.create(function(err,ph) {
return ph.createPage(function(err,page) {
return page.open("you/url/", function(err,status) {
console.log("opened site? ", status);
page.includeJs('http://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js', function(err) {
//jQuery Loaded.
//Settimeout to wait for a bit for AJAX call.
setTimeout(function() {
return page.evaluate(function() {
//Get what you want from the page
//e.g. localStorage.getItem('xxx');
}, 5000);
});
});
});
});
Here is phontom.
Here is node-phontom.
Somebody uses https://github.com/ruipgil/scraperjs for scraping web pages?
I can not understand how to interact with the page? How to get google search results. This should be done as a function of scrape() or before?
You should check out cheerio API. Scraperjs uses it for parsing. You can clarify here what do you wanna get from specific page and I will provide you with sample code.
Here is code for getting url from google query
var scraperjs = require('scraperjs')
scraperjs.StaticScraper
.create('https://www.google.ru/search?q=scraperjs')
.scrape(function($) {
return $('li.g').map(function() {
return $(this).find('a').first().attr('href')
}).get();
}, function(news) {
news.forEach(function(elm) {
console.log(elm);
});
});
~
Is it possible to build an 'incognito mode' for loading background web-pages in a browser extension?
I am writing a non-IE cross-browser extension that periodically checks web-pages on the user's behalf. There are two requirements:
Page checks are done in the background, to be as unobtrusive as possible. I believe this could be done by opening the page in a new unfocussed browser tab, or hidden in a sandboxed iframe in the extension's background page.
The page checks should operate in 'incognito mode', and not use/update the user's cookies, history, or local storage. This is to stop the checks polluting the user's actual browsing behavior as much as possible.
Any thoughts on how to implement this 'incognito mode'?
It would ideally work in as many browser types as possible (not IE).
My current ideas are:
Filter out cookie headers from incoming/outgoing http requests associated with the page checks (if I can identify all of these) (not possible in Safari?)
After each page check, filter out the page from the user's history.
Useful SO questions I've found:
Chrome extension: loading a hidden page (without iframe)
Firefox addon development, open a hidden web browser
Identify requests originating in the hiddenDOMWindow (or one of its iframes)
var Cu = Components.utils;
Cu.import('resource://gre/modules/Services.jsm');
Cu.import('resource://gre/modules/devtools/Console.jsm');
var win = Services.appShell.hiddenDOMWindow
var iframe = win.document.createElementNS('http://www.w3.org/1999/xhtml', 'iframe');
iframe.addEventListener('DOMContentLoaded', function(e) {
var win = e.originalTarget.defaultView;
console.log('done loaded', e.document.location);
if (win.frameElement && win.frameElement != iframe) {
//its a frame in the in iframe that load
}
}, false);
win.document.documentElement.appendChild(iframe);
must keep a global var reference to iframe we added.
then you can change the iframe location like this, and when its loaded it triggers the event listener above
iframe.contentWindow.location = 'http://www.bing.com/'
that DOMContentLoaded identifies all things loaded in that iframe. if the page has frames it detects that too.
to remove from history, into the DOMContentLoaded function use the history service to remove win.location from history:
https://developer.mozilla.org/en-US/docs/Using_the_Places_history_service
now to strip the cookies from requests in that page use this code:
const {classes: Cc, Constructor: CC, interfaces: Ci, utils: Cu, results: Cr, manager: Cm} = Components;
Cu.import('resource://gre/modules/Services.jsm');
var myTabToSpoofIn = Services.wm.getMostRecentBrowser('navigator:browser').gBrowser.tabContainer[0]; //will spoof in the first tab of your browser
var httpRequestObserver = {
observe: function (subject, topic, data) {
var httpChannel, requestURL;
if (topic == "http-on-modify-request") {
httpChannel = subject.QueryInterface(Ci.nsIHttpChannel);
var goodies = loadContextGoodies(httpChannel)
if (goodies) {
if (goodies.contentWindow.top == iframe.contentWindow.top) {
httpChannel.setRequestHeader('Cookie', '', false);
} else {
//this page load isnt in our iframe so ignore it
}
}
}
}
};
Services.obs.addObserver(httpRequestObserver, "http-on-modify-request", false);
//Services.obs.removeObserver(httpRequestObserver, "http-on-modify-request", false); //run this on shudown of your addon otherwise the observer stags registerd
//this function gets the contentWindow and other good stuff from loadContext of httpChannel
function loadContextGoodies(httpChannel) {
//httpChannel must be the subject of http-on-modify-request QI'ed to nsiHTTPChannel as is done on line 8 "httpChannel = subject.QueryInterface(Ci.nsIHttpChannel);"
//start loadContext stuff
var loadContext;
try {
var interfaceRequestor = httpChannel.notificationCallbacks.QueryInterface(Ci.nsIInterfaceRequestor);
//var DOMWindow = interfaceRequestor.getInterface(Components.interfaces.nsIDOMWindow); //not to be done anymore because: https://developer.mozilla.org/en-US/docs/Updating_extensions_for_Firefox_3.5#Getting_a_load_context_from_a_request //instead do the loadContext stuff below
try {
loadContext = interfaceRequestor.getInterface(Ci.nsILoadContext);
} catch (ex) {
try {
loadContext = subject.loadGroup.notificationCallbacks.getInterface(Ci.nsILoadContext);
} catch (ex2) {}
}
} catch (ex0) {}
if (!loadContext) {
//no load context so dont do anything although you can run this, which is your old code
//this probably means that its loading an ajax call or like a google ad thing
return null;
} else {
var contentWindow = loadContext.associatedWindow;
if (!contentWindow) {
//this channel does not have a window, its probably loading a resource
//this probably means that its loading an ajax call or like a google ad thing
return null;
} else {
var aDOMWindow = contentWindow.top.QueryInterface(Ci.nsIInterfaceRequestor)
.getInterface(Ci.nsIWebNavigation)
.QueryInterface(Ci.nsIDocShellTreeItem)
.rootTreeItem
.QueryInterface(Ci.nsIInterfaceRequestor)
.getInterface(Ci.nsIDOMWindow);
var gBrowser = aDOMWindow.gBrowser;
var aTab = gBrowser._getTabForContentWindow(contentWindow.top); //this is the clickable tab xul element, the one found in the tab strip of the firefox window, aTab.linkedBrowser is same as browser var above //can stylize tab like aTab.style.backgroundColor = 'blue'; //can stylize the tab like aTab.style.fontColor = 'red';
var browser = aTab.linkedBrowser; //this is the browser within the tab //this is where the example in the previous section ends
return {
aDOMWindow: aDOMWindow,
gBrowser: gBrowser,
aTab: aTab,
browser: browser,
contentWindow: contentWindow
};
}
}
//end loadContext stuff
}