Web Scraper Pagination - pagination

I have created a web scraper where I am trying to fetch the dynamic data which loads in a div after page is load.
Here it is my code and source website url https://www.medizinerkarriere.de/kliniken-sortiert-nach-name.html
async function pageFunction(context) {
// jQuery is handy for finding DOM elements and extracting data from them.
// To use it, make sure to enable the "Inject jQuery" option.
const $ = context.jQuery;
var result = [];
$('#klinikListBox ul').each(function(){
var item = {
Name: $(this).find('li.klName').text().trim(),
Ort: $(this).find('li.klOrt').text().trim(),
Land: $(this).find('li.klLand').text().trim(),
Url:""
};
result.push(item);
});
// To make this work, make sure the "Use request queue" option is enabled.
await context.enqueueRequest({ url: 'https://www.medizinerkarriere.de/kliniken-sortiert-nach-name.html' });
// Return an object with the data extracted from the page.
// It will be stored to the resulting dataset.
return result;
}
But there are on click pagination and I am not sure how to do it.
I tried all method from this link but it didn't work.
https://docs.apify.com/scraping/web-scraper#bonus-making-your-code-neater
Please help and quick help will be highly appreciated.

In this case the pagination loads dynamically on a single page so enqueuing new pages doesn't make sense. You can get to the next page by simply clicking the page button, it is also a good practice to wait a bit after the click.
$('#PGPAGES span').eq(1).click();
await context.waitFor(1000)
You can scrape all pages with a simple loop
const numberOfPages = 8 // You can scrape this number too
for (let i = 1; i <= numberOfPages; i++) {
// Your scraping code, push data to an array and return them in the end
$('#PGPAGES span').eq(i).click();
await context.waitFor(1000)
}

Related

NodeJs,ExpressJS Running functions in background

im asking this question because i dont know what to look for right now and my googling wasnt great so far.
I am making nodejs,express,sql app that scrape website. It takes 30 to 120 seconds to scrape whole category. How to make that function run in the background without blocking website. Frontend template engine is eJS. If its not possible to do with eJS which framework,library should i use then? I imagine it work like that
User go to /scrape
Choose category and send to server by clicking button
Some container on /scrape gets greyed out with circle rotating or
other % or smth
User can freely leave /scrape and click around webiste or just stay
on /scrape waiting for result
When user cames back to /scrape the results are there or when he
stayed results shows up with or without reloading the page
Getting full respond to these questions will be very helpfull. But just keywords for me to look up also are very helpfull
Sorry for bad english
For your case here you could use redis, or just store the data you scrape on an data structrue that you like (in my opinion, because of the category, hashmaps (js objects) are the best here) directly in nodejs. The process would then look like this:
User goes to /scrape and selects a category
Backend checks if that category was already scraped (e.g. checks for the data in the hashmap with the category name as key)
If the data exists (just check if the key is defined), then send the data to the user, else (if the data isn't stored, e.g. key == undefined), send the user a message that the data is beign scraped and just run the scrape funtion in the backround. The scrape function than scrapes the data, and if it is done, it pushes the data with the category key to the hashmap. To avoid the same categorys beign scraped at the same time, you could add a "pending" property to the hashmap. So if the user accesses the /scrape route, you check in the hashmap if the category key exsists, if yes and pending is false, send data, if yes and pending is true, send wait alert, if the key doesn't exists, start the scrape function and send a wait alter.
Additionally, to make the whole thing "live", you could use socket.io (https://socket.io/) to implement websockets. You could then send the scraped data to the user without the user having to reload the page to check if the scrape process is done.
I made a little exmaple, that doesn't implement scraping, but should make the whole logic here a little bit easier to understand. I also added some explenation to the code in form of comments.
const express = require("express");
const app = express()
// the data hashmap
const data = {};
// scrape function
const scrape = async (id) => {
// set pending to true to prevent multiple scraped on same category
data[id] = { pending: true, data: {} }
// this would be your scrape functio, I used a promise here that
// resolves after 5 seconds with an random number just for
// simplicity
const a = await new Promise((res, rej) => {
setTimeout(() => { res(Math.floor(Math.random()*1000)) }, 5000)
})
// if the data was scraped, set pending to false and add the data
data[id].pending = false;
data[id].data = { id: a }
}
// "scrape" route
app.get("/:id", async (req, res) => {
const { id } = req.params; // if would represent category
// check if id (category) is not in hashmap, if not, then
// start the scrape process and send a wait alert
if (data[id] == undefined) {
scrape(id);
res.send("scraping...")
// if the data is already beign scraped, send a wait alert
// the pending property prvents that multiple people trigger
// the scrape of the same category
} else if (data[id].pending == true) {
res.send("still scraping...")
// lastly, if the data is defined, and is not pending, then
// you could just send it
} else {
res.send(data[id].data)
}
})
// to test this, go to the root with any id, could be string, number,
// whatever (e.g. /1337 or /helloworld), wait for 5 seceonds (or
// leave and come back after 5 seconds), refresh the page and you can
// see the random number. If you now go to an other route (e.g /test)
// and go back to the last one, you still can see the data, if you again
// wait for 5 seconds and then go back to /test, you can see the data.
// You can also open multiple tabs at the same time, which means the
// scraping is asynchronous, so you don't have to wait for the
// one category to be scraped to scrape the next
app.listen(5000)

How does scribd prevent download

when reading BOOKS on scribd.com the download functionality is not enabled. even browsing through the html source code I was unable to download the actual book. Great stuff ... but HOW did they do this ?
I am looking to implement something similar, to display a pdf (or converted from pdf) in such a way that the visitor cannot download the file
Most solutions I have seen are based on obfusticating the url.. but with a little effort people can find the url and download the file. ScribD seems to have covered this quite well..
Any suggestions , ideas how to implement such a download protection ?
It actually works dinamically building the HTML based on AJAX requests made while you're flipping pages. It is not image based. That's why you're finding it difficult to download the content.
However, it is not that safe for now. I present a solution below to download books that is working today (27th Jan 2020) not for teaching you how to do that (it is not legal), but to show you how you should prevent (or, at least, making it harder) users from downloading content if you're building something similar.
If you have a paid account and open the book page (the one that opens when you click 'Start Reading'), you can download an image of each book page by loading a library such as dom-to-image.
For instance, you could load the library using the developer tools (all code shown below must be typed in the page console):
if (injectDomToImage == undefined) {
var injectDomToImage = document.createElement('script');
injectDomToImage.src = "https://cdnjs.cloudflare.com/ajax/libs/dom-to-image/2.6.0/dom-to-image.min.js";
document.getElementsByTagName('head')[0].appendChild(injectDomToImage);
}
And then, you could define functions such as these:
function downloadPage(page, prefix) {
domtoimage.toJpeg(document.getElementsByClassName('reader_and_banner_container')[0], {
quality: 1,
})
.then(function(dataUrl) {
var link = document.createElement('a');
link.download = `${prefix}_page_${page}.jpg`;
link.href = dataUrl;
link.click();
nextPage(page, prefix);
});
}
function checkPageChanged(page, oldPageCounter, prefix) {
let newPageCounter = $('.page_counter').html();
if (oldPageCounter === newPageCounter) {
setTimeout(function() {
checkPageChanged(page, oldPageCounter, prefix);
}, 500);
} else {
setTimeout(function() {
downloadPage(page + 1, prefix);
}, 500);
}
}
function nextPage(page, prefix) {
let oldPageCounter = $('.page_counter').html();
$('.next_btn').trigger('click');
// Wait until page counter has changed (page loading has finished).
checkPageChanged(page + 1, oldPageCounter, prefix);
}
function download(prefix) {
downloadPage(1, prefix);
}
Finally, you could download each book page as a JPG image using:
download('test_');
It will download each page as test_page_.jpg
In order to prevent such type of 'robot', they could, for example, have used Re-CAPTCHA v3 that works in background seeking for 'robot'-like behaviour.

How to get percentage of node js proccess?

Currently, I used this library https://github.com/mooz/node-pdf-image/ for convert PDF to an image.
When I convert PDF that has more than 50 pages, I want to know how many percentages of convert has been done or check which page that on processing.
Is that any possibility to doing that?
or any library that can check percentages of node process?
Thank you.
node-pdf-image supports converting the pages one by one so you just need to emit events as they progress. You can then display the events with console.log, or send them to the browser with websockets etc . . .
async function convertWithProgress(eventEmitter, pdf) {
const numberOfPages = await pdf.numberOfPages();
for (let page = 0; page < numberOfPages; page++) {
eventEmitter.emit('progress', page / numberOfPages);
await pdf.convertPage(page);
}
eventEmitter.emit('progress', 1);
// combine images?
// send images?
}

data from web scraping using node.js request is different from data shown in the browser

Right now, I am doing some simple web scraping, for example get the current train arrival/departure information for one railway station. Here is the example link, http://www.thetrainline.com/Live/arrivals/chester, from this link you can visit the current arrival trains in the chester station.
I am using the node.js request module to do some simple web scraping,
app.get('/railway/arrival', function (req, res, next) {
console.log("/railway/arrival/ "+req.query["city"]);
var city = req.query["city"];
if(typeof city == undefined || city == undefined) { console.log("if it undefined"); city ="liverpool-james-street";}
getRailwayArrival(city,
function(err,data){
res.send(data);
}
);
});
function getRailwayArrival(station,callback){
request({
uri: "http://www.thetrainline.com/Live/arrivals/"+station,
}, function(error, response, body) {
var $ = cheerio.load(body);
var a = new Array();
$(".results-contents li a").each(function() {
var link = $(this);
//var href = link.attr("href");
var due = $(this).find('.due').text().replace(/(\r\n|\n|\r|\t)/gm,"");
var destination = $(this).find('.destination').text().replace(/(\r\n|\n|\r|\t)/gm,"");
var on_time = $(this).find('.on-time-yes .on-time').text().replace(/(\r\n|\n|\r|\t)/gm,"");
if(on_time == undefined) var on_time_no = $(this).find('.on-time-no').text().replace(/(\r\n|\n|\r|\t)/gm,"");
var platform = $(this).find('.platform').text().replace(/(\r\n|\n|\r|\t)/gm,"");
var obj = new Object();
obj.due = due;obj.destination = destination; obj.on_time = on_time; obj.platform = platform;
a.push(obj);
console.log("arrival ".green+due+" "+destination+" "+on_time+" "+platform+" "+on_time_no);
});
console.log("get station data "+a.length +" "+ $(".updated-time").text());
callback(null,a);
});
}
The code works by giving me a list of data, however these data are different from the data seen in the browser, though the data come from the same url. I don't know why it is like that. is it because that their server can distinguish the requests sent from server and browser, that if the request is from server, so they sent me the wrong data. How can I overcome this problem ?
thanks in advance.
They must have stored session per click event. Means if u visit that page first time, it will store session and validate that session for next action you perform. Say, u select some value from drop down list. for that click again new value of session is generated that will load data for ur selected combobox value. then u click on show list then that previous session value is validated and you get accurate data.
Now see, if you not catch that session value programatically and not pass as parameter with that request, you will get default loaded data or not get any thing. So, its chalenging for you to chatch that data.Use firebug for help.
Another issue here could be that the generated content occurs through JavaScript run on your machine. jsdom is a module which will provide such content but is not as lightweight.
Cheerio does not execute these scripts and as a result content may not be visible (as you're experiencing). This is an article I read a while back and caused me to have the same discovery, just open the article and search for "jsdom is more powerful" for a quick answer:
http://encosia.com/cheerio-faster-windows-friendly-alternative-jsdom/

Chrome Omnibox extension to post form data to a website?

How can an Omnibox extension create and post form data to a website and then display the result?
Here's an example of what I want to do. When you type lookup bieber into the Omnibox, I want my extension to post form data looking like
searchtype: all
searchterm: bieber
searchcount: 20
to the URL http://lookup.com/search
So that the browser will end up loading http://lookup.com/search with the results of the search.
This would be trivial if I could send the data in a GET, but lookup.com expects an HTTP POST. The only way I can think of is to inject a form into the current page and then submit it, but (a) that only works if there is a current page, and (b) it doesn't seem to work anyway (maybe permissions need to be set).
Before going off down that route, I figured that somebody else must at least have tried to do this before. Have you?
You could do this by using the omnibox api:
chrome.omnibox.onInputChanged.addListener(
function(text, suggest) {
doYourLogic...
});
Once you have you extension 'activated' due to a certain keyword you typed you can call something like this:
var q = the params you wish to pass
var url = "http://yourSite.com";
var req = new XMLHttpRequest();
req.open("POST", url, true);
req.setRequestHeader("Content-type", "application/x-www-form-urlencoded");
req.onreadystatechange = function() {
if (req.readyState == 4) {
callback(req.responseXML);
}
}
req.send(q);

Resources