I have a website that has a main URL containing several links. I want to get the first <p> element from each link on that main page.
I have the following code that works fine to get the desired links from main page and stores them in urls array. But my issue is
that I don't know how to make a loop to load each url from urls array and print each first <p> in each iteration or append them
in a variable and print all at the end.
How can I do this? thanks
var request = require('request');
var cheerio = require('cheerio');
var main_url = 'http://www.someurl.com';
request(main_url, function(err, resp, body){
$ = cheerio.load(body);
links = $('a'); //get all hyperlinks from main URL
var urls = [];
//With this part I get the links (URLs) that I want to scrape.
$(links).each(function(i, link){
lnk = 'http://www.someurl.com/files/' + $(link).attr('href');
urls.push(lnk);
});
//In this part I don't know how to make a loop to load each url within urls array and get first <p>
for (i = 0; i < urls.length; i++) {
var p = $("p:first") //first <p> element
console.log(p.html());
}
});
if you can successfully get the URLs from the first <p>, you already know everything to do that so I suppose you have issues with the way request is working and in particular with the callback based workflow.
My suggestion is to drop request since it's deprecated. You can use something like got which is Promise based so you can use the newer async/await features coming with it (which usually means easier workflow) (Though, you need to use at least nodejs 8 then!).
Your loop would look like this:
for (const i = 0; i < urls.length; i++) {
const source = await got(urls[i]);
// Do your cheerio determination
console.log(new_p.html());
}
Mind you, that your function signature needs to be adjusted. In your case you didn't specify a function at all so the module's function signature is used which means you can't use await. So write a function for that:
async function pullAllUrls() {
const mainSource = await got(main_url);
...
}
If you don't want to use async/await you could work with some promise reductions but that's rather cumbersome in my opinion. Then rather go back to promises and use a workflow library like async to help you manage the URL fetching.
A real example with async/await:
In a real life example, I'd create a function to fetch the source of the page I'd like to fetch, like so (don't forget to add got to your script/package.json):
async function getSourceFromUrl(thatUrl) {
const response = await got(thatUrl);
return response.body;
}
Then you have a workflow logic to get all those links in the other page. I implemented it like this:
async function grabLinksFromUrl(thatUrl) {
const mainSource = await getSourceFromUrl(thatUrl);
const $ = cheerio.load(mainSource);
const hrefs = [];
$('ul.menu__main-list').each((i, content) => {
$('li a', content).each((idx, inner) => {
const wantedUrl = $(inner).attr('href');
hrefs.push(wantedUrl);
});
}).get();
return hrefs;
}
I decided that I'd like to get the links in the <nav> element which are usually wrapped inside <ul> and elements of <li>. So we just take those.
Then you need a workflow to work with those links. This is where the for loop is. I decided that I wanted the title of each page.
async function mainFlow() {
const urls = await grabLinksFromUrl('https://netzpolitik.org/');
for (const url of urls) {
const source = await getSourceFromUrl(url);
const $ = cheerio.load(source);
// Netpolitik has two <title> in their <head>
const title = $('head > title').first().text();
console.log(`${title} (${url}) has source of ${source.length} size`);
// TODO: More work in here
}
}
And finally, you need to call that workflow function:
return mainFlow();
The result you see on your screen should look like this:
Dossiers & Recherchen (https://netzpolitik.org/dossiers-recherchen/) has source of 413853 size
Der Netzpolitik-Podcast (https://netzpolitik.org/podcast/) has source of 333354 size
14 Tage (https://netzpolitik.org/14-tage/) has source of 402312 size
Official Netzpolitik Shop (https://netzpolitik.merchcowboy.com/) has source of 47825 size
Über uns (https://netzpolitik.org/ueber-uns/#transparenz) has source of 308068 size
Über uns (https://netzpolitik.org/ueber-uns) has source of 308068 size
netzpolitik.org-Newsletter (https://netzpolitik.org/newsletter) has source of 291133 size
netzwerk (https://netzpolitik.org/netzwerk/?via=nav) has source of 299694 size
Spenden für netzpolitik.org (https://netzpolitik.org/spenden/?via=nav) has source of 296190 size
Related
I have a working puppeteer script that I'd like to make into an API but I'm having problems with waitForSelector.
Background:
I wrote a puppeteer script that successfully searches for and scrapes the result of a query I specify in the code e.g. let address = xyz;. Now I'd like to make it into an API so that a user can query something. I managed to code everything necessary for the local API (working with express) and everything works as well. By that I mean: I coded all the server side stuff: I can make a request, the scraper function is called, puppeteer starts up, carries out my search (I need to type in an address, choose from a dropdown and press enter).
The status:
The result of my query is a form (basically 3 columns and some rows) in an iFrame and I want to scrape all the rows (I modify them into a specific json later on). The way it works is I use waitForSelector on the form's selector and then I use frame.evaluate.
Problem:
When I run my normal scraper everything works well, but when I run the (slightly modified but essentially same) code within the API framework, waitForSelector suddenly always times out. I have tried all the usual workarounds: waitForNavigation, taking a screenshot and inspecting etc but nothing helped. I've been reading quite a bit and could it be that I'm screwing something up in terms of async/await when I call my scraper from within the context of the API? I'm still quite new to this so please bear with me. This is the code of the working script - I indicated the important part
const puppeteer = require("puppeteer");
const chalk = require("chalk");
const fs = require('fs');
const error = chalk.bold.red;
const success = chalk.keyword("green");
address = 'Gumpendorfer Straße 12, 1060 Wien';
(async () => {
try {
// open the headless browser
var browser = await puppeteer.launch();
// open a new page
var page = await browser.newPage();
// enter url in page
await page.goto(`https://mein.wien.gv.at/Meine-Amtswege/richtwert?subpage=/lagezuschlag/`, {waitUntil: 'networkidle2'});
// continue without newsletter
await page.click('#dss-modal-firstvisit-form > button.btn.btn-block.btn-light');
// let everyhting load
await page.waitFor(1000)
console.log('waiting for iframe with form to be ready.');
//wait until selector is available
await page.waitForSelector('iframe');
console.log('iframe is ready. Loading iframe content');
//choose the relevant iframe
const elementHandle = await page.$(
'iframe[src="/richtwertfrontend/lagezuschlag/"]',
);
//go into frame in order to input info
const frame = await elementHandle.contentFrame();
//enter address
console.log('filling form in iframe');
await frame.type('#input_adresse', address, { delay: 100});
//choose first option from dropdown
console.log('Choosing from dropdown');
await frame.click('#react-autowhatever-1--item-0');
console.log('pressing button');
//press button to search
await frame.click('#next-button');
// scraping data
console.log('scraping')
await frame.waitForSelector('#summary > div > div > br ~ div');//This keeps failing in the API
const res = await frame.evaluate(() => {
const rows = [...document.querySelectorAll('#summary > div > div > br ~ div')];
const cells = rows.map(
row => [...row.querySelectorAll('div')]
.map(cell => cell.innerText)
);
return cells;
});
await browser.close();
console.log(success("Browser Closed"));
const mapFields = (arr1, arr2) => {
const mappedArray = arr2.map((el) => {
const mappedArrayEl = {};
el.forEach((value, i) => {
if (arr1.length < (i+1)) return;
mappedArrayEl[arr1[i]] = value;
});
return mappedArrayEl;
});
return mappedArray;
}
const Arr1 = res[0];
const Arr2 = res.slice(1,3);
let dataObj = {};
dataObj[address] = [];
// dataObj['lagezuschlag'] = mapFields(Arr1, Arr2);
// dataObj['adresse'] = address;
dataObj[address] = mapFields(Arr1, Arr2);
console.log(dataObj);
} catch (err) {
// Catch and display errors
console.log(error(err));
await browser.close();
console.log(error("Browser Closed"));
}
})();
I just can't understand why it would work in the one case and not in the other, even though I barely changed something. For the API I basically changed the name of the async function to const search = async (address) => { such that I can call it with the query in my server side script.
Thanks in advance - I'm not attaching the API code cause I don't want to clutter the question. I can update it if it's necessary
I solved this myself. Turns out the problem wasn't as complicated as I thought and it was annoyingly simple to solve. The problem wasn't with the selector that was timing out but with the previous selectors, specifically the typing and choosing from dropdown selectors. Essentially, things were going too fast. Before the search query was typed in, the dropdown was already pressed and nonsense came out. How I solved it: I included a waitFor(1000) call before the dropdown is selected and everything went perfectly. An interesting realisation was that even though that one selector timed out, it wasn't actually the source of the problem. But like I said, annoyingly simple and I feel dumb for asking this :) but maybe someone will see this and learn from my mistake
I am currently evaluating WebViewer version 5.2.8.
I need to set some javascript function/code as an action for triggers like calculate trigger, format trigger and keystroke trigger through the WebViewer UI.
Please help me on how to configure javascript code for a form field trigger in WebViewer UI.
Thanks in advance,
Syed
Sorry for the late response!
You will have to create the UI components yourself that will take in the JavaScript code. You can do something similar to what the FormBuilder demo does with just HTML and JavaScript. However, it may be better to clone the open source UI and add your own components.
As for setting the action, I would recommend trying out version 6.0 instead as there is better support for widgets and form fields in that version. However, we are investigating a bug with the field actions that will throw an error on downloading the document. You should be able to use this code to get it working first:
docViewer.on('annotationsLoaded', () => {
const annotations = annotManager.getAnnotationsList();
annotations.forEach(annot => {
const action = new instance.Actions.JavaScript({ javascript: 'alert("Hello World!")' });
// C cor Calculate, and F for Format
annot.addAction('K', action);
});
});
Once the bug has been dealt with, you should be able to download the document properly.
Otherwise, you will have to use the full API and that may be less than ideal. It would be a bit more complicated with the full API and I would not recommend it if the above feature will be fixed soon.
Let me know if this helps or if you need more information about using the full API to accomplish this!
EDIT
Here is the code to do it with the full API! Since the full API works at a low level and very closely to the PDF specification, it does take a lot more to make it work. You do still have to update the annotations with the code I provided before which I will include again.
docViewer.on('documentLoaded', async () => {
// This part requires the full API: https://www.pdftron.com/documentation/web/guides/full-api/setup/
const doc = docViewer.getDocument();
// Get document from worker
const pdfDoc = await doc.getPDFDoc();
const pageItr = await pdfDoc.getPageIterator();
while (await pageItr.hasNext()) {
const page = await pageItr.current();
// Note: this is a PDF array, not a JS array
const annots = await page.getAnnots();
const numAnnots = await page.getNumAnnots();
for (let i = 0; i < numAnnots; i++) {
const annot = await annots.getAt(i);
const subtypeDict = await annot.findObj('Subtype');
const subtype = await subtypeDict.getName();
const actions = await annot.findObj('AA');
// Check to make sure the annot is of type Widget
if (subtype === 'Widget') {
// Create the additional actions dictionary if it does not exist
if (!actions) {
actions = await annot.putDict('AA');
}
let calculate = await actions.findObj('C');
// Create the calculate action (C) if it does not exist
if (!calculate) {
calculate = await actions.putDict('C');
await Promise.all([calculate.putName('S', 'JavaScript'), calculate.putString('JS', 'app.alert("Hello World!")')]);
}
// Repeat for keystroke (K) and format (F)
}
}
pageItr.next();
}
});
docViewer.on('annotationsLoaded', () => {
const annotations = annotManager.getAnnotationsList();
annotations.forEach(annot => {
const action = new instance.Actions.JavaScript({ javascript: 'app.alert("Hello World!")' });
// K for Keystroke, and F for Format
annot.addAction('C', action);
});
});
You can probably put them together under the documentLoaded event but once the fix is ready, you can delete the part using the full API.
Node-red node for integrating with an older ventilation system, using screen scraping, nodejs with cheerio. Works fine for fetching some values now, but I seem unable to fetch the right element in the more complex structured telling which operating mode is active. Screenshot of structure attached. And yes, never used jquery and quite a newbie on cheerio.
I have managed, way to complex, to get the value, if it is within a certain part of the tree.
const msgResult = scraped('.control-1');
const activeMode = msgResult.get(0).children.find(x => x.attribs['data-selected'] === '1').attribs['id'];
But only works on first match, fails if the data-selected === 1 isn't in that part of the tree. Thought I should be able to use just .find from the top of the tree, but no matches.
const activeMode = scraped('.control-1').find(x => x.attribs['data-selected'] === '1')
What I would like to get from the html structure attached, is the ID of the div that has data-selected=1, which again can be below any of the two divs of class control-1. Maybe also the content of the underlying span, where the mode is described in text.
HTML structure
It's hard to tell what you're looking for but maybe:
$('.control-1 [data-selected="1"]').attr('id')
You should try to make some loop to check every tree.
Try this code, hope this works.
const cheerio = require ('cheerio')
const fsextra = require ('fs-extra');
(async () => {
try {
const parseFile = async (error, contentHTML) => {
let $ = await cheerio.load (contentHTML)
const selector = $('.control-1 [data-selected="1"]')
for (let num = 0; num < selector.length; num++){
console.log ( selector[num].attribs.id )
}
}
let activeMode = await fsextra.readFile('untitled.html', 'utf-8', parseFile )
} catch (error) {
console.log ('ERROR: ', error)
}
})()
I'm attempting to write a very basic scraper that loops through a few pages and outputs all the data from each url to a single json file. The url structure goes as follows:
http://url/1
http://url/2
http://url/n
Each of the urls has a table, which contains information pertaining to the ID of the url. This is the data I am attempting to retrieve and store inside a json file.
I am still extremely new to this and having a difficult time moving forward. So far, my code looks as follows:
app.get('/scrape', function(req, res){
var json;
for (var i = 1163; i < 1166; i++){
url = 'https://urlgoeshere.com' + i;
request(url, function(error, response, html){
if(!error){
var $ = cheerio.load(html);
var mN, mL, iD;
var json = { mN : "", mL : "", iD: ""};
$('html body div#wrap h2').filter(function(){
var data = $(this);
mN = data.text();
json.mN = mN;
})
$('table.vertical-table:nth-child(7)').filter(function(){
var data = $(this);
mL = data.text();
json.mL = mL;
})
$('table.vertical-table:nth-child(8)').filter(function(){
var data = $(this);
iD = data.text();
json.iD = iD;
})
}
fs.writeFile('output' + i + '.json', JSON.stringify(json, null, 4), function(err){
console.log('File successfully written! - Check your project directory for the output' + i + '.json file');
})
});
}
res.send(json);
})
app.listen('8081')
console.log('Magic happens on port 8081');
exports = module.exports = app;
When I run the code as displayed above, the output within the output.json file only contains data for the last url. I presume that's because I attempt to save all the data within the same variable?
If I include res.send() inside the loop, so the data writes after each page, I receive the error that multiple headers cannot be sent.
Can someone provide some pointers as to what I'm doing wrong? Thanks in advance.
Ideal output I would like to see:
Page ID: 1
Page Name: First Page
Color: Blue
Page ID: 2
Page Name: Second Page
Color: Red
Page ID: n
Page Name: Nth Page
Color: Green
I can see a number of problems:
Your loop doesn't wait for the asynchronous operations in the loop, thus you do some things like res.send() before the asynchronous operations in the loop have completed.
In appropriate use of cheerio's .filter().
Your json variable is constantly being overwritten so it only has the last data in it.
Your loop variable i would lose its value by the time you tried to use it in the fs.writeFile() statement.
Here's one way to deal with those issues:
const rp = require('request-promise');
const fsp = require('fs').promises;
app.get('/scrape', async function(req, res) {
let data = [];
for (let i = 1163; i < 1166; i++) {
const url = 'https://urlgoeshere.com/' + i;
try {
const html = await rp(url)
const $ = cheerio.load(html);
const mN = $('html body div#wrap h2').first().text();
const mL = $('table.vertical-table:nth-child(7)').first().text();
const iD = $('table.vertical-table:nth-child(8)').first().text();
// create object for this iteration of the loop
const obj = {iD, mN, mL};
// add this object to our overall array of all the data
data.push(obj);
// write a file specifically for this invocation of the loop
await fsp.writeFile('output' + i + '.json', JSON.stringify(obj, null, 4));
console.log('File successfully written! - Check your project directory for the output' + i + '.json file');
} catch(e) {
// stop further processing on an error
console.log("Error scraping ", url, e);
res.sendStatus(500);
return;
}
}
// send all the data we accumulated (in an array) as the final result
res.send(data);
});
Things different in this code:
Switch over all variable declarations to let or const
Declare route handler as async so we can use await inside.
Use the request-promise module instead of request. It has the same features, but returns a promise instead of using a plain callback.
Use the promise-based fs module (in latest versions of node.js).
Use await in order to serialize our two asynchronous (now promise-returning) operations so the for loop will pause for them and we can have proper sequencing.
Catch errors and stop further processing and return an error status.
Accumulate an object of data for each iteration of the for loop into an array.
Change .filter() to .first().
Make the response to the request handler be a JSON array of data.
FYI, you can tweak the organization of the data in obj however you want, but the point here is that you end up with an array of objects, one for each iteration of the for loop.
EDIT Jan, 2020 - request() module in maintenance mode
FYI, the request module and its derivatives like request-promise are now in maintenance mode and will not be actively developed to add new features. You can read more about the reasoning here. There is a list of alternatives in this table with some discussion of each one. I have been using got() myself and it's built from the beginning to use promises and is simple to use.
I am a total scrub with the node http module and having some trouble.
The ultimate goal here is to take a huge list of urls, figure out which are valid and then scrape those pages for certain data. So step one is figuring out if a URL is valid and this simple exercise is baffling me.
say we have an array allURLs:
["www.yahoo.com", "www.stackoverflow.com", "www.sdfhksdjfksjdhg.net"]
The goal is to iterate this array, make a get request to each and if a response comes in, add the link to a list of workingURLs (for now just another array), else it goes to a list brokenURLs.
var workingURLs = [];
var brokenURLs = [];
for (var i = 0; i < allURLs.length; i++) {
var url = allURLs[i];
var req = http.get(url, function (res) {
if (res) {
workingURLs.push(?????); // How to derive URL from response?
}
});
req.on('error', function (e) {
brokenURLs.push(e.host);
});
}
what I don't know is how to properly obtain the url from the request/ response object itself, or really how to structure this kind of async code - because again, I am a nodejs scrub :(
For most websites using res.headers.location works, but there are times when the headers do not have this property and that will cause problems for me later on. Also I've tried console logging the response object itself and that was a messy and fruitless endeavor
I have tried pushing the url variable to workingURLs, but by the time any response comes back that would trigger the push, the for loop is already over and url is forever pointing to the final element of the allURLs array.
Thanks to anyone who can help
You need to closure url value to have access to it and protect it from changes on next loop iteration.
For example:
(function(url){
// use url here
})(allUrls[i]);
Most simple solution for this is use forEach instead of for.
allURLs.forEach(function(url){
//....
});
Promisified solution allows you to get a moment when work is done:
var http = require('http');
var allURLs = [
"http://www.yahoo.com/",
"http://www.stackoverflow.com/",
"http://www.sdfhksdjfksjdhg.net/"
];
var workingURLs = [];
var brokenURLs = [];
var promises = allURLs.map(url => validateUrl(url)
.then(res => (res?workingURLs:brokenURLs).push(url)));
Promise.all(promises).then(() => {
console.log(workingURLs, brokenURLs);
});
// ----
function validateUrl(url) {
return new Promise((ok, fail) => {
http.get(url, res => return ok(res.statusCode == 200))
.on('error', e => ok(false));
});
}
// Prevent nodejs from exit, don't need if any server listen.
var t = setTimeout(() => { console.log('Time is over'); }, 1000).ref();
You can use something like this (Not tested):
const arr = ["", "/a", "", ""];
Promise.all(arr.map(fetch)
.then(responses=>responses.filter(res=> res.ok).map(res=>res.url))
.then(workingUrls=>{
console.log(workingUrls);
console.log(arr.filter(url=> workingUrls.indexOf(url) == -1 ))
});
EDITED
Working fiddle (Note that you can't do request to another site in the browser because of Cross domain).
UPDATED with #vp_arth suggestions
const arr = ["/", "/a", "/", "/"];
let working=[], notWorking=[],
find = url=> fetch(url)
.then(res=> res.ok ?
working.push(res.url) && res : notWorking.push(res.url) && res);
Promise.all(arr.map(find))
.then(responses=>{
console.log('woking', working, 'notWorking', notWorking);
/* Do whatever with the responses if needed */
});
Fiddle