Node-red node for integrating with an older ventilation system, using screen scraping, nodejs with cheerio. Works fine for fetching some values now, but I seem unable to fetch the right element in the more complex structured telling which operating mode is active. Screenshot of structure attached. And yes, never used jquery and quite a newbie on cheerio.
I have managed, way to complex, to get the value, if it is within a certain part of the tree.
const msgResult = scraped('.control-1');
const activeMode = msgResult.get(0).children.find(x => x.attribs['data-selected'] === '1').attribs['id'];
But only works on first match, fails if the data-selected === 1 isn't in that part of the tree. Thought I should be able to use just .find from the top of the tree, but no matches.
const activeMode = scraped('.control-1').find(x => x.attribs['data-selected'] === '1')
What I would like to get from the html structure attached, is the ID of the div that has data-selected=1, which again can be below any of the two divs of class control-1. Maybe also the content of the underlying span, where the mode is described in text.
HTML structure
It's hard to tell what you're looking for but maybe:
$('.control-1 [data-selected="1"]').attr('id')
You should try to make some loop to check every tree.
Try this code, hope this works.
const cheerio = require ('cheerio')
const fsextra = require ('fs-extra');
(async () => {
try {
const parseFile = async (error, contentHTML) => {
let $ = await cheerio.load (contentHTML)
const selector = $('.control-1 [data-selected="1"]')
for (let num = 0; num < selector.length; num++){
console.log ( selector[num].attribs.id )
}
}
let activeMode = await fsextra.readFile('untitled.html', 'utf-8', parseFile )
} catch (error) {
console.log ('ERROR: ', error)
}
})()
Related
I have a working puppeteer script that I'd like to make into an API but I'm having problems with waitForSelector.
Background:
I wrote a puppeteer script that successfully searches for and scrapes the result of a query I specify in the code e.g. let address = xyz;. Now I'd like to make it into an API so that a user can query something. I managed to code everything necessary for the local API (working with express) and everything works as well. By that I mean: I coded all the server side stuff: I can make a request, the scraper function is called, puppeteer starts up, carries out my search (I need to type in an address, choose from a dropdown and press enter).
The status:
The result of my query is a form (basically 3 columns and some rows) in an iFrame and I want to scrape all the rows (I modify them into a specific json later on). The way it works is I use waitForSelector on the form's selector and then I use frame.evaluate.
Problem:
When I run my normal scraper everything works well, but when I run the (slightly modified but essentially same) code within the API framework, waitForSelector suddenly always times out. I have tried all the usual workarounds: waitForNavigation, taking a screenshot and inspecting etc but nothing helped. I've been reading quite a bit and could it be that I'm screwing something up in terms of async/await when I call my scraper from within the context of the API? I'm still quite new to this so please bear with me. This is the code of the working script - I indicated the important part
const puppeteer = require("puppeteer");
const chalk = require("chalk");
const fs = require('fs');
const error = chalk.bold.red;
const success = chalk.keyword("green");
address = 'Gumpendorfer Straße 12, 1060 Wien';
(async () => {
try {
// open the headless browser
var browser = await puppeteer.launch();
// open a new page
var page = await browser.newPage();
// enter url in page
await page.goto(`https://mein.wien.gv.at/Meine-Amtswege/richtwert?subpage=/lagezuschlag/`, {waitUntil: 'networkidle2'});
// continue without newsletter
await page.click('#dss-modal-firstvisit-form > button.btn.btn-block.btn-light');
// let everyhting load
await page.waitFor(1000)
console.log('waiting for iframe with form to be ready.');
//wait until selector is available
await page.waitForSelector('iframe');
console.log('iframe is ready. Loading iframe content');
//choose the relevant iframe
const elementHandle = await page.$(
'iframe[src="/richtwertfrontend/lagezuschlag/"]',
);
//go into frame in order to input info
const frame = await elementHandle.contentFrame();
//enter address
console.log('filling form in iframe');
await frame.type('#input_adresse', address, { delay: 100});
//choose first option from dropdown
console.log('Choosing from dropdown');
await frame.click('#react-autowhatever-1--item-0');
console.log('pressing button');
//press button to search
await frame.click('#next-button');
// scraping data
console.log('scraping')
await frame.waitForSelector('#summary > div > div > br ~ div');//This keeps failing in the API
const res = await frame.evaluate(() => {
const rows = [...document.querySelectorAll('#summary > div > div > br ~ div')];
const cells = rows.map(
row => [...row.querySelectorAll('div')]
.map(cell => cell.innerText)
);
return cells;
});
await browser.close();
console.log(success("Browser Closed"));
const mapFields = (arr1, arr2) => {
const mappedArray = arr2.map((el) => {
const mappedArrayEl = {};
el.forEach((value, i) => {
if (arr1.length < (i+1)) return;
mappedArrayEl[arr1[i]] = value;
});
return mappedArrayEl;
});
return mappedArray;
}
const Arr1 = res[0];
const Arr2 = res.slice(1,3);
let dataObj = {};
dataObj[address] = [];
// dataObj['lagezuschlag'] = mapFields(Arr1, Arr2);
// dataObj['adresse'] = address;
dataObj[address] = mapFields(Arr1, Arr2);
console.log(dataObj);
} catch (err) {
// Catch and display errors
console.log(error(err));
await browser.close();
console.log(error("Browser Closed"));
}
})();
I just can't understand why it would work in the one case and not in the other, even though I barely changed something. For the API I basically changed the name of the async function to const search = async (address) => { such that I can call it with the query in my server side script.
Thanks in advance - I'm not attaching the API code cause I don't want to clutter the question. I can update it if it's necessary
I solved this myself. Turns out the problem wasn't as complicated as I thought and it was annoyingly simple to solve. The problem wasn't with the selector that was timing out but with the previous selectors, specifically the typing and choosing from dropdown selectors. Essentially, things were going too fast. Before the search query was typed in, the dropdown was already pressed and nonsense came out. How I solved it: I included a waitFor(1000) call before the dropdown is selected and everything went perfectly. An interesting realisation was that even though that one selector timed out, it wasn't actually the source of the problem. But like I said, annoyingly simple and I feel dumb for asking this :) but maybe someone will see this and learn from my mistake
I have a website that has a main URL containing several links. I want to get the first <p> element from each link on that main page.
I have the following code that works fine to get the desired links from main page and stores them in urls array. But my issue is
that I don't know how to make a loop to load each url from urls array and print each first <p> in each iteration or append them
in a variable and print all at the end.
How can I do this? thanks
var request = require('request');
var cheerio = require('cheerio');
var main_url = 'http://www.someurl.com';
request(main_url, function(err, resp, body){
$ = cheerio.load(body);
links = $('a'); //get all hyperlinks from main URL
var urls = [];
//With this part I get the links (URLs) that I want to scrape.
$(links).each(function(i, link){
lnk = 'http://www.someurl.com/files/' + $(link).attr('href');
urls.push(lnk);
});
//In this part I don't know how to make a loop to load each url within urls array and get first <p>
for (i = 0; i < urls.length; i++) {
var p = $("p:first") //first <p> element
console.log(p.html());
}
});
if you can successfully get the URLs from the first <p>, you already know everything to do that so I suppose you have issues with the way request is working and in particular with the callback based workflow.
My suggestion is to drop request since it's deprecated. You can use something like got which is Promise based so you can use the newer async/await features coming with it (which usually means easier workflow) (Though, you need to use at least nodejs 8 then!).
Your loop would look like this:
for (const i = 0; i < urls.length; i++) {
const source = await got(urls[i]);
// Do your cheerio determination
console.log(new_p.html());
}
Mind you, that your function signature needs to be adjusted. In your case you didn't specify a function at all so the module's function signature is used which means you can't use await. So write a function for that:
async function pullAllUrls() {
const mainSource = await got(main_url);
...
}
If you don't want to use async/await you could work with some promise reductions but that's rather cumbersome in my opinion. Then rather go back to promises and use a workflow library like async to help you manage the URL fetching.
A real example with async/await:
In a real life example, I'd create a function to fetch the source of the page I'd like to fetch, like so (don't forget to add got to your script/package.json):
async function getSourceFromUrl(thatUrl) {
const response = await got(thatUrl);
return response.body;
}
Then you have a workflow logic to get all those links in the other page. I implemented it like this:
async function grabLinksFromUrl(thatUrl) {
const mainSource = await getSourceFromUrl(thatUrl);
const $ = cheerio.load(mainSource);
const hrefs = [];
$('ul.menu__main-list').each((i, content) => {
$('li a', content).each((idx, inner) => {
const wantedUrl = $(inner).attr('href');
hrefs.push(wantedUrl);
});
}).get();
return hrefs;
}
I decided that I'd like to get the links in the <nav> element which are usually wrapped inside <ul> and elements of <li>. So we just take those.
Then you need a workflow to work with those links. This is where the for loop is. I decided that I wanted the title of each page.
async function mainFlow() {
const urls = await grabLinksFromUrl('https://netzpolitik.org/');
for (const url of urls) {
const source = await getSourceFromUrl(url);
const $ = cheerio.load(source);
// Netpolitik has two <title> in their <head>
const title = $('head > title').first().text();
console.log(`${title} (${url}) has source of ${source.length} size`);
// TODO: More work in here
}
}
And finally, you need to call that workflow function:
return mainFlow();
The result you see on your screen should look like this:
Dossiers & Recherchen (https://netzpolitik.org/dossiers-recherchen/) has source of 413853 size
Der Netzpolitik-Podcast (https://netzpolitik.org/podcast/) has source of 333354 size
14 Tage (https://netzpolitik.org/14-tage/) has source of 402312 size
Official Netzpolitik Shop (https://netzpolitik.merchcowboy.com/) has source of 47825 size
Über uns (https://netzpolitik.org/ueber-uns/#transparenz) has source of 308068 size
Über uns (https://netzpolitik.org/ueber-uns) has source of 308068 size
netzpolitik.org-Newsletter (https://netzpolitik.org/newsletter) has source of 291133 size
netzwerk (https://netzpolitik.org/netzwerk/?via=nav) has source of 299694 size
Spenden für netzpolitik.org (https://netzpolitik.org/spenden/?via=nav) has source of 296190 size
I'm trying to do the following: Read the content of a directory to find all the .xml files (I'm using glob but I'd like to use something like fs.readdir from fs), then I want to read every file using fs.readFile and then I want to convert the xml file to JSON objects. I'm using xml2json for this purpose.
Once I have the json objects, I would like to iterate every one of them to get the one property out of it and push it to an array. Eventually, all the code is wrapped in a function that logs the content of the array (once is completed). This code currently works fine but I'm getting to the famous callback hell.
const fs = require('fs');
const glob = require('glob');
const parser = require('xml2json');
let connectors = []
function getNames(){
glob(__dirname + '/configs/*.xml', {}, (err, files) => {
for (let j=0; j < files.length; j++) {
fs.readFile( files[j], function(err, data) {
try {
let json = parser.toJson(data, {object: true, alternateTextNode:true, sanitize:true})
for (let i=0; i< json.properties.length; i++){
connectors.push(json.properties[i].name)
if (connectors.length === files.length){return console.log(connectors)}
}
}
catch(e){
console.log(e)
}
});
}
})
}
getNames()
However, I'd like to move to a more clean and elegant solution (using promises). I've been reading the community and I've found some ideas in some similar posts here or here.
I'd like to have your opinion on how I should proceed for this kind of situations. Should I go for a sync version of readFile instead? Should I use promisifyAll to refactor my code and use promises everywhere? If so, could you please elaborate on what my code should look like?
I've also learned that there's a promises based version of fs from node v10.0.0 onwards. Should I go for that option? If so how should I proceed with the parser.toJson() part. I've also seen that there's another promise-based version called xml-to-json-promise.
I'd really appreciate your insights into this as I'm not very familiar with promises when there are several asynchronous operations and loops involved, so I end up having dirty solutions for situations like this one.
Regards,
J
I would indeed suggest that you use the promise-version of glob and fs, and then use async, await and Promise.all to get it all done.
NB: I don't see the logic about the connectors.length === files.length check, as in theory the number of connectors (properties) can be greater than the number of files. I assume you want to collect all of them, irrespective of their number.
So here is how the code could look (untested):
const fs = require('fs').promises; // Promise-version (node 10+)
const glob = require('glob-promise'); // Promise-version
const parser = require('xml2json');
async function getNames() {
let files = await glob(__dirname + '/configs/*.xml');
let promises = files.map(fileName => fs.readFile(fileName).then(data =>
parser.toJson(data, {object: true, alternateTextNode:true, sanitize:true})
.properties.map(prop => prop.name)
));
return (await Promise.all(promises)).flat();
}
getNames().then(connectors => {
// rest of your processing that needs access to connectors...
});
As in comments you write that you have problems with accessing properties.map, perform some validation, and skip cases where there is no properties:
const fs = require('fs').promises; // Promise-version (node 10+)
const glob = require('glob-promise'); // Promise-version
const parser = require('xml2json');
async function getNames() {
let files = await glob(__dirname + '/configs/*.xml');
let promises = files.map(fileName => fs.readFile(fileName).then(data =>
(parser.toJson(data, {object: true, alternateTextNode:true, sanitize:true})
.properties || []).map(prop => prop.name)
));
return (await Promise.all(promises)).flat();
}
getNames().then(connectors => {
// rest of your processing that needs access to connectors...
});
I am currently evaluating WebViewer version 5.2.8.
I need to set some javascript function/code as an action for triggers like calculate trigger, format trigger and keystroke trigger through the WebViewer UI.
Please help me on how to configure javascript code for a form field trigger in WebViewer UI.
Thanks in advance,
Syed
Sorry for the late response!
You will have to create the UI components yourself that will take in the JavaScript code. You can do something similar to what the FormBuilder demo does with just HTML and JavaScript. However, it may be better to clone the open source UI and add your own components.
As for setting the action, I would recommend trying out version 6.0 instead as there is better support for widgets and form fields in that version. However, we are investigating a bug with the field actions that will throw an error on downloading the document. You should be able to use this code to get it working first:
docViewer.on('annotationsLoaded', () => {
const annotations = annotManager.getAnnotationsList();
annotations.forEach(annot => {
const action = new instance.Actions.JavaScript({ javascript: 'alert("Hello World!")' });
// C cor Calculate, and F for Format
annot.addAction('K', action);
});
});
Once the bug has been dealt with, you should be able to download the document properly.
Otherwise, you will have to use the full API and that may be less than ideal. It would be a bit more complicated with the full API and I would not recommend it if the above feature will be fixed soon.
Let me know if this helps or if you need more information about using the full API to accomplish this!
EDIT
Here is the code to do it with the full API! Since the full API works at a low level and very closely to the PDF specification, it does take a lot more to make it work. You do still have to update the annotations with the code I provided before which I will include again.
docViewer.on('documentLoaded', async () => {
// This part requires the full API: https://www.pdftron.com/documentation/web/guides/full-api/setup/
const doc = docViewer.getDocument();
// Get document from worker
const pdfDoc = await doc.getPDFDoc();
const pageItr = await pdfDoc.getPageIterator();
while (await pageItr.hasNext()) {
const page = await pageItr.current();
// Note: this is a PDF array, not a JS array
const annots = await page.getAnnots();
const numAnnots = await page.getNumAnnots();
for (let i = 0; i < numAnnots; i++) {
const annot = await annots.getAt(i);
const subtypeDict = await annot.findObj('Subtype');
const subtype = await subtypeDict.getName();
const actions = await annot.findObj('AA');
// Check to make sure the annot is of type Widget
if (subtype === 'Widget') {
// Create the additional actions dictionary if it does not exist
if (!actions) {
actions = await annot.putDict('AA');
}
let calculate = await actions.findObj('C');
// Create the calculate action (C) if it does not exist
if (!calculate) {
calculate = await actions.putDict('C');
await Promise.all([calculate.putName('S', 'JavaScript'), calculate.putString('JS', 'app.alert("Hello World!")')]);
}
// Repeat for keystroke (K) and format (F)
}
}
pageItr.next();
}
});
docViewer.on('annotationsLoaded', () => {
const annotations = annotManager.getAnnotationsList();
annotations.forEach(annot => {
const action = new instance.Actions.JavaScript({ javascript: 'app.alert("Hello World!")' });
// K for Keystroke, and F for Format
annot.addAction('C', action);
});
});
You can probably put them together under the documentLoaded event but once the fix is ready, you can delete the part using the full API.
I am trying to get puppeteer to go to all a tags in a page and load them, add them to an array and return it. My puppeteer version is 1.5.0. Here is my code:
module.exports.scrapeLinks = async (page, linkXpath) => {
page.waitForNavigation();
linksElement = await page.$x(linkXpath);
var url_list_arr = [];
console.log(linksElement.length);
i=1;
for(linksElementItem in linksElement)
{
const linksData = await page.$x('(' + linkXpath + ')[' + (i + 1) +']');
if (linksData.length > 0) {
linksData[0].click();
console.log(page.url());
url_list_arr.push(page.url());
}
else {
throw new Error('Link not found');
}
}
return url_list_arr;
};
However with this code, I get an
UnhandledPromiseRejectionWarning: Error: Node is either not visible or
not an HTMLElement
I also found out through the docs that is not possible to use the xpath on the page.click function. Is there anyway to achieve this?
It is also okay if there is a function to get all the link from a page, but I couldn't find it in the docs.
To get a handle on all a-tags in an array:
const aTags= await page.$$('a')
Loop through them with:
for (const aTag of aTags) {...}
Inside the loop you can interact with each of these elementHandle separately.
Note that
await aTag.click()
will destroy (garbage collect) all elementHandles when the page context is navigated. In this case you need a workaround like loading the initial page inside a loop to always start with a fresh instance.