How to efficiently do web scraping in Node.js? - node.js

I am trying to scrape some data from a shopping site Express.com. Here's 1 of many products that contains image, price, title, color(s).
<div class="cat-thu-product cat-thu-product-all item-1">
<div class="cat-thu-p-cont reg-thumb" id="p-50715" style="position: relative;"><img class="cat-thu-p-ima widget-app-quickview" src="http://t.express.com/com/scene7/s7d5/=/is/image/expressfashion/25_323_2516_900/i81?$dcat191$" alt="ROCCO SLIM FIT SKINNY LEG CORDUROY JEAN"><img id="widget-quickview-but" class="widget-ie6png glo-but-css-off2" src="/assets/images/but/cat/but-cat-quickview.png" alt="Express View" style="position: absolute; left: 50px;"></div>
<ul>
<li class="cat-cat-more-colors">
<div class="productId-50715">
<img class="js-swatchLinkQuickview" title="INK BLUE" src="http://t.express.com/com/scene7/s7d5/=/is/image/expressfashion/25_323_2516_900_s/i81?$swatch$" width="16" height="6" alt="INK BLUE">
<img class="js-swatchLinkQuickview" title="GRAPHITE" src="http://t.express.com/com/scene7/s7d5/=/is/image/expressfashion/25_323_2516_924_s/i81?$swatch$" width="16" height="6" alt="GRAPHITE">
<img class="js-swatchLinkQuickview" title="MERCURY GRAY" src="http://t.express.com/com/scene7/s7d5/=/is/image/expressfashion/25_323_2516_930_s/i81?$swatch$" width="16" height="6" alt="MERCURY GRAY">
<img class="js-swatchLinkQuickview" title="HARVARD RED" src="http://t.express.com/com/scene7/s7d5/=/is/image/expressfashion/25_323_2516_853_s/i81?$swatch$" width="16" height="6" alt="HARVARD RED">
</div>
</li>
<li class="cat-thu-name"><a href="/rocco-slim-fit-skinny-leg-corduroy-jean-50715-647/control/show/3/index.pro" onclick="var x=".tl(";s_objectID="http://www.express.com/rocco-slim-fit-skinny-leg-corduroy-jean-50715-647/control/show/3/index.pro_2";return this.s_oc?this.s_oc(e):true">ROCCO SLIM FIT SKINNY LEG CORDUROY JEAN
</a></li>
<li>
<strong>$88.00</strong>
</li>
<li class="cat-thu-promo-text"><font color="BLACK" style="font-weight:normal">Buy 1, Get 1 50% Off</font>
</li>
</ul>
The very naive and possibly error-prone approach I've done is to first to grab all prices, images, titles and colors:
var price_objects = $('.cat-thu-product li strong');
var image_objects = $('.cat-thu-p-ima');
var name_objects = $('.cat-thu-name a');
var color_objects = $('.cat-cat-more-colors div');
Next, I populate arrays with the data from DOM extracted using jsdom or cheerio scraping libraries for node.js. (Cheerio in this case).
// price info
for (var i = 0; i < price_objects.length; i++) {
prices.push(price_objects[i].children[0].data);
}
// image links
for (var i = 0; i < image_objects.length; i++) {
images.push(image_objects[i].attribs.src.slice(0, -10));
}
// name info
for (var i = 0; i < name_objects.length; i++) {
names.push(name_objects[i].children[0].data);
}
// color info
for (var i = 0; i < color_objects.length; i++) {
colors.push(color_objects[i].attribs.src);
}
Lastly, based on the assumption that price, title, image and colors will match up create a product object:
for (var i = 0; i < images.length; i++) {
items.push({
id: i,
name: names[i],
price: prices[i],
image: images[i],
colors: colors[i]
});
}
This method is slow, error-prone, and very anti-DRY. I was thinking it would be nice if we could grab $('.cat-thu-product') and using a single for-loop extract relevant information from a single product a time.
But have you ever tried traversing the DOM in jsdom or cheerio? I am not sure how anyone can even comprehend it. Could someone show how would I use this proposed method of scraping, by grabbing $('.cat-thu-product') div element containing all relevant information and then extract necessary data?
Or perhaps there is a better way to do this?

I would suggest still using jQuery (because it's easy, fast and secure) with one .each example:
var items = [];
$('div.cat-thu-product').each(function(index, productElement) {
var product = {
id: $('div.cat-thu-p-cont', productElement).attr('id'),
name: $('li.cat-thu-name a', productElement).text().trim(),
price: $('ul li strong', productElement).text(),
image: $('.cat-thu-p-ima', productElement).attr('src'),
colors: []
};
// Adding colors array
$('.cat-cat-more-colors div img', productElement).each(function(index, colorElement) {
product.colors.push({name: $(colorElement).attr('alt'), imageUrl: $(colorElement).attr('src')});
});
items.push(product);
});
console.log(items);
And to validate that you have all the required fields, you can write easilly validator or test. But if you are using different library, you still should loop through "div.cat-thu-product" elements.

Try node.io https://github.com/chriso/node.io/wiki
This will be a good approach of doing what you are trying to do.

using https://github.com/rc0x03/node-promise-parser
products = [];
pp('website.com/products')
.find('div.cat-thu-product')
.set({
'id': 'div.cat-thu-p-cont #id',
'name': 'li.cat-thu-name a',
'price': 'ul li strong',
'image': '.cat-thu-p-ima',
'colors[]': '.cat-cat-more-colors div img #alt',
})
.get(function(product) {
console.log(product);
products.push(product);
})

Related

NetSuite: How can I add a record to a Advanced PDF/HMTL Template?

So I know that I can use the N/render to generate a template and I can use the addRecord to add record objects to the print template to make them available in the FTL.
My question is if I can do something similar when the native print button is clicked and prints a Advanced PDF/HTML Template. I know that I can catch the PRINT event in the User Event script but beyond that I am stuck.
I know the question is a little general I will add context on request. I just don't know which way to go.
EDIT: I am familiar with the option of adding a custpage field to the form and then extracting the JSON in the FTL.
In this specific situation it would be much more convenient if I could simply add a full record. Meaning I am on a Item Fulfillment print and want to add the FULL parent Sales Order record to the print so that I can access it in the FTL by salesorder.memo etc. Something similar to:
require(['N/render'], function(render) {
var renderer = render.create();
renderer.addRecord('customer', record.load({ type: record.Type.CUSTOMER, id: customer }));
})
The issue is that I only know how to do this for completely custom prints but not prints that are printed from the Native print buttons on transactions.
I need this to do line matching from the Sales Order lines to the Item Fulfillment lines and would rather do it this way if possible instead of creating a custpage and inserting a custom made object.
I refer to one of my previous answer.
Use the beforeLoad hook in a UserEventScript to set extra data on the context.form. You'll be able to access this data on the template.
/**
* #NApiVersion 2.x
* #NScriptType UserEventScript
*/
define(['N/ui/serverWidget'], function(serverWidget) {
function beforeLoad(context) {
// var request = context.request;
// var newRecord = context.newRecord;
var form = context.form;
var type = context.type;
var UserEventType = context.UserEventType;
// only execute during printing...
if (type != UserEventType.PRINT) return
var customData = {
hello: 'world'
}
var field = form.addField({
id : 'custpage_custom_data',
label: 'Custom Data',
type : serverWidget.FieldType.LONGTEXT
});
field.defaultValue = JSON.stringify(customData);
}
return {
beforeLoad: beforeLoad
};
})
You can access the data within the template through:
<#if record.custpage_custom_data?has_content>
<#assign custom_data = record.custpage_custom_data?eval />
</#if>
As per your question you want to add item sublist data also from sales order on print of item fulfillment. if it is so, then here I have used for same situation.
Steps:
Write a user event before load script on print mode only and then create a saved search to get the data of item and save it in custom field with long text type with space as label.
Customize your standard pdf template that is attached to item fulfillment record.
GoTo- customization- forms- Advanced pdf template-Customize preferred template for item fulfillment.
Add a table there with that custom field.
It will work on standard print button. I have done it for work order record. You may edit in search using sales order saved search.
UserEvent
/**
*#NApiVersion 2.x
*#NScriptType UserEventScript
*/
define(['N/record', 'N/search', 'N/ui/serverWidget'], function (record, search, serverWidget) {
function beforeLoad(scriptContext) {
try {
if (scriptContext.type == 'print') {
var currentRec = scriptContext.newRecord;
var recid = currentRec.id;
columns[0] = search.createColumn({
name: "sequence",
join: "manufacturingOperationTask",
sort: search.Sort.ASC,
label: "Operation Sequence"
});
columns[1] = search.createColumn({
name: "custevent_custom_op_name",
join: "manufacturingOperationTask",
label: "Operation Name(Instruction)"
});
columns[2] = search.createColumn({
name: "manufacturingworkcenter",
join: "manufacturingOperationTask",
label: "Manufacturing Work Center"
});
columns[3] = search.createColumn({
name: "formulanumeric",
formula: "Round({manufacturingoperationtask.runrate}*{quantity}/60,2)",
label: "BudgetHours"
});
//Creating search to get all the values for work order
var mySearch = search.create({
type: "workorder",
filters:
[
["type", "anyof", "WorkOrd"],
"AND",
["internalid", "anyof", recid],
"AND",
["mainline", "is", "T"]
],
columns: columns
});
var searchResultCount = mySearch.runPaged().count;
mySearch.run().each(function (result) {
// .run().each has a limit of 4,000 results
results.push(result);
return true;
});
//populate current printout with custom record entries
var customRecords = { columns: columns, results: results };
var columns = customRecords.columns, results = customRecords.results;
var custrecord = scriptContext.form.addField({ id: 'custpage_custrecord_to_print', type: serverWidget.FieldType.LONGTEXT, label: " " }),
custrecordArray = [];
if (results && results instanceof Array) {
for (var i = 0; i < results.length; i++) {
var singleLine = {};
for (var j = 0; j < columns.length; j++) {
if (i == i && j == 2) {
var value = results[i].getText(columns[j]);
} else {
var value = results[i].getValue(columns[j]);
}
if (j == 0 || j == 1 || j == 2) {
if (value.indexOf('.') == 0 || value.indexOf(',') == 0 || value.indexOf('-.') == 0 || value.indexOf('-,') == 0) {
value = '0' + value;
}
}
singleLine["col" + j] = (value) ? value : '';
}
custrecordArray.push(singleLine);
}
custrecord.defaultValue = JSON.stringify(custrecordArray);
}
}
} catch (e) {
log.error("ERROR", e);
}
}
return {
beforeLoad: beforeLoad,
};
});
In Advanced Pdf Template:-
<#if record.custpage_custrecord_to_print?has_content>
<#assign customrecord = record.custpage_custrecord_to_print?eval />
<table width="100%" class="second_table" style="page-break-inside: auto; width: 100%; margin-top: 2px; padding-top: 0px">
<#list customrecord as customrecord_line>
<tr width="100%" border-top="solid black" margin-top="10px" style="margin-top:10px; page-break-before: auto;">
<th width="25%" align="left" style="padding: 2px 2px;">Step</th>
<th width="25%" align="center" style="padding: 2px 2px;">Activity</th>
<th width="25%" align="center" style="padding: 2px 2px;">Run Rate(Min/Unit)</th>
<th width="25%" align="center" style="padding: 2px 2px;">BudgetHours</th></tr>
<tr width="100%" style="page-break-inside: auto;">
<td width="25%" align="left" style="padding: 2px 2px;">0${customrecord_line.col0}</td>
<td width="25%" align="center" style="padding: 2px 2px;">${customrecord_line.col2}</td>
<td width="25%" align="center" style="padding: 2px 2px;">${customrecord_line.col3}</td>
<td width="25%" align="center" style="padding: 2px 2px;">${customrecord_line.col4}</td>
</tr>
</list>
</table>
</#if>
It will be helpful.
Thanks,

how to loop over divs with the same class with Puppeteer

Using puppeteer to scrape a page Im able to get the contents from a list of divs with the same class and nested list of divs within those i.e.
<div class="parent">
<div class="child"></div>
</div>
<div class="parent">
<div class="child"></div>
<div class="child"></div>
</div>
<div class="parent">
<div class="child"></div>
...
</div>
...
now my problem is i need to reiterate over the list and run the page.click() on the child class divs to open lightboxes, select an element in the lightbox to click then run the page.pdf() on.
I currently have a for loop over the parent class divs, and an inner for loop over the child class divs. I'm not sure how to select the right div with the for loop index values as there is no nth-of-class etc.
I simply want to run something like
for (let a = 0; a < data.length; a++) {
for (let b = 0; b < data[a].length; b++) {
await page.click('.parent[a] .child[b]');
// other code here...
}
}
to open the lightbox, then a
await page.waitForSelector('.ReactModal')
to scrape the lightbox html and the run
await page.pdf({
path: dir + "/"+ filename,
format: 'A4'
});
Any guidance would be appreciated as to the possible approaches would be.
If I understand correctly, you can try something like this:
for (const parent of await page.$$('.parent')) {
for (const child of await parent.$$('.child')) {
await child.click();
await page.waitForSelector('.ReactModal'); // maybe check if this is not the same lightbox
await page.pdf(/*...*/);
}
}

How to implement dynamic flex value with Angular Material

I'm trying to build a div with uncertain number of child divs, I want the child div have flex="100" when there is only one of them, which takes the entire row. If there are more than one child divs (even if there are three or four child elements), they should all have exactly flex="50", which will take half of the row.
Any idea how could I do that?
Thanks in advance.
Another way is <div flex="{{::flexSize}}"></div> and in controller define and modify flexSize e.g $scope.flexSize = 50;
Thanks for the help from #William S, I shouldn't work with flex box for a static size layout.
So I work with ng-class to solve my problem.
HTML:
<div flex layout-fill layout="column" layout-wrap>
<div ng-repeat="function in functions" ng-class="test" class="card-container">
<md-card>
contents
</md-card>
</div>
</div>
My CSS is like the following:
.test1 {
width: 100%;
}
.test2 {
width: 50%;
}
The initial value of $scope.test is 'test1',by changing the value from 'test1' to 'test2', the width of children divs will be set to 50%.
Defining "flex" inside an element will always try to take up the most amount of space that the parent allows, without supplying any parameters to flex. This needs to be accompanied by layout="row/column", so flex expands in the right direction.
Is this what you're looking for?
http://codepen.io/qvazzler/pen/LVJZpR
Note that eventually, the items will start growing outside the parent size. You can solve this in several ways, but I believe this is outside the scope of the question.
HTML
<div ng-app="MyApp" ng-controller="AppCtrl as ctrl">
<h2>Static height list with Flex</h2>
<md-button class="thebutton" ng-click="addItem()">Add Item</md-button>
<div class="page" layout="row">
<md-list layout-fill layout="column">
<md-list-item flex layout="column" class="listitem" ng-repeat="item in items" ng-click="logme('Unrelated action')">
<md-item-content layout="column" md-ink-ripple flex>
<div class="inset">
{{item.title}}
</div>
</md-item-content>
</md-list-item>
</md-list>
</div>
</div>
CSS
.page {
height: 300px; /* Or whatever */
}
.thebutton {
background-color: gray;
}
.listitem {
/*height: 100px;*/
flex;
background-color: pink;
}
JS
angular.module('MyApp', ['ngMaterial'])
.controller('AppCtrl', function($scope) {
$scope.items = [{
title: 'Item One',
active: true
} ];
$scope.addItem = function() {
newitem = {
title: 'Another item',
active: false
};
$scope.items.push(newitem);
}
$scope.toggle = function(item) {
item.active = !item.active;
console.log("bool toggled");
}
$scope.logme = function(text) {
alert(text);
console.log(text);
}
});
Use below :
scope.flexSize = countryService.market === 'FR'? 40: 50;
<div flex="{{::flexSize}}">

How to get children elements

I have the following html that I am trying to verify:
<div class="project-counts-nav-wrapper" style="">
<ul id="project-counts" class="project-counts-nav" style="">
<li style="">0 active projects</li>
<li style="">0 draft projects</li>
<li style="">
0 archived projects (
View
)
</li>
</ul>
</div>
How do I get text value for each 'li' tag element?
I have the following Intern js code:
.findByCssSelector('#project-counts')
.findAllByTagName('li')
.then(function(children){
console.info('Show children length: ' + children.length);
console.info('children: ' + children);
for(var i=0; i<3; i++){
// how do I get text for each element
console.info('Show children: ' + children[i];
}
}).end();
I see the following from output:
Show children length: 3
children: [object Object],[object Object],[object Object]
Show children: [object Object]
Show children: [object Object]
Show children: [object Object]
I'd like to get:
0 active projects
0 draft projects
0 archived projects
Thanks,
Brad
The following should work for you:-
var liElements = document.getElementsByTagName('li'),
i;
for (i = 0; i < liElements.length; i++) {
console.info(liElements[i].textContent);
}
The above should print just the text content as desired. Note - you would want to go ahead and put in class names on your li and get the Elements using that so that you do not pull up all the li on the page.
Codepen link
Here is the solution. I added chained command getVisableText after the findAllByTagName command:
.findByCssSelector('#project-counts')
.findAllByTagName('li')
.getVisibleText()
.then(function(children){
console.info('Show children length: ' + children.length);
console.info('children: ' + children);
for(var i=0; i < children.length; i++){
console.info('Show each child: ' + children[i]);
}
}).end();

Using jade to create an unordered list tree

I have an object called boundedArea which contains an array of boundedArea objects in a field children and I would like to create a tree of unordered lists.
I have the following code:
- for (var index = 0; index < rootAreas.length; index++) {
- var boundedArea = rootAreas[index];
div(class='panel panel-default')
div.panel-heading
| #{boundedArea.NAME}
div.panel-body
- printChildren(boundedArea, 0);
- }
-
- function printChildren(boundedArea, depth) {
- var children = boundedArea.children;
- if (children == null || children.length == 0) {
- return;
- }
ul
- for (var index = 0; index < children.length; index++) {
- var child = children[index];
li #{child.NAME}
- console.log("Printing %s child of %s", child.NAME, boundedArea.NAME);
- printChildren(child, depth + 1);
- }
- }
Now obviously this sort of works in that it prints out all the values. However because the ul and li tags are a fixed indentation in they do not nest and just ended up printing sequentially.
Is there any way to dynamically set the level of indent or to force these to nest. Or should I be using a completely different model of nesting altogether.
I tried cretaing a javascript variable indent filled with two spaces for each depth level and then tried to use #{indent} but that just ended up creating tags with spaces in which was not what I wanted. Though that implies that something around this idea could work as it must be resolved at some level before but the it is picked up as a token of some kind.
Try using a mixin instead of a function. Mixins respect/remember the level of indentation (not really sure why functions don't).
mixin printChildren(boundedArea, depth)
- var children = boundedArea.children;
- if (children == null || children.length == 0)
- return;
ul
- for (var index = 0; index < children.length; index++)
- var child = children[index];
li #{child.NAME}
+printChildren(child, depth + 1)
- for (var index = 0; index < rootAreas.length; index++)
- var boundedArea = rootAreas[index];
div(class='panel panel-default')
div.panel-heading
| #{boundedArea.NAME}
div.panel-body
+printChildren(boundedArea, 0)
I tweaked your code a bit. Mixins are invoked using a + instead of a - and they need to be defined before they are used.
I tested it with this sample data:
{
rootAreas: [
{
NAME: 'area1',
children: [
{ NAME: 'child1' },
{ NAME: 'child2' },
{
children: [
{ NAME: 'child3' },
{ NAME: 'child4' },
]
},
]
},
{
NAME: 'area2',
children: [
{ NAME: 'child5' },
{ NAME: 'child6' },
{ NAME: 'child7' },
]
}
]
}
And the template yielded this HTML code:
<div class="panel panel-default">
<div class="panel-heading">area1</div>
<div class="panel-body">
<ul>
<li>child1</li>
<li>child2</li>
<li>
<ul>
<li>child3</li>
<li>child4</li>
</ul>
</li>
</ul>
</div>
</div>
<div class="panel panel-default">
<div class="panel-heading">area2</div>
<div class="panel-body">
<ul>
<li>child5</li>
<li>child6</li>
<li>child7</li>
</ul>
</div>
</div>
If I understood you correctly, this is what you're looking for.

Resources