I'm running into trouble with scraping items that don't have a single root. Something that is necessary I believe with x-ray
Consider scraping hacker news where each headline is made up of two TRs:
<tbody>
<tr class="athing>content item 1</tr>
<tr>content item 1</tr>
<tr class="spacer></tr>
<tr class="athing>content item 2</tr>
<tr>content item 2</tr>
<tr class="spacer></tr>
</tbody>
As can be seen, there's no common root-node per item.
Does x-ray support scraping in such a case?
you could use + to select sibling
x(html, 'tbody ',
['tr.athing, tr.athing+tr:not(.athing):not(.spacer)']
)
(function (err, res) {
console.log(res)
})
result:
[ 'content item 1a',
'content item 1b',
'content item 2a',
'content item 2b' ]
Related
I'm building a webscraper with nodejs and pupeteer.
Everthing works fine, but now im stuck on how to get structured data from a table without classes. Here's an example:
I don't know how to iterate thru the table and extract the data in json format which should like this:
<table class="tableclass">
<tbody>
<tr>
<td>
<b>
<strong>
<span>A</span></strong> & B <strong><span>C</span></strong>Name</b>
</td>
<td >
Street No<br>
Zip City
</td>
<td >
Map | Website
</td>
</tr>
<tr>
<td>
<b>
<strong>
<span>A</span></strong> & B <strong><span>C</span></strong>Name</b>
</td>
<td >
Street No<br>
Zip City
</td>
<td >
Map | Website
</td>
</tr>
</table>
Obj ={
"content":[
{
"name":"A&B C Name",
"adress":[
"Street No",
"Zip",
"City"
],
"link":"http://www.websiteB.de"
},
]
}
Does the table have a consistent structure in each case? If so, you just need to figure out how to get to each element from the root of of table. For instance, to get the name, assuming that the above table structure is the same for all tables:
const table = document.querySelector('.tableclass')
Obj ={
"content":[
{
"name": table.querySelectorAll('tr')[0].querySelectorAll('td')[0].innerText;
....
]
}
Here, I get the table element I am interested in using document.querySelector('.tableclass') - which will return the first instance of .tableclass on the page. If you have multiple, you will have to use document.querySelectorAll and perform these operations on each table in a for-loop.
Then, I use the querySelector but limited to this table, and I grab the first element, because that's where the name is. (table.querySelectorAll('tr')[0]). Here I could have just used (table.querySelector('tr')) as I wanted the first element, but this is just to show you how you can access any of the s by their index. Finally, following the same logic, I need to select the first element as that is the element that contains all the 'name' text, then I just use its .innerText attribute to extract the text.
innerText will be your friend here - just traverse the DOM nodes using node.querySelector until you get to one that contains all the text you want and no more, then get the .innerText attribute on that node. If the table has consistent structure, you should just be able to figure this out for one table and it should work on all of them.
let data = await page.evaluate(() => {
var i = 0;
for (var i = 0; i < 5; i++) {
const table = document.querySelector('#tableclass');
let dealer = table.querySelectorAll('tr')[i].querySelectorAll('td')[0].innerText;
let adress = table.querySelectorAll('tr')[i].querySelectorAll('td')[1].innerText;
let link = table.querySelectorAll('tr')[i].querySelectorAll('td')[2].querySelectorAll('a')[1].getAttribute("href");
return {
dealer,
adress,
link
}
}
I want to loop thru the table/ each row in it. I know this is wrong, but I don't know how to loop in this case. Thanks for help!
I'm using node.js and puppeteer to get some data. How can I save the content of an element (which is divided by line break <br>) in two separate variables?
That's the HTML I'm looking at:
<table summary="">
<tbody>
<tr nowrap="nowrap" valign="top" align="left">
<td nowrap="nowrap">2018-08-14<br>16:35:41</td>
</tr>
</tbody>
</table>
I'm getting the content of the td like this (app.js):
let tableCell04;
let accepted;
tableCell04 = await page.$( 'body div table tr td' );
accepted = await page.evaluate( tableCell04 => tableCell04.innerText, tableCell04 );
console.log('Accepted: '+accepted);
The output in console is:
Accepted: 2018-08-14
16:35:41
But what I would like to have is storing the content which is separated by the line break in two separate variables so that I get sth like this:
Accepted_date: 2018-08-14
Accepted_time: 16:35:41
Hi you can use tableCell04.innerHTML to get the html instead of the plain text.
accepted = await page.evaluate( tableCell04 => tableCell04.innerHTML, tableCell04 );
const [Accepted_date, Accepted_time] = accepted.split('<br>');
The item I am trying to access in the following HTML is "GMV DLL VERSION2"
<div class="container content">
<main>
<h2 id="rpcs--gmv-dll-version">RPCs → GMV DLL VERSION</h2>
<h3 id="vista-file-8994">VISTA File 8994</h3>
<table>
<thead>
<tr>
<th>property</th>
<th>value</th>
</tr>
</thead>
<tbody>
<tr>
<td>label</td>
<td>GMV DLL VERSION2</td>
I am trying to scrape this website (http://vistadataproject.info/artifacts/vistaRPC%20documentation/GMV%20DLL%20VERSION)
and output it into a text file. I successfully did a test run with reddit.com. However I cannot seem to get this page to get even a single element off of it. To test it, even before tackling the table I've been trying to scrape some elements that come quite early (in top area) of the page.
The lack of classNames and Id in the tables is tricky enough, but not being able to get even the title text is really making me wonder what is going on. Any input will be appreciated.
request(http://vistadataproject.info/artifacts/vistaRPC%20documentation/GMV%20DLL%20VERSION, (err, res, body) => {
if (err) {
console.log('Error: ' + err);
}
console.log('Status: ' + res.statusCode);
const $ = cheerio.load(body);
$('header.masthead > div.container').each(( index, tr ) => {
// var children = $(this).children();
const tableData = $(this).find('a.logo').text();
console.log("Table Contents: " + tableData);
fs.appendFileSync('test.txt', tableData + '\n' + 'Captured');
});
The problem is that 'masthead' is a class name, not an id. Same deal with 'container' and 'logo'. So you need to adjust your selector accordingly:
$('header.masthead > div.container').each(( index, tr ) => {
However, that only gets you the header information, which does not include the tables containing the 'property => value' data. For that information you just need to look for child tables under the '<main>' tag.
I have a collection that has three fields: name(String), email(String) and appointments(Object)
I query the collection inside a router.get based on user email
router.get('/appointments', ensureAuthenticated, function(req, res, next) {
var query = {email: req.user.email};
PatientList.find(query, function(err, data){
if (err) throw err;
res.render('appointments', {data : data, title: 'Appointments'});
});
});
data from the query above looks like this:
{ __v: 0,
name: 'dan',
email: 'dan#gmail.com',
appointments:
[ { _id: 58373466542d6ae430a13337,
position: '1',
name: 'dan',
email: 'dan#gmail.com',
serviced: false,
hospital: 'Toronto hospital',
date: 'Thursday, November 24, 2016',
__v: 0 },
{ _id: 5837346a542d6ae430a13339,
position: '2',
name: 'dan',
email: 'dan#dan.com',
serviced: false,
hospital: 'Calgary hospital',
date: 'Thursday, November 24, 2016',
__v: 0 },
]
I want to access these fields inside the appointments field and display them on a table using handlebars. My html with handlebars looks like this
<table class="table table-striped table-bordered">
<thead>
<tr>
<td>Clinic</td>
<td>Appointment Date</td>
<td>Patient Number</td>
<td>Clinic Queue Status</td>
</tr>
</thead>
<tbody>
{{#each data}}
<tr>
<td>{{this.appointments.hospital}}</td>
<td>{{this.appointments.date}}</td>
<td>{{this.appointments.position}}</td>
<td>{{#if this.appointments.serviced}}
Please return to the clinic immediately.
{{else}}
Patient is currently being served.
{{/if}}
</td>
</tr>
{{/each}}
</tbody>
</table>
But it is not printing anything in the table
Passing properties to your view ( template ) via res.render results in those properties become available for use in your template.
You are passing a data property in your handlebars template, so you need to use your data variable inside, which is an object, that contains the appointments array.
In order to properly render them you need to use #each against data.appointments like :
{{#each data.appointments}}
<tr>
<td>{{hospital}}</td>
<td>{{date}}</td>
<td>{{position}}</td>
<td>{{#if serviced}}
Please return to the clinic immediately.
{{else}}
Patient is currently being served.
{{/if}}
</td>
</tr>
{{/each}}
Also keep in mind that when you use #each all of the properties of your array are served as locals for this block. So this. is not necessary anymore.
I have a view with a table of products that can be added to a shopping cart. Each row has a DropDownList with allowed quantities that can be ordered along with a button to add to cart. Everything is populating and displaying properly. I know how to pass the item ID in the ActionLink but how can I get the value of the DownDownList associated with the table row of the ActionLink that was clicked?
I am guessing possibly using JQuery that fires when the ActionLink is clicked?
I also thought of making every row a form but that seems overkill.
Is there an easy MVC way to do this?
In prepping more info for a proper question and went ahead and solved it. Thank you Stephen for the nudge and info.
I tried putting a Html.BeginForm around each <tr> tag in the details section. This did indeed work for me. I was able to easily get the unique form info to POST for each individual row. However, when I would enable JQuery DataTables the submit would break. DataTables must be capturing the submit or click somehow. Haven't figured that out but it made me try JQuery which seems a much better way to do it.
Here is how I construct the table data row:
#foreach (var item in Model)
{
<tr>
<td>
<img src="#item.GetFrontImage()" width="100" />
</td>
<td>
<strong>#Html.DisplayFor(modelItem => item.DisplayName)</strong>
</td>
<td>
#Html.DisplayFor(modelItem => item.CustomerSKU)
</td>
<td>
#Html.DropDownList("OrderQty", item.GetAllowedOrderQuantities(), htmlAttributes: new { #class = "form-control" })
</td>
<td>
<a class="btn btn-default pull-right" data-id="#item.ID">Add to Cart</a>
</td>
</tr>
}
This creates a select with id of OrderQty and I embedded the item ID in data-id attribute of the link. I then used this JQuery to capture the info and POST it to my controller. Just have a test div displaying the results in this example:
// Add to Cart click
$('table .btn').click(function () {
// Gather data for post
var dataAddToCard = {
ID: $(this).data('id'), // Get data-id attribute (Item ID)
Quantity: $(this).parent().parent().find('select').val() // Get selected value of dropdown in same row as button that was clicked
}
// POST data to controller
$.ajax({
url: '#Url.Action("AddToCart","Shopping")',
type: 'POST',
data: JSON.stringify(dataAddToCard),
contentType: 'application/json',
success: function (data) { $('#Result').html(data.ID + ' ' + data.Quantity); }
})
});
The JQuery function receives the reference to the link being clicked so I can extract the Item ID from the data-id attribute. I can then get a reference to the dropdown (select) that is in the same row by using .parent.parent (gets me to the <tr> tag) and then just finding the next 'select' tag. Probably pretty obvious to a lot of you.
This works great for my purposes. I can also update other elements with data returned from the POST.
Thank you
Karl
for the table in html:
<div class="table-responsive">
<table id="employeeTable"class="table table-bordered">
<thead>
<tr>
<th class="text-center">ُُُEmpId</th>
<th class="text-center">Name</th>
<th class="text-center">Absense State</th>
</tr>
</thead>
<tbody>
#foreach (var item in Model)
{
<tr>
<td>#item.Id</td>
<td>#item.Name</td>
<td class="text-center">#Html.DropDownList("DDL_AbsentStatus", new SelectList(ViewBag.statusList, "Id", "Name"), new { #class = "form-control text-center" })</td>
</tr>
}
</tbody>
</table>
</div>
in javascript to get the selected value:
//Collect Date For Pass To Controller
$("#btn_save").click(function (e) {
e.preventDefault();
if ($.trim($("#datepicker1").val()) == "") {
alert("ادخل تاريخ يوم صحيح!")
return;
}
var employeesArr = [];
employeesArr.length = 0;
$.each($("#employeeTable tbody tr"), function () {
employeesArr.push({
EmpId: $(this).find('td:eq(0)').html(),
EntryDate: $.trim($("#datepicker1").val()),
StatusId: $(this).find('#DDL_AbsentStatus').val()
});
});
$.ajax({
url: '/Home/SaveAbsentState',
type: "POST",
dataType: "json",
data: JSON.stringify(employeesArr),
contentType: 'application/json; charset=utf-8',
success: function (result) {
alert(result);
emptyItems();
},
error: function (err) {
alert(err.statusText);
}
});
})