I'm building a webscraper with nodejs and pupeteer.
Everthing works fine, but now im stuck on how to get structured data from a table without classes. Here's an example:
I don't know how to iterate thru the table and extract the data in json format which should like this:
<table class="tableclass">
<tbody>
<tr>
<td>
<b>
<strong>
<span>A</span></strong> & B <strong><span>C</span></strong>Name</b>
</td>
<td >
Street No<br>
Zip City
</td>
<td >
Map | Website
</td>
</tr>
<tr>
<td>
<b>
<strong>
<span>A</span></strong> & B <strong><span>C</span></strong>Name</b>
</td>
<td >
Street No<br>
Zip City
</td>
<td >
Map | Website
</td>
</tr>
</table>
Obj ={
"content":[
{
"name":"A&B C Name",
"adress":[
"Street No",
"Zip",
"City"
],
"link":"http://www.websiteB.de"
},
]
}
Does the table have a consistent structure in each case? If so, you just need to figure out how to get to each element from the root of of table. For instance, to get the name, assuming that the above table structure is the same for all tables:
const table = document.querySelector('.tableclass')
Obj ={
"content":[
{
"name": table.querySelectorAll('tr')[0].querySelectorAll('td')[0].innerText;
....
]
}
Here, I get the table element I am interested in using document.querySelector('.tableclass') - which will return the first instance of .tableclass on the page. If you have multiple, you will have to use document.querySelectorAll and perform these operations on each table in a for-loop.
Then, I use the querySelector but limited to this table, and I grab the first element, because that's where the name is. (table.querySelectorAll('tr')[0]). Here I could have just used (table.querySelector('tr')) as I wanted the first element, but this is just to show you how you can access any of the s by their index. Finally, following the same logic, I need to select the first element as that is the element that contains all the 'name' text, then I just use its .innerText attribute to extract the text.
innerText will be your friend here - just traverse the DOM nodes using node.querySelector until you get to one that contains all the text you want and no more, then get the .innerText attribute on that node. If the table has consistent structure, you should just be able to figure this out for one table and it should work on all of them.
let data = await page.evaluate(() => {
var i = 0;
for (var i = 0; i < 5; i++) {
const table = document.querySelector('#tableclass');
let dealer = table.querySelectorAll('tr')[i].querySelectorAll('td')[0].innerText;
let adress = table.querySelectorAll('tr')[i].querySelectorAll('td')[1].innerText;
let link = table.querySelectorAll('tr')[i].querySelectorAll('td')[2].querySelectorAll('a')[1].getAttribute("href");
return {
dealer,
adress,
link
}
}
I want to loop thru the table/ each row in it. I know this is wrong, but I don't know how to loop in this case. Thanks for help!
Related
I'm using node.js and puppeteer to get some data. How can I save the content of an element (which is divided by line break <br>) in two separate variables?
That's the HTML I'm looking at:
<table summary="">
<tbody>
<tr nowrap="nowrap" valign="top" align="left">
<td nowrap="nowrap">2018-08-14<br>16:35:41</td>
</tr>
</tbody>
</table>
I'm getting the content of the td like this (app.js):
let tableCell04;
let accepted;
tableCell04 = await page.$( 'body div table tr td' );
accepted = await page.evaluate( tableCell04 => tableCell04.innerText, tableCell04 );
console.log('Accepted: '+accepted);
The output in console is:
Accepted: 2018-08-14
16:35:41
But what I would like to have is storing the content which is separated by the line break in two separate variables so that I get sth like this:
Accepted_date: 2018-08-14
Accepted_time: 16:35:41
Hi you can use tableCell04.innerHTML to get the html instead of the plain text.
accepted = await page.evaluate( tableCell04 => tableCell04.innerHTML, tableCell04 );
const [Accepted_date, Accepted_time] = accepted.split('<br>');
The item I am trying to access in the following HTML is "GMV DLL VERSION2"
<div class="container content">
<main>
<h2 id="rpcs--gmv-dll-version">RPCs → GMV DLL VERSION</h2>
<h3 id="vista-file-8994">VISTA File 8994</h3>
<table>
<thead>
<tr>
<th>property</th>
<th>value</th>
</tr>
</thead>
<tbody>
<tr>
<td>label</td>
<td>GMV DLL VERSION2</td>
I am trying to scrape this website (http://vistadataproject.info/artifacts/vistaRPC%20documentation/GMV%20DLL%20VERSION)
and output it into a text file. I successfully did a test run with reddit.com. However I cannot seem to get this page to get even a single element off of it. To test it, even before tackling the table I've been trying to scrape some elements that come quite early (in top area) of the page.
The lack of classNames and Id in the tables is tricky enough, but not being able to get even the title text is really making me wonder what is going on. Any input will be appreciated.
request(http://vistadataproject.info/artifacts/vistaRPC%20documentation/GMV%20DLL%20VERSION, (err, res, body) => {
if (err) {
console.log('Error: ' + err);
}
console.log('Status: ' + res.statusCode);
const $ = cheerio.load(body);
$('header.masthead > div.container').each(( index, tr ) => {
// var children = $(this).children();
const tableData = $(this).find('a.logo').text();
console.log("Table Contents: " + tableData);
fs.appendFileSync('test.txt', tableData + '\n' + 'Captured');
});
The problem is that 'masthead' is a class name, not an id. Same deal with 'container' and 'logo'. So you need to adjust your selector accordingly:
$('header.masthead > div.container').each(( index, tr ) => {
However, that only gets you the header information, which does not include the tables containing the 'property => value' data. For that information you just need to look for child tables under the '<main>' tag.
I have a view with a table of products that can be added to a shopping cart. Each row has a DropDownList with allowed quantities that can be ordered along with a button to add to cart. Everything is populating and displaying properly. I know how to pass the item ID in the ActionLink but how can I get the value of the DownDownList associated with the table row of the ActionLink that was clicked?
I am guessing possibly using JQuery that fires when the ActionLink is clicked?
I also thought of making every row a form but that seems overkill.
Is there an easy MVC way to do this?
In prepping more info for a proper question and went ahead and solved it. Thank you Stephen for the nudge and info.
I tried putting a Html.BeginForm around each <tr> tag in the details section. This did indeed work for me. I was able to easily get the unique form info to POST for each individual row. However, when I would enable JQuery DataTables the submit would break. DataTables must be capturing the submit or click somehow. Haven't figured that out but it made me try JQuery which seems a much better way to do it.
Here is how I construct the table data row:
#foreach (var item in Model)
{
<tr>
<td>
<img src="#item.GetFrontImage()" width="100" />
</td>
<td>
<strong>#Html.DisplayFor(modelItem => item.DisplayName)</strong>
</td>
<td>
#Html.DisplayFor(modelItem => item.CustomerSKU)
</td>
<td>
#Html.DropDownList("OrderQty", item.GetAllowedOrderQuantities(), htmlAttributes: new { #class = "form-control" })
</td>
<td>
<a class="btn btn-default pull-right" data-id="#item.ID">Add to Cart</a>
</td>
</tr>
}
This creates a select with id of OrderQty and I embedded the item ID in data-id attribute of the link. I then used this JQuery to capture the info and POST it to my controller. Just have a test div displaying the results in this example:
// Add to Cart click
$('table .btn').click(function () {
// Gather data for post
var dataAddToCard = {
ID: $(this).data('id'), // Get data-id attribute (Item ID)
Quantity: $(this).parent().parent().find('select').val() // Get selected value of dropdown in same row as button that was clicked
}
// POST data to controller
$.ajax({
url: '#Url.Action("AddToCart","Shopping")',
type: 'POST',
data: JSON.stringify(dataAddToCard),
contentType: 'application/json',
success: function (data) { $('#Result').html(data.ID + ' ' + data.Quantity); }
})
});
The JQuery function receives the reference to the link being clicked so I can extract the Item ID from the data-id attribute. I can then get a reference to the dropdown (select) that is in the same row by using .parent.parent (gets me to the <tr> tag) and then just finding the next 'select' tag. Probably pretty obvious to a lot of you.
This works great for my purposes. I can also update other elements with data returned from the POST.
Thank you
Karl
for the table in html:
<div class="table-responsive">
<table id="employeeTable"class="table table-bordered">
<thead>
<tr>
<th class="text-center">ُُُEmpId</th>
<th class="text-center">Name</th>
<th class="text-center">Absense State</th>
</tr>
</thead>
<tbody>
#foreach (var item in Model)
{
<tr>
<td>#item.Id</td>
<td>#item.Name</td>
<td class="text-center">#Html.DropDownList("DDL_AbsentStatus", new SelectList(ViewBag.statusList, "Id", "Name"), new { #class = "form-control text-center" })</td>
</tr>
}
</tbody>
</table>
</div>
in javascript to get the selected value:
//Collect Date For Pass To Controller
$("#btn_save").click(function (e) {
e.preventDefault();
if ($.trim($("#datepicker1").val()) == "") {
alert("ادخل تاريخ يوم صحيح!")
return;
}
var employeesArr = [];
employeesArr.length = 0;
$.each($("#employeeTable tbody tr"), function () {
employeesArr.push({
EmpId: $(this).find('td:eq(0)').html(),
EntryDate: $.trim($("#datepicker1").val()),
StatusId: $(this).find('#DDL_AbsentStatus').val()
});
});
$.ajax({
url: '/Home/SaveAbsentState',
type: "POST",
dataType: "json",
data: JSON.stringify(employeesArr),
contentType: 'application/json; charset=utf-8',
success: function (result) {
alert(result);
emptyItems();
},
error: function (err) {
alert(err.statusText);
}
});
})
I just need to wrap column into hyperlink. So that a user can click on item of Number Column and can be redirected.
Here is my current View:-
#foreach (var item in Model) {
<tr>
<th>
#Html.ActionLink("Read", "Read", new { id = item.id})
</th>
<td>
#Html.DisplayFor(modelItem => item.Number)
</td>
</tr>
Trying to do something like this. I know its not right but need to know the right way to do it. I am new to MVC
#Html.ActionLink(#Html.DisplayFor(modelItem => item.Number).ToString(), "Read", new { id = item.id })
You can't really, but you can just use Url.Action instead:
<a href="#Url.Action("Read", new { id = item.id })">
#Html.DisplayFor(modelItem => item.Number)
</a>
I don't know if there's a way to do this with ActionLink (I suspect there isn't, at least not in any way I'd want to support in the code.) But you can manually craft an a tag and still keep its URL dynamic by using Url.Action() instead:
<a href="#Url.Action("Read", new { id = item.id })">
#Html.DisplayFor(modelItem => item.Number)
</a>
I'm fairly familiar with jQuery, but I'm working on a project in YUI, which I am totally new to, and am not sure how to accomplish this.
In essence, I need to display a js popup if a span element exists that has the text "Inactive" in it and is several steps down the tree from a div with a class of "list_subpanel_cases".
This is a rough example, but the point is, this is dynamically built, so my only definite selectors are the div with the class and the descendant span with a text value of "Inactive".
<div class="list_subpanel_cases">
<table>
<tbody>
<tr>
<td>
<span>Active</span>
</td>
</tr>
<tr>
<td>
<span>Inactive</span>
And I need to find out if any spans exist with the text "Inactive".
Hope this isn't too confusing!
It appears that CSS3 selectors can't examine content (only attributes) so you'd have to use a selector for the candidate span tags and then use code to look at the content for a match. Here's one way to do that:
function findInactive() {
var found = null;
Y.all(".list_subpanel_cases span").some(function(node, index, nodeList) {
if (node.getContent() == "Inactive") {
found = node;
return(true); // stop looking for more matches
}
return(false); // keep looking for more matches
});
return(found);
}
if (findInactive()) {
// execute code here when the Inactive span exists
}
You can see it work here: http://jsfiddle.net/jfriend00/BVzqL/.