I am trying parse a site, but the html is a mess. Can anyone with more experience in parsing sites help me?
<tr>
<td><font FACE=Tahoma color='#CC0000' size=2><b>Date</b></font></td>
<td><font FACE=Tahoma color='#CC0000' size=2><b>Place</b></font></td>
<td><font FACE=Tahoma color='#CC0000' size=2><b>Situation</b></font></td>
</tr>
<tr><td rowspan=2>16/09/2011 10:11</td><td>New York</td><td><FONT COLOR="000000">Situation Red</font></td></tr>
<tr><td colspan=2>Optional comment hello new york</td></tr>
<tr><td rowspan=2>16/09/2011 10:08</td><td>Texas</td><td><FONT COLOR="000000">Situation Green</font></td></tr>
<tr><td colspan=2>Optional comment hello texas </td></tr>
<tr><td rowspan=1>06/09/2011 13:14</td><td>California</td><td><FONT COLOR="000000">Yellow Situation</font></td></tr>
</TABLE>
A strange and crazy thing is the comment not in the head of table also the start point(california) dont have comment. So, start point always will be like this:
Date: 06/09/2011 13:14
Place: California
Situation: Yellow Situation
Comment: null
all others places have a comment and will be like this:
Date: 16/09/2011 10:11
Place: New York
Situation: Situation Red
Comment: Optional comment hello new york.
I have tried some approaches, but I don't have much experience with node.js and less with HTML parsing. I need a getting started with parsing crazy stuff.
I built a distributed scraper in node.js. I found it easier to parse html that had been parsed through html tidy.
Here is a module to run html through tidy:
var spawn = require('child_process').spawn;
var fs = require('fs');
var tidy = (function() {
this.html = function(str, callback) {
var buffer = '';
var error = '';
if (!callback) {
throw new Error('No callback provided for tidy.html');
}
var ptidy = spawn(
'tidy',
[
'--quiet',
'y',
'--force-output',
'y',
'--bare',
'y',
'--break-before-br',
'y',
'--hide-comments',
'y',
'--output-xhtml',
'y',
'--fix-uri',
'y',
'--wrap',
'0'
]);
ptidy.stdout.on('data', function (data) {
buffer += data;
});
ptidy.stderr.on('data', function (data) {
error += data;
});
ptidy.on('exit', function (code) {
//fs.writeFileSync('last_tidy.html', buffer, 'binary');
callback(buffer);
});
ptidy.stdin.write(str);
ptidy.stdin.end();
}
return this;
})();
module.exports = tidy;
Example (if saved as tidy.js):
require('./tidy.js');
tidy.html('<table><tr><td>badly formatted html</tr>', function(html) { console.log(html); });
Result:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content="HTML Tidy for Linux/x86 (vers 25 March 2009), see www.w3.org" />
<title></title>
</head>
<body>
<table>
<tr>
<td>badly formatted html</td>
</tr>
</table>
</body>
</html>
Related
So, I am getting the below error,
inspected via google chrome
What I am trying to achieve is to explore the Fetch Api by retrieving some data via userApi.js file which pulls it from a srcServer.js (I have hardcoded some data here). I am using webpack bundle and index is the entry point of my project. I have created index.html to bind the data via innerhtml.
Earlier I was using import 'isomorphic-fetch' in my userApi.js file but that too didn't help and hence I found some suggestions on google to use isomorphic-fetch, node-fetch etc. nothing of that sort worked.
I have added most of the artifacts below can you please guide me what is that I am missing here.
Project Structure
userApi.js
import 'isomorphic-fetch'
import 'es6-promise'
export function getUsers () {
return get('users')
}
function get (url) {
return fetch(url).then(onSuccess, onError) //eslint-disable-line
}
function onSuccess (response) {
return response.json()
}
function onError (error) {
console.log(error)
}
index.js
/* eslint-disable */ // --> OFF
import './index.css'
import {getUsers} from './api/userApi'
// Populate table of users via API call.
getUsers().then(result => {
let usersBody = ''
result.forEach(element => {
usersBody+= `<tr>
<td><a href='#' data-id='${user.id}' class='deleteUser'>Delete</a></td>
<td>${user.id}</td>
<td>${user.firstName}</td>
<td>${user.lastName}</td>
</tr>` //eslint-disable-line
})
global.document.getElementById('users').innerHTML = usersBody
})
index.html
<!DOCTYPE <!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<title>Page Title</title>
<meta name="viewport" content="width=device-width, initial-scale=1">
</head>
<body>
<h1>Users</h1>
<table>
<thead>
<th> </th>
<th>Id</th>
<th>First Name</th>
<th>Last Name</th>
</thead>
<tbody id="users">
</tbody>
</table>
<script src="bundle.js"></script>
</body>
</html>
srcServer.js
// sample api call data
app.get('/users', function (req, res) {
// Hard coded for simplicity
res.json([
{ 'id': 1, 'firstName': 'P', 'lastName': 'K' },
{ 'id': 2, 'firstName': 'M', 'lastName': 'K' },
{ 'id': 3, 'firstName': 'S', 'lastName': 'K' }
])
})
The error is giving you the exact reason, i.e. user is not defined. Have you tried to print console.log(element); in your forEach loop? You will see what you need to change.
You are accessing the user information incorrectly. In your forEach loop, each value is represented as element not user
result.forEach(element => {
usersBody+= `<tr>
<td><a href='#' data-id='${element.id}' class='deleteUser'>Delete</a></td>
<td>${element.id}</td>
<td>${element.firstName}</td>
<td>${element.lastName}</td>
</tr>` //eslint-disable-line
})
For my class I have to design an app that says at the top of the page whether an incoming request is GET or POST, then has to print a table that shows all parameter names and values that were sent in the URL query string, and the property names and values that were received in the request body.
So far I have been able to get my localhost:port to work, it correctly shows whether a request is GET or POST. But when I go to the subpage that is supposed to display the tables, I get a 404 instead.
Here is the render page that I think is causing the problem:
function runQ(req) {
console.log(req.qParams);
console.log(req.body);
var context = {};
context.queryParams = [];
context.bodyParams = [];
context.queryCount = 0;
context.bodyCount = 0;
for( var p in req.qParams) {
context.queryCount++;
context.queryParams.push({'name': p, 'value': req.qParams[p] });
}
for( var p in req.body) {
context.bodyCount++;
context.bodyParams.push({'name': p, 'value': req.body[p] });
}
context.methodType = req.method;
return context;
}
app.get('/request', function(req, res) {
res.render('request', runQ(req));
});
app.post('/request', function(req, res) {
res.render('request', runQ(req));
});
I have a request.handlebar saved in my ubuntu/getpost/views folder along with the 404 and 500 handlebars.
The command I use for testing is:
$ curl --data "a=1&b=2&c=3" localhost:port
I replaced the localhost:port with an actual IP and port address when I have node running.
My console returns this on the tab that is running node:
undefined
{ a: '1', b: '2', c: '3' }
And this on the tab where I typed the cURL command:
<!doctype html>
<html>
<head>
<title>Demo Page</title>
</head>
<body>
<h1>POST Request Received</h1>
<table>
<caption><p>Request Body Table</p></caption>
<thead>
<tr>
<th>Property Names</th>
<th>Values</th>
</tr>
</thead>
<tbody>
<tr>
<td>a</td>
<td>1</td>
</tr>
<tr>
<td>b</td>
<td>2</td>
</tr>
<tr>
<td>c</td>
<td>3</td>
</tr>
</tbody>
</table>
</body>
So everything seems to be working from the console but when I try to access localhost:port/request, I go to the 404 error instead of a page that displays the tables.
Can anyone tell me what I'm doing wrong? Thank you all for your time.
I'm using https://www.npmjs.com/package/html-pdf library which is based on Phantom JS which internally uses webkit. I'm pasting the dummy HTML & JS code(keep these files in 1 folder) and also attaching the output screenshot.
The issue I'm facing is that on windows the PDF is generated with some extra space at top(empty space above red) which I can't get rid of.
Here is a forum(outdated) discussing similar issues, https://groups.google.com/forum/#!topic/phantomjs/YQIyxLWhmr0 .
input.html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
</head>
<body>
<div id="pageHeader" style="border-style: solid;border-width: 2px;color:red;">
header <br/> header <br/> header <br/> header
</div>
<div id="pageContent" style="border-style: solid;border-width: 2px;color:green;">
<div>
body <br/> body <br/> body
</div>
</div>
JS
(You require path, fs, handlebars, html-pdf npm packages)
var path = require('path');
var fs = require('fs');
var handlebars = require('handlebars');
var pdf = require('html-pdf');
saveHtml();
function saveHtml() {
fs.readFile('input.html', 'utf-8', {
flag: 'w'
}, function(error, source) {
handlebars.registerHelper('custom_title', function(title) {
return title;
})
var template = handlebars.compile(source);
var data = {};
var html = template(data);
var options = {
'format': 'A4',
'base': "file://",
/* You can give more options like height, width, border */
};
pdf.create(html, options).toFile('./output.pdf', function(err, res) {
if (err) {
console.log('err pdf');
return;
} else {
console.log('no err pdf');
return;
}
});
});
}
Output on ubuntu
Output on windows
Extra space at top(empty space above red) in Windows is the issue.
THINGS that didn't work
1. Adding
"border": {
"top": "0px" // or mm, cm, in
}
to options in JS file
https://www.npmjs.com/package/html-pdf#options
Giving fixed "height": "10.5in" & "width": "8in" in options in JS file
Making margin-top & padding-top to 0px/-50px to pageHeader div.
Overriding margin-top & padding of body to 0px/-20px in #media print in bootstrap.css
Giving fixed height to header
Any help will be greatly appreciated. Thanks.
You can manually set the CSS property to html tag. In my case I was having problems developing the template in Windows and deploying it to Linux (Heroku).
I put zoom: 0.7 in the label html and now both views look the same.
html{zoom: 0.7;}
I was able to get more consistent results by removing the ID's so that it treated everything as content rather than separate header and content areas.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
</head>
<body>
<div style="border-style: solid;border-width: 2px;color:red;">
header
</div>
<div style="border-style: solid;border-width: 2px;color:green;">
<div>
body
</div>
</div>
</body>
</html>
If you need an ID for styling, use something other than pageHeader / pageFooter to avoid the special treatment associated with those IDs.
Have you tried using a normalize style sheet to compensate for cross platform differences?
https://necolas.github.io/normalize.css/
I have this simple file hello.html and looks like this.
<!DOCTYPE html>
<html>
<head>
<style>
p { background:yellow; }
</style>
<script src="http://code.jquery.com/jquery-latest.js"></script>
</head>
<body>
<div class="thediv">
</div>
</body>
</html>
I am using fs module to write to my file like this
var fs = require('fs');
var randomnumber = Math.floor((Math.random()*100393)+433334);
fs.createReadStream('test.txt').pipe(fs.createWriteStream(randomnumber+'.txt'));
fs.writeFile(randomnumber+".txt", "Lorem ipsum"+randomnumber, function(err) {
if(err) {
console.log(err);
} else {
console.log("The file was saved!");
}
});
console.log(randomnumber);
I want to write
<article>
<p> lorem ipsum </p>
</article>
to the div with the id thediv.Is the fs module used for this kind of thing or is there a module more suited for this task?.
I believe the only answer to get an HTML parser, parse the file into a DOM, make the adjustments and then save the DOM back to a file. Here's a question that answers where to find an HTML parser.
When using Express for Node.js, I noticed that it outputs the HTML code without any newline characters or tabs. Though it may be more efficient to download, it's not very readable during development.
How can I get Express to output nicely formatted HTML?
In your main app.js or what is in it's place:
Express 4.x
if (app.get('env') === 'development') {
app.locals.pretty = true;
}
Express 3.x
app.configure('development', function(){
app.use(express.errorHandler());
app.locals.pretty = true;
});
Express 2.x
app.configure('development', function(){
app.use(express.errorHandler());
app.set('view options', { pretty: true });
});
I put the pretty print in development because you'll want more efficiency with the 'ugly' in production. Make sure to set environment variable NODE_ENV=production when you're deploying in production. This can be done with an sh script you use in the 'script' field of package.json and executed to start.
Express 3 changed this because:
The "view options" setting is no longer necessary, app.locals are the local variables merged with res.render()'s, so [app.locals.pretty = true is the same as passing res.render(view, { pretty: true }).
To "pretty-format" html output in Jade/Express:
app.set('view options', { pretty: true });
There is a "pretty" option in Jade itself:
var jade = require("jade");
var jade_string = [
"!!! 5",
"html",
" body",
" #foo I am a foo div!"
].join("\n");
var fn = jade.compile(jade_string, { pretty: true });
console.log( fn() );
...gets you this:
<!DOCTYPE html>
<html>
<body>
<div id="foo">I am a foo div!
</div>
</body>
</html>
I doesn't seem to be very sophisticated but for what I'm after -- the
ability to actually debug the HTML my views produce -- it's just fine.
In express 4.x, add this to your app.js:
if (app.get('env') === 'development') {
app.locals.pretty = true;
}
If you are using the console to compile, then you can use something like this:
$ jade views/ --out html --pretty
In express 4.x, add this to your app.js:
app.locals.pretty = app.get('env') === 'development';
Do you really need nicely formatted html? Even if you try to output something that looks nice in one editor, it can look weird in another. Granted, I don't know what you need the html for, but I'd try using the chrome development tools or firebug for Firefox. Those tools give you a good view of the DOM instead of the html.
If you really-really need nicely formatted html then try using EJS instead of jade. That would mean you'd have to format the html yourself though.
you can use tidy
take for example this jade file:
foo.jade
h1 MyTitle
p
a(class='button', href='/users/') show users
table
thead
tr
th Name
th Email
tbody
- var items = [{name:'Foo',email:'foo#bar'}, {name:'Bar',email:'bar#bar'}]
- each item in items
tr
td= item.name
td= item.email
now you can process it with node testjade.js foo.jade > output.html:
testjade.js
var jade = require('jade');
var jadeFile = process.argv[2];
jade.renderFile(__dirname + '/' + jadeFile, options, function(err, html){
console.log(html);
});
will give you s.th. like:
output.html
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html><head><title>My Title</title><link rel="stylesheet" href="/stylesheets/style.css"/><script type="text/javascript" src="../js/jquery-1.4.4.min.js"></script></head><body><div id="main"><div ><h1>MyTitle</h1><p>show users</p><table><thead><tr><th>Name</th><th>Email</th></tr></thead><tbody><tr><td>Foo</td><td>foo#bar</td></tr><tr><td>Bar</td><td>bar#bar</td></tr></tbody></table></div></div></body></html
then running it through tidy with tidy -m output.html will result in:
output.html
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content=
"HTML Tidy for Linux (vers 25 March 2009), see www.w3.org" />
<title>My Title</title>
<link rel="stylesheet" href="/stylesheets/style.css" type=
"text/css" />
<script type="text/javascript" src="../js/jquery-1.4.4.min.js">
</script>
</head>
<body>
<div id="main">
<div>
<h1>MyTitle</h1>
<p>show users</p>
<table>
<thead>
<tr>
<th>Name</th>
<th>Email</th>
</tr>
</thead>
<tbody>
<tr>
<td>Foo</td>
<td>foo#bar</td>
</tr>
<tr>
<td>Bar</td>
<td>bar#bar</td>
</tr>
</tbody>
</table>
</div>
</div>
</body>
</html>
building off of oliver's suggestion, heres a quick and dirty way to view beautified html
1) download tidy
2) add this to your .bashrc
function tidyme() {
curl $1 | tidy -indent -quiet -output tidy.html ; open -a 'google chrome' tidy.html
}
3) run
$ tidyme localhost:3000/path
the open command only works on macs. hope that helps!