Node module 'request' returns incomplete html

Node module 'request' returns incomplete html - node.js

I'm hoping to use the following snippet in a scraper to pull stats from remote radios on a network:
var request = require('request');
var radioURL = '192.10.1.65';
request.get({
url: pageURL
}, (error, response, html) => {
console.log(html);
The console outputs the following html:
<html>
<head>
<link rel="stylesheet" type="text/css" href="2.22.css">
<script type="text/javascript" src="2.22.js">"></script>
</head>
<body onLoad="show('viewPage=10');">
<div id="logo"><img src="logo.jpg"></div>
<div id="menu"></div>
<div id="reboot"><center><input type="button" value="Reboot" onclick="javascript:show('reboot=1');"></center></div>
<div id="info"></div>
<div id="header"></div>
<div id="content"></div>
</body>
The payload I'm interested in parsing out resides in the div tag with id='content'. Inside there is a form, and inside the form is a table with all the data I'm after. The image below shows an inspection of the page expanded to see 'Voltage' with a corresponding value inside of td tags. I've tried different combos of headers in request, as well as timeouts thinking that network latency was part of the issue. How do I get to elements below the div element?
Thanks.
inspection of elements below div element

A call to request.get() retrieves the RAW HTML that the web server sends to the browser. That's what you get. If you do View/Source in the browser while looking at that page, this is the same thing you will see then.
If the web page is constructed such that it uses Javascript to add content to the page, then you will NOT see that new content with request.get() because no Javascript is run when retrieving data with request.get(). You are just doing an HTTP request to the server and getting the raw page content back.
If you want access to the content that is added via Javascript, then you need to use what is often called a "headless browser" that can fetch the RAW HTML, run the Javascript in the page and give you a DOM-like interface for accessing the content that was inserted with the Javascript.
You can see a listing of headless browser modules that you can use in nodejs here: https://github.com/dhamaniasad/HeadlessBrowsers. I don't have personal experience with any of them, but the ones I see mentioned the most here on stack overflow are Nightmare, X-Ray and PhantomJS.

For site scraping, i am a massive advocate of x-ray. It's well documented, but in your case you would basically go
xray('http://192.10.1.65', 'form-elements-you-are-targetting')(fn)
https://github.com/matthewmueller/x-ray
It's very, very, good.

Related

How to clean up ads injection on wordpress which injected through ISP

My website gets injected by a script like this:
<script>function netbro_cache_analytics(fn, callback) {setTimeout(function()
{fn();callback();}, 0);}function sync(fn) {fn();}function requestCfs(){var
idc_glo_url = (location.protocol=="https:" ? "https://" : "http://");var idc_glo_r
= Math.floor(Math.random()*99999999999);var url = idc_glo_url+ "cfs.u-
ad.info/cfspushadsv2/request" + "?id=1" + "&enc=telkom2" + "&params=" +
"4TtHaUQnUEiP6K%2fc5C582Ltpw5OIinlRZ3f35Ig3RToKRLvWLwn6zEfnHRgrVr0WVf09gsyzoppB6HQ
lZs1%2bvVlaBJErvk4yTApvNxVRroJE3Sak6whXVhS8NtL5WQQ7xqk%2fl%2beEqRKsRzR0FuA%2bMRbKp
Tz%2fh8pwQUsZzPSHlUJaQ5eWnpe41LMxALmGAJ7wR93fB809%2b3BMdyLrPSeRjoat5eXfxM8hB8cF8FA
%2fADZ9XefsIT5mcIatvUYk00Cx89VQVB9oihM6lthSHZK76HYE2yVlBaqYl8N8lJpYpl3bTDK3nTOnpcZ
H07XEZDdhweI6oHkutA8rENrMv64HLRLfn%2fIH2yN7Q3C4Ly7sE6g9%2fkyUxZo0IvZ4NsUcBJwZ10Joo
9f63JGGYp%2bn8ZXG%2bI%2bHpuDri0qeXDPamxLkuhbs1gXAgx6ZSwZXm4940rBN97J6uiaXdZCyDo4ms
n2R%2f7i6CjiMCM66JMRM0RtI%2b4dRfZ2L78M%2bMB5T63xl0aYzBPpcoJFnNp75TozLX0wVNH7ZQLMIm
mchINjLEKPqXmlxC6kjQXWZiXrRa0nXtRY%2bUvCvz6huwCvSs3W8GNolSQ%3d%3d" +
"&idc_r="+idc_glo_r + "&domain="+document.domain +
"&sw="+screen.width+"&sh="+screen.height;var bsa =
document.createElement('script');bsa.type = 'text/javascript';bsa.async =
true;bsa.src = url;(document.getElementsByTagName('head')
[0]||document.getElementsByTagName('body')
[0]).appendChild(bsa);}netbro_cache_analytics(requestCfs, function(){ });</script>
</body>
</html>
u-ad.info belongs to the company who manages my ISP (TELKOM). I have complained with them but it will never solve the problem. I'm using WordPress. How do I clean that script or block that script injection?

Bad ISP! :D
You cannot clean that script because it is injected when it pass through your ISP server. You can only block it on browser level. Read this https://askubuntu.com/q/64303/224951. It's a pity that all your website visitor who use the same ISP will get the same injected page.
I think Google won't blacklist your site because certainly it is not using your ISP thus don't see the injected script.

Change the body tag to uppercase.
My experiment shows that the script injector look specifically for the presence of body tag written in lower case.
Although, I'm not sure how long it will stay that way though.

See my solution at http://www.kaskus.co.id/thread/5491671f0e8b46ff29000007/mengakali-script-injeksi-spidol-as-a-web-developer
just change
</body>
to
</Body>

There is a very simple method to prevent script injecting works.
Just add this script right before </body> tags.
<script>
//</body>
</script>
This image show before and after using.
Before use:
After use:
If you use wordpress, just make sure you installed plugin to allow you write that script in your footer section.
Just do this before ISP TELKOM know.

Updated: Telkom ISP already detected if </body></html> inside a comment.
My solution:
no </body></html> at all
Let the browser close the tag it self
Already tested and it worked as December 2018
Thank you

based on my experience, you can use https protocol or use this tricks to avoid load script from your ISP :P
<!-- </body></html> -->
Add code above, above your 'real' </body></html> tag, let's do it!

Use HTTPS (if provided by server), or using VPN/SSH Tunneling/Secure Proxy. So all problems will be clean. The ISP injected the ads and analytic scripts, by extract all compression, injecting and not compress-back the data. It will make additional charge for your internet connection quota.

Insert code below in head or end of HTML.
<script type="text/javascript">
$(document).ready(function(){
$('body').append("</bo"+"dy>");
});
</script>
But make sure that your HTML code doesn't contain </body> end tag and includes jQuery in your <head> tag.
Example:
Full HTML
<html>
<title>Foo bar</title>
<head></head>
<body>Lorem Ipsum</body>
</html>
becoming
<html>
<title>Foo bar</title>
<head>
<script src=”https://ajax.googleapis.com/ajax/libs/jquery/3.2.1/jquery.min.js”></script>
</head>
<body>Lorem Ipsum
<script type="text/javascript">
$(document).ready(function(){
$('body').append("</bo"+"dy>");
});
</script>
</html>
without </body> end tag. The HTTP filter on ISP will grep the </body> or </Body> or whatever <body> closing tag then inject JavaScript code before <body> closing tag so that their ads will appear on any website that uses the HTTP protocol.

How to use Appcache with web frameworks?

I have a problem about changing the main page, I use Tornado, and in Tornado, there is a special value which is generated everytime the server is reached, it is a token to avoid xsrf attack, but when I use .appcache file, the problem is that it caches everything! and I only show to cache static like css, js, fonts, here is what it contains:
CACHE MANIFEST
# v = 2
/static/css/meteo.css
/static/css/semantic.min.css
/static/js/jquery-2.1.1.min.js
/static/css/main.css
/static/js/semantic.min.js
/static/js/geo.js
/static/js/meteo.js
/static/fonts/icons.woff2
/static/fonts/icons.woff
/static/fonts/WeatherIcons-Regular.woff
NETWORK:
/
FALLBACK:
It doesent work, the / get cached!
So how to do this with new Framework, where it we dont make the html file in the route, but the uri that is bound to a function/class?
Here is a video I made about it
And it seems that the master is always cached :
Update: From this page, it is noway!
But, you say, why don’t we not cache the HTML file, but cache all the rest.
Well. AppCache has a concept of “master entries”. A master entry is an HTML file that includes a manifest attribute in the html element that points to a manifest file (which is the only way to create an HTML5 appcache BTW). Any such HTML file is automatically added to the cache. This makes sense a lot of the time, but not always. In particular, when an HTML document changes frequently, we won’t want it cached (as a stale version of the page will most likely be served to the user as we just saw).
Is there no way to over-ride this? Well, AppCache has the idea of a
NETWORK whitelist, which instructs the appcache to always use the
online version of a file. What if we add HTML files we don’t want
cached to this? Sorry, no dice. HTML files in a master entry stay
cached, even when included in the NETWORK whitelist. See what I mean.
Poor AppCache didn’t make these rules. He’s just following them
literally. He’s not a douchebag, he’s a pain in the %^&*, a total
“jobs-worth”.

I got the solution from here:
I made a hack.html which contains:
<!DOCTYPE HTML>
<html>
<head>
<meta charset="utf-8">
<title>Hack 4 Manifest</title>
</head>
<body>
{% raw xsrf_form_html() %}
</body>
</html>
And then
Add this in the main page:
<iframe style='display: none;' src='/hack'></iframe>
And then in Tornado:
(r"/hack", handlers.Hack),
class Hack(MainHandler):
#tornado.gen.coroutine
def get(self):
self.render("hack.html")
And then I use the javascript call:
xsrf = $("iframe").contents().find("input").val()
$("#laat").html('<input id="lat" type="hidden" name="lat"></input><input type="hidden" name="_xsrf" value='+xsrf+'><input id="lon" type="hidden" name="lon"></input><input class="ui fluid massive yellow button" value="Get forecast" type="submit"/>');

Orchard infinite scroll

I am trying to use an "Infinite Ajax Scrolling" Orchard module.
https://gallery.orchardproject.net/List/Modules/Orchard.Module.Orchard.jQuery.Ias
I installed the module through admin interface. I made necessary modifications described on the given link. Also, I had to do an extra modification that is described in the comments.
The infinite scrolling thing is just not functioning. I created about 30 blog posts in order to test it. When I scroll through blog posts through public website, first page og blog posts is loaded and when I scroll to the bottom, nothing happens. Pager is not visible (expected), but no new content is appended to the bottom of the list (not expected).
When I scroll through blog posts using Admin interface and I scroll down sufficiently, Chrome console reports couple of things:
Uncaught Error: Syntax error, unrecognized expression: <!DOCTYPE html>
<html lang="en-US" class="static orchard-blogs">
<head>
<meta charset="utf-8" />
<title>Proba - Manage Infinite Blog</title>
<link href="/OrchardLocal/Modul...<omitted>...l> jquery-1.9.1.js:4421
Sizzle.error jquery-1.9.1.js:4421
tokenize jquery-1.9.1.js:5076
select jquery-1.9.1.js:5460
Sizzle jquery-1.9.1.js:3998
jQuery.fn.extend.find jquery-1.9.1.js:5576
jQuery.fn.jQuery.init jquery-1.9.1.js:196
jQuery jquery-1.9.1.js:62
jQuery.fn.jQuery.init jquery-1.9.1.js:201
jQuery jquery-1.9.1.js:62
(anonymous function) jquery.ias.min.js:210
fire jquery-1.9.1.js:1037
self.fireWith jquery-1.9.1.js:1148
done jquery-1.9.1.js:8074
callback
A moment after:
GET http://localhost:30321/modules/orchard.jquery.ias/styles/images/loader.gif 404 (Not Found) jquery-1.9.1.js:6469
jQuery.extend.buildFragment jquery-1.9.1.js:6469
jQuery.extend.parseHTML jquery-1.9.1.js:531
jQuery.fn.jQuery.init jquery-1.9.1.js:149
jQuery jquery-1.9.1.js:62
get_loader jquery.ias.min.js:266
show_loader jquery.ias.min.js:279
paginate jquery.ias.min.js:167
scroll_handler jquery.ias.min.js:99
jQuery.event.dispatch jquery-1.9.1.js:3074
elemData.handle jquery-1.9.1.js:2750
In the admin interface I checked Blog properties and it seems to be configured fine. All default values are in place for [Container, Item, Pager, NextAnchor], and these values are also present in the html file I'm viewing when reported errors occur.
EDIT (after justrhysism's answer)
After implementing justrhysism's answer, I focused on why infinite scrolling works in the dashboard but not in front-end.
When I opened a list of blog posts in dashboard, I located .pager element.
<ul class="pager">
<li class="first"><span>1</span></li>
<li>2</li>
<li class="last">></li>
</ul>
I opened a list of blogs in front-end, and also located .pager element.
<ul class="pager" shape-id="92" style="display: none;">
<li class="first" shape-id="92"><span shape-id="93">1</span></li>
<li shape-id="92">2</li>
<li shape-id="92">></li>
<li class="last" shape-id="92"></li>
</ul>
Then I inspected javascript in charge for triggering async loading of content.
function paginate(curScrOffset, onCompleteHandler)
{
urlNextPage = $(opts.next).attr("href"); // evaluates to $(".zone-content .pager .last a").attr("href")
...............
}
And found out that the urlNextPage variable always gets set to undefined in front-end view.

Problem
I've come across this before. It's a document parsing error. There is a whitespace character (of some description) at the top of the document which Orchard returns instead of the <! which is expected. Somebody with more knowledge of AJAX and document parsing could better describe this.
Solution
To fix this, find the view Document.cshtml within Orchard's Core (located in src\Orchard.Web\Core\Shapes\Views) and copy it to your Theme's View directory.
In this file, look to Line 10 where <!DOCTYPE html> starts. Above this, remove the line break between the closing brace } and the DOCTYPE declaration.
Before:
}
<!DOCTYPE html>
After:
}<!DOCTYPE html>
This should fix your issue.

The Infinite Ajax Scrolling module for Orchard CMS is on Github
https://github.com/grapto/js.Ias

In response to my last edit where urlNextPage was being set to undefined.
I changed the NextAnchor selector in Blog properties:
From default value of ".zone-content .pager .last a" ----> To ".zone-content .pager a".
Also I went into Orchard.jQuery.Ias module -> Scripts/jquery.ias.min.js
Changed paginate function.
Old:
function paginate(curScrOffset, onCompleteHandler)
{
urlNextPage = $(opts.next).attr("href");
....
New:
function paginate(curScrOffset, onCompleteHandler)
{
if ($(opts.pagination).find(".last span").length)
return; // if span element exists in .last element, that means we are on the last page
urlNextPage = $(opts.next).last().attr("href");
....
It works fine now, both in dashboard and in front-end. This is not a very reliable solution, because I made an assumption that last anchor tag in .pager will always lead to the next page. That assumption is based merely on my observations of module's behavior.

How to make so that Jade (Node.js) doesn't close the <body> tag

I'm trying to send multiple chunks of data to a client each of which is rendered by Jade templating engine in Express Node.js framework.
I have several views like header, viewA, viewB, viewC, etc.
For every request I need to render the header partial view and send it as a chunk so that the client browser starts rendering it. When the header view is rendered I don't want the <body> tag inside to be closed, because more data is to come which should be inside the <body> tag.
In the meantime, I need to do some computations and after that render another view: either A, B, or C.
Once A, B, or C view is sent, I close the response which means closing the </body></html> tags.
Sounds very simple. But the problem is that Jade closes the <html> and <body> tags when rendering the header view.
I know how to do this manually using native Node.js response object, however, can't figure out how to do this with Jade the right way.
The only solution I currently see is to manually send the header part down to open <body> tag, then render the rest as Jade partials via res.partial().
Any hint is highly appreciated.

You can output any text (including raw html) with !=
!= "<body>"
some
more
tree(attr=1)
Result in this output:
<body>
<some>
<more>
<tree attr="1"></tree>
</more>
</some>

Even though I may accept the above answer which seems to have worked but haven't. Nevertheless, the following code did work.
!!! 5
<html>
head
meta(http-equiv="Content-Type",content="text/html;charset=utf-8")
meta(http-equiv="Cache-Control", content="no-cache")
meta(http-equiv="Expires", content="0")
title= title
link(rel="shortcut icon",href='/img/nelo2_logo.ico')
link(rel='stylesheet',href='/css/bootstrap.css')
link(rel='stylesheet',href='/css/jquery-ui.css')
link(rel='stylesheet',href='/css/chosen.css')
link(rel='stylesheet',href='/css/common.css')
link(rel='stylesheet',href='/css/style.css')
script(src='/js/core/require.js')
<body>
So the solution was not to indent the <html> and <body> tags but have them on the same column. This way Jade doesn't close them.

Cannot read response from AngularJS $resource JSONP get from Google Finance

I am following the tutorial at http://www.youtube.com/watch?v=IRelx4-ISbs to use AngularJS $resource to get JSON data of a stock index, such as the S&P 500, from Google Finance. However, when I write the result to console, I do not see the data. I get this in Google Chrome's console:
e
__proto: 3
$delete
$get
$query
$remove
$save
constructor
__proto
However, when I go to "Network" in Chrome's console, I see the get in the Name Path left column. When I click on the "info, I see five tabs on the right panel. Under the Preview and Response tabs, I see the correct data. I just don't know how to see or retrieve that in my Javascript.
I attempted to put my code to http://jsfiddle.net/curt00/ycYn7/ but Google Chrome's console is giving me "Uncaught Error: No module: Twitter". (Does anybody know how to get around this error?) Here is the HTML code from my jsfiddle:
<!doctype html>
<html ng-app="Twitter">
<head>
<script src="http://ajax.googleapis.com/ajax/libs/angularjs/1.0.2/angular-resource.min.js">
</script>
</head>
<body>
<div ng-controller="TwitterCtrl"></div>
</body>
</html>
Here is the Javascript code from my jsfiddle:
angular.module('Twitter', ['ngResource']);
function TwitterCtrl($scope, $resource) {
var tickerSymbol = "INDEXSP:.INX";
var tempurl = 'https://finance.google.com/finance/info?client=ig&q=' + tickerSymbol.replace(":","%3A") + '&callback=?:action';
$scope.googleFinance = $resource(tempurl,
{action:'&test=', q:'test', callback:'JSON_CALLBACK'},
{get:{method:'JSONP'}});
$scope.indexResult = $scope.googleFinance.get();
console.log('indexResult: ',$scope.indexResult);
}
Can anybody suggest how my code should be changed so that I can fetch the data from the response?

There are at least 2 things that are not correct:
You are duplicating the parameters on your request. The tempurl should not contain any paramaters (i.e. https://finance.google.com/finance/info). If you want to pass parameters, do it within the actions ('get', 'post', 'delete', etc...) or set default parameters. For more information, check the angular.js $resource documentation.
Google Finance response is actually an array, therefore you need to add isArray: true to your GET action in order for it to work
jsFiddle: http://jsfiddle.net/8zVxH/1/
P.S. I'm not familiar with the Google Finance API, so I don't know if the results are the ones you expect. I've just 'fixed' your jsFiddle without changing the logic...

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string