Scraping URLs from a node.js data stream on the fly

Scraping URLs from a node.js data stream on the fly - node.js

I am working with a node.js project (using Wikistream as a basis, so not totally my own code) which streams real-time wikipedia edits. The code breaks each edit down into its component parts and stores it as an object (See the gist at https://gist.github.com/2770152). One of the parts is a URL. I am wondering if it is possible, when parsing each edit, to scrape the URL for each edit that shows the differences between the pre-edited and post edited wikipedia page, grab the difference (inside a span class called 'diffchange diffchange-inline', for example) and add that as another property of the object. Right not it could just be a string, does not have to be fully structured.
I've tried using nodeio and have some code like this (i am specifically trying to only scrape edits that have been marked in the comments (m[6]) as possible vandalism):
if (m[6].match(/vandal/) && namespace === "article"){
nodeio.scrape(function(){
this.getHtml(m[3], function(err, $){
//console.log('getting HTML, boss.');
console.log(err);
var output = [];
$('span.diffchange.diffchange-inline').each(function(scraped){
output.push(scraped.text);
});
vandalContent = output.toString();
});
});
} else {
vandalContent = "no content";
}
When it hits the conditional statement it scrapes one time and then the program closes out. It does not store the desired content as a property of the object. If the condition is not met, it does store a vandalContent property set to "no content".
What I am wondering is: Is it even possible to scrape like this on the fly? is the scraping bogging the program down? Are there other suggested ways to get a similar result?

I haven't used nodeio yet, but the signature looks to be an async callback, so from the program flow perspective, that happens in the background and therefore does not block the next statement from occurring (next statement being whatever is outside your if block).
It looks like you're trying to do it sequentially, which means you need to either rethink what you want your callback to do or else force it to be sequential by putting the whole thing in a while loop that exits only when you have vandalcontent (which I wouldn't recommend).
For a test, try doing a console.log on your vandalContent in the callback and see what it spits out.

Related

Using webRequest API to intercept script requests, edit them and send them back

As the title says, I'm trying to intercept script requests from the user's page, make a GET request to the script url from the background, add a bit of functionality and send it back to the user.
A few caveats:
I don't want to do this with every script request
I still have to guarantee that the script tags are executed in the original order
So far I came with two solutions, none of which work properly. The basic code:
chrome.webRequest.onBeforeRequest.addListener(
function handleRequest(request) {
// First I make the get request for the script myself SYNCHRONOUSLY,
// because the webRequest API cannot handle async.
const syncRequest = new XMLHttpRequest();
syncRequest.open('GET', request.url, false);
syncRequest.send(null);
const code = syncRequest.responseText;
},
{ urls: ['<all_urls>'] },
['blocking'],
);
Now once we have the code, there are two approaches that I've tried to insert it back into the page.
I send the code through a port to a content script, that will add it to the page inside a <script></script> tag. Along with the code, I also send an index to keep sure the scripts are inserted back into the page in the correct order. This works fine for my dummy website, but it breaks on bigger apps, like youtube, where it fails to load the image of most videos. Any tips on why this happens?
I return a redirect to a data url:
if (condition) return { cancel: false }
else return { redirectUrl: 'data:application/javascript; charset=utf-8,'.concat(alteredCode) };
This second options breaks the code formatting, sometimes removing the space, sometimes cutting it short. I'm not sure on the reason behind this behavior, it might have something to do with data url spec.
I'm stuck. I've researched pretty much every related answer on this website and couldn't find anything. Any help or information is greatly appreciated!
Thanks for your time!!!

range.address throws context related errors

We've been developing using Excel JavaScript API for quite a few months now. We have been coming across context related issues which got resolved for unknown reasons. We weren't able to replicate these issues and wondered how they got resolved. Recently similar issues have started popping up again.
Error we consistently get:
property 'address' is not available. Before reading the property's
value, call the load method on the containing object and call
"context.sync()" on the associated request context.
We thought as we have multiple functions defined to modularise code in project, may be context differs somewhere among these functions which has gone unnoticed. So we came up with single context solution implemented via JavaScript Module pattern.
var ContextManager = (function () {
var xlContext;//single context for entire project/application.
function loadContext() {
xlContext = new Excel.RequestContext();
}
function sync(object) {
return (object === undefined) ? xlContext.sync() : xlContext.sync(object);
}
function getWorksheetByName(name) {
return xlContext.workbook.worksheets.getItem(name.toString());
}
//public
return {
loadContext: loadContext,
sync: sync,
getWorksheetByName: getWorksheetByName
};
})();
NOTE: above code shortened. There are other methods added to ensure that single context gets used throughout application.
While implementing single context, this time round, we have been able to replicate the issue though.
Office.initialize = function (reason) {
$(document).ready(function () {
ContextManager.loadContext();
function loadRangeAddress(rng, index) {
rng.load("address");
ContextManager.sync().then(function () {
console.log("Address: " + rng.address);
}).catch(function (e) {
console.log("Failed address for index: " + index);
});
}
for (var i = 1; i <= 1000; i++) {
var sheet = ContextManager.getWorksheetByName("Sheet1");
loadRangeAddress(sheet.getRange("A" + i), i);//I expect to see a1 to a1000 addresses in console. Order doesn't matter.
}
});
}
In above case, only "A1" gets printed as range address to console. I can't see any of the other addresses (A2 to A1000)being printed. Only catch block executes. Can anyone explain why this happens?
Although I've written for loop above, that isn't my use case. In real use case, such situations occur where one range object in function a needs to load range address. In the mean while another function b also wants to load range address. Both function a and function b work asynchronously on separate tasks such as one creates table object (table needs address) and other pastes data to sheet (there's debug statement to see where data was pasted).
This is something our team hasn't been able to figure out or find a solution for.

There is a lot packed into this code, but the issue you have is that you're calling sync a whole bunch of times without awaiting the previous sync.
There are several problems with this:
If you were using different contexts, you would actually see that there is a limit of ~50 simultaneous requests, after which you'll get errors.
In your case, you're running into a different (and almost opposite) problem. Given the async nature of the APIs, and the fact that you're not awaiting on the sync-s, your first sync request (which you'd think is for just A1) will actually contain all the load requests from the execution of the entire for loop. Now, once this first sync is dispatched, the action queue will be cleared. Which means that your second, third, etc. sync will see that there is no pending work, and will no-op, executing before the first sync ever came back with the values!
[This might be considered a bug, and I'll discuss with the team about fixing it. But it's still a very dangerous thing to not await the completion of a sync before moving on to the next batch of instructions that use the same context.]
The fix is to await the sync. This is far and away the simplest to do in TypeScript 2.1 and its async/await feature, otherwise you need to do the async version of the for loop, which you can look up, but it's rather unintuitive (requires creating an uber-promise that keeps chaining a bunch of .then-s)
So, your modified TypeScript-ified code would be
ContextManager.loadContext();
async function loadRangeAddress(rng, index) {
rng.load("address");
await ContextManager.sync().then(function () {
console.log("Address: " + rng.address);
}).catch(function (e) {
OfficeHelpers.Utilities.log(e);
});
}
for (var i = 1; i <= 1000; i++) {
var sheet = ContextManager.getWorksheetByName("Sheet1");
await loadRangeAddress(sheet.getRange("A" + i), i);//I expect to see a1 to a1000 addresses in console. Order doesn't matter.
}
Note the async in front of the loadRangeAddress function, and the two await-s in front of ContextManager.sync() and loadRangeAddress.
Note that this code will also run quite slowly, as you're making an async roundtrip for each cell. Which means you're not using batching, which is at the very core of the object-model for the new APIs.
For completeness sake, I should also note that creating a "raw" RequestContext instead of using Excel.run has some disadvantages. Excel.run does a number of useful things, the most important of which is automatic object tracking and un-tracking (not relevant here, since you're only reading back data; but would be relevant if you were loading and then wanting to write back into the object).
Finally, if I may recommend (full disclosure: I am the author of the book), you will probably find a good bit of useful info about Office.js in the e-book "Building Office Add-ins using Office.js", available at https://leanpub.com/buildingofficeaddins. In particular, it has a very detailed (10-page) section on the internal workings of the object model ("Section 5.5: Implementation details, for those who want to know how it really works"). It also offers advice on using TypeScript, has a general Promise/async-await primer, describes what .run does, and has a bunch more info about the OM. Also, though not available yet, it will soon offer information on how to resume using the same context (using a newer technique than what was originally described in How can a range be used across different Word.run contexts?). The book is a lean-published "evergreen" book, son once I write the topic in the coming weeks, an update will be available to all existing readers.
Hope this helps!

Node.js and Express - Regarding rendering pages in a directory with one call

Hopefully I'm not duplicating a question here, but I looked around and didn't see anything that addressed this specific issue.
I'm constructing some pages by creating a bunch of unique content blocks that I store in a folder - things like this:
h4 Content Example
p.description.
Lorem ipsum dolor sample text etc etc
in, say, pages\contentBlocks\Example\contentExample.jade.
These content blocks are added to a mixin (pages\mixins.jade) that looks like this:
mixin contentBlock(title)
.content-block(id=title.toLowerCase())
h3 #{title}
if block
.content
block
and I call this mixin within a larger page (pages\portfolio.jade) like this:
.subsection
//- Expects a list of objects of the format { title, html }
each val in pageBlocks
+contentBlock(val.title)
!= val.html
(the intention is to later enhance this with other variables that can be passed down into the mixin for in-place formatting, like background images)
I would like to make use of the ability of Express to automatically render all of the pages in a directory in a single call. Like so (index.js):
var pageBlocks = [];
res.render('pages/contentBlocks/Example/*', function (err, html) {
if (err) throw err;
pageBlocks.push({ html: html });
});
However, I'd also like to be able to reference the title of the file, for example for setting the class of the block that I put it in. Ideally, it would look something like this:
res.render('pages/contentBlocks/Example/*', function (err, html) {
if (err) throw err;
pageBlocks.push({ title:functionToGetFileName(), html: html });
});
Is there any way to do this without manually iterating through the files in that directory?
Or, since this method seems needlessly convoluted (given the presence of mixins, includes, extends, and partials, it feels like I ought to be able to do this without extra render calls), is there a better way to do this?
Edit: I've done a bit more looking around and code testing, and now I'm unsure. Can Express implicitly render all of the files in a directory? Based on this question, I assumed it could, but I might have misread. The formatting of res.render('path/to/directory/', callback) doesn't seem to be working - I get the following error:
Error: Failed to lookup view "path/to/directory/" in views directory "/path/to/app/views"

For a temporary solution, I am using the following pattern:
exampleBlocks = [];
eb_dir = "pages/contentBlocks/Example";
fs.readdirSync("./views/" + eb_dir).forEach(function(file) {
res.render(eb_dir + "/" + file, function (err, html) {
if (err) throw err;
exampleBlocks.push({title: file, html:html})
});
});
Which is concise enough for my tastes, though calling res.render() so many times rustles my jimmies. I'd rather it be a single batch render, or better yet, solved through clever use of jade structures - but I still haven't thought of a way to do the latter.
Edit:
I decided I'd rather not simply enumerate the files in the folder - I want more control of the order. Instead of using an each loop in my page, I'm simply going to call the mixin with included content, like this (pages\portfolio.jade in my previous examples):
.subsection
+contentBlock("Block 1")
include blocks/block1
+contentBlock("Block 2")
include blocks/block2
I will leave the question here, however, as others may encounter a similar issue.

Where to initialize extension related data

am a newbie, trying to write some basics extension. For my extension to work i need to initialize some data, so what I did is inside my background.js i declared something like this.
localStorage["frequency"] = 1; //I want one as Default value. This line is not inside any method, its just the first line of the file background.js
Users can goto Options page and change this above variable to any value using the GUI. As soon as the user changes it in UI am updating that value.
Now the problem is to my understanding background.js reloads everytime the machine is restarted. So every time I restart my machine and open Chrome the frequency value is changed back to 1. In order to avoid this where I need to initialize this value?

You could just use a specific default key. So if frequency is not set you would try default-frequency. The default keys are then still set or defined in the background.js.
I like to do that in one step, in a function like this
function storageGet(key,defaultValue){
var item = localstorage.getItem(key);
if(item === null)return defaultValue;
else return item;
}
(According to the specification localstorage must return null if no value has been set.)
So for your case it would look something like
var f = storageGet("frequency",1);
Furthermore you might be interested in checking out the chrome.storage API. It's used similar to localstorage but provides additional functionalities which might be useful for your extension. In particular it supports to synchronize the user data across different chrome browsers.
edit I changed the if statement in regard to apsillers objection. But since the specification says it's ought to be null, I think it makes sense to check for that instead of undefined.

This is another solution:
// background.js
initializeDefaultValues();
function initializeDefaultValues() {
if (localStorage.getItem('default_values_initialized')) {
return;
}
// set default values for your variable here
localStorage.setItem('frequency', 1);
localStorage.setItem('default_values_initialized', true);
}

I think the problem lies with your syntax. To get and set your localStorage values try using this:
// to set
localStorage.setItem("frequency", 1);
// to get
localStorage.getItem("frequency");

Request.Filter in an IIS Managed Module

My goal is to create an IIS Managed Module that looks at the Request and filters out content from the POST (XSS attacks, SQL injection, etc).
I'm hung up right now, however, on the process of actually filtering the Request. Here's what I've got so far:
In the Module's Init, I set HttpApplication.BeginRequest to a local event handler. In that event handler, I have the following lines set up:
if (application.Context.Request.HttpMethod == "POST")
{
application.Context.Request.Filter = new HttpRequestFilter(application.Context.Request.Filter);
}
I also set up an HttpResponseFilter on the application.Context.Response.Filter
HttpRequestFilter and HttpResponseFilter are implementations of Stream.
In the response filter, I have the following set up (an override of Stream.Write):
public override void Write(byte[] buffer, int offset, int count)
{
var Content = UTF8Encoding.UTF8.GetString(buffer);
Content = ResponseFilter.Filter(Content);
_responseStream.Write(UTF8Encoding.UTF8.GetBytes(Content), offset, UTF8Encoding.UTF8.GetByteCount(Content));
}
ResponseFilter.Filter is a simple String.Replace, and it does, in fact, replace text correctly.
In the request filter, however, there are 2 issues.
The code I have currently in the RequestFilter (an override of Stream.Read):
public override int Read(byte[] buffer, int offset, int count)
{
var Content = UTF8Encoding.UTF8.GetString(buffer);
Content = RequestFilter.Filter(Content);
if (buffer[0]!= 0)
{
return _requestStream.Read(UTF8Encoding.UTF8.GetBytes(Content), offset, UTF8Encoding.UTF8.GetByteCount(Content));
}
return _requestStream.Read(buffer, offset, count);
}
There are 2 issues with this. First, the filter is called twice, not once, and one of the requests is just basically a stream of /0's. (the if check on buffer[0] filters this currently, but I think that I'm setting something up wrong)
Second, even though I am correctly grabbing content with the .GetString in the read, and then altering it in RequestFilter.Filter(a glorified string.replace()), when I return the byte encoded Content inside the if statement, the input is unmodified.
Here's what I'm trying to figure out:
1) Is there something I can check prior to the filter to ensure that what I'm checking is only the POST and not the other time it is being called? Am I not setting the Application.Context.Request.Filter up correctly?
2) I'm really confused as to why rewriting things to the _requestStream (the HttpApplication.Context.Request.Filter that I sent to the class) isn't showing up. Any input as to something I'm doing wrong would be really appreciated.
Also, is there any difference between HttpApplication.Request and HttpApplication.Context.Request?
edit: for more information, I'm testing this on a simple .aspx page that has a text box, a button and a label, and on button click assigns the text box text to the label's text. Ideally, if I put content in the textbox that should be filtered, it is my understanding that by intercepting and rewriting the post, I can cause the stuff to hit the server as modified. I've run test though with breakpoints in the module and in code, and the module completes before the code behind on the .aspx page is hit. The .aspx page gets the values as passed from the form, and ignores any filtering I attempted to do.

There's a few issues going on here, but for future reference, what explains the page receiving the unfiltered post, as well as the filter being evaluated twice is that you are likely accessing the request object in some way PRIOR to you setting the Request.Filter. This may cause it to evaluate the inputstream, running the currently set filter chain as is, and returning that stream.
For example, simply accessing Request.Form["something"] would cause it to evaluate the inputstream, running the entire filter chain, at that point in time. Any modification to the Request.Filters after this point in time would have no effect, and would appear that this filter is being ignored.
What you wanted to do is possible, but also ASP.NET provides Request Validation to address some of these issues (XSS). However, Sql Injection is usually averted by never constructing queries through string concatenation, not via input sanitizing, though defense-in-depth is usually a good idea.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Scraping URLs from a node.js data stream on the fly - node.js

Related

Using webRequest API to intercept script requests, edit them and send them back

range.address throws context related errors

Node.js and Express - Regarding rendering pages in a directory with one call

Where to initialize extension related data

Request.Filter in an IIS Managed Module

Categories

Resources