NodeJs web crawler file extension handling

NodeJs web crawler file extension handling - node.js

I'm developing a web crawler in nodejs. I've created a unique list of the urls in the website crawle body. But some of them have extensions like jpg,mp3, mpeg ... I want to avoid crawling those who have extensions. Is there any simple way to do that?

Two options stick out.
1) Use path to check every URL
As stated in comments, you can use path.extname to check for a file extension. Thus, this:
var test = "http://example.com/images/banner.jpg"
path.extname(test); // '.jpg'
This would work, but this feels like you'll wind up having to create a list of file types you can crawl or you must avoid. That's work.
Side note -- be careful using path. Typically, url is your best tool for parsing links because path is aimed at files/directories, not urls. On some systems (Windows), using path to manipulate a url can result in drama because of the slashes involved. Fair warning!
2) Get the HEAD for each link & see if content-type is set to text/html
You may have reasons to avoid making more network calls. If so, this isn't an option. But if it is OK to make additional calls, you could grab the HEAD for each link and check the MIME type stored in content-type.
Something like this:
var headersOptions = {
method: "HEAD",
host: "http://example.com",
path: "/articles/content.html"
};
var req = http.request(headersOptions, function (res) {
// you will probably need to also do things like check
// HTTP status codes so you handle 404s, 301s, and so on
if (res.headers['content-type'].indexOf("text/html") > -1) {
// do something like queue the link up to be crawled
// or parse the link or put it in a database or whatever
}
});
req.end();
One benefit is that you only grab the HEAD, so even if the file is a gigantic video or something, it won't clog things up. You get the HEAD, see the content-type is a video or whatever, then move along because you aren't interested in that type.
Second, you don't have to keep track of file names because you're using a standard MIME type to differentiate html from other data formats.

Related

AVPlayer Not Able to Handle Relative Path URL in HLS Streams

We are running into an issue that seems unique to AVPlayer. We've built a new architecture for our HLS streams which makes use of relative URLs. We can play these channels on a number of other players just fine, but when trying to build using AVPlayer, the channel gets 400 errors requesting either child manifests/segments with relative URLs.
Using curl, we are able to get a 200 success by getting to a url like: something.com/segmentfolder/../../abcd1234/videosegment_01.mp4
Curl is smart enough to get rid of the ../../ and create a valid path so the actual request (which can be seen using curl -v) looks something like: something.com/abcd1234/videosegment_01.mp4
Great. But AVPlayer doesn't do this. So it makes the request as is, which leads to a 400 error and it can't download the segment.
We were able to simulate this problem with Swift playground fairly easily with this code:
let hlsurl = URL(string: "website.com/segmentfolder/../../abc1234/videosegment.mp4")!
print(hlsurl)
print(hlsurl.standardized)
The first print shows the URL as is. Trying to GET that URL leads to a 400. The second print properly handles it by adding .standardized to the URL. This leads to a 200. The problem is, this only works for the top level/initial manifest, it doesn't work for all the child manifests and segments.
let url = URL(string: "website.com/segmentfolder/../../abc1234/videosegment.mp4")!
let task = URLSession.shared.dataTask(with: url.standardized) {(data, response, error) in
guard let data = data else { return }
print(String(data: data, encoding: .utf8)!)
}
So question, does AVPlayer support relative URLs? If so, why can't it handle the ../../ in the URL path like other players and curl can? Is there a special way to get it to trigger standardizing ALL URL requests?
Any help would be appreciated.

Sending large amounts of data from an ESP8266 server

I am building a web server from an ESP8266 that will send environmental data to any web client as a web page. I'm using the Arduino IDE.
The problem is that the data can get rather large at times, and all of the examples I can find show assembling a web page in memory and sending it all at once to the client via ESP8266WebServer.send(). This is ok for small web pages, but won't work with the amount of data I need to send.
What I want to do is send the first part of the web page, then send the data out directly as I gather it, then send the closing parts of the web page. Is this even possible? I've looked unsuccessfully for documentation and there doesn't seem to be any examples anywhere.

For future reference, I think I figured out how to do it, with help from this page: https://gist.github.com/spacehuhn/6c89594ad0edbdb0aad60541b72b2388
The gist of it is that you still use ESP8266WebServer.send(), but you first send an empty string with the Content-Length header set to the size of your data, like this:
server.sendHeader("Content-Length", (String)fileSize);
server.send(200, "text/html", "");
Then you send buffers of data using ESP8266WebServer.sendContent() repeatedly until all of the data is sent.
Hope this helps someone else.

I was having a big issue and a headache in serving big strings concatenating together with other strings variables to the ESP32 Ardunio webserver with
server.send(200, "text/html", BIG_WEBPAGE);
and often resulted in a blank page as I reported in my initial error.
What was happening was this error
E (369637) uart: uart_write_bytes(1159): buffer null
I don't reccommend to use the above server.send() function
After quite a lot of reaserch I found this piece of code that simply works like a charm. I just chunked my webpage in 5 pieces like you see below.
server.sendHeader("Cache-Control", "no-cache, no-store, must-revalidate");
server.sendHeader("Pragma", "no-cache");
server.sendHeader("Expires", "-1");
server.setContentLength(CONTENT_LENGTH_UNKNOWN);
// here begin chunked transfer
server.send(200, "text/html", "");
server.sendContent(WEBPAGE_BIG_0);
server.sendContent(WEBPAGE_BIG_1);
server.sendContent(WEBPAGE_BIG_2);
server.sendContent(WEBPAGE_BIG_3);
server.sendContent(WEBPAGE_BIG_4);
server.sendContent(WEBPAGE_BIG_5);
server.client().stop();
I really own much to this post. Hope the answer hepls someone else.
After some more experiments I realized it is faster and more efficient the code if you do not feed the string variable into the server.sendContent function. Instead you just paste there the actual string value.
server.sendContent("<html><head>my great page</head><body>");
server.sendContent("my long body</body></html>");
It is very important the when you chunk the webpage you don't chunk html tags and you don't chunk an expression of a javascript code (like cutting in half a while or an if), while chunking scripts just chunk after the semicolon or better between two function declarations.

Chunked transfer encoding is probably what you want, and it's helpful in the situation where the web page you are sending is being dynamically created on-the-fly and is also too large to fit into memory. In this situation, you have two problems. One, you can't send it all at once, and two, you don't know ahead of time how big the result is going to be. Both problems can be fixed like this:
String webPageChunk = "some html";
server.setContentLength(CONTENT_LENGTH_UNKNOWN);
server.send ( 200, "text/html", webPageChunk);
while (<page is being generated>) {
webPageChunk = "some more html";
server.sendContent(webPageChunk);
}
server.sendContent("");
Sending a blank line will tell the client to terminate the session. Be careful not to send one in your loop before you're done generating the whole page.

Login via POST does not yield valid session

I am currently trying to convert a smallish app from nodejs to golang (hence the two tags) but I'm running into a bit of trouble in doing so.
Essentially it is a very simple http POST login which I can't seem to realise. The background is that my university provides a calendar export function and I would like to provide this calendar as a feed that could be added to Google Cal.
Now the thing is that I have the whole thing working in node, but I would really like to be able realise it in go aswell.
The important bit of node code would be
var query = url.parse(req.url, true).query;
var data = {
u: query.user, // Username
p: query.password, // Password
};
needle.post(LOGIN_URL, data, {}, function (error, response) {
//extract cookies etc.
});
which is working like a charm but if I try to do the same in go
import "github.com/parnurzeal/gorequest"
//...
resp, body, err := gorequest.New().Post(LOGIN_URL).Send("u=user&p=pass").End()
//extract cookies etc.
I end up an invalid (timed out) session. I already tried using just net/http in go, which doesn't seem to change anything.
The result the POST request yields is a 302 redirect to an overview page (Btw: it is ASP based). Could it be that this is what's causing the problem, since gorequest then fetches that overview page without the cookies returned in resp, effectively creating a new session that isn't authorized, or am I overlooking something terribly simple?

So it seems that I found the answer myself by following your advice and using "net/http" and digging a little deeper into what the http.Client actually does. To anyone who might encounter similar problems, here is my solution:
http.Client automatically redirects if it receives a 30x response by the server see documentation. Although one can override the redirect policy, I was unable to prevent redirection entirely.
Additionally it seems as if the client has a bug (what I would call it at least), as it drops all header upon redirect see the issue or in the source where new headers are created.
While searching around in net/http I found http.DefaultTransport which is used by http.Client and does not redirect. It seems somewhat lower level and exactly what I was after. The following piece of code demonstrates how I replaced the line realised with gorequest from above:
data := url.Values{"u": {USER}, "p": {PASS}}
req, err := http.NewRequest("POST", LOGIN_URL, bytes.NewBufferString(data.Encode()))
//I needed quite some time to figure out that I needed to set the content type accordingly
req.Header.Add("Content-Type", "application/x-www-form-urlencoded")
//...
resp, err := http.DefaultTransport.RoundTrip(req)
//...
//resp.Header["Set-Cookie"] now contains the login/session cookies
Although I need to extract cookies myself and set a few header values, the solution works perfectly and I am quite happy with it. If anybody has some improvements to my solution or any other advice I am glad to hear it. And thanks to JimB and Volker :).

Getting the mime type from a request in nodeJS

Im learning nodejs but I ran into a roadblock. Im trying to setup a simple server that will serve static files. My problem is that unless I explicitly type in the extension in the url, I cannot get the file extension. The httpheader 'content-type' comes in as undefined .
Here is my code, pretty simple:
var http = require("http"),
path = require("path"),
fs = require("fs");
var server = http.createServer(function(request, response){
console.log([path.extname(request.url), request.headers['content-type']]);
var fileName = path.basename(request.url) || "index.html",
filePath = __dirname + '/public/' + fileName;
console.log(filePath);
fs.readFile( filePath, function(err,data){
if (err) throw err
response.end(data);
});
})
server.listen(3000)
Any ideas?
FYI I dont just wanna dive into connect or other, I wanna know whats going on before I drop the grind and go straight to modules.

So static web servers generally don't do any deep magic. For example, nginx has a small mapping of file extensions to mime types here: http://trac.nginx.org/nginx/browser/nginx/conf/mime.types
There's likely also some fallback logic defaulting to html. You can also use a database of file "magic numbers" as is used by the file utility to look at the beginning of the file data and guess based on that.
But there's no magic here. It's basically
go by the file extension when available
maybe go by the beginning of the file content as next option
use a default of html because normally only html resources have URLs with no extensions, whereas images, css, js, fonts, multimedia, etc almost always do use file extensions in their URIs
Also note that browsers generally have fairly robust set of checks that will intepret files correctly even when Content-Type headers are mismatched with the actual response body data.

Best way to handle security and avoid XSS with user entered URLs

We have a high security application and we want to allow users to enter URLs that other users will see.
This introduces a high risk of XSS hacks - a user could potentially enter javascript that another user ends up executing. Since we hold sensitive data it's essential that this never happens.
What are the best practices in dealing with this? Is any security whitelist or escape pattern alone good enough?
Any advice on dealing with redirections ("this link goes outside our site" message on a warning page before following the link, for instance)
Is there an argument for not supporting user entered links at all?
Clarification:
Basically our users want to input:
stackoverflow.com
And have it output to another user:
stackoverflow.com
What I really worry about is them using this in a XSS hack. I.e. they input:
alert('hacked!');
So other users get this link:
stackoverflow.com
My example is just to explain the risk - I'm well aware that javascript and URLs are different things, but by letting them input the latter they may be able to execute the former.
You'd be amazed how many sites you can break with this trick - HTML is even worse. If they know to deal with links do they also know to sanitise <iframe>, <img> and clever CSS references?
I'm working in a high security environment - a single XSS hack could result in very high losses for us. I'm happy that I could produce a Regex (or use one of the excellent suggestions so far) that could exclude everything that I could think of, but would that be enough?

If you think URLs can't contain code, think again!
https://owasp.org/www-community/xss-filter-evasion-cheatsheet
Read that, and weep.
Here's how we do it on Stack Overflow:
/// <summary>
/// returns "safe" URL, stripping anything outside normal charsets for URL
/// </summary>
public static string SanitizeUrl(string url)
{
return Regex.Replace(url, #"[^-A-Za-z0-9+&##/%?=~_|!:,.;\(\)]", "");
}

The process of rendering a link "safe" should go through three or four steps:
Unescape/re-encode the string you've been given (RSnake has documented a number of tricks at http://ha.ckers.org/xss.html that use escaping and UTF encodings).
Clean the link up: Regexes are a good start - make sure to truncate the string or throw it away if it contains a " (or whatever you use to close the attributes in your output); If you're doing the links only as references to other information you can also force the protocol at the end of this process - if the portion before the first colon is not 'http' or 'https' then append 'http://' to the start. This allows you to create usable links from incomplete input as a user would type into a browser and gives you a last shot at tripping up whatever mischief someone has tried to sneak in.
Check that the result is a well formed URL (protocol://host.domain[:port][/path][/[file]][?queryField=queryValue][#anchor]).
Possibly check the result against a site blacklist or try to fetch it through some sort of malware checker.
If security is a priority I would hope that the users would forgive a bit of paranoia in this process, even if it does end up throwing away some safe links.

Use a library, such as OWASP-ESAPI API:
PHP - http://code.google.com/p/owasp-esapi-php/
Java - http://code.google.com/p/owasp-esapi-java/
.NET - http://code.google.com/p/owasp-esapi-dotnet/
Python - http://code.google.com/p/owasp-esapi-python/
Read the following:
https://www.golemtechnologies.com/articles/prevent-xss#how-to-prevent-cross-site-scripting
https://www.owasp.org/
http://www.secbytes.com/blog/?p=253
For example:
$url = "http://stackoverflow.com"; // e.g., $_GET["user-homepage"];
$esapi = new ESAPI( "/etc/php5/esapi/ESAPI.xml" ); // Modified copy of ESAPI.xml
$sanitizer = ESAPI::getSanitizer();
$sanitized_url = $sanitizer->getSanitizedURL( "user-homepage", $url );
Another example is to use a built-in function. PHP's filter_var function is an example:
$url = "http://stackoverflow.com"; // e.g., $_GET["user-homepage"];
$sanitized_url = filter_var($url, FILTER_SANITIZE_URL);
Using filter_var allows javascript calls, and filters out schemes that are neither http nor https. Using the OWASP ESAPI Sanitizer is probably the best option.
Still another example is the code from WordPress:
http://core.trac.wordpress.org/browser/tags/3.5.1/wp-includes/formatting.php#L2561
Additionally, since there is no way of knowing where the URL links (i.e., it might be a valid URL, but the contents of the URL could be mischievous), Google has a safe browsing API you can call:
https://developers.google.com/safe-browsing/lookup_guide
Rolling your own regex for sanitation is problematic for several reasons:
Unless you are Jon Skeet, the code will have errors.
Existing APIs have many hours of review and testing behind them.
Existing URL-validation APIs consider internationalization.
Existing APIs will be kept up-to-date with emerging standards.
Other issues to consider:
What schemes do you permit (are file:/// and telnet:// acceptable)?
What restrictions do you want to place on the content of the URL (are malware URLs acceptable)?

Just HTMLEncode the links when you output them. Make sure you don't allow javascript: links. (It's best to have a whitelist of protocols that are accepted, e.g., http, https, and mailto.)

You don't specify the language of your application, I will then presume ASP.NET, and for this you can use the Microsoft Anti-Cross Site Scripting Library
It is very easy to use, all you need is an include and that is it :)
While you're on the topic, why not given a read on Design Guidelines for Secure Web Applications
If any other language.... if there is a library for ASP.NET, has to be available as well for other kind of language (PHP, Python, ROR, etc)

For Pythonistas, try Scrapy's w3lib.
OWASP ESAPI pre-dates Python 2.7 and is archived on the now-defunct Google Code.

How about not displaying them as a link? Just use the text.
Combined with a warning to proceed at your own risk may be enough.
addition - see also Should I sanitize HTML markup for a hosted CMS? for a discussion on sanitizing user input

There is a library for javascript that solves this problem
https://github.com/braintree/sanitize-url
Try it =)

In my project written in JavaScript I use this regex as white list:
url.match(/^((https?|ftp):\/\/|\.{0,2}\/)/)
the only limitation is that you need to put ./ in front for files in same directory but I think I can live with that.

Using Regular Expression to prevent XSS vulnerability is becoming complicated thus hard to maintain over time while it could leave some vulnerabilities behind. Having URL validation using regular expression is helpful in some scenarios but better not be mixed with vulnerability checks.
Solution probably is to use combination of an encoder like AntiXssEncoder.UrlEncode for encoding Query portion of the URL and QueryBuilder for the rest:
public sealed class AntiXssUrlEncoder
{
public string EncodeUri(Uri uri, bool isEncoded = false)
{
// Encode the Query portion of URL to prevent XSS attack if is not already encoded. Otherwise let UriBuilder take care code it.
var encodedQuery = isEncoded ? uri.Query.TrimStart('?') : AntiXssEncoder.UrlEncode(uri.Query.TrimStart('?'));
var encodedUri = new UriBuilder
{
Scheme = uri.Scheme,
Host = uri.Host,
Path = uri.AbsolutePath,
Query = encodedQuery.Trim(),
Fragment = uri.Fragment
};
if (uri.Port != 80 && uri.Port != 443)
{
encodedUri.Port = uri.Port;
}
return encodedUri.ToString();
}
public static string Encode(string uri)
{
var baseUri = new Uri(uri);
var antiXssUrlEncoder = new AntiXssUrlEncoder();
return antiXssUrlEncoder.EncodeUri(baseUri);
}
}
You may need to include white listing to exclude some characters from encoding. That could become helpful for particular sites.
HTML Encoding the page that render the URL is another thing you may need to consider too.
BTW. Please note that encoding URL may break Web Parameter Tampering so the encoded link may appear not working as expected.
Also, you need to be careful about double encoding
P.S. AntiXssEncoder.UrlEncode was better be named AntiXssEncoder.EncodeForUrl to be more descriptive. Basically, It encodes a string for URL not encode a given URL and return usable URL.

You could use a hex code to convert the entire URL and send it to your server. That way the client would not understand the content in the first glance. After reading the content, you could decode the content URL = ? and send it to the browser.

Allowing a URL and allowing JavaScript are 2 different things.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string